# «Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

These specially-formatted terms have more emphasis in the page than those that are formatted normally. A document that emphasize a search term as a bold format will be more related to the search term than a document that has the term in a normal format without emphasis. If a term is emphasized by the use of two or more types of special formatting we assign a priority in the order of title, bold, and italic.

The significance for each type of format is estimated based on the probability,

more significant/important if the format type has lower probability of occurring in a web page. Lower probability P(Eti) of a matching term, ti, indicates the term is more significant.

**The probability is estimated by:**

5.2.2. Scoring a Term 5.2.2.1. Uniform Scoring P(Dti, Lti, Fti, Eti) is the joint probability of all four characteristics occurring in term ti

-- Dti is the depth of a node where a term belongs to, Lti is the length of a term, Fti is the frequency of a term, and Eti is the emphasis of a term. Assuming independence among the

**four characteristics, we estimate:**

Smaller log likelihood means the term match is more significant. In information theory (Mitchell et al., 1997), –log2 P(e) is the number of bits needed to encode event e, hence using –log2, instead of log, in Eq. 1 yields the total number of bits needed to encode the four characteristics. The uniform term scoring (US) function for a personalized term score

**is formulated as:**

which we use as a score for term ti. Larger Sti means the term match is more significant.

5.2.2.2. Weighted Scoring The uniform term scoring function uses uniform weights for each characteristic. It is possible that some characteristics are more important than the others. For instance, the depth of a node (D) may be more significant than frequency (F). Therefore, we attempted to differentiate the weights for each characteristic. F and E characteristics represent the relevance of a web page. Longer terms (greater L) represent a user’s interest more specifically; however, longer terms do not mean that a user is more interested in that term.

Therefore, those L, F, and E characteristics do not fully reflect a user’s interests. It is more reasonable to emphasize D characteristic more than other characteristics, because D (depth) represents the strength of a user’s interests.

A simple heuristic is used in this paper that assumes the depth of a node is at least two times more important than other characteristics. Based on this heuristic, the weights w1=0.4, w2=0.2, w3=0.2, and w4=0.2 are assigned. The weighted term scoring (WS)

**function for a personalized term score is formulated as:**

5.2.3. Scoring a Page The personal page score is based on the number of interesting terms and how interesting the terms are in a web page. If there are many terms in a web page that are interesting to a user, it will be more interesting to the user than a web page that has fewer interesting terms. If there are terms in a pages that are more interesting to a user, the web page will be more interesting to the user than a web page that has less interesting terms.

The personalized page scoring function for a web page Spj adds all the scores of the

**terms in the web page and can be formulated as:**

distinct term. The time complexity of scoring a page is O(nt), where nt is the number of “distinct” terms in a web page. D and L characteristics can be calculated during the preprocessing stage of building a UIH. F and L characteristics can be calculated while extracting distinct terms from a web page.

5.2.4. Incorporating Public Page Score Personal page scoring is not sufficient for some search engines. The success of using public scoring in popular search engines, such as Google’s PageRank, indicates the importance of using a public page-popularity measure to determine what page a user is interested in. Many existing methods determine the public popularity of a page by determining the number pages that link to it (Haveliwala, 2002; Jeh and Widom, 2003).

Many collaborative filtering approaches also use the popularity of a web page for recommendation (Eirinaki et al., 2004; Kim et al., 2004). Section 6.2.3 described our personal web page scoring function. We wish to incorporate the public scoring into our page scoring function so both the popularity of a page and individual interests are taken into account. We use the rank order returned by Google as our public score. GOOGLEpj is

Google’s public page scoring function has been found in a recent study (Delaney, 2004) to be very effective at returning pages that users find interesting. The use of Google’s page rank as a public page score makes our experimental comparison with Google clearer, because any improvement in the ordering is due to the contribution of our personal page score. For a given web page, pj, the personal and public page score (PPS) equation can be

**written as:**

of GOOGLEpj, and R(Spj) is the rank of a web page, pj, with the personal page score, Spj. If the function R returns the rank in an ascending order, more interesting web pages will have lower PPS values. Therefore, the function R reverses the rank. The personal page score and the public page score are weighted by the value of the constant c. In this paper, both functions are weighed equally: c = 0.5.

5.3. Experiments In our experiments data were collected from 11 different users. Of the 11 human subjects, 4 were undergraduate students and 7 were graduate students. In terms of major, 7 were Computer Sciences, 2 were Aeronautical Sciences, 1 was Chemical Engineering, and 1 was Marine Biology. We asked each volunteer to submit 2 search terms that can contain any Boolean operators. Some examples of the search terms used are {review forum +"scratch remover", cpu benchmark, aeronautical, Free cross-stitch scenic patterns, neural networks tutorial, DMC(digital media center), artificial intelligence, etc.} Then, we used Google to retrieve 100 related web pages for each search term. Those collected web pages were classified/labeled by user based on two categories: interest and potential interest. The data set for interest has more authority because it indicates direct relevance to the current search query. The data set for potential interest reflects the user’s general personal interests, which might not be directly relevant at the time of query. The areas of a user’s potential interests often go beyond the boundary of a search term’s specific meaning. Sometimes users find interesting web pages while searching for different subjects. These unexpected results help the user as well. Therefore, it is also a contribution if a method shows higher precision in finding potentially interesting web pages.

In order to build UIHs, we also requested each volunteer to submit the web pages in their bookmarks. If there were fewer than 50 web pages in their bookmark, we asked them to collect more pages up to around 50. The minimum number of web pages was 38 and the maximum number was 72. Web pages from both bookmarks and Google were parsed to retrieve only texts. The terms (words and phrases) in the web pages are stemmed and filtered through the stop list (Rasmussen, 1992). A phrase-finding algorithm (Kim and Chan, 2004) was used to collect variable-length phrases. Words in selection boxes/menus were also removed because they did not appear on the screen until a user clicks on them.

Unimportant contexts such as comments and style were also removed. Web pages that contain non-text (e.g., “.pdf” files, image files, etc.) were excluded because we are handling only text. To remove any negative bias to Google, broken links that were still ranked high erroneously by Google were excluded from the test, since those web pages will be scored “Poor” by the user for sure. The data used in this study is accessible at http://cs.fit.edu/~hkim/dissertation/dissertation.htm. Microsoft.NET language was used, and the program ran on an Intel Pentium 4 CPU.

We attempted to remove any negative bias to Google. Those web pages that contain non-text (e.g., “.pdf” files, image files, etc.) were excluded because we are handling only texts. Furthermore, the broken links that were still ranked high erroneously by Google were excluded from the test, since those web pages will be scored “Poor” by user for sure.

We categorized the interest as “Good”, “Fair”, and “Poor”; the potential interest is categorized as “Yes” and “No”. A web page was scored as “Good”, “Fair”, and “Poor” depending on each individual’s subjective opinion based on the definition of interest. It was also marked as “Yes” or “No” based on the user’s potential interest. We evaluated a ranking method based on how many interesting (categorized as “Good”) or potentially interesting web pages (categorized as “Yes”) the method collected within a certain number of top links (Bharat and Mihaila, 2001) (called “Top link analysis”). It is realistic in a sense many information retrieval systems are interested in the top 10 or 20 groups.

Precision/recall graph (van Rijsbergen, 1979) is used for evaluation as well (called “precision/recall analysis”). It is one of the most common evaluation methods in information retrieval. However, traditional precision/recall graphs are very sensitive to the initial rank positions and evaluate entire rankings (Croft and Das, 1989). The formula for

**precision and recall were:**

Precision = Number of “Good” or ”Yes” pages retrieved in the set / Size of the set Recall = Number of “Good” or ”Yes” pages retrieved in the set / Number of “Good” or “Yes” pages in the set

**where the “set” is the group of top ranked web pages. In this paper we study five groups:**

Top 1, 5, 10, 15, and 20.

5.4. Results and Analysis We compare four ranking methods: Google, Random, US, and WS. Google is the ranking provided by Google. Random arbitrarily ranks the web pages. US and WS are the two proposed methods based on a personal UIH learned from a user’s bookmarks. For Random, US, and WS, the top 100 pages retrieved by Google are re-ranked based on the method. Each method is analyzed with two data sets: a set of web pages chosen as interesting and another chosen as potentially interesting by the users. Top link analysis, precision/recall analysis, the sensitivity of personal score weight c (Section 6.2.4) are discussed.

5.4.1. Interesting Web Page 5.4.1.1. Top Link Analysis Web search engine users are usually interested in the links ranked within top 20 (Chen and Sycara, 1998). We compare each method only with Top 1, 5, 10, 15, and 20 links on the interesting web page data set and present the results in Table 20. The first column is the methods; the next five columns present the precision values of each method with respect to the five Top links. The values in each cell are the average of 22 search terms’ precision values. High precision value indicates high accuracy/performance.

Precision values higher than Google’s are formatted as bold and the percentage of improvement is within parentheses. The highest precision value in each column is underscored.

The results show that our WS method was more accurate than Google in three Top links (Top 10, 15, and 20) and the percentages of improvements are at least 13%, while WS ties with Google for Top 1 and Top 5. In terms of the highest precision, WS showed highest performance in four columns; Google showed in only two columns and the values are equal to WS. Compared to US, WS showed higher precision in four (Top 1, 5, 15 and

20) of the five columns. Random was the lowest as we expected, showing the lowest precisions in all five columns. These results indicate that WS achieves the highest overall precision.

We also wanted to know which search terms yielded higher precision with WS than with Google and analyzed the precision with respect to each individual search terms.

Out of 22 search terms (11 users × 2 search terms), WS achieved higher precision for 12 search terms (55%), Google did for 8 search terms (36%), and they were even for 2 search terms (9%). Since the UIH is built from a user’s bookmarks, we analyse the bookmarks to understand the search terms that did not perform well using WS. When we compare the bookmarks with the “good” retrieved web pages, we found that they are unrelated. For example, a volunteer used “woodworking tutorial” as a search term, but he never bookmarked web pages related to that term. This implies bookmarks are useful for building user profiles, but they are not sufficient. We will discuss enhancements in the conclusion.

5.4.1.2. Statistical Significance In order to see if this improvement is statistically significant we conducted t-Test (paired two samples for means) between two groups of individual search terms with Google and WS for each Top link. There was no statistically significant difference between WS and Google for any Top link with 95% confidence (P=1and t=2.079 for Top 1; P=1and t=2.079 for Top 5; P=0.328 and t=2.079 for Top 10; P=0.204 and t=2.079 for Top 15;

P=0.147 and t=2.079 for Top 20).

To understand why our improvements are not statistically significant, we analyze the variance in the precision values. In Figure 22 we plot the average and the standard deviation (SDs) of 22 search terms’ precisions from Google with respect to the five Top links. The x-axis shows the Top links and y-axis represents the average and the SD of precision values. The dots in the middle of vertical bars are the averages and the bars themselves represent the SD values. Variance was large for Top 1 and decreases when more links were considered.