# «Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

We summarized the results in a Contingency matrix in Figure 17. The numbers outside the box were the sum of rows and columns. Out of 32 total correlation functions tested, 20 correlation functions satisfied old properties (P1-P3); 19 satisfied our new properties (P4, P5); 15 satisfied both desirable properties (P1-P5); 5 functions satisfied old properties only, and 4 functions satisfied new properties only. Since our properties could distinguish 5 correlation functions out of 20 functions that satisfied old properties (5/20=25%), it determined that our properties were independent from previous 3 desirable properties.

** Figure 17. Contingency matrix of desirable properties over all functions 4.**

3.2. Comparison Based upon Property 1 If a correlation function can detect the statistical independence of a diagram, it satisfies P1. The correlation functions that satisfy old properties and new properties were counted under the condition of satisfying P1. We created contingency matrix of the correlation functions that satisfied P1 in Figure 18. Twenty correlation functions satisfied P1. Interestingly, the correlation functions that satisfied P1 also satisfied P2 and P3. The correlation functions that satisfied P2 and P3 were dependent on the correlation functions that satisfied P1. This means P1 alone can explain P2 and P3. In other words, P1 can subsume P2 and P3, which reveals the nonessential nature of P2 and P3. When verifying correlation functions, P1 may be the only necessary reference property.

** Figure 18. Contingency matrix of desirable properties over property 1 4.**

3.3. Comparison Based upon Property 6 In Table 19, an “o” is placed in each column, for P6 if the correlation function can distinguish positive and negative relations. The correlation functions that satisfied old properties or new properties was counted and shown in the contingency matrix (Figure 19).

Of the 12 functions satisfying P6, 8 satisfied P1-P5; only 4 satisfied old properties (P1-P3) without satisfying new properties (P4 and P5).

All correlation functions satisfying P6 also satisfied old properties. It might be unnecessary to test a function for P1-P3 when P6 encompasses all three. Our properties (P4 and P5) tested an alternative aspect of the correlation functions, which is unrelated to P6.

In the contingency matrix, we compared 12 correlation functions, which satisfied old properties to new properties (Table 19). Of the 12 functions, only 8 satisfied our new properties. Clearly our properties were testing an alternative aspect of correlation functions.

** Figure 19. Contingency matrix of desirable properties over property 6 4.**

3.4. Normalized Results – Property 7 Normalization helps users analyze or understand the data. The P7 is not a highly desirable property because normalization is unrealistic for many correlation functions. We included the results of P7 testing only to help those users who prefer normalized results.

A comparison of P7 to both old properties and new properties may prove illustrative. Correlation functions that produced normalized results were assigned an “o” in Table 19. Out of 32 correlation functions, 22 functions satisfied P7. Ten functions satisfied both old properties and P7; fourteen functions satisfied both new properties and P7. The results also indicated that the characteristics of correlation functions were unrelated to function normality.

4.4. Summary In order to select the right correlation functions for an application out of a set of well-known correlation functions, the characteristics of each function must be compared.

Piatetsky-Shapiro (1991) opened new research areas by proposing three basic desirable properties for functions (P1-P3). Tan et al. (2002) compared correlation functions using the three established desirable properties and other existing function properties. Our research on two-variable Venn diagrams led us to identify two more desirable properties of correlation functions (P4 and P5). Property 4: If P(A,B) increases when P(A) and P(B) remain unchanged, then F monotonically decreases. Property 5: If P(A) (or P(B)) and P(A,B) increase at the same ratio, then F monotonically increases.

Consider the cases where A1=5.0, A2=10.0, B=10.0, A1∩B=3.0, and A2∩B=6.0 (Table 17). The correlations of the two cases, F(A1,B,A1∩B) and F(A2,B,A2∩B), may have different values. The log CPR values, which assess the significance of the correlation between A and B, are 4.2 and 5.0 respectively (Rosenfeld, 1994). But a deficiency exists.

For example, the function Inte satisfies P1-P3, but returns the same value of 6 in both cases. Our proposed property 5 can tell the difference between the two cases, and is critical for measuring the characteristics of correlation functions.

In addition, the definition of P1 used by Piatetsky-Shapiro (1991) lacks some descriptive power. For example, some correlation functions (e.g., Odds, Conv, Inte, and Coll in Appendix 1) produced a score of 1, even when two events were statistically independent. We revised the definition of P1 to improve its descriptive power.

P1: F can distinguish statistical independence of A and B.

In order to see whether those correlation functions that satisfy our two new desirable properties also satisfy previous three desirable properties, we collected 32 correlation functions as shown in Appendix 1 and used systematically generated test cases for testing each property. The results indicated that our two new desirable properties were more descriptive than the previous three old desirable properties. It was because all correlation functions that satisfy P1 also satisfied P2 and P3, and all correlation functions that satisfied P6 also satisfied P1-P3; but our properties were independent from P1 and P6.

We insist that these two properties are important in terms of understanding the correlations of two variables and characterizing an appropriate correlation function. The summarized results not only of P1-P5 but also of P6 that measured negative/positive correlation and the result of P7 that checked which correlation function returned a normalized return value will help reader compare characteristics of correlations functions.

Our experiment was limited to positive correlations for our web personalization since many applications depend on positive correlation. We will extend our analysis to

Personalized Ranking of Search Results with Implicitly Learned User Interest Hierarchies Web search engines are usually designed to serve all users, without considering the interests of individual users. Personalized web search incorporates an individual user's interests when deciding relevant results to return. We propose to learn a user profile, called a user interest hierarchy (UIH), from web pages that are of interest to the user. The user’s interest in web pages will be determined implicitly, without directly asking the user. Using the implicitly learned UIH, we study methods that (re)rank the results from a search engine. Experimental results indicate that our personalized ranking methods, when used with a popular search engine, can yield more relevant web pages for individual users. This

Figure 20. Devising a scoring function in the framework of web personalization

5.1. Personalized Results Personalization of web search involves adjusting search results for each user based on his or her unique interests. Our approach orders the pages returned by a search engine depending on a user’s interests. Instead of creating our own web search engine, we retrieved results from Google (Google, 2005). Since the purpose of this Chapter is to achieve a personalized ordering of search engine results, we can score a page based on the user profile and the results returned by a search engine as shown in the dashed box in Figure 21.

pages in his/her bookmarks (Li et al., 1999; Maarek and Ben-Shaul, 1996) and the Divisive Hierarchy Clustering (DHC) algorithm (Kim and Chan, 2003). A UIH organizes a user’s interests from general to specific. Near the root of a UIH, general interests are represented by larger clusters of terms while towards the leaves, more specific interests are represented by smaller clusters of terms. The root node contains all distinct terms in the bookmarked web page. The leaf nodes contain more specifically interesting terms. The relations between terms are calculated based on the co-occurrence in the same web page.

An example of a UIH is shown in Figure 3. Each node (cluster) contains a set of words. The root node contains all words that exist in a set of web pages. Each node can represent a conceptual relationship if those terms occur together at the same web page frequently, for example ‘perceptron’ and ‘ann’ (in italics) can be categorized as belonging to neural network algorithms, whereas ‘id3’ and ‘c4.5’ (in bold) in another node cannot.

Words in these two nodes are mutually related to some other words such as ‘machine’ and ‘learning’. This set of mutual words, ‘machine’ and ‘learning’, performs the role of connecting italicized and bold words in sibling nodes and forms the parent node. We illustrate this notion in the dashed box.

This Chapter focuses on devising a scoring method that receives two inputs (UIH and retrieved results) and one output (personalized ranking).

5.2. Approach In order to provide personalized, reordered search results to a user, we need to score each page depending on personal interests. Therefore, the goal is to assign higher scores to web pages that a user finds more interesting. This section explains how to score a retrieved web page using a user’s UIH. First, we explain the basic characteristics for each matching term. Second, based on the characteristics, we propose functions to score a term.

These functions determine how interesting a term is to a user. Third, based on the score and the number of the matching terms, we calculate an overall score for the page. Last, since the search engine provides a score/ranking for a web page, we incorporate this ranking into our final score of the web page.

5.2.1. Four Characteristics of a Matching Term Given a web page and a UIH, we identify matching terms (words/phrases) that reside both in the web page and in the UIH. The number of matching terms is defined m, which is less than the number of total distinct terms in the web page, nt, and the number of total distinct terms in the UIH, nu.

Each matching term, ti, is analyzed according to four characteristics: the level of a node where a term belongs to (Dti), the length of a term such as how many words are in the term (Lti), the frequency of a term (Fti), and the emphasis of a term (Eti). D and L can be calculated while building a UIH from the web pages in a user’s bookmark. Different web page has different values for F and E characteristics. We estimate the probability of these four characteristics and based on these probabilities, we approximate the significance of each matching term.

5.2.1.1. Level/Depth of a UIH Node A UIH represents general interests in large clusters of terms near the root of the UIH, while more specific interests are represented by smaller clusters of terms near the leaves. The root node contains all distinct terms and the leaf nodes contain small groups of terms that represent more specific interests. Therefore, terms in more specific interests are harder to match, and the level (depth) where the term matches indicates significance. For example, a document that contains terms in leaf nodes will be more related to the user’s interests than a document that contains the terms in a root node only. If a term in a node also appears in several of its ancestors, we use the level (depth) closest to the leaves.

There is research that indicates user-defined query scores can be used effectively (Salton and Waldstein, 1978; Harper, 1980; Croft and Das, 1989). From the acquisition point of view, it is not clear how many levels of importance users can specify if we ask a user directly. In I3R (Croft and Thompson, 1987), they used only two levels: important or default. Harper (1980) used 5 levels of importance, and Croft and Das (1989) used 4 levels.

We calculate the scores of terms using the level (depth) of a node in the UIH instead of explicitly asking the user.

The significance of a term match can be measured by estimating the probability, P(Dti), of matching term ti at depth (level) Dti in the UIH. P(Dti) is the probability of a level in a UIH. A term that matches more specific interests (deeper in the UIH) has a lower P(Dti) of occurring. Lower probability indicates the matching term, ti, is more significant. The

**probability is estimated by:**

Longer terms (phrases) are more specific than shorter ones. If a web page contains a long search term typed in by a user, the web page is more likely what the user was looking for.

In general, there are fewer long terms than short terms. To measure the significance of a term match, the probability, P(Lti), of matching term ti of length Lti in the UIH is calculated. Lti is defined as MIN (10, the length of a term). We group the longer (greater than 10) phrases into one bin because they are rare. Longer terms has a smaller probability, P(Lti), of occurring, which indicates a more significant match. The probability

**is estimated by:**

5.2.1.3. Frequency of a Term More frequent terms are more significant/important than less frequent terms.

Frequent terms are often used for document clustering or information retrieval (van Rijsbergen, 1979). A document that contains a search term many times will be more related to a user’s interest than a document that has the term only once.

page to measure the significance of the term. However, in general, frequent terms have a lower probability of occurring. For example, in a web page most of the terms (without the terms in a stop list (Rasmussen, 1992)) will occur once, some terms happen twice, and

5.2.1.4. Emphasis of a Term Some terms have different formatting (HTML tags) such as title, bold, or italic.