# «Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

To provide a more robust context for personalization, we desire to extract a continuum of general to specific interests of a user, called a user interest hierarchy (UIH).

The higher-level interests are more general, while the lower-level interests are more specific. A UIH can represent a user’s interests at different abstraction levels and can be learned from the contents (words/phrases) in a set of web pages bookmarked by a user. We propose a divisive hierarchical clustering (DHC) algorithm to group words (topics) into a hierarchy where more general interests are represented by a larger set of words. Our approach does not need user involvement and learns the UIH “implicitly.” To enrich features used in the UIH, we used phrases in addition to words.

Figure 2. Learning user interest hierarchy in the framework of web personalization

2.1. User Interest Hierarchy A user interest hierarchy (UIH) organizes a user’s general to specific interests.

Towards the root of a UIH, more general (passive) interests are represented by larger clusters of words while towards the leaves, more specific (active) interests are represented by smaller clusters of words. To generate a UIH for a user, our clustering algorithm (details in Section 4) accepts a set of web pages bookmarked by the user as input. That is, the input of DHC is web pages that are interesting to a user. We use the words and phrases in the web pages and ignore link or image information. The web pages are stemmed and filtered by ignoring the most common words listed in a stop list (Rasmussen, 1992). The phrases are extracted by VPF algorithm (Kim and Chan, 2004). These processes are depicted in Figure 2.

Table 1 contains a sample data set. Numbers on the left represent individual web pages; the content has words stemmed and filtered through the stop list. These words in the web pages can be represented by a UIH as shown in Figure 3. Each cluster, node, can represent a conceptual relationship, for example ‘perceptron’ and ‘ann’ (in italics) can be categorized as belonging to neural network algorithms, whereas ‘id3’ and ‘c4.5’ (in bold) in another node cannot. Words in these two nodes are mutually related to some other words such as ‘machine’ and ‘learning’. This set of mutual words, ‘machine’ and ‘learning’, performs the role of connecting italic and bold words in sibling clusters and forms the parent cluster. We illustrate this notion in the dashed box in Figure 3.

One can easily identify phrases like “machine learning” and “searching algorithm” in the UIH, however only the individual words are represented in the UIH. By locating phrases from the pages, we can enrich the vocabulary for building the UIH. For example, the phrase “machine learning” can be identified and added to Pages 1-6. If we can use phrases as feature in the UIH, each cluster will be enriched because phrases are more specific than words. For example, a user is interested in both phrases “java coffee” and “java language”. The word “java” will be in the parent cluster of both “coffee” and “language”. Each child cluster would contain only “coffee” or “language”, which is relatively less useful when not in combination with “java”.

The approach we take to generate the hierarchy is similar to clustering pages, but pages may belong to multiple clusters – overlapping clusters of pages. That is, instead of directly clustering the original objects (web pages), we first cluster features (words) of the objects and then the objects are assigned to clusters based on the features in each cluster.

Note that a document can have terms in different clusters, hence, a document can be in more than one cluster. Since the more challenging step is the initial hierarchical clustering of features, our primary focus for this Chapter is on devising and evaluating algorithms for this step.

2.2. Building User Interest Hierarchy We desire to learn a hierarchy of interest topics from a user’s web pages bookmarked by a user, in order to provide a context for personalization. Our divisive hierarchical clustering (DHC) algorithm recursively partitions the words into smaller clusters, which represent more related words. We assume words occurring close to each other (within a window size) are related to each other. We investigate correlation functions that measure how closely two words are related in Section 4.2. We also investigate techniques that dynamically locate a threshold that decides whether two words are strongly related or not in Section 4.3. If two words are determined to be strongly related to each other, they will be in the same cluster; otherwise, they will be in different clusters.

2.2.1. Algorithm Our algorithm is a divisive graph-based hierarchical clustering method called DHC, that recursively divides clusters into child clusters until it meets the stopping conditions. In preparation for our clustering algorithm, we extract words from web pages that are interesting to the user, filter them through a stop list, and stem them (Rasmussen, 1992). Figure 4 illustrates the pseudo code for the DHC algorithm. Using a correlation function, we calculate the strength of the relationship between a pair of words in line 1.

After calculating a threshold to differentiate strong correlation values from weak correlation in line 2, we remove all weak correlation values in line 5. We then build a weighted undirected graph with each vertex representing a word and each weight denoting the correlation between two words. Since related words are more likely to appear in the same document than unrelated terms, we measure co-occurrence of words in a document.

Given the graph, called a CorrelationMatrix, the clustering algorithm recursively partitions the graph into sub-graphs, called Clusters, each of which represents a sibling node in the resulting UIH in line 6.

At each partitioning step, edges with “weak” weights are removed and the resulting connected components constitute sibling clusters (we can also consider cliques as clusters, but more computation is required). We treat the determination of what value is considered to be “strong” or “weak”, as another clustering. The recursive partitioning process stops when one of the stopping criteria is satisfied. The first criterion is when the current graph does not have any connected components after weak edges are removed. The second criterion is a new child cluster is not formed if the number of words in the cluster falls below a predetermined threshold.

Cluster: distinct words in a set of interesting web pages to a user [with information of web page membership] CORRELATIONFUNCTION: Calculates the "closeness" of two words.

FINDTHRESHOLD: Calculates the cutoff value for determining strong and weak correlation values.

WindowSize: The maximum distance (in number of words) between two related words in calculating their correlation value.

Procedure DHC (Cluster,CORRELATIONFUNCTION,FINDTHRESHOLD,WindowSize)

1. CorrelationMatrix ← CalculateCorrelationMatrix (CORRELATIONFUNCTION, Cluster, WindowSize)

2. Threshold ← CalculateThreshold(FINDTHRESHOLD, CorrelationMatrix)

3. If all correlation values are the same or a threshold is not found

4. Return EmptyHierarchy

5. Remove weights that are less than Threshold from CorrelationMatrix

6. While (ChildCluster←NextConnectedComponent (CorrelationMatrix))

7. If size of ChildCluster = MinClusterSize 8. ClusterHierarchy←ClusterHierarchy + ChildCluster + DHC(ChildCluster,CORRELATIONFUNCTION,FINDTHRESHOLD, WindowSize)

9. Return ClusterHierarchy End Procedure

Suppose we built a weighted undirected graph with the running example in Table 1 where each vertex represents a word and each weight (value) denotes the correlation value.

The undirected graph can be depicted as shown in a) in Figure 5 and Figure 6 – the left column shows graph partitioning and the right column represents the corresponding tree.

We presented only some vertices and edges as shown in a). Once a threshold for differentiating “strong” edges from “weak” edges is calculated by using a Findthreshold method, we can remove weak edges – those removed edges are represented as dashed lines.

After removing weak edges, DHC finds connected components, which is shown in b). If the number of elements in a cluster is greater than the minimum number of elements in a cluster (e.g., 4), then the correlation values are recalculated and the algorithm repeats the process of removing “weak” edges as shown in c). Since DHC recursively partitions the graph into subgraphs, called Clusters, the final result becomes hierarchical clusters as shown in d).

The CalculateCorrelationMatrix function takes a correlation function, cluster, and window size as parameters and returns the correlation matrix, where the window size affects how far two words (the number of words between two words) can be considered as related. The CalculateThreshold function takes a threshold-finding method and correlation matrix as parameters and returns the threshold. The correlation function (Sec. 4.2) and threshold-finding method (Sec. 4.3) greatly influence the clustering algorithm, and are discussed next..

2.2.2. Correlation Functions The correlation function calculates how strongly two words are related. Since related words are likely to be closer to each other than unrelated words, we assume two words co-occurring within a window size are related to each other. To simplify our discussion, we have been assuming the window size to be the entire length of a document.

That is, two words co-occur if they are in the same document. These functions are used in

2.2.2.1. AEMI We use AEMI (Augmented Expected Mutual Information) (Chan, 1999) as a correlation function. AEMI is an enhanced version of MI (Mutual Information) and EMI (Expected Mutual Information). Unlike MI which considers only one corner of the contingency matrix and EMI which sums the MI of all four corners of the contingency matrix, AEMI sums supporting evidence and subtracts counter-evidence. Chan (1999) demonstrates that AEMI could find more meaningful multi-word phrases than MI or EMI.

Concretely, consider A and B in AEMI (A,B) are the events for the two words. P ( A = a ) is the probability of a document containing a and P ( A = a ) is the probability of a

is the probability of a document containing both terms a and b. These probabilities are

**estimated from documents that are interesting to the user. AEMI (A,B) is defined as:**

The first term computes supporting evidence that a and b are related and the second term calculates counter-evidence. Using our running example in Figure 3, Table 2 shows a few examples of how AEMI values are computed. The AEMI value between ‘searching’ and ‘algorithm’ is 0.36, which is higher than the AEMI value between ‘space’ and ‘constraint’, –0.09.

2.2.2.2. AEMI-SP Inspired by work in the information retrieval community, we enhance AEMI by incorporating a component for inverse document frequency (IDF) in the correlation function. The document frequency of a word calculates the number of documents that contain the word. Words that are commonly used in many documents are usually not informative in characterizing the content of the documents. Hence, the inverse document frequency (the reciprocal of document frequency) measures how informative a word is in characterizing the content. Since our formulation is more sophisticated than IDF and it involves a pair of words rather than one word in IDF, we use a different name and call our function specificity (SP).

We estimate the probability of word occurrence in documents instead of just document frequency so that we can scale the quantity between 0 and 1. We desire to give high SP values to words with a probability below 0.3 (approximately), gradually decreasing values from 0.3 to 0.7, and low values above 0.7. This behavior can be approximated by a sigmoid function, commonly used as a smoother threshold function in neural networks, though ours needs to be smoother. SP(x) is defined as: 1/(1 + exp(0.6 × (x × 10.5 – 5))), where x is defined as: MAX (P(a), P(b)) - we choose the larger probability so that SP is more conservative. The factor 0.6 smoothes the curve, and constants 10.5 and –5 shift the range of x from between 0 and 1 to between -5 and 5.5. The new range of -5 and

5.5 is slightly asymmetrical because we would like to give a small bias to more specific words. For instance, for a = ‘ann’ and b = ‘perceptron’, x is 0.2 and SP(x) is 0.85, but for a=‘machin’ and b=‘ann’, x is 0.6 and SP(x) is 0.31.

Our correlation function AEMI-SP is defined as: AEMI × SP/2. The usual range for AEMI is 0.1–0.45 and SP is 0–1. To scale SP to a similar range as AEMI, we divide SP by 2. For example, in Table 3 the AEMI-SP value for ‘searching’ and ‘algorithm’ is lower than the value for ‘ann’ and ‘perceptron’ because the SP value for ‘ann’ and ‘percetpron’ is higher even though the AEMI value is lower.

expect it to occur quite often and appear with different, more specific words. Hence, we desire general (“connecting”) words to exist only at higher levels in the UIH. For example, ‘ai’ is general and preferably should not appear at the lower levels. Using our running example in Figure 3, the Jaccard value between ‘ai’ and ‘machine’ is 0.6 and the value between ‘ai’ and ‘search’ is 0.5. If the threshold is 0.49, both pairs are in the same cluster and ‘ai’ may perform the role to connect ‘machine’ and ‘search’. Even if the threshold is 0.55, ‘ai’ still remains in the child cluster with ‘machine’ (since their correlation value is over the threshold) - which is a wrong decision. This phenomenon tells us the Jaccard function is not proper for making hierarchical clusters.