# «Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

2.4.1.3. Window Size We compared the performance using different window sizes: ‘entire page’ versus 100 words (paragraph length). We fixed the correlation function to AEMI and the threshold-finding method to MaxChildren. Table 5 and Table 8 illustrate the results. A window size of the entire page generated slightly more meaningful clusters (59% good) than a window size of 100 (57% good). However, a window size of 100 yielded more tress with 100% ‘good’ leaf clusters (6) than a window size of the entire page (5). Hence, it is not clear which window size produces more meaningful clusters. Both methods resulted in ‘medium’ trees. A window size of 100 generated one thin tree for U11. The ‘T’ tree in Table 8 has only two nodes: the root and one leaf.

Table 5. Combination of AEMI, MaxChildren, and entire page

If we can add phrases as feature in the UIH, each cluster will be enriched because phrases are more specific than words. We compared two different data sets: one consisting of only words and the other consisting of words and phrases – the phrases were collected by VPF (Kim and Chan, 2004). Table 5 and Table 9 illustrate the results. Results from the data with phrases presented more meaningful leaf clusters (64%) than results with only words (59%). Tree shapes were similar (medium) in both methods.

UIHs learned from a user (U1) are depicted in Figure 9 and Figure 10. The one from only words has three ‘good’ leaf clusters (1, 3, and 4) and one ‘bad’ leaf cluster (5).

Cluster 0 denotes root nodes, which has all words or phrases. The right one which is learned from both words and phrases has all ‘good’ clusters; furthermore, it is more descriptive because Clusters 6 and 7 in Figure 10 describe class names while cluster 3 in Figure 9 alone describes class names. We can say Clusters 6 and 7 in Figure 10 are conceptually related because both are class names. We cannot explain why some specific interests in one UIH do not exist in the other UIH. For example, Clusters 4 in Figure 9 showed that the user (U1) was interested in a Master or Doctoral degree program, but the

hierarchy (UIH) that can represent a continuum of general to specific interests from a set of web pages interesting to a user. This approach is non-intrusive and allows web pages to be associated with multiple clusters/topics. We proposed our divisive hierarchical clustering (DHC) algorithm and evaluated it based on data obtained from 13 users on our web server.

We also introduced correlation functions and threshold-finding methods for the clustering algorithm. Our empirical results suggested that DHC with the AEMI correlation function and the MaxChildren threshold-finding method yielded more meaningful leaf clusters. In addition, by using phrases found by VPF algorithm, we improved performance up to 64% of interpretable clusters. We did not analyze differences among the UIHs’ obtained from various users because of the large numbers of web pages used in our experiments. Results from experiments not reported here indicated that stemmed words were more effective than whole words. The minimum cluster size affected the number of leaf clusters; size 4 was easy to use and seemed to produce reasonable results.

The performance of the DHC algorithm varied depending on a user and the articles selected. We currently do not understand the reason for the variance in performance. We

Finding phrases in a document has been studied in various information retrieval systems to improve their performance. Many previous statistical phrase-finding methods had a different aim such as document classification. Some are hybridized with statistical and syntactic grammatical methods; others use correlation heuristics between words. We propose a new phrase-finding algorithm that adds correlated words one by one to the phrases found in the previous stage, maintaining high correlation within a phrase. The inputs are words in a document and the outputs are phrases collected. These processes are

Figure 11. Finding phrases in the framework of web personalization

3.1. Variable-length Phrases Our algorithm consists of two components: the main algorithm and the correlation function. In preparation for VPF, we extract words from a web page visited by the user, filter them through a stop list, and stem them (Ahonen et al., 1998; Chan, 1999; Frakes and Baeza-Yates, 1992; Zamir and Etzioni, 1998). Other phrase-finding algorithms also require these pre-processing steps.

3.1.1. VPF Algorithm Our algorithm adds words one by one to the phrases found in a previous stage – each stage corresponds to each recursion in the VPF algorithm. One might insist that each phrase P{l} of length l has the form P{l-1}w, where w is one word and P{l-1} is a phrase of length l-1. Since the phrase P{ l-1}w is defined by the correlation between P{l-1} and w, it is possible that the correlation exists between a non-phrase P{l-1} and a word w. That is, P{l-1} is not a phrase, but can be extended into a phrase of length l. If this is possible, it is also possible that even if there exists a phrase, P{l}, in a document, the phrase P{l} could not be generated because P{l-1} does not exist. For example, there exists a phrase “wireless powerful computer” in a web page. But, since “wireless powerful” is not a phrase, it is possible that the phrase could not be generated. However, if the phrase is meaningful/important enough, the sub phrases “wireless powerful” and “powerful computer” will be generated. Next, “computer” will be added to “wireless powerful”. To relieve this problem, we calculate the threshold once at the beginning – this means the threshold is consistent. If the correlation value of “wireless powerful” is lower than the value of “wireless powerful computer” then the shorter phrase will be removed at a pruning stage.

Our algorithm receives a sequence of words as input and returns meaningful phrases. It combines words into phrases until it generates no more phrases. Figure 12 illustrates the pseudo code for the variable-length phrase-finding algorithm (VPF). It uses three main variables – List, SEQ, and Corr. The List variable stores all collected phrases in a Hash attribute. Each element (phrase) of the Hash attribute keeps correlation value in sim attribute and the position list in posi attribute. VPF first makes a linked list (SEQ) with an input example. Each word and word position are stored in each node in the linked list.

Then, all 1-gram distinct words are stored in List[1].Hash. Corr is a chosen correlation function. The List, SEQ, and Corr are passed to BeSpecific procedure and this finds all phrases. Once the phrases are acquired, they are pruned. The pruning process simply removes all sub-phrases that have a sim value lower than the sim value of the superphrases.

The BeSpecific procedure receives five parameters: List that stores all phrases, l which is the length of phrases (initial value is 2), thre which keeps the calculated threshold value differentiating “strong” relations from “weak” relations (initialized to 0); SEQ which stores the linked list, and Corr. BeSpecific recursively creates new sequences by removing nodes that are not in the Hash table generated in the previous stage, and also by removing consecutive nodes whose lengths are shorter than l. Since it removes words that are not in the Hash table generated in the previous stage, there can be gaps between nodes. Once the new sequence is generated, it collects all l-grams having no gaps from the sequence. The threshold is calculated when l=2 only once by averaging all correlation values between the first and second word in each element in List[2].Hash. The for loop removes any element with low correlation values (sim) from Hash. The sim property of p keeps the correlation between a sub phrase, p[1.. l-1], and the last word, p[l], where p is a phrase consisting of l words. For example, if p=“computer science seminar”, then p[1.. l-1]=“computer science” and p[l]=“seminar”. The Intersection counts all adjacent points based on the distance. The intersection between the positions of p[1.. l-1] and p[l] becomes the positions of p. If the intersection of p becomes 0, then the p is removed. The BeSpecific procedure recursively increases the phrase length L until Hash become empty. We can also apply various correlation functions in the place of Corr.

In order to improve the phrase-selection accuracy, we need to calculate for each word the percentage that a word can come before any other words and the percentage that the word can come after any other words, called pre-percentage and post-percentage respectively. The idea is that if a word occurred at the end of a sequence, then this word loses his one chance to come before any other words, so we adjust the pre-percentage of the word by deducting one from the number of occurrences of the word. The postpercentage is vice versa. We can view a string S[1..n] of n consecutive words as two sub strings, Spre=S[1..n-1] and Spost=S[2..n]. Pre- and post-percentage of w can be computed in

**time O(1), when we know all the positions where w occurred:**

wpre-percentage= Frequency of w in Spre / |Spre| wpost-percentage= Frequency of w in Spost / |Spost|.

Input: Example– a sequence of words Output: Collected phrases Function VPF (Example) return phrases SEQ– linked list, each node has a word and a position.

List– array of Hash table, each word in a Hash has sim and posi attributes.

Corr- a correlation function

1. SEQ←linked list made by the input Example

2. List[1].Hash←store all 1-grams, each word keeps its positions

3. BeSpecific(List, 2, 0, SEQ, Corr)

4. Prune phrases in the List

5. Return all phrases in the List End Function Input: List- store array of Hash table L- length of phrase, initialized to 2 thre- threshold value, initialized to 0 SEQ- linked list, the size reduces every iteration Corr- correlation function BeSpecific(List, L, thre, SEQ, Corr)

1. if List[L-1].Hash is empty then stop

2. SEQ←remove nodes that are not in the List[L-1].Hash and also remove consecutive nodes which length is shorter than L

3. List[L].Hash←Collect all L-grams from SEQ

4. if L=2 then thre←Average correlation across all words in List[2].Hash

5. for each p in List[L].Hash do A ← pre-percentage of p[1..L-1] 6.

B ← post-percentage of p[L] 7.

8. p.sim←Corr(A,B,A∩B)

9. remove any p for which p.sim is lower then thre

10. p.posi←Intersection of p[1..L-1].posi and p[n].posi with distance of L-2

11. remove any p for which p.posi=0

12. BeSpecific(List, L+1, thre, SEQ, Corr) End Procedure

** when L is 2. Even though the BeSpecific is called L times, as L increases the size of the sequence decreases drastically.**

The Corr function has O(T) time complexity, where T is the position length. But, as T increases the number of element in a Hash decrease. We, therefore, can claim that the time complexity of VPF in general case is roughly: O(S), where S is the sequence size.

For example, in Figure 2 we have an input string S = “abcdabefbcdghabcd,” in which each letter represents a word. The correlation function simply returns the frequency of a phrase using a threshold value of 1.5. List[1].Hash contains all distinct words.

List[2].Hash originally contains all possible 2-word phrases, then removes any phrase that occur less than 2 times, resulting in {“ab”, “bc”, “cd”}. We remove all words that are not in the List[2].Hash from the 1st SEQ resulting in 2nd SEQ. List[3].Hash contains all 3-word phrases, and then remove “cda” and “dab” because they occur only once. When we came to BeSpecific for the 3rd time, we removed “*abc*” from the 3rd SEQ, because their consecutive length is less than 4 – the size of L increases every time we come to BeSpecific. When we run the 4th BeSpecific, we can remove all phrases (“abcda” and “bcdab”) in List[5].Hash, because they occur only once. Since the 5th Hash is empty, the BeSpecific stops. After the BeSpecific, the List keeps {“ab”, “bc”, “cd”, “abc”, “bcd”, “abcd”}. Suppose the pruning removes all sub-phrases that have a sim value lower than or equal to the sim value of the super-phrases. The occurrences of phrases are: “ab”-3, “bc”-3, “cd”-3, “abc”-2, “bcd”-3, and “abcd”-2. We remove “bc” and “cd” because “abc” subsumes them and has equal frequency. “abc” is removed by “abcd” with the same reason. The final returned phrases are {“ab”, “bcd”, “abcd”}.

3.1.2. Correlation Functions The VPF algorithms build phrases; and correlation functions actually calculate the weight of a phrase. The correlation functions are important in terms of selecting more meaningful phrases. The VPF is able to cooperate with many different existing correlation functions, and it can be hard to choose one correlation function out of many. In this section, we describe several key properties of a good correlation function. Much of the statistical work in building multi-word features focuses on co-occurrence (Chen and Sycara, 1998; Rosenfeld, 1994). All correlation measures are not equally good at capturing the dependencies between different events (Tan et al., 2002). It is because each correlation function biases toward different individual event probabilities and joint probabilities.

Piatetsky-Shapiro (1991) has proposed three key properties that a good correlation

**function, F, for events A and B should satisfy:**

P1: if A and B are statistically independent, then F is 0;

P2: F monotonically increases with P(A,B) when P(A) and P(B) remain the same;

P3: if P(A) (or P(B)) increases when the rest of the parameters (P(A,B) and P(B) (or P(A))) remain unchanged, then F monotonically decreases.

Statistical independence can be measured by the determinant operator, where Det (A,B) = A∩B×A′∩B′ − A∩B′×A′∩B. Thus, a singular diagram is independent when its determinant is equal to zero (Tan and Kumar, 2000). Another important operation in finding phrases is distinguishing between positive and negative correlations (P4).

**Measuring their cross product ratio (CPR) can assess the significance of the correlation between A and B (Rosenfeld, 1994) and is defined as:**