«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
The MIN method in STC (Zamir and Etzioni, 1998) can be defined as MIN(P(a|b), P(b|a)). The idea is that if we assign the same correlation value to connected words and connecting words, they would go together. For instance, ‘ai’ connects ‘machine’ and ‘searching’, so they are grouped together in one cluster. However, when they are divided into child clusters, ‘ai’ should be removed because ‘ai’ is too general. But MIN (P(‘ai’|’machine’), P(‘machine’|’ai’)) still yields a relatively higher value than the average.
Alternatively, the MAX function, MAX(P(a|b),P(b|a)), does not distinguish the value for ‘ai’ and ‘machine’, and the value for ‘machine’ and ‘learning’, even though the latter pair has a much stronger relationship. Since Jaccard, MIN, and MAX did not generate desirable cluster hierarchies, we excluded them from further experiments.
2.2.3. Threshold-finding Methods Instead of using a fixed user-provided threshold (as in STC (Zamir and Etzioni, 1998)) to differentiate strong from weak correlation values between a pair of words, we examine methods that dynamically determine a reasonable threshold value. Weights with a weak correlation are removed from CorrelationMatrix and child clusters are identified.
22.214.171.124. Valley To determine the threshold, we would like to find a sparse region that does not have a lot of similar values. That is, the frequency of weights in that region is low. We first determine the highest observed and lowest desirable correlation values, and quantize the interval into ten regions of equal width. The lowest desirable correlation value is defined as the value achieved by a pair of words that occur together only in one document. We then determine the frequency of values in each region. Generally, lower weights have a higher frequency and higher weights have a lower frequency. If the frequency monotonically decreases with regions of higher weights, picking the region with the lowest frequency will always be the region with the highest weights. If, unfortunately, the threshold is too high, then too many edges will be cut. In this case, the threshold is set to be the average plus standard deviation (biasing to remove more edges with lower weights).
However, if the frequency does not decrease monotonically, we attempt to identify the “widest and steepest” valley. A valley is defined as any region where the frequency decreases and then increases. Steepness can be measured by the slopes of the two sides of a valley and the width of how many regions the valley covers. Since the regions are of equal
regions on the two sides of a valley. Once the widest and steepest valley is located, we identify the threshold in the region that constitutes the bottom (lowest frequency) of the valley.
For example, in Table 4, the first column is the id of each region, the second column is the range of correlation values, the third column is the number of values resides in each region, and the last column is the number of child nodes that can be generated with the lowest value in each range as a threshold. There are three valleys when a histogram is drawn like Figure 7: one from Region 0 through 3, (quality is 17), another one from Region 3 through 5, (quality is 14), and the last one is from Region 5 through 9, (quality is 15). Therefore, the widest and steepest valley is the first valley and its bottom is in Regions 1 and 2, which is shown in Figure 8. To identify the threshold inside the bottom region, we ignore the frequency information and find two clusters of correlation values. In this case, it is a one-dimensional two-cluster task, which can be accomplished by sorting the weights and splitting at the largest gap between two successive weights (Largest gap). In our example, since the bottom has zero frequency, any value between.28 and.30 can be the threshold. If the bottom does not have zero frequency, we recursively divide the bottom until the frequency is zero.
126.96.36.199. MaxChildren The MaxChildren method selects a threshold such that maximum of child clusters are generated. This way we divide the strongly correlated values from weakly correlated ones. This also ensures that the resulting hierarchy does not degenerate to a tall and thin tree (which might be the case for other methods). This preference also stems from the fact that topics are generally more diverse than detailed and the library catalog taxonomy is typically short and wide. For example, we want the trees in Figure 3 to be shorter and wider. MaxChildren calculates the number of child clusters for each boundary value between two quantized regions. To guarantee the selected threshold is not too low, this method ignores the first half of the boundary values. For example, in Table 4, the boundary value 0.33 (between Regions 5 and 6) generates the most children and is selected as the threshold. This method recursively divides the selected best region until there are no changes on the number of child clusters.
188.8.131.52. Other Threshold-finding Methods There are some other threshold-finding methods that we initially studied but found to be inferior to Valley or MaxChildren, and subsequently are not included in this Chapter.
LargestGap sorts the values and split at the largest gap between two successive values (this method can be used in the Valley method after the bottom of the largest valley is found).
Again this is motivated by trying to form two clusters in a one-dimensional space.
However, in our initial experiments, the largest gap is close to the largest observed value and thus the resulting tree is usually too small. To prevent the threshold from being too large, Top30% method selects a threshold that retains values in the top 30%. However, this method generates tall and thin trees. To retain ‘abnormally’ large values of a threshold, we also studied Average+StandardDeviation, in order to select a threshold larger than the average. This is later combined into the Valley method.
2.2.4. Window Size and Minimum Size of a Cluster The window size parameter specifies the maximum ‘physical’ distance (in terms of number of words) between a pair of words for consideration of co-occurrence. We have been using the entire document length as the window size to simplify our discussion.
However, considering two words occurring in the same page as related might be too optimistic. Hence, we investigated smaller window sizes that roughly cover a paragraph (e.g., 100 words) or a sentence (e.g., 15 words). However, in our experiments the window size does not make a significant difference. And, the minimum size of a cluster affects the number of clusters. A larger number of clusters makes the hierarchy less comprehensible and requires more computation. We picked 4 as the minimum size of a cluster, because this number of words can represent a concept that is sufficiently specific.
2.3. Experiments We will evaluate the UIH itself to see if it is meaningful using real data. The quality of the UIH will also describe the performance of DHC. We then compared user interest hierarchies using different methods. Furthermore, we compared the quality of UIHs of which one uses words only and the other includes phrases.
2.3.1. Experimental Data and Procedures Experiments were conducted on data obtained from our departmental web server.
By analyzing the server access log from January to April 1999, we identified hosts that were accessed at least 50 times in the first two months and also in the next two months. We filtered out proxy, crawler, and our computer lab hosts, and identified “single-user” hosts, which are at dormitory rooms and a local company (Chan, 1999). We yielded 13 different users and collected the web pages they visited. The total pages that we used were around 1400 files and most of the pages contained regular contents such as no media biased files.
The number of words on the web pages was on the average 1,918 – minimum number of words was 340, and maximum was 3,708. To find phrases we used variable-length phrase finding algorithm (VPF) (Kim and Chan, 2004) because it finds more meaningful phrases than other methods (Chan, 1999; Ahonen et al., 1998). We evaluated the effectiveness of our algorithms by analyzing the generated hierarchies in terms of meaningfulness and shape. Separate experiments were conducted to evaluate the effectiveness of different correlation functions, threshold-finding methods, and window sizes. In order to remove the researcher’s bias, we randomly reordered whole clusters from all approaches, before we evaluate each cluster.
2.3.2. Evaluation Criteria To evaluate a UIH, we use both qualitative and quantitative measures.
Qualitatively, we examine if the cluster hierarchies reasonably describe some topics (meaningfulness). Quantitatively, we measure shape of the cluster trees by calculating the average branching factor (ABF) (Russell and Norvig, 1995). ABF is defined as the total number of branches of all non-leaf nodes divided by the number of non-leaf nodes.
We categorized meaningfulness as ‘good’, ‘bad,’ or ‘other’. Since the leaf clusters should have specific meaning and non-leaf clusters are hard to interpret due to their size, we only evaluated the leaf clusters for meaningfulness. Our measure is based on
interpretability and usability (Han, 2001). So, we check two properties of the leaf clusters:
the existence of related words, and possibility of combining words. For instance, for related words, consider ‘formal’, ‘compil’, ‘befor’, ‘graphic’, ‘mathemat’, and ‘taken’ are in a cluster, even though ‘befor’ and ‘taken’ do not have any relationship with the other words. Since the words, ‘formal’, ‘compil’, ‘graphic’, and ‘mathemat’, are classified as class names related to computer science major, this cluster is evaluated as ‘good’. For the possibility of combining words, consider ‘research’, ‘activ’, ‘class’, and ‘web’ are in a cluster. In this case, the meaning of the cluster can be estimated as ‘research activity’ or ‘research class’ (Zamir and Etzioni, 1999), so we regard this cluster as ‘good’. A cluster is marked as ‘good’ when it has more than 2/5 of the words that are related or have more than 2 possible composite phrases as well. This is hard to measure, so we attempted to be as skeptical as possible. For example, suppose a cluster has ‘test’, ‘info’, ‘thursdai’, ‘pleas’, ‘cours’, ‘avail’, and ‘appear’. In this case one can say ‘test info’ or ‘cours info’ are possible composite phrases, but ‘test info’ does not have any conceptual meaning in our opinion, so we did not count that phrase. If a cluster contains less then 15 words and does not satisfy the criteria for ‘good’ cluster, it is marked as ‘bad’. A cluster is marked as ‘other’ when a leaf cluster has more than 15 words because a big leaf cluster is hard to interpret.
We categorized shape as ‘thin’, ‘medium,’ or ‘fat’. If a tree’s ABF value is 1, the tree is considered a ‘thin’ tree (marked as ‘T’ in the following tables). If the ABF value of a tree is at least 10, the tree is considered a ‘fat’ tree (marked as ‘F’). The rest are ‘medium’ trees (marked as ‘M’). We consider one more tree type: ‘conceptual’ tree (marked as ‘C’), which subsumes ‘M’ or ‘F’ type trees. A conceptual tree is one that has at least one node with more than two child clusters and more than 80% of the words in each child cluster have similar meaning. Since we prefer a tree that can represent meaningful concepts, ‘C’ type trees are the most desirable. ‘T’ type trees are degenerate (imagine each node in the hierarchy has only one child and the hierarchy resembles a list, which is usually not how concepts are organized) and hence undesirable. Based on these evaluation criteria, we analyze different correlation functions, threshold-finding methods and window sizes.
2.4. Results and Analysis In this section we analyze the results from the DHC. We first evaluate the DHC algorithm with only words as features. Then, we compare the results from DHC using only words and the combination of words and phrases as features.
2.4.1. Building UIH with Only Words as Features 184.108.40.206. Correlation Functions We compared two correlation functions: AEMI versus AEMI-SP. We fixed the threshold-finding method to Valley and the window size to ‘entire page.’ Table 5 and Table 6 illustrate the results. The letter ‘U’ stands for user, ‘# of L’ means the number of leaf nodes. ‘G %’ means ‘percentage of good’, which is calculated by dividing the number of ‘good’ leaves by the ‘# of L’. AEMI yielded significantly more meaningful leaf clusters (59% good) than AEMI-SP (41% good). The means of the two groups were significantly different from each other according to the t-test at level 0.05 (Lind et al., 2002). Both methods generated trees whose shapes were mostly ‘medium’. For U8, AEMI generated a conceptually related tree ― the tree had a node with two child clusters, which contained words from course titles and hence represented the concepts of different classes. For U2 with AEMI-SP, the generated tree was ‘fat’ and had an ABF value of 10.
220.127.116.11. Threshold-finding Method We compared two threshold-finding methods: Valley versus MaxChildren. We fixed the correlation function to AEMI and the window size to entire page. Table 5 and Table 7 illustrate the results. MaxChildren generated more meaningful leaf clusters (59% good) than Valley (47% good). However, the means of two groups were not statistically different from each other according to the t-test at level 0.05. Tree shapes are similar (medium) in both methods. However, generally, trees generated by MaxChildren were shorter, which indicates that MaxChildren reduces the number of iterations in the DHC algorithm by dividing the cluster in an early stage. Hence, MaxChildren is faster than Valley.