«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
Negative correlation has a negative log CPR value. P4 is that F can distinguish positive and negative correlation of A and B. Since positive correlation is much more important than negative correlation in finding phrases, we only measured the change of correlation values over positive correlation.
Tan et al. (2002) illustrated those properties and extended them to each of the existing measures to determine if the existing measure satisfies the properties required (Kamber and Shinghal, 1996). Some of these properties have been extensively investigated in the data mining literature (Hilderman and Hamilton, 2001; Tan and Kumar, 2000; Tan et al., 2002). We examined 32 correlation functions of properties and cooperated them with phrase-finding algorithms. A complete list of the correlation functions to be examined in this study is given in Appendix 1.
3.2. Experiments 3.2.1. Experimental Data and Procedures We use five New York Times articles and five web pages collected from our department server. We use web pages to test, because contents in a web page differ from the content found in normal article. The data used in this study is accessible at http://cs.fit.edu/~hkim/dissertation/dissertation.htm. The article size was about 2 pages and we asked 10 human subjects, other than the authors, to read the 10 articles. Each article contains about 1,300 words - 720 words after removing stop-words. Since we assigned 6 as a threshold of maximum phrases length for Ahonen’s and Chan’s, the total possible number of phrases for each article is approximately 3600 (=720×5). Of the 10 subjects, 4 were graduate students from a department of computer sciences and 6 were undergraduates with various majors.
We asked the 10 human subjects to choose their top 10 meaningful phrases for each article or web page. One might insist that the results will be different depending on how 10 humans are chosen. If all volunteers have the same background the matching rate will be higher than the normal case. However, since we are not comparing the algorithm with humans, but comparing among algorithms, it does not matter how we chose the 10 volunteers. Furthermore, since the algorithm finds phrases statistically that cover general human meaningfulness, we choose 10 human subjects arbitrarily.
The instruction that we gave them were:
Identify the top 10 "meaningful/important" phrases for each article.
Phrases are defined as two or more adjacent words that are meaningful, for example, "computer science," "florida institute of technology,"...
The definition of meaningful is up to you.
We will measure the number of matches between the human subjects’ selections and different correlation functions’ selections as well as different phrase-finding algorithms.
We also count average matching of humans – in this case, we divided the sum by
9. There are cases for the human or algorithm to select less than 10 phrases. In order to be fair in these cases, we use an additional adjustment function. We also attempt to prevent a measure from being scored 1 by finding one phrase and having one matched phrase by chance – the results are too sensitive. We, therefore, decided to give a lower incentive as a measure finds fewer phrases 20% at the most. For example, if there are 5 matches out of 10, the number of matching is 5×1/10. If there are 5 matches out of 9, then we assigned 5×1/9. But, if there are 5 matches out of 8, then we assigned 5×1/8.5. The generalized
, where m is the number of matched words and f is the number of selected words.
We also applied different correlation functions to Ahonen’s algorithm to see if the difference of the performance depended on the correlation functions. Ahonen used two different correlation functions: conditional probability (Confidence, F11) for filtering phrases and mutual confidence (F32) for ordering the collected phrases determining which phrase is more important than the other. Since he used fixed user-defined threshold (0.2) for filtering the phrases, we only varied the correlation function used for ordering phrases.
3.2.2. Evaluation Criteria We evaluate the meaningfulness of phrases. We believe the closer a match comes to our set of human-selected phrases, the better the phrase-finding algorithm is in terms of finding meaningful phrases. To evaluate the correlation functions for each phrase-finding algorithm, we have two evaluation criteria: the number of exact matches and the number of simple matches. We have 128 methods (4 algorithms × 32 correlation functions) – the 4 algorithms are VPF, Chan’s, Ahonen’s, and Ahonen’s with gap, and the correlation functions F1 through F32 are in Appendix A.
The number of exact matches of a method is measured by the percentage of the matches between the human’s and a method’s. We count each match with a human’s and then average the 10 compared results. This counting approach assigns more weights to the more meaningful words – more meaningful word means that they were selected by more human subjects. If a phrase is selected by several human subjects, every match is counted.
Therefore, finding more popular phrases increases the matching average. The number of matches will be very low, because only 10 phrases selected by a method and a human respectively are going to be compared.
The number of simple matches counts the matched phrases against the list collected by all human (i.e., the union of the words from the 10 human subjects). The list will be less then 100 because some phrases can overlap. Simple match is not directly related to finding more meaningful phrases, because this count removes the popularity information. We count this only to support the result of the exact match.
Comparison of the results is done using a matched-pair design (Robertson, 1981).
In this design, the top ten phrases in the ranking generated are compared. The comparison, which simply identifies if one group of ten phrases is better than the other, is on the basis of precision in other words the number of matched phrases. This type of evaluation has the following advantages: It is realistic in the sense that many information retrieval systems are interested in the top group. Traditional recall/precision tables are very sensitive to the initial rank positions and evaluate entire rankings (Crotf and Das, 1989). Another advantage is that significance measures can be readily used.
3.3. Results and Analysis Before comparing our algorithm with existing methods we need to decide whether to use pruning or not. After that we will be able to perform the comparison. In evaluating our method against related algorithms we use different scoring methods: exact match and simple match.
3.3.1. With-pruning vs. Without-pruning The VPF algorithm has a pruning function. The results differed whether we used the pruning function or not. We compared them by comparing the top 10 best methods with exact match (Sec 6.2). By composing the algorithm and 32 correlation functions (in Appendix A), we generated 32 methods. We ranked the top 10 methods using “with pruning” and presented the corresponding results of “without-pruning” next in Table 10.
The values are the average of matches across 10 human subjects and 10 articles. Most methods yielded improved results when they had been pruned. The top method VPF with F25 had improved its performance by 5.7%. With pruning is statistically significantly better than without-pruning with a 95% confidence interval (P=0.004).
3.3.2. Analysis with Exact Match Because “with-pruning” achieves a higher matching rate than “without-pruning”, we decided to use pruning in our algorithms for the rest of our experiment.
22.214.171.124. Top 10 Methods The main purpose of the analysis in this section is to choose the best method.
Which method is the best is the most interesting question. We averaged the results from 10 articles and 10 human subjects and sorted by the average to rank all 128 methods. We presented the results in Table 11 and included the rank, methods used, and the average.
Each method was composed of an algorithm and a correlation function. Notice that, we also presented the results of previous methods. Ahonen used correlation function F32. He also introduced a method with gaps. The row Ahonen_gap represented the results using Ahonen’s method allowing gaps within a phrase.
The best method was the combination of VPF and correlation functions F25 followed by F16 and F28 – all those three correlation functions satisfied PiatetskyShapiro’s three desirable properties and distinguish positive from negative correlations.
The best method VPF with F25 matched 0.93 phrases on average with the phrases selected by a human subject. In the next section we measured the average number of matching phrases between human subjects and compared those results to the results from methods.
Interestingly, VPF won the top 3. Chan’s algorithm occupied the next ranks.
Another observation was that the correlation functions F25, F16, and F28 that marked high rank with VPF also marked high rank with Chan’s. This observation implied that the performance also depends on the correlation functions.
Table 11. Ranked by average across humans and articles – Exact match
Ahonen’s algorithm ranked 24 and Ahonen_gap 105. These methods matched
0.767 and 0.452 numbers of phrases with human subjects respectively. The low performance with gap is the same phenomenon as shown in (Ahonen et al., 1998). We conducted t-Test (paired two sample for means) between VPF with F25 and Ahonen with F32. There was a clear statistically significant difference between the two methods with 95% confidence (P=0.016). Therefore, we can conclude that VPF with F25 found statistically significantly more meaningful phrases than Ahonen’s previous algorithm.
Ahonen’s algorithm with other correlation functions received higher ranks such as F6, F10, F11, F12, F17, F26, and F20 as shown in Table 11. They all ranked 13, 15, and 20, which are higher than Ahonen’s original method (24). This indicates Ahonen’s algorithm can be improved upon by using different correlation functions.
126.96.36.199. Comparing with Human Subjects To see the average number of matches among human subjects is interesting and also provides insight into interpreting the average number of matching by the algorithm.
For instance, if an algorithm matches 1 on average and the human matches 7, then the performance of the algorithm is almost negligible no matter how much higher its performance is compared to others.
We presented the best, average, and worst matching human results in Table 3. The results told us that only 1.3 phrases out of 10 picked by a human subject matched with the phrases picked by the others on average. This is not an unrealistic result. Considering that each document has approximately 1,300 words, more than 7779 possible phrase combinations exist and each person has a different background, matching 1.3 phrases out of 10 on average is a reasonable result. Our method achieved a result (0.93), which was close to the typical human result. We also conducted a t-Test with the human average and VPF with F25. The human subjects’ average was statistically significantly better than the best result obtained by the algorithm with a 95% confidence interval (P=0.02). It would be interesting to see if the worst case of human matching was higher than the algorithm’s. The answer was no. It was not statistically significantly better than the machine’s. This result indicates that human matching is better than the matching of algorithms in general but not always.
Table 13. Ranked by average across humans and articles – Simple match
3.3.3. Analysis with Simple Match This simple match count showed similar ranking to the exact match. VPF with F28 followed by F25 and F13 had the top matching rates: 3.70, 3.69, and 3.67 respectively as shown in Table 13. Since simple match uses a list of meaningful phrases by taking the union of phrases selected by the 10 human subjects, average number of matching phrases is higher than the average by exact match. Chan’s with F25 ranked 22 (3.05 matching rate), Ahonen without gap ranked 33 (2.93), and Ahonen with gap ranked 118 (1.76) out of 128.
Chan’s original method ranked 22 (3.053). These results also told us VPF found more phrases than Ahonen’s and Chan’s. The result from simple match also indicated that Ahonen’s algorithm could be improved by incorporating different correlation functions.
We also attempted to compare the methods’ results with the results from humans.
Human matched the list 5.6 out of 10 on average; the best and worst cases are 6.3 and 4.7 as shown in Table 14. The result 3.69 of method VPF with F25, which was the highest score with exact match, was quite significant considering that we only used the statistical information.
3.4. Summary We proposed a variable-length phrase-finding algorithm, which find more meaningful phrases – VPF – than older methods – Ahonen’s and Chan’s algorithms. We also coordinated these algorithms with 32 different correlation functions. They regenerate sequences recursively with the words selected in the previous stage and search for increased length of phrases in time O(nw), where nw is the number of distinct words in a page. Since our algorithm uses average as a threshold and stops when the length of phrases does not increase, no user-defined parameter is required.
In order to choose the best method, we conducted an experiment by asking 10 human subjects to select 10 phrases from 10 different articles. We compared the number of matching phrases chosen by a method to those phrases chosen by 10 human subjects. By comparing the top 10 best measures (matched-pair design (Robertson, 1981)), we observed that when we add pruning, the algorithm (VPF) had improved performance.