«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
1989) is an extension of COBWEB for incremental clustering of continuous data set. It stores a mean or distribution for each individual attribute in each node and uses a modified similarity/dissimilarity measure that is an integral over continuous attributes. AutoClass (Cheeseman and Stutz, 1996) allows probabilistic membership of objects in clusters and hence clusters can overlap (an object belonging to multiple clusters). Also, it is a flat (nonhierarchical) clustering algorithm.
7.3.4. Correlation Functions Correlation functions are used to measure the similarity/distance between objects within a clustering function or other machine learning algorithms.
Tan et al. (2002) demonstrated that not all functions are equally good at capturing the dependencies among variables and there is no single function that is consistently better than the others in all applications. They compared various existing correlation functions based on the key properties that Piatetsky-Shapiro (1991) presented and other properties of a correlation function. We propose some additional desirable properties for correlation functions.
Huang et al. (2002) applied a number of correlation functions in their application to identify interesting rules and sequential patterns from the Livelink log ﬁles. They presented a comparison of these measures based on the feedback from domain experts.
Some of the interestingness measures were found to be better than others. Their experiment also supported that no single function is consistently better than the others. However, they did not analyze the desirable properties of correlation functions.
Piatetsky-Shapiro (1991) opened a new research area of information retrieval by presenting three desirable properties of a correlation function. However, his properties depend on cross product ratio (CPR) (Rosenfeld, 1994), in that all correlation function those satisfy CPR satisfies Piatetsky-Shapiro’s properties. We propose two other desirable properties, which independent from CPR.
Jaroszewicz and Simovici (2004) presented a method of computing interestingness of item sets and attribute sets with respect to background knowledge encoded as a Bayesian network. Their algorithm found interesting, unexpected patterns. Their work can be expanded to calculating all possible interestingness such as correlation or support between variables. Their work used the interestingness of dependency but did not focus on
system. This consisted of 5 areas. First, to create a context for personalization, we proposed to establish a user interest hierarchy (UIH) that can represent a continuum of general to specific interests from a set of web pages interesting to a user. This approach was non-intrusive and allowed web pages to be associated with multiple clusters/topics.
We evaluated the learned UIH based on data obtained from 13 users on our web server.
Second, variable length phrase-finding algorithm found meaningful phrases. In order to choose the best method, we compared the number of matching phrases chosen by a method to those phrases chosen by 10 human subjects. Matched-pair design (Robertson, 1981), comparing the top 10 best measures, was used for evaluation. Third, we proposed two properties as desirable properties for correlation functions. In order to compare our 2 new desirable properties and previous 3 desirable properties, we collected 32 correlation functions and examined which correlation function satisfied which desirable properties.
Fourth, we devised a new method of ranking web search results to serve each individual user’s interests using UIH. We evaluated methods based on how many interesting web pages or potentially interesting web pages each algorithm found within certain number of top links (Bharat and Mihaila, 2001). Traditional precision/recall graphs (van Rijsbergen,
1979) were also used for evaluation. We counted which search term showed higher performances with our weighted scoring method (WS) than with Google as well. Fifth, this we identified several implicit indicators that can be used to determine a user’s interest in a web page. This paper evaluates both previously studied implicit indicators and several new implicit indicators. All indicators examined were complete duration, active window duration, look at it duration, distance of mouse movement, number of mouse clicks, distance of scrollbar movement, number of scrollbar clicks, number of key up and down, and size of highlighting text. The data was 11 users’ implicit indicator data and a 1-5 interest rating of each page. During our experiment volunteers were encouraged to behave normally. Two evaluation criteria were used: (1) how accurately an indicator can predict users’ interests and (2) how many users’ interests an indicator can predict.
8.1. Summary of Contributions The following is a summary of our contributions.
In Chapter 2, Learning Implicit User Interest Hierarchy for Context in Personalization, we represented user interest hierarchy (UIH) at different abstraction levels (general to specific), which could be learned implicitly from the contents (words/phrases) in a set of web pages bookmarked by a user. The higher-level interests were more general, while the lower-level ones were more specific. In order to build a UIH, we devised a divisive graph-based hierarchical clustering algorithm (DHC), which constructed a UIH by grouping words (topics) into a hierarchy instead of flat cluster used by STC (Zamir and Etzioni, 1998). Between the root and the leaves, "internal" tree nodes represented different levels of generality and duration of interest. Towards the root of a UIH, more general (passive) interests were represented by larger clusters of words while towards the leaves, more specific (active) interests were represented by smaller clusters of words. DHC automatically found the threshold for clusters of terms (words and phrases) where as STC needed to specify the threshold. Furthermore, we used a more sophisticated correlation function, AEMI, than STC’s conditional probability. We also observed that DHC with an AEMI correlation function and MaxChildren threshold-finding method made a more meaningful UIH with 59% of meaningful clusters than the other combinations. We added phrases to the words as feature. Our experimental results indicated that 64% of UIH were interpretable by human.
In Chapter 3, Identifying Variable-Length Meaningful Phrases with Correlation Functions, we proposed a variable-length phrase-finding algorithm (VPF), which found more meaningful phrases – VPF – than older methods – Ahonen’s algorithm and Chan’s algorithms. They regenerated sequences recursively with the words selected in the previous stage and searched for increased length of phrases in time O(nw), where nw was the page size. This algorithm does not need any user-specified parameters, since our algorithm used average as a threshold and stopped when the length of phrases did not increase. The algorithm achieved further improved performance by pruning less meaningful phrases. The With-pruning was statistically significantly better than the Without-pruning with 95% confidence interval (P=0.004) for the VPF with F25. More meaningful phrases than previous methods were found by VPF with F25 and the improvement in performance is statistically significant than Ahonen’s original method. We suspected the filtering stage of Ahonen’s algorithm filtered many meaningful phrases out or their weighting scheme using the length of a phrase and tightness (Ahonen et al., 1998) distracted the correlation value of a phrase.
In Chapter 4, Analysis of Desirable Properties of Correlation Functions between Two Events, we identified 2 new desirable properties for a correlation functions in general.
• if P(A,B) increases when P(A) and P(B) remain unchanged, then F monotonically decreases;
• if P(A) (or P(B)) and P(A,B) increase at the same ratio, then F monotonically increases.
In addition, we revise and improve the first desirable property proposed by PiatetskyShapiro (1991) to make it more accurately descriptive. Some functions (e.g., Odds, Conv, Inte, and Coll in Appendix 1) produced a score of 1, even when two events were statistically independent.
• F can distinguish statistically independence of A and B.
Our empirical results indicated that our two new desirable properties were more descriptive than the previous three desirable properties provided in Piatetsky-Shapiro (1991). It was because all correlation functions that satisfy P1 also satisfied P2 and P3, and all correlation functions that satisfied P6 also satisfied P1-P3; but our properties were independent from P1 and P6. We tested our 2 new desirable properties over many different correlation functions and summarized their results with respect to each property. The summarized results included P1-P5, P6 that measured negative/positive correlation, and P7 that checked which correlation function returned a normalized return value. The associated table will help user compare the characteristics of correlations functions.
In Chapter 5, Personalized Ranking of Search Results with Implicitly Learned User Interest Hierarchies, we compared four ranking methods: Google, Random, US, and WS.
We used two data sets: interesting web pages that are relevant to the user search term and potentially interesting web pages that could be relevant in the future. We introduced two personalized ranking methods (WS and US) that utilize an implicitly learned user profile (UIH). We built different UIHs for users depending on their interests. We identified four characteristics for terms that match the user profile and provide a probabilistic measure for each characteristic. The four characteristics were the level of a node where a term belongs to (D), the length of a term (L), the frequency of a term (F), and the emphasis of a term (E).
D and L were calculated while building a UIH from the web pages in a user’s bookmark.
Different web pages had different values for F and E characteristics. Our experimental results indicate that WS method could achieve higher precision than Google for Top 10, 15 and 20 web pages that are relevant to the user search query. On interesting web pages, the Top link analysis showed WS achieved at least 13% higher precision than Google for Top 10, 15 and 20 links on average. WS outperformed US and Random in general also. On potentially interesting web pages, WS achieved the highest performance in all five Top links (Top 1, 5, 10, 15, and 20) as well. When incorporating the (public) ranking from the search engine, we found that equal weights for the public and personalized ranking can result in higher precision. A weight of 0.5 for the personal ranking seemed to show the highest performance on both data sets.
The precision/recall analysis visualizes the performance of each method in graphs.
WS and US are closer to the upper-right corner than Google except with recall values lower than.15 (after Top 5). In general, WS outperforms US and Random for interesting web pages. The results from precision/recall graph for potentially interesting web pages are similar. WS was closer to the upper-right corner than Google, US, and Random over all.
These results conclude that WS could provide more accurate ranking than Google on average.
In Chapter 6, Implicit Indicators for Interesting web Pages, our experiments indicate that complete duration, active window duration, look at it duration, and distance of mouse movement are reliable indicators for more users than other indicators – 8 users out of 11. Over the data set of “all visits”, the indicators were able to predict the most number of users’ interests as well – 7 users out of 11. The distance of mouse movement was as accurate as indicators based on duration, and it can be the most practical indicator since it is simple to detect and is more robust than active window duration against the case of user’s absence. If a user leaves a web page open and leaves the room, the distance of mouse movement will not be affected. For the bookmark, save, print, and memo indicators, more than 95% of the pages were correctly scored as “interested”. When we divided the data set less than “interested” and more than or equal to “interested”, “95% of the bookmarked web pages, 98% of the saved web pages, 100% of the printed web pages, and 98% of the memoed web pages belonged to the score of more than or equal to “interested”.
Our results also indicate that there was no indicator that was valid for all users. Depending on the user, an indicator may or may not be valid.
8.2. Ethical Issues in User Modeling 8.2.1. Privacy An individual’s right to privacy has always been an issue in user modeling. This is because the fact that the consequences for victims of privacy intrusions can be serious problems. Although the Internet is widely used nowadays, many users remain unfamiliar and skeptical about the level of security and privacy on the Internet. Great numbers of users are uncomfortable with “user profiling,” a practice in which users’ online movements are recorded. Much of this user anxiety is caused by the fact that users have no clear understanding regarding the rules that govern this practice, how extensive it is, what is recorded, and even how the information is used (Chiu, 2000). If the process of user profiling is explained, then user modeling will be more readily accepted.
This personalization method should be provided as an option of a web browser to a user along with full descriptions. Before building a user profile, the browser must accept the user’s permission whether s/he wants to build the user profile from their bookmarks or the web pages collected by implicit interest indicator. This is a permission-based personalization tool. Furthermore, the profile can be stored in a client. If both the web log file of the user’s behavior and the user profile remain in the user’s computer, then s/he will feel their privacy is protected. The browser should provide complete controls over the profile to the user.