«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
Wu and Gunopulos, 2002), or finding more descriptive or important/meaningful phrases (Ahonen et al., 1998; Chan, 1999). We attempt to find more meaningful phrases in a document. The definition of meaningful is unique to each individual. So, we define a phrase as more meaningful if it is meaningful to the most people.
Building UIH, finding phrases, and devising page-scoring functions (in Chapter 5) can use correlation functions. The analysis of relationships among variables/events is a fundamental task for many data mining problems. There are two types of properties for correlation functions: other properties and desired properties of functions. Other properties such as inversion invariance are depending on each application domain (Tan et al., 2002).
One has to examine which properties of other properties are more suitable for his/her application domain (Tan and Kumar, 2000). However, the desirable properties can be applied to all correlation functions. Before selecting or devising a correlation function, it is important to check whether each measure satisfies basic desirable properties of a function (Piatetsky-Shapiro, 1991; Tan et al., 2002). We propose two new desirable properties.
Web personalization adapts the information or services provided by a web site to
the needs of a user. There are two main techniques for performing web personalization:
collaborative filtering (Eirinaki et al., 2004; Kim et al., 2004; Cadez et al., 2000) and using user profiles (Albanese et al., 2004; Mobasher et al., 1999). Collaborative filtering uses information from many different users to make recommendations. Disadvantages of this method are that it cannot predict a new page and it requires a large amount of data from many users to determine what pages are the most popular. Obtaining data on the web pages visited by many different users is often difficult (or illegal) to collect in many application domains. In contrast, user profiles require the web page history of only a single user.
Google (2005) provides personalized services based on explicitly learned user profiles, which consumes users’ time and effort. We attempt to personalize the search results using the user profile that is learned implicitly and uses contents of web pages instead of user’s behavior (Kim et al., 2004; Albanese et al., 2004).
A UIH is learned from a set of interesting web pages to a user. Determining a user’s interesting web pages can be performed explicitly by asking the user, or implicitly by observing the user’s behaviour. Implicit indicators are usually less accurate than explicit indicators (Watson et al., 1998). However, implicit indicators do not require any extra time or effort from the user and can adapt to changes in the user’s interests over time. To implicitly measure user interest we need to identify reliable implicit indicators. One of the major user interest indicators identified by researchers is duration, or the time spent on a web page (Granka et al., 2004; Jung, 2001; Claypool et al., 2001; Resnick et al., 1994;
Liberman, 1995; Kim et al., 2001; Oard et al., 1998). However, some researches indicate that duration may not be an accurate measure of user interest (Jung, 2001). We suspect that this is because the duration indicator often does not account for the user’s absence. We examine the duration in more detail and attempt to propose new other indicators.
1.2. Problem Statement Our problem consists of three parts as shown in Figure 1: identifying interesting web pages, learning user profile, and personalizing search results. The part of learning user profile can be divided into three sub problems: profile-learning, phrase-finding, and desirable properties of a correlation function. These five main problems for the framework of our web personalization are marked as thick boxes.
The first problem in our system is learning implicit user interest hierarchy for context in personalization. The inputs are the terms (words and phrases) in a set of interesting web pages or bookmarks. The bookmarked web pages are used when it is difficult to collect a set of interesting web pages to a user. The output is a learned profile called a user interest hierarchy (UIH). We want to devise a method that can learn the UIH from bookmarks implicitly. The second problem is identifying variable-length meaningful phrases with correlation functions from a web page. The inputs are the words in a document. The outputs are meaningful phrases with various lengths. The definition of meaningful is unique to each individual. We let each individual define his or her own definition of meaningful. So, we define a phrase as more meaningful if it is meaningful to the most people. The third problem is about the analysis of desirable properties of correlation functions between two events. These properties of correlation functions help select a right correlation function in learning a user profile, finding meaningful phrases, and scoring interesting web pages. The problem is to find properties that can describe the characteristics of correlation functions. The fourth problem is the personalized ranking of search results with implicitly learned user interest hierarchies. The inputs are a profile (user interest hierarchy) and search results from Google (2005). The outputs will be the reordered search results depending on a user’s interests. The goal is to assign higher scores to web pages that a user finds more interesting. The last problem is to find implicit indicators for interesting web pages. The inputs are web log-files and interest scores of web pages provided by a user. The log-files record users’ behavior while they are reading web pages. The goal is to find an implicit indicator that could predict the interest score of a web page.
Figure 1. Five research parts in the framework of web personalization
1.3. Approach The first main objective of this research is to build UIH’s that capture general to specific interests without the user’s involvement (implicitly). The most common and obvious solution for building a UIH is for the user to specify interests explicitly. However, the explicit approach includes these disadvantages: time and effort in specifying interests, and user interest may change over time. Alternatively, an implicit approach can identify a user’s interests by inference. Leaf nodes of the UIH generated by our algorithm represent a list of specific user interests. Internal nodes (towards root node) represent more general interests. For example, a graduate student in computer science is looking for a research paper in web personalization. The specific interest is web personalization, but the general interest is computer science. The web pages the student is interested could all be related to computer science and hence words and phrases from these pages would appear in the root node of the UIH. Some of the pages he is interested in could be related to web personalization, and the words (e.g., profile, user, and personalization) might be at the leaf of the UIH. Between the root and the leaves, “internal” tree nodes represent different levels of generality of interest. We devise a divisive hierarchical clustering (DHC) algorithm that constructs such a hierarchy and supports overlapping clusters of the original objects (web pages in our case). Note that clustering web pages is not one of our objectives; the overlapping property allows us to associate a web page to (potentially) multiple topics. We believe our approach has significant benefits and possesses interesting challenges.
Using phrases, in addition to words, we can improve the UIH. A composed term by two or more single words (called “phrase”) usually has more specific meaning and can disambiguate related words. For instance, “apple” has different meanings in “apple tree” and in “apple computer”. Therefore, we propose a variable-length phrase finding algorithm (VPF), which finds more meaningful variable-length phrases. VPF is designed to remove the maximum length of phrases in Ahonen’s algorithm (Ahonen et al., 1998) and fix the problem in Chan’s algorithm (Chan, 1999). VPF increases the length of phrases by adding a word to the phrases made in the previous stage one by one, maintaining high correlation within a phrase. VPF also applies pruning to the phrases collected in order to remove less meaningful phrases.
Both divisive hierarchical clustering (DHC) algorithm and variable-length phrase finding (VPF) algorithm use correlation functions. It is important to understand the characteristics of a correlation function in selecting a correlation function and devising a new correlation function. We analyze the previous desirable properties and propose two new desirable properties. All possible cases that could be made by two events are enumerated and examined. Out of many possible cases, we illustrate that a measure should show a consistent pattern in two cases that are new desirable properties of a correlation function. We believe that these properties will help understand the general characteristics of a measure.
The UIH can be used to build a method for personalizing web search results that is able to reorder the results from Google (2005), such that web pages that a user is most interested in appear at the top of the page. We wish to devise a page scoring function with a user’s implicitly learned User Interest Hierarchy (UIH). Web pages are ranked based on their “score,” where higher scores are considered to be more interesting to the user after comparing the text of the page to the user’s profile. For example, if a user searches for “Australia” and is interested in “travel,” then links related to “travel” will be scored higher;
but if a user is interested in “universities,” then pages related to “universities” will be scored higher.
Implicit indicators for interesting web pages help build a UIH implicitly. We compare previous implicit indicators and propose other new implicit indicators. A user’s duration on a web page is divided into three types depending on if the browser is open (complete duration), if the browser is the active application (active window duration), and if the user is looking at the screen (look at it duration). We also study new implicit indicators (memo) that have not been evaluated in previous research. We divide the web pages visited during our evaluation into two groups: (1) web pages that a user visited more than once and viewed for the longest duration, and (2) all web pages that were visited more than once.
1.4. Key Contributions
The main contributions are:
• we represent user interest hierarchy (UIH) at different abstraction levels (general to specific), which can be learned implicitly from the contents (words/phrases) in a set of interesting pages to a user;
• we build a divisive graph-based hierarchical clustering algorithm (DHC), which constructs a UIH by grouping words (topics) into a hierarchy instead of flat cluster used by STC;
• we propose a variable-length phrase-finding algorithm (VPF) that is designed for finding meaningful phrases – the time complexity remains as O(nw) where nw is the number of distinct words in a document;
• more meaningful phrases than previous methods are found and the improvement in performance is statistically significant.
• we identify 2 new desirable properties for a correlation function in general;
• our results indicate that these 2 new desirable properties are more descriptive than the previous 3 desirable properties provided in Piatetsky-Shapiro (1991), because the previous ones highly depend on cross product ratio (CPR);
• We introduce personalized ranking methods that utilize an implicitly learned user profile (UIH);
• Our experimental results indicate that Weighted Scoring method can achieve higher precision than Google for some top links with interesting and potentially interesting web pages to the user;
• Our experiments indicate that complete duration, active window duration, look at it duration, and distance of mouse movement are reliable indicators for more users than other indicators;
1.5. Dissertation Organization The rest of this dissertation is organized as follows. In Chapter 2 through 6, we introduce 5 problems of our web personalization architecture. These five parts are learning implicit user interest hierarchy for context in personalization (Kim and Chan, 2003), identifying variable-length meaningful phrases with correlation functions (Kim and Chan, 2004), analysis of desirable properties of correlation functions between two events, personalized ranking of search results with implicitly learned user interest hierarchies, and implicit indicators for interesting web pages. In Chapter 2, we introduce user interest hierarchies (UIH’s) and detail our approach towards building an implicit UIH’s. Our empirical evaluation regarding the meaningfulness of a UIH is discussed. In Chapter 3, we provide the detailed description of our variable-length phrase-finding algorithm (VPF).
Every iteration a word is added to the phrases generated in a previous stage when the correlation value between them is higher than a threshold. This algorithm does not request any user-defined parameters. The performance of the algorithm will be evaluated by experimental result. In Chapter 4, two new desirable properties of correlation functions are proposed. The experimental comparisons with previous three desirable properties are discussed. In addition to proposing two new properties, we summarize the characteristics of 32 correlation functions based on those properties. In Chapter 5, we incorporate UIHs to the personalization of web search results. Scoring functions are introduced that calculates the public and personal page score of a web page in the web search results. The function uses four characteristics for a term: the level of a node where a term belongs to (D), the length of a term (L), the frequency of a term (F), and the html formatting used for the emphasis of a term (E). The results will be evaluated empirically. In Chapter 6, we learn implicit indicators for interesting web pages to a user and propose more predictable implicit indicators. Some indicators are used frequently and some less frequently. We test both types of implicit indicators empirically. In Chapter 7, we review the areas of web information retrieval, user modeling, and machine learning. In Chapter 8, we summarize our contributions and discuss limitations and future work.