«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
Yan et al. (1996) present a system that facilitates the analysis of past user access patterns to discover common user access behavior (for example, navigation through the men’s clothing department, consumer electronics, and traveling). They perform clustering over a web site’s logs. Once this information is analyzed, it is used to improve the static hypertext structure or to dynamically insert links to web pages. They model visitors as vectors in URL-space (an n-dimensional space with a separate dimension for each page at a site) and cluster them using “leader algorithm” (Hartigan, 1975).
7.1.5. Assisting Personal Information Methods for assisting personal information help a user to organize their own information better and increase the web usability. The assistant usually resides in a user’s personal computer. These techniques do not use other people’s information and are not related to predicting navigation.
PowerBookmarks (Li et al., 1999) is a web information organization, sharing, and management tool, which monitors and utilizes users’ access patterns to provide useful personalized services. PowerBookmarks provides automated URL bookmarking, document refreshing, bookmark expiration, and subscription services for new or updated documents.
BookmarkOrganizer (Maarek and Ben-Shaul, 1996) is an automated system that maintains a hierarchical organization of a user’s bookmarks using the classical HAC algorithm (Voorhees, 1986), but by applying “slicing” technique (slice the tree at regular intervals and collapse into one single level all levels between two slices). Both BookmarkOrganizer and PowerBookmarks reduce the effort required to maintain the bookmark, but they are insensitive to the context browsed by users and do not have reordering functions.
7.1.6. Implicit Detection of User’s Characteristics Detecting the interests of a web page from a person can happen in either Client or Server. Obtaining labeled training instances is necessary for agents to learn a user’s interest; however, how the learning algorithm obtains training examples is an important issue.
Jung (2001) developed Kixbrowser, a custom web browser that recorded users’ explicit rating for web pages and their actions: mouse clicks, highlight, key input, size, copy, rollover, mouse movement, add to bookmark, select all, page source, print, forward, stop, duration, the number of visits (frequency), and recency during users’ browsing. He developed individual linear and nonlinear regression models to predict the explicit rating.
His results indicate that the number of mouse clicks is the most accurate indicator for predicting a user’s interest level.
CuriousBrowser (Claypool et al., 2001) is a web browser that recorded the actions (implicit ratings) and explicit ratings of users. This browser was used to record mouse clicks, mouse movement, scrolling and elapsed time. The results indicate that the time spent on a page, the amount of scrolling on a page, and the combination of time and scrolling has a strong correlation with explicit interest.
The two experiments above show some inconsistency. Jung (2001) said mouse click is a good indicator, but Claypool et al. (2001) did not. Jung (2001) found that duration and scrollbar movement are not very predictive of a user’s interest, but Claypool et al. (2001) said they are good indicators.
Powerize (Kim et al., 2001) is a content-based information filtering and retrieval system that uses an explicit user interest model. They also reported a way to implement the implicit feedback technique of user modelling for Powerize. They also found that observing the printing of web pages along with reading time could increase the prediction rate for detecting relevant documents.
Goecks and Shavlik (2000) proposed an approach for an intelligent web browser that is able to learn a user’s interest without the need for explicitly rating pages. They measured mouse movement and scrolling activity in addition to user browsing activity (e.g., navigation history). We extend these existing implicit interest indicators in this research.
Granka et al. (2004) measured eye-tracking to determine how the displayed web pages are actually viewed. Their experimental environment was restricted to a search results.
We examine the duration implicit indicator in more detail. We divide the duration into three types: complete duration, active window duration, and look at it duration. Our complete duration is different from the duration in Jung’s (2001) work. His duration includes the downloading time of a web page, but ours does not. We divided the web pages visited during our evaluation into two groups: (1) web pages that a user visited more than once and viewed for the longest duration, and (2) all web pages that were visited more than once, while Jung (2001) only used the second data set. In our experiment, we let a user navigate to any web page and do normal tasks such as using chat programs or word processors during the experiment. Another difference is that we use head orientation instead of eye-tracking (Granka et al., 2004). Our experiment is also valuable since there are cases where an application does not have devices for tracking a user’s eyes.
7.2. User Modeling This section lists adaptive systems that use user modeling. The primary goal of user modeling is to enable the prediction of a user’s actions on a personalized web site, and thus to help determine which adaptation are useful for the user and navigate the web. The forms of user model are as varied as the purposes for which user models are formed as shown in Figure 35. Mainly user models try to describe (Webb et al., 2001): the cognitive processes of user’s action, the difference between the user’s skill and expert skills, the user’s behavioral pattern or preferences, and the user’s characteristics. Another important dimension is to distinguish whether models are based on individual users or communities of users (Webb et al., 2001). Whereas much of the academic research is related to modeling individual users, many applications (Ungar and Foster, 1998) in electronic commerce are related to forming generic models of user communities.
User modeling poses a number of challenges for machine learning, including:
computational complexity, concept drift, the need for labeled data, and the need for large data sets. User modeling is known to be a very dynamic modeling task – attributes are changing over time. The capability of adjusting to these changes quickly is known as “concept drift” (Widmer and Kubat, 1996). Webb et al. (2001) examined each of these issues and reviewed approaches. These techniques can reside on both client/server sides.
Generally these techniques do not use other people’s information. These can support prefetching and advise unvisited pages.
7.2.1. Adaptive Hypermedia Adaptive hypermedia focuses on improving web (Hypermedia) interactions by modeling users and adapting the experience. The differences from adaptive web sites lies in the application domain – Hypermedia is related to help systems (adapting to the particular context of the help request), information retrieval (helping users find as much relevant content as possible), or online information systems (helping users find highquality content quickly). Brusilovsky (2001) introduced this field for newcomers by an overview. Previous empirical studies have shown that adaptive navigation support can improve the speed of navigation (Kaplan et al., 1993) and learning (Brusilovsky and Pesin, 1998). The adaptive presentation can also affect the understanding of content (Boyle and Encarnacion, 1994).
Weber and Specht (1997) demonstrated that user modeling techniques like simple overlay models or more elaborated episodic learner models are effective for adaptive guidance and for individualized help in web-based learning systems. This system uses a combination of an overlay model (provide default path and short cut path) and an episodic user model (stores knowledge about the learner in terms of a collection of episodes, such episodes can be viewed as cases). This system also supports adaptive navigation as individualized diagnosis and helps on problem solving tasks.
7.2.2. Human Behavior Based User Model Human behavior based user models are not good at predicting unvisited web pages, because this approach utilizes models that are based upon user actions such as path, click, downloads, frequency of visits to a web page, etc.
Mobasher et al. (1999) proposed an approach to usage-based web personalization that takes into account both the offline tasks related to the mining of usage data and the online process of automatic web page customization. Their technique captures common
Figure 35. Diagram of user modeling Letizia (Lieberman, 1995) is a user interface agent (client-side), which operates with conventional web browser.
The agent tracks the user’s browsing behavior (e.g., following links, initiating searches, and requests for help) and tries to estimate the user’s interest in as-yet-unseen pages. Letizia can recommend nearby pages by doing lookahead search. Letizia cannot take advantage of the past experiences of other visitors to the same site, since it runs on a client.
Pazzani and Billsus (1997, 1999a, 1999b) state that a web site should be augmented with an intelligent agent to help visitors navigate the site, and should learn from the visitors to the web site. An agent can learn common access patterns of the site both by analyzing web logs and by inferring the visitor’s interests from actions of the visitor.
TELLIM (Hoerding, 1999) monitors the behavior of a customer and recognizes the user’s needs and preferences. This information adapts the product presentations. Using a set of rules, the system evaluates for every presentation element whether the customer was interested in it or not. Those rules are extracted from their personal experience. For example, if the downloading of an integrated image was interrupted, then it has negative interest to the customer. The attribute of an element is quite simple: kind of product (e.g., “car”, brand, e.g. “Ford”) and kind of information (e.g., “engine”). They used CDL4 algorithm to learn the preferences of the customer. The user model for each customer is expressed as a set of rules like if the size of the item is less than 20GB (hard drive), then the customer may not be interested. They addressed the refinement of the user model as a future work.
The AVANTI Project (Fink et al., 1996) focuses on helping users by adapting the content and the presentation of web pages to each individual user. The elderly and handicapped users are also partly considered. AVANTI also relies partly on explicit profiles. It uses both user’s path and his/her model to guess pages.
7.2.3. Contents Based User Model Contents based user models can predict web pages unvisited by users. This is achieved because this model learns from the contents of web pages that a user visited. This technique usually has higher dimensional vectors.
WebWatcher (Joachims et al., 1997) is a tour guide software agent. It accompanies users from page to page providing several types of assistance: highlighted interesting hyperlinks, menu bar, and advice. It also learns from experience to improve its advicegiving skills. Since it runs as a centralized server it can leverage data from different users.
User interest (user model) is represented by high-dimensional feature vectors, each dimension representing a word. This uses reinforcement learning (Sutton and Barto, 1998) to allow agents to learn control strategies that select optimal actions in certain settings.
SiteIF (Stefani and Strapparava, 1999) is a personal agent that follows users as they browse a web site. It learns user’s interests from the requested pages and builds/updates a user model. This system builds the user model in the form of a semantic net whose nodes are concepts and arcs are the co-occurrence relation of two concepts. The relevance between user model and a document is estimated using the Semantic Network Value Technique.
Mobasher et al (1999) propose an approach to usage-based web personalization taking into account both the offline tasks related to mining of usage data and the online process of automatic web page customization. Their technique captures common user profiles based on association-rule discovery and usage-based clustering. The advantage of this approach is that it can predict visited web pages well, but is not good for predicting unvisited web pages. Content-based user models are generated from the contents of web pages that a user has visited. This technique usually has higher dimensional vectors and needs a greater number of training data. The advantage is that it can predict unvisited web pages by users.
Syskill & Webert (Pazzani et al., 1996) is an intelligent agent that learns user profiles. After identifying informative words from web pages to use as Boolean features, it learns a Naïve Bayesian classifier to determine the interest of a page to a user. It converts the HTML source of a web page into a Boolean feature vector that indicates whether a particular word is present or absent in a particular web page. Hybrid models are learned by observing user’s actions and the contents of web pages visited by a user.
Mobasher et al. (2000) combine site usage-based clustering and a site contentbased approach to obtain uniform representation, in which the user preference is automatically learned from web usage data and integrated with domain knowledge and the site content. These profiles could be used to perform real-time personalization. Their experimental results indicate that the integration of usage and content mining increases the usefulness and accuracy of the resulting recommendations.
A news-agent called News Dude (Billsus and Pazzani, 1999), learns which stories in the news a user is interested in. The news-agent uses a multi-strategy machine learning approach to create separate models of a user’s short-term and long-term interests. They use the Nearest Neighbor algorithm for modeling short-term interests and a Naïve Bayesian classifier for long-term interests.
Unlike News Dude that creates a model of two layers, our approach tries to model a continuum that spans from general to specific interests. Once we get a user profile based on contents, we can extend it to incorporate human behavior based user model.