# «Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

7.2.4. Hybrid Way Based User Model Hybrid approach uses both user’s actions and the contents of the web pages visited by a user for building a user model. Mobasher et al. (2000) combine site usage based clustering and site contents based approach to obtain a uniform representation in which the user preference is automatically learned from web usage data and integrated with domain knowledge and the site contents. These profiles can be used to perform real-time personalization. Their experimental results indicate that the integration of usage and content mining increases the usefulness and accuracy of the resulting recommendations.

7.2.5. Explicit/Implicit Way of Building a User Model Most current approaches to personalization rely heavily on human participation to collect profile information about a user. The most common and obvious solution for collecting profile information about a user is asking for the user to specify their interests explicitly (Yahoo mail, 2003). However, the explicit approach has several disadvantages.

Time and effort are required to specify interests, and user’s interests may change over time.

Alternatively, an implicit approach can identify a user’s interests by inference.

Ardissono et al. (1999) demonstrated how user modeling and adaptive hypermedia techniques could be applied to present the most appropriate set of news (and advertisement) to each user. This system builds the initial model of a new user by asking questions directly such as age, gender, job, hobbies, etc. Since the initial stereotype user model may be not accurate, the model is refined periodically after monitoring the user’s behavior (e.g., which news s/he selects). The obtained user models are used for dynamic generation of the web pages based on a knowledge base (e.g., which news, at which detail level and which advertisement). However, setting rules for revising user profile and for predicting probability is difficult.

7.3. Machine Learning Machine learning has two different methods for leaning: supervised learning and unsupervised leaning. Supervised leaning has labels on learning data sets, but unsupervised

**learning does not. Supervised learning can be divided into two sub categories:**

characterization and classification. Characterization methods learn from a set of good data, and then detect no-goods based on the learning. This technique is mainly used for anomaly detection. Classification accepts a set of labeled data and then learns from them.

Unsupervised learning has two sub categories: clustering and outlier detection. Both methods use unlabeled data. Clustering tries to group similar elements. Outlier detection also groups similar elements, but rejects far elements at the same time. Outlier detection is commonly used for anomaly detection. Theses categories can be depicted as shown in

shown in Figure 37. Generally numerical method is hard for humans to understand in terms of how it is educated; but human can easily interpret the learned results by symbolic method. Following study mainly focuses on symbolic method, numeric method, and clustering techniques.

Some web personalization systems use machine learning to model user interest model or visitor’s behavior. In their work, they have a set of web pages, which have category labels (e.g. interesting, uninteresting, or topics of interest). The task is to assign labels to the unseen web pages.

7.3.1. Symbolic Methods of Learning Symbolic methods of learning yield results that can be readily understood by users.

This ease of user interface contributes to understanding the learning methods and

Semantic networks use objects as nodes in a graph, where the nodes are organized in a taxonomic structure and arrows represent relations between nodes. Each node represents a separate concept and links between nodes have several types of semantic relations. SiteIF (Stefani and Strapparava, 1999) represents a user model as a semantic network, whose nodes are concepts and arcs between nodes are the co-occurrence relation of two concepts. These methods learn the relationship between two concepts. It is an effective way to represent data as they incorporate the inheritance mechanism that prevents duplication of data.

7.3.1.2. Learning Decision Trees Decision tree, such as ID3, ASSISTANT, and C4.5, generates a nested set of ifthen-else rules, which is of the form, “if attribute value then …”. These algorithms recursively construct a tree using a greedy algorithm by finding the next attribute that maximizes the separation of categories at each node. Their inductive bias is a preference for small trees over large trees, Occam’s razor.

7.3.1.3. Learning Sets of Rules Rule-learning algorithms learn rule sets to assign categories to a training set.

RIPPER (Cohen, 1995) is a rule learning algorithms that perform efficiently on large noisy datasets. It uses greedy algorithm, so the early rules generated cover more number of training instances. Apriori (Agrawal and Srikant, 1994) generates all significant association rules between the categories and other attributes in large databases. Given a training set, it allows any attribute to serve as the label and finds if-then rules among attributes. LERAD (Mahoney and Chan, 2003) also finds association rules by choosing a set of rules randomly and applying rule validation.

Genetic algorithms encode each rule set as a bit string and use genetic search operators to explore this hypothesis space. It operates by iteratively updating a pool of rules, called population. On each generation, only some of the rules are selected according to their fitness and are carried forward into the next generation’s population intact.

SurfLen (Fu et al., 2000) uses rules in recommending next web pages. TELLIM (Hoerding, 1999) represents user profile as a set of rules. TELLIM handles only some product items, and measures if a customer is interested in the item or not. Their rules consist of not many precedents and antecedents. However in our study there are many attributes (all distinct words in web pages), which can reduce the speed of the rule-learning algorithm.

7.3.2. Numerical Methods of Learning If a symbolic method of learning can be called a white box method, a numerical method of learning will be a black box method. Black box methods do not provide human readable results. However, the performance is quite good such as with Neural Networks which is one of the most popular learning algorithms.

7.3.2.1. Hidden Markov Models Hidden markov models (HMM) are to model sequences of data. It can be viewed as a stochastic generalization of finite-state automata, where both the transitions between states and the generation of output symbols are governed by probability distributions.

Deshpande and Karypis (2001) applied pruning techniques to HMM model in order to reduce the model size and improve the prediction. Anderson (2002) also used Markov models to provide a probability distribution over out-links. Like wise, this approach is mainly used in predicting navigation in a web server.

7.3.2.2. Naïve Bayes Classifier Naïve Bayes classifier is a Bayesian learning method. It is called “naïve” because the attribute values are assumed to be conditionally independent. Even when this assumption is not met, the naïve Bayes classifier is often quite effective. Bayesian belief networks represent the sets of conditional independence assumptions among subsets of the attributes more effectively. Syskill & Webert (Pazzani et al., 1996) uses Naïve Bayes classifier in building user profiles. As a disadvantage, the interpretation of the results can be difficult since every vector has only probability values.

7.3.2.3. Artificial Neural Networks Artificial neural networks is among the most effective learning methods currently known. For example BACKPROPAGATION algorithm has been surprisingly successful in many practical problems. Each vector element is assigned to an input neuron; each category to an output neuron. These neurons and other intermediate neurons are all connected by weights. The network is trained by incrementally adjusting the weights to correctly categorize the training set. Nonlinear functions of the input can be learned by adding the intermediate neurons. Even though this approach shows good performance in various areas, it supports very little human interpretation.

7.3.2.4. Instance-based Learning Instance-based learning such as nearest neighbor and locally weighted regression simply memorize the presented training data. When a new query instance is encountered, a set of similar instances is retrieved from memory and used to classify the new query instance. These methods can use more complex, symbolic representations for instances.

The disadvantage of this approach is that the cost of classifying new instances can be high.

For some application domains, it may be necessary to classify hundreds or thousands of web documents in a few seconds.

7.3.3. Clustering Techniques Many web personalization systems use clustering techniques in building a user model. In our work, we try to group words according to their correlations. For example all class names are supposed to be in one cluster or some related words such as computer and monitor are also in the same cluster. There are five categories of major clustering methods

every object in its own cluster and then repeatedly merge similar clusters together, resulting in a tree shape structure that contains clustering information on many different levels (Voorhees, 1986). Merges are usually binary – merging two entities, which could be clusters or initial data points. Hence, each parent is forced to have two children in the hierarchy. Divisive (top-down) hierarchical clustering (DHC) algorithms are similar to agglomerative ones, except that initially all objects start in one cluster which is repeatedly split. These algorithms find the two furthest points, which are the two initial clusters. Then, the rest of the points are assigned to those two clusters depending on which one is closer.

Hence, a binary tree is generated. Our DHC algorithm can generate multiple branches from one node depending on the data, which is the advantage of using a graph-partitioning technique. Several stopping criteria for AHC/DHC algorithms have been suggested, but they are typically predetermined constants – one common stopping criterion is the desired number of clusters (Fisher, 1987; Milligan and Cooper, 1985). These algorithms are very sensitive to the stopping criterion. The web documents, however, could be extremely varied (in the number, length, type and relevance of the terms/documents). When these algorithms mistakenly merge multiple “good” clusters due to the predetermined constraint, the resulting cluster could be meaningless to the user (Zamir and Etzioni, 1998). Another characteristic of the terms in web documents is that there reside many outliers. These outliers (sort of “noise”) reduce the effectiveness of commonly used stopping criteria.

AGNES (AGglomerative NESting) (Kaufman and Rousseeuw, 1990), DIANA(DIvisive ANAlysis) (Kaufman and Rousseeuw, 1990), BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) (Zhang et al., 1996), CURE(Clustering Using REpresentatives) (Guha et al., 1998), ROCK (Guha et al, 1999), CHAMELEON (Karypis et al., 1999) all use these methods. In some application this binary split can be a major disadvantage. The hierarchical tree reorganized by slicing technique (Maarek and BenShaul, 1996) can be useful in many areas.

Partitioning clustering algorithms such as the K-means algorithm initially creates a partitioning of K clusters. Those initial K clusters are then iteratively refined to achieve the final clustering of K clusters. A major drawback of this approach is that the number of clusters must be specified beforehand as an input parameter. Our algorithm only needs to cluster strongly connected words, but the K-means algorithm divides all words into K clusters without removing weak relations. We define the “strongly connected” in this paper as the relation between words whose correlation value is higher than the threshold. K-mean algorithm is too sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data. Instead of taking the mean value, the medoid can be used, which is the most centrally located object in a cluster. PAM(Partitioning around Medoids) (Kaufman and Rousseeuw, 1990), CLARA(Clustering LARge Applications) (Kaufman and Rousseeuw, 1990), and CLARANS (Clustering Large Applications based upon RANdomized Search) (NG and Han, 1994) all use k-medoid method. We will not use this method in our work, because it is too difficult to find the best number of clusters initially.

Density-based clustering methods handles clusters with arbitrary shapes. This method regard clusters as dense regions of objects in the data space that are separated by regions of low density. Some examples of density-based approach are DBSCAN (Density Based Spatial Clustering of Applications with Noise) (Ester et al., 1996), OPTICS (Ordering Points To Identify the Clustering) (Ankerst et al., 1999), and DENCLUE (DENsity-based CLUstEring) (Hinneburg and Keim, 1998).

Grid-based method uses a multi-resolution grid data structure. It quantizes the space into a finite number of smaller cells. On which all of the operations for clustering are performed. It is typically fast; the processing time is typically dependent on the number of cells not the number of data objects. Examples of grid-based methods are STING (STatistical INformation Grid approach) (Wang et al., 1997) and WaveCluster (Wang et al., 1997).

Model-based clustering methods set some mathematical model and then attempt to optimize the fit between the given data and the mathematical model. Such methods are based on the assumption that there is a mixture of underlying probability distributions that generate the data. COBWEB is an incremental conceptual clustering algorithm. Each cluster records the probability of each attribute and value, and the probabilities are updated every time an object is added to a cluster (Fisher, 1987). Instead of recalculating the whole probabilities of clusters to determine if child clusters are generated, our DHC algorithm uses a graph-based method and a different correlation function. CLASSIT (Gennari et al.,