«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»
18.104.22.168. Phrase Describing concepts often requires more than a single word. A composed term using two or more single words (called “phrase”) usually has more specific meaning and can disambiguate related words. For instance, “apple” has different meanings in “apple tree” and in “apple computer”. Using phrases, in addition to words, can improve the performance of WIR systems. Statistical phrase-finding algorithms are mainly used for improving the performance of information retrieval (Kim and Chan, 2003; Turpin and Moffat, 1999; Croft et al., 1991; Fagan, 1987). There are three main approaches: syntactic (Lima and Pedersen, 1999), statistical (Mitra et al., 1997), and hybridized (Gokcay and Gokcay, 1995). Our research mainly focuses on the statistical approach, which does not need any grammatical knowledge and has easy adaptability to other languages. Statistical phrase-finding approaches have been used for expanding vector dimensions in clustering multiple documents (Turpin and Moffat, 1999; Wu and Gunopulos, 2002), or finding more descriptive or important/meaningful phrases (Ahonen et al., 1998; Chan, 1999).
Wu and Gunopulos (2002) examined the usefulness of phrases as terms in vectorbased document classification. They used statistical techniques to extract phrases from documents whose document frequency (df) is larger than or at least equal to a predefined threshold. Fagan (1987) selected phrases having a document frequency of at least 55 and a high co-occurrence in the same sentence. Mitra et al. (1997) collected all pairs of nonfunction words that occur contiguously in at least 25 documents. Turpin and Moffat (1999) used Mitra’s method for statistical phrases for vector-space retrieval. Since the aim of these approaches was to find the significant words or phrases among documents, this method could remove meaningful phrases in a document. These all methods focused on collecting two-word phrases.
Croft, et al. (1991) described an approach where phrases identified in natural language queries are used to build structured queries for probabilistic retrieval models and showed that using phrases could improve performance. They used tf×idf (term frequency inverse document frequency) information for a similarity measure. Croft (2000) segmented a document’s text using a number of phrase separators such as verbs, numbers, dates, title words, format changes, etc. Next, his method checks the candidate phrases to see if they are syntactically correct. Finally, the occurrence frequency of the remaining phrases is checked.
Gokcay and Gokcay (1995) used statistically extracted keywords and phrases for title generation. Their statistical method used grammatical information of tags and sentences, but it is hard to determine a sentence without grammatical information. They used the cosine correlation function for comparing the similarity of two words.
Ahonen’s method (Ahonen et al., 1998) finds all possible combinations of words within a fixed window using Mannila and Toivonen’s (1996) algorithm. Suppose the window size is 6 and the string in that window is “abcdef”. Their algorithm generates all possible cases: “ab”, “bc”, “cd”, “de”, “ef”, “abc”, “bcd”,… “bcdef”, “abcdef”. Then, it computes the conditional probability for the weight of those phrases. A phrase “abc” has two possible weights from P(“c”| “ab”) and P(“bc”| “a”), from which the higher value is chosen. They also allowed gaps within a phrase. The use of two parameters – a threshold to remove less descriptive phrases in the generating stage and a threshold for the maximum phrase length – could be too strict.
Zamir’s algorithm (Zamir and Etzioni, 1998) uses only frequency information.
They collect neither too frequent nor too rare phrases. This method needs two user-defined parameters: one for removing too rare or too frequent words and the other for selecting phrases out of all possible phrases.
Chan’s algorithm (1999) improved performance by using correlation information within a phrase. His algorithm calculates the correlation values of all pairs. The main drawback of this method resides in its incompleteness. Suppose there is a string S=“abaxbxaxxbxaxxxb…xxbbxbxbxxbxxxb…”, where ‘a’ and ‘b’ represent words, and ‘x’ represents any word. If all correlations between ‘a’ and ‘b’ with 1 through 4 distances, and correlations between ‘b’ and ‘b’ with 1 through 4 distances have values higher than the threshold, Chan’s algorithm will generate a word “abbbb” that does not exist in the string.
These cases are very unlikely to happen in a normal article such as newspaper or journal article. But web pages contain lists of similar product names or tables that just arrange a few different repeating words many times. We experienced these non-existing words in our experiment such as “test pass test”, “student test pass”, “teach assist teach class”, etc.
Another disadvantage of Chan’s algorithm is that it requires a user-defined maximum phrase length. Chan (1999) implemented the algorithm in time O(nw2), where nw is the number of distinct words, while our implementation consumes O(nw).
We compare previous statistical approaches and attempts to find meaningful phrases in a document. The length of the phrases is various like normal phrases and our algorithm requires no specified parameters. We mainly focus on a statistical approach without using syntactic information. Simple phrase separators (e.g., stop-words and nonalphabet characters) are used only. Our research experimentally shows which correlation functions are better than others in terms of measuring word correlation in our variablelength phrase-finding algorithm.
7.1.2. Clustering Web Contents Methods about clustering web contents group related web pages together (e.g., Vivisimo (2005)). It does not use any personal information. These methods focus on clustering large amount of information in a short time.
Scatter/Gather (Cutting et al., 1992) supports a hierarchical interface by clustering documents into topically coherent groups and providing descriptive summaries. A user can select one or more basic clusters to focus on, recursively, in the subset of documents. The user specifies a query and two thresholds (the # of documents to be initially retrieved and the # of sub-clusters). Scatter/Gather produces significant improvements over similarity search ranking.
STC (Suffix Tree Clustering) (Zamir and Etzioni, 1998) is a linear clustering algorithm in terms of the document size, which is based on the phrases (an ordered sequence of one or more words) that are common to groups of documents. STC has three steps: document cleaning, identifying base features using a suffix tree, and merging these base features into clusters.
Grouper (Zamir and Etzioni, 1999) is an interface that dynamically groups the search results into clusters labeled by phrases extracted from the snippets – this uses STC as its clustering algorithm. Given a set of phrases from STC, Grouper presents them in descending order of coverage (the percentage of documents in the cluster that contain the phrase) and length (# of words in the phrase, not counting stopped words). Grouper creates clusters by merging base features (a phrase and the set of documents that contain it).
However, when the clusters fail to capture the semantic distinctions the users were expecting, it can be confusing.
Newsblaster (Barzilay et al., 1999) generates a concise summary by identifying and synthesizing similar elements across related text from a set of multiple documents.
Common phrases are identified across sentences assuming a set of similar sentences as input extracted from multiple documents on the same event. Using language generation to merge similar information is a new approach that significantly improves the quality of the resulting summaries. This technique may be able to be used for summarizing web pages.
7.1.3. Predicting Navigation Another approach of web personalization is to predict forward references based on partial knowledge about the history of a session. This approach predicts future requests and present documents to client. When the server guesses correctly, the latency of the next request is greatly reduced; and if the server guesses incorrectly, the client requests the intended next document. These techniques also use other people’s information, and can support pre-fetching. The drawback is that it is difficult for them to predict previously unvisited pages.
The two major approaches for predicting navigation are Markov models and ngram models. Zukerman et al. (1999) and Cadez et al. (2000) use a Markov model to learn and represent significant dependencies among page references. The n-gram models predict which URL will be requested next; the Markov models compute the probability of next request. An n-gram is a sequence of n web requests, and the n-gram models learn how often each sequence of n requests was made in the training data.
Deshpande and Karypis (2001) use pruning techniques to reduce the model size and improve predictions. Their method intelligently selects parts of different order Markov models so that the resulting model has a reduced state complexity and improved prediction accuracy.
WebCANVAS (Cadez et al., 2000) visualizes the similar group of users by using a Markov model – this is illustrated on user-traffic data from msnbc.com. Zukerman et al.
(1999) developed a system that reduces a user’s expected waiting time by pre-sending documents s/he is likely to request. Their models are based on observing the behavior patterns of many users, rather than using the behavior of an individual user. They combine two features: the order in which documents are requested and the structure of the server site. Mladenić (2000) used naïve Bayes or k-nearest neighbor models to predict the next link.
SurfLen (Fu et al., 2000) actively monitors and tracks a user’s navigation. Once a user’s navigation history is captured, they discover the hidden knowledge contained in the history by applying association rule mining techniques. This uses a form of “market basket” analysis (Agrawal and Srikant, 1994). The knowledge is then used to recommend potentially interesting web pages to users. PageGather (Perkowitz and Etzioni, 2000) builds a co-occurrence matrix of all pairs of pages visited and finds clusters of pages that are frequently viewed in the same session. Like SurfLen, PageGather also recommends the top n pages that are most likely to co-occur with the visitor’s current session.
Shahabi and Banaei-Kashani (2003) proposed a web-usage-mining framework using navigation pattern information. They introduced a feature-matrices model (FM) to discover and interpret users’ access patterns.
This approach is different from ours since we do not use navigation pattern information but rather the contents of web pages. Both the n-gram and the Markov methods require large volumes of training data and cannot generate previously unvisited pages because they use navigation pattern information.
7.1.4. Personalized Contents The methods related to personalized contents use user’s access pattern to provide personalized web services. Server usually monitors the user’s access patterns. This technique identifies a set of categories from other users’ access pattern and then provides a new user the closest category based on his/her access pattern. Our method does not personalize the set of categories, but personalizes results returned from a search engine.
Page et al. (1998) first proposed personalized web searches based on modifying the global PageRank algorithm with the input of bookmarks or homepages of a user. Their work mainly focuses on global “importance” by taking advantage of the link structure of the web. Haveliwala (1999) determined that PageRank could be computed for very large subgraphs of the web on machines with limited main memory. Brin et al. (1998) suggested the idea of biasing the PageRank computation for the purpose of personalization, but it was never fully explored. Haveliwala (2002) used personalized PageRank scores to enable “topic sensitive” web search. He concluded that the use of personalized PageRank scores can improve web search, but the number of hub vectors (e.g., number of interesting web pages used in a bookmark) used was limited to 16 due to the computational requirements.
Jeh and Widom (2003) scaled the number of hub pages beyond 16 for finer-grained personalization.
Liu et al. (2002) also tried mapping user queries to sets of categories. This set of categories served as a context to disambiguate the words in the user’s query, which is similar to Vivisimo (2005). They studied how to supply, for each user, a small set of categories as a context for each query submitted by the user, based on his or her search history.
Anderson (2002) proposed PROTEUS (This system personalizes web browsing for visitors using wireless PDAs at many web sites, adapting each site in turn), MINPATH (This algorithm finds personalized shortcut links efficiently), and MONTAGE (This system supports personalized, dynamic portals of web content, based on the navigation behavior of each individual visitor), which personalize the web content for different audience, that personalize individual pages (e.g., elide-content), site-sessions (add-shortcut), or entire browsing sessions.
Footprints (Wexelblat and Maes, 1997) helps user browse complex web sites by visualizing the paths taken by users who have been to the site before. The paths are visualized as a graph of linked document nodes – color represents the frequency of use of the different paths. Footprints are left automatically by anonymous different users, and new visitors do not need to provide any information about themselves in order to use the system. However, users can only see the frequency of the links between adjacent pages.
Perkowitz and Etzioni (1997, 2000) find singular transformations that appeal to all visitors at the site by synthesizing index pages – hubs of links to other pages in the site.
They consider the problem of index page synthesis and sketch a solution that relies on novel clustering and conceptual clustering techniques.