FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 19 |

«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

-- [ Page 4 ] --

To provide a more robust context for personalization, we desire to extract a continuum of general to specific interests of a user, called a user interest hierarchy (UIH).

The higher-level interests are more general, while the lower-level interests are more specific. A UIH can represent a user’s interests at different abstraction levels and can be learned from the contents (words/phrases) in a set of web pages bookmarked by a user. We propose a divisive hierarchical clustering (DHC) algorithm to group words (topics) into a hierarchy where more general interests are represented by a larger set of words. Our approach does not need user involvement and learns the UIH “implicitly.” To enrich features used in the UIH, we used phrases in addition to words.

–  –  –

Figure 2. Learning user interest hierarchy in the framework of web personalization

2.1. User Interest Hierarchy A user interest hierarchy (UIH) organizes a user’s general to specific interests.

Towards the root of a UIH, more general (passive) interests are represented by larger clusters of words while towards the leaves, more specific (active) interests are represented by smaller clusters of words. To generate a UIH for a user, our clustering algorithm (details in Section 4) accepts a set of web pages bookmarked by the user as input. That is, the input of DHC is web pages that are interesting to a user. We use the words and phrases in the web pages and ignore link or image information. The web pages are stemmed and filtered by ignoring the most common words listed in a stop list (Rasmussen, 1992). The phrases are extracted by VPF algorithm (Kim and Chan, 2004). These processes are depicted in Figure 2.

–  –  –

Table 1 contains a sample data set. Numbers on the left represent individual web pages; the content has words stemmed and filtered through the stop list. These words in the web pages can be represented by a UIH as shown in Figure 3. Each cluster, node, can represent a conceptual relationship, for example ‘perceptron’ and ‘ann’ (in italics) can be categorized as belonging to neural network algorithms, whereas ‘id3’ and ‘c4.5’ (in bold) in another node cannot. Words in these two nodes are mutually related to some other words such as ‘machine’ and ‘learning’. This set of mutual words, ‘machine’ and ‘learning’, performs the role of connecting italic and bold words in sibling clusters and forms the parent cluster. We illustrate this notion in the dashed box in Figure 3.

One can easily identify phrases like “machine learning” and “searching algorithm” in the UIH, however only the individual words are represented in the UIH. By locating phrases from the pages, we can enrich the vocabulary for building the UIH. For example, the phrase “machine learning” can be identified and added to Pages 1-6. If we can use phrases as feature in the UIH, each cluster will be enriched because phrases are more specific than words. For example, a user is interested in both phrases “java coffee” and “java language”. The word “java” will be in the parent cluster of both “coffee” and “language”. Each child cluster would contain only “coffee” or “language”, which is relatively less useful when not in combination with “java”.

The approach we take to generate the hierarchy is similar to clustering pages, but pages may belong to multiple clusters – overlapping clusters of pages. That is, instead of directly clustering the original objects (web pages), we first cluster features (words) of the objects and then the objects are assigned to clusters based on the features in each cluster.

Note that a document can have terms in different clusters, hence, a document can be in more than one cluster. Since the more challenging step is the initial hierarchical clustering of features, our primary focus for this Chapter is on devising and evaluating algorithms for this step.

2.2. Building User Interest Hierarchy We desire to learn a hierarchy of interest topics from a user’s web pages bookmarked by a user, in order to provide a context for personalization. Our divisive hierarchical clustering (DHC) algorithm recursively partitions the words into smaller clusters, which represent more related words. We assume words occurring close to each other (within a window size) are related to each other. We investigate correlation functions that measure how closely two words are related in Section 4.2. We also investigate techniques that dynamically locate a threshold that decides whether two words are strongly related or not in Section 4.3. If two words are determined to be strongly related to each other, they will be in the same cluster; otherwise, they will be in different clusters.

2.2.1. Algorithm Our algorithm is a divisive graph-based hierarchical clustering method called DHC, that recursively divides clusters into child clusters until it meets the stopping conditions. In preparation for our clustering algorithm, we extract words from web pages that are interesting to the user, filter them through a stop list, and stem them (Rasmussen, 1992). Figure 4 illustrates the pseudo code for the DHC algorithm. Using a correlation function, we calculate the strength of the relationship between a pair of words in line 1.

After calculating a threshold to differentiate strong correlation values from weak correlation in line 2, we remove all weak correlation values in line 5. We then build a weighted undirected graph with each vertex representing a word and each weight denoting the correlation between two words. Since related words are more likely to appear in the same document than unrelated terms, we measure co-occurrence of words in a document.

Given the graph, called a CorrelationMatrix, the clustering algorithm recursively partitions the graph into sub-graphs, called Clusters, each of which represents a sibling node in the resulting UIH in line 6.

At each partitioning step, edges with “weak” weights are removed and the resulting connected components constitute sibling clusters (we can also consider cliques as clusters, but more computation is required). We treat the determination of what value is considered to be “strong” or “weak”, as another clustering. The recursive partitioning process stops when one of the stopping criteria is satisfied. The first criterion is when the current graph does not have any connected components after weak edges are removed. The second criterion is a new child cluster is not formed if the number of words in the cluster falls below a predetermined threshold.

Cluster: distinct words in a set of interesting web pages to a user [with information of web page membership] CORRELATIONFUNCTION: Calculates the "closeness" of two words.

FINDTHRESHOLD: Calculates the cutoff value for determining strong and weak correlation values.

WindowSize: The maximum distance (in number of words) between two related words in calculating their correlation value.


1. CorrelationMatrix ← CalculateCorrelationMatrix (CORRELATIONFUNCTION, Cluster, WindowSize)

2. Threshold ← CalculateThreshold(FINDTHRESHOLD, CorrelationMatrix)

3. If all correlation values are the same or a threshold is not found

4. Return EmptyHierarchy

5. Remove weights that are less than Threshold from CorrelationMatrix

6. While (ChildCluster←NextConnectedComponent (CorrelationMatrix))

7. If size of ChildCluster = MinClusterSize 8. ClusterHierarchy←ClusterHierarchy + ChildCluster + DHC(ChildCluster,CORRELATIONFUNCTION,FINDTHRESHOLD, WindowSize)

9. Return ClusterHierarchy End Procedure

–  –  –

Suppose we built a weighted undirected graph with the running example in Table 1 where each vertex represents a word and each weight (value) denotes the correlation value.

The undirected graph can be depicted as shown in a) in Figure 5 and Figure 6 – the left column shows graph partitioning and the right column represents the corresponding tree.

We presented only some vertices and edges as shown in a). Once a threshold for differentiating “strong” edges from “weak” edges is calculated by using a Findthreshold method, we can remove weak edges – those removed edges are represented as dashed lines.

After removing weak edges, DHC finds connected components, which is shown in b). If the number of elements in a cluster is greater than the minimum number of elements in a cluster (e.g., 4), then the correlation values are recalculated and the algorithm repeats the process of removing “weak” edges as shown in c). Since DHC recursively partitions the graph into subgraphs, called Clusters, the final result becomes hierarchical clusters as shown in d).

The CalculateCorrelationMatrix function takes a correlation function, cluster, and window size as parameters and returns the correlation matrix, where the window size affects how far two words (the number of words between two words) can be considered as related. The CalculateThreshold function takes a threshold-finding method and correlation matrix as parameters and returns the threshold. The correlation function (Sec. 4.2) and threshold-finding method (Sec. 4.3) greatly influence the clustering algorithm, and are discussed next..

–  –  –

2.2.2. Correlation Functions The correlation function calculates how strongly two words are related. Since related words are likely to be closer to each other than unrelated words, we assume two words co-occurring within a window size are related to each other. To simplify our discussion, we have been assuming the window size to be the entire length of a document.

That is, two words co-occur if they are in the same document. These functions are used in

–  –  – AEMI We use AEMI (Augmented Expected Mutual Information) (Chan, 1999) as a correlation function. AEMI is an enhanced version of MI (Mutual Information) and EMI (Expected Mutual Information). Unlike MI which considers only one corner of the contingency matrix and EMI which sums the MI of all four corners of the contingency matrix, AEMI sums supporting evidence and subtracts counter-evidence. Chan (1999) demonstrates that AEMI could find more meaningful multi-word phrases than MI or EMI.

Concretely, consider A and B in AEMI (A,B) are the events for the two words. P ( A = a ) is the probability of a document containing a and P ( A = a ) is the probability of a

–  –  –

is the probability of a document containing both terms a and b. These probabilities are

estimated from documents that are interesting to the user. AEMI (A,B) is defined as:

–  –  –

The first term computes supporting evidence that a and b are related and the second term calculates counter-evidence. Using our running example in Figure 3, Table 2 shows a few examples of how AEMI values are computed. The AEMI value between ‘searching’ and ‘algorithm’ is 0.36, which is higher than the AEMI value between ‘space’ and ‘constraint’, –0.09.

–  –  – AEMI-SP Inspired by work in the information retrieval community, we enhance AEMI by incorporating a component for inverse document frequency (IDF) in the correlation function. The document frequency of a word calculates the number of documents that contain the word. Words that are commonly used in many documents are usually not informative in characterizing the content of the documents. Hence, the inverse document frequency (the reciprocal of document frequency) measures how informative a word is in characterizing the content. Since our formulation is more sophisticated than IDF and it involves a pair of words rather than one word in IDF, we use a different name and call our function specificity (SP).

We estimate the probability of word occurrence in documents instead of just document frequency so that we can scale the quantity between 0 and 1. We desire to give high SP values to words with a probability below 0.3 (approximately), gradually decreasing values from 0.3 to 0.7, and low values above 0.7. This behavior can be approximated by a sigmoid function, commonly used as a smoother threshold function in neural networks, though ours needs to be smoother. SP(x) is defined as: 1/(1 + exp(0.6 × (x × 10.5 – 5))), where x is defined as: MAX (P(a), P(b)) - we choose the larger probability so that SP is more conservative. The factor 0.6 smoothes the curve, and constants 10.5 and –5 shift the range of x from between 0 and 1 to between -5 and 5.5. The new range of -5 and

5.5 is slightly asymmetrical because we would like to give a small bias to more specific words. For instance, for a = ‘ann’ and b = ‘perceptron’, x is 0.2 and SP(x) is 0.85, but for a=‘machin’ and b=‘ann’, x is 0.6 and SP(x) is 0.31.

Our correlation function AEMI-SP is defined as: AEMI × SP/2. The usual range for AEMI is 0.1–0.45 and SP is 0–1. To scale SP to a similar range as AEMI, we divide SP by 2. For example, in Table 3 the AEMI-SP value for ‘searching’ and ‘algorithm’ is lower than the value for ‘ann’ and ‘perceptron’ because the SP value for ‘ann’ and ‘percetpron’ is higher even though the AEMI value is lower.

–  –  –

expect it to occur quite often and appear with different, more specific words. Hence, we desire general (“connecting”) words to exist only at higher levels in the UIH. For example, ‘ai’ is general and preferably should not appear at the lower levels. Using our running example in Figure 3, the Jaccard value between ‘ai’ and ‘machine’ is 0.6 and the value between ‘ai’ and ‘search’ is 0.5. If the threshold is 0.49, both pairs are in the same cluster and ‘ai’ may perform the role to connect ‘machine’ and ‘search’. Even if the threshold is 0.55, ‘ai’ still remains in the child cluster with ‘machine’ (since their correlation value is over the threshold) - which is a wrong decision. This phenomenon tells us the Jaccard function is not proper for making hierarchical clusters.

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 19 |

Similar works:

«Pathos, Performance, Volition: Melodrama's Legacy in the Work of Carl Th. Dreyer by Amanda Elaine Doxtater A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Scandinavian Languages and Literatures and the Designated Emphasis in Film Studies in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Mark Sandberg, Chair Professor Linda Rugg Professor Linda Williams Fall 2012 Pathos, Performance,...»

«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science. Chapel Hill Approved by: Marc Pollefeys Jan-Michael Frahm Gary Bishop Svetlana Lazebnik Jongwoo Lim Gregory Welch c 2010 Brian Sanderson Clipp ALL RIGHTS RESERVED ii ABSTRACT BRIAN SANDERSON CLIPP: Multi-Camera...»

«Masterarbeit zur Erlangung des akademischen Grades Master of Arts der Philosophischen Fakultät der Universität Zürich Die Hausnamen der Schaffhauser Altstadt Verfasserin: Linda Hatt Matrikel-Nr.: 08-716-896 Referentin: Prof. Dr. Elvira Glaser Betreuer: Dr. Martin Graf Deutsches Seminar Abgabedatum: 15.12.2014 Masterarbeit Universität Zürich Hausnamen der Schaffhauser Altstadt Deutsches Seminar Linda Hatt Prof. Dr. E. Glaser / Dr. M. Graf Inhaltsverzeichnis 1. EINLEITUNG 1.1. VORWORT...»

«Forthcoming in Jessica Brown and Herman Cappelen (eds.), Assertion (Oxford: Oxford University Press). Assertion and Isolated Secondhand Knowledge Jennifer Lackey Northwestern University A common view in the recent philosophical literature is that knowledge is the norm governing proper assertion. Thus, according to Keith DeRose, “.one is positioned well-enough to assert that P iff one knows that P” (DeRose 2002, p. 180). Let us call the thesis expressed here the Knowledge Norm of Assertion,...»

«MANAGING VARIABILITY IN VLSI CIRCUITS by Brian T. Cline A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan Doctoral Committee: Professor David Blaauw, Chair Professor Marios Papaefthymiou Associate Professor Joanna Mirecki-Millunchick Associate Professor Dennis M. Sylvester © Brian T. Cline All Rights Reserved DEDICATION To my Family & Friends. This dissertation is dedicated to my...»

«CONSTRUCTION MANAGER'S INFLUENCE ON PROJECT SUCCESS V. LATORRE A thesis submitted to the University of Plymouth in fulfillment for the degree of DOCTOR OF PHILOSOPHY Page Numbering as Bound s. Blank In Original Copyright Statement This copy of the report has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without the author's prior...»

«Uniquely Represented Data Structures with Applications to Privacy Daniel Golovin CMU-CS-08-135 August 2008 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Guy E. Blelloch, Chair Gary L. Miller R. Ravi Jon M. Kleinberg, Cornell University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2008 Daniel Golovin This research was sponsored by the National Science Foundation under ITR grants...»

«In vitro and In vivo Proteome Analysis of Coccidioides posadasii by Setu Kaushal A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved November 2015 by the Graduate Supervisory Committee: Douglas Lake, Chair D. Mitchell Magee Douglas Chandler Jeffery Rawls ARIZONA STATE UNIVERSITY December 2015 ABSTRACT Coccidioidomycosis (valley fever) is caused by inhalation of arthrospores from soildwelling fungi, Coccidioides immitis and C....»

«TOUGHNESS-DOMINATED HYDRAULIC FRACTURES IN COHESIONLESS PARTICULATE MATERIALS A Dissertation Presented to The Academic Faculty by Robert S Hurt In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Civil and Environmental Engineering Georgia Institute of Technology May 2012 TOUGHNESS-DOMINATED HYDRAULIC FRACTURES IN COHESIONLESS PARTICULATE MATERIALS Approved by: Dr. Leonid Germanovich, Advisor Dr. Glen Rix School of Civil and Environmental School of...»

«CHARLEMAGNE: THE MAKING OF AN IMAGE, 1100-1300 By JACE ANDREW STUCKEY A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA Copyright 2006 by Jace Andrew Stuckey My dissertation and all of my work is dedicated to my wife Shannon. Without her support none of this would have been possible. I would also like to dedicate this work to my daughter, Elizabeth, who was born the...»

«Symptom Experience Following Lung Cancer Surgery by Kathleen Garrubba Hopkins Bachelors of Science, University of Pittsburgh, 1978 Master of Science, Industrial Engineering, University of Pittsburgh, 1982 Associates Degree, Community College of Allegheny County, 2005 Submitted to the Graduate Faculty of School of Nursing in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh © 2014 UNIVERSITY OF PITTSBURGH School of Nursing This dissertation...»


<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.