FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 19 |

«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

-- [ Page 6 ] -- Window Size We compared the performance using different window sizes: ‘entire page’ versus 100 words (paragraph length). We fixed the correlation function to AEMI and the threshold-finding method to MaxChildren. Table 5 and Table 8 illustrate the results. A window size of the entire page generated slightly more meaningful clusters (59% good) than a window size of 100 (57% good). However, a window size of 100 yielded more tress with 100% ‘good’ leaf clusters (6) than a window size of the entire page (5). Hence, it is not clear which window size produces more meaningful clusters. Both methods resulted in ‘medium’ trees. A window size of 100 generated one thin tree for U11. The ‘T’ tree in Table 8 has only two nodes: the root and one leaf.

Table 5. Combination of AEMI, MaxChildren, and entire page

–  –  –

If we can add phrases as feature in the UIH, each cluster will be enriched because phrases are more specific than words. We compared two different data sets: one consisting of only words and the other consisting of words and phrases – the phrases were collected by VPF (Kim and Chan, 2004). Table 5 and Table 9 illustrate the results. Results from the data with phrases presented more meaningful leaf clusters (64%) than results with only words (59%). Tree shapes were similar (medium) in both methods.

UIHs learned from a user (U1) are depicted in Figure 9 and Figure 10. The one from only words has three ‘good’ leaf clusters (1, 3, and 4) and one ‘bad’ leaf cluster (5).

Cluster 0 denotes root nodes, which has all words or phrases. The right one which is learned from both words and phrases has all ‘good’ clusters; furthermore, it is more descriptive because Clusters 6 and 7 in Figure 10 describe class names while cluster 3 in Figure 9 alone describes class names. We can say Clusters 6 and 7 in Figure 10 are conceptually related because both are class names. We cannot explain why some specific interests in one UIH do not exist in the other UIH. For example, Clusters 4 in Figure 9 showed that the user (U1) was interested in a Master or Doctoral degree program, but the

–  –  –

hierarchy (UIH) that can represent a continuum of general to specific interests from a set of web pages interesting to a user. This approach is non-intrusive and allows web pages to be associated with multiple clusters/topics. We proposed our divisive hierarchical clustering (DHC) algorithm and evaluated it based on data obtained from 13 users on our web server.

We also introduced correlation functions and threshold-finding methods for the clustering algorithm. Our empirical results suggested that DHC with the AEMI correlation function and the MaxChildren threshold-finding method yielded more meaningful leaf clusters. In addition, by using phrases found by VPF algorithm, we improved performance up to 64% of interpretable clusters. We did not analyze differences among the UIHs’ obtained from various users because of the large numbers of web pages used in our experiments. Results from experiments not reported here indicated that stemmed words were more effective than whole words. The minimum cluster size affected the number of leaf clusters; size 4 was easy to use and seemed to produce reasonable results.

The performance of the DHC algorithm varied depending on a user and the articles selected. We currently do not understand the reason for the variance in performance. We

–  –  –

Finding phrases in a document has been studied in various information retrieval systems to improve their performance. Many previous statistical phrase-finding methods had a different aim such as document classification. Some are hybridized with statistical and syntactic grammatical methods; others use correlation heuristics between words. We propose a new phrase-finding algorithm that adds correlated words one by one to the phrases found in the previous stage, maintaining high correlation within a phrase. The inputs are words in a document and the outputs are phrases collected. These processes are

–  –  –

Figure 11. Finding phrases in the framework of web personalization

3.1. Variable-length Phrases Our algorithm consists of two components: the main algorithm and the correlation function. In preparation for VPF, we extract words from a web page visited by the user, filter them through a stop list, and stem them (Ahonen et al., 1998; Chan, 1999; Frakes and Baeza-Yates, 1992; Zamir and Etzioni, 1998). Other phrase-finding algorithms also require these pre-processing steps.

3.1.1. VPF Algorithm Our algorithm adds words one by one to the phrases found in a previous stage – each stage corresponds to each recursion in the VPF algorithm. One might insist that each phrase P{l} of length l has the form P{l-1}w, where w is one word and P{l-1} is a phrase of length l-1. Since the phrase P{ l-1}w is defined by the correlation between P{l-1} and w, it is possible that the correlation exists between a non-phrase P{l-1} and a word w. That is, P{l-1} is not a phrase, but can be extended into a phrase of length l. If this is possible, it is also possible that even if there exists a phrase, P{l}, in a document, the phrase P{l} could not be generated because P{l-1} does not exist. For example, there exists a phrase “wireless powerful computer” in a web page. But, since “wireless powerful” is not a phrase, it is possible that the phrase could not be generated. However, if the phrase is meaningful/important enough, the sub phrases “wireless powerful” and “powerful computer” will be generated. Next, “computer” will be added to “wireless powerful”. To relieve this problem, we calculate the threshold once at the beginning – this means the threshold is consistent. If the correlation value of “wireless powerful” is lower than the value of “wireless powerful computer” then the shorter phrase will be removed at a pruning stage.

Our algorithm receives a sequence of words as input and returns meaningful phrases. It combines words into phrases until it generates no more phrases. Figure 12 illustrates the pseudo code for the variable-length phrase-finding algorithm (VPF). It uses three main variables – List, SEQ, and Corr. The List variable stores all collected phrases in a Hash attribute. Each element (phrase) of the Hash attribute keeps correlation value in sim attribute and the position list in posi attribute. VPF first makes a linked list (SEQ) with an input example. Each word and word position are stored in each node in the linked list.

Then, all 1-gram distinct words are stored in List[1].Hash. Corr is a chosen correlation function. The List, SEQ, and Corr are passed to BeSpecific procedure and this finds all phrases. Once the phrases are acquired, they are pruned. The pruning process simply removes all sub-phrases that have a sim value lower than the sim value of the superphrases.

The BeSpecific procedure receives five parameters: List that stores all phrases, l which is the length of phrases (initial value is 2), thre which keeps the calculated threshold value differentiating “strong” relations from “weak” relations (initialized to 0); SEQ which stores the linked list, and Corr. BeSpecific recursively creates new sequences by removing nodes that are not in the Hash table generated in the previous stage, and also by removing consecutive nodes whose lengths are shorter than l. Since it removes words that are not in the Hash table generated in the previous stage, there can be gaps between nodes. Once the new sequence is generated, it collects all l-grams having no gaps from the sequence. The threshold is calculated when l=2 only once by averaging all correlation values between the first and second word in each element in List[2].Hash. The for loop removes any element with low correlation values (sim) from Hash. The sim property of p keeps the correlation between a sub phrase, p[1.. l-1], and the last word, p[l], where p is a phrase consisting of l words. For example, if p=“computer science seminar”, then p[1.. l-1]=“computer science” and p[l]=“seminar”. The Intersection counts all adjacent points based on the distance. The intersection between the positions of p[1.. l-1] and p[l] becomes the positions of p. If the intersection of p becomes 0, then the p is removed. The BeSpecific procedure recursively increases the phrase length L until Hash become empty. We can also apply various correlation functions in the place of Corr.

In order to improve the phrase-selection accuracy, we need to calculate for each word the percentage that a word can come before any other words and the percentage that the word can come after any other words, called pre-percentage and post-percentage respectively. The idea is that if a word occurred at the end of a sequence, then this word loses his one chance to come before any other words, so we adjust the pre-percentage of the word by deducting one from the number of occurrences of the word. The postpercentage is vice versa. We can view a string S[1..n] of n consecutive words as two sub strings, Spre=S[1..n-1] and Spost=S[2..n]. Pre- and post-percentage of w can be computed in

time O(1), when we know all the positions where w occurred:

wpre-percentage= Frequency of w in Spre / |Spre| wpost-percentage= Frequency of w in Spost / |Spost|.

Input: Example– a sequence of words Output: Collected phrases Function VPF (Example) return phrases SEQ– linked list, each node has a word and a position.

List– array of Hash table, each word in a Hash has sim and posi attributes.

Corr- a correlation function

1. SEQ←linked list made by the input Example

2. List[1].Hash←store all 1-grams, each word keeps its positions

3. BeSpecific(List, 2, 0, SEQ, Corr)

4. Prune phrases in the List

5. Return all phrases in the List End Function Input: List- store array of Hash table L- length of phrase, initialized to 2 thre- threshold value, initialized to 0 SEQ- linked list, the size reduces every iteration Corr- correlation function BeSpecific(List, L, thre, SEQ, Corr)

1. if List[L-1].Hash is empty then stop

2. SEQ←remove nodes that are not in the List[L-1].Hash and also remove consecutive nodes which length is shorter than L

3. List[L].Hash←Collect all L-grams from SEQ

4. if L=2 then thre←Average correlation across all words in List[2].Hash

5. for each p in List[L].Hash do A ← pre-percentage of p[1..L-1] 6.

B ← post-percentage of p[L] 7.

8. p.sim←Corr(A,B,A∩B)

9. remove any p for which p.sim is lower then thre

10. p.posi←Intersection of p[1..L-1].posi and p[n].posi with distance of L-2

11. remove any p for which p.posi=0

12. BeSpecific(List, L+1, thre, SEQ, Corr) End Procedure

–  –  –

when L is 2. Even though the BeSpecific is called L times, as L increases the size of the sequence decreases drastically.

The Corr function has O(T) time complexity, where T is the position length. But, as T increases the number of element in a Hash decrease. We, therefore, can claim that the time complexity of VPF in general case is roughly: O(S), where S is the sequence size.

For example, in Figure 2 we have an input string S = “abcdabefbcdghabcd,” in which each letter represents a word. The correlation function simply returns the frequency of a phrase using a threshold value of 1.5. List[1].Hash contains all distinct words.

List[2].Hash originally contains all possible 2-word phrases, then removes any phrase that occur less than 2 times, resulting in {“ab”, “bc”, “cd”}. We remove all words that are not in the List[2].Hash from the 1st SEQ resulting in 2nd SEQ. List[3].Hash contains all 3-word phrases, and then remove “cda” and “dab” because they occur only once. When we came to BeSpecific for the 3rd time, we removed “*abc*” from the 3rd SEQ, because their consecutive length is less than 4 – the size of L increases every time we come to BeSpecific. When we run the 4th BeSpecific, we can remove all phrases (“abcda” and “bcdab”) in List[5].Hash, because they occur only once. Since the 5th Hash is empty, the BeSpecific stops. After the BeSpecific, the List keeps {“ab”, “bc”, “cd”, “abc”, “bcd”, “abcd”}. Suppose the pruning removes all sub-phrases that have a sim value lower than or equal to the sim value of the super-phrases. The occurrences of phrases are: “ab”-3, “bc”-3, “cd”-3, “abc”-2, “bcd”-3, and “abcd”-2. We remove “bc” and “cd” because “abc” subsumes them and has equal frequency. “abc” is removed by “abcd” with the same reason. The final returned phrases are {“ab”, “bcd”, “abcd”}.

3.1.2. Correlation Functions The VPF algorithms build phrases; and correlation functions actually calculate the weight of a phrase. The correlation functions are important in terms of selecting more meaningful phrases. The VPF is able to cooperate with many different existing correlation functions, and it can be hard to choose one correlation function out of many. In this section, we describe several key properties of a good correlation function. Much of the statistical work in building multi-word features focuses on co-occurrence (Chen and Sycara, 1998; Rosenfeld, 1994). All correlation measures are not equally good at capturing the dependencies between different events (Tan et al., 2002). It is because each correlation function biases toward different individual event probabilities and joint probabilities.

Piatetsky-Shapiro (1991) has proposed three key properties that a good correlation

function, F, for events A and B should satisfy:

P1: if A and B are statistically independent, then F is 0;

P2: F monotonically increases with P(A,B) when P(A) and P(B) remain the same;

P3: if P(A) (or P(B)) increases when the rest of the parameters (P(A,B) and P(B) (or P(A))) remain unchanged, then F monotonically decreases.

Statistical independence can be measured by the determinant operator, where Det (A,B) = A∩B×A′∩B′ − A∩B′×A′∩B. Thus, a singular diagram is independent when its determinant is equal to zero (Tan and Kumar, 2000). Another important operation in finding phrases is distinguishing between positive and negative correlations (P4).

Measuring their cross product ratio (CPR) can assess the significance of the correlation between A and B (Rosenfeld, 1994) and is defined as:

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 19 |

Similar works:

«The Pennsylvania State University The Graduate School Department of Anthropology SETTLEMENT AND POPULATION AT PIEDRAS NEGRAS, GUATEMALA A Thesis in Anthropology by Zachary Nathan Nelson © 2005 Zachary Nathan Nelson Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2005 Abstract My dissertation examines the relationship between settlement and population at Piedras Negras, Guatemala. This Classic Maya center developed from a small village into a...»

«The Present Participle as a Marker of Style and Authorship in Old English Biblical Translation by George James McBeath Lamont A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Centre for Medieval Studies University of Toronto © Copyright by George James McBeath Lamont 2015 The Present Participle as a Marker of Style and Authorship in Old English Biblical Translation George James McBeath Lamont Doctor of Philosophy Centre for Medieval Studies...»

«Loughborough University Institutional Repository Richard Hamilton a multi-elusive artist of the modern world? This item was submitted to Loughborough University's Institutional Repository by the/an author.Additional Information: • A Master's Thesis. Submitted in partial fullment of the requirements for the award of Master of Philosophy of Loughborough University. Volume 1 is available on open access and Volume 2 is closed access for copyright reasons. https://dspace.lboro.ac.uk/2134/8542...»

«MODELING, DESIGN AND ENERGY MANAGEMENT OF FUEL CELL SYSTEMS FOR AIRCRAFT A Dissertation Presented to The Academic Faculty by Thomas Heenan Bradley In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Mechanical Engineering Georgia Institute of Technology December, 2008 Copyright © Thomas Heenan Bradley 2008 MODELING, DESIGN AND ENERGY MANAGEMENT OF FUEL CELL SYSTEMS FOR AIRCRAFT Approved by: Dr. David E. Parekh, Advisor Dr. William J. Wepfer School of...»

«LOCAL DEVELOPMENT PLANNING AND COMMUNITY EMPOWERMENT IN DECENTRALISED INDONESIA: THE ROLE OF LOCAL PLANNING IN IMPROVING SELF ORGANISING CAPABILITIES OF LOCAL COMMUNITIES IN TAKALAR, INDONESIA Setiawan Aswad BA (Dev. Mgt), M.Dev.Plg Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Civil Engineering and Built Environment Faculty of Science and Engineering Queensland University of Technology September 2013 Abstract In the process of empowering local...»

«From the Neverland to the Midnight Garden: The Changing Representation of Boys in Children’s Literature 53 From the Neverland to the Midnight Garden: The Changing Representation of Boys in Children’s Literature Kazuki OGAWA Introduction British children’s literature has produced many popular boy characters: Peter Pan, Winnie-the-Pooh and Harry Potter, to name but a few. The last name among them may now be the most famous boy all over the world; he is the hero in the Harry Potter series by...»

«ABSTRACT Title of Document: MAPPING TERRORISM: AMORPHOUS NATIONS, TRANSIENT LOYALTIES Ritu Saksena, Doctor of Philosophy, 2006 Directed By: Professor Sangeeta Ray Department of English Terrorism has a predilection with nations and nationalism and it plays on the symbiotic relationship between nationalism and violence. But “forgetting” this violence and bloodshed was crucial to the perpetuation of the myth of civilized nations. While Postcolonial Studies has offered incisive justifications...»

«‘This is Not Your Father’s War’ Confronting the Moral Challenges of ‘Irregular War’ Banquet Keynote Address NATO 60th Anniversary Conference “Building Integrity” Monterey, CA: Thursday evening 26 February 2009 Sponsored by Transparency International (U.K.) and the Naval Postgraduate School George R. Lucas, Jr Professor of Philosophy Director of Navy and National Programs The Stockdale Center for Ethical Leadership US Naval Academy (Annapolis MD, USA) On Tuesday 4 November 2008,...»

«July, 2015 Vita Allen W. Wood Academic Address: Department of Philosophy Sycamore Hall 026 Indiana University 1033 E. Third St., Sycamore Hall 026 Bloomington, IN 47405-7005 Telephone +1 (812) 855-9503 Fax +1 (812) 855-3777 E-mail: awwood@indiana.edu, allenw@stanford.edu, wood2420@gmail.com http://www.indiana.edu/~phil/people/allen-wood.shtml http://philosophy.stanford.edu/profile/Allen+Wood/ http://en.wikipedia.org/wiki/Allen_W._Wood Education: B. A., Reed College, 1964; Major: Literature and...»

«Student Handbook College of Urban and Public Affairs Nohad A. Toulan School of Urban Studies and Planning Doctor of Philosophy: Urban Studies (Ph.D.) Doctor of Philosophy: Urban Studies: Regional Science (Ph.D.) Academic Year 2012-2013 For additional information, contact: Nohad A. Toulan School of Urban Studies and Planning Urban Center Building Room 350 503-725-5477 susp@pdx.edu Latest Revision Date: 10/17/2012 Urban Studies Ph.D. Handbook 2012 Page 2 I. CONTENTS College and School Overview I....»

«NOT NOTHINGNESS: PETER BROOK’S ‘EMPTY SPACE’ AND ITS ARCHITECTURE Negin Djavaherian School of Architecture McGill University, Montreal, Canada August 2012 A thesis submitted to McGill University in partial fulfilment of the requirements of degree of Doctor of Philosophy © Negin Djavaherian, 2012 To my parents, Parvaneh and Hassan, and to Alois ABSTRACT The thesis explores architectural potential and experience in the theatre of Peter Brook (1925-). The importance of his thought, writings...»

«Information Integration for Software Maintenance and Evolution Malcom Bernard Gethers II Elizabeth City, North Carolina Master of Science, University of North Carolina at Greensboro, 2007 Bachelor of Science, High Point University, 2005 A Dissertation presented to the Graduate Faculty of the College of William and Mary in Candidacy for the Degree of Doctor of Philosophy Department of Computer Science The College of William and Mary August, 2012 APPROVAL PAGE This Dissertation is submitted in...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.