FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 19 |

«Learning Implicit User Interest Hierarchy for Web Personalization by Hyoung-rae Kim A dissertation submitted to Florida Institute of Technology in ...»

-- [ Page 10 ] --

These specially-formatted terms have more emphasis in the page than those that are formatted normally. A document that emphasize a search term as a bold format will be more related to the search term than a document that has the term in a normal format without emphasis. If a term is emphasized by the use of two or more types of special formatting we assign a priority in the order of title, bold, and italic.

The significance for each type of format is estimated based on the probability,

–  –  –

more significant/important if the format type has lower probability of occurring in a web page. Lower probability P(Eti) of a matching term, ti, indicates the term is more significant.

The probability is estimated by:

–  –  –

5.2.2. Scoring a Term Uniform Scoring P(Dti, Lti, Fti, Eti) is the joint probability of all four characteristics occurring in term ti

-- Dti is the depth of a node where a term belongs to, Lti is the length of a term, Fti is the frequency of a term, and Eti is the emphasis of a term. Assuming independence among the

four characteristics, we estimate:

–  –  –

Smaller log likelihood means the term match is more significant. In information theory (Mitchell et al., 1997), –log2 P(e) is the number of bits needed to encode event e, hence using –log2, instead of log, in Eq. 1 yields the total number of bits needed to encode the four characteristics. The uniform term scoring (US) function for a personalized term score

is formulated as:

–  –  –

which we use as a score for term ti. Larger Sti means the term match is more significant. Weighted Scoring The uniform term scoring function uses uniform weights for each characteristic. It is possible that some characteristics are more important than the others. For instance, the depth of a node (D) may be more significant than frequency (F). Therefore, we attempted to differentiate the weights for each characteristic. F and E characteristics represent the relevance of a web page. Longer terms (greater L) represent a user’s interest more specifically; however, longer terms do not mean that a user is more interested in that term.

Therefore, those L, F, and E characteristics do not fully reflect a user’s interests. It is more reasonable to emphasize D characteristic more than other characteristics, because D (depth) represents the strength of a user’s interests.

A simple heuristic is used in this paper that assumes the depth of a node is at least two times more important than other characteristics. Based on this heuristic, the weights w1=0.4, w2=0.2, w3=0.2, and w4=0.2 are assigned. The weighted term scoring (WS)

function for a personalized term score is formulated as:

–  –  –

5.2.3. Scoring a Page The personal page score is based on the number of interesting terms and how interesting the terms are in a web page. If there are many terms in a web page that are interesting to a user, it will be more interesting to the user than a web page that has fewer interesting terms. If there are terms in a pages that are more interesting to a user, the web page will be more interesting to the user than a web page that has less interesting terms.

The personalized page scoring function for a web page Spj adds all the scores of the

terms in the web page and can be formulated as:

–  –  –

distinct term. The time complexity of scoring a page is O(nt), where nt is the number of “distinct” terms in a web page. D and L characteristics can be calculated during the preprocessing stage of building a UIH. F and L characteristics can be calculated while extracting distinct terms from a web page.

5.2.4. Incorporating Public Page Score Personal page scoring is not sufficient for some search engines. The success of using public scoring in popular search engines, such as Google’s PageRank, indicates the importance of using a public page-popularity measure to determine what page a user is interested in. Many existing methods determine the public popularity of a page by determining the number pages that link to it (Haveliwala, 2002; Jeh and Widom, 2003).

Many collaborative filtering approaches also use the popularity of a web page for recommendation (Eirinaki et al., 2004; Kim et al., 2004). Section 6.2.3 described our personal web page scoring function. We wish to incorporate the public scoring into our page scoring function so both the popularity of a page and individual interests are taken into account. We use the rank order returned by Google as our public score. GOOGLEpj is

–  –  –

Google’s public page scoring function has been found in a recent study (Delaney, 2004) to be very effective at returning pages that users find interesting. The use of Google’s page rank as a public page score makes our experimental comparison with Google clearer, because any improvement in the ordering is due to the contribution of our personal page score. For a given web page, pj, the personal and public page score (PPS) equation can be

written as:

–  –  –

of GOOGLEpj, and R(Spj) is the rank of a web page, pj, with the personal page score, Spj. If the function R returns the rank in an ascending order, more interesting web pages will have lower PPS values. Therefore, the function R reverses the rank. The personal page score and the public page score are weighted by the value of the constant c. In this paper, both functions are weighed equally: c = 0.5.

5.3. Experiments In our experiments data were collected from 11 different users. Of the 11 human subjects, 4 were undergraduate students and 7 were graduate students. In terms of major, 7 were Computer Sciences, 2 were Aeronautical Sciences, 1 was Chemical Engineering, and 1 was Marine Biology. We asked each volunteer to submit 2 search terms that can contain any Boolean operators. Some examples of the search terms used are {review forum +"scratch remover", cpu benchmark, aeronautical, Free cross-stitch scenic patterns, neural networks tutorial, DMC(digital media center), artificial intelligence, etc.} Then, we used Google to retrieve 100 related web pages for each search term. Those collected web pages were classified/labeled by user based on two categories: interest and potential interest. The data set for interest has more authority because it indicates direct relevance to the current search query. The data set for potential interest reflects the user’s general personal interests, which might not be directly relevant at the time of query. The areas of a user’s potential interests often go beyond the boundary of a search term’s specific meaning. Sometimes users find interesting web pages while searching for different subjects. These unexpected results help the user as well. Therefore, it is also a contribution if a method shows higher precision in finding potentially interesting web pages.

In order to build UIHs, we also requested each volunteer to submit the web pages in their bookmarks. If there were fewer than 50 web pages in their bookmark, we asked them to collect more pages up to around 50. The minimum number of web pages was 38 and the maximum number was 72. Web pages from both bookmarks and Google were parsed to retrieve only texts. The terms (words and phrases) in the web pages are stemmed and filtered through the stop list (Rasmussen, 1992). A phrase-finding algorithm (Kim and Chan, 2004) was used to collect variable-length phrases. Words in selection boxes/menus were also removed because they did not appear on the screen until a user clicks on them.

Unimportant contexts such as comments and style were also removed. Web pages that contain non-text (e.g., “.pdf” files, image files, etc.) were excluded because we are handling only text. To remove any negative bias to Google, broken links that were still ranked high erroneously by Google were excluded from the test, since those web pages will be scored “Poor” by the user for sure. The data used in this study is accessible at http://cs.fit.edu/~hkim/dissertation/dissertation.htm. Microsoft.NET language was used, and the program ran on an Intel Pentium 4 CPU.

We attempted to remove any negative bias to Google. Those web pages that contain non-text (e.g., “.pdf” files, image files, etc.) were excluded because we are handling only texts. Furthermore, the broken links that were still ranked high erroneously by Google were excluded from the test, since those web pages will be scored “Poor” by user for sure.

We categorized the interest as “Good”, “Fair”, and “Poor”; the potential interest is categorized as “Yes” and “No”. A web page was scored as “Good”, “Fair”, and “Poor” depending on each individual’s subjective opinion based on the definition of interest. It was also marked as “Yes” or “No” based on the user’s potential interest. We evaluated a ranking method based on how many interesting (categorized as “Good”) or potentially interesting web pages (categorized as “Yes”) the method collected within a certain number of top links (Bharat and Mihaila, 2001) (called “Top link analysis”). It is realistic in a sense many information retrieval systems are interested in the top 10 or 20 groups.

Precision/recall graph (van Rijsbergen, 1979) is used for evaluation as well (called “precision/recall analysis”). It is one of the most common evaluation methods in information retrieval. However, traditional precision/recall graphs are very sensitive to the initial rank positions and evaluate entire rankings (Croft and Das, 1989). The formula for

precision and recall were:

Precision = Number of “Good” or ”Yes” pages retrieved in the set / Size of the set Recall = Number of “Good” or ”Yes” pages retrieved in the set / Number of “Good” or “Yes” pages in the set

where the “set” is the group of top ranked web pages. In this paper we study five groups:

Top 1, 5, 10, 15, and 20.

5.4. Results and Analysis We compare four ranking methods: Google, Random, US, and WS. Google is the ranking provided by Google. Random arbitrarily ranks the web pages. US and WS are the two proposed methods based on a personal UIH learned from a user’s bookmarks. For Random, US, and WS, the top 100 pages retrieved by Google are re-ranked based on the method. Each method is analyzed with two data sets: a set of web pages chosen as interesting and another chosen as potentially interesting by the users. Top link analysis, precision/recall analysis, the sensitivity of personal score weight c (Section 6.2.4) are discussed.

5.4.1. Interesting Web Page Top Link Analysis Web search engine users are usually interested in the links ranked within top 20 (Chen and Sycara, 1998). We compare each method only with Top 1, 5, 10, 15, and 20 links on the interesting web page data set and present the results in Table 20. The first column is the methods; the next five columns present the precision values of each method with respect to the five Top links. The values in each cell are the average of 22 search terms’ precision values. High precision value indicates high accuracy/performance.

Precision values higher than Google’s are formatted as bold and the percentage of improvement is within parentheses. The highest precision value in each column is underscored.

The results show that our WS method was more accurate than Google in three Top links (Top 10, 15, and 20) and the percentages of improvements are at least 13%, while WS ties with Google for Top 1 and Top 5. In terms of the highest precision, WS showed highest performance in four columns; Google showed in only two columns and the values are equal to WS. Compared to US, WS showed higher precision in four (Top 1, 5, 15 and

20) of the five columns. Random was the lowest as we expected, showing the lowest precisions in all five columns. These results indicate that WS achieves the highest overall precision.

We also wanted to know which search terms yielded higher precision with WS than with Google and analyzed the precision with respect to each individual search terms.

Out of 22 search terms (11 users × 2 search terms), WS achieved higher precision for 12 search terms (55%), Google did for 8 search terms (36%), and they were even for 2 search terms (9%). Since the UIH is built from a user’s bookmarks, we analyse the bookmarks to understand the search terms that did not perform well using WS. When we compare the bookmarks with the “good” retrieved web pages, we found that they are unrelated. For example, a volunteer used “woodworking tutorial” as a search term, but he never bookmarked web pages related to that term. This implies bookmarks are useful for building user profiles, but they are not sufficient. We will discuss enhancements in the conclusion. Statistical Significance In order to see if this improvement is statistically significant we conducted t-Test (paired two samples for means) between two groups of individual search terms with Google and WS for each Top link. There was no statistically significant difference between WS and Google for any Top link with 95% confidence (P=1and t=2.079 for Top 1; P=1and t=2.079 for Top 5; P=0.328 and t=2.079 for Top 10; P=0.204 and t=2.079 for Top 15;

P=0.147 and t=2.079 for Top 20).

To understand why our improvements are not statistically significant, we analyze the variance in the precision values. In Figure 22 we plot the average and the standard deviation (SDs) of 22 search terms’ precisions from Google with respect to the five Top links. The x-axis shows the Top links and y-axis represents the average and the SD of precision values. The dots in the middle of vertical bars are the averages and the bars themselves represent the SD values. Variance was large for Top 1 and decreases when more links were considered.

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 19 |

Similar works:

«Activated atmosphere case hardening of steels by Xiaolan Wang A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Material Science and Engineering Dec 2011 Approved: _ Prof. Richard D. Sisson Jr, Advisor Mr. Zbigniew Zurecki, Co-advisor Prof. Diran Apelian, Committee Member Prof., Makhlouf M. Makhlouf, Committee Member Prof. Jianyu Liang, Committee Member i ACKNOWLEDGMENTS I would...»


«PREPARATION AND CHARACTERIZATION OF POLYMER COMPOSITES CONTAINING GOLD NANOPARTICLES a dissertation submitted to the department of chemistry and the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy By Eda Yılmaz September, 2011 I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of doctor of philosophy. Prof....»

«Bedu-Addo, Paul Kobina Annan (2010) Work-family interference among Ghanaian women in higher status occupations. PhD thesis, University of Nottingham.Access from the University of Nottingham repository: http://eprints.nottingham.ac.uk/11529/1/ready_for_binding.pdf Copyright and reuse: The Nottingham ePrints service makes this work by researchers of the University of Nottingham available open access under the following conditions. This article is made available under the University of Nottingham...»

«MONITORING AND ANALYSIS SYSTEM FOR PERFORMANCE TROUBLESHOOTING IN DATA CENTERS A Thesis Presented to The Academic Faculty by Chengwei Wang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology December 2013 Copyright c 2013 by Chengwei Wang MONITORING AND ANALYSIS SYSTEM FOR PERFORMANCE TROUBLESHOOTING IN DATA CENTERS Approved by: Professor Karsten Schwan, Dr. Matthew Wolf Committee Chair School of...»

«NAME: DRAGICA VUJADINOVIC Phone: (+381) 65 82 72 607 Mailing address: Email: dragicav@ius.bg.ac.rg 11 000 Belgrade Hilandarska 24/8. EDUCATION PhD University of Belgrade, Faculty of Law, September 24, 1986. Dissertation: “The Budapest School Theory of Radical Needs” Committee: Prof. Dr. Zoran Vidaković (chair), Prof. Dr. Miroslav Pečujlić, Prof. Dr. Vladimir Milić MS University of Belgrade, Faculty of Philosophy, June 22, 1982. Thesis: “Conception of Dialectic Method by Hegel and...»

«The Ghost of Socrates The Ghost of Socrates: Exploring Philosophical Issues in Information Systems By John M. Artz, PhD Associate Professor of Information Systems The George Washington University These chapters may be freely downloaded and distributed as long as the person doing the downloading or distribution does not modify it or claim credit for any part of it. © 2010 John M. Artz, PhD 1 The Ghost of Socrates Table of Contents Introduction The Ghost of Socrates Knowledge Management Concept...»


«MODELING, DESIGN AND ENERGY MANAGEMENT OF FUEL CELL SYSTEMS FOR AIRCRAFT A Dissertation Presented to The Academic Faculty by Thomas Heenan Bradley In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Mechanical Engineering Georgia Institute of Technology December, 2008 Copyright © Thomas Heenan Bradley 2008 MODELING, DESIGN AND ENERGY MANAGEMENT OF FUEL CELL SYSTEMS FOR AIRCRAFT Approved by: Dr. David E. Parekh, Advisor Dr. William J. Wepfer School of...»

«1 Curriculum Vita (Marshella) Shelly Sheats Harkness Associate Professor Home Address: Secondary Mathematics Education 429 N. Pennsylvania St. #607 University of Cincinnati Indianapolis, IN 46204 Teachers College 511C (317) 426-2311 Cincinnati, OH 45221-0002 (513) 556-3743 Co-Editor School Science and Mathematics Education Doctor of Philosophy (Ph.D.), Curriculum and Instruction/Mathematics Education. Indiana University. Bloomington, Indiana. (2002) K-12 Gifted and Talented Endorsement. Indiana...»

«Why I Choose a Vocational High School: The Study of Elicited Expectation and Educational Decision Parita Suaphan Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Science COMLUMBIA UNIVERSITY 2015 ©2014 Parita Suaphan All rights reserved ABSTRACT Why I Choose a Vocational High School: A Study of Elicited Expectation and Educational Decision Parita Suaphan The decision between vocational...»

«The Late Agnostic: William Bronk as Religious Poet Thesis submitted for the degree of Doctor of Philosophy at the University of Leicester by James Marian Bober Department of English University of Leicester May 2014 The Late Agnostic: William Bronk as Religious Poet James Marian Bober Abstract This thesis examines the poetry of William Bronk (1918-99). Through close readings of individual texts and broader thematic explorations it demonstrates that Bronk can and should be viewed as a religious...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.