WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 1 ] --

A HIERARCHICAL WWW PAGES CLUSTERING

ALGORITHM BASED ON THE VECTOR SPACE MODEL

by

MICHAŁ WRÓBLEWSKI

Supervisor

JERZY STEFANOWSKI, Assistant Professor

Referee

ROBERT SUSMAGA, Assistant Professor

MASTER THESIS

Submitted in partial fulfillment of the requirements

for the degree of Master of Science, Poznań University of Technology, Poland July 2003

-1-

HIERARCHICZNY ALGORYTM GRUPOWANIA STRON

WWW WYKORZYSTUJĄCY MODEL PRZESTRZENI

WEKTOROWEJ

MICHAŁ WRÓBLEWSKI

Promotor dr hab. inż. JERZY STEFANOWSKI Recenzent dr inż. ROBERT SUSMAGA

PRACA MAGISTERSKA

Specjalność Inżynieria Oprogramowania, Instytut Informatyki Politechnika Poznańska, Poznań lipiec 2003

-2-

ABSTRACT

This master thesis examines possibility of adaptation of hierarchical cluster analysis algorithm AHC (Agglomerative Hierarchical Clustering) based on the vector space model to Web search results clustering. We show here problems that arise when trying to use this algorithm in this application domain, where it must process textual data sets, containing very limited information.

We present here, how these problems may be overcome by using adequate techniques of pre-processing (use of phrases in the vector space model) and post-processing. In particular, we propose several methods of pruning of output dendrogram created by AHC, based on descriptions of clusters being elements of that tree.

In order to perform a verification of presented version of the algorithm, it has been implemented as a module of the Carrot2 system. Moreover, we have carried out an evaluation of quality of results given by the algorithm.

STRESZCZENIE

Niniejsza praca magisterska bada możliwość adaptacji hierarchicznego algorytmu analizy skupień AHC (Agglomerative Hierarchical Clustering) opartego na modelu przestrzeni wektorowej do grupowania wyników zapytań do serwisów internetowych. Pokazujemy tu problemy, jakie powstają w momencie próby użycia tego algorytmu do tego typu zastosowań, gdzie musi on działać na tekstowych zbiorach danych, zawierających bardzo ograniczone informacje.

Przedstawiamy tu, jak można przezwyciężyć te problemy za pomocą odpowiednich technik wstępnego przetwarzania danych wejściowych (zastosowanie fraz w modelu przestrzeni wektorowej) i końcowego przetwarzania wyników algorytmu. W szczególności proponujemy kilka technik redukcji zbędnych węzłów wyjściowego dendrogramu tworzonego przez AHC w oparciu o opisy grup będących jego elementami.

W celu dokonania praktycznej weryfikacji przedstawionej przez nas wersji algorytmu, została ona zaimplementowana jako moduł systemu Carrot2. Ponadto przeprowadzono też eksperyment mający na celu ocenę jakości wyników tworzonych przez algorytm.

-3ACKNOWLEDGMENTS In the course of creating this thesis, I have come upon many difficulties, which I couldn't certainly solve easily or even at all using just my own skills. Finding solutions to these problems was possible only thanks to help from other persons. First of all, I would like to mention Mr. Jerzy Stefanowski, to whom I am deeply grateful for arousing my interest in Web search results clustering and important scientific suggestions. I give thanks also to Mr.

Dawid Weiss, author and manager of the Carrot2 project, whose help in solving technical problem was invaluable. Without his hard work on the Carrot2, and earlier also on the Carrot system, this thesis would either simply not exist or it would take incomparably much more effort to complete it.

I would like here to express my gratitude to other Carrot2 developers – Paweł Kowalik and Stanisław Osiński, with whom I had the pleasure to cooperate for a really long year. Working together, we have participated in creation of a really interesting system, with great research perspectives for the future.

I wish to thank people, whose time and effort have made evaluation of the results given by the algorithm possible – Jacek Błaszczyk, Irmina Masłowska, Stanisław Osiński, Jerzy Stefanowski and Dawid Weiss. Mrs. Irmina Masłowska deserves particular thanks for her valuable remarks on hierarchical results quality measures and giving me the ability to compare results of AHC and her clustering method.

Finally, I would like to express how important for me was the help of all the nearest people, who supported me during course of this thesis. It is hard for me to say, how much I owe you.

-4CONTENTS

1. INTRODUCTION

1.1. MOTIVATIONS

1.2. THE SCOPE AND GOAL OF THE THESIS

1.3. SHORT THESIS SUMMARY

1.4. THESIS LAYOUT

2. BASIC CONCEPTS

2.1. TYPES OF INTERNET SEARCH ENGINES

2.2. INDEXING SERVICES

2.2.1. COLLECTING AND STORING THE INFOMATION

2.2.2. MATCHING THE USER'S QUERY

2.3. AUTOMATIC SEARCH RESULTS CLUSTERING

2.4. EXPERIMENTAL SOLUTIONS





2.5. THE CARROT2 SYSTEM

3. THE VECTOR SPACE MODEL IN INFORMATION RETRIEVAL25

3.1. BASIC CONCEPTS

3.1.1. DOCUMENTS REPRESENTATION

3.1.2. STRONG AND WEAK POINTS OF THE MODEL

3.2. ASSIGNING WEIGHTS TO TERMS

3.2.1. METHODS OF TERMS WEIGTHING

3.2.2. TERMS WEIGHTING ISSUES SPECIFIC FOR SEARCH RESULTS

CLUSTERING

3.3. DOCUMENT SIMILARITY MEASURES

3.4. DEALING WITH NATURAL LANGUAGE

3.4.1. STOPWORDS

3.4.2. STEMMING

3.4.3. SYNONYMS

3.5. USING PHRASES IN THE VECTOR SPACE MODEL

3.5.1. REASONS TO USE PHRASES

3.5.2. PHRASES EXTRACTION

3.5.3. POSSIBLE IMPROVEMENTS OF PHRASES EXTRACTION................ 38

-5DOCUMENTS CLUSTERING ALGORITHMS

4.1. OVERVIEW OF MAIN CLUSTERING ALGORITHMS

4.1.1. K-MEANS

4.1.2. OTHER METHODS BASED ON THE VECTOR SPACE MODEL......... 42 4.1.3. SUFFIX TREE CLUSTERING

4.1.4. LINGO

4.2. THE AGGLOMERATIVE HIERARCHICAL CLUSTERING (AHC)

ALGORITHM

4.2.1. ALGORITHM OUTLINE

4.2.2. LINKAGE METHODS

4.3. FORMING CLUSTERS FROM RESULTS OF AHC

4.3.1. INTITIAL CLUSTERS FORMATION

4.3.2. DENDROGRAM POST-PROCESSING

4.3.3. CREATING DESCRIPTIONS OF CLUSTERS

4.3.4. REMOVING REDUNDANT DESCRIPTIONS

4.3.5. DESCRIPTIONS-BASED CLUSTERS POST-PROCESSING.................. 54 4.3.6. FINAL POST-PROCESSING

5. IMPLEMENTATION OF THE AHC ALGORITHM IN THE

CARROT2 SYSTEM

5.1. THE CARROT2 SYSTEM ARCHITECTURE

5.1.1. BASIC IDEAS

5.1.2. ELEMENTS OF SYSTEM ARCHITECTURE

5.2. AHC MODULE AS A PART OF CARROT2

5.3. AHC MODULE DESIGN AND IMPLEMENTATION

5.3.1. GENERAL DESIGN AND IMPLEMENTATION ISSUES

5.3.2. EFFICIENCY ISSUES

6. EVALUATION OF THE AHC ALGORITHM

6.1. METHODS OF CLUSTERING RESULTS EVALUTION

6.1.1. EXISTING WEB MINING METHODS

6.1.2. OUR CLUSTERING RESULTS QUALITY MEASURES PROPOSAL... 70 6.1.3. AHC-SPECIFIC QUALITY MEASURES

6.2. USER EVALUATION OF THE AHC ALGORITHM

6.2.1. THE EXPERIMENT

6.2.2. RESULTS

-6EXPERT EVALUATION OF THE AHC ALGORITHM - INFLUENCE OF

AHC PARAMETERS AND INPUT DATA ON THE RESULTS

6.3.1. INFLUENCE OF QUERY BREADTH

6.3.2. INFLUENCE OF THE INPUT SNIPPETS NUMBER

6.3.3. INFLUENCE OF PHRASES USE

6.3.4. INFLUENCE OF GROUPS CREATING THRESHOLD

6.3.5. INFLUENCE OF POST-PROCESSING METHODS

6.3.6. INFLUENCE OF TERMS WEIGHTING METHODS

6.4. EXPERT EVALUATION OF THE AHC ALGORITHM - COMPARISON OF

AHC AND OTHER CLUSTERING ALGORITHMS

7. CONCLUSIONS

7.1. SCIENTIFIC CONTRIBUTIONS

7.2. POSSIBLE FUTURE RESEARCH

8. BIBLIOGRAPHY

–  –  –

1.1. MOTIVATIONS From the beginnings of civilization, man has always gathered knowledge. First (and the most important) invention in this area was writing, which enabled persistent storage of information. In ancient times and then Middle Ages great libraries like ones in Alexandria or Neneveh, storing thousands of books, were often the most important centers of civilization.

When computers, and then the Internet have appeared, our abilities of gathering information have dramatically increased. The number of documents available in Internet grows exponentially since its origin. This is mostly due to the fact that automatic generation of Web documents is possible. Databases of world's currently largest Web search service, Google contains now about 3 billion of indexed Web pages, while just two years ago it contained "only" 1.3 billion [Google]. And even the biggest search services cover only a dozen or so percent of the Internet.

Moreover, the accessibility of information has considerably improved. Nowadays, one does not longer need to travel to Athens or Alexandria to find needed information – all you need to do is just to get on the Internet, type in a few words, click something and wait a few seconds for the result (at least so would claim the marketing wizards of the Internet search services).

Unfortunately, as the size of stored data grows, it is getting even harder and harder to find the relevant information that one is looking for. In a small library the librarian may know where to find each book, but dealing with quantity of documents such as mentioned in the first paragraph is beyond human capabilities. Moreover, as almost anyone can publish almost anything in the Web (which is therefore often concerned as a "Big Mess"), only a small part of information contained there is really valuable. This creates further challenge as user should get both all and only the information relevant to his query.

All above problems can be formulated as a single task of finding the set of documents on the Web relevant to a given user query. This is so called Web Search Problem [Selberg 99].

As a response to it, a special kind of Web service – search engines were created, which provide some – but still far from perfect – solution. Search engines are now among most popular net services, and are often points from which user starts to surf the Internet, so that one may call them "gates to the Web". A survey made recently by Search Engine Watch Web site revealed that they are the top information resource that Americans use when seeking answers (used 32 percent of the time) [SearchEngineWatch]. The most commonly used services are nowadays Yahoo [Yahoo], MSN [MSN], AltaVista [AltaVista] and Google [Google], receiving 200 millions queries daily and considered usually as the best one available.

In such typical service, the user types in his query, describing what she or he is looking for and the service returns a list of documents more or less relevant to this query. Although

-8algorithms and techniques used for finding this data may be different, user interaction in most of today search engines is based on two main models: query list (for the input) and ranked list (for the output). Unfortunately, it is still not possible to ask most of the services questions in the natural language (and it is even less probable to get a valid answer to such type of query).

So the query list consists of words or phrases in some natural language (in almost all cases it is English), linked by Boolean operators. The answer comes as a ranking list of documents matching the query, ranked (ordered) so that the most relevant documents appear on the top of it.

Usually here a problem arises, and namely that the list is too long and contains many pages that are not relevant to the query, but rather concern many different topics. Then user must sift through hundreds or thousands of documents to find the few ones that he has been looking for. But, as human patience is limited, usually users after taking a look at the few top documents give over browsing ranking list (quite often not finding the information that they were looking for). According to another survey made by the Search Engine Watch, about 75% of Internet users report searching to be "quite frustrating", and 33% of them finds it "very frustrating". This fact has even been given a name – "search rage". The same survey shows that on average users "search rage" if they don't find information that they were looking for in 12 minutes [SearchEngineWatch].

–  –  –

As an attempt to improve the quality of search a new paradigm, called search results clustering, was formulated. It does not focus on the process of searching for the results (it may be, and usually is done completely independent of the searching algorithm), but rather on their presentation, so one may say that it is a more "user-centered" approach. Instead of showing the user a long list of results itself, it is split in groups related to sub-topics that occur in the set of documents returned as a result of the query. Then these groups, with a descriptions that enable users to determine their content, are presented. Groups may eventually contain also smaller, more specific groups which may contain further groups etc. –

this is called hierarchical clustering. There exist two main benefits of clustering:

• the number of groups is much smaller than the number of documents contained in them, so it is easier and faster to browse the results. This fact has been put into the

-9marketing slogan of the Vivisimo clustering engine: "No more wasted time scrolling and backtracking through long one-dimensional lists".



Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |


Similar works:

«The Design of a Stitched, High-dynamic Range CMOS Particle Sensor Master of Science Thesis Jiaqi Zhu (4233476) July 2014 Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology The project of this thesis was in cooperation with Caeleste CVBA. Their support and assistance are hereby gratefully acknowledged. The Design of a Stitched, High-dynamic Range CMOS Particle Sensor Master of Science Thesis Jiaqi Zhu (4233476) Supervisor: Prof. dr. ir. Albert J.P....»

«Marius Cătălin Iordan mci@princeton.edu Ÿ༉ Princeton Neuroscience Institute, Princeton, NJ 08544 Ÿ༉ www.MariusCatalinIordan.com ACADEMIC APPOINTMENTS Postdoctoral Research Associate 2016 – present Princeton Neuroscience Institute, Princeton University Advisors: Jonathan D. Cohen and Dan Osherson EDUCATION Ph.D., M.S., Computer Science 2009 – 2016 Stanford University Advisors: Fei-Fei Li and Diane M. Beck (University of Illinois) Degree Focus: Cognitive and...»

«Bay Area Scientists in Schools Presentation Plan Lesson Name: A Whole New World of DNA and Proteins: 7th Grade Version_ Presenters: Grade Level: 7th Standards Connection(s): Life Science: 1.c the nucleus is the repository for genetic information in plant and animal cells. 2.e Students know DNA (deoxyribonucleic acid) is the genetic material of living organisms and is located in the chromosomes of each cell. Next Generation Science Standards MS-LS1Conduct an investigation to provide evidence...»

«Genes 2011, 2, 48-58; doi:10.3390/genes2010048 OPEN ACCESS genes ISSN 2073-4425 www.mdpi.com/journal/genes Review Genetic Diversification by Somatic Gene Conversion Kohei Kurosawa and Kunihiro Ohta * Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Komaba 3-8-1, Meguro-ku, Tokyo 153-8902, Japan; E-Mail: cc107708@mail.ecc.u-tokyo.ac.jp * Author to whom correspondence should be addressed; E-Mail: kohta@bio.c.u-tokyo.ac.jp; Tel.: +81-3-5465-8834; Fax:...»

«Photo on the Cover Page – Rubbing of the Luoyang Jingjiao Pillar (China, 9th Century) Conference Organizers: Dr. Li Tang 唐莉 Univ.-Prof. Dr. Dietmar W. Winkler Together with: Ulrike Grill Christine Hofer-Ranftl Gabriel Rabo 2 Saturday, June 18, 2016 Registration 16:00 18:30 Registration Lobby of St. Virgil 18:00 19:00 Dinner 19:00 Meet & Greet (Cafeteria of St. Virgil) Conference Section Sunday, June 19, 2016 09:00 – 09:10 Welcome & Anouncements Section I: Persia and Central Asia...»

«I* * V CXTeC^ C^^J^ffV^l^, L* CONSEIL * DE L'EUROPE PUBDGIV024 Architectural heritage Reports and Studies, No. 15 Mining engineering monuments as a cultural heritage Report of the Bochum Colloquy (Federal Republic of Germany) Strasbourg 1989 *** * * * * COUNCIL * * CONSEIL OF EUROPE *** DE L'EUROPE Mining engineering monuments as a cultural heritage A Council of Europe Colloquy organised in co-operation with the Deutsches Bergbau-Museum Bochum (Federal Republic of Germany), 5-8 September 1988...»

«Aegon Group Primary Credit Analyst: Mark Button, London (44) 20-7176-7045; mark.button@standardandpoors.com Secondary Contact: Patrick C Wong, New York (1) 212-438-1936; patrick.wong@standardandpoors.com Table Of Contents Rationale Outlook Base-Case Scenario Company Description Business Risk Profile Financial Risk Profile Other Assessments Factors Specific To The Holding Company Related Criteria And Research WWW.STANDARDANDPOORS.COM/RATINGSDIRECT MARCH 5, 2015 1 1387618 | 300000817 Aegon Group...»

«Inni e cantici spirituali – Cassetta n° 4 Inni e cantici spirituali cassetta n° 4 121. Amor Divin mi chiama 1. Amor Divin m chiama A consacrarGli il cuor, Spegnerei io la brama, La voce del Signor?Coro: Oh! Senza più tardare Ti voglio dare il cor Voglio che Tu possegga La vita mia, Signor.2. Misero, nel peccato, Giacevo in schiavitù. Per esser liberato Nulla trovavo più! Coro: Oh! Senza più tardare. 3. La luce m’è venuta E rischiarò il sentier. E quasi all’insaputa Non fui più...»

«Rock Chick Reawakening By Kristen Ashley A novella for 1,001 Dark Nights Prologue Building Castles Daisy “You’re a lunatic!” “You didn’t think that when I had my mouth wrapped around your dick!” “That’s because you couldn’t use it to talk!” “Kiss my ass!” “Not anymore, babe. We’re done.” “Like I care.” “You’ll care when you got no one’s dick to suck to pay your cable bill.” My eyes were closed. I was lying alone in my dark room, on my back in my twin...»

«THE SOCIAL LIFE OF JAPAN’S ADOLECHNIC Todd Joseph Miles Holden INTRODUCTION This chapter addresses key themes of this book by offering insights into the world of Japanese adolescents. As the hybrid term adolechnic indicates, most Japanese youth make extensive use of communication technology. But do such technologies affect the rhythms and nature of their existence? For instance, does technological mediation enable youth to inhabit plural worlds? Does it bear on their socialization? And what...»

«Phonetic Factors in Transliteration of Biblical Proper Names into Greek and Latin Jože Krašovec The current forms of biblical proper names in various European languages have been influenced by the phonetic changes necessitated by their transfer and transliteration from Hebrew and Aramaic into Greek and Latin, from which sources other languages borrowed in their turn. By way of Bible translation into Greek, Latin, and other ancient languages many biblical proper names have passed into general...»

«Wojciech Nowiński Handball Goalkeeperthe first steps. 1 Cover project Tomasz Popowicz Drawings Magdalena Czeszyk Graphic design Tomasz Popowicz Pictures Rafal Zakrzewski Tomasz Popowicz 2 1. Introduction 2. The rules of the game for a goalkeeper 3. Terminology in handball 4. Saving a goalbasic rules for a goalkeeper 5. One-to-one: goalkeeper and a shooter 5.1. How to understand a shooter?2. How to save throws from different distances and at different angles? 3. How to save a 7-meters throw? 6....»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.