«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
A HIERARCHICAL WWW PAGES CLUSTERING
ALGORITHM BASED ON THE VECTOR SPACE MODEL
JERZY STEFANOWSKI, Assistant Professor
ROBERT SUSMAGA, Assistant Professor
Submitted in partial fulfillment of the requirements
for the degree of Master of Science, Poznań University of Technology, Poland July 2003
HIERARCHICZNY ALGORYTM GRUPOWANIA STRON
WWW WYKORZYSTUJĄCY MODEL PRZESTRZENI
MICHAŁ WRÓBLEWSKIPromotor dr hab. inż. JERZY STEFANOWSKI Recenzent dr inż. ROBERT SUSMAGA
PRACA MAGISTERSKASpecjalność Inżynieria Oprogramowania, Instytut Informatyki Politechnika Poznańska, Poznań lipiec 2003
We present here, how these problems may be overcome by using adequate techniques of pre-processing (use of phrases in the vector space model) and post-processing. In particular, we propose several methods of pruning of output dendrogram created by AHC, based on descriptions of clusters being elements of that tree.
In order to perform a verification of presented version of the algorithm, it has been implemented as a module of the Carrot2 system. Moreover, we have carried out an evaluation of quality of results given by the algorithm.
STRESZCZENIENiniejsza praca magisterska bada możliwość adaptacji hierarchicznego algorytmu analizy skupień AHC (Agglomerative Hierarchical Clustering) opartego na modelu przestrzeni wektorowej do grupowania wyników zapytań do serwisów internetowych. Pokazujemy tu problemy, jakie powstają w momencie próby użycia tego algorytmu do tego typu zastosowań, gdzie musi on działać na tekstowych zbiorach danych, zawierających bardzo ograniczone informacje.
Przedstawiamy tu, jak można przezwyciężyć te problemy za pomocą odpowiednich technik wstępnego przetwarzania danych wejściowych (zastosowanie fraz w modelu przestrzeni wektorowej) i końcowego przetwarzania wyników algorytmu. W szczególności proponujemy kilka technik redukcji zbędnych węzłów wyjściowego dendrogramu tworzonego przez AHC w oparciu o opisy grup będących jego elementami.
W celu dokonania praktycznej weryfikacji przedstawionej przez nas wersji algorytmu, została ona zaimplementowana jako moduł systemu Carrot2. Ponadto przeprowadzono też eksperyment mający na celu ocenę jakości wyników tworzonych przez algorytm.
-3ACKNOWLEDGMENTS In the course of creating this thesis, I have come upon many difficulties, which I couldn't certainly solve easily or even at all using just my own skills. Finding solutions to these problems was possible only thanks to help from other persons. First of all, I would like to mention Mr. Jerzy Stefanowski, to whom I am deeply grateful for arousing my interest in Web search results clustering and important scientific suggestions. I give thanks also to Mr.
Dawid Weiss, author and manager of the Carrot2 project, whose help in solving technical problem was invaluable. Without his hard work on the Carrot2, and earlier also on the Carrot system, this thesis would either simply not exist or it would take incomparably much more effort to complete it.
I would like here to express my gratitude to other Carrot2 developers – Paweł Kowalik and Stanisław Osiński, with whom I had the pleasure to cooperate for a really long year. Working together, we have participated in creation of a really interesting system, with great research perspectives for the future.
I wish to thank people, whose time and effort have made evaluation of the results given by the algorithm possible – Jacek Błaszczyk, Irmina Masłowska, Stanisław Osiński, Jerzy Stefanowski and Dawid Weiss. Mrs. Irmina Masłowska deserves particular thanks for her valuable remarks on hierarchical results quality measures and giving me the ability to compare results of AHC and her clustering method.
Finally, I would like to express how important for me was the help of all the nearest people, who supported me during course of this thesis. It is hard for me to say, how much I owe you.
1.2. THE SCOPE AND GOAL OF THE THESIS
1.3. SHORT THESIS SUMMARY
1.4. THESIS LAYOUT
2. BASIC CONCEPTS
2.1. TYPES OF INTERNET SEARCH ENGINES
2.2. INDEXING SERVICES
2.2.1. COLLECTING AND STORING THE INFOMATION
2.2.2. MATCHING THE USER'S QUERY
2.3. AUTOMATIC SEARCH RESULTS CLUSTERING
2.4. EXPERIMENTAL SOLUTIONS
2.5. THE CARROT2 SYSTEM
3. THE VECTOR SPACE MODEL IN INFORMATION RETRIEVAL25
3.1. BASIC CONCEPTS
3.1.1. DOCUMENTS REPRESENTATION
3.1.2. STRONG AND WEAK POINTS OF THE MODEL
3.2. ASSIGNING WEIGHTS TO TERMS
3.2.1. METHODS OF TERMS WEIGTHING
3.2.2. TERMS WEIGHTING ISSUES SPECIFIC FOR SEARCH RESULTSCLUSTERING
3.3. DOCUMENT SIMILARITY MEASURES
3.4. DEALING WITH NATURAL LANGUAGE
3.5. USING PHRASES IN THE VECTOR SPACE MODEL
3.5.1. REASONS TO USE PHRASES
3.5.2. PHRASES EXTRACTION
3.5.3. POSSIBLE IMPROVEMENTS OF PHRASES EXTRACTION................ 38
-5DOCUMENTS CLUSTERING ALGORITHMS
4.1. OVERVIEW OF MAIN CLUSTERING ALGORITHMS
4.1.2. OTHER METHODS BASED ON THE VECTOR SPACE MODEL......... 42 4.1.3. SUFFIX TREE CLUSTERING
4.2. THE AGGLOMERATIVE HIERARCHICAL CLUSTERING (AHC)ALGORITHM
4.2.1. ALGORITHM OUTLINE
4.2.2. LINKAGE METHODS
4.3. FORMING CLUSTERS FROM RESULTS OF AHC
4.3.1. INTITIAL CLUSTERS FORMATION
4.3.2. DENDROGRAM POST-PROCESSING
4.3.3. CREATING DESCRIPTIONS OF CLUSTERS
4.3.4. REMOVING REDUNDANT DESCRIPTIONS
4.3.5. DESCRIPTIONS-BASED CLUSTERS POST-PROCESSING.................. 54 4.3.6. FINAL POST-PROCESSING
5. IMPLEMENTATION OF THE AHC ALGORITHM IN THECARROT2 SYSTEM
5.1. THE CARROT2 SYSTEM ARCHITECTURE
5.1.1. BASIC IDEAS
5.1.2. ELEMENTS OF SYSTEM ARCHITECTURE
5.2. AHC MODULE AS A PART OF CARROT2
5.3. AHC MODULE DESIGN AND IMPLEMENTATION
5.3.1. GENERAL DESIGN AND IMPLEMENTATION ISSUES
5.3.2. EFFICIENCY ISSUES
6. EVALUATION OF THE AHC ALGORITHM
6.1. METHODS OF CLUSTERING RESULTS EVALUTION
6.1.1. EXISTING WEB MINING METHODS
6.1.2. OUR CLUSTERING RESULTS QUALITY MEASURES PROPOSAL... 70 6.1.3. AHC-SPECIFIC QUALITY MEASURES
6.2. USER EVALUATION OF THE AHC ALGORITHM
6.2.1. THE EXPERIMENT
-6EXPERT EVALUATION OF THE AHC ALGORITHM - INFLUENCE OFAHC PARAMETERS AND INPUT DATA ON THE RESULTS
6.3.1. INFLUENCE OF QUERY BREADTH
6.3.2. INFLUENCE OF THE INPUT SNIPPETS NUMBER
6.3.3. INFLUENCE OF PHRASES USE
6.3.4. INFLUENCE OF GROUPS CREATING THRESHOLD
6.3.5. INFLUENCE OF POST-PROCESSING METHODS
6.3.6. INFLUENCE OF TERMS WEIGHTING METHODS
6.4. EXPERT EVALUATION OF THE AHC ALGORITHM - COMPARISON OFAHC AND OTHER CLUSTERING ALGORITHMS
7.1. SCIENTIFIC CONTRIBUTIONS
7.2. POSSIBLE FUTURE RESEARCH
1.1. MOTIVATIONS From the beginnings of civilization, man has always gathered knowledge. First (and the most important) invention in this area was writing, which enabled persistent storage of information. In ancient times and then Middle Ages great libraries like ones in Alexandria or Neneveh, storing thousands of books, were often the most important centers of civilization.
When computers, and then the Internet have appeared, our abilities of gathering information have dramatically increased. The number of documents available in Internet grows exponentially since its origin. This is mostly due to the fact that automatic generation of Web documents is possible. Databases of world's currently largest Web search service, Google contains now about 3 billion of indexed Web pages, while just two years ago it contained "only" 1.3 billion [Google]. And even the biggest search services cover only a dozen or so percent of the Internet.
Moreover, the accessibility of information has considerably improved. Nowadays, one does not longer need to travel to Athens or Alexandria to find needed information – all you need to do is just to get on the Internet, type in a few words, click something and wait a few seconds for the result (at least so would claim the marketing wizards of the Internet search services).
Unfortunately, as the size of stored data grows, it is getting even harder and harder to find the relevant information that one is looking for. In a small library the librarian may know where to find each book, but dealing with quantity of documents such as mentioned in the first paragraph is beyond human capabilities. Moreover, as almost anyone can publish almost anything in the Web (which is therefore often concerned as a "Big Mess"), only a small part of information contained there is really valuable. This creates further challenge as user should get both all and only the information relevant to his query.
All above problems can be formulated as a single task of finding the set of documents on the Web relevant to a given user query. This is so called Web Search Problem [Selberg 99].
As a response to it, a special kind of Web service – search engines were created, which provide some – but still far from perfect – solution. Search engines are now among most popular net services, and are often points from which user starts to surf the Internet, so that one may call them "gates to the Web". A survey made recently by Search Engine Watch Web site revealed that they are the top information resource that Americans use when seeking answers (used 32 percent of the time) [SearchEngineWatch]. The most commonly used services are nowadays Yahoo [Yahoo], MSN [MSN], AltaVista [AltaVista] and Google [Google], receiving 200 millions queries daily and considered usually as the best one available.
In such typical service, the user types in his query, describing what she or he is looking for and the service returns a list of documents more or less relevant to this query. Although
-8algorithms and techniques used for finding this data may be different, user interaction in most of today search engines is based on two main models: query list (for the input) and ranked list (for the output). Unfortunately, it is still not possible to ask most of the services questions in the natural language (and it is even less probable to get a valid answer to such type of query).
So the query list consists of words or phrases in some natural language (in almost all cases it is English), linked by Boolean operators. The answer comes as a ranking list of documents matching the query, ranked (ordered) so that the most relevant documents appear on the top of it.
Usually here a problem arises, and namely that the list is too long and contains many pages that are not relevant to the query, but rather concern many different topics. Then user must sift through hundreds or thousands of documents to find the few ones that he has been looking for. But, as human patience is limited, usually users after taking a look at the few top documents give over browsing ranking list (quite often not finding the information that they were looking for). According to another survey made by the Search Engine Watch, about 75% of Internet users report searching to be "quite frustrating", and 33% of them finds it "very frustrating". This fact has even been given a name – "search rage". The same survey shows that on average users "search rage" if they don't find information that they were looking for in 12 minutes [SearchEngineWatch].
As an attempt to improve the quality of search a new paradigm, called search results clustering, was formulated. It does not focus on the process of searching for the results (it may be, and usually is done completely independent of the searching algorithm), but rather on their presentation, so one may say that it is a more "user-centered" approach. Instead of showing the user a long list of results itself, it is split in groups related to sub-topics that occur in the set of documents returned as a result of the query. Then these groups, with a descriptions that enable users to determine their content, are presented. Groups may eventually contain also smaller, more specific groups which may contain further groups etc. –
this is called hierarchical clustering. There exist two main benefits of clustering:
• the number of groups is much smaller than the number of documents contained in them, so it is easier and faster to browse the results. This fact has been put into the
-9marketing slogan of the Vivisimo clustering engine: "No more wasted time scrolling and backtracking through long one-dimensional lists".