«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
• the query may concern many different topics (not only those which user had in mind when typing it into the search engine). Clustering reveals these topics as separate groups, so that the user may choose only ones that are interesting for him Vivisimo claims that "categorized search results increase the probability of finding key information by an estimated 44%" [Vivisimo]. Clustering is a natural technique when dealing with a big amount of information – a example of it can be a library catalogue (imagine looking for a book in a library without it, just using a list of items!). Algorithms performing clustering are well known for a long time in computing science, statistics and economy, where they are used for finding groups in numerical data [Everitt, 80]. They may be used also to perform a clustering of documents in textual databases [Salton, 89]. Example of such algorithm is Agglomerative Hierarchical Clustering (AHC), which generates a hierarchical output.
First really successful search results clustering service was Grouper, developed at the University of Washington [Zamir and Etzioni, 98], [Zamir and Etzioni, 99], which used a brand new algorithm called Suffix Tree Clustering (STC). Grouper was a pure research project, and unfortunately is no longer active. Currently the most popular clustering engine is Vivisimo [Vivisimo], which is capable of performing hierarchical clustering. It is a commercial service whose authors do not reveal secrets of used algorithms.
First clustering engine developed in Poland was Carrot, written as a part of master thesis at Poznań University of Technology [Carrot], [Weiss, 01]. Carrot also utilized STC algorithm, trying to apply it to documents written in both English and Polish languages. It was the first clustering service designed for a language different than English. As a continuation of Carrot a new system, called Carrot2 [Carrot2] was developed. Main purpose of this system was to enable easy performance of experiments comparing different clustering algorithms [Weiss 02a]. This work is also a part of the Carrot2 system, and focuses on application of Agglomerative Hierarchical Clustering to this problem, as this system doesn't contain implementation of any hierarchical algorithm.
1.2. THE SCOPE AND GOAL OF THE THESIS
This thesis raises subjects associated with domain of computing science called Information Retrieval, including problems of processing and searching of text databases [Salton, 89]. This work is particularly strongly connected with another quite new domain partly covering Information4 Retrieval, called Web Mining, which concerns similar problems towards Internet (which one also may consider as a large, distributed database). Web Mining raises completely new problems, associated with distributed and chaotic structure of Internet, rapid changes of its content, constraints on time of processing and unknown profile of end user, who may be a ordinary person without any knowledge of how to use the system, which should help him in finding information he needs.
- 10 This thesis is supposed to reach two main goals:
• Implementation of Web documents hierarchical clustering algorithm as a part of the Carrot2 system.
Choice of existing or design of new clustering algorithm based on so called vector space model (presented in Chapter 3). This algorithm should then be implemented as a module of the Carrot2 search service. Design and implementation of this module should be compliant with standards of information flow in the Carrot2 system [Weiss 02a].
• Evaluation of the algorithm's performance on chosen sets of WWW pages (considering also documents in Polish language) and comparison of it and recent approaches to search results clustering like Suffix Tree Clustering or Latent Semantic Indexing.
In order to achieve this goal, it was necessary to consider measures of quality of algorithm's results, which would consider specific properties of all compared algorithms.
Appropriate set of test data, which would allow to study different aspects of these algorithms had also to be created.
1.3. SHORT THESIS SUMMARY
Chapter 2 gives a brief overview of existing answers to the Web search problem, with emphasis placed on search results clustering. Innovative, experimental approaches to presenting search results to the user are also discussed. In the final subchapter we introduce the Carrot2 clustering engine, which is especially important as this thesis is a part of this system.
In chapter 3 we present the vector space model, which is a classical paradigm in the Information Retrieval domain and underlies the Agglomerative Hierarchical Clustering. Basic terms and assumptions in this model are presented, along with discussion of problems that can be encountered when dealing with natural language.
Chapter 4 presents main documents clustering algorithms, both ones based on the vector spaced model and new ones based on key phrases. A separate subchapter is devoted to a detailed description the AHC algorithm, being a major subject of this thesis. Finally, we give a comprehensive overview of post-processing of the algorithm results which is performed before their presentation to user.
In chapter 5 design and implementation issues of AHC algorithm are discussed, along with a short presentation of the whole Carrot2 system architecture, which gives a background that may help to understand some of the design solutions.
Main purpose of chapter 6 is an attempt to evaluate the results given by the algorithm.
First, we present both the standard quality measures and clustering-specific ones that we have designed. Then, the evaluation experiment and its results are presented in two subchapters.
Chapter 7 ends this thesis presenting once more the main accomplishments and conclusions. Possible future research directions are also listed.
- 11 THESIS LAYOUTIn the first subchapter we have stated that not only the amount of information is important, but also how easy it is to find it. One might say that this sentence also concerns this thesis. So apart from creating the content itself, an effort has also been made to present it to the reader so that she or he may easily find the information that she/he is looking for. Layout of this thesis is based on ones used in many other papers that the author has come across in the past.
Different font styles are used to emphasize special meaning of some parts of thesis. Italics is used to place emphasis on parts of text of some special importance. On the other hand, bold font appears whenever a new term is defined or introduced.
A standard has also been introduced for the numbering of figures. A decision has been made not to distinguish between figures, tables, equations, etc. All these elements are treated as figures and are numbered according to their order of precedence in chapter in which they appear (e.g. "Figure 4-2-5" would denote fifth figure in chapter 4.2). Respective number and also a short title always appears in the text below each figure. Thereafter figure is referenced in text by its number.
References to other papers and Internet resources are marked using square bracket with a name of the reference. This name consists usually of name of one of the authors of the paper and year of its issue.
- 12 BASIC CONCEPTS Main purpose of this fragment of thesis is to give the Reader more detailed knowledge of the Web searching topic. Several types of search services that can be distinguished are discussed here. Most interesting approaches are then discussed in details in subsequent subchapters. First come the indexing services, being still the most often used tool. Techniques that allow them to search and process the vast amount of data found in the WWW are presented. Then we shift to recent interesting approaches to the Internet searching. Most of ideas introduced here are still in experimental phase, but they represent very interesting (and sometimes maybe even unexpected) approaches to Web search problem. We focus our attention especially on search results clustering and Carrot2 system, which are presented in separate subchapters.
2.1. TYPES OF INTERNET SEARCH ENGINES
Not all Web search engines work in the same manner. Currently several types of services representing different approaches are in use. We will use here a classification based on ones given in [Zamir and Etzioni 99] and [Weiss 02b]. The most important distinction concerns the way of presentation of search results – whether it is oriented toward visualization of attributes of documents or toward visualization of similarities between documents.
Historically first type of search engines were so called indexing services, which have their own databases containing immense numbers of WWW pages searched in order to find an answer to user's query. Indexing services have always been the most popular tool used for looking for information in the net. The biggest Web search services like Google or Yahoo work this very way. However, as we have already stated, databases of single search engines cover only a small part of Internet information resources. Thus in case of some very specific queries they may give very little or even no results.
In such cases, the simplest solution is to repeat the search using other service, which may give better results. Typing the same query several times and then comparing the result lists is
- 13 a time-consuming and boring task, which moreover can be easily automated, and so the metasearch engines came to being. Meta-search engines don't have their own databases containing documents, but after receiving the user's query they redirect it to other services, collect the results, and send them back to user. Example services of this type are Dogpile [Dogpile], Mamma [Mamma], Metasearch [Metasearch] or Vivisimo, which in addition performs clustering before presenting results to the user. What is interesting for us is that also Polish meta-search engines, allowing to query Polish search services were built – Emulti [Emulti] and RazDwaTrzy [RazDwaTrzy]. Also client programs that work in a similar fashion do exist (Copernicus or, again for querying Polish Web – Internetowy Poszukiwacz) [Chabiński and Bugajska 03].
Fig. 2-1-2 Vivisimo's Advanced Search options reveal, which search egines it uses
Due to the fact that different search services may have different formats of query, metasearch engines must translate the user's query before its further broadcasting. There also exist some problems associated with fact that results obtained from different services are disjoint and sorted in different order, so finding a common order for the final ranking list may pose a problem. Meta-search engines can sort the results using their own algorithms, but as they have no additional data, only the collected pages, therefore the results often are poor. A simple test of meta-search engines was performed in [Chabiński and Bugajska 03], and the authors claim that the quality of the results of most of these services (excluding Vivisimo) is rather poor when compared to results given by Google.
Completely different approach are the Web directories, working as thematic catalogues of Internet pages, which are manually assigned to respective categories. Main directory services are Yahoo, which was the first to employ the idea of directory to search the Web, LookSmart [LookSmart] and Open Directory Project [ODP], which can be viewed as "Open Source" directory – i.e. the documents can be assigned to categories by any Internet user, not only the workers of the company that owns the directory.
This approach is very interesting as it differs widely from classical "query-results list" paradigm, possessing many advantages over it (but unfortunately a lot disadvantages, too). It is also getting more popular (according to [Weiss 02b] a few years ago about a half of Internet users declared using directories when looking for information). It can be also viewed as a
- 14 form of clustering, which is done manually, not by the machine. Different, interesting solution was applied in the NorthernLight service [NorthernLight], which is a "half-manual" directory, i.e. categories are predefined by man, but the classification of documents to them is performed automatically.
Fig. 2-1-3 Comparison of the top Web directories according to [SearchEngineWatch]
Main disadvantages of web directories are:
• slow indexing (due to the fact that it has to be performed manually), for instance the Open Directory Project receives 250 site submissions per hour [SearchEngineWatch]
• relatively small size of databases (example exact numbers are given in Figure 2-1-3), resulting from the former point
• the content of catalogues may be out-of-date, also because of the former point
• predefined directories structure, which may not consider very specific topics
Strong points are:
• hierarchical structure of catalogues, allowing to refine the search, starting from very general topics and moving towards more and more detailed.
• categories are assigned manually, so accuracy is much better than in case of automatic clustering
• there is no need to know any keywords or formulate queries, navigating through directory structure is much more simple and intuitive Fig. 2-1-4 Open Directory Project catalogue for category "Computers"
This subchapter concerns typical indexing services using "query-results list" model, and tries to answer how the documents that respond to the user's query are found.
2.2.1. COLLECTING AND STORING THE INFORMATION In order to find the Web documents relevant to user's query, search services keep indexes of them in their databases. This creates great challenges, as number of these documents may amount to billions and the result has to be found during just few seconds unless we want the user to get impatient. Google has published a short list of example difficulties that they have
to overcome [Google], including:
• efficiently searching full index of more than 3 billion documents more than 3,000 times per second at peak traffic times
• crawling and indexing billions of documents, comprising of more than 20 TB of data, in a few days