FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 || 3 | 4 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 2 ] --

• the query may concern many different topics (not only those which user had in mind when typing it into the search engine). Clustering reveals these topics as separate groups, so that the user may choose only ones that are interesting for him Vivisimo claims that "categorized search results increase the probability of finding key information by an estimated 44%" [Vivisimo]. Clustering is a natural technique when dealing with a big amount of information – a example of it can be a library catalogue (imagine looking for a book in a library without it, just using a list of items!). Algorithms performing clustering are well known for a long time in computing science, statistics and economy, where they are used for finding groups in numerical data [Everitt, 80]. They may be used also to perform a clustering of documents in textual databases [Salton, 89]. Example of such algorithm is Agglomerative Hierarchical Clustering (AHC), which generates a hierarchical output.

First really successful search results clustering service was Grouper, developed at the University of Washington [Zamir and Etzioni, 98], [Zamir and Etzioni, 99], which used a brand new algorithm called Suffix Tree Clustering (STC). Grouper was a pure research project, and unfortunately is no longer active. Currently the most popular clustering engine is Vivisimo [Vivisimo], which is capable of performing hierarchical clustering. It is a commercial service whose authors do not reveal secrets of used algorithms.

First clustering engine developed in Poland was Carrot, written as a part of master thesis at Poznań University of Technology [Carrot], [Weiss, 01]. Carrot also utilized STC algorithm, trying to apply it to documents written in both English and Polish languages. It was the first clustering service designed for a language different than English. As a continuation of Carrot a new system, called Carrot2 [Carrot2] was developed. Main purpose of this system was to enable easy performance of experiments comparing different clustering algorithms [Weiss 02a]. This work is also a part of the Carrot2 system, and focuses on application of Agglomerative Hierarchical Clustering to this problem, as this system doesn't contain implementation of any hierarchical algorithm.


This thesis raises subjects associated with domain of computing science called Information Retrieval, including problems of processing and searching of text databases [Salton, 89]. This work is particularly strongly connected with another quite new domain partly covering Information4 Retrieval, called Web Mining, which concerns similar problems towards Internet (which one also may consider as a large, distributed database). Web Mining raises completely new problems, associated with distributed and chaotic structure of Internet, rapid changes of its content, constraints on time of processing and unknown profile of end user, who may be a ordinary person without any knowledge of how to use the system, which should help him in finding information he needs.

- 10 This thesis is supposed to reach two main goals:

• Implementation of Web documents hierarchical clustering algorithm as a part of the Carrot2 system.

Choice of existing or design of new clustering algorithm based on so called vector space model (presented in Chapter 3). This algorithm should then be implemented as a module of the Carrot2 search service. Design and implementation of this module should be compliant with standards of information flow in the Carrot2 system [Weiss 02a].

• Evaluation of the algorithm's performance on chosen sets of WWW pages (considering also documents in Polish language) and comparison of it and recent approaches to search results clustering like Suffix Tree Clustering or Latent Semantic Indexing.

In order to achieve this goal, it was necessary to consider measures of quality of algorithm's results, which would consider specific properties of all compared algorithms.

Appropriate set of test data, which would allow to study different aspects of these algorithms had also to be created.


Chapter 2 gives a brief overview of existing answers to the Web search problem, with emphasis placed on search results clustering. Innovative, experimental approaches to presenting search results to the user are also discussed. In the final subchapter we introduce the Carrot2 clustering engine, which is especially important as this thesis is a part of this system.

In chapter 3 we present the vector space model, which is a classical paradigm in the Information Retrieval domain and underlies the Agglomerative Hierarchical Clustering. Basic terms and assumptions in this model are presented, along with discussion of problems that can be encountered when dealing with natural language.

Chapter 4 presents main documents clustering algorithms, both ones based on the vector spaced model and new ones based on key phrases. A separate subchapter is devoted to a detailed description the AHC algorithm, being a major subject of this thesis. Finally, we give a comprehensive overview of post-processing of the algorithm results which is performed before their presentation to user.

In chapter 5 design and implementation issues of AHC algorithm are discussed, along with a short presentation of the whole Carrot2 system architecture, which gives a background that may help to understand some of the design solutions.

Main purpose of chapter 6 is an attempt to evaluate the results given by the algorithm.

First, we present both the standard quality measures and clustering-specific ones that we have designed. Then, the evaluation experiment and its results are presented in two subchapters.

Chapter 7 ends this thesis presenting once more the main accomplishments and conclusions. Possible future research directions are also listed.


In the first subchapter we have stated that not only the amount of information is important, but also how easy it is to find it. One might say that this sentence also concerns this thesis. So apart from creating the content itself, an effort has also been made to present it to the reader so that she or he may easily find the information that she/he is looking for. Layout of this thesis is based on ones used in many other papers that the author has come across in the past.

Different font styles are used to emphasize special meaning of some parts of thesis. Italics is used to place emphasis on parts of text of some special importance. On the other hand, bold font appears whenever a new term is defined or introduced.

A standard has also been introduced for the numbering of figures. A decision has been made not to distinguish between figures, tables, equations, etc. All these elements are treated as figures and are numbered according to their order of precedence in chapter in which they appear (e.g. "Figure 4-2-5" would denote fifth figure in chapter 4.2). Respective number and also a short title always appears in the text below each figure. Thereafter figure is referenced in text by its number.

References to other papers and Internet resources are marked using square bracket with a name of the reference. This name consists usually of name of one of the authors of the paper and year of its issue.

- 12 BASIC CONCEPTS Main purpose of this fragment of thesis is to give the Reader more detailed knowledge of the Web searching topic. Several types of search services that can be distinguished are discussed here. Most interesting approaches are then discussed in details in subsequent subchapters. First come the indexing services, being still the most often used tool. Techniques that allow them to search and process the vast amount of data found in the WWW are presented. Then we shift to recent interesting approaches to the Internet searching. Most of ideas introduced here are still in experimental phase, but they represent very interesting (and sometimes maybe even unexpected) approaches to Web search problem. We focus our attention especially on search results clustering and Carrot2 system, which are presented in separate subchapters.


Not all Web search engines work in the same manner. Currently several types of services representing different approaches are in use. We will use here a classification based on ones given in [Zamir and Etzioni 99] and [Weiss 02b]. The most important distinction concerns the way of presentation of search results – whether it is oriented toward visualization of attributes of documents or toward visualization of similarities between documents.

–  –  –

Historically first type of search engines were so called indexing services, which have their own databases containing immense numbers of WWW pages searched in order to find an answer to user's query. Indexing services have always been the most popular tool used for looking for information in the net. The biggest Web search services like Google or Yahoo work this very way. However, as we have already stated, databases of single search engines cover only a small part of Internet information resources. Thus in case of some very specific queries they may give very little or even no results.

In such cases, the simplest solution is to repeat the search using other service, which may give better results. Typing the same query several times and then comparing the result lists is

- 13 a time-consuming and boring task, which moreover can be easily automated, and so the metasearch engines came to being. Meta-search engines don't have their own databases containing documents, but after receiving the user's query they redirect it to other services, collect the results, and send them back to user. Example services of this type are Dogpile [Dogpile], Mamma [Mamma], Metasearch [Metasearch] or Vivisimo, which in addition performs clustering before presenting results to the user. What is interesting for us is that also Polish meta-search engines, allowing to query Polish search services were built – Emulti [Emulti] and RazDwaTrzy [RazDwaTrzy]. Also client programs that work in a similar fashion do exist (Copernicus or, again for querying Polish Web – Internetowy Poszukiwacz) [Chabiński and Bugajska 03].

Fig. 2-1-2 Vivisimo's Advanced Search options reveal, which search egines it uses

Due to the fact that different search services may have different formats of query, metasearch engines must translate the user's query before its further broadcasting. There also exist some problems associated with fact that results obtained from different services are disjoint and sorted in different order, so finding a common order for the final ranking list may pose a problem. Meta-search engines can sort the results using their own algorithms, but as they have no additional data, only the collected pages, therefore the results often are poor. A simple test of meta-search engines was performed in [Chabiński and Bugajska 03], and the authors claim that the quality of the results of most of these services (excluding Vivisimo) is rather poor when compared to results given by Google.

Completely different approach are the Web directories, working as thematic catalogues of Internet pages, which are manually assigned to respective categories. Main directory services are Yahoo, which was the first to employ the idea of directory to search the Web, LookSmart [LookSmart] and Open Directory Project [ODP], which can be viewed as "Open Source" directory – i.e. the documents can be assigned to categories by any Internet user, not only the workers of the company that owns the directory.

This approach is very interesting as it differs widely from classical "query-results list" paradigm, possessing many advantages over it (but unfortunately a lot disadvantages, too). It is also getting more popular (according to [Weiss 02b] a few years ago about a half of Internet users declared using directories when looking for information). It can be also viewed as a

- 14 form of clustering, which is done manually, not by the machine. Different, interesting solution was applied in the NorthernLight service [NorthernLight], which is a "half-manual" directory, i.e. categories are predefined by man, but the classification of documents to them is performed automatically.

–  –  –

Fig. 2-1-3 Comparison of the top Web directories according to [SearchEngineWatch]

Main disadvantages of web directories are:

• slow indexing (due to the fact that it has to be performed manually), for instance the Open Directory Project receives 250 site submissions per hour [SearchEngineWatch]

• relatively small size of databases (example exact numbers are given in Figure 2-1-3), resulting from the former point

• the content of catalogues may be out-of-date, also because of the former point

• predefined directories structure, which may not consider very specific topics

Strong points are:

• hierarchical structure of catalogues, allowing to refine the search, starting from very general topics and moving towards more and more detailed.

• categories are assigned manually, so accuracy is much better than in case of automatic clustering

• there is no need to know any keywords or formulate queries, navigating through directory structure is much more simple and intuitive Fig. 2-1-4 Open Directory Project catalogue for category "Computers"

–  –  –

This subchapter concerns typical indexing services using "query-results list" model, and tries to answer how the documents that respond to the user's query are found.

2.2.1. COLLECTING AND STORING THE INFORMATION In order to find the Web documents relevant to user's query, search services keep indexes of them in their databases. This creates great challenges, as number of these documents may amount to billions and the result has to be found during just few seconds unless we want the user to get impatient. Google has published a short list of example difficulties that they have

to overcome [Google], including:

• efficiently searching full index of more than 3 billion documents more than 3,000 times per second at peak traffic times

• crawling and indexing billions of documents, comprising of more than 20 TB of data, in a few days

Pages:     | 1 || 3 | 4 |   ...   | 12 |

Similar works:

«Human Pandemic Influenza Business Continuity Guidelines Local Government Association of South Australia Pandemic Awareness Project Version 2, July 2007 Table of Contents Introduction Establish the Context Identify the Risks Analyse the Risks Maintaining Essential Services Leadership Financial Knowledge Management Communications Critical Supplies Community Expectations Evaluate the Risks Maximum Tolerable Outages Treat the Risks Step 1 Gain Executive Support Step 2 Form Pandemic Planning Team...»

«Andrews Uniuersity Seminuy Studies, Summer 1981, Vol. 19, No. 2, 99-114 Copyright O 1981 by Andrews University Press.THE EXEGETICAL METHODS OF SOME SIXTEENTH-CENTURY PURITAN PREACHERS: HOOPER, CARTWRIGHT, AND PERKINS PART 11* ERWIN R. GANE Pacific Union College Angwin, California In Part I of this series, I provided a brief overview of the preaching careers of the three Puritan preachers here under consideration-John Hooper, Thomas Cartwright, and William Perkins. I also analyzed their concept...»

«Corporate Governance Statement The Hague, March 2016 aegon.com 1. Dutch Corporate Governance Code comply or explain As a company based in the Netherlands, Aegon N.V. (also being referred to as the “Company”) adheres to the Dutch Corporate Governance Code. The complete text of the Code can be found on www.commissiecorporategovernance.nl. Aegon endorses the Code and strongly supports its principles for sound and responsible corporate governance. Aegon regards the Code as an effective means of...»

«Quantitative Analysis of Local Mineral Content and Composition During Bone Growth and Remodeling DISSERTATION zur Erlangung des akademischen Grades doctor rerum naturalium (Dr. rer. nat.) im Fach Physik eingereicht an der Mathematisch-Naturwissenschaftlichen Fakultät I der Humboldt-Universität zu Berlin von Diplom Ingenieur Andreas Roschger geboren am 17. September 1983 in Wien Präsident der Humboldt-Universität zu Berlin Prof. Dr. Jan-Hendrik Olbertz Dekan der...»

«Mar 14, 2011 Chapter 2 The Deep Time of the Dead [We are] mortal creatures who miss our dead friends, and thus can appreciate levitating tigers and portraits by Raphael for what they are—songs of mortality sung by the prisoners of time”1 Vous nous voyez ci-attachés cinq, six Quant de la chair, que trop avons nourrie, Elle est pieça devoree et pourrie, Et nous les os, devenons cendre et pouldre.You see us cleaving together, five, six: As for the flesh, that we nourished too much, It is...»

«Computational Linguistics in the Netherlands Journal 4 (2014) 53-74 Submitted 05/2014; Published 12/2014 T-Scan: a new tool for analyzing Dutch text Henk Pander Maat∗ h.l.w.pandermaat@uu.nl Rogier Kraf∗ kraf@ziggo.nl Antal van den Bosch∗∗ a.vandenbosch@let.ru.nl Nick Dekker∗ n.w.dekker@uu.nl Maarten van Gompel∗∗ proycon@anaproy.nl Suzanne Kleijn∗ s.kleijn1@uu.nl Ted Sanders ∗ t.j.m.sanders@uu.nl Ko van der Sloot∗∗∗ ko.vandersloot@uvt.nl ∗ Department of Languages,...»

«Unit 3 Notes J Ewan, St Ninian’s High School, Douglas Isle of Man. Edexcel Unit 3 Option B A2 Notes Edexcel Guidance for Unit 3 Option B The focus of this topic is on France during a tumultuous period of change with French men and women evolving from subjects to citizens in a maelstrom of revolutions, war and constitutional experiment. The key areas of study required for Section A are summarised in the four bullet points and the two associated controversies to be examined in Section B are...»

«THE EFFECT OF PARTICIPATION IN READ 180 ON SIXTH GRADE STUDENTS’ READING ACHIEVEMENT Christopher M. Miller B.A., Southwest Missouri State University, 1998 M.S., Southwest Missouri State University, 2002 Submitted to the Graduate Department and Faculty of the School of Education of Baker University in partial fulfillment of the requirements for the degree Doctor of Education in Educational Leadership May 1, 2014 Copyright 2014 by Christopher M. Miller ii Dissertation Committee Major Advisor...»

«The Early Aurignacian in Central Europe and its Place in a European Perspective Nicolas Teyssandier To cite this version: Nicolas Teyssandier. The Early Aurignacian in Central Europe and its Place in a European Perspective. Ofer BAr-Yosef; Joao Zilhao. Towards a definition of the Aurignacian, 2002, Lisbonne, Portugal. 241-256, pp.241-256, 2006, Trabalhos de Arqueologia, 45. hal-00174678 HAL Id: hal-00174678 https://hal.archives-ouvertes.fr/hal-00174678 Submitted on 24 Sep 2007 HAL is a...»

«What is discourse analysis? The relationship between linguistic and non-linguistic behaviour connected discourse occurs within a particular situation – whether of a person speaking, or of a conversation, or of someone sitting down occasionally over the period of months to write a particular kind of book in a particular literary or scientific tradition (Harris, 1952: 3). what happens when people draw on the knowledge they have about language. to do things in the world (Johnstone, 2002: 3)....»

«Federal Reserve Bank of Minneapolis Research Department Costly Financial Intermediation in Neoclassical Growth Theory* Rajnish Mehra, Facundo Piguillem, and Edward C. Prescott Working Paper 685 April 2011 Abstract The neoclassical growth model is extended to include costly intermediated borrowing and lending between households. This is an important extension as substantial resources are used in intermediating the large amount of borrowing and lending between households. In 2007, in the United...»

«Technologies and Architectures for Multimedia-Support in Wireless Sensor Networks 373 1 22 Technologies and Architectures for Multimedia-Support in Wireless Sensor Networks Sven Zacharias and Thomas Newe University of Limerick Ireland 1. Introduction Wireless Sensor Networks (WSNs) are an emerging technology in the area of sensory and distributed computing. A WSN consists of many, theoretically up to some thousand or even millions of sensor nodes. A sensor node is generally defined as a cheap...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.