FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 6 ] --

- 38 DOCUMENTS CLUSTERING ALGORITHMS Main purpose of this chapter is to present to the Reader the most important clustering algorithms, taking especially into consideration Agglomerative Hierarchical Clustering as it is major subject of this thesis. In the first part, a short overview of the clustering problem is given. Several algorithms (including also ones based on the vector space model) are presented in order to show different approaches to this issue. Second subchapter is dedicated completely to the detailed presentation of Agglomerative Hierarchical Clustering. As this method is in reality just a framework of algorithm, and may be widely interpreted, we present its main versions and describe differences between them. Finally, we discuss techniques applied by us in order to adapt the AHC algorithm to our application, i.e. Web search results clustering.


Before discussing clustering algorithms, we should present the main idea of clustering itself.

A good (although a little too narrow for our purposes) definition is given in [Everitt 80]:

"Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined."

This definition emphasizes the most important properties of clustering:

• Objects within the formed clusters should be homogeneous, while the clusters should be heterogeneous. It is necessary, as clusters should help us distinguish between objects.

• Number and attributes of clusters should be found out by the algorithm, not given as the input data. So it should not only assign object to groups, but also determine their structure. That's why clustering is sometimes called unsupervised learning.

Example of clustering of numerical data, consisting of a set of objects described by two variables (so that they may be interpreted as points in two-dimensional space) is given on below Figure 4-1-1. These points have been classified into five distinct groups.

- 39 Fig. 4-1-1 Clustering example However, mentioned earlier clustering definition misses one important point – it supposes that data is described always by numerical measures. It doesn't have always to be true, as clustering is used in many different domains and one might for instance wish to find clusters in other kind of data like images or text documents (which is our case). Most of known clustering algorithms were designed for numeric data, so it may seem easier to transform attributes of other data types into numbers (e.g. for text – using vector space model).

However, this transformation may lead to a loss of some properties of the original data representation (this issue for the vector space model is discussed in Chapter 3-1-2).

Results of different clustering algorithms may have very different properties. Most

important are:

• Hierarchical structure of groups In most of the cases it turns out that clusters created by an algorithm could be further split into smaller (or merged into larger) ones, depending on how much details do we want. Performing clustering on an example collection of documents concerning intelligence, we could distinguish groups "artificial intelligence" and "emotional intelligence". But cluster concerning documents about artificial intelligence could be split further into ones such as "neural networks" or "knowledge discovery". Without a hierarchy the clusters would be either too general and containing a lot of documents, or there would be too many of them. Using a tree instead of list of groups enables the user to quickly navigate through it from general to detailed topics.

• Overlapping groups Some of the clustered objects can be assigned equally well to more than one cluster.

Consider for example a great book about UNIX network programming, written by W.

Richard Stevens. It belongs to two categories: books about network programming and about the UNIX operating system, and assigning it to only one of them would mean loss of valuable information.

• Leaving of some of clustered objects besides created groups.

Usually among the data there exists some number of objects that aren't similar to any others. Such elements (called outliers) should not be classified as they wouldn't fit to any group.

- 40 K-MEANS K-means is one of classical methods in the cluster analysis ([Rocchio 66], [MacQueen 67]). It is based on numerical data, so in order to employ it to cluster text documents one should use the vector space model. This algorithm is non-hierarchical and in basic version it doesn't allow the clusters to overlap. In order to use k-means, we must first specify the desired

number of clusters k. Then the algorithm acts in the following manner:

Select k initial random centroids Do Assign each object to the closest centroid Recompute all centroids While (!centroids are stabilized)

–  –  –

This method iteratively optimizes assignment of objects to clusters, using as a criterion distance of each object from the cluster to its centroid, computed as mean vectors of objects that belong to respective clusters (therefore centroids are called also means and hence algorithm's name).

–  –  –

Main advantage of the algorithm is its speed. It requires only O(nk) time per one iteration, where k is the desired number of clusters, and n – number of input documents (when considering documents clustering algorithms, n usually denotes size of document collection, so from now on whenever no meaning for n is given, it it supposed to have exactly this meaning). Number of iterations is usually significantly smaller than number of objects (although there is no such guarantee), so usually it is said that overall complexity of the kmeans algorithm is O(nk).

Unfortunately, this algorithm is too simple and so it performs poorly, especially when

applied to textual data [Zamir and Etzioni 98]. Its main drawbacks are:

• Necessity of knowing the number of clusters in advance (usually there is no way of knowing how many clusters exist).

- 41 The way of initializing centroids is not specified, so they usually have to be randomly chosen. An attempt to overcome both this and former problem by trying to find the initial centroids using other algorithm is discussed in Chapter 4-1-2.

• Only local optimization of cluster assignment (a "greedy" method)

• Inability of the method to deal with clusters of non-convex shapes – due to usage of the centroids as cluster representation There exits also a "fuzzy" version of the k-means algorithm [Bezdek 81]. In this method, each document instead of similarity to the closest cluster has a degree of membership in each

cluster so that overlapping clusters are allowed:

Select k initial random centroids Do Compute for each object its degree of membership in all centroids Recompute all centroids While (!centroids are stabilized)

–  –  –

4.1.2. OTHER METHODS BASED ON THE VECTOR SPACE MODEL An approach to deal with mentioned earlier problems that arise when using k-means algorithm was proposed in [Cutting et al. 92]. This method – called Buckshot – calculates initial cluster centroids applying the AHC algorithm to a (random) small sample of documents of the collection. Then groups found by AHC are taken to be these centroids. This trick allows not only to find the number and position of initial clusters, but also to speed-up computations, as k-means has to perform a smaller number of iterations before stabilization when it knows already the initial centroids. As AHC algorithm used in Buckshot runs in time O(n2), the initial sample is of size kn in order to maintain the overall performance of O(kn). A comparison of clustering methods performed in [Zamir and Etzioni 98] shows that using this method improves both time and quality of clustering, but this gain unfortunately is not significant. Main drawback of this method results from the fact that the initial sample may not be representative for the whole document collection.

Another simple approach is the Single-pass method [Hill 68]:

Treat the first document as the first cluster.

Do Add the next document from the list to the closest cluster if their similarity is above the similarity threshold (predetermined) Otherwise treat it as a new cluster While (!all documents are clustered)

–  –  –

- 42 This method has computational costs similar to k-means, namely it runs in linear time O(kn), where k is the desired number of clusters. However, it also has similar disadvantages.

Its another weak point is that clustering results depend heavily on the order of documents in the input list. Again according to [Zamir and Etzioni 98] it is the method that has results of the worst quality.

4.1.3. SUFFIX TREE CLUSTERING Suffix Tree Clustering (STC) is a brand new clustering algorithm, designated only for documents clustering purposes. It was proposed originally by Zamir and Etzioni ([Zamir and Etzioni 98]), and is the first clustering algorithm that fulfills their requirements for Web documents clustering methods (presented in Chapter 2.3). It is based on the idea of creating of clusters from documents that share common phrases.

First phase of the algorithm is finding common phrases and documents that share them. It is done using a special data structure called Suffix Tree [Ukkonen 95]. This structure enables us to find all phrases and documents that share them in time linear in the number of documents, and moreover new documents can be added to it incrementally, just as they arrive from the search engine. An example suffix tree constructed from three documents is shown

below on Figure 4-1-3-1:

Fig. 4-1-3-1 Sample Suffix Tree for three documents containg sentences: "cat ate cheese", "mouse ate cheese too" and "cat ate mouse too" [Zamir and Etzioni 98].

In the next stage, a list of base (initial) clusters is created. All phrases that occur in more than one document are considered as base clusters. Then for each base cluster, its score function is calculated. In essence, this function promotes groups that contain bigger number of documents and are described by longer (i.e. more informative) phrases. Clusters, whose score doesn't exceed certain threshold value are rejected.

Purpose of the last stage of STC is to merge base clusters that share a considerable amount of documents. In this phase, a base cluster graph is created. Nodes of this graph correspond to base clusters, and edges exist between clusters that share more than certain part of documents.

–  –  –

Fig. 4-1-3-2 Base cluster graph for suffix tree from Figure 4-1-3-1 [Zamir and Etzioni 98].

This method was the first really successful search results clustering algorithm. Its authors claim that it is both faster (due to linear complexity and incremental processing) and gives better results than any method based on the vector space model [Zamir and Etzioni 98].

Main disadvantage of basic version of STC is "flat" structure of result clusters, i.e. lack of their hierarchy. An attempt to overcome this problem by creating hierarchy from the result groups is presented in [Masłowska and Słowiński 03].

4.1.4. LINGO During work on the Carrot2 system a new clustering algorithm, called LINGO was proposed [Osiński 03]. This method works in a very specific way, namely it firstly finds in the input documents set phrases that would fit to be cluster descriptions (called "abstract concepts") and only then it creates the groups themselves by trying to match documents to the found descriptions. Therefore this approach is called by its author "description-oriented".

LINGO is also based on the vector space model, and it employs advanced techniques such as Singular Value Decomposition of the term-document matrix or its dimensionality reduction (see Chapter 3-2-2) using low-rank approximation.

Main advantage of this algorithm are very good, informative to the user labels describing created clusters (however in some cases they are too specific and fit only some of the documents from the group). LINGO is also capable of creating overlapping results. Its main weak points are inability to perform hierarchical clustering and quite slow processing time, due to use of Singular Value Decomposition. However, attempts to accurately assess this method are premature, as it is still a completely new idea, with only one existing implementation.


(AHC) ALGORITHM 4.2.1. ALGORITHM OUTLINE Agglomerative Hierarchical Clustering is a classical clustering algorithm, originating analogically like k-means from the statistics domain. Fundamental idea of this method is very simple, not to say obvious. It consists in iterative merging of two most similar groups, which

in first step contain single elements:

–  –  –

Algorithm takes as input a matrix of pairwise similarities between objects. In case of documents this matrix is created by calculating all pairwise similarities between documents using one of measures presented in Chapter 3-3 (according to our literature review it is usually cosine similarity). It returns a binary (because it always links two clusters) tree of clusters, called dendrogram (sample dendrograms are depicted on Figure 4-2-2-6). Name of the algorithm refers to its way of working, as it creates hierarchical results in an "agglomerative" or "bottom-up" way, i.e. by merging smaller groups into larger ones. There also exists a "divisive" (rather rarely used) version, which works just oppositely, iteratively

partitioning clusters instead of merging them:

–  –  –

Fig. 4-2-1-2 Outline of the divisive ("top-down") hierarchical clustering algorithm AHC is in reality just a template of some class of algorithms, with lots of detailed versions proposed in literature. Basic outline of this method is very general and does not specify two

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |   ...   | 12 |

Similar works:

«Re-examining habitual residence as the sole connecting factor in Hague Convention child abduction cases Danielle Bozin-Odhiambo* This article critiques the usefulness of habitual residence as the sole connecting factor in Hague Convention child abduction cases. This is achieved by examining the quality of this jurisdiction in light of changes in the gender dynamics underpinning international parental child abduction and the transnational family phenomenon. Arguably, the child’s habitual...»

«501 MAIN STREET, ARKADELPHIA, AR 71923 TELEPHONE: (870) 246-8081 – FAX: (870) 246-2117 Dear Signing Agent: Enclosed please find the signing agent application that we require to be completed prior to completion of any work on behalf of Capstone/Pioneer Settlement Services. In order to successfully work with Capstone/Pioneer Settlement Services, it is important that signing agents are readily accessible, communicate any potential problems that arise before, during, or after each settlement,...»

«ALTA 2011 NICKEL/COBALT/COPPER CONFERENCE MAY 23-25, 2011 BURSWOOD CONVENTION CENTRE PERTH, AUSTRALIA ALTA Metallurgical Services Castlemaine, Victoria, Australia www.altamet.com.au ALTA Free Library www.altamet.com.au Matthew Soderstrom Adam Fischmann, Ph.D. Global Application Technology Manager Research Scientist Metal Extraction Products Metal Extraction Products Ask The Mining Chemical Leaders For mining operators seeking improved performance, a dynamic collection of knowledge from the...»

«UNINTENDED CONSEQUENCES IN RESTORATION: INVESTIGATING INTERACTIONS BETWEEN TROUT HABITAT ENHANCEMENT AND ANGLERS IN WESTERN STREAMS by Eva Jordanna Black A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Land Resources and Environmental Sciences MONTANA STATE UNIVERSITY Bozeman, Montana November 2011 ©COPYRIGHT by Eva Jordanna Black 2011 All Rights Reserved ii APPROVAL of a thesis submitted by Eva Jordanna Black This thesis has been read by...»

«Munich Personal RePEc Archive Dynamic Panics: Theory and Application to the Eurozone Zachary Stangebye University of Notre Dame 14 April 2015 Online at https://mpra.ub.uni-muenchen.de/69967/ MPRA Paper No. 69967, posted UNSPECIFIED Dynamic Panics: Theory and Application to the Eurozone∗† Zachary R. Stangebye‡ Abstract It is shown in a standard quantitative model of sovereign debt and default that sentiment shocks can alter the distribution of fundamental defaults. The channel through...»

«A LITERARY STUDY OF EURIPIDES’ PHOINISSAI Ita Hilton UNIVERSITY COLLEGE LONDON 2011 1 Acknowledgements My greatest debt by far is to my supervisor, Professor Chris Carey, whose untiring guidance and support have been instrumental to my research, on which his influence has been very great. His input has added much to my enjoyment of and engagement with my work, and I am extremely grateful to him. I also thank Bill Allan, Alan Woolley, and especially Chris Pelling for their generous help and...»

«Interestingness Measures for Data Mining: A Survey LIQIANG GENG AND HOWARD J. HAMILTON University of Regina Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to be reduced. This survey reviews the interestingness measures for rules and summaries,...»

«The evolution of world export sophistication and the Italian trade anomaly Federico Tamagni∗ Michele Di Maio Universit´ di Macerata a Scuola Superiore Sant’Anna, Pisa m.dimaio@unimc.it; f.tamagni@sssup.it. The Authors are very grateful to Alessandro An∗ timiani, Giulio Bottazzi, Luca De Benedictis, Giovanni Dosi, Fabio Sdogati, Lucia Tajoli and two anonymous referees for insightful comments and suggestions. The work also benefited from discussions with participants to the 11th...»

«1    “Fear Not, I am With Thee: Christ’s Atonement and our Personal Growth”  Elder Bruce C. and Sister Marie K. Hafen This address was given Thursday, May 1, 2014 at the BYU Women’s Conference © 2014 by Brigham Young University Women’s Conference. All rights reserved For further information write: BYU Women’s Conference 161 Harman Continuing Education Building Provo, Utah 84602 801-422-7692 E-mail: womens_conference@byu.edu Home page: http://womensconference.byu.edu...»

«ALPHA LARM Natural food supplement for long lasting lubrication for dry eyes Research and Development Department, Monaco DENSMORE 1/16 July 2013 TABLE OF CONTENTS ALPHA LARM 1 1. PRODUCT DESCRIPTION 3 2. SAFETY AND EFFICACY 5 Scientific context 5 Role of components 6 Experimental study 10 Goals of supplementation 10 3. PROMOTIONAL MATERIAL 11 4. REFERENCES 14 DENSMORE 2/16 July 2013 1. PRODUCT DESCRIPTION Name of the product: ALPHA LARM 60 soft caps Status: Food supplement Use in case of ...»

«IRLE IRLE WORKING PAPER #75-00 May 2000 Shukko (Employee Transfers) and Tacit Knowledge Exchange in Japanese Supply Networks: The Electronics Industry Case James R. Lincoln, and Christina Ahmadjian Cite as: James R. Lincoln, and Christina Ahmadjian. (2000). “Shukko (Employee Transfers) and Tacit Knowledge Exchange in Japanese Supply Networks: The Electronics Industry Case.” IRLE Working Paper No. 75-00. http://irle.berkeley.edu/workingpapers/75-00.pdf irle.berkeley.edu/workingpapers...»

«Minion Park (Reserve 27292) Management Plan ADOPTED JUNE 2007 Minion Park Management Plan 1.0 INTRODUCTION Minion Park is a small urban reserve located in Broadwater, Busselton. It is comprised of two separate parcels. The main area is approximately 0.45 ha in size and bounded on the east and north by Little Colin St, Harnett St to the west and Bussell Hwy to the south. The remaining portion is a thin strip approximately 60 metres long and between 10 and 15 metres wide between Bussell Hwy and...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.