FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 7 ] --

main points:

• stopping condition Normally, this method stops when only one group is left, so there is nothing left to cluster. But, as this method turns out to be quite slow, it is sometimes suggested that it should be stopped when either specified number of clusters remains or similarity value between two closest clusters in iteration falls below certain threshold. This approach often fails, as these two thresholds have to be arbitrary, and it is hard to find a value that

- 45 would suit different cases of input data (simple experiment with manipulation of value of this parameter is described in Chapter 6-3-4). Moreover, usually subsequent iterations of the algorithm take less and less time, so in order to have a significant time gain, the quality trade-off would have to be too big.

• similarity measure between clusters This is a very important issue in the AHC method, and may have a great impact on its results. As we have already stated, the algorithm merges in each step two closest clusters.

But it receives as input only similarities between objects, not groups. So there must be some way of calculating the latter. Similarity measures between clusters are called linkage methods, as they influence the way the clusters are merged. Different linkage methods are discussed in details in Chapter 4-2-2.

AHC is often considered as algorithm that gives the best results among ones based on the vector space model [Zamir and Etzioni 98]. Its biggest advantage is that it is capable of creating a hierarchy of clusters, which is necessary to reflect the hierarchy of topics in documents. Unfortunately, results of typical version of AHC don't have other properties described in Chapter 4.1 (namely, they cannot contain overlapping clusters, and – when no extra stop condition is used – this method clusters all objects). However, the latter problem can be solved by post-processing of dendrogram given by the algorithm (see Chapter 4-3-5).

But real main disadvantage of AHC is its low speed – it takes between O(n2) and O(n3) time to cluster n documents, depending on the linkage method used ([Voorhees 86], [Maarek et al. 00]). This means that using AHC for large document collections may be impossible.

Although in Web search results applications the size of input data set amounts usually only to few hundreds documents (see key requirements for Web document clustering methods in Chapter 2-3), we also have less time as clustering has to be performed "on-line".

4.2.2. LINKAGE METHODS Linkage methods are the most important issue in the AHC algorithm, and in fact they decide how it exactly works. As it was already mentioned in the former subchapter, they are used to aggregate similarities between single documents in two different groups into similarities between the both whole groups. So linkage methods formulas are based on these inter-document similarities. There are many methods [Everitt 80], and we will describe here

only the three most popular of them, presented in [Salton 89]:

• Single-link or Nearest Neighbour In this method, similarity between a pair of clusters is taken to be the similarity between the most similar pair of objects from each cluster, thus each cluster member will be more similar to at least one member of the same cluster than to any member of any another


–  –  –

This method merges clusters that have at least one pair of similar items. Unfortunately, it tends to produce a small number of "straggly", large and loosely connected clusters. This is so called chaining effect (illustrated also in Figure 4-2-2-6), and happens due to the

fact that only two of the elements in cluster have to be very similar. Examples of Singlelink clustering results are shown on the below figure:

Fig. 4-2-2-2 Three sample results of the Nearest Neighbour linkage method

• Complete-link or Furthest Neighbour In this method, similarity between a pair of clusters is taken to be the similarity between the least similar pair of objects from each cluster, thus each cluster member will be more similar to all members of the same cluster than to the least similar member of any

another cluster:

–  –  –

This method merges clusters that have all pairs of their members similar. It produces rather smaller, but tightly connected clusters that are typically preferable. Examples of Complete-link clustering results are shown on the below figure, and it is clearly visible

that clusters given by this method make more sense that results of Single-link:

- 47 Fig. 4-2-2-4 Three sample results of the Furthest Neighbour linkage method

• Group-average or UPGMA (Unweighted Pair-Group Method using Arithmetic averages) In this method, similarity between a pair of clusters is taken to be the average similarity between all pairs of objects from each cluster, thus each cluster member will have a greater average similarity to all members of the same cluster than to all members of any

another cluster:

–  –  –

Example dendrograms for the same documents collection containing 100 items using cosine coefficient as similarity measure between documents are shown below. On the one corresponding to Single-link method, the chaining effect is visible, while dendrogram for Complete-link shows that similarity of larger groups decreases rapidly, and the top, most general groups have their similarity equal to 0. Dendrogram corresponding to Group-average seems to be somehow "intermediate" between two former methods.

–  –  –

Formulas used to calculate similarity between groups used in most linkage methods may

be reduced to a generic form:

sim (a, bc ) = α b ⋅ sim (a, b)+ α c ⋅ sim (a, c ) + β ⋅ sim(b, c) + γ ⋅ sim ( a, b) − sim (a, c )


b, c – merged clusters, a – any other cluster, αb, αc, β, γ – coefficients specific for different linkage methods

–  –  –


4.3.1. INITIAL CLUSTERS FORMATION Dendrogram coming as a result of the AHC algorithm consists always of n – 1 nodes (for n

leaves representing documents). For each of these nodes three cases are possible:

• A node has two leaves. In this case it is a group containing two documents.

• It has a leaf and another node. Then document represented by leaf and group represented by node should then be merged together into one group.

• It has two nodes. It means that it is a group, made of another two subgroups

This idea is illustrated on Figure 4-3-1-1 below:

Fig. 4-3-1-1 Example of forming groups from dendrogram

Process of forming clusters from the output tree may be stopped after reaching certain similarity threshold in order to not create large, but loosely-connected groups.

Unfortunately, dendrogram created by AHC apart from informative, coherent clusters contains also a lot of useless, redundant ones, which should be removed from it before presentation of results to the user. Therefore we need to create methods which would enable to sift out only the important clusters from the tree.

The most important source of above problem is the binary structure of dendrogram, caused by the way the algorithm links groups – always by two. Unfortunately in the documents collection that might not always be the case (i.e. a topic can have either just one ore more than two subtopics). Consider an example set of documents corresponding to query "king". It

–  –  –

Fig. 4-3-1-2 Example document collection structure (on the left) and corresponding AHC dendrogram.

"Artificial" clusters are marked with "???".

It is clearly visible that these useless clusters should be removed, as their presence in the results obscures the hierarchy and makes its browsing by the user much more difficult.

4.3.2. DENDROGRAM POST-PROCESSING Several methods of dendrogram pruning, mostly based on similarities between the nodes in it have been proposed in order to select the most significant levels of hierarchy. An intuitive approach, whose authors claim that it follows expert's way of most important clusters selection has been proposed in [Maarek et al. 91]. It analyses the changes in similarity values when navigating between different levels of dendrogram. If a group of nodes shares a common parent, and the differences in similarity level between its members are relatively

small, then it forms a cluster. This approach is illustrated in Figure 4-3-2-1:

Fig. 4-3-2-1 Example of dendrogram prunning taken from [Maarek et al. 91].

This method can be further simplified as described in [Maarek et al. 00] by rounding the similarities of the dendrogram nodes to some value (authors of this paper find 0.1 to be enough) in order to make the hierarchy more compact. Then all nodes having the same similarity value and a common parent are treated as a cluster. When using Single-link or Complete-link linkage methods, we can round the similarity values in the input matrix, not in the output dendrogram and therefore even remove the post-processing phase from the

- 51 algorithm. This is due to the fact that using these linkage methods does not introduce any new values to the similarity matrix during its recomputations by AHC (see Chapter 4-2-2).

However, instead of these approaches we propose a different one, based not on some numerical properties of the tree, but rather on textual descriptions of created clusters. It is discussed in Chapter 4-3-5, as we would like first to present the way these descriptions are created.

4.3.3. CREATING DESCRIPTIONS OF CLUSTERS Unfortunately, the AHC algorithm itself contrary to methods like STC or LINGO doesn't create any labels of the formed groups. But when we already have the clusters and their contents, the task of describing them isn't difficult. An intuitive approach is to form description of a group from terms common in documents that are members of this group (we should keep in mind the fact that sharing common terms is the reason why these documents form a group). Detailed version of this method has been presented in [Maarek et al. 00] and is

used in our system. It consists of three main points:

• Label of a group consists of several (e.g. 3-5) terms or phrases. The purpose of this rule is simply to constrain the length of descriptions in order to make them more readable for the user.

• Label is made up from the "best" terms or phrases (i.e. ones that have the biggest sum of weights in documents contained in the cluster). When using standard terms weighting methods, it means that words included in descriptions appear frequently in documents. A possible improvement at this stage would be taking into consideration length of phrases (if it is not done already in the terms weighting process – see Chapter 3-5-3). Preferring longer phrases could help to overcome one of weaknesses of AHC, namely short cluster descriptions (refer to Chapter 6-4 for comparison of results given by AHC and other clustering algorithms).

• Terms / phrases from the label have to occur in a considerable part (e.g. 70 %), but not necessarily all of documents from this group. It is obvious that terms appearing rarely in the group shouldn't form its description. On the other hand, it would be a mistake to exclude a term from a label if it is not present in just few members of the cluster.

Creating cluster descriptions is this part of the clustering process, where application of phrases gives very good results. According to [Zamir and Etzioni 99], phrases are "extremely useful in characterizing the cluster's content". This issue has been also discussed also in Chapter 3-5-1.

4.3.4. REMOVING REDUNDANT DESCRIPTIONS Descriptions obtained using method presented in the former subchapter still have some flaws. Usually the problem is that they contain a lot of similar and therefore redundant

phrases. Two main reasons of this situation are:

- 52 hierarchical structure of clusters Usually set of common terms of a subcluster is made up of a corresponding set of its parent cluster and words specific for this subcluster. So for an example cluster "King County", all its subclusters would have a phrase "King County" in their descriptions (i.e.

a cluster about government of King County would have a description "King County", "King County Government").

• use of phrases As we have already stated in Chapter 3-5-2, during phrases extraction overlapping sequences may be retrieved. For example, in a collection of documents containing phrase "King County Government", algorithm described in this chapter would find also its subphrases: "King County", "County Government", "King" etc., and possibly all or most of them would appear in the cluster label, although most of them would be redundant.

Subphrase of a given phrase is defined in [Zamir and Etzioni 99] as a phrase (or term) that is its substring (i.e. two subphrases of "King County" would be "King" and "County"). We propose a more general definition, and call a subphrase of given phrase a phrase (or term) that contains any of terms contained in the first one (but not all of them). This definition is orderindependent, so for instance "agglomerative clustering" would be a subphrase of "agglomerative hierarchical clustering", while according to the first one, only "agglomerative hierarchical" and "hierarchical clustering" could be its subphrases. Then superphrase of phrase a is any phrase b such that a is a subphrase of b. Of course any definition of sub- or super phrases should take into account stemming of terms contained in phrases so that "King" would be a subphrase of "King's College".

First problem is simply eliminated by removing terms or phrases that appear in descriptions of parent clusters from descriptions of their subclusters [Maarek et al. 00]. We have enhanced this method by removing also subphrases of phrases that appear in parent clusters labels.

–  –  –

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 12 |

Similar works:

«The Journal of Hebrew Scriptures ISSN 1203-1542 http://www.jhsonline.org and http://purl.org/jhs Articles in JHS are being indexed in the ATLA Religion Database, RAMBI and THEOLDI. Their abstracts appear in Religious and Theological Abstracts. The journal is archived by the National Library of Canada, and is accessible for consultation and research at the Electronic Collection site maintained by the The National Library of Canada. VOLUME 6, ARTICLE 10 doi:10.5508/jhs.2006.v6.a10 JAKOB WÖHRLE,...»

«CAARE FOSTER DOG MANUAL Companion Animal Advocacy & Rescue Effort A Guide for Foster Dog Parents CAARE FOSTER DOG MANUAL Companion Animal Advocacy & Rescue Effort Table of Contents P R O G R A M I N F O R M A T I O N F R E Q U E N T L Y A S K E D Q U E S T I O N S R E Q U I R E M E N T S F O R A L L F O S T E R P A R E N T S S E L E C T I N G A N A P P R O P R I A T E Rescue D O G T O F O S T E R I N T R O D U C I N G Y O U R D O G S T O F O S T E R D O G S I N T R O D U C I N G Y O U R C A T S...»

«IMOGEN BINNIE a novel “A powerful new literary voice.” MICHELLE TEA AUTHOR OF VALENCIA IMOGEN BINNIE copyright © 2013 by Imogen Binnie Topside Press 228 Park Avenue South, #14261, New York, NY 10003 topsidepress.com Some rights reserved. Most wrongs reversed. Publisher’s Note: This novel is a work of fiction. Names, characters, places, and incidents either are the product of the author’s imagination or are used ficticiously. Any resemblance to actual events, locales, or...»

«Biro Theatricality and Performance: Reanimating Pia Lindman’s ‘Performance of Grief’ As with video, my approach to drawing is informed by the tradition of performance art. After videotaping myself re-enacting gestures of mourning captured in photographs in the New York Times, I traced these gestures from video stills with pencil. By exhibiting both the tracings and the enactments, I tried to illuminate some of the relationships between a photograph, its mediation, and the idea of original...»

«Topology 43 (2004) 983 – 1033 www.elsevier.com/locate/top Filling in solvable groups and in lattices in semisimple groups Cornelia Drutu∗ UFR de Mathà matiques et UMR CNRS no. 8524, Università de Lille I, F-59655 Villeneuve d’Ascq, France e e Received 14 September 2000; received in revised form 13 March 2003; accepted 18 November 2003 Abstract We prove that the ÿlling order is quadratic for a large class of solvable groups and asymptotically quadratic for all Q-rank one lattices in...»

«Chapter 5 I WOKE UP LATE IN the afternoon. For a second I didn't know where I was. You know how it is, when you wake up in a strange place and wonder where in the world you are, until memory comes rushing over you like a wave. I half convinced myself that I had dreamed everything that had happened the night before. I'm really home in bed, I thought. It's late and both Darry and Sodapop are up. Darry's cooking breakfast, and in a minute he and Soda will come in and drag me out of bed and wrestle...»

«    Working Paper  The Accrual Anomaly: Exploring the Optimal Investment   Hypothesis      Lu Zhang  Stephen M. Ross School of Business   at the University of Michigan    Jin Ginger Wu  University of Georgia    X. Frank Zhang  Yale School of Management        Ross School of Business Working Paper Series  Working Paper No. 1100  December 2007        This paper can be downloaded without charge from the  ...»


«Chiesa di Nostra Signora delle Grazie (detta di Sant’Agostino) 1 CHIESA DI NOSTRA SIGNORA DELLE GRAZIE (detta di S. Agostino) Cenni storici La Chiesa di S. Maria delle Grazie è comunemente chiamata “Chiesa di S. Agostino”, a ricordo dell’ordine religioso che ospitò per secoli e che la fece erigere. Quella che oggi vediamo non è l’originale; infatti, la prima chiesa agostiniana era sita nell’area fuori delle mura cittadine, negli airali della Maddalena. Gli Agostiniani (Ordine...»


«Minimalism and Explanation Robert D. Van Valin, Jr. Department of Linguistics & Center for Cognitive Science The State University of New York at Buffalo VANVALIN@ACSU.BUFFALO.EDU 1. Introduction Chomsky’s Minimalist Program (Chomsky 1995, 1998) portrays itself as an attempt to simplify generative syntactic theory while enhancing its explanatory potential. This paper explores some of the questions of language design raised in the Minimalist Program, especially in Chomsky (1998), from the point...»

«1 “Nose Rings and Bellybutton Things Counseling the Next Generation” This.pdf document contains the course materials you must read. Simply keep scrolling down and read every page. To receive CEU credit after reading this file, please follow the directions at the end of the course. Peachtree is approved to provide continuing education services by the National Association of Alcohol and Drug Addiction Counselors (NAADAC) and the National Board of Certified Counselors (NBCC), as well as by...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.