FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 9 | 10 || 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 11 ] --

6.3.4. INFLUENCE OF GROUPS CREATING THRESHOLD Groups creating threshold parameter indicates, whether any groups should be created from the nodes of dendrogram, whose similarity is lower than value of this parameter (see Chapter 4-3-1). Default value of this parameter is equal to 0.0 so no groups are discarded only on the basis of their similarity level. We have supposed that setting this threshold to some higher number (in our experiment we have chosen 0.3) should leave in the results only the most coherent groups, moreover cleaned from snippets that don't fit well into them. The trade-off would of course be moving these documents to the "Other Topics" group, but we were willing to pay it as the main problem with the AHC algorithm consists rather in constructing weakly associated and poorly labeled groups than in its inability to assign documents to some cluster (compare with results of LINGO evaluation – [Osiński 03]).

- 79 Fig. 6-3-4-1 Comparison of AHC results for a query "cell" using groups creating threshold parameter values of 0.0 (on the left) and 0.3(on the right).

It turned out that change of this parameter's value from 0.0 to 0.3 was enough to cause a sharp decrease in size of created clusters, without substantial changes in their quality. It is easy to see that in spite of using these different parameter values, descriptions of created clusters are almost the same, only their sizes have changed. One may also notice that the "Other Topics" group is much bigger and contains 22 instead of 8 documents. We may therefore say that increasing value of this threshold is not a good way of improving clustering quality.

6.3.5. INFLUENCE OF POST-PROCESSING METHODS In Chapter 4-3-5 we have proposed dendrogram pruning methods based on descriptions of created clusters, and not on values of similarities between its nodes. Here we will show how their use improves the quality of results.

–  –  –

It is clearly visible that according to our hypothesis removing groups that have no descriptions has an essential influence on the quality of algorithm's results. In above example hierarchy consists of a dozen or so top level groups. As AHC creates a binary dendrogram, in order to present such hierarchy it has to introduce "artificial" groups into the results structure (compare to Figure 4-3-1-2). Character of these groups is their lack of descriptions, clearly visible on right tree in the above figure. Left dendrogram is created just by removing these clusters. Remaining post-processing methods from Chapter 4-3-5 are used quite rare, as situations when they may be used are also rare.

Another interesting observation is lack of the "Other Topics" group in the right tree. It is due to the fact that no clusters were removed from it and so there was no reason to create such special group for documents that would be contained in these groups.

6.3.6. INFLUENCE OF TERMS WEIGHTING METHODS In this experiment we have confronted four already mentioned in Chapter 3-2-1 methods of

terms weighting:

• standard tf terms weighting

• its version assigning the square root of the tf coefficient value as weight of a term

• standard tf-idf terms weighting (see Figure 3-2-1-3)

• its version proposed by us, using linearly decreasing idf factor (see Figure 3-2-1-4) Dendrograms obtained by using these methods during terms weighting phase are shown


–  –  –

One may easily notice that it is hard to find any significant differences between hierarchies created by using these methods (with the sole exception of "standard" tf-idf weighting). In particular, ones created by using tf weighting and tf-idf formula from Figure 3-2-1-4 are very similar.

The most important difference occurring in case of use of standard tf-idf terms weighting method is frequently its inability to form larger clusters. In this example, this property manifests itself in case of the group "Clinton County" (being the largest from all created for used query). When using this method, this cluster is only half of its size when compared to use of other ones. This is due to our hypothesis from Chapter 3-2-1, where we have noticed that idf factor used in this method decreases too quickly with increase in number of documents in which a term appears.

It is hard to determine unequivocally, which one of the remaining methods allows to produce best results. In our opinion making such judgments would require conducting an experiment similar to one form Chapter 6-2, with at least several different queries and exact quality measures used. It is needless to say that in fact there would have to be three experiments – one for each of the presented methods.


ALGORITHMS In this subchapter we present a rough comparison of Agglomerative Hierarchical

Clustering and three other algorithms:

- 82 hierarchical Suffix Tree Clustering [Masłowska and Słowiński 03], creating the clusters tree on the basis of results of applying of "classical" STC algorithm (refer to Chapter 4-1-3 for details)

• hierarchical method used in the Vivisimo clustering service [Vivisimo]. Unfortunately, authors of this system don't reveal any secrets about type of used algorithm

• the LINGO algorithm [Osiński 03], being also an element of the Carrot2 system. It is the only compared here method that doesn't create a hierarchy.

It would be also interesting to compare "our" AHC implementation and some other version of this algorithm. However, unfortunately we don't know about any available implementation, so we couldn't perform such comparison.

The experiment was conducted on the results of a "data mining" query from the Vivisimo service. 400 top snippets (after duplicate removal about 350) were downloaded and then clustered using different algorithms. Such large input set was used in order to fairly compare these methods, as hierarchical STC was designed especially for clustering of big numbers (i.e.

hundreds) of documents.

Fig. 6-4-1 Comparison of results of different clustering algorithms for a query "data mining". From left to right:

AHC, LINGO, Vivisimo, and hierarchical STC (results of all methods are visualized using Carrot2 output component). Different total numbers of snippets in case of different algorithms result from overlapping.

- 83 After a closer examination one may notice that results created by different algorithms share a lot of common clusters (this is true especially in case of the largest ones). Groups such as labeled with phrases "Knowledge Discovery", "Data Warehouses", "Machine Learning", "Visualization" appear in all hierarchies (although their exact descriptions sometimes differ between results of different algorithms).

A noteworthy fact in case of AHC results is very small size of the "Other Topics" group (consisting only of 1 document of 350 present in the input). We have also noticed that results given by this algorithm in case of snippets downloaded from Vivisimo are slightly better than when using Google as data source. This may be due to the fact that Vivisimo makes use of several search engines that also probably provide longer snippets.

Here we will point out the most important properties of results of AHC and other


• AHC Agglomerative Hierarchical Clustering creates very small clusters when compared to the other methods. This is partially due to the fact that it is unable to produce overlapping results. Moreover, the clusters are labeled with very general descriptions (in part because of their hierarchical structure – groups with long, specific descriptions occur in the lower levels of hierarchy). These descriptions may be too short to understand the topic of the group (for instance "Text" or "Offers").

• LINGO This method creates the smallest number of clusters from all algorithms presented here.

On the other hand, their descriptions are very long and meaningful (although sometimes unfortunately too long and not matching all documents in their respective groups). For instance, groups labeled by the other methods "Decision Support" or "Notes" are described by LINGO respectively with phrases "Decision Support Software" and "Data Mining Lecture Notes".

• Vivisimo Results of method employed in Vivisimo seem to have no "outstanding features". In all considered aspects (size and number of groups and length of their descriptions), this algorithm seems to find a "compromise" between AHC and LINGO. One may also notice that hierarchy produced by it is less deep than one created by the AHC algorithm.

• hierarchical STC The most characteristic property of results of this method is their strong overlapping (compare total sizes of created clusters on Figure 6-4-1). Hierarchical STC creates several huge groups, but also a lot of very small, so that its results consist of biggest number of clusters when compared to ones of other algorithms (unfortunately we were unable to present on figure the whole list of top-level groups due to its length).

Moreover, labels of top-level clusters are very short (shorter even than in case of AHC), probably due to the same reasons as in case of this method.


Web mining, and in particular search results clustering are relatively new and quickly developing domains of computing science. These domains are promising and certainly there still remains a lot of work to be done before the clustering algorithms will be widely used.

New approaches to this problem are continually developed (for instance LINGO, which was created during the work on the Carrot2 system). Current methods usually seem to have a great advantage over old algorithms derived from the areas of both statistics and text databases (such as Agglomerative Hierarchical Clustering, which is the main subject of our thesis), as they pay more attention to meaning of texts contained in the input data, and not only to simple common occurrences of particular words in them.

However, as we have shown during the work on this thesis, thanks to introduction of some changes adapting it to text processing, also the AHC algorithm may successfully perform documents clustering. Results collected during the algorithm's evaluation clearly show that it may be useful in our application. It shouldn't be forgotten that AHC is algorithm designed rather for offline clustering, where this activity may be performed on larger collections of complete documents. Here we have shown that its results may be helpful to users even if the input data set is small and consists only of digests of documents.

This is an effect of employing pre- and post-processing methods that enable to fit this algorithm to our application domain. Thanks to use of these methods (i.e. phrases extraction and dendrogram pruning) it has become more sensitive to meaning of clustered texts. Our research has been guided towards these techniques also due to the fact that AHC itself is a well-studied and used for tens of years algorithm. But pre- and post-processing phases that according to us are very important are considerably less studied. Moreover, we have found these phases, where not numerical, but textual data is processed much more interesting.

Finally, we would like to emphasize meaning of the fact that this thesis is a part of much greater whole. Our implementation of AHC is a module of the Carrot2 system, which is the most advanced, if not the only one Open Source clustering service. Due to the fact of its integration in the Carrot2 framework, software being the result of this thesis may turn out much more useful than if it were a separate application. Moreover, maybe in the future our module will be developed further in order to perform research on other aspects of the algorithm.


Main scientific contributions of this thesis include:

• use of phrases in the AHC algorithm We have succeeded in implementing of a simple statistical technique of phrases extraction and employing it in our module. However simple, this method turned out to

- 85 give useful results. Moreover, it can be easily applied in any algorithm using the vector space model.

• introduction of description-based dendrogram pruning methods In order to perform post-processing of the algorithm's results we have proposed techniques based on analysis of result clusters labels instead of ones based on analysis of similarities between them, as we feel that these methods are more "topic-aware". This approach has been introduced also according to repeatedly emphasized in publications from this domain importance of good cluster descriptions.

implementation of a hierarchical clustering algorithm as a part of the Carrot2 system • During work on this thesis we have implemented the AHC algorithm together with mentioned before methods. Then our code has been released as a part of Open Source Carrot2 system. This module is the only one from three clustering components in this system capable of generating hierarchical results.

• simple evaluation of our version of the algorithm performed We have conducted different types of evaluation experiments – user- and expert-oriented.

User assessment has shown that our implementation of the AHC algorithm gives useful results. On the other hand, expert evaluation enabled us to study the influence of different parameters of the algorithm on its results and compare AHC and recent approaches to document clustering.


Unfortunately, we haven't succeeded in addressing all issues associated with implementation of AHC. However, we are aware of existence of these problems and point

them out here as subjects of possible future research:

Pages:     | 1 |   ...   | 9 | 10 || 12 |

Similar works:

«Purchase Agreement for Wigglebutt Partnership Family (Female) Bill & Angela Ogle, d.b.a. Wigglebutt Aussies and hereinafter known as the Breeder, in consideration of payment of $ _1600.00_(discounted price) and the agreement of _ hereinafter known as the Buyer, to abide by the terms of the agreement set forth below, the execution of which is acknowledged, hereby releases to the undersigned Buyer responsibility and ownership subject to the reservation of rights set forth below for the Miniature...»

«ETSI ETR 289 TECHNICAL October 1996 REPORT Source: EBU/CENELEC/ETSI-JTC Reference: DTR/JTC-DVB-14 ICS: 33.020 Key words: Digital, video, broadcasting, DVB, TV, CA, security Union Européenne de Radio-Télévision European Broadcasting Union Digital Video Broadcasting (DVB); Support for use of scrambling and Conditional Access (CA) within digital broadcasting systems ETSI European Telecommunications Standards Institute ETSI Secretariat Postal address: F-06921 Sophia Antipolis CEDEX FRANCE Office...»

«Case 1:13-cr-00521-LTS Document 256 Filed 03/06/16 Page 1 of 16 U.S. Department of Justice United States Attorney Southern District of New York The Silvio J. Mollo Building One Saint Andrew’s Plaza New York, New York 10007 March 6, 2016 Via ECF & Facsimile Honorable Laura Taylor Swain United States District Judge Southern District of New York Fax: 212-805-0426 Re: United States v. Adam Samia, S9 13 Cr. 521 (LTS) Dear Judge Swain: The Government writes in opposition to the March 3, 2016...»

«MCAFEE SCHOOL OF THEOLOGY STYLE GUIDE REVISED FALL 2010 based on Turabian, Kate L. A Manual for Writers of Term Papers, Theses, and Dissertations, 7th ed. Chicago: The University of Chicago Press, 2007.. Nancy L. deClaissé-Walford McAfee School of Theology Atlanta, Georgia Fall 2010 TABLE OF CONTENTS SECTION 1: PARTS OF A PAPER 1.1 Title Page 1.2 Page Numbering 1.3 Margins 1.4 General Formatting Issues 1.4.1 Font Size 1.4.2 Line Spacing 1.4.3 Indenting Paragraphs 1.4.4 Paragraphs 1.4.5...»

«Comp. by: pg2793 Stage : Revises1 ChapterID: 0001286003 Date:19/7/11 Time:13:19:07 Filepath:d:/womat-filecopy/0001286003.3D OUP UNCORRECTED PROOF – REVISES, 19/7/2011, SPi Part IV Narrative Structures and Strategies Comp. by: pg2793 Stage : Revises1 ChapterID: 0001286003 Date:19/7/11 Time:13:19:07 Filepath:d:/womat-filecopy/0001286003.3D OUP UNCORRECTED PROOF – REVISES, 19/7/2011, SPi Comp. by: pg2793 Stage : Revises1 ChapterID: 0001286003 Date:19/7/11 Time:13:19:07...»

«269 BIBLIOGRAPHY Alder, V S 1979. From the mundane to the magnificent: a volume of autobiography. London: Rider. Alder, V S 2000. The fifth dimension. York Beach, Me: Weiser. Anderson, C A 1985. The problem is God: the selection and care of your personal God. Walpole, NH: Stillpoint. Anderson, C A 1991. A guide to the selection and care of your personal God. Canton, Mass: Squantum. Anderson, C A 1993. ‘The New Thought movement: a link between east and west’. Paper delivered at the...»

«Dog/Puppy Handbook Thank you for offering your home, time, and heart to an ARF animal that needs foster care. By providing a foster home, you are providing a stable, loving environment as well as much needed socialization and basic training that will increase their adoptability. If you have never fostered a shelter animal before, the following will help you with the care of your foster puppy/dog. Foster Contacts Foster Supervisor (fostercoordinator@arflife.org) The best way to reach the foster...»

«Mon temps, le vôtre et le nôtre : des catégories temporelles aux significations plurielles Licínio M. Vicente Tomás, Ph. D. Universidade dos Açores, Portugal Résumé Ce texte vise à donner un bref aperçu de l’analyse interprétative en ce qui concerne la construction des catégories temporelles ainsi que leur rapport aux contenus et significations socioculturelles et individuelles qui en découlent. Pour ce faire, nous relèverons une approche du temps comme phénomène qui légitime...»

«Chapter Fourteen THE SACRED SIGNIFICANCE OF WIND CAVE AND ITS ENVIRONS Probably no area speaks to the controversy over the sacredness of the Black Hills better than Wind Cave and its environs, which include the Race Track, the Buffalo Gap, and the Hot Springs. The identification of Wind Cave as a sacred site appears to be recent, at least from the vantage point of published sources, even though this is not the sense one gets from tribal elders White Bull, Left Hand Bear, Charlie Eagle Louse,...»


«Installation instructions Refrigerator & Freezer and User guide NZ AU UK IE PAC Active Smart™ models 635 mm wide E331T, E372B, E381T, E402B, E411T, E415H 680 mm wide E406B, E413T, E440T, E442B 790 mm wide E521T, E522B Cyclic and Compact models 525 mm wide P120, E169T, C170T, C190, E240B, E249T, C270 635 mm wide C373, C450, E373, E450 Vertical freezer models 525 mm wide E150, E210 635 mm wide E308, E388 Chest freezer models Slimline H215, H275, H320 Standard H160, H220, H280, H360, H510, H701...»

«Rare lung cancers There are several different kinds of lung cancer, often referred to as lung cancer subtypes. Some of these occur more often than others. In this factsheet we will specifically look at the subtypes of cancers that do not happen very often and are considered ‘rare’. People’s experiences of these rarer types of lung cancer will generally be similar to those of people with more common forms of the condition. However, there are some differences in treatment and outlook....»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.