WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 8 ] --

The need for this improvement is illustrated on above Figure 4-3-4-1. There we have a cluster of documents corresponding to Shakespeare's "King Lear". It has a subcluster which could be described with following phrases: "William Shakespeare", "King Lear", "King" and "Lear" as all of these occur frequently in it. It is easy to see that three last phrases are redundant. Using the method from [Maarek et al. 00] we would remove only the phrase "King Lear", but also its subphrases "King" and "Lear" should be removed from the description.

- 53 It should be mentioned here that this method may lead to removal of all phrases that make up cluster label. It happens most often in case of "artificial" groups from Figure 4-3-1-2, as they are usually labeled only with phrases that also occur in descriptions of their ancestors in the hierarchy. This is due to the fact that theses groups themselves have no specific meaning.

Second problem is quite common for phrase-based algorithms. Authors of Suffix Tree Clustering have also run into it and have proposed three simple heuristic methods in order to deal with it [Zamir and Etzioni 99]. We have employed here all of these approaches.

4.3.5. DESCRIPTIONS-BASED CLUSTERS POST-PROCESSING During the work on this thesis we have discovered that in order to perform dendrogram pruning instead of methods already described in literature (see Chapter 4-3-2) one may use techniques based on simple analysis of cluster descriptions. We hold that such approach may give better results, as cluster labels tell us more about meaning of their content than values of their mutual similarities. This approach brings also AHC closer to new generation of clustering algorithms like STC or LINGO, which are based mostly just on group descriptions.

We will describe here several simple strategies, which are implemented in our system.

• removing clusters without description If no description can be created for a group (see Chapter 4-3-3 for information about describing groups), it is removed from the results hierarchy and replaced by its respective subgroups. A rationale for this method is importance of good cluster descriptions (compare with Chapter 2-3). Group with no label would surely be completely noninformative to the user, who would have to browse all of its documents in order to deduce its topic.

This method was introduced mainly in order to deal with "artificial" groups (see Figure 4-3-1-2), for whom usually no description can be created as they correspond to no distinct topics (compare also the former subchapter). It is also the most important of strategies presented here, as it helps to fix our biggest problem.

• merging clusters with the same descriptions Sometimes the AHC algorithm creates two very similar, but yet separate groups under the same node in the hierarchy. However, the difference between them is not enough to receive different labels in the description process. Then they are merged into one, as description is for us a more important hint than exact dendrogram's structure. Besides, presence of two different groups with the same description might confuse the user.

–  –  –

Fig. 4-3-5-1 Example of merging clusters with equal descriptions. Two subclusters of a cluster "King County" are merged as they have the same label. However, they are not merged with subcluster of cluster "King's College" that has the same description, as they have different parent clusters.

–  –  –

Fig. 4-3-5-2 Example of attaching a cluster to another one with more general description.

As a result of two latter post-processing methods it may turn out that after merging some of the clusters will have only one descendant, and such hierarchy may seem strange. However, we allow such situation, as it may also happen in the input set, when in some group only one subtopics may be distinguished. In such case, the algorithm should reflect this fact in its results.

4.3.6. FINAL POST-PROCESSING Purpose of methods described in the former subchapter was to choose from the result dendrogram only such clusters, which really correspond to some distinguishing thematic categories. Here we will present three simple approaches used by us to increase readability of

the hierarchy to the user:

• removing root of the hierarchy If the cluster corresponding to the root of the dendrogram is not removed during pruning phase, then we eliminate it here (independently of its description or any other factor), and its subclusters become the top level of hierarchy. This method is used in order to increase the readability of the first, most important level of hierarchy.

- 55 creating the "Other Topics" group After post-processing of dendrogram it may turn out that due to removal of some clusters, a certain part of documents from input collection is not contained in any result cluster. These documents usually don't correspond to any of the topics in final clusters structure. A special group labeled "Other Topics" is created and they are assigned to it.

• ordering result clusters by their size For each of the result groups, its list of subgroups is ordered according to their size, so that the biggest and possibly more important ones appear at the beginning of it.

- 56 IMPLEMENTATION OF THE AHC

ALGORITHM IN THE CARROT2





SYSTEM In this chapter we will present the way of implementation of introduced earlier in Chapters 3 and 4 issues, particularly the AHC algorithm. We will also discuss in details already mentioned before Carrot2 environment, which spared us a considerable amount of work.

Contrary to the previous parts of the thesis, this chapter focuses rather on technical than theoretical aspects of the AHC algorithm. It gives a brief overview of the module's architecture along with its way of working. Moreover, in its final subchapter, issues concerning efficiency of chosen solutions are also discussed.

5.1. THE CARROT2 SYSTEM ARCHITECTURE

5.1.1. BASIC IDEAS Carrot2 [Carrot2] is a generic data-processing framework, developed by Dawid Weiss [Weiss 02a]. It is currently used to perform experiments with different search results clustering algorithms, but as claims its author, "it can be easily configured to do other, interesting things". The system has a pipelined architecture, based on processes, being chains of components, i.e. independent software entities (modules) capable of performing some specified tasks.

Fig. 5-1-1-1 Components and data flow in Carrot2 (taken from [Weiss 02a]).

Components communicate between themselves using HTTP protocol instead of function calls, so they may be distributed on different machines. Moreover, the XML language is (usually) used in the inter-component communication. These facts have a number of

implications:

• Flexibility, as components are independent software units.

- 57 Easy distribution.

• Low speed of processing due use of HTTP and XML in communication between components. However, use of HTTP enables possible distribution of components which may be an advantage, as it allows load balancing in case of severe network load.

As Carrot2 was meant to be rather research than commercial, user-oriented system, the main goal was ease of performance of different experiments and not efficiency. Moreover, the clustering process has a pipeline-oriented nature (e.g. stemming – terms weighting – similarities calculation – clustering algorithm – post-processing etc.), independently of the type of method used. This architecture completely suits our needs since we can only focus on writing the clustering component and then assemble it in a process with already existing components.

5.1.2. ELEMENTS OF SYSTEM ARCHITECTURE

In the Carrot2 system, four types of components can be distinguished (see Figure 5-1-1-1):

• input components Input component is at the beginning of a process. It accepts a query request from the controller. Then it should generate data corresponding to this query (e.g. download from some search engine snippets relevant to the query)

• filters ("intermediate" components) This type of component takes as input some data and transforms it, sending its own results further down the stream to next component in the processing chain. Most important filters in Carrot2 are so called clusterers, implementing clustering algorithms.

Filters associated with the AHC algorithm are presented in Chapter 5-2.

• output components Output component converts the received data into a form that may be returned to the outside environment. For instance such component may visualize results of a clustering algorithm as a HTML page.

• controller It is the main component in the Carrot2 system, controlling the cooperation of other ones from the input to the output. It receives the query from the user and returns to him the final data.

Each component is described by a special file, called descriptor. It provides all data about a component needed by the controller, namely its network address (needed in order to localize it) and data formats that it produces / accepts. The latter information (not obligatory) enables the controller to check if one component can receive results of another as input.

- 58 Fig. 5-1-2-1 Component descriptor of a sample output component.

Information about processes is also provided by their respective descriptors. Purpose of a descriptor is to define, which components (and in which order) make up the process and specify values of parameters passed to components included in it. Most important property of descriptors is the possibility to change them at runtime, so one may repeat one clustering process using different values of parameters without either restarting the server or recompiling the components.

Fig. 5-1-2-2 Process descriptor of a sample clustering process employing STC algorithm.

5.2. AHC MODULE AS A PART OF CARROT2 Agglomerative Hierarchical Clustering algorithm implementation in the Carrot2 system consists of two filter components, named TermsFilter and AHCFilter. Main purpose of the first component is to extract terms and phrases from the input snippet list and calculate the term-document matrix (in fact, this matrix is returned by the component as vector of lists due to he mentioned in Chapter 3-1-2 fact that it is sparse). This component performs neither stemming nor stop-words extraction, which have to be done earlier in the processing chain.

Example of the output of TermsFilter is shown below:

–  –  –

AHCFilter receives the output of TermsFilter, calculates matrix of similarities between documents, executes the AHC algorithm and finally performs the cluster labeling along with the dendrogram pruning methods described in Chapter 4-3-5. Its output is the final tree of clusters and their labels, presented below.

–  –  –

AHCFilter also receives many parameters that enable to change behavior of different parts of the algorithm. However, this variety of parameters may be also viewed as a disadvantage, as it makes tuning of the algorithm very difficult. Parameters of this component are listed

below:

–  –  –

Main rationale for splitting of the AHC module into two ones was fact that the first phase of algorithm (i.e. terms weighing) may be common for many clustering methods. So once implemented, TermsFilter component can be reused many times in different processes.

Unfortunately, splitting of this module has caused a slight increase of its execution time (efficiency issues will be discussed further in more details in Chapter 5-3-2). Output of the TermsFilter component is directly passed to input of the AHCFilter component. So if they were two classes in one software unit, passing of data between them could be realized simply just by passing of few references to memory. But as they are two separate components, we have to store and then read whole lists of hundreds of objects (all input documents and terms and phrases occurring in them with their respective weights) in XML format. However, influence of these tasks on the overall module's performance is not very significant, as they take just a dozen or so percent of its total execution time (when clustering 100 snippets), and even less for larger amounts of data (compare Figure 5-3-2-2).

5.3. AHC MODULE DESIGN AND IMPLEMENTATION

5.3.1. GENERAL DESIGN AND IMPLEMENTATION ISSUES Both components making up AHC module were designed in the UML modeling language [UML], being a de facto standard in object-oriented modeling. Due to definitely research nature of the project and frequent changes of requirements, only a rough, general design was

created, consisting of the following diagrams:

–  –  –

Fig. 5-3-1-1 General architecture of the AHC module in the Carrot2 system on UML component diagram. Two main packages: termsfilter and ahcfilter correspond to filter components of the Carrot2 system.

–  –  –

- 63 It was already mentioned that one of the major goals during design of the system was ease of performing different experiments. Due to this fact, we have put great emphasis on parametrization of the module. Many tasks, such as for instance terms weighting or interdocument similarities calculation may be performed in may different ways. Therefore for each of them several different classes having a common interface that may perform the same task were implemented (see Figures 5-3-1-2 and 5-3-1-3). As objects of these classes may be passed as parameters to respective components in descriptor, it is possible to replace one method of for instance term weighing by another during runtime.

As the language of implementation, we have chosen Java, being also the language in which is written the whole Carrot2 system (although its architecture in fact imposes no constraints on used language [Weiss 02a]). Java was chosen because it has a number of very useful features,

namely:

• A number of most important data structures such as lists or hash tables are already implemented in the standard libraries of the language.



Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 12 |


Similar works:

«Order of the Minister for Foreign Trade and Development Cooperation of 8 January, no. DHS_2016.18114, laying down administrative rules and a ceiling for grants awarded under the Ministry of Foreign Affairs Grant Regulations 2006 (Addressing Root Causes Fund 2016-2021) The Minister for Foreign Trade and Development Cooperation; Having regard to articles 6 and 7 of the Ministry of Foreign Affairs Grants Decree;1 Having regard to article 5.1 of the Ministry of Foreign Affairs Grant Regulations...»

«Policy Study No. 254 PASSENGER-FRIENDLY AIRPORTS: ANOTHER REASON FOR AIRPORT PRIVATIZATION BY ASHEESH ADVANI Executive Summary S ince the privatization of the British Airports Authority in 1987, over 20 countries have privatized airports by means of equity divestitures, leases, and incentive-laden management contracts. After more than a decade of experience, many of the benefits of airport privatization are becoming more discernible: • Capital infusion: Privatization enables airports to raise...»

«Seeing Through the Lies 9/10/07 3:33 PM Page 16 C 1 HAPTER A Half-Truth Is a Whole Lie Guide me in your truth and teach me, for you are God my Savior, and my hope is in you all day long.Psalm 25:5 The half-truth has been around since Satan and Eve met for lunch in the garden. And it wasn’t the snake who first danced around the truth. It was Eve. Sweet, innocent Eve. The crown of creation, the helpmate of all mankind. (At least all mankind of that day!) It’s not that she really lied or...»

«WHO/CDS/EPR/GIP/2006.2 WHO strategic action plan for pandemic influenza 2006–2007 WHO strategic action plan for pandemic influenza 2006–2007 Contents 2 Executive summary 4 Background 6 Goals 7 Strategic actions and expected results 7 Reduce human exposure to the H5N1 virus 10 Strengthen the early warning system 14 Intensify rapid containment operations 16 Build capacity to cope with a pandemic 18 Coordinate global scientific research and development 23 2006-2007 Funding Requirements 1 WHO...»

«QUALCOMM INCORPORATED Notice of 2016 Annual Meeting of Stockholders and Proxy Statement January 21, 2016 Dear Fellow Stockholder: You are cordially invited to attend Qualcomm’s 2016 Annual Meeting of Stockholders (the Annual Meeting) on Tuesday, March 8, 2016. The meeting will begin promptly at 9:30 a.m. Pacific Time at the Irwin M. Jacobs Qualcomm Hall, 5775 Morehouse Drive, San Diego, California 92121. I invite you to arrive early at 8:30 a.m. to preview our product displays. We will begin...»

«The Codsall High Federation of Schools Bilbrook Church of England Middle School  01902 840910 www.bilbrook middle.com  office@bilbrook.staffs.sch Don’t forget to follow us on Twitter Newsletter-December 2015 What a fantastic first term! The fifteen school weeks of this term seem to have flown by. I have taught in a wide and diverse range of schools over the course of my career and I can honestly say that so many of the things I've seen across all three schools are some of the very...»

«Caminos que nos separan Propuesta Didáctica para la exposición Mapamundistas 2014-2015    Miren Arregi, Zulema Alvarez, Marta Arana, Oihane Ardanaz, Josune Gago y Itziar Bazterrika. 1  ÍNDICE 1.PROPUESTA EDUCATIVA PARA LA EXPOSICIÓN MAPAMUNDISTAS -2014..3.2.JUSTIFICACIÓN..3-4.3.PRINCIPIOS EDUCATIVOS Y ESTÉTICOS.4-5. 4.PRINCIPIOS METODOLÓGICOS.5-6. 5.OBJETIVOS..6-7. 6.INFORMACIÓN Y RECURSOS..7-17. 7.ACTIVIDADES..18. 8.BIBLIOGRAFIA Y WEBGRAFIA..18. 2  PROPUESTA EDUCATIVA PARA LA...»

«Men's healing circles Transcript of a presentation at the No To Violence Conference, Melbourne, Australia, November 2012 Presenters Jamie Thomas and Kerry Thomson Boorndawan Willam Aboriginal Healing Service www.bwahs.org.au Note This presentation was transcribed by a professional transcription company and then lightly copyedited for readability and length. There are some gaps in the transcript (indicated by.), where the recording was not clear. The transcript may also inadvertently contain...»

«Tenth U.S. National Conference on Earthquake Engineering Frontiers of Earthquake Engineering July 21-25, 2014 Anchorage, Alaska 10NCEE EARTHQUAKE RESILIENCE OF A 632METER SUPER-TALL BUILDING WITH ENERGY DISSIPATION OUTRIGGERS ZHOU Ying 1, ZHANG Cuiqiang2 and LU Xilin3 ABSTRACT Structural systems of super-tall buildings, including RC/SRC core walls, frames, and sometimes bracings, are utilized collaboratively to resist strong earthquakes and wind loads. Outriggers are usually set to ensure the...»

«Comput Econ DOI 10.1007/s10614-014-9465-4 Minimality of State Space Solutions of DSGE Models and Existence Conditions for Their VAR Representation Massimo Franchi · Paolo Paruolo Accepted: 13 August 2014 © European Union 2014 Abstract Standard solution methods of DSGE models do not necessarily deliver minimal state space forms. When the ABCD form is non-minimal, the conditions in the literature are not necessary for the existence of a VAR representation of the observables. In this paper we...»

«Energy and Power Engineering, 2015, 7, 12-29 Published Online January 2015 in SciRes. http://www.scirp.org/journal/epe http://dx.doi.org/10.4236/epe.2015.71002 Integration of Renewable Energy Resources in Microgrid Manzar Ahmed, Uzma Amin, Suhail Aftab, Zaki Ahmed Electrical Engineering Department, Faculty of Engineering, University of South Asia, Lahore, Pakistan Email: azaki786@usa.edu.pk Received 10 January 2015; accepted 24 January 2015; published 29 January 2015 Copyright © 2015 by...»

«Last modified February 26, 2016 Bear-Resistant Products Testing Program CERTIFIED BEAR-RESISTANT PRODUCTS Methods for Complying with Food Storage Regulations on Public Lands In order to keep attractants unavailable to bears and reduce human/bear conflicts, regulations pertaining to proper food and attractant storage are likely to be present on public lands within grizzly bear habitat. Use of IGBC-certified bear-resistant containers is one of the methods available to comply with many of these...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.