# «by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

- 38 DOCUMENTS CLUSTERING ALGORITHMS Main purpose of this chapter is to present to the Reader the most important clustering algorithms, taking especially into consideration Agglomerative Hierarchical Clustering as it is major subject of this thesis. In the first part, a short overview of the clustering problem is given. Several algorithms (including also ones based on the vector space model) are presented in order to show different approaches to this issue. Second subchapter is dedicated completely to the detailed presentation of Agglomerative Hierarchical Clustering. As this method is in reality just a framework of algorithm, and may be widely interpreted, we present its main versions and describe differences between them. Finally, we discuss techniques applied by us in order to adapt the AHC algorithm to our application, i.e. Web search results clustering.

**4.1. OVERVIEW OF MAIN CLUSTERING ALGORITHMS**

Before discussing clustering algorithms, we should present the main idea of clustering itself.

**A good (although a little too narrow for our purposes) definition is given in [Everitt 80]:**

"Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined."

**This definition emphasizes the most important properties of clustering:**

• Objects within the formed clusters should be homogeneous, while the clusters should be heterogeneous. It is necessary, as clusters should help us distinguish between objects.

• Number and attributes of clusters should be found out by the algorithm, not given as the input data. So it should not only assign object to groups, but also determine their structure. That's why clustering is sometimes called unsupervised learning.

Example of clustering of numerical data, consisting of a set of objects described by two variables (so that they may be interpreted as points in two-dimensional space) is given on below Figure 4-1-1. These points have been classified into five distinct groups.

- 39 Fig. 4-1-1 Clustering example However, mentioned earlier clustering definition misses one important point – it supposes that data is described always by numerical measures. It doesn't have always to be true, as clustering is used in many different domains and one might for instance wish to find clusters in other kind of data like images or text documents (which is our case). Most of known clustering algorithms were designed for numeric data, so it may seem easier to transform attributes of other data types into numbers (e.g. for text – using vector space model).

However, this transformation may lead to a loss of some properties of the original data representation (this issue for the vector space model is discussed in Chapter 3-1-2).

Results of different clustering algorithms may have very different properties. Most

**important are:**

• Hierarchical structure of groups In most of the cases it turns out that clusters created by an algorithm could be further split into smaller (or merged into larger) ones, depending on how much details do we want. Performing clustering on an example collection of documents concerning intelligence, we could distinguish groups "artificial intelligence" and "emotional intelligence". But cluster concerning documents about artificial intelligence could be split further into ones such as "neural networks" or "knowledge discovery". Without a hierarchy the clusters would be either too general and containing a lot of documents, or there would be too many of them. Using a tree instead of list of groups enables the user to quickly navigate through it from general to detailed topics.

• Overlapping groups Some of the clustered objects can be assigned equally well to more than one cluster.

Consider for example a great book about UNIX network programming, written by W.

Richard Stevens. It belongs to two categories: books about network programming and about the UNIX operating system, and assigning it to only one of them would mean loss of valuable information.

• Leaving of some of clustered objects besides created groups.

Usually among the data there exists some number of objects that aren't similar to any others. Such elements (called outliers) should not be classified as they wouldn't fit to any group.

- 40 K-MEANS K-means is one of classical methods in the cluster analysis ([Rocchio 66], [MacQueen 67]). It is based on numerical data, so in order to employ it to cluster text documents one should use the vector space model. This algorithm is non-hierarchical and in basic version it doesn't allow the clusters to overlap. In order to use k-means, we must first specify the desired

**number of clusters k. Then the algorithm acts in the following manner:**

Select k initial random centroids Do Assign each object to the closest centroid Recompute all centroids While (!centroids are stabilized)

This method iteratively optimizes assignment of objects to clusters, using as a criterion distance of each object from the cluster to its centroid, computed as mean vectors of objects that belong to respective clusters (therefore centroids are called also means and hence algorithm's name).

Main advantage of the algorithm is its speed. It requires only O(nk) time per one iteration, where k is the desired number of clusters, and n – number of input documents (when considering documents clustering algorithms, n usually denotes size of document collection, so from now on whenever no meaning for n is given, it it supposed to have exactly this meaning). Number of iterations is usually significantly smaller than number of objects (although there is no such guarantee), so usually it is said that overall complexity of the kmeans algorithm is O(nk).

Unfortunately, this algorithm is too simple and so it performs poorly, especially when

**applied to textual data [Zamir and Etzioni 98]. Its main drawbacks are:**

• Necessity of knowing the number of clusters in advance (usually there is no way of knowing how many clusters exist).

- 41 The way of initializing centroids is not specified, so they usually have to be randomly chosen. An attempt to overcome both this and former problem by trying to find the initial centroids using other algorithm is discussed in Chapter 4-1-2.

• Only local optimization of cluster assignment (a "greedy" method)

• Inability of the method to deal with clusters of non-convex shapes – due to usage of the centroids as cluster representation There exits also a "fuzzy" version of the k-means algorithm [Bezdek 81]. In this method, each document instead of similarity to the closest cluster has a degree of membership in each

**cluster so that overlapping clusters are allowed:**

Select k initial random centroids Do Compute for each object its degree of membership in all centroids Recompute all centroids While (!centroids are stabilized)

4.1.2. OTHER METHODS BASED ON THE VECTOR SPACE MODEL An approach to deal with mentioned earlier problems that arise when using k-means algorithm was proposed in [Cutting et al. 92]. This method – called Buckshot – calculates initial cluster centroids applying the AHC algorithm to a (random) small sample of documents of the collection. Then groups found by AHC are taken to be these centroids. This trick allows not only to find the number and position of initial clusters, but also to speed-up computations, as k-means has to perform a smaller number of iterations before stabilization when it knows already the initial centroids. As AHC algorithm used in Buckshot runs in time O(n2), the initial sample is of size kn in order to maintain the overall performance of O(kn). A comparison of clustering methods performed in [Zamir and Etzioni 98] shows that using this method improves both time and quality of clustering, but this gain unfortunately is not significant. Main drawback of this method results from the fact that the initial sample may not be representative for the whole document collection.

**Another simple approach is the Single-pass method [Hill 68]:**

Treat the first document as the first cluster.

Do Add the next document from the list to the closest cluster if their similarity is above the similarity threshold (predetermined) Otherwise treat it as a new cluster While (!all documents are clustered)

- 42 This method has computational costs similar to k-means, namely it runs in linear time O(kn), where k is the desired number of clusters. However, it also has similar disadvantages.

Its another weak point is that clustering results depend heavily on the order of documents in the input list. Again according to [Zamir and Etzioni 98] it is the method that has results of the worst quality.

4.1.3. SUFFIX TREE CLUSTERING Suffix Tree Clustering (STC) is a brand new clustering algorithm, designated only for documents clustering purposes. It was proposed originally by Zamir and Etzioni ([Zamir and Etzioni 98]), and is the first clustering algorithm that fulfills their requirements for Web documents clustering methods (presented in Chapter 2.3). It is based on the idea of creating of clusters from documents that share common phrases.

First phase of the algorithm is finding common phrases and documents that share them. It is done using a special data structure called Suffix Tree [Ukkonen 95]. This structure enables us to find all phrases and documents that share them in time linear in the number of documents, and moreover new documents can be added to it incrementally, just as they arrive from the search engine. An example suffix tree constructed from three documents is shown

**below on Figure 4-1-3-1:**

Fig. 4-1-3-1 Sample Suffix Tree for three documents containg sentences: "cat ate cheese", "mouse ate cheese too" and "cat ate mouse too" [Zamir and Etzioni 98].

In the next stage, a list of base (initial) clusters is created. All phrases that occur in more than one document are considered as base clusters. Then for each base cluster, its score function is calculated. In essence, this function promotes groups that contain bigger number of documents and are described by longer (i.e. more informative) phrases. Clusters, whose score doesn't exceed certain threshold value are rejected.

Purpose of the last stage of STC is to merge base clusters that share a considerable amount of documents. In this phase, a base cluster graph is created. Nodes of this graph correspond to base clusters, and edges exist between clusters that share more than certain part of documents.

Fig. 4-1-3-2 Base cluster graph for suffix tree from Figure 4-1-3-1 [Zamir and Etzioni 98].

This method was the first really successful search results clustering algorithm. Its authors claim that it is both faster (due to linear complexity and incremental processing) and gives better results than any method based on the vector space model [Zamir and Etzioni 98].

Main disadvantage of basic version of STC is "flat" structure of result clusters, i.e. lack of their hierarchy. An attempt to overcome this problem by creating hierarchy from the result groups is presented in [Masłowska and Słowiński 03].

4.1.4. LINGO During work on the Carrot2 system a new clustering algorithm, called LINGO was proposed [Osiński 03]. This method works in a very specific way, namely it firstly finds in the input documents set phrases that would fit to be cluster descriptions (called "abstract concepts") and only then it creates the groups themselves by trying to match documents to the found descriptions. Therefore this approach is called by its author "description-oriented".

LINGO is also based on the vector space model, and it employs advanced techniques such as Singular Value Decomposition of the term-document matrix or its dimensionality reduction (see Chapter 3-2-2) using low-rank approximation.

Main advantage of this algorithm are very good, informative to the user labels describing created clusters (however in some cases they are too specific and fit only some of the documents from the group). LINGO is also capable of creating overlapping results. Its main weak points are inability to perform hierarchical clustering and quite slow processing time, due to use of Singular Value Decomposition. However, attempts to accurately assess this method are premature, as it is still a completely new idea, with only one existing implementation.

## - 44 THE AGGLOMERATIVE HIERARCHICAL CLUSTERING

(AHC) ALGORITHM 4.2.1. ALGORITHM OUTLINE Agglomerative Hierarchical Clustering is a classical clustering algorithm, originating analogically like k-means from the statistics domain. Fundamental idea of this method is very simple, not to say obvious. It consists in iterative merging of two most similar groups, which**in first step contain single elements:**

Algorithm takes as input a matrix of pairwise similarities between objects. In case of documents this matrix is created by calculating all pairwise similarities between documents using one of measures presented in Chapter 3-3 (according to our literature review it is usually cosine similarity). It returns a binary (because it always links two clusters) tree of clusters, called dendrogram (sample dendrograms are depicted on Figure 4-2-2-6). Name of the algorithm refers to its way of working, as it creates hierarchical results in an "agglomerative" or "bottom-up" way, i.e. by merging smaller groups into larger ones. There also exists a "divisive" (rather rarely used) version, which works just oppositely, iteratively

**partitioning clusters instead of merging them:**

Fig. 4-2-1-2 Outline of the divisive ("top-down") hierarchical clustering algorithm AHC is in reality just a template of some class of algorithms, with lots of detailed versions proposed in literature. Basic outline of this method is very general and does not specify two