«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
g – total number of clusters in the results gu – total number of useful clusters s – total number of snippets in the results sa – total number of snippet assignments in the results sw – total number of snippets well matching their clusters sm – total number of snippets moderately matching their clusters sn – total number of snippets not matching their clusters so – total number of snippets in the "Other Topics" group
On the basis of these statistics, the following quality measures have been defined:
• cluster label quality It is the proportion of number of useful and number of all groups (without the "Other
gu cluster label quality = g
6.1.3. AHC-SPECIFIC QUALITY MEASURES Presence of hierarchy in the clustering results makes their evaluation even more difficult.
Groups included in the results of AHC may contain not only documents, but also other groups, which may also contain other groups and so on. In this connection a number of
problems arise, for instance:
• should hierarchical dependencies between groups influence assessment of their quality, e.g. can a subcluster of some useless cluster be useful?
It seems that top levels of the hierarchy are the most important ones, as their descriptions give to the user the "first impression" of their contents. So even if a useless top-level cluster contains some interesting subclusters, discouraged user may disregard the whole branch of the hierarchy without looking further down into its structure.
• should the redundancy of groups be punished?
It happens that some levels of hierarchy introduce no distinct subtopics when compared to topic of their parent clusters. In such case, these subclusters should be assessed as useless.
• should the descriptions of parent groups be taken into consideration when evaluating their subgroups?
It seems that when assessing subclusters descriptions one must take into consideration also their parent clusters description. Consider for instance again results of clustering for a query "king". It may contain a top-level cluster described as "King County" with a subcluster "Government". The latter group contains of course documents about government of King County, not about any government (in which case it would be too general). On the other hand, during our evaluation experiment we have come into an example hierarchy for a query "volleyball" containing a cluster "women's volleyball", which in turn contained a subcluster "men's volleyball". Although the latter one is useful itself, its position in hierarchy makes it rather useless.
In order to at least partially deal with these problems, to evaluate AHC results we have
introduced few improvements and additional metrics:
• calculating cluster label quality and snippet assignment distribution separately for the top level groups Apart from calculating these coefficients for all clusters, when evaluating hierarchical clustering results one may calculate them only for the top level clusters (i.e. ones that have no parent groups). This should be done due to the mentioned earlier fact that top groups are the ones most important to the user.
• average hierarchy depth This metric does not measure directly the quality of results itself, but has rather an informative purpose. However, we intuitively feel that the constructed hierarchy has certain optimal depth and shouldn't be neither too shallow nor too deep. In [Maarek et al.
00] dependence of results quality on hierarchy depth for a sample documents collection is shown, but authors of this paper don't give any general conclusions. Optimal hierarchy depth may heavily depend on structure of the input data.
6.2. USER EVALUATION OF THE AHC ALGORITHM
6.2.1. THE EXPERIMENT Our user evaluation experiment has been conducted on 6 users, who were asked to fill in survey forms. Such a small number of persons taking part in this process may of course seem insufficient in order to recognize its results as statistically significant. Unfortunately, considering our situation, and especially lack of resources and time, it was hard to find a larger number of people that could help us in carrying out the evaluation.
We should also mention that all persons engaged in this process are experienced Internet users, professionally connected with computing science. Moreover, most of them are also experts in the domain of clustering algorithms. This fact had also to influence obtained results.
For the purpose of evaluation we have used four queries, two in both Polish and English languages in order to examine the difference in behavior of AHC for different languages. As we also wanted to study the influence of breadth of topics scope in the input documents set on results quality, two of the queries used in the experiment were general, while another two
Snippets used as input data for the experiment were downloaded from the Google search service. Results handled to the persons involved in the evaluation were obtained using the Carrot2 AHC module with its default parameter values (mentioned in Figure 5-2-4 and Figure 5-2-2).
6.2.2. RESULTS Here we would like to present the most important results of the user evaluation. As it has turned out values of quality measures calculated only for top level clusters differ on the maximum only a few percent from their equivalents calculated for the whole hierarchy (a fact somehow unexpected), we present here only values of the latter ones on one, collective chart.
One may notice that algorithm gives result of quite solid quality. Percentage of well and moderately assigned snippets is high and varies about 90 %, a result that may be considered as satisfying. Unfortunately, AHC deals much worse with cluster descriptions – only about 60-70% of groups have been considered as useful. On the other hand, almost all snippets are
- 74 assigned to groups, with value of assignment coverage factor amounting usually to about 90%.
It is easy to see that AHC deals better with results of queries given in the English than Polish language (probably due to less complex issues of stemming and phrases extraction in the case of the first one). Yet influence of query scope on clustering quality is less obvious. It seems that for specific questions snippet assignment coverage is higher and its distribution is better, but when comparing cluster labels quality no general trend can be observed. We must be aware of the fact that our conclusions may lack solid basis, as the experiment was performed on small number of both users and queries.
It is very interesting to compare values of quality measures assessed by different evaluators.
As it turns out, user judgements are very subjective, especially in case of cluster label qualities. This simple comparison shows, how difficult is the evaluation of clustering results, in particular in case of hierarchy (see Figure 6-2-2-2 below).
Finally, we would like to compare values of our metrics for results of AHC and LINGO (for detailed description of LINGO evaluation refer to [Osiński 03]). In the below table we have set aggregated results of one algorithm's evaluation against results of another one.
Fig. 6-2-2-3 Comparison of aggregated evaluation results of AHC and LINGO
- 75 This juxtaposition shows that while snippet assignment distribution in case of both algorithms is actually the same, LINGO creates significantly better descriptions, at the cost of assigning a lower number of snippets to clusters. It may seem interesting that difference between values of two latter measures for both methods are almost equal (about 15%). So it may seem that this fraction of the input collection is made up of documents that are difficult to place in any cluster, yet the AHC algorithm tries to do just so.
6.3. EXPERT EVALUATION OF THE AHC ALGORITHM INFLUENCE OF AHC PARAMETERS AND INPUT DATA
ON THE RESULTSMajor goal of experiments performed in this subchapter is presentation of the influence of changes in algorithm parameters and input data on its results. Due to already mentioned variety of parameters that may be used in our implementation of the AHC (compare Figure 5-2-4 and Figure 5-2-2), it was impossible to repeat user evaluation similar to one presented in the former subchapter for each studied aspect of the algorithm. We have therefore decided to present the results of these experiments (unfortunately, for large hierarchies we were forced to present only the biggest and topmost groups), with simple verbal description pointing out the most important differences.
During expert evaluation we have used default values of the algorithm parameters from Figure 5-2-4 and Figure 5-2-2 (unless it is mentioned differently in the experiment's description). All queries were formulated in the English language, and we have used top 100 snippets returned by the Google search service as the input data.
6.3.1. INFLUENCE OF QUERY BREADTH Purpose of this experiment was to compare hierarchy created by the algorithm for two different queries, one general and another more specific, covering a small subset of documents corresponding to the first one. We have chosen two queries: "clinton" (it is a widely used query in examples of results clustering as it covers a lot of completely different topics – see also [Zamir 99], [Weiss 01] and [Osiński 03]) and "president bill clinton".
- 76 Fig. 6-3-1-1 Comparison of AHC results for queries: "clinton" (general – on the left) and "president bill clinton" (much more specific – on the right).
Result of this comparison shows that the algorithm performs better with more general queries. Main reason for this fact is of course that it is much more difficult to find distinct topics in a set of documents relevant to a specific query, and therefore create good, heterogeneous clusters. In particular, group descriptions are much better when clustering results of general queries.
6.3.2. INFLUENCE OF THE INPUT SNIPPETS NUMBER Natural influence of increase of input data size should be increase of number of clusters in the results and also increase of their size. It should be also expected that it will improve clusters labels quality, as small documents set may contain proportionally many overspecialized phrases that will be chosen as clusters labels. On the other hand, in case of larger input collection we may expect that these phrases will be replaced by ones a little bit more general, but instead much more informative to the user (while their subclusters may have specific labels). Another important, yet not considered here aspect of increase of the input data size is its influence on the algorithm's processing time. This issue is discussed in more details in Chapter 5-3-2.
- 77 Fig. 6-3-2-1 Comparison of AHC results for a query "unix" using 100 (on the left) and 300 (on the right) input snippets.
The most important difference caused by increase of input snippets number are increase of number of created top-level clusters and growth of hierarchy, which has become much more complex (unfortunately, due to its size we were not able to present here the whole tree). It is clearly visible that while when using 300 snippets a dozen or so top-level groups have their own subclusters, in case of 100 snippets the structure is more flat, with hierarchy present only in few biggest clusters. However, increase of the groups number was much smaller than of the documents number due to repeating topics in the input set.
It turns out that our assumption about influence of increase in input data size on cluster labels quality is only partially fulfilled. Although some of the groups in the right tree have shorter descriptions, it is hard to say, whether their quality has really improved. It results most probably from the fact that problem with overspecialized labels in case of clusters created by the AHC algorithm is rather rare (compare LINGO). On the other hand, it is clearly visible that in hierarchy constructed by clustering a larger number of documents first clusters (i.e. the biggest ones) correspond better to the most important subtopics (due to the fact that algorithm is performed on a more representative sample).
6.3.3. INFLUENCE OF PHRASES USE Purpose of this experiment was to show the profits of phrases use in the AHC algorithm and the vector space model in general. According to our hypothesis formulated in Chapter 3-5, use of phrases shouldn't have strong influence on structure of the hierarchy itself, but rather on quality of cluster labels.
- 78 Fig. 6-3-3-1 Comparison of AHC results for a query "king" for maximum length of extracted phrases equal to 1 (i.e. no phrases use – on the left) and 5 (on the right).
After closer examination of the tree structure it is easy to see that in spite of phrases use its structure has changed only slightly. This is due to the fact that if a group of documents shares a phrase, it also shares all of words that are included in it, so introduction of phrases should have no great impact on inter-document similarities. On the other hand, the clusters labels have become much more coherent. When we use only single terms, often their sequence in descriptions is different than that in documents, and this fact may confuse the user. Moreover, terms equal to the user's query are removed during terms weigthing from the term-document matrix (see Chapter 3-2-2) and therefore they cannot appear in labels, which become often too short and not fully informative. However, when using phrases the same terms may appear as their elements and therefore also in descriptions. This problem is illustrated on the above figure where labels of clusters such as "King County" or "Burger King" become "County" or "Burger" when we use single terms only.