«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
• Java language itself enforces object-oriented-programming and good software practices.
• Java Virtual Machine performs runtime checks for common and often difficult-totrack errors such as null pointers, array indices running out of bounds, etc.
• Many libraries supporting Web technologies (particularly XML) written in that language.
A number of utilities for the Carrot2 system (component frameworks etc.) was already • implemented in this language.
Serious drawback of Java is unfortunately speed of execution of code written in this language (much lower than in case of natively-compiled code, written for instance in C++).
However, as we have stated before in Chapter 5-1-1, due to character of the system, speed was not the most important criterion in the choice of solutions. Efficiency issues are discussed in details further in the next subchapter.
5.3.2. EFFICIENCY ISSUES Unfortunately, speed of execution of our AHC module is quite slow. Performed tests showed that for more than just 200 snippets system's time of response exceeds several seconds, making it too slow for more impatient users. Exact results of these tests, with values for individual parts of modules are given below in Figure 5-3-2-1 (please have in mind fact
that these are just times of the AHC module, not the whole Carrot2 system):
It is easy to see that the most laborious element of the module is implementation of the AHC algorithm (only for small input data sets, containing less than 200 snippets extraction of terms and phrases lasts longer than this task), due to its computational complexity. Major goal of implementation of this algorithm in the Carrot2 system was to make it as simple and as flexible as possible. We have created a general framework, which enables easy replacing of stop conditions and linkage methods. Unfortunately, chosen solution has also the biggest complexity possible for this algorithm – O(n3) (see Chapter 4-2-1), being definitely too slow to meet Zamir and Etzioni's key requirements for Web documents clustering methods (presented in Chapter 2-3).
Another element that takes up bigger and bigger part of the whole running time with increase of the size of data is calculation of inter-document similarities. Usually it is said that this process takes O(n2) time for n documents [Salton 89]. This formula doesn't take into
- 65 account the fact that in case of all commonly used similarity measures time needed for their calculation also depends linearly on the number of terms in the input document set. So exact computational complexity of calculating document similarities is O(n2·m), where m is the number of terms, which increases with the number of documents (a fact not concerned by most of the authors). However, this increase is slower than linear (there is no exact formula – at least we haven't found any), so overall complexity of this process is between O(n2) and O(n3), being just a little bit smaller than that of the AHC algorithm.
- 66 EVALUATION OF THE AHC ALGORITHM Purpose of this chapter is to present an assessment of our version of the AHC algorithm. In order to achieve this goal, we have carried out several experiments, whose results are also presented. In this part of the thesis we refer often to the former chapters, showing influence of described there techniques on the quality of the algorithm's results. We have also performed a comparison between our implementation of AHC and several other important clustering algorithms.
Our experiments may be distinguished into two classes on the ground of the way of results
• user evaluation (presented in Chapter 6-2)
• expert evaluation (presented in Chapters 6-3 and 6-4) Major goal of the first type of experiment was to determine the quality of results given by the system from the common user's point of view. In order to achieve this goal, we have performed a survey on a group of several users, asking them to judge simple aspects of the algorithm's results. We have also created few simple metrics (presented in Chapter 6-1-2 and Chapter 6-1-3) that give exact, numerical values based on these users assessments.
On the other hand, experiments of the second kind consisted of comparing the results of different algorithms (or one algorithm for different parameters / input data) by a single person (expert) in order to find differences between them. In this part of evaluation we have concentrated our attention rather on finding and describing the properties of the algorithm's results than on attempt to calculate some exact metrics. Therefore, this part of the evaluation experiment was more research- than user-oriented, trying to find reasons behind the final results, not to judge them definitely.
6.1. METHODS OF CLUSTERING RESULTS' EVALUATION
It turns out that evaluation of clustering results (especially in case of hierarchical algorithms) is quite a difficult task. It is particularly hard to create objective, numerical
measures of their quality. Here we point out a few most important reasons of this fact:
• subjectivity of evaluation The primary criterion in clustering evaluation should be usefulness of the given results.
However, for such a definition of quality it is hard to find objective measures. Different end users may have different preferences (e.g. some may be interested in small, precisely described clusters while others prefer large, more general groups), therefore there may
- 67 exist many equivalent ways of partitioning the input documents set. This is a common problem in the Web Mining, as this domain is very user-oriented.
It turns out that there are at least two benefits of clustering – enabling the user to find faster the relevant information and "search refinement", consisting in creating a good partitioning of the input documents set, enabling to find different subtopics in the query results. The used similarity measures should be different depending on which application is more important.
• multi-criteria nature of the problem Apart from major goal of our evaluation, the quality of results is influenced by a number of factors such, as conciseness of created clusters or eligibility of their descriptions, and it is hard to aggregate them all to one factor. This issue will be discussed also in Chapter 6-1-2.
• difficulty in comparing results given by different algorithms As we have mentioned before in Chapter 4-1, results of different clustering algorithms may have different properties (i.e. be hierarchical or flat, overlapping or not). It is difficult to create quality measures that could be equally well applied to different algorithms and therefore fairly compare their results.
6.1.1. EXISTING WEB MINING METHODS Problem of the results evaluation is quite often addressed in the Web Mining literature.
During our studies of publications from this domain we have come upon several different approaches and in this subchapter we make a brief overview of them, pointing out both their strong and weak points.
• standard Information Retrieval metrics Most commonly used quality measures in the Information Retrieval domain are precision and recall [Salton 89]. These two metrics were originally defined for search
results evaluation problem as following:
- 68 Both metrics are different and opposing quality criterions, as usually the more precise the search results are, the less (also relevant) documents we get. However, these quality measures in their basic version don't satisfy our needs stated before.
• merge-then-cluster approach Technique that seems to be a natural approach to our problem is performing clustering on a specially prepared set of documents. This input set is created by collecting documents concerning different topics. Then the results of clustering algorithms performed on this set are compared with its original partition [Zamir 99]. The most important advantage of this approach is ability of automatic repeating of multiple tests of different methods (or one method with different parameters) after establishment of the input set.
This approach requires the construction of the input documents set to be done very carefully. In particular, one cluster appearing in results of some algorithm may be split by other into several more specific ones, and decision which partition is correct is often arbitrary. One may also criticize the "artificial" way of creating the test documents set by merging documents concerning different topics instead of using real data (ranked lists), in which case finding coherent clusters may be more difficult.
• ground truth hypothesis Another, but somehow similar approach to this problem is using a quality measure based on entropy ([Maarek et al. 00], [Stefanowski and Weiss 03]). In this approach, the deviation of the algorithm's results from ground truth set, representing some "ideal" partition is calculated. Constructing the ground truth set may be performed for instance manually by experts who try to find clusters in the input data set [Weiss and Stefanowski 03]. The deviation is calculated basing on a information theory measure presented in [Dom 2001]. This method is similar to merge-than-cluster as the subject of evaluation in both approaches is difference between some reference set and algorithm's results.
It also enables us to quickly perform many different tests after creating reference clustering results (the ground truth). Its main disadvantage is that used quality measure is defined only for flat partitioning (so neither hierarchical nor overlapping results may be correctly evaluated using this method).
• user evaluation There are many evaluation methods based on the examination of user's behavior. These methods may either make use of direct feedback from users (e.g. some kind of survey), or attempt to analyze automatically their reaction to the system's response for instance by logs tracing. Very interesting and detailed example of user's evaluation is presented in [Zamir and Etzioni 99], where authors of this paper present application of user behavior analysis to evaluation of the Grouper system.
This approach seems to be very promising as the quality of results is decided by the users themselves. However, there are several problems associated with performing of this type of evaluation such as different level of users experience or finding a statistically
- 69 significant number of experiment participants [Weiss 01]. Moreover, such experiments take quite a lot of time and cannot be easily, automatically repeated.
6.1.2. OUR CLUSTERING RESULTS QUALITY MEASURES PROPOSALDuring the work on the Carrot2 system we have decided to make use of the last approach presented in the former subchapter and base system evaluation on user survey. The same method was also used to evaluate results of the LINGO algorithm (see [Osiński 03]), also a part of the Carrot2 system.
The survey has been created taking into account key requirements for Web search results clustering, such as browsable summaries of the clusters or relevance (refer to Chapter 2-3 for more details). Users taking part in the evaluation were asked to answer the following
• usefulness of created clusters For each of the created groups (except the "Other Topics" group), the users were asked to assess its usefulness, defined as user's ability to deduce topics of documents contained in it on the basis of its description. It is easy to see that the most important factor influencing evaluation of this criterion is the algorithm's ability to create meaningful, accurate cluster labels.
We have to add here that not only the clusters corresponding to topics, in which the user is interested are useful. Provided their descriptions are concise and accurate, also clusters "unexpected" in the context of the query that contain documents irrelevant to the user's interests are also helpful, as they enable him to quickly separate these documents from the interesting ones. This is exactly the way, in which clustering improves the search results.
In case of hierarchy also usefulness of subclusters is evaluated, independently of their parent clusters, as top-level clusters that are too general and therefore meaningless may be split into more detailed, useful ones (we continue this topic in the next subchapter).
Scale: useful cluster | useless cluster
• precision of assignment of documents to their respective clusters For each document assignment to a group, the users were asked to judge if this assignment is correct (i.e. whether its snippet fits this group's description). Here we have decided to use a more diverse than binary scale, as documents may fit to some degree in their clusters (e.g. be more or less specific than respective clusters, but still somehow related to their descriptions).
Precision of assignment is assessed only if the cluster to which the document has been assigned has been judged as useful. There are two rationales behind this decision. First of all, if group's description is nonsense, it is also nonsense to judge if a document fits it.
Secondly, if user cannot understand the cluster's title, he will probably discard it and therefore he won't browse documents contained in it. Moreover, the "Other Topics" group with its documents is also excluded from this process.
- 70 Here we have to make an important remark as in case of both overlapping and hierarchical results one document may be contained in many clusters. Therefore its assignment to all the clusters where it is contained should be independently judged.
Scale: well matching | moderately matching | not matching Having collected assessments of single groups and documents from users, we may try to define general quality measures for the whole clustering results. We have decided to use several metrics evaluating different aspects of the results due to the mentioned earlier multicriteria nature of the problem. Some of measures used in AHC evaluation were also used to assess the quality of LINGO and are presented already in [Osiński 03]. Here we will focus rather on presentation of AHC-specific solutions, to which we dedicate whole next subchapter.
After collecting of the forms from users we are able to calculate the following statistics: