# «by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

Main problems concerning further development of Carrot2 are – although it may seem unexpected – legal issues. As it is a meta-search engine, it doesn't have a indexed pages database of its own. Instead, Carrot2 performs clustering on search results collected from other search engines. As it turned out, this kind of "parasitic" activity is considered illegal by terms of use of commercial services. Fortunately, a contact has been established with authors of new, experimental indexing service Egothor [Egothor], who have declared that Carrot may make a use of results returned by it, and now it finally may be possible for Carrot to have a legal "background" search engine.

- 24 THE VECTOR SPACE MODEL IN

## INFORMATION RETRIEVAL

In this chapter a comprehensive overview of the vector space model, being a classical approach applied to processing databases containing text documents is given. First, we present the basic concepts in this model, then documents representation and how documents corresponding to a query, or similar to other one may be found. This model has some severe drawbacks, resulting from its main assumption – reducing texts written in natural language, which is very flexible to strict, mathematical representation. These problems, along with their possible solutions are also discussed.**3.1. BASIC CONCEPTS**

3.1.1. DOCUMENTS REPRESENTATION The vector space model is based on linear algebra and treats documents and queries as vectors of numbers, containing values corresponding to occurrence of words (called here terms) in respective documents [Salton et al. 75]. Let t be size of the terms set, and n – size of the documents set. Then both the query Q and all documents Di, i = 1..n may be represented

**as t-dimensional vectors:**

Fig. 3-1-1-1: Document and query representation in the vector space model where coefficients aik and aqk represent the values of term k in document Di or query Q, respectively [Salton 89]. Thus both documents and terms form a term-document matrix A(n×t). Rows of this matrix represent documents, and columns – so called term vectors. Let us assume that position aik is set equal to 1, when term k appears in document i, and to 0 when it doesn't appear in it (next subchapter is devoted to the topic of calculating of these values).

Then for example documents collection corresponding to a query "king" we could create a

**corresponding term-document matrix (see Figure 3-1-1 below):**

**Terms set:**

The, King, University, College, Site, Contents, of, County, Bar, Association, Government, Seattle, Washington, Martin, Luther

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1

It is easy to see that documents corresponding to similar topics should have similar values in the same positions of their vectors.

3.1.2. STRONG AND WEAK POINTS OF THE MODEL

**Main advantages of this paradigm are:**

• Linear algebra as the basis of the model After transforming documents to vectors we may easily perform mathematical operations on them, using methods of linear algebra, which is a well known and studied sub-domain of mathematics.

• Simple, efficient data structures may be used to store data Representation of documents in the vector space model is very simple, and we may store them efficiently using arrays. However, often these arrays are sparse, i.e. most of contained values are equal to 0 (especially when considering large numbers of documents containing many terms). Special methods for storing and processing such vectors may be used in order to decrease the amount of time and memory needed [Duff et al. 02].

However, they are beyond the scope of this thesis and we will not discussed these issues.

- 26 Basic weakness of this model is

• Lack of information contained in the structure of documents In basic vector space model, only occurrence of words in documents is important (i.e. it is treated as a set of terms), and their order isn't considered. It is the main reason why this approach is often criticized ([Zamir and Etzioni 98], [Weiss 01]), as the information about the proximity between words (their context in sentence) is not utilized. Consider for example two documents: one containing a phrase "White House", which has a very specific meaning, and another containing a sentence "A white car was parked near the house" Treating documents simply as sets of terms we only know that words "white" and "house" occur in both documents, although their context there is completely different.

However, this problem can be easily overcome – one can supplement this model, using also phrases in addition to terms in document vectors, as described in [Maarek et al. 91] and [Maarek et al. 00]. This approach will be discussed in the last subchapter of this part of the thesis.

**3.2. ASSIGNING WEIGHTS TO TERMS**

3.2.1. METHODS OF TERMS WEIGTHING As we have mentioned in the former subchapter, values at individual positions of document vectors correspond to occurrence of terms in these documents. These values are called weights, as they describe how important is this term in context of the containing document (the same term may have different weights in different documents). Process of calculating weights of terms is called terms weigthing. There are several main methods used to assign weights to terms.

The simplest method is boolean terms weigthing, which – as its name suggests – sets weights to 0 or 1 depending on the presence of term in document. This method was used to calculate the term-document matrix in example shown on Figure 3-1-1-2. Using this method causes loss of valuable information, as it differentiates only between two cases: presence an absence of term in document, and exact number of occurrences of word may indicate is's importance in documents [Salton 89].

Method utilizing knowledge of exact number of term occurrences in documents is called tf terms weigthing (tf stands for term frequency). It assigns to each position of document vector value equal to number of corresponding term's occurrences in this document (so when it is used, same term may have different weight in different documents). This weight is linear in the number of occurrences, and it may rise too fast (i.e. word appearing 5 times as often doesn't necessarily have to be 5 times as relevant), so some type of "smoothing" may be used here (for instance square root – [Salton and Buckley 87]).

Another significant issue is frequency of appearance of term in the whole collection of documents, not only in a single one. This is especially important for clustering purposes, where we have to split documents into some subgroups, so words occurring only in a small

We have discovered that unfortunately using formula for inverse frequency presented in Figure 3-2-1-1 causes weight of term to drop too rapidly with increase of its document frequency (words that occur in many documents are too strongly penalized). As a result, it may be difficult to form larger clusters of documents (because words common for these documents have a very small weight). Therefore we propose another formula for calculating inverse document frequency, whose value decreases linearly in the number of documents in which the term occurs (sample results of different terms weighting methods are compared in

**Chapter 6-3-6):**

Fig. 3-2-1-4: Formula for inverse document frequency, n – number of all documents

- 28

Fig. 3-2-1-5: Inverse document frequency (using formula from Fig. 3-2-1-4) as a function of document frequency

## 3.2.2. TERMS WEIGHTING ISSUES SPECIFIC FOR SEARCH RESULTS

CLUSTERING According to Zamir and Etzioni's key requirements for Web documents clustering methods (presented in Chapter 2-3), the only data that a search results clustering system may use, are snippets and document titles. Although search engines return usually snippets of good quality (i.e. ones being a good digest of document, enabling to deduce its topic), this restriction still means that we have to perform clustering using incomplete data. Therefore during the process of terms weighting we don't know how really often do the terms occur in documents and may, for instance, assign a low weight to a term that appears many times in a document (but the snippet contains only a few occurrences of it).So occurrences of terms in snippets may seem random and that's why it is sometimes suggested to calculate the idf factor on some considerably large document set instead on search results [Weiss 01]. However, we cannot agree with this approach, as the distribution of terms in the query result set (which is usually associated just with a few topics) may differ significantly from their distribution in a large text set, containing documents from many domains. For instance, a word "snippet" would have a very high document frequency in a collection of documents responding to a query "search results clustering", but in general it is rather a rarely occurring term. Therefore, we have decided to calculate terms weights on the basis of the input snippets collection.

During work at this thesis we have used also other ideas that may partially improve search

**clustering results:**

• Increasing weights of terms that appear in titles of documents According to assumptions of this method, if a term appears not only in snippet but also in title of an input document, its weight in this document should be scaled by some factor ([Osiński 03], where factor value of 2.5 is proposed). This is due to the fact that titles, given to documents by their authors, should be treated as their best digests. We have also used this method, but haven't observed any considerable, unambiguous improvement.

- 29 Removing terms that appear only in one document As matrices created during terms weighting are sparse, some authors postulate that dimensionality reduction should be performed [Karypis and Han 00]. Main goal of this process is removing from the matrix rows corresponding to the least important terms. It is usually performed by applying complicated matrix transformations. We propose here a simple approach, consisting in removing from term-document matrix rows corresponding to terms (and phrases – when using them) that occur only in one document (i.e. all terms ti such that dfi = 1). Such words don't aid in forming clusters, as they cannot be shared by multiple documents, and may be treated as a noise. This method was introduced also for performance reasons (about two thirds of terms occur only in one document due to the mentioned earlier fact that term vectors are sparse), allowing to decrease size of termdocument matrix and therefore to achieve a considerable speed-up in its computations.

• Removing term that is equal to the user's query Forming a group from documents that share the query as a common term would be rather absurd. Moreover, this group would probably contain almost or even all result documents as most of results contain term that was given in the query. When using phrases, this condition means that phrase equal to the query (which may also be a phrase itself) and its subphrases are not added to the terms vector.

**3.3. DOCUMENT SIMILARITY MEASURES**

Using vector representation, it is easy to find documents relevant to the query or similar to each other by calculating a similarity value of (or distance) between vectors either representing document and query, or representing a pair of documents. Calculating similarity of two t-dimensional vectors is a well-known mathematical problem, and can be done in

**many ways. Example similarity measures between two vectors are given in [Salton 89]:**

Unfortunately, choice of a particular similarity measure for certain application lacks theoretical justification and is rather arbitrary [Salton 89]. The most commonly used measure is the cosine coefficient [Maarek et al. 00], which is equal to the cosine of t-dimensional angle between two compared vectors.

Very useful property of a similarity measure is normalization of its values. Normalized similarity coefficients have values varying always between 0 and 1 when the vector elements are nonnegative (which is always the case for term vectors). Knowledge of range of possible values of a coefficient enables us to compare its values for different pairs of vectors. Typical measures that have this property are Dice, Jaccard and cosine coefficient (see Figure 3-3-1), whereas values of the inner product are practically unbounded [Salton 89] and have to be normalized later after their calculation. Moreover, inner product takes into account only how many common terms occur in two compared vectors, completely ignoring number of differences between them. Thus two exemplary pairs of sentences: ("King's College", "King's College") and ("The University of King's College", "King's College London") will have the same similarity value according to this measure, as they both share 2 same terms.

Another question concerning similarity measures is an association between similarity and distance of vectors. Measures presented in Figure 3-3-1 produce results that show how two vectors are similar to each other, but there are also other coefficients (for instance the widelyused Euclidean measure shown on Figure 3-3-2), whose results indicate distance of vectors.

In order to compare results of those two approaches, some way of conversion of their values must be proposed.

Fig. 3-3-3 A formula used to convert from distance to similarity measure [Everitt 80]