FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 4 ] --

Main problems concerning further development of Carrot2 are – although it may seem unexpected – legal issues. As it is a meta-search engine, it doesn't have a indexed pages database of its own. Instead, Carrot2 performs clustering on search results collected from other search engines. As it turned out, this kind of "parasitic" activity is considered illegal by terms of use of commercial services. Fortunately, a contact has been established with authors of new, experimental indexing service Egothor [Egothor], who have declared that Carrot may make a use of results returned by it, and now it finally may be possible for Carrot to have a legal "background" search engine.



In this chapter a comprehensive overview of the vector space model, being a classical approach applied to processing databases containing text documents is given. First, we present the basic concepts in this model, then documents representation and how documents corresponding to a query, or similar to other one may be found. This model has some severe drawbacks, resulting from its main assumption – reducing texts written in natural language, which is very flexible to strict, mathematical representation. These problems, along with their possible solutions are also discussed.


3.1.1. DOCUMENTS REPRESENTATION The vector space model is based on linear algebra and treats documents and queries as vectors of numbers, containing values corresponding to occurrence of words (called here terms) in respective documents [Salton et al. 75]. Let t be size of the terms set, and n – size of the documents set. Then both the query Q and all documents Di, i = 1..n may be represented

as t-dimensional vectors:

–  –  –

Fig. 3-1-1-1: Document and query representation in the vector space model where coefficients aik and aqk represent the values of term k in document Di or query Q, respectively [Salton 89]. Thus both documents and terms form a term-document matrix A(n×t). Rows of this matrix represent documents, and columns – so called term vectors. Let us assume that position aik is set equal to 1, when term k appears in document i, and to 0 when it doesn't appear in it (next subchapter is devoted to the topic of calculating of these values).

Then for example documents collection corresponding to a query "king" we could create a

corresponding term-document matrix (see Figure 3-1-1 below):

–  –  –

Terms set:

The, King, University, College, Site, Contents, of, County, Bar, Association, Government, Seattle, Washington, Martin, Luther

–  –  –

1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1

–  –  –

It is easy to see that documents corresponding to similar topics should have similar values in the same positions of their vectors.


Main advantages of this paradigm are:

• Linear algebra as the basis of the model After transforming documents to vectors we may easily perform mathematical operations on them, using methods of linear algebra, which is a well known and studied sub-domain of mathematics.

• Simple, efficient data structures may be used to store data Representation of documents in the vector space model is very simple, and we may store them efficiently using arrays. However, often these arrays are sparse, i.e. most of contained values are equal to 0 (especially when considering large numbers of documents containing many terms). Special methods for storing and processing such vectors may be used in order to decrease the amount of time and memory needed [Duff et al. 02].

However, they are beyond the scope of this thesis and we will not discussed these issues.

- 26 Basic weakness of this model is

• Lack of information contained in the structure of documents In basic vector space model, only occurrence of words in documents is important (i.e. it is treated as a set of terms), and their order isn't considered. It is the main reason why this approach is often criticized ([Zamir and Etzioni 98], [Weiss 01]), as the information about the proximity between words (their context in sentence) is not utilized. Consider for example two documents: one containing a phrase "White House", which has a very specific meaning, and another containing a sentence "A white car was parked near the house" Treating documents simply as sets of terms we only know that words "white" and "house" occur in both documents, although their context there is completely different.

However, this problem can be easily overcome – one can supplement this model, using also phrases in addition to terms in document vectors, as described in [Maarek et al. 91] and [Maarek et al. 00]. This approach will be discussed in the last subchapter of this part of the thesis.


3.2.1. METHODS OF TERMS WEIGTHING As we have mentioned in the former subchapter, values at individual positions of document vectors correspond to occurrence of terms in these documents. These values are called weights, as they describe how important is this term in context of the containing document (the same term may have different weights in different documents). Process of calculating weights of terms is called terms weigthing. There are several main methods used to assign weights to terms.

The simplest method is boolean terms weigthing, which – as its name suggests – sets weights to 0 or 1 depending on the presence of term in document. This method was used to calculate the term-document matrix in example shown on Figure 3-1-1-2. Using this method causes loss of valuable information, as it differentiates only between two cases: presence an absence of term in document, and exact number of occurrences of word may indicate is's importance in documents [Salton 89].

Method utilizing knowledge of exact number of term occurrences in documents is called tf terms weigthing (tf stands for term frequency). It assigns to each position of document vector value equal to number of corresponding term's occurrences in this document (so when it is used, same term may have different weight in different documents). This weight is linear in the number of occurrences, and it may rise too fast (i.e. word appearing 5 times as often doesn't necessarily have to be 5 times as relevant), so some type of "smoothing" may be used here (for instance square root – [Salton and Buckley 87]).

Another significant issue is frequency of appearance of term in the whole collection of documents, not only in a single one. This is especially important for clustering purposes, where we have to split documents into some subgroups, so words occurring only in a small

–  –  –

We have discovered that unfortunately using formula for inverse frequency presented in Figure 3-2-1-1 causes weight of term to drop too rapidly with increase of its document frequency (words that occur in many documents are too strongly penalized). As a result, it may be difficult to form larger clusters of documents (because words common for these documents have a very small weight). Therefore we propose another formula for calculating inverse document frequency, whose value decreases linearly in the number of documents in which the term occurs (sample results of different terms weighting methods are compared in

Chapter 6-3-6):

–  –  –

Fig. 3-2-1-4: Formula for inverse document frequency, n – number of all documents

- 28

–  –  –

Fig. 3-2-1-5: Inverse document frequency (using formula from Fig. 3-2-1-4) as a function of document frequency


CLUSTERING According to Zamir and Etzioni's key requirements for Web documents clustering methods (presented in Chapter 2-3), the only data that a search results clustering system may use, are snippets and document titles. Although search engines return usually snippets of good quality (i.e. ones being a good digest of document, enabling to deduce its topic), this restriction still means that we have to perform clustering using incomplete data. Therefore during the process of terms weighting we don't know how really often do the terms occur in documents and may, for instance, assign a low weight to a term that appears many times in a document (but the snippet contains only a few occurrences of it).

So occurrences of terms in snippets may seem random and that's why it is sometimes suggested to calculate the idf factor on some considerably large document set instead on search results [Weiss 01]. However, we cannot agree with this approach, as the distribution of terms in the query result set (which is usually associated just with a few topics) may differ significantly from their distribution in a large text set, containing documents from many domains. For instance, a word "snippet" would have a very high document frequency in a collection of documents responding to a query "search results clustering", but in general it is rather a rarely occurring term. Therefore, we have decided to calculate terms weights on the basis of the input snippets collection.

During work at this thesis we have used also other ideas that may partially improve search

clustering results:

• Increasing weights of terms that appear in titles of documents According to assumptions of this method, if a term appears not only in snippet but also in title of an input document, its weight in this document should be scaled by some factor ([Osiński 03], where factor value of 2.5 is proposed). This is due to the fact that titles, given to documents by their authors, should be treated as their best digests. We have also used this method, but haven't observed any considerable, unambiguous improvement.

- 29 Removing terms that appear only in one document As matrices created during terms weighting are sparse, some authors postulate that dimensionality reduction should be performed [Karypis and Han 00]. Main goal of this process is removing from the matrix rows corresponding to the least important terms. It is usually performed by applying complicated matrix transformations. We propose here a simple approach, consisting in removing from term-document matrix rows corresponding to terms (and phrases – when using them) that occur only in one document (i.e. all terms ti such that dfi = 1). Such words don't aid in forming clusters, as they cannot be shared by multiple documents, and may be treated as a noise. This method was introduced also for performance reasons (about two thirds of terms occur only in one document due to the mentioned earlier fact that term vectors are sparse), allowing to decrease size of termdocument matrix and therefore to achieve a considerable speed-up in its computations.

• Removing term that is equal to the user's query Forming a group from documents that share the query as a common term would be rather absurd. Moreover, this group would probably contain almost or even all result documents as most of results contain term that was given in the query. When using phrases, this condition means that phrase equal to the query (which may also be a phrase itself) and its subphrases are not added to the terms vector.


Using vector representation, it is easy to find documents relevant to the query or similar to each other by calculating a similarity value of (or distance) between vectors either representing document and query, or representing a pair of documents. Calculating similarity of two t-dimensional vectors is a well-known mathematical problem, and can be done in

many ways. Example similarity measures between two vectors are given in [Salton 89]:

–  –  –

Unfortunately, choice of a particular similarity measure for certain application lacks theoretical justification and is rather arbitrary [Salton 89]. The most commonly used measure is the cosine coefficient [Maarek et al. 00], which is equal to the cosine of t-dimensional angle between two compared vectors.

Very useful property of a similarity measure is normalization of its values. Normalized similarity coefficients have values varying always between 0 and 1 when the vector elements are nonnegative (which is always the case for term vectors). Knowledge of range of possible values of a coefficient enables us to compare its values for different pairs of vectors. Typical measures that have this property are Dice, Jaccard and cosine coefficient (see Figure 3-3-1), whereas values of the inner product are practically unbounded [Salton 89] and have to be normalized later after their calculation. Moreover, inner product takes into account only how many common terms occur in two compared vectors, completely ignoring number of differences between them. Thus two exemplary pairs of sentences: ("King's College", "King's College") and ("The University of King's College", "King's College London") will have the same similarity value according to this measure, as they both share 2 same terms.

Another question concerning similarity measures is an association between similarity and distance of vectors. Measures presented in Figure 3-3-1 produce results that show how two vectors are similar to each other, but there are also other coefficients (for instance the widelyused Euclidean measure shown on Figure 3-3-2), whose results indicate distance of vectors.

In order to compare results of those two approaches, some way of conversion of their values must be proposed.

–  –  –

Fig. 3-3-3 A formula used to convert from distance to similarity measure [Everitt 80]

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 12 |

Similar works:

«Proyecto URU/06/016 “Conectando el Conocimiento con la Acción para la Gestión integrada de la Zona Costera Uruguaya del Río de la Plata” DIAGNÓSTICO Y EVALUACIÓN DE INFRAESTRUCTURAS EN LA ZONA COSTERA URUGUAYA (Colonia – Rocha) Segundo informe: Estado de situación de la infraestructura existente en la zona costera Consultora: Mercedes Medina agosto 2009 “CONECTANDO EL CONOCIMIENTO CON LA ACCIÓN PARA LA GESTIÓN INTEGRADA DE LA ZONA COSTERA URUGUAYA DEL RÍO DE LA PLATA “...»


«Jabotinsky.The Man and the Vision William Mehlman A Publication of Americans For A Safe Israel Foreword Seven decades have passed since the death of Ze’ev Jabotinsky. In that time great and terrible events have transpired to reshape Jewish life and thought: A third of our people perished in the camps and crematoria of Eastern and Central Europe; a Jewish state was reborn and has fought seven wars for its continued existence, and an ingathering into Israel of the exiles began that has already...»

«Corning® Matrigel® Matrix Frequently Asked Questions. What is Corning Matrigel matrix? Corning Matrigel matrix is a reconstituted basement membrane preparation that is extracted from the Engelbreth-Holm-Swarm (EHS) mouse sarcoma, a tumor rich in extracellular matrix proteins. This material, once isolated, is approximately 60% laminin, 30% collagen IV, and 8% entactin. Entactin is a bridging molecule that interacts with laminin and collagen IV, and contributes to the structural organization of...»

«Style and the ‘Idea’ of the Sophist in the Time after Plato. The Impact of Form Typology in Sophistic Teaching and Writing on Interdisciplinary Scholarly Work FEE-ALEXANDRA HAASE1 University of Balamand Abstract: Sophists acted among the educated persons and scholars in the Mediterranean after the 1st Sophistic in many regards as a special and distinct group. Since many examples show that they acted not only as sophists, but also performed other activities, we will examine here several...»

«Public Disclosure Authorized Public Disclosure Authorized Public Disclosure Authorized Public Disclosure Authorized 101045 eLearning Benin Project Table of Contents eLearning Benin Project Winner of the 2015 Youth Innovation Fund Project Background Project Objectives Desired Outcomes Project Team Sponsors Partners Republic of Benin Benin Education Fund Participating Higher Education Institutions Public Universities Private Universities/Schools Education Leaders Brief Bios of Participants...»

«THE GARNET GRADUATE February 28,2014 Newsletter THE GARNET GRADUATE USC Graduate Student Association February 28, 2014 Newsletter IN THIS ISSUE President’s Welcome Page 2 GSA Events Page 3-4 Professional Workshops Page 4 Teaching Page 5 Research Page 6 Service Page 7 Did You Know? Page 8 Other Events Pages 9-10 THE GARNET GRADUATE 2 President’s Welcome Your GSA team has been hard at work planning the rest of our events for this school year, and we have some great things in store for you! Be...»

«Victorian Planning System Ministerial Advisory Committee Initial Report December 2011 Victorian Planning System Ministerial Advisory Committee Initial Report. Geoff Underwood, Chair. Catherine Heggen, Member. David Keenan, Member. Terry Montebello, Member. Jane Nathan, Member. Leigh Phillips, Member December 2011 Victorian Planning System Ministerial Advisory Committee Contents LIST OF FIGURES LIST OF ACRONYMS & ABBREVIATIONS 1. INTRODUCTION 2. APPOINTMENT AND TERMS OF REFERENCE 2.1 THE...»

«RULES AND REGULATIONS Hudson Guild Theatre The rules and regulations of the theatre have been developed to ensure the smooth running of the theatre for all productions using the space. Therefore, these rules and regulations must be followed. If you are not sure what your responsibilities are, current rules (a copy of which you have signed before your first use of the space) are posted inside the booth as well as in the backstage hallway. Please bring this copy with you when you have your Rules...»

«Randoll II, G,l. Scheller AZ Hennig F.F. 1 1 Department ofTraumatology. University of Edangen/Nuremberg. KrankenhausstraSse 12,91054 Erlangen :z Leonardis Fachklinik. AlbstraBe 9, 70806 Komwestheim Results of Dynamic Diagnostics Consequences for the Treatment of Chronic Diseases Abstract Cori~spondingto the insight gained from this viewpoint ( Cell-Matrix and Cells Field and Rhythm-Structures of Life), the apt objective is to identify such bodily-intrinsic organizers and use them...»

«The Effects of Various Kinds of Form-Focused Instruction on Learners’ Ability to Comprehend and Produce Polite Requests in English Masahiro Takimoto This study involving 60 Japanese learners of English investigated the effects of various kinds of form-focused instruction on learners’ ability to comprehend and produce polite requests in English. Each treatment group received one of the following: (a) deductive instruction; (b) inductive instruction with problem-solving tasks; or (c)...»

«APROACHES FOR VALIDATING FREQUENT EPISODES BASED ON PERIODICITY IN TIME-SERIES DATA by DHAWAL Y BHATIA Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE THE UNIVERSITY OF TEXAS AT ARLINGTON December 2005 ACKNOWLEDGEMENTS Firstly, I would like to express my deepest sincere gratitude to my advisor, Sharma Chakravarthy, for his magnanimous patience, guidance and...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.