FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 12 |

«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»

-- [ Page 5 ] --

3.4. DEALING WITH NATURAL LANGUAGE This subchapter is dedicated to presentation of difficulties that may arise during conversion of written texts to vector representation because of complexity of natural languages. This subchapter is actually the only parts of the thesis, where described algorithms are dependent on the language of clustered documents. It means that during development of a multilingual clustering system, a lot of work may be needed in this area.

3.4.1. STOPWORDS Many words occurring in text contain no information about its topic. These are so called function words [Salton 89], which perform some necessary function in sentence structure (e.g. link two words – "and", "but", are used to form different tenses – "will", "have", etc.), but have no meaning themselves. These words shouldn't be considered during processing of the text, as they are not only meaningless, but moreover occur very frequent in normal sentences. So their usage could lead to forming strong, but nonsense clusters – for instance from documents that share a word "of", "and" or "the".

This problem is solved by excluding these terms from terms weighting. A list of function word, called stop-list is used and words from this list (called stop-words) are treated as meaningless. Example stop-list for English language (it is obvious that stop-lists have to be constructed separately for each language) used in the Carrot2 system is shown on Figure 3-4-1-1. This list is of course constructed before clustering and remains the same for different queries. It may be either constructed manually by linguistic experts or automatically by analyzing frequencies of different words in a text corpus (i.e. some set of texts that is big enough to allow statistical processing of it and to recognize it as representative for given language) and choosing as stop-words terms that are most common [Weiss 01]. In the second case, the list also usually needs some manual correction.

Putting to stop-list only words performing some grammatical function may not be enough.

Usually Web documents contain some words associated with Internet ("link", "www", "url"), which perform some other functions in text of the document (as navigation between Web sites), but contain no information. These terms should be also added to the stop-list.

–  –  –

Fig. 3-4-1-1 An example stop-list for English language used in the Carrot system [Weiss 01] 3.4.2. STEMMING Presented in chapter 3.2 methods of terms weighting treat different sequences of characters as separate positions of term vectors. But it turns out that two quite different strings can represent one word. It is a result of inflection, which is present in natural languages and may cause a word to occur in different grammatical forms, depending for instance on its number (car – cars) or degree (slow – slower – slowest). Also several different words may originate from one, and have a one common meaning (example: information – inform – informative).

All these words represent a similar sense and should be treated as one position in the terms vector.

A solution to this problem is stemming, consisting in reduction of all forms of a word or words to one, common part called stem. After performing of stemming, the position of terms

–  –  –

Issue of stemming is particularly well studied in case of English language. A lot of stemmers (i.e. stemming algorithms) have been create for this language. Main reasons of this fact are that, firstly it has quite simple syntax and strict inflection rules, making the process of stemmer development easy, and secondly of course English is the language of computing science. Example stemmers for English language are Lovins stemmer [Lovins 68] and Porter stemmer [Porter 80]. In the Carrot2 system, the Porter stemming algorithm is used. It is a small piece of code, being just a set of several quite simple rules (steps).

Fig. 3-4-2-2 An example of a stemmer algorithm [Lovins 68]

Since one of most important goals of the Carrot2 system is clustering of documents in Polish language, we will describe here shortly also stemming for this language. This issue has already been addressed in details in [Weiss 01], and we will mention here only the most important facts.

Polish language is one that has quite complex inflection rules, with a lot of exceptions.

Therefore stemmers for it have to be created manually by linguistic experts on the basis of

- 34 dictionaries, which makes process of their development expensive, and none is available for free. So during development of the original Carrot system a need arose to create one, and a simple stemmer based on automatic analysis of text corpus was created. This software is used now in the Carrot2 system for stemming of Polish documents.

3.4.3. SYNONYMS Apart from the fact that one word may have different forms, also many different words that have different stems may have similar or the same meaning (for example – "cluster" and "bunch"). Synonyms may be used often in written documents by their authors to avoid repetition of the same word, and they are treated as different, not correlated terms. In order to deal with synonyms, a thesaurus has to be used, enabling to check if a word has the same meaning as other ones occurring in the document. All terms containing the same information are usually replaced by one common, artificial identifier which is used instead of many words during the terms weighting phase. This process is called in [Salton 89] thesaurus transformation.

Such thesaurus has to be created manually, and its construction takes a lot of resources (especially time), so we couldn't develop our own one. Instead we would have to choose an existing one (which would be available for free). Example of such system is WordNet [WordNet], developed at the Princeton University. It contains most of word in the English language, and also the relations that exist between them, including synonymy (which is only one possibility, information about other relations such as hypernymy representing a hierarchy of words is also available). Unfortunately, due to the lack of time we have made no use of WordNet (anyway, using it would enable us only to detect synonyms in English documents, we don't know about any tool for Polish language). Using synonyms was rather a minor goal as we have observed that they occur quite rarely in documents, and therefore this approach wouldn't improve clustering quality significantly.

–  –  –

3.5.1. REASONS TO USE PHRASES Natural way to represent phrases in the vector space model is to simply treat them as elements of the terms vector so that weights may be assigned to them and then they can be included during vector similarity computations. For example set of sentences presented earlier in Figure 3-1-1-2, valid phrases could be "King College" or "King County". As we have mentioned earlier in Chapter 3-1-2, such phrases carry much more information than just simple occurrences of terms (in this case – "King", "College" and "County") in document.

Nevertheless there has been a number of opinions that using phrases in the vector space model is a rare approach [Zamir and Etzioni 99] and that it is not likely to achieve significantly better results than using only single words [Salton 89]. There are also some papers showing successful application of phrases [Maarek et al. 91] [Maarek et al. 00]. In our opinion, main benefit of using phrases consists rather in better quality of group labels (see Chapter 4-3-2) than the quality of clustering itself. Phrases are more "expressive", and make the descriptions more concise and accurate. Consider for example a group of documents about using Agglomerative Hierarchical Clustering algorithm to search results clustering.

This group could be described with three phrases "agglomerative hierarchical clustering", "search results clustering", and "vector space model". Without phrases its label would change into something like "space", "clustering", "model", "hierarchical", "results", "vector", "agglomerative", "search", and even if this cluster would contain documents relevant to the description, it would take the user much more effort to guess its topic from the description. In Chapter 6-3-3, we show benefits from phrases use on sample algorithm results.

3.5.2. PHRASES EXTRACTION In order to treat phrases as elements of the terms vector, they have first to be found in the input documents set. This process is called the phrases extraction. Phrases are built from single words, so the only possible way to retrieve them without additional information is to consider all sequences of words in documents (up to some length) and choose from them ones that may represent valid phrases.

Phrases retrieval is performed separately for different length of sequences from 2 up to some arbitrarily chosen maximum length (we have found 5 to be enough, as phrases longer than 5 words occur in documents very rarely). Each iteration of the process consists of two stages. In the first one, all sequences (called also n-grams) of words of given length are found in the documents using algorithm that is presented below on Figure 3-5-2-1. It is a simplified version of method presented in [Maarek et al. 91] and [Smadja 93].

–  –  –

Result of this algorithm are all n-grams of words of given length (not crossing sentence boundaries) that are present in the given document, so it has to be repeated for all the documents from the whole collection. It is also easy to see that this algorithm's complexity is linear in number of words in the document. This means that retrieving sequences from all documents is linear in their number (as number of words in one document may be treated as a constant).

Only a small part of sequences produced by this method represent valid, important phrases, so a second stage is needed – sifting through the candidates in order to find the significant phrases. This is done using only information about their frequencies. For all n-grams of current length, their frequency, its average value and deviation are calculated. Then for each one its strength [Smadja 93] is calculated. Value of a strength of a sequence s tells us how many standard deviations above the average frequency of all sequences of current length is

the frequency of s:

–  –  –

The most frequent n-grams (i.e. possessing certain minimal strength) are considered as important and attached to the term vector. Then the process is repeated for the next value of length.

In [Smadja 93] the value suggested for strength is 1, which should filter out the majority of insignificant phrases. Our experiments have shown that using strength value of 1 for a document collection of 100 snippets usually meant that a sequence had to occur at least 3 times in the documents set in order to be classified as important.


Methods presented by us in this chapter are very simple and may be further improved. In

particular, it does not address the following issues:

• using grammatical information during phrases retrieval Most of publications about phrases extraction comes from the lexicography domain and their authors postulate that the statistical significance of a retrieved phrase alone is not enough – the algorithm should also check its grammatical correctness ([Church et al. 91], [Smadja 91] [Smadja 93]). Also [Salton 89] states that "phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words is not likely to generate a large number of high-quality phrases".

In order to perform grammatical validation, we need for every word information about the corresponding part of speech and also knowledge of the syntax rules (which both are again language-dependent). We didn't have such data and have extracted the phrases using only the statistical information. Moreover, it turned out that in fact grammatical information is not necessary for our application of phrases extraction – we are interested just in the mere fact of occurring some words together in different documents as it may point that they are somehow related.

• consideration of the order of terms Sometimes phrases that contain the same terms, but in different order have the same meaning. This point is partially connected with the previous one, as importance of order of words depends on the documents language and used grammatical constructions. Using an example from [Weiss 01], in English two phrases: "John hit Paul" and "Paul hit John" carry completely opposite meaning, while in Polish sentences "Jan uderzył Pawła" and "Pawła uderzył Jan" mean exactly the same.

• using complete phrases only When using our method, if a document contains a phrase longer than two terms, also its subphrases may be additionally extracted and treated as separate ones (for instance, from a document containing a sentence "King County Seattle Government", also phrases "King County", "County Seattle", etc. would be retrieved). It is often postulated that only complete phrases should be used, i.e. in case of our example – "King County Seattle Government", in order to avoid redundancy..

• methods of weighting specific for phrases Extracted phrases are treated during construction of term-document matrix just like terms, i.e. their weights are calculated on the basis of their occurrences in documents.

However, one may try also to make use of phrases length during this process, as usually longer phrases are more important and informative to the user.

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 12 |

Similar works:

«Play Paper Victimization, aggression and sexism in Vijay Tendulkar’s works IJIFR/ V2/ E2/ 010 339-345 English Paper ID Page No Subject Area Vijay Tendulkar, Unmitigated Violence, Mutual Interactions, Human Character, Evil Key Words Inherent, Realistic Display, Projection Play, Torture,Victimization, Helpless Woman Ph. D Research Scholar Raj Kumar. M Karpagam University, Coimbatore Principal (Retd.), Dr. Mummachi Rathinam Arts and Science College, Coimbatore. Abstract Vijay Tendulkar is a...»


«1 Post Pituitary Surgery Fact Sheet This information has been provided for patients who have recently had pituitary surgery and to inform patients of what they could expect post-operatively. Please refer to our Pituitary Surgery & Radiotherapy and Pituitary Gland booklets where you can read a full explanation of the pituitary gland, its hormones and their function. The reason you have been referred for pituitary surgery is due to findings of an abnormality in or near to the pituitary gland;...»

«A NEW WORLD OF KNOWLEDGE Canadian Universities and Globalization Edited by Sheryl Bond and Jean-Pierre Lemasson INTERNATIONAL DEVELOPMENT RESEARCH CENTRE Ottawa • Cairo • Dakar • Johannesburg • Montevideo • Nairobi • New Delhi • Singapore Published by the International Development Research Centre PO Box 8500, Ottawa, ON, Canada K1G 3H9 © International Development Research Centre 1999 Canadian Cataloguing in Publication Data Main entry under title : A new world of knowledge :...»

«MUSIK & FORSKNING 28 · 2003 FABIAN HOLT Genre Formation in Popular Music 1 In the first chapter of his classic study Mimesis. The Representation of Reality in Western Literature Erich Auerbach outlines several essential aspects of representation with his comparative analysis of the stories of Odysseus’ homecoming in book 19 of the Odyssey and Abraham’s offering in the Old Testament (Genesis 22:1). Auerbach takes those texts as examples of two basic styles with “determining influence upon...»

«Oral Language and Vocabulary Development Activities For Student-Student Interaction Oral language is the ability to speak and listen. The development of thinking and reading abilities is closely linked to the development of oral language. “Speaking to learn is the vehicle for increasing and deepening knowledge. Think about very young toddlers. Spoken language naturally comes before written language. This developmental pattern continues throughout school: it is extremely difficult to read a...»

«University of Southampton Research Repository ePrints Soton Copyright © and Moral Rights for this thesis are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content must not be changed in any way or sold commercially in any format or medium...»

«The Center for Research Libraries scans to provide digital delivery of its holdings. In some cases problems with the quality of the original document or microfilm reproduction may result in a lower quality scan, but it w i l l be legible. In some cases pages may be damaged or missing. Files include O C R (machine searchable text) when the quality o f the scan and the language or format of the text allows. If preferred, you may request a loan by contacting Center for Research Libraries through...»

«UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, DC 20549 FORM 10-Q X QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 For the quarterly period ended March 31, 2008. _ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 for the transition period from _ to _. Commission File Number 0-50481 AEOLUS PHARMACEUTICALS, INC. (Exact Name of Registrant as Specified in Its Charter) Delaware 56-1953785 (State or Other...»

«Commercial Court Division THE REPUBLIC OF UGANDA IN THE HIGH COURT OF UGANDA AT KAMPALA (COMMERCIAL COURT DIVISION) HCT 00 CC CS 0184 2001 1. ABDUL BASIT SENGOOBA 2. HARUNA NYANZI ::::::::::::::::::::::::::::::::::::: PLAINTIFFS 3. MARIAM NAMAWEJJE 4. AKRAM LULE suing through next of friend 5. IMAAMA NAMULIWAYA Haji Suleman Lule VERSUS STANBIC BANK LTD (Former Uganda Commercial Bank Ltd) ::::::::::::::::::::::::::::: DEFENDANTS BEFORE: THE HON. JUSTICE GEOFFREY KIRYABWIRE. J U D G M E N T: The...»

«Idiopathic anaphylaxis The hidden allergy? Keywords: Allergy, anaphylaxis, Food allergy Most people have heard of severe allergies, or anaphylaxis, to food, insect stings, drugs and other common substances, but sometimes no cause at all can be found for a person's reaction a condition called Idiopathic anaphylaxis. But does this really mean they are reacting to nothing? To raise awareness of this little known condition and help those affected, the Anaphylaxis Campaign has updated its factsheet...»

«Reframin g the mom LEGA CIES OF ent 1982 BL K ART G ROUP CO NFEREN CE n 1980 eardo John R rs by Chambe ie of Edd Saturday 27th Oct photo ober 201 On Thu 2 rsday 28th O group ctober of bla 1982, ck art The Fi a studen rst Na ts hos tional Conven ted Black tion a Art t Wolv Polyte erhamp chnic. ton Their discus purpos s ‘the e was form, future to functi of bla on and ck art ’. Thirty years Lecture Th later day, o almost n Satu eatre to the rday 2 artist 7th Oc s, cur Millenium tober ators,...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.