«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
3.4. DEALING WITH NATURAL LANGUAGE This subchapter is dedicated to presentation of difficulties that may arise during conversion of written texts to vector representation because of complexity of natural languages. This subchapter is actually the only parts of the thesis, where described algorithms are dependent on the language of clustered documents. It means that during development of a multilingual clustering system, a lot of work may be needed in this area.
3.4.1. STOPWORDS Many words occurring in text contain no information about its topic. These are so called function words [Salton 89], which perform some necessary function in sentence structure (e.g. link two words – "and", "but", are used to form different tenses – "will", "have", etc.), but have no meaning themselves. These words shouldn't be considered during processing of the text, as they are not only meaningless, but moreover occur very frequent in normal sentences. So their usage could lead to forming strong, but nonsense clusters – for instance from documents that share a word "of", "and" or "the".
This problem is solved by excluding these terms from terms weighting. A list of function word, called stop-list is used and words from this list (called stop-words) are treated as meaningless. Example stop-list for English language (it is obvious that stop-lists have to be constructed separately for each language) used in the Carrot2 system is shown on Figure 3-4-1-1. This list is of course constructed before clustering and remains the same for different queries. It may be either constructed manually by linguistic experts or automatically by analyzing frequencies of different words in a text corpus (i.e. some set of texts that is big enough to allow statistical processing of it and to recognize it as representative for given language) and choosing as stop-words terms that are most common [Weiss 01]. In the second case, the list also usually needs some manual correction.
Putting to stop-list only words performing some grammatical function may not be enough.
Usually Web documents contain some words associated with Internet ("link", "www", "url"), which perform some other functions in text of the document (as navigation between Web sites), but contain no information. These terms should be also added to the stop-list.
Fig. 3-4-1-1 An example stop-list for English language used in the Carrot system [Weiss 01] 3.4.2. STEMMING Presented in chapter 3.2 methods of terms weighting treat different sequences of characters as separate positions of term vectors. But it turns out that two quite different strings can represent one word. It is a result of inflection, which is present in natural languages and may cause a word to occur in different grammatical forms, depending for instance on its number (car – cars) or degree (slow – slower – slowest). Also several different words may originate from one, and have a one common meaning (example: information – inform – informative).
All these words represent a similar sense and should be treated as one position in the terms vector.
A solution to this problem is stemming, consisting in reduction of all forms of a word or words to one, common part called stem. After performing of stemming, the position of terms
Issue of stemming is particularly well studied in case of English language. A lot of stemmers (i.e. stemming algorithms) have been create for this language. Main reasons of this fact are that, firstly it has quite simple syntax and strict inflection rules, making the process of stemmer development easy, and secondly of course English is the language of computing science. Example stemmers for English language are Lovins stemmer [Lovins 68] and Porter stemmer [Porter 80]. In the Carrot2 system, the Porter stemming algorithm is used. It is a small piece of code, being just a set of several quite simple rules (steps).
Fig. 3-4-2-2 An example of a stemmer algorithm [Lovins 68]
Since one of most important goals of the Carrot2 system is clustering of documents in Polish language, we will describe here shortly also stemming for this language. This issue has already been addressed in details in [Weiss 01], and we will mention here only the most important facts.
Polish language is one that has quite complex inflection rules, with a lot of exceptions.
Therefore stemmers for it have to be created manually by linguistic experts on the basis of
- 34 dictionaries, which makes process of their development expensive, and none is available for free. So during development of the original Carrot system a need arose to create one, and a simple stemmer based on automatic analysis of text corpus was created. This software is used now in the Carrot2 system for stemming of Polish documents.
3.4.3. SYNONYMS Apart from the fact that one word may have different forms, also many different words that have different stems may have similar or the same meaning (for example – "cluster" and "bunch"). Synonyms may be used often in written documents by their authors to avoid repetition of the same word, and they are treated as different, not correlated terms. In order to deal with synonyms, a thesaurus has to be used, enabling to check if a word has the same meaning as other ones occurring in the document. All terms containing the same information are usually replaced by one common, artificial identifier which is used instead of many words during the terms weighting phase. This process is called in [Salton 89] thesaurus transformation.
Such thesaurus has to be created manually, and its construction takes a lot of resources (especially time), so we couldn't develop our own one. Instead we would have to choose an existing one (which would be available for free). Example of such system is WordNet [WordNet], developed at the Princeton University. It contains most of word in the English language, and also the relations that exist between them, including synonymy (which is only one possibility, information about other relations such as hypernymy representing a hierarchy of words is also available). Unfortunately, due to the lack of time we have made no use of WordNet (anyway, using it would enable us only to detect synonyms in English documents, we don't know about any tool for Polish language). Using synonyms was rather a minor goal as we have observed that they occur quite rarely in documents, and therefore this approach wouldn't improve clustering quality significantly.
3.5.1. REASONS TO USE PHRASES Natural way to represent phrases in the vector space model is to simply treat them as elements of the terms vector so that weights may be assigned to them and then they can be included during vector similarity computations. For example set of sentences presented earlier in Figure 3-1-1-2, valid phrases could be "King College" or "King County". As we have mentioned earlier in Chapter 3-1-2, such phrases carry much more information than just simple occurrences of terms (in this case – "King", "College" and "County") in document.
Nevertheless there has been a number of opinions that using phrases in the vector space model is a rare approach [Zamir and Etzioni 99] and that it is not likely to achieve significantly better results than using only single words [Salton 89]. There are also some papers showing successful application of phrases [Maarek et al. 91] [Maarek et al. 00]. In our opinion, main benefit of using phrases consists rather in better quality of group labels (see Chapter 4-3-2) than the quality of clustering itself. Phrases are more "expressive", and make the descriptions more concise and accurate. Consider for example a group of documents about using Agglomerative Hierarchical Clustering algorithm to search results clustering.
This group could be described with three phrases "agglomerative hierarchical clustering", "search results clustering", and "vector space model". Without phrases its label would change into something like "space", "clustering", "model", "hierarchical", "results", "vector", "agglomerative", "search", and even if this cluster would contain documents relevant to the description, it would take the user much more effort to guess its topic from the description. In Chapter 6-3-3, we show benefits from phrases use on sample algorithm results.
3.5.2. PHRASES EXTRACTION In order to treat phrases as elements of the terms vector, they have first to be found in the input documents set. This process is called the phrases extraction. Phrases are built from single words, so the only possible way to retrieve them without additional information is to consider all sequences of words in documents (up to some length) and choose from them ones that may represent valid phrases.
Phrases retrieval is performed separately for different length of sequences from 2 up to some arbitrarily chosen maximum length (we have found 5 to be enough, as phrases longer than 5 words occur in documents very rarely). Each iteration of the process consists of two stages. In the first one, all sequences (called also n-grams) of words of given length are found in the documents using algorithm that is presented below on Figure 3-5-2-1. It is a simplified version of method presented in [Maarek et al. 91] and [Smadja 93].
Result of this algorithm are all n-grams of words of given length (not crossing sentence boundaries) that are present in the given document, so it has to be repeated for all the documents from the whole collection. It is also easy to see that this algorithm's complexity is linear in number of words in the document. This means that retrieving sequences from all documents is linear in their number (as number of words in one document may be treated as a constant).
Only a small part of sequences produced by this method represent valid, important phrases, so a second stage is needed – sifting through the candidates in order to find the significant phrases. This is done using only information about their frequencies. For all n-grams of current length, their frequency, its average value and deviation are calculated. Then for each one its strength [Smadja 93] is calculated. Value of a strength of a sequence s tells us how many standard deviations above the average frequency of all sequences of current length is
the frequency of s:
The most frequent n-grams (i.e. possessing certain minimal strength) are considered as important and attached to the term vector. Then the process is repeated for the next value of length.
In [Smadja 93] the value suggested for strength is 1, which should filter out the majority of insignificant phrases. Our experiments have shown that using strength value of 1 for a document collection of 100 snippets usually meant that a sequence had to occur at least 3 times in the documents set in order to be classified as important.
- 37 POSSIBLE IMPROVEMENTS OF PHRASES EXTRACTIONMethods presented by us in this chapter are very simple and may be further improved. In
particular, it does not address the following issues:
• using grammatical information during phrases retrieval Most of publications about phrases extraction comes from the lexicography domain and their authors postulate that the statistical significance of a retrieved phrase alone is not enough – the algorithm should also check its grammatical correctness ([Church et al. 91], [Smadja 91] [Smadja 93]). Also [Salton 89] states that "phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words is not likely to generate a large number of high-quality phrases".
In order to perform grammatical validation, we need for every word information about the corresponding part of speech and also knowledge of the syntax rules (which both are again language-dependent). We didn't have such data and have extracted the phrases using only the statistical information. Moreover, it turned out that in fact grammatical information is not necessary for our application of phrases extraction – we are interested just in the mere fact of occurring some words together in different documents as it may point that they are somehow related.
• consideration of the order of terms Sometimes phrases that contain the same terms, but in different order have the same meaning. This point is partially connected with the previous one, as importance of order of words depends on the documents language and used grammatical constructions. Using an example from [Weiss 01], in English two phrases: "John hit Paul" and "Paul hit John" carry completely opposite meaning, while in Polish sentences "Jan uderzył Pawła" and "Pawła uderzył Jan" mean exactly the same.
• using complete phrases only When using our method, if a document contains a phrase longer than two terms, also its subphrases may be additionally extracted and treated as separate ones (for instance, from a document containing a sentence "King County Seattle Government", also phrases "King County", "County Seattle", etc. would be retrieved). It is often postulated that only complete phrases should be used, i.e. in case of our example – "King County Seattle Government", in order to avoid redundancy..
• methods of weighting specific for phrases Extracted phrases are treated during construction of term-document matrix just like terms, i.e. their weights are calculated on the basis of their occurrences in documents.
However, one may try also to make use of phrases length during this process, as usually longer phrases are more important and informative to the user.