«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
• extension of input pre-processing Methods of pre-processing of the input snippets, performed by the other modules of the Carrot2 system, such as stemming or stop-words removal also have a strong influence on the final results. We feel that for instance extension of stop-lists would result in a considerable improvement of result clusters quality as descriptions of clusters created by our algorithm often consist of single terms only. This fact makes them unclear in case when these terms are ones which contain no information and should in fact be put on the stop-list.
• overlapping clusters Unfortunately, in results given by AHC one document may appear only in one group (not counting its parent groups in hierarchy). Yet in the real world single documents may correspond to several different topics and should be contained in an appropriate number
- 86 of clusters. However, during our work on this thesis we haven't found any publicationsmentioning attempts to create version of the AHC algorithm with such capabilities.
• improvement of usefulness of created clusters We feel that main problem with the quality of AHC results is that it quite often creates clusters made up from documents which share just one or two single words, which usually don't tell us anything about their real topic. So rejecting some of the created clusters and moving documents contained in them to the "Other Topics" group may have in fact an overall good influence on the results.
• speed of implementation Unfortunately, our implementation of the AHC algorithm is certainly too slow for its practical application. Main reason of this fact is the cubic complexity of used version of the clustering algorithm itself.
- 87 BIBLIOGRAPHY[AltaVista] AltaVista indexing service, http://www.altavista.com [AnswerBus] AnswerBus search engine with a natural language interface, http://misshoover.si.umich.edu/~zzheng/qa-new [AskJeeves] AskJeeves search engine with a natural language interface, http://www.askjeeves.com [Bezdek 81] Bezdek, J. C.: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981 [Carrot] Carrot Web search results clustering interface, http://www.cs.put.poznan.pl/dweiss/index.php/research/carrot/obsolete.xml [Carrot2] Carrot2 Web search results clustering interface, http://www.cs.put.poznan.pl/dweiss/carrot [Chabiński and Bugajska 03] Chabiński A., Bugajska M.: Multiszperacze, CHIP, 118 (3):
128-132, March 2003 [Church et al. 91] Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis, in: Zernik, U. (ed.): Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon: 115-164, Lawrence Erlbaum, New Jersey, 1991 [Cutting et al. 92] Cutting, D. R., Karger, D. R., Pedersen, J. O., Tukey, J. W.:
Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR'92): 318-329, 1992.
[Dogpile] Dogpile meta-search engine, http://www.dogpile.com [Dom 2001] Dom, E. B., An information-theoretic external cluster validity measure, IBM research report RJ 10219, 2001.
[Duff et. al 02] Duff, I. S., Heroux, M. A., Pozo R., An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum, ACM Transactions on Mathematical Software (TOMS), 28 (2): 239-267, June 2002.
[Egothor] Egothor indexing service, http://somis4.ais.dundee.ac.uk:8080/egothor-api [Emulti] Emulti meta-search engine, http://www.emulti.pl [Everitt 80] Everitt, B.: Cluster Analysis. Halsted Press (John Wiley & Sons), New York, 1980 [Google] Google indexing service, http://www.google.com [Hill 68] Hill, D. R.: A vector clustering technique, in: Samuelson (ed.): Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968.
[Java] The source for Java Technology, http://java.sun.com [Kartoo] Kartoo graphical search results visualization service, http://www.kartoo.com
- 88 Karypis and Han 00] Karypis, G., Han, E-H., Concept Indexing A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization, Technical Report TR-00-0016, University of Minnesota, 2000 [Lance and Williams 66] Lance, G. N., Williams, W. T.: A General Theory of Classificatory Sorting Strategies. 1. Hierarchical Systems, Computer Journal, 9: 373-380, May 1966.
[LookSmart] LookSmart Web directory, http://www.looksmart.com [Lovins 68] Lovins, J. B.: Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics, 11(1): 23-31, March 1968 [Maarek et al. 91] Maarek, Y., Berry, D. M., Kaiser G. E.: An Information Retrieval Approach For Automatically Constructing Software Libraries, IEEE Transactions On Software Engineering, 17 (8): 800-813, August 1991 [Maarek et al. 00] Maarek, Y., Fagin R., Ben-Shaul, I., Pelleg D.: Ephemereal Document Clustering for Web Applications, IBM Research Report RJ 10186, April 2000.
[MacQueen 67] MacQueen, J.: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability, vol. 1: 281-297, University of California Press, Berkeley, 1967.
[Mamma] Mamma meta-search engine, http://www.mamma.com [MapNet] MapNet graphical search results visualization service, http://maps.map.net [Masłowska and Słowiński 03] Masłowska, I., Słowiński, R., Hierarchical Clustering of Text Corpora Using Suffix Trees, in: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K.
(eds.): Intelligent Information Processing and Web Mining, Advances in Soft Computing, 179-188, Springer-Verlag, 2003 [Metasearch] Metasearch meta-search engine, http://www.metasearch.com [MSN] MSN indexing service, http://www.msn.com [NorthernLight] NorthernLight Web directory, http://www.northernlight.com [Notess 99] Notess, G.: Dead Links Report, http://www.searchengineshowdown.com/stats/dead.shtml [ODP] Open Directory Project Web directory, http://dmoz.org [Osinski 03] Osiński, S.: An Algorithm for Clustering of Web Search Results, Master thesis, Poznań University of Technology, 2003 [Page et al. 98] Page, L., Brin, S., Motwani, R., Winograd, T.: The Page Rank citation ranking: Bringing order to the Web, Technical Report, Stanford University, 1998 [Porter 80] Porter, M. F.: An algorithm for suffix stripping, Program, 14(3): 130-137, 1980 [RazDwaTrzy] RazDwaTrzy meta-search engine, http://razdwatrzy.com [Rocchio 66] Rocchio, J. J.: Document retrieval systems – optimization and evaluation, Ph.D. thesis, Harvard University, 1966.
- 89 Salton 89] Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, 1989 [Salton and Buckley 87] Salton, G., Buckley, C.: Text Weighting Approaches in Automatic Text Retrieval, Cornell University Technical Report: 87-881, New York, 1987 [Salton et al. 75] Salton, G., Wong, A., Yang, C. S.: A Vector Space Model for Automatic Indexing, Communications of the ACM, 18 (11): 613-620, November 1975 [SearchEngineWatch] Web service containing lots of information about search engines, http://www.searchenginewatch.com [Selberg 99] Selberg, E. W.: Towards Comprehensive Web Search, Doctoral dissertation, University of Washington, 1999 [Smadja 91] Smadja, F. A.: From N-Grams to Collocations: An Evaluation of Xtract, Proceedings of 29th ACL Meeting, Berkeley, 1991 [Smadja 93] Smadja, F. A.: Retrieving collocations from text: Xtract, Computational Linguistics, 19(1): 143—177, 1993 [Stefanowski and Weiss 03] Stefanowski, J., Weiss, D., Carrot2 and Language Properties in Web Search Results Clustering, Proceedings of the First International Atlantic Web Intelligence Conference (AWIC'2003), 240-249, Madrid, Spain, 2003 [Ukkonen 95] Ukkonen, E.: On-line construction of suffix trees, Algorithmica, 14(3): 249September 1995 [UML] UML Resource Page, http://www.omg.org/technology/uml [vanRijsbergen 79] van Rijsbergen, C. J.: Information Retrieval, Butterworths, London, 1979 [Vivisimo] Vivisimo Web search results clustering interface, http://www.vivisimo.com [Voorhees 86] Voorhees, E. M: Implementing agglomerative hierarchical clustering
algorithms for use in information retrieval, Information Processing and Management, 22:
465-476, 1986 [Weiss 01] Weiss, D.: A Clustering Interface for Web Search Results in Polish and English. Master thesis, Poznań University of Technology, 2001 Weiss, D., Carrot Developers: Carrot2 Developers [Weiss 02a] Guide, http://www.cs.put.poznan.pl/dweiss/carrot/index.php/developers, 2002 [Weiss 02b] Weiss, D.: Szukanie igły w sieci, Magazyn Internet, 80 (5): 46-51, May 2002 [Weiss 02c] Weiss D.: Choć na chwilę zdjąć gogle.., CHIP, 112 (9): 130-134, September 2002 [Weiss and Stefanowski 03] Weiss, D., Stefanowski, J.: Web search results clustering in Polish: experimental evaluation of Carrot, Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM´03 Conference, vol. 579 (XIV), 209-220, Zakopane, Poland, 2003.
[WordNet] WordNet linguistic system, http://www.cogsci.princeton.edu/~wn [Yahoo] Yahoo indexing service, http://www.yahoo.com
- 90 Zamir 99] Zamir, O., Clustering Web Algorithms: A Phrase-Based Method For Grouping Search Engine Results, Doctoral Dissertation, University of Washington, 1999.
[Zamir and Etzioni 98] Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration, Proceedings of the 19th International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR'98): 46-54, 1998.
[Zamir and Etzioni 99] Zamir, O., Etzioni, O.: Grouper: A Dynamic Clustering Interface to Web Search Results. WWW8 / Computer Networks 31(11-16): 1361-1374, 1999
- 91 -