«by MICHAŁ WRÓBLEWSKI Supervisor JERZY STEFANOWSKI, Assistant Professor Referee ROBERT SUSMAGA, Assistant Professor MASTER THESIS Submitted in ...»
• running graph algorithms on a graph of 3 billion nodes and 20 billion edges
• keeping index up to the minute by finding and re-indexing almost all web pages within minutes of when they change or they are created These problems have to be dealt with using distributed and parallel processing techniques as no machine alone has enough processing power. Google takes a very interesting approach here – instead of using powerful workstations, they have employed cluster of over 10,000 cheap PCs to search their databases. Judging simply by the time needed to answer the query (which seems to be most important user's criterion), this solution works perfectly.
Another serious problems for search engine designers are filling the services databases with all documents from the Web and updating them in order to reflect its current state. These problems were put by Selberg [Selberg 99] into questions: How can all Web pages be located? (the Web Discovery Problem) and: How can the index be kept up to date with the Web? (the Coherence Problem).
It turns out that for the search services the only way to keep a moderately complete and consistent image of the Web in databases is its constant searching in order to discover new documents or detect changes in ones already contained in database. In most cases one unfortunately cannot rely on any external notifications about changes that have occurred. On purpose to perform Internet searching, special programs called Web robots or spiders are used, whose task can be illustrated as traversing a graph where nodes represent the documents and edges – hyperlinks between them. When a change is detected, appropriate document is updated, inserted or deleted. Considering size of the Internet, it may take a lot of time to find new pages, and despite the ongoing efforts, search services cannot keep up with abrupt and chaotic WWW growth, covering still smaller and smaller area of it [SearchEngineWatch].
They are only able to capture some snapshot of current state of the Web.
- 16 Particularly irritating from the user's point of view is so called "dead links" problem, which arises when database of the search engine contains references to documents that don't exist anymore. Research from about few years ago shows that on average several percent of links returned by search services are "dead" [Notess, 99].
2.2.2. MATCHING THE USER'S QUERY First stage of the searching process is interpretation of user's query. As it was already mentioned in the introduction, it is usually a list of words (or phrases), split by Boolean operators. For instance a query asking about all documents about Oxford expect those containing anything about Cambridge would look like "Oxford AND NOT Cambridge".
This way of formulating queries may pose a lot of problems, which are listed here:
• it forces the users to express them in some artificial language, and the process of "translation" may need to distortions so that query typed into the search engine may mean something different that the one that user had originally in mind.
• user queries are usually too short (in most cases they consist of 1-3 words). Such queries lack context and are ambiguous. Consider for example user who wants to go skiing to Switzerland and types into his favorite search engine a query "Switzerland".
But he will be flooded also with lots of information about Swiss watches and chocolate, not only about skiing in the Swiss Alps...
• query list is constructed using Boolean operators, which aren't well known to most of people.
• user may not know exactly what he is looking for (this is also possible, and search engines should be designed with a fact that people don't always know precise questions kept in mind).
As a result of the search, a list of Web pages (depicted on Figure 2-1-3-1) containing given combination of key words from the query is returned (The topic of finding documents relevant to the query is discussed in Chapter 3). Because usually this list is so long that it is impossible to look it through exhaustively, it has to be ordered so that the documents most relevant to the query occupy the first positions and user can focus just on the top of the list.
- 17 Fig. 2-2-2-1 Google page showing ranking list of results of a query "king" Original way to sort the documents was comparing how often do the key words appear in the different pages on the ranking list. Quality of results of this approach unfortunately often is poor, because number of occurrences of certain words isn't the only factor of importance of the document. Moreover, it causes a room for certain abuses to exist, as adding respective words to the page's content, its author may manipulate its position in the ranking.
In order to solve this problem, it was necessary to use information that is provided by structure of mutual references between Web pages. As every experienced Internet (not necessarily – this happens also with books, papers, etc.) user knows, in each domain there are services, which are so called "authorities", i.e. many other pages refer to these services.
Moreover, this approach – called connectivity analysis may be easy performed automatically. It has become a basis of methods which apart from estimating relevance of document to the query also take into account its "authority". An example of such algorithm is Page Rank [Page et al., 98] used in Google, which was one of reasons why this service achieved its great success.
Apart from ranking list's ordering, a separate problem of creating its content arises. Of course, list cannot consist of whole documents, but giving user just their addresses without any details what they contain wouldn't be a satisfactory solution. So the ranking lists besides links to the documents contains their titles and short, automatically created digests called snippets. Snippets should be as short as possible, but on the other hand give the user some information about the page's content so that he can judge if it is of his interest without reading the whole page. Usually snippets consist of sentences that contain keywords, allowing to see the context in which they appear in the document (which, as we have stated before can tell us a lot about exact topic of this document). This also means that for different queries snippets created for the same document may be different.
- 18 AUTOMATIC SEARCH RESULTS CLUSTERINGThis approach is based on the cluster hypothesis, formulated in [VanRijsbergen 79]. This hypothesis states that similar documents should be relevant to the same queries. So clustering of search results should form from the ranking lists groups of similar documents. Example of
documents clustering is shown on Figure 2-3-1:
Fig. 2-3-1 Vivisimo page showing clustered results of a query "king"
Automatic document classification is a well-known for a long time approach in the Information Retrieval domain [Salton 89], [Maarek et al. 91], where it was applied to cluster large text databases. This kind of clustering is much easier to perform as these databases aren't often updated as it is in the case of Internet, and the clustering can be performed periodically.
This fact also allows to cluster the entire document collection, not only on the query results, which results in creating of hierarchy of the whole database – just like in human-made Web directories. This way of document clustering may be referred to as "persistent clustering", contrary to the term "ephemeral clustering" which can be used for search results clustering [Maarek et al. 00].
However, the Web is too dynamic environment to use this method, because there aren't any clustering algorithms that allow fast, incremental updating of results which would be required here. Considering also the size of databases of search services, performing a clustering algorithm on their contents would also be a serious problem. Due to this fact, only search results are clustered. It allows also achieving better quality that in case, when all result documents would be assigned to hierarchy created on the basis of whole database (because documents that are not present in the result set have no influence on the way that it is split into topics [Zamir and Etzioni, 99]).
First, pioneer attempts to address the problem of Web search results clustering date to half of the 90-ties. Most important of these systems was Scatter-Gather, described in [Cutting et al., 92]. These approaches were based on grouping techniques from domain of statistics, used
- 19 to cluster rather numerical than textual data. As the Web search results clustering must meet some specific conditions (listed further), performance of these "classical" algorithms was quite poor in this area [Zamir and Etzioni, 98], [Zamir and Etzioni, 99]. Breakthrough in this area was the Grouper system, whose author's have succeeded in creating an algorithm that clusters documents in linear time in the size of their set, and performs well when applied to textual data. They have also formulated generic key requirements for Web document
clustering methods [Zamir and Etzioni 98], [Zamir and Etzioni 99]:
• Relevance The method ought to cluster documents relevant to the user's query separately from irrelevant ones.
• Browsable summaries The method should provide concise and accurate descriptions of the clusters so that sifting through ranking lists wouldn't be replaced by sifting through clusters. This is a very important issue for document clustering, and also authors of Vivisimo claim that "rather than form clusters and then figure out how to describe them, we only form describable clusters in the first place" [Vivisimo].
• Cluster overlap Since documents may have multiple topics, the method should allow a document to be assigned to more than one cluster.
• Snippet-tolerance The method should produce high-quality clusters even when it has only access to the snippets from the ranking list, not to the whole, original documents as downloading them from the Web would take too much time.
• Speed The method should be able to cluster several hundreds of documents (an order of magnitude more than the user might sift through in a ranked list presentation) in a time that would be accepted by the user.
• Incrementality This is also some kind of speed requirement which says that in order to save time, the method should start processing each snippet as soon as it is received from the search engine.
2.4. EXPERIMENTAL SOLUTIONS
As we have already stated, many problems with poor quality of search results arise because users have to express their needs as queries in some artificial language. This observation has led to construction of search engines that allow queries in natural language – AskJeeves
- 20 AskJeeves] or AnswerBus [AnswerBus]. This approach is very rare as automatic analysis of natural language is difficult to perform. Another problem is that Internet users speak many different languages (only AnswerBus allows queries in other language than English – namely in French, Spanish, German, Italian and Portuguese), and moreover they tend to make a lot of grammatical and orthographic mistakes, making their queries even harder to understand for the service.
Fig. 2-4-1 AskJeeves page showing results of a query "to be or no to be?"
Remaining services discussed here focus on presentation of similarity between documents returned as query results. It turns out that recently a very popular approach to this problem is graphical visualization of results [Weiss 02c]. One of the most interesting services applying it is Kartoo [Kartoo], which has very nice graphical interface that however may be too "abstract" for user accustomed to typical search engines. Idea used here is based on visualization of documents as nodes of a graph, with relations between them represented by edges (see Figure 2-4-2). These edges are labeled with sets of keywords (resembling somehow labels of groups in document clustering) that are shared by related documents. Also importance of keywords and documents is visualized.
- 21 Fig. 2-4-2. Kartoo page showing results of a query "king" Yet another service of this type is Map.net [MapNet], which – as its name suggests – creates a "map" of the results. This search engine uses Open Directory Project catalogue, dividing Internet into categories present there, and plots result documents on map putting them in areas corresponding to their topic. Size and "geographic position" of areas on the map depend on number of documents in respective categories, and mutual relations between them.
Fig. 2-4-3 Map.Net page showing map of results of a query "king"
Both these systems are very interesting and the way they present the results is really more and user-friendly when compared to the ranking list. But they are still in experimental phase and should be considered rather as fads, not practical tools. And it seems that still nothing is going to threaten (anyway well-earned) Google's market position in the next few years.
- 22 THE CARROT2 SYSTEM The Carrot2 system originated from Carrot clustering meta-search engine, which was a result of work described in [Weiss 01]. Main goals of Carrot included evaluation of applicability of Web document clustering to Polish language and – what's more important from this thesis point of view – creating an open-source clustering system, which could form a basis for further work. This goal was successfully realized and system is further developed under the name of Carrot2. Main improvements consider the system's architecture in order to increase its flexibility [Weiss 02]. Thanks to these changes, easy adding or changing of components of the system is possible, which is necessary in order to easy perform experiments with different algorithms (which are illustrated below on Figure 2-5-1 and Figure 2-5-2 and Figure 2-5-3) Fig. 2-5-1 Carrot2 page showing clustered results of a query "king" (Suffix Tree Clustering algorithm used)