«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
Though some of these users may just enter the Google URL into the search bar instead of the address bar, I have personally witnessed a very nice lady typing “google” into the Bing search bar to get to her search engine of choice. She then queried Google with the name of the organization she worked at to get to its web site.
-9Figure 1.1. The decline and fall of the Yahoo directory from 1994 to 2010. The directory is marked with red on each screenshot. If you can’t find it in the last one, look harder at the fourth item in the drop-down menu in the top-center part of the page, just after „Celebrities”. Screenshots of Yahoo homepage (partly from CNET News 2006).
- 10 armed with the knowledge of the difference between search engines, agree upon a definition, there is no means of collecting information about all the services out there, though I would venture that a number in the tens of thousands is a conservative estimate if one includes regional and specialized search engines.
1.3 Web Search Evaluation Now that we know how important search engines are, we might have an easier time explaining why their evaluation is important. In fact, there are at least three paths to the heart of the matter, and we will take all of them.
Firstly, and perhaps most obviously, since users depend on search engines, they have a lot to gain from knowing which search engine would serve them best. Even if there is a difference of, say, 5% – however measured – between the search engine you are using and a better alternative, your day-to-day life could become just that little bit easier. This is the reason for the popular-press search engine reviews and tests, though they seem to be not as popular these days. Some years back, however, tests like those by on-line publication Search Engine Watch (2002) or German consumer watchdog Stiftung Warentest (2003) were plentiful. However, these tests were rarely rigorous enough to satisfy scientific standards, and often lacked a proper methodology. The few current evaluations (such as Breithut 2011) tend to be still worse, methodologically speaking.
Connected to the first one is another reason for evaluating search engines, which might be viewed as a public version of the former. There is sizeable concern among some politicians and non-government organizations that the search engine market is too concentrated (in the hands of Google, that is); this leads to issues ranging from privacy to freedom of speech and freedom of information. What that has got to do with evaluation? Lewandowski (2007) provides two possible extreme outcomes of a comparative evaluation. Google might be found to deliver the best quality; then it would seem counterproductive to try to diversify the search engine market by imposing fines or restrictions on Google since this would impair the users’ search experience. Instead, should one still wish to encourage alternatives, the solution would have to be technological – creating, in one way or another, a search engine of enough quality to win a significant market share. 7 Alternatively, the search engines might be found to perform comparably well, in which case the question arises why Google has the lion’s share of the market. It might be because it misuses its present leading position – in whatever way it has been achieved in first place – to keep competition low, which would make it a case for regulators and anti-monopoly commissions, or because of better marketing, or some other reason not directly related to search quality. In any case, the policy has to be quite different depending on the results of an evaluation.8 Though the French-German Quaero project (www.quaero.org) and the German Theseus research program (www.theseus-programm.de), both touted as “European answers to Google” by the media, develop in directions quite different from a general web search engine.
Though the “winner taxes it all” principle flourishes online because of network effects, and it is unclear whether an attempt to remedy the situation would be sensible (Linde and Stock 2011).
- 11 The public might also be interested in another aspect of web search evaluation. Apart from asking “Which search engine is better?”, we can also concern ourselves with possible search engine bias. This might be intentional, accidental or an artifact of the search engine’s algorithms. For example, preference may be given to old results over new ones or to popular views over marginal opinions. While in some cases this is acceptable or even desirable, in others it might be considered harmful (Diaz 2008).
The third reason web search evaluation is useful is that search engine operators want (and need) to improve their services. But in order to know whether some change really improves the user experience, it is necessary to evaluate both versions. This operator-based approach will be the one we will concentrate upon.
In Part I, I will do my best to provide a theoretical basis for the study described in Part II. This will require a range of topics to be dealt with – some in considerable depth, and other, less pertinent, ones almost in passing. This will enable us to understand the motivation for the practical study, its position in the research landscape, and the questions it asks and attempts to answer.
The first thing to know about web search evaluation is how web search works, and how people use it; this will be the topic of Chapter 2. Chapter 3 then turns to evaluation; it explains a widely employed IR evaluation framework and its relevance to web search evaluation.
Chapter 4 is the largest one in the theoretical part; it describes various explicit metrics proposed in the literature and discusses their merits as well as potential shortcomings. General limitations of such metrics are also discussed. Chapter 5 does the same for implicit metrics, albeit on a smaller scale. The relation between the two kinds of metrics, explicit and implicit, is explored in Chapter 6.
Search engine evaluation is, in a very basic sense, all about how well a search engine serves its users, or, from the other point of view, how satisfied users are with a search engine.
Therefore, in this section I will provide some information on today’s search engines as well as their users. Please keep in mind that the search engine market and technologies change rapidly, as does user behavior. Data which is two years old may or may not be still representative of what happens today; five-year-old data is probably not reliably accurate anymore; and at ten years, it can be relied upon to produce some serious head-shaking when compared to the current situation. I will try to present fresh information, falling back on older data when no new studies are available or the comparison promises interesting insights.
2.1 Search Engines in a Nutshell Probably the first web search engine was the “World Wide Web Wanderer”, started in 1993.
It possessed a crawler which traversed the thousands and soon tens of thousands of web sites available on the newly-created World Wide Web, and creating an own index of their content.9 It was closely followed by ALIWEB, which did have an intention of bringing search to the masses, but no web crawler, relying on webmasters to submit to an index the descriptions of their web sites (Koster 1994; Spink and Jansen 2004). Soon after that, a run began on the search engine market, with names like Lycos, Altavista, Excite, Inktomi and Ask Jeeves appearing from 1994 to 1996. The current market leaders entered later, with Google launching in 1998, Yahoo! Search with its own index and ranking (as opposed to the long-established catalog) in 2004, and MSN Search (to be renamed Live Search, and then Bing) in 2005.
Today, the market situation is quite unambiguous; there is a 360-kilogram gorilla on the stage.
It has been estimated that Google handled at least 65% of all US searches in June 2009, followed by Yahoo (20%) and Bing (6%) (comScore 2009). Worldwide, Google’s lead is considered to be considerably higher; the estimates range from 83% (NetApplications 2009) up to 90% for single countries such as the UK (Hitwise 2009a) and Germany (WebHits 2009).
In the United States, google.com has also been the most visited website (Hitwise 2009b). Its dominance is only broken in a few countries with strong and established local search engines, notably in China, where Baidu has over 60% market share (Barboza 2010), and Russia, where Yandex has over 50% market share (LiveInternet 2010). The dictionary tells us that there is a verb to google, meaning to “search for information about (someone or something) on the Though the WWWW was primarily intended to measure the size of the web, not to provide a search feature (Lamacchia 1996).
The general workings of web search engines are quite similar to each other. In a first step, a web crawler, starting from a number of seed pages, follows outbound links, and so attempts to “crawl” the entire web, copying the content of the visited pages into the search engine’s storage as it does so. These contents are then indexed, and made available for retrieval. On the frontend side, the user generally only has to enter search terms, whereupon a ranking mechanism attempts to find results in the index which will most probably satisfy the user, and to rank them in a way that will maximize this satisfaction (Stock 2007).
Result ranking is an area deserving a few lines of its own since it is the component of a search engine which (together with the available index) determines what the result list will be. In traditional information retrieval, where results for most queries were relatively few and the quality of documents was assumed to be universally high, there often was no standard ranking; all documents matching the query were displayed in a more or less fixed order, often ordered by their publication dates. This is clearly not desirable with modern web search engines where the number of matches can easily go into the millions. Therefore, a large number of ranking criteria have been established to present the user with the most relevant results in the first ranks of the list.
One old-established algorithm goes by the acronym tf-idf, which resolves to term frequency – inverted document frequency (Salton and McGill 1986). As the name suggests, this method considers two dimensions; term frequency means that documents containing more occurrences of terms also found in the query are ranked higher, and inverted document frequency implies that documents containing query terms that are rare throughout the index are also considered more relevant. Another popular ranking approach is the Vector Space Model (Salton, Wong and Yang 1975). It represents every document as a vector, with every term occurring in the index providing a dimension of the vector space, and the number of occurrences of the term in a document providing the extension in this dimension. For the query another vector is constructed in the same way; the ranking is then determined by a similarity measure (e.g. the cosine) between the query vector and single document vectors.
These approaches are based only on the content of the single documents. However, the emergence of the web with its highly structured link network has allowed for novel approaches, known as “link analysis algorithms”. The first major one was the HITS (Hyperlink-Induced Topic Search) algorithm, also known as “Hubs and Authorities” (Kleinberg 1999). The reason for this name is that the algorithm attempts to derive from the link network the most selective and the most selected documents. In a nutshell, the algorithm first retrieves all documents relevant to the query; then it assigns higher “hub” values to those with many outgoing links to others within the set, and higher “authority” values to those with many incoming links. This process is repeated iteratively, with links from many good “hub” pages providing a larger increase in “authority” scores, and links to many good “authority”
- 15 pages leading to significant improvements in the “hub” values. 10 HITS is rarely used in general web search engines since it is vulnerable to spam links (Asano, Tezuka and Nishizeki 2008); also, it is query-specific and as such has to be computed at runtime, which would in many cases prove to be too laborious. At one point, a HITS-based ranking was used by the Teoma search engine (see Davison et al. 1999).
The second major link-based algorithm, PageRank (Brin and Page 1998), proved far more successful. It does not distinguish between different dimensions; instead, a single value is calculated for every page. Every link from page A to page B is considered to transfer a fraction of its own PageRank to the linked page; if a page with PageRank 10 has 5 outgoing links, it contributes 2 to the PageRank of every one of those. This process is also repeated iteratively; as it is query-independent, the PageRank values need not be calculated at runtime, which saves time.
It is important to remember that these descriptions are extremely simplified versions of the actual algorithms used; that no major search engine publishes its ranking technology, though brief glimpses might be caught in patents; and that the ranking is always based on a large number of criteria. Google, for example, uses more than 200 indicators, of which the PageRank is but one (Google 2010). The leader of the Russian search engines market, Yandex, claims that the size of their ranking formula increased from 20 byte in 2006 to 120 MB in the middle of 2010 (Segalovich 2010), and to 280 MB in September 2010 (Raskovalov 2010). To put things in perspective, that is approximately 50 times the size of the complete works of William Shakespeare.11
0,02 0,01 Figure 2.1. The complication of Yandex ranking formulas (data from Raskovalov 2010; Segalovich 2010). Note that the vertical scale (formula size in kilobyte) is logarithmic.
To allow the results to converge and the algorithm to terminate, the hub and authority values are usually normalized after every iteration.
Uncompressed digital version available at Project Gutenberg (http://www.gutenberg.org/ebooks/100).