«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
- 16 Many major search engines offer other types of results than the classical web page list – images, videos, news, maps and so forth. However, the present work only deals with the document lists. As far as I know, there have been no attempts to provide a single framework for evaluating all aspects of a search engine, and such an approach seems unfeasible. There are also other features I will have to ignore to present a concise case; for example, spellcheckers and query suggestions, question-answering features, relevant advertising etc. I do not attempt to discuss the usefulness of the whole output a search engine is able of; when talking of “result lists”, I will refer only to the organic document listing that is still the search engines’ stock-in-trade.
This is as much depth as we need to consider search evaluation and its validity; for a real
introduction to search engines, I recommend the excellent textbook “Search Engines:
Information Retrieval in Practice” (Croft, Metzler and Strohman 2010).
2.2 Search Engine Usage For any evaluation of search engine performance it is crucial to understand user behavior.
This is true at the experimental design stage, when it is important to formulate a study object which corresponds to something the users do out there in the real world, as well as in the evaluation process itself, when one needs to know how to interpret the gathered data.
Therefore, I will present some findings that attempt to shed light on the question of actual search engine usage. This data comes mainly from two sources: log data evaluation (log data is discussed in Chapter 5) and eye-tracking studies. Most other laboratory experiment types, as well as questionnaires, interviews, and other explicit elicitations, have rarely been used in recent years. 12 Unfortunately, in the last years search engine logs have not been made available to the scientific community as often as ten years ago. The matter of privacy alone, which is at any case hotly debated by the interested public, surely is enough to make a search engine provider think twice before releasing potentially sensitive data.13 What, then, does this data show? The users’ behavior is unlike that experienced in classical information retrieval systems, where the search was mostly performed by intermediaries or other information professionals skilled at retrieval tasks and familiar with the database, query language and so forth. Web search engines aim at the average internet user, or, more precisely, at any internet user, whether new to the web or a seasoned Usenet veteran.
For our purposes, a search session starts with the user submitting a query. It usually contains two to three terms (see also Figure 2.2), and has been slowly increasing during the time web search has been observed (Jansen and Spink 2006; Yandex 2008). The average query has no This is probably due to multiple reasons: The availability (at least for a while) of large amounts of log data; the difficulty in inferring the behaviour of millions of users from a few test persons (who are rarely representative of the average surfer), and the realization that explicit questions may induce the test person to behave unnaturally.
The best-known case is probably the release by AOL for research purposes of anonymized log data from 20 million queries made by about 650.000 subscribers. Despite the anonymization, some users were identified soon (Barbaro and Zeller Jr. 2006). Its attempt at public-mindedness brought AOL not only an entry in CNN’s “101 Dumbest Moments in Business” (Horowitz et al. 2007), but also a Federal Trade Commission complaint by the Electronic Frontier Foundation (2006) and a class-action lawsuit which was settled out of court for an undisclosed amount (Berman DeValerio 2010).
- 17 operators; the only one used with any frequency being the phrase operator (most major engines use quotation marks for that) which occurs in under 2% of the queries, with 9% of users issuing at least one query containing operators (White and Morris 2007).14 The queries are traditionally divided into navigational (user looks for a specific web page known or supposed to exist), transactional (not surprisingly, in this case the user looks to perform a transaction like buying or downloading) and informational, which should need no further explanation (Broder 2002). While this division is by no means final, and will be modified after an encounter with the current state of affairs in Section 9.2, it is undoubtedly useful (and widely used). Broder’s estimation of query type distribution is that 20-25% are navigational, 39-48% are informational, and 30-36% are transactional. A large-scale study with automatic classifying had a much larger amount of informational queries (around 80%), while navigational and transactional queries were at about 10% each (Jansen, Booth and Spink 2007). 15 Another study with similar categories (Rose and Levinson 2004) also found informational queries to be more frequent than originally assumed by Broder (62-63%), and the other two types less frequent (navigational queries 12-15%, transactional queries 22Figure 2.2. Query length distribution (from Yandex 2008, p. 2). Note that the data is for Russian searches.
Once the user submits his query, he is presented with the result page. Apart from advertising, “universal search” features and other elements mentioned in Section 2.1, it contains a result list with usually ten results. Each result consists of three parts: a page title taken directly from the page’s title tag, a so-called “snippet” which is a query-dependant extract from the For the 13 weeks during which the data was collected. A study of log data collected 7-10 years earlier had significantly higher numbers of queries using operators, though they were still under 10% (Spink and Jansen 2004).
While the study was published in 2007, the log data it is based is at least five years older, and partly originates in the end-90ies.
Note that all cited statistics are for English-speaking users. The picture may or may not be different for other languages; a study using data from 2004 showed no fundamental differences in queries originating from Greece (Efthimiadis 2008). A study of Russian log data additionally showed 4% of queries containing a full web address, which would be sufficient to reach the page via the browser’s address bar instead of the search engine. It also stated that about 15% of all queries contained some type of mistake, with about 10% being typos (Yandex 2008).
- 18 page, and a URL pointing to the page itself. When the result page appears, the user tends to start by scanning the first-ranked result; if this seems promising, the link is clicked and the user (at least temporarily) leaves the search engine and its hospitable result page. If the first snippet seems irrelevant, or if the user returns for more information, he proceeds to the second snippet which he processes in the same way. Generally, if the user selects a result at all, the first click falls within the first three or four results. The time before this first click is, on average, about 6.5 seconds; and in this time, the user catches at least a glimpse of about 4 results (Hotchkiss, Alston and Edwards 2005). 17 As Thomas and Hawking note, “a quick scan of part of a result set is often enough to judge its utility for the task at hand. Unlike relevance assessors, searchers very seldom read all the documents retrieved for them by a search engine” (Thomas and Hawking 2006, p. 95). There is a sharp drop in user attention after rank 6 or 7; this is caused by the fact that only these results are directly visible on the result page without scrolling.18 The remaining results up to the end of the page receive approximately equal amounts of attention (Granka, Joachims and Gay 2004; Joachims et al. 2005). If a user scrolls to the end of the result page, he has an opportunity to move to the next one which contains further results; however, this option is used very rarely – fewer than 25% of users visit more than one result page,19 the others confining themselves to the (usually ten) results on the first page (Spink and Jansen 2004).
Percentage of times an abstract20 was viewed/clicked depending on the rank of the result (taken from Joachims et al. 2005, p. 156). Data is based on original and manipulated Google results, in an attempt to eliminate quality bias. Note that the study was a laboratory-based, not log-based one.
Turpin, Scholer et al. (2009) mention 19 seconds per snippet; however, this is to consciously judge the result and produce an explicit rating. When encountering a result list in a “natural” search process, the users tend to spend much less time deciding whether to click on a result.
Obviously, this might not be the case for extra high display resolutions; however, this does not invalidate the observations.
Given the previous numbers in this section, a quarter of all users going beyond the first result page seems like a lot. The reason probably lies in the different sources of data in the various studies; in particular, the data used by Spink and Jansen (2004) mostly stems from the 1990s.
“Abstract” and “snippet” have the same meaning in this case.
- 19 The number of pages a user actually visits is also not high. Spink and Jansen (2004) estimate them at 2 or 3 per query; and only about one in 35,000 users clicks on all ten results on the first result page (Thomas and Hawking 2006). This fact will be very important when we discuss the problems with traditional explicit measures in section 4. Most of those clicks are on results in the very highest ranks; Figure 2.3 gives an overview of fixation (that is, viewing one area of the screen for ca. 200-300 milliseconds) and click rates for top ten results. Note that the click rate falls much faster than the fixation rate. While rank 2 gets almost as much attention as rank 1, it is clicked more than three times less often.
The evaluation of a web search engine has the same general goal as that of any other retrieval system: to provide a measure of how good the system is at providing the user with the information he needs. This knowledge can then be used in numerous ways: to select the most suitable of available search engines, to learn about the users or the data searched upon, or – most often – to find weaknesses in the system and ways to eliminate them.
However, web search engine evaluation also has the same potential pitfalls as its more general counterpart. The classical overview of an evaluation methodology was developed by TagueSutcliffe (Tague 1981; Tague-Sutcliffe 1992). The ten points made by her have been extremely influential, and we will now briefly recount them, commenting on their relevance to the present work.
The first issue is “To test or not to test”, which bears more weight than it seems at a first glance. Obviously, this step includes reviewing the literature to check if the questions asked already have answers. However, before this can be done, one is forced to actually formulate the question. The task is anything but straightforward. “Is that search engine good?” is not a valid formulation; one has to be as precise as possible. If the question is “How many users with navigational queries are satisfied by their search experience?”, the whole evaluation process will be very different than if we ask “Which retrieval function is better at providing results covering multiple aspects of a legal professional’s information need?”. These questions are still not as detailed as they will need to be when the concrete evaluation methods are devised in the next steps; but they show the minimal level of precision required to even start thinking about evaluation. It is my immodest opinion that quite a few of the studies presented in this chapter would have profited from providing exact information on what it is they attempt to quantify. As it stands, neither the introduction of new nor the evaluation of existing metrics is routinely accompanied by a statement on what precisely is to be captured. There are some steps in the direction of a goal definition, but these tend to refer to behavioral observations ("in reality, the probability that a user browses to some position in the ranked list depends on many other factors other than the position alone", Chapelle et al.
2009, p. 621) or general theoretical statements ("When examining the ranked result list of a query, it is obvious that highly relevant documents are more valuable than marginally relevant", Järvelin and Kekäläinen 2002, p. 424). To be reasonably certain to find an explicit phenomenon to be measured, one needs to turn to the type of study discussed in Chapter 6, which deals precisely with the relationship of evaluation metrics and the real world.
The second point mentioned by Tague-Sutcliffe concerns the decision on what kind of test is to be performed, the basic distinction being that between laboratory experiments and operational tests. In laboratory experiments, the conditions are controlled for, and the aim of
- 21 the experiment can be addressed more precisely. Operational tests run with real users on real systems, and thus can be said to be generally closer to “real life”. Of course, in practice the distinction is not binary but smooth, but explicit measures (Chapter 4) tend to be obtained by laboratory experiments, while implicit measures (Chapter 5) often come from operational tests.
The third issue “is deciding how actually to observe or measure […] concepts – how to operationalize them” (Tague-Sutcliffe 1992, p. 469). This means deciding which variables one controls for, and which are going to be measured. The features of the database, the type of user to be modeled, the intended behavioral constraints for the assessor (e.g. “Visit at least five results”), and – directly relevant to the present work – what and how to evaluate are all questions for this step.
The fourth question, what database to select, is not very relevant for us; mostly, either one of the popular web search engines is evaluated, or the researcher has a retrieval method of his own whose index is filled with web pages, of which there is no shortage. Sometimes, databases assembled by large workshops such as TREC21 or CLEF22 can be employed.