«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
One more study which examined the relationship between explicit relevance ratings and user satisfaction has been conducted at Google (Huffman and Hochster 2007). It used 200 random queries from those posed to the search engine, asking raters to infer probable information needs and to provide further information on the query (e.g. navigational versus informational versus transactional). Then, one set of raters was presented with the information need and the Google result list for the query, and asked to perform a normal search. After the user indicated he was done, an explicit satisfaction rating was solicited. A second group of raters was asked to give relevance ratings for the first three documents of each result list. A simple average precision at rank 3 had a very strong Pearson correlation of 0.73 with user satisfaction, which is quite a high number for so simple a metric. A higher correlation was observed for highly relevant results; this means that with really good results in the top 3 ranks, the user will not be unsatisfied. This is not really surprising; however, we have seen that studies do not always support intuitions, so the results are valuable. By manipulating the metric (in ways not precisely specified), the authors managed to increase correlation to over 0.82.
There was also an intermediate condition providing, for different topics, results starting at different ranks, from 1 to 300; the findings for this treatment were mixed, and will not be detailed here.
- 37 There might be multiple reasons for the fact that the Huffman and Hochster study (2007) provides a strong correlation between precision and user satisfaction while most other studies do not. First, the queries used and the tasks posed were those of real search engine users, something that other studies rarely featured. Second, the relevance considered was only that of the first three results. Studies have shown that on the web, and especially at Google, the first three or four results are the most often viewed and consequently the most important ones (Hotchkiss, Alston and Edwards 2005). This might mean that the relevance ratings of lowerranked results do not play an important role for many queries, and as such only complicate the overall picture. It would have been interesting to see how the correlation changed for average precision at, say, 10 results; that is something that will be addressed in the evaluation performed in Part II.
Session satisfaction vs. first-query relevance (modified from Huffman and Hochster 2007, p. 568). Crosses mark misspellings. The red circles indicate data points which do not have any further points with higher relevance and lower satisfaction; for all other points, a positive change in relevance might, even within the data set of that study, come with a decline in satisfaction.
Perhaps more importantly, one has to keep in mind that correlation is not everything. The models described by Huffman and Hochster may help to provide an answer to the question “Will the user be satisfied with his search session, given certain relevance rankings?”. But they are probably significantly worse at another task: judging whether a certain change in indexing or ranking procedures improves performance. The authors write: “Relevance metrics are useful in part because they enable repeated measurement. Once results for an appropriate sample of queries are graded, relevance metrics can be easily computed for multiple ranking algorithms” (Huffman and Hochster 2007, p. 571). However, if we look at the relevancesatisfaction-plot (Figure 4.4), we see that there are at most 14 data points (circled in red) which do not have further results right and down from them; that is, where increased relevance might not correspond to decreased satisfaction. If we discard misspellings, this means that only less than 10% of all data points in this sample cannot be actually worsened by
- 38 a result set with a higher relevance score. Without access to the raw data of the study, it is difficult to tell how well these models predict exact satisfaction values given a relevance rating and – related to that and important in a practical sense – the direction of change in satisfaction given a certain (usually relatively small) change in the relevance ranking. These questions will also be followed up in Part II, where they will be answered based on new data.
Similar results were also produced in a Yahoo study aimed mainly at goal success determination from user behavior (Hassan, Jones and Klinkner 2010). The average precision of the top 3 results had an 80% accuracy in predicting goal success, while DCG offered 79% (here, the considered rank was not disclosed). A Markov Model trained on query reformulation and click data provided 85% accuracy. However, one has to note that this study differed from most others in that it considered success as a function of sessions instead of the more usual queries. Also, success was estimated not by the originators of the information need but by company-internal editors.
The measures described in this section are not acquired by asking the user, as was the case with explicit ones. Instead, they are calculated using search engine logs. Logs are documents automatically created by the search engine’s servers which record user actions. Typical contents of a log are a user ID, a session ID, the query, the result list and a timestamp.
The user ID is an identifier which allows tracking the user’s progress through different sessions. In the log’s original state, this is normally the user’s IP address. The session ID is a limited form of a user ID; its precise form depends on the definition of “session” used. This may range from a single query to a timeframe of many hours, during which the user submits multiple queries. The query is just the query in the form submitted by the user; it is paired with the result list. The result lists contains not just the URLs of the results offered to the user;
it also shows which of the results have been clicked on. Finally, the timestamp provides information on the exact moment of the query, and – if available – the time of each individual click (Croft, Metzler and Strohman 2010). It is also important to note what the logs of a search engine do not usually contain, not least because much of the research on implicit metrics has used features not readily available. Some of these features are viewing time for a document (“How long did the user stay on a web page he found through his search?”), the amount of scrolling (“How far down the document did the user go?”), or the bookmarking or printing of a document (Kelly and Teevan 2003). Also, the logs do not provide any information as to the users’ information needs, a topic that will be discussed in Chapter 7.
It is especially the click data from the result list that provides the basis for analysis. Different explanations have been proposed as to how exactly the click data corresponds to user preferences, and some of those explanations will be presented in this section. As clicks and other log data reflect user behavior, a user model is indispensable in analyzing the data;
therefore, those measures lean towards the user-based camp.
Some relatively straightforward metrics are being used for internal evaluations at Google.
While not much is known about the details, at least two metrics have been mentioned in interviews: average click rank and average search time (Google 2011d). The smaller (that is, higher up in the result list) the average click rank, and the less time a user spends on the search session, the better the result list is assumed to be. However, the company stops short of providing any clues on how exactly the evaluation is conducted or any evidence they have that these metrics, intuitive as they are, actually represent user preferences.
A general difficulty with interpreting click data as relevance judgments is the absence of any non-click data. While a click on a result can intuitively be interpreted as a positive judgment on the quality of the document, a non-click has little informative value. The user may not
- 40 have liked the result, or just not seen it. One solution is to relinquish absolute relevance judgments in favor of relative ones (Joachims 2002). The most common way is assuming that a user tends to scan the result list from top to bottom.36 In this case, all results up to the lowest-ranked click he makes can be considered to have been examined. The logical next step is to assume that every examined but not selected result has been judged by the user to be less relevant than any later result that has been clicked. This model, called “click skip above” by Joachims, does not make any statements about the relative relevance of clicked results among themselves, or between the non-clicked results among themselves. The preference of selected documents over non-selected but presumably considered ones is the only assertion it makes.
This is not the only possible interpretation of click data. Joachims, Granka et al. (2005) also test further evaluation (or re-ranking) strategies. Assuming that the user learns from results he sees, later clicks would be made on the basis of more available information and thus more meaningful. A possible expression of this is the “last click skip above” measure, which considers the result last clicked on 37 to be more relevant than all higher-ranked unclicked results. A stronger version of the same assumption is “click earlier click”, interpreting the second of any two clicks as the more relevant (the rationale is that if a highly relevant result has been encountered, it does not pay to click on a less relevant result afterwards). Lastly, noticing that eye-tracking results show the users are likely to examine the abstracts directly above or below a click, they introduce the “click skip previous” and the “click no-click next” measures.
The authors also compared their measures to explicit ratings collected for the same data. The results are mixed; the correlations for the “click skipped above”, “last click skip above” and “click skip previous” range from 80% to 90% compared to explicit ratings of the descriptions in the result list, and from 78% to 81% for the ratings of the pages themselves.
The values for “click earlier click” and “click no-click next” are about 15% lower. On the one hand, the values are quite high, especially considering that the inter-judge agreement (and with it the reasonable upper border for possible correlation) was at just 83% to 90%. On the other hand, the variances were very large, from 5% up to 25%. Also, the correlation of the implicit measures with explicit
ratings varied strongly between two phases of the study, suggesting a worrying lack of result stability.
All of the described metrics are based on the comparison between clicked and unclicked results. They assume that a click reflects a high quality of the document, while a non-click on a supposedly considered document reflects low or at least lower quality. This approach has, apart from being once again focused on individual results, two main conceptual problems.
First, all described methods – apart from “click no-click next” which has a relatively low correlation with explicit ratings – can only indicate a preference of a lower-ranked result over a higher-ranked one, and “such relevance judgments are all satisfied if the ranking is reversed, Though some studies reported a significant minority (up to 15% of users in a test group) employing “an extreme breadth-first strategy, looking through the entire list [of 25 results] before opening any document” (Klöckner, Wirschum and Jameson 2004, p. 1539), and another group preferring a mixed strategy, looking at a few results before selecting one.
As it is determined by the timestamp, this is not necessarily the lowest-ranked clicked result.
- 41 making the preferences difficult to use as training data” (Radlinski and Joachims 2006, p.
1407). Thus, the ranking of a result list is never vindicated, and a reversal of a result list can only result in an improved (or at least equally good) result list, as measured by those metrics.
Secondly, clicks are not based on document quality, but at best on the quality of document presentations shown in the result list. The document itself may be irrelevant or even spam, but if the title, snippet or URL is attractive enough to induce a click, it is still considered to be an endorsement.
Dupret and Liao (2010) suggest a metric which does not compare clicked to non-clicked results. Instead, the question the method addresses is whether, after clicking on a result, the user returns to the result list to proceed with his search. If he does not, the document is assumed to have satisfied (together with previously clicked documents) the user’s information need. This assumption, as the authors point out, is unrealistic (the user could have abandoned his search effort) – but it provides some practical benefits. The method is not used for creating explicit relevance predictions; but if a function maximizing a posteriori likelihood is added as a feature to the ranking system of “a leading commercial search engine”,38 the NDCG – as compared to explicit relevance judgments – grows by 0,06% to 0,76%, depending on the cutoff value and type of query. This seems to be an improvement over a presumably already sophisticated effort; however, the order of magnitude is not large, and there is a notion that “a gain of a few percent in mean average precision is not likely to be detectable by the typical user” (Allan, Carterette and Lewis 2005, p. 433) which presumably also applies to NDCG.
The authors work for Yahoo.