«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
- 32 each other so as to avoid duplicate information. In short, MMR finds, for any rank, the document for which the difference between query similarity (topicality) and similarity to already retrieved documents (repetition of information) is maximal. A parameter can be used to assign relative importance to the two concepts; if it is set to 1, the novelty factor disappears, and MMR is reduced to a standard topicality measure. If it is 0, the topicality is disregarded, and novelty remains the only factor. The concept of marginal relevance is especially interesting since it neatly encompasses the economic notion that only new information is useful (Varian 1999). MMR was conceived as a ranking function, but, it can easily be adapted for evaluation. However, since it requires a sophisticated similarity measurement, it will not be employed in the study presented in Part II (and I will spare you a complicated formula).
MMR’s approach is to start from first principles; an alternative is to take existing measures and expand them by adding a novel user model. One such measure is α-NDCG, which adds to the (normalized) DCG score the concepts of users’ information needs as well as novelty and relevance rating details (Clarke et al. 2008). It employs the concept of an “information nugget” which has gained acceptance in several IR subfields (Dang, Lin and Kelly 2006). In the context of α-NDCG, “information nugget” denotes any property of the document which might be relevant to the user. Thus, a document can have no, one or multiple useful information nuggets. The metric then proceeds with some quite complex calculations, determining the probability that a user will find nuggets in the current document as well as the probability for each nugget that it has already been encountered during the search session.
These two probabilities then determine the gain at the current rank, which is handled similarly to the gains in usual NDCG. Once again, it is too complex for the present evaluation, and is mentioned mainly because it shows the sophistication of user models that can be employed by current metrics.
An example of a novel evaluation metric with a different agenda is the Expected Reciprocal Rank (ERR). Whereas α-NDCG pays particular attention to the diversity and novelty of the information nuggets, ERR focuses on the documents’ cumulated probability of being useful enough to satisfy the user’s information need. In practice, this means examining whether the documents earlier in the result list have been useful; if so, the value of a new document is discounted accordingly (Chapelle et al. 2009).
ERR formula. For each rank r, the probabilities of a relevant result at each earlier rank i are multiplied; the inverse probability is used as a damping factor for the gain at the current rank.
The calculation of ERR shown in Formula 4.12 shows a structure similar to those of traditional measures like MAP, with a summation over document ratings g (here as to indicate a slight probability that even the best-rated result might not satisfy the user
- 33 completely) and a rank-based discounting function 1/r. 32 The novel part is that every addend is also discounted by a factor indicating how likely it is that the user has already found enough information not to be interested in the current result; the factor is the product of previous results’ weights, each subtracted from 1 to provide higher discount for better previous results.
4.4 General Problems of Explicit Metrics There are some problems common to all the explicit ratings described above. This section provides an overview and indicates where solutions may be found.
Firstly, user judgments tend to be hard to obtain, as they require considerable resources. For the quantities of judgments needed for study results to be stable and significant, these resources may surpass the capabilities of single research groups. This is less of a problem for businesses and large collective coordinated such as TREC. However, commercial research does not always find its way into public availability; and the amount of work put into every large-scale evaluation such as TREC means there is likely to be a limited number of methods for which user judgments will be provided, and those methods are likely to be wellestablished and already tested.
The second issue is the one partially addressed by user-based metrics. The explicit relevance judgments are almost always collected for individual results, and then combined in a certain way to deduce a measure of user satisfaction. The raters do answer the questions asked of them; but it is important to know whether those questions really relate to the point at stake. It is the downside of the precision advantage offered by explicit user judgments: The results are focused, but the focus may be too narrow. It has been argued, for example, that the concentration on individual results is counterproductive and should be replaced or at least supplanted by methods taking into account the user’s interaction with the result list as a whole (e.g. Spink 2002; for an overview, see Mandl 2008). Indeed, there are studies that suggest that discounting for rank and even for previously encountered relevant documents does not always alleviate the problems.
One such study tested four models of user behavior as related to position bias of search engines (Craswell et al. 2008). Among them are a baseline model (no positional bias at all;
clicks are based only on the snippet quality), an examination model (the probability of a click is based on position-based chance of snippet examination and snippet quality), and a cascade model (clicks are based on the attractiveness of previous snippets and the snippet of the examined document).
The models were trained on data acquired from “a major search engine” (the study has been performed at Microsoft), where the results were manipulated as follows:
for a subset of users, adjacent results from the first result page were swapped (e.g., the first result with the second or the 9th with the 10th). The difference in click occurrence was measured, and used to train the individual models. The predictions of these models were then compared to the actual click data.
As the authors note, “there is nothing particular about that choice” (Chapelle et al. 2009, p. 624); another function can be easily employed if it promises closer correspondence with the assumed user model.
The results, based on over 100,000 data sets, are very interesting. In short, the cascade model performs better than all the others, while the examination model produces results barely above the baseline (see Figure 4.3). This is remarkable, especially given that the cascade model used here was deliberately kept very simplistic and used some crude assumptions (such as there always being one and only one click per query). The obvious conclusion would be that a measure comprised of ratings for single results does not lead to a coherent representation of a user’s probable behavior, even if position bias is taken into account. However, the results also show that the strength of the cascade model lies in predicting the user behavior for the very first ranks. At the first rank,33 its performance is superb; at the second, good; at the third, unremarkable. From the fourth result on, the cascade model actually performs worse than the baseline model, which assumes no positional bias at all. The authors “take this to mean that there is a small amount of presentation bias at lower ranks, and a large amount of variability because clicks are rare” (Craswell et al. 2008, p. 93). This is not the only possible interpretation. Another is that the difference is not primarily one of rank, but of query type. In this reading, the model performs very well on navigational queries, which indeed contain only one result that the user is interested in, as assumed by the model. Furthermore, these results are usually near the top of the result list, in the positions where the cascade model is strongest.
Broad informational queries, in contrast, tend to require multiple documents to satisfy the information need. Also, the results for non-navigational queries are considered to be more difficult to rank (Dupret and Liao 2010), so that the ranking list might be worse to begin with, allowing a relatively good performance of the baseline model. If this interpretation is correct, different metrics would be required in order to adequately assess different query types.
That is, when the first result is switched with the second; the rank indicates the first of two consecutive swapped results.
- 35 Whatever the precise reasons for the cascade model’s performance, the overall results of the study seem to indicate that the methods for using explicit relevance ratings to construct result list evaluations, as they are mostly applied today, do not correspond well with real-life user behavior.
There have been further studies indicating the need for more holistic ratings. Ali, Chang et al.
(2005) have shown that the correlation between result-based DCG scores and result list scores (on a tertiary scale) is 0.54 for image and 0.29 for news search. While the fields were more specific than general web search, the numbers clearly do not indicate a reliable link between the scores. Instead, the low correlation indicates that a result list might be larger or smaller than the sum of its parts; that is, good individual results might combine into a bad result list, and a result list judged to be of high quality might consist of results which are, on average, considered individually to be low-grade.
A further study examined the correlation between average precision and user success (AlMaskari et al. 2008). For a number of TREC topics, test runs were performed on three search engines to determine their average precision. Then, users were presented with the best or the worst of the search engines for each topic, and were instructed to find as many relevant documents as possible in 7 minutes. The results showed a strong correlation between average precision and user success metrics (such as the number of retrieved documents) as well as user satisfaction.
However, a number of conceptual issues might be raised. The correlation values are significant; but the increase in average precision that they correlate with is fourfold, which is quite an extraordinary difference. Compared with this distinction, the increase in user success and especially user satisfaction is quite low. When the (absolute or relative) difference between the systems’ average precision was reduced, the significance of correlations promptly dropped and all but disappeared when the increase in average precision was at 30%.
The topics were taken from TREC, and the “title” and “description” fields were used as queries for precision calculation; however, the users searched with queries of their own, so that the correlation was actually between the precision of one query and the output and user satisfaction of another query. The average precision was very low (0.09 to 0.16 for the three search engines tested); this might be a consequence of measuring it for all results, without a cut-off value.34 This makes the “good” systems not very good, and the average precision not very realistic with regard to actual user behavior. Furthermore, the interface used by the study does not have much resemblance to the interface of most search engines, providing less transferrable results, and also aggravating possible effects an artificial setting has on user behavior. Perhaps most significantly, the task itself, with its fixed time frame and requirement of relevant document recall independent of the actual novelty or usefulness of documents, seems to be more of a task one would like to measure with average precision than a typical information need of a web searcher. This would also explain why precision@200 was found to correlate better with user’s effectiveness than average precision; since the users were The authors, at least, do not mention a cut-off value, so we assume the average precision was that of all returned results.
- 36 instructed to gather as many relevant documents as possible, and given a fixed (and not very small) amount of time, rank effects probably did not play a major role in this evaluation. This long list of possible error sources stresses the necessity of careful consideration of study setups.
One other possible explanation of the different correlations is provided by Smith and Kantor (2008). They studied the influence of degraded search quality on user performance. The degrading was achieved by returning standard Google results for the control group, and Google results starting at rank 300 for the treatment group.35 The lower quality was confirmed by significantly lower precision and (relative) recall numbers. The users were asked to find as many good sources for the topics as they could. There was no significant difference between the numbers of good results marked by the users, neither in absolute nor in relative terms. The authors conjecture that the users adapt their behavior to the change in result quality; an assumption supported by collected implicit metrics (for example, users working on “bad” result lists tended to pose significantly more queries per topic). Regrettably, the study provides no user satisfaction measures. Those would be quite interesting, as the results seem to suggest that if the correlation between precision and user performance is high, and correlation between precision and user satisfaction is low, as suggested by other studies (e.g.
Turpin and Scholer 2006), there might not be any direct relationship between user performance and user satisfaction. Instead, the difference between performance and satisfaction may be caused by user effort (in this case, issuing more queries, and perhaps spending more time considering non-relevant abstracts and documents in search for relevant ones).