«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
ERR has the lowest peak value, 0.88. While MAP (Rank) reaches 0.90, it does so only at cutoff rank 9. Up to cut-off rank 7, it is the worst-performing metric. Somewhat surprising are the PIR values for classical Precision. While it does not attain the highest scores, it does very well up to rank 6; after that, though, the PIR declines sharply, to hardly 0.75 towards cut-off rank 10. The no-discount version of MAP takes off rather slowly, but overtakes Precision by cut-off rank 7 and reaches a PIR score of 0.92 at ranks 8 and 9.
- 115 The best-performing metrics are NDCG (with the standard discount function log2) and ESL (with rank-based discount and a cumulative relevance of 2.5). NDCG rises just a bit slower at the start, and declines after cut-off rank 8 while ESL stays stable, but over the middle distance, their scores are the same.
Apart from individual metrics’ scores, there are two important questions to be raised about this data, both of which have been mentioned in the NDCG discussion in Section 10.1. The first of those is the shape of the individual curves, while the second concerns absolute values.
None of the metrics in Figure 10.34 starts with high PIR values at low cut-off ranks. Instead, there are two basic forms; either the scores rise up to a point, and then stay stable (MRR, MAP (Rank), and ESL (Rank, n=2.5) ), or they rise to a point, plateau, and the decline again (Precision, NDCG). MAP (No discount) and ERR (Square) only decline at the cut-off 10, the former more, the latter less strongly. The broad pattern that can be discerned is one of discount steepness; the metrics with steeper discounts tend to keep their high scores, while those with low or no discounts tend to do worse starting with cut-off rank 6 to 9. The nodiscount MAP and ERR (Square) metrics do not quite conform with this observation, but they do not depart from it strikingly, either. But is there, I hear you cry out, any possible reason for this strange phenomenon? Indeed, there might well be.
Inter-metric PIR comparison using the best-threshold method. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.
Usually, it is assumed that it would be best to evaluate all available results if possible. In web search engines, this possibility rarely occurs, and the researchers settle for the second-best option – evaluating as many results as they can afford. But this may turn out to be a relic of good old days of professional databases used by professional researchers. Consider Precision, the most venerable of metrics. In olden days, the perfect solution would have been to evaluate all results returned by the information retrieval system. But, as study after study has shown, modern-day search engine users do not pay attention to all documents. Their attention dwindles rapidly, so that in most search sessions, a result at rank ten is unlikely to even be considered, much less clicked on. This means that if one result list has a top-quality page at rank 10, while a second has a completely irrelevant result, the user is unlikely to care (or even notice). Precision, however, regards this difference to be important – in fact, as important as a similar difference at rank 1. What does this mean for the relative evaluation of the two result lists? If the result at a later rank does not influence user preference, but is counted towards the metric score, the score may go up or down independent of the preference rating. And an indiscriminate fluctuation of scores pushes the PIR to its baseline of 0.5, which, as discussed in Section 8.2.1, constitutes precisely the case where the user preference and metric score are independent.
More generally, this line of reasoning suggests that, while later ranks have the potential to improve preference identification, this diminishes the farther away one gets from the first few results; the chance of leading the metric off course with a result ignored by the user will, in contrast, rise. If the metric takes too little notice of lower-ranked results, it might miss an
The result of this view on discounting is that considering only a small part of results returned by a search engine might not be a necessary evil, saving resources by diminishing the quality of the study. It might instead be a way to save resources while improving the study quality – a trade-off most researchers would not consider too disadvantageous.
The second important question is that of the metrics’ absolute values. As has been already mentioned in Section 10.1, the scores seem quite high. There might be multiple explanations for that; for example, metric quality, query type, threshold advantages, lack of non-preferable result lists, and unrealistic rater availability, which I will discuss in the following paragraphs.
First, let us call to mind what the numbers actually mean. The “good” metrics (that is, every metric but MRR) peak at PIR scores of 0.88 to 0.92, so I will use a value of 0.9 as an average.
A PIR score of 0.9 means that, given a query, two result lists one of which is preferable to the user, and explicit result judgments by this user, a metric will correctly pick the preferred result list nine times out of ten; one in ten result lists will be the less desirable one. Alternatively, the metric might pick the preferred result eight times out of ten, and return no preference for the other two results (leading to guessing and a preference recognition rate of 50% for those results).
The most straightforward explanation is that the numbers simply reflect the high quality of the metrics. After all, many of these metrics have been used extensively, often for decades, and there is no compelling intrinsic reason to assume they have to be poor predictors of user preference. However, previous studies examining the relationship between these metrics and user satisfaction and quality have produced results which are hard to reconcile with these high scores (e.g. Turpin and Hersh (2001) or Al-Maskari, Sanderson and Clough (2007); see Section 4 for detailed discussions of these studies).
The next possibility is that the mix of different query types used for the evaluation pushed the scores higher than they would be otherwise, assuming that different metrics should be used for different query types. We can check this by comparing Figure 10.34 with its informational-only counterpart (Figure 10.36), as we did before. The results for informational queries show slightly higher PIR scores than, but no major differences from the all-query results.
- 118 Figure 10.36. Inter-metric PIR comparison using the best-threshold method and only informational queries. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8. Again.
Another conceivable reason for the high scores is the best-threshold approach; by cherrypicking the best thresholds, the results are likely to be on the upper bounds of what is possible to determine on a relatively modest data set. This possibility can also be easily tested. In Figure 10.35, the zero-threshold approach is used instead of best-threshold, and the results are so similar that it takes a bit of time to find the differences. 92 The peak values remain practically unchanged; and so we can shelve this possibility for good.
A third option for doubting the high values comes from the pre-selected nature of result lists used for this evaluation. As explained in Section 8.2.1, result list pairs where the user does not have a preference have been excluded from the study. The reason for this is simple: if the user does not care which result list he gets, the metric can neither improve nor degrade the user experience. On the downside, this means that PIR measures only a specific sub-part of a metric’s ability to correctly predict the user’s judgments. The question of judging metric quality for other, more detailed evaluations will be discussed in Section 220.127.116.11. But we can already hypothesize the results in this case; as we have seen, most high PIR scores are reached at very low thresholds. However, low thresholds mean that we judge almost any difference in metric scores to be important; and the likelihood of assuming a user preference when there is none is quite considerable in these cases.
A final reason for possibly inflated scores is grounded in the relevance ratings and preference judgments used. As it is probably the most complex one, it has a section of its very own.
Yes, there are some. I leave it to the reader to find them as a home assignment.
- 119 Preference Judgments and Extrinsic Single-result Ratings As mentioned before, the results discussed in the previous section were all based on a comparison of result lists preferred by users and ratings of single results by the very same users. As was also briefly mentioned before, this is an unrealistic situation. If a user has already seen the individual results, he is unlikely to be interested in seeing result lists containing links to those results, even with the added benefit of abstracts. If an evaluation effort uses any single result based metric to determine the better result list, its ratings will not come from precisely the people for whom the result list is intended. Accordingly, a more realistic estimate of a metric’s usefulness will come from the comparison of user preferences with metric scores based on other users’ ratings. The calculation is straightforward; the standard PIR is used, as it has been stated in Formula 8.2, but with the relevance rankings of individual results being the average relevance ratings of all users except for the user whose result list preference is being used.
As in the same-user rating condition, I will start with threshold graphs. Only a few will be used, as the graphs show many of the same properties. One difference, however, can be seen on graphs such as Figure 10.37 (NDCG with log2 discount) and Figure 10.38 (NDCG with square-rank discount). When compared to their same-user-rating counterparts (Figure 10.3 and Figure 10.6), they show much more shallow curves for individual cut-off ranks. This is especially true for higher-discount functions; while the PIR scores diminish rapidly with increasing thresholds in the same-user condition, here, they stay relatively stable or even increase slightly at threshold values up to 0.1, and decline relatively slowly after that.
NDCG PIR values discounted by log2 of result rank. Result and preference ratings come from different users.
NDCG PIR scores for different discount functions, with the best-threshold approach. Result and preference ratings come from different users.
This change has predictable implications for the PIR comparison graphs using different threshold approaches (since t=0 does not necessarily provide scores close to the best-threshold method, as it did almost universally in the condition with same-user ratings). While the general shapes of the curves in Figure 10.39 and Figure 10.40 are still quite similar, the differences are immediately visible. The differences between different discount functions are larger with the zero-threshold approach, as are the score amplitude and score fluctuations for individual functions. The best-threshold approach produces more stable results, but it also shows some changes. The PIR scores peak earlier, mostly at cut-off rank 6; the declines after the peak are less steep and less universal; and because of the stronger score fluctuations, it is harder to determine which metrics perform better. The low-discount functions (no-discount and log5) seem to perform worst if viewed over all cut-offs; but their peak score is higher than that of the high-discount functions (rank square, click-based). Arguably, the average-discount functions (log2, root, rank) perform the best; for this reason, I will keep the NDCG log2 as NDCG reference metric.
A similar picture arises for MAP. The zero-threshold graph (Figure 10.42) appears even more chaotic than that for NDCG. Scores still tend to peak around cut-off 6, but there are some large increases at ranks 7 or 8. The best-threshold evaluation (Figure 10.41) is again slightly more stable, but still hard to read. The low-discount functions tend to perform better; the “No discount” function can still be considered to perform best in the best-threshold case, while falling behind for zero-threshold; the “Rank” discount, equivalent to traditional MAP, still performs rather poorly.
MAP PIR scores for different discount functions, with the zero-threshold approach. Result and preference ratings come from different users.
- 123 Our next test case is ESL. The shapes of the curves in Figure 10.43 and Figure 10.44 show less difference to their counterparts with same-user ratings. The latter graph, showing the PIR scores for the cumulative relevance n=3, is somewhat more jagged, and more lines show a slight decline at later ranks (n=3 produced better results than n=2.5 in this condition, and will be used for the purposes of this section). Nevertheless, both graphs are much more orderly than MAP or even NDCG. The scores obtained with the zero-threshold approach are less stable for n=3 (Figure 10.46), but otherwise both these results and those for n=1 (Figure 10.45) are quite similar to the results of the best-threshold approach.
ESL PIR scores for cumulated relevance n=1 with the best-threshold approach. Result and preference ratings come from different users.
ESL PIR scores for cumulated relevance n=1 with the zero-threshold approach. Result and preference ratings come from different users.