«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
Arguably, this approach ignores a lot of fine detail. For example, if the user clicks on all of the top five results, it could be a sign of higher result quality than if only the third (or even only the first) result is selected. However, the average click rank would be the same (or lower for the supposedly better case in the second scenario). But a metric can still be useful, irrespective of how far it goes in its attempt to capture all possible influences.
Preference Identification Ratio based on average click rank. PIR is shown on the Y-axis and the threshold (difference in rank average) on the X-axis. The PIR indicates the correct preference identification if average click ranks which lie higher in the result list are considered to be better. No-click sessions are regarded as having an average click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results.
And indeed, Figure 11.8 shows average click rank to perform well above chance when predicting user preferences. At low threshold values (with an average difference of 3 ranks regarded as sufficient, or even for all thresholds) the PIR scores lie at around 0.65. For thresholds of 4 ranks and more, the score declines to around 0.55 (and for thresholds starting with 10 ranks, it lies around the baseline score of 0.50).
- 161 Figure 11.9. Preference Identification Ratio based on first click rank. PIR is shown on the Y-axis and the threshold (difference in first click rank) on the X-axis. The PIR indicates the correct preference identification if clicks which lie higher in the result list are considered to be better. No-click sessions are regarded as having a first click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results.
Instead of using the average click rank, we could also consider just the first click. By the same logic as before, if the first result a user clicks on is at rank 1, the result list should be preferable to one where the first click falls on rank 10. This metric is probably even more simplistic than average click rank; however, as Figure 11.9 shows, it produces at least some results.
Preference Identification Ratio based on first click rank. PIR is shown on the Y-axis and the threshold (difference in first click rank) on the X-axis. The PIR indicates the correct preference identification if clicks which lie higher in the result list are considered to be better. No-click sessions are regarded as having a first click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results. Only informational queries have been evaluated.
In the previous chapters, I have presented the results of the study in some detail; perhaps sometimes in more detail than would be strictly necessary. I feel this is justified since my main goal is to show the possibilities of the approach, which is to judge metrics by comparing them to users’ explicit statements regarding their preferences. I have also commented briefly on the meaning of most results. In this section, I want to pull together these results to provide a broader picture. I will pronounce judgment on individual metrics and types of evaluation that were studied; however, a very important caveat (which I will repeat a few times) is that this is but one study, and a not large one at that. Before anyone throws up his hands in despair and wails “Why, oh why have I ever used method X! Woe unto me!” he should attempt to reproduce the results of this study and to see if they really hold. On the other hand, most of the findings do not contradict earlier results; at most, they contradict views widely held on theoretical grounds. That is, while the results may be not as solid as I would like, they are sometimes the best empirical results on particular topics, if only because there are no others.
The discussion will span three topics. First, I will sum up the lessons gained from the general results, particularly the comparison of the two result list types. Then, I will offer an overview of and some further thoughts on the performance of individual metrics and parameters.
Finally, I will turn to the methodology used and discuss its merits and shortcomings, for the evaluation presented in this study as well as for further uses.
12.1 Search Engines and Users The evaluation of user judgments shows some quite important results. Those concern three areas: search engine performance, user judgment peculiarities, and the consequent appropriateness of these judgments for particular evaluation tasks.
The conclusions regarding search engine performance come from the evaluation of judgments for the original result list versus the randomized one (to recall: the latter was constructed by shuffling the top 50 results from the former). The judgments, for most evaluations, were significantly different from each other; but, also for most evaluations, much less so than could be expected for an ordered versus an unordered result list of a top search engine. In click numbers and click rank, in user satisfaction and user preference the randomized result lists performed quite acceptably.
This was most frequent for queries which have a large number of results that might be relevant. This can be the case for queries where a broad range of options is sought.
Informational queries that have a wide scope (“What about climate change?”) fall into that category. In this case, the user presumably wants to look at a variety of documents, and there are also likely to be more than enough pages in the search engine index to fill the top 50 ranks with more or less relevant results. Another possibility is that the user looks for something
- 164 concrete which can be found at any number of pages. This can be a factual query (“When was Lenin born?”), or a large subset of transactional queries (there are hundreds of online shops selling similar products at similar prices, or offering the same downloads). All in all, those options cover a significant part of all queries. The queries for which there is likely to be a large difference between the performance of a normally ranked and a randomized list are those which have a small number of relevant results; these can be narrow informational or transactional queries (“Teaching music to the hard of hearing”), or, most obviously, navigational queries, which tend to have only one relevant result. In the present study, only 2 of the 42 queries were navigational, compared to the 10% to 25% given in the literature (see Section 2.2); a higher proportion may well have resulted in more differences between result list performance.
One evaluation where the two result list types performed very alike was the number of result lists that failed to satisfy any rater (17% versus 18%). This and the relatively high percentage of queries rated as very satisfactory in the randomized result list (57%) indicate that user satisfaction might not be a very discriminating measure. Of course, the satisfaction was binary in this case, and a broader rating scale might produce more diverse results.
For clicks, the results are twofold. On the one hand, the original result lists had many more clicks at earlier ranks than the randomized one. But after rank five, the numbers are virtually indistinguishable. They generally decline at the same pace, apparently due to the users’ growing satisfaction with the information they already got, or to their growing tiredness, or just to good old position bias; but result quality seems to plays no role. Almost the same picture arises when we consider relevance, whether macro or micro; at first, the original result list has more relevant results, but, halfway through the first result page, the quality difference between it and the randomized result list all but disappears.
An important lesson from these findings is that authors should be very careful when, in their studies, they use result lists of different quality to assess a metric. The a priori difference in quality can turn out to be much smaller than expected, and any correlation it shows with other values might turn out to be, in the worst case, worthless. The solution is to define the quality criteria in an unambiguous form; in most cases, a definition in terms of explicit user ratings (or perhaps user behavior) will be appropriate.
12.2 Parameters and Metrics Which metric performs best? That is, which metric is best at capturing the real user preferences between two result lists? This is a relatively clear-cut question; unfortunately, there is no equally unequivocal answer. However, if we look closer and move away from unrealistically perfect conditions, differences between the metrics’ performances become visible. Then, towards the end of this section, we will return to the question of general metric quality.
12.2.1 Discount Functions One important point regards the discount functions. This is especially important in the case of MAP, where classical, widely-used MAP (with result scores discounted by rank) is
- 165 consistently outperformed by a no-discount version.102 However, this by no means means that no discount is the best discount. In NDCG and ESL, for example, the no-discount versions tend to perform worst. All in all, the hardly world-changing result is that different discount functions go down well with different metrics; though once again a larger study may find more regularities.
However, this lack of clarity does not extend to different usage situations of most metrics.
That is, no-discount MAP outperforms rank-discounted MAP for all queries lumped together and for informational queries on their own, for result ratings given by the same user who made the preference judgment and for those made by another one, for ratings made on a sixpoint scale and for binary ratings. If this finding is confirmed, it would mean that you only need to determine the best discount function for a given metric in one situation, and then would be able to use it in any.
Another point to be noted is that there is rarely a single discount function for any metric and situation performing significantly better than all others. Rather, there tend to be some wellperforming and some not-so-well-performing discounts, with perhaps one or two not-so-wellbut-also-not-so-badly-performing in between. For NDCG, for example, the shallowest discounts (no discount and log5) perform worst, the steepest (square and click-based) do better, with the moderately discounting functions providing the highest results.
All of these results are hard to explain. One thing that seems to be clear is that the properties of a metric make it work better with a certain kind of discount; however, this is a reformulated description rather than an explanation. If pressed, I think I could come up with a noncontradictory explanation of why MAP works best without discount, NDCG with a slight one, and ERR likes it steep. For example: MAP already regards earlier results to be more important, even without an explicit discount function. As it averages over precision at ranks 1 to 3, say, the relevance of the first result influences all three precision values, while that of the third result has an effect on the third precision value only. Thus, using an additional, explicit discount by rank turns the overall discount into something more like a squared-rank function.
However, the other cases are less clear-cut; and in all of them, the a posteriori nature of possible explanations should make us cautious not to put too much trust in them until they are corroborated by further results.
12.2.2 Thresholds The thresholds, as employed in this study, have the function of providing some extra sensitivity to preference predictions. These predictions depend on differences between two values; and it should make immediate sense that a difference of 300% should be a pretty strong predictor, while a difference of 1% might or might not mean anything. Indeed, the At later ranks, no-discount MAP sometimes has lower scores than rank-discounted MAP. However, this is not really a case of the latter improving with more data, but rather the former falling. This is an important difference, since it means that no-discount MAP generally has a higher peak score (that is, its highest score is almost always higher than the highest score of rank-discounted MAP).
- 166 smaller the difference, the larger the chance that the user preference actually goes in the opposite direction.103 The good news is that in many cases, the best predictions can be made with the threshold set to zero. For example, the inter-metric evaluation results given in Section 10.5 look very similar, whether we use the zero-threshold approach, or take the best-performing thresholds for every individual metric, discount and cut-off value. If there are changes, they generally lie in the margin of 0.01-0.02 PIR points.
The bad news is that this is not the case when we use single-result and preference judgments by different users – the most likely situation for a real-life evaluation. For the six-point relevance scale as well as for binary and three-point versions, the zero-threshold scores are lower and more volatile. While the peak values are relatively stable, it becomes more important to either use precisely the right cut-off value, or go at least some way towards determining good threshold values; else, one can land at a point where PIR scores take a dip, and get significantly worse predictions than those using a slightly different parameter would yield.
188.8.131.52 Detailed Preference Identification There is an additional twist to threshold values. Throughout this study, I have assumed that the goal, the performance we were trying to capture, was to provide a maximal rate of preference recognition; this is what PIR scores are about. Sensible as this assumption is, there can be others. One situation where PIR scores are not everything occurs when we are trying to improve our algorithm without worsening anyone’s search experience.