«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
It is no revelation that most changes, in search as elsewhere, make some things better and some worse. Unless we reach a PIR of 1 (and we won’t), some user’s preferences are not going to be recognized, or will even be inverted. If we have a PIR of 0.7, than 70% of user preferences are recognized correctly, while 30% are reversed; or 40% are recognized correctly, and 60% are treated as no preference at all; or anything in between. For a search engine operator, it is a big difference whether you improve results for two fifth of your users, or improve it for about two thirds while antagonizing the remaining third. Users may be content with stagnating quality, but react badly if it actually deteriorates.
Consider the example of one of the best-performing metrics, NDCG with a default log2 discount104 (the graph is reproduced for your convenience as Figure 12.1). Let us assume that you’ve gone for the cut-off value of 10 to use as much information as possible. You can see in the graph that the peak PIR value is reached at thresholds 0.01-0.03 and again at 0.08. Which of those should you choose if you want to maximize benefits, but only provided you cause a minimum of harm? Or is there another threshold which is best suited for that?
This example suggests that another method would be to determine thresholds by relative, not absolute values.
However, since the PIR scores are unlikely to differ by orders of magnitude, large differences in the results seem improbable.
This example has been picked at random; obviously, the same methods can be used to evaluate in detail any other metric and discount.
In this case, it might make sense to go beyond PIR. Figure 12.2 shows the detailed distribution of judgments at the first peak point (threshold 0.01). I will take a moment to explain what the colours mean.
Dark green shows the proportion of queries where result list preference is identified correctly by the difference in NDCG scores.
Light green means that the user has no preference, and the NDCG scores’ difference is below the threshold, so that the equal quality of the result lists is correctly inferred.
These queries are not used in PIR calculation for reasons stated below.
The yellow sector is for the queries where the user considers the result lists to be equally good, but the difference in NDCG scores is larger than the threshold, falsely showing a preference. This is the second kind of query not used for PIR calculation.
As explained in Section 8.2.1, if a user considers two result lists to be of equal use, than recognizing this (non-)preference or failing to do that has no direct influence on the user’s experience since he will be neither better no worse off, whichever result list he gets.
Orange means that the NDCG difference is once again below the threshold, but this time, there is a user preference, which is thus not captured.
Red means that there is a user preference, but the difference in NDCG scores actually predicts the opposite preference.
As an easy guide, the colours go from dark green (good) through yellow (relatively harmless) towards red (really bad).
Preference identifications for NDCG discounted by log2 of result rank, at a cut-off value of ten and a threshold of 0.01.
The PIR value for the point shown in Figure 12.2 is 0.84, as can be read from Figure 12.1.
However, the pie graph provides more information. We see that for over 60% of all queries, user have preferences which are correctly identified by NDCG (dark green); that of the 23% of queries where the user does not have a preference, NDCG scores reflect this in just 2% (light green), while for 21%, they suggest a non-existing preference (yellow); that in 7% of cases, NDCG scores miss out on existing preferences (orange); and that for 8% of all queries, NDCG actually picks the result list the user likes less. It is this last number that you are interested in minimizing in this scenario, while keeping the first one as high as possible.105 But 8% seems quite large; one in thirteen users will actually experience a decline in result list quality. Therefore, we need a method to look at the number of reversed preference judgments depending on the threshold used.
Of course, the erroneous “No preference” statement should also be kept as low as possible; but while this would mean missed chances to improve the user experience, reversed preference actually impairs it.
Preference identifications for NDCG discounted by log2 of result rank, at a cut-off value of 10.
This is where Figure 12.3 comes into play, and I suspect that if you regarded PIR graphs as too immersed in specifics, you are going to hate this one. It shows the relative development of the five categories introduced above as a function of threshold values. As above, we would like the green areas to be as large as possible, while keeping orange and red small.
Additionally, the black line shows the PIR at each threshold.
If we look at the PIR line, we will see that the peak scores described above are indeed quite different with regard to the number of queries which will have their user preference reversed.
With a threshold of 0.01, more than 8% of all users will experience a decline in result list quality; at 0.03, the number falls to about 6%, and at 0.08 to 4%. If our aim is to minimize wrong predictions, we should go with the latter threshold; or even raise it to as much as 0.24, when PIR is 0.05 points below its maximum, but no user will be worse off than before.
Figure 12.3 also illustrates an inherent property of PIR development.
With a threshold of 0, all queries (except for those where metric scores are absolutely identical) are regarded as having a preferred result list. If we increase the threshold, the only possible changes are that results that some of these queries move into one of the “no preference” categories. Thus, if PIR falls, it means that some queries moved from correct preference identification (dark green) to mistaken equality (orange); and if PIR increases, some queries must have gone from wrong preference (red).106 This gives us an easy possibility to discover purged mistakes: whenever PIR goes up when thresholds increase, the red faction has shrunk. We cannot assume that PIR always increases when errors are corrected, however, since correct judgments may be dropped at the same point.
To be more precise, more queries must have moved in the mentioned direction than in the opposite.
- 170 Rating Sources A large difference exists between results depending on whether the explicit result ratings come from the same raters as the preference judgments. Not surprisingly, if a user’s ratings are used to predict his own preference, the PIR scores are significantly higher. While the highest attained PIR score in this condition is about 0.92, it drops to just 0.80 if the preference is inferred from other users’ judgements. This means that with all other parameters being optimal, one in five users would actually experience worse results than in the absence of any judgments.
This result holds broadly across most parameters. It can be seen at any cut-off rank, with any metric, threshold and scale (with an exception mentioned in the next section). Moreover, the results are not only lower in absolute PIR scores; they also get much less stable. The increased fluctuations can be seen particularly well in threshold evaluations, though most parameters are affected.
What are we to make of these results? To me, they mainly suggest the importance of search personalization. I stated that explicit rating results from the real-life user issuing a query are a luxury hardly imaginable for a search engine. But neither is it necessary to base preference assessments on all (or a random subset) of other users’ judgements. The personalization of web search is a well-established topic which has grown in importance in the last decade (see, for example, Speretta and Gauch 2005; Teevan, Dumais and Horvitz 2005; Sieg, Mobasher and Burke 2007). Usually, the research focuses on generating useful changes to a result list drawing upon the user’s previous actions. For our purposes, some kind of user profile could be generated, and a user’s preference could be induced from the preferences of raters or real users with similar profiles. The resulting evaluation scores would then fall in between the conditions of same-user and different-user ratings.
12.2.4 Relevance Scales The evaluation of relevance scales is somewhat less satisfactory than that of other parameters in that it is not direct. The better method would certainly have been to provide different rater groups with different scales, and then give different sub-groups different instructions as to what the individual relevance ratings are supposed to signify. However, for the relevance scales tested here this would mean a total of six subgroups, in which case each subgroup would be too small to provide reliable results. Therefore, I had to make do with conflating a six-point scale; and a more direct evaluation would be appropriate before taking any drastic measures.107 Generally, the six-point scale performs better than any binary or three-point scale. However, there are large differences among the subgroups as well as for different situations. For binary scales, the subgroup where the division between relevant and non-relevant was drawn in the middle (between the six-point scale’s ratings 3 and 4) performed only slightly worse than the six-point scale itself, while the subgroups where the relevance criteria were either very strict or very lax did far worse. This is important since users tend to be generous with their ratings Although I do not think any of the results are very revolutionary in the first place.
For three-point relevance scales, an equal distribution of relevance ratings also provided results not far below those of the six-point scale. If “relevant” and “non-relevant” categories were narrower, with more ratings falling into the “partially relevant” category, the PIR scores dropped. However, there was a noticeable exception: in one condition (with result ratings and preference judgements coming from different users), the latter three-point scale was actually more successful than the six-point scale. The result is not easily explained, and awaits further examination.
12.2.5 Cut-off Ranks The evaluation of cut-off ranks and their influence on the metrics’ performances was a surprise for me and could lead to a change in the methodology of search engine evaluation if it is confirmed by further studies. Usually, the necessity of introducing a cut-off rank is regarded as just that – a necessary compromise between optimal quality (evaluating all results) and reality (having a limited amount of resources). Accordingly, my initial aim regarding cut-offs was to see if there are certain points at which the marginal gain of evaluating one more rank would fall significantly, therefore making them good candidates for cut-off-rankdom.
I did not expect to find that for most metrics and discount functions, as well as other parameters, the metrics’ performance does not rise steadily from rank one to ten. Instead, it rises initially, levels out, and then declines. The peak cut-off rank differs, and I will come to the particulars shortly. But before that, I would like to illustrate how more information can actually decrease the correspondence between an explicit metric and user preference.
As an extreme example, consider a navigational query. In the first result list, the desired result is at rank one; in the second, it is at rank 10. It is reasonable to expect that users will prefer the first result list. However, with precision and a cut-off rank of 10, the metrics will have identical scores. As soon as we lower the cut-off rank, the “right” result list takes the lead.
Of course, it is also possible that the first result list has another really good result at rank 10, while the second result has none at all. In this case, going all the way to cut-off rank 10 would increase the chance of getting the right prediction. But if we assume that both scenarios are equally likely,108 than we have a fifty-fifty chance of improving or decreasing PIR, all other things being equal. But usually, all other things are not equal. PIR, in all cases we have seen, lies above the baseline of 0.5, and mostly well above it, sometimes in regions around 0.9. This means that it has much more room for decline than for gain; and, in our example, if the first result list had the good result at rank ten, the PIR would not have improved.
There is also another tendency working in the same direction. As shown in Section 9.4, the average relevance of the results up to rank 5 is higher for the original result lists than for the That is not necessarily true; but the general logic holds in any case.
If we accept this “regression towards the mean” explanation, what parameters would determine a metric’s proneness to this cut-off decay? It would be the discount function and the PIR score itself.
The discount function’s influence is easy to see: the steeper the discount, the less heed is taken of later results. With square rank discount, for example, the tenth result has hundred times less influence then the first; for most purposes, it might as well not have been included at all. This effect can be seen clearly in Figure 12.4, which shows the performance of NDCG with different discount functions. The legend orders them in order of increasing steepness, and it is easy to see that no-discount DCG is the first to go into decline, and has the largest fall. log5 starts to decline at the same point (cut-off rank 5), but does so less dramatically. log2 and root are next to go, at cut-off rank 8. The steeper discounts (rank and squared rank) do not decline at all, and the click-based discount actually manages a marginal rise as late as cut-off rank 10.