«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
The final question regards the relative and absolute PIR performances of the individual metrics. As Figure 10.47 shows, they are significantly different from those obtained with same-user ratings. The runs of the individual curves are different, as are the ranking of individual metrics and the absolute values in terms of PIR.93 When the individual result ratings and preferences came from the same users, the PIR curves were relatively stable, mostly rising continuously towards their peak PIR value around cut-off ranks 7 or 8, and possibly declining after that (Figure 10.34). However, when we remove the luxury of same-user ratings, the picture changes. The metrics mostly peak at cut-off rank 6, with a few scores peaking as early as rank 5 (ERR and traditional, rank-discounted MAP).
Both of these metrics, as well as precision, also experience a dip in PIR scores at rank 4. Only ERR, with its square discount used, manages to stay at its (relatively low) peak score until cut-off rank 10; all other metrics’ scores fall towards the latter ranks, sometimes quite significantly. Again, the no-discount metrics (Precision and MAP) have the largest declines in PIR towards later cut-off ranks.
The relative performance of the different metrics changed as well, though not as strikingly.
The log2-discounted NDCG is still arguably the best-performing metric, although ESL (with n=3 and discounted by rank) has fractionally higher scores at early ranks. But traditional MRR is omitted from this and all further evaluations where single result relevance ratings and preference judgments come from different users. It consistently lies far below the other metrics, and does nothing to aid our understanding while distracting from more important matters (and stretching the Y-scale on the graphs).
- 126 MAP, which, at least at later ranks, had high PIR scores in the same-rater condition, performs poorly throughout the current evaluation. And Precision is even more variable than before; it has a PIR score of 0.80 at cut-off rank 6, tying with NDCG for the highest overall PIR score and beating traditional MAP (which is also at its peak value) by over 0.05. At cut-off rank 10, however, Precision not only falls over 0.06 PIR points behind NDCG, but also falls behind all other metrics, including MAP.
Finally, the absolute scores also change, not quite unexpectedly. If the same-user approach produced PIR scores up to 0.92, meaning that over nine out of ten users will get their preferred result list based on individual-result rating, the different-user approach brings the top scores down to 0.80. This means that, even if we determine the ideal threshold value and the ideal rank, one out of five queries will result in an inferior result list being presented to the user. And if we do not have the ideal thresholds for each cut-off value, the results are slightly worse and more complex. Figure 10.48 shows the results of the zero-threshold approach;
while the absolute peak values decline insignificantly (from 0.80 to 0.775), the individual metrics’ scores fluctuate more. If we pair off two metrics at a time, we will find only a few pairs where one metric performs better than the other at all ranks. The scores increase slightly if only informational queries are evaluated (Figure 10.49).
It is not quite easy to explain these results. As a possible explanation for the general PIR decline at later ranks, I previously offered the diminished chance of the results playing a role in the user’s assessment, and the influence of the relevance rating on the metric therefore changing PIR for better or worse by chance, bringing it closer to the baseline of 0.5. This could also explain why the PIR scores rise at cut-off rank 6 for many of the metrics. Some studies suggest that the result in the sixth rank is clicked on more often than the previous one, thus raising its importance and usefulness (Joachims et al. 2005). Others suggest a possible explanation; the user studies results immediately after those visible on the first screen in more detail since his scrolling down means he has not satisfied his information need yet (Hotchkiss, Alston and Edwards 2005). This transition might well happen immediately before or after cutoff rank 5. But it is unclear to me why this would only be the case if relevance and preference judgments come from different users. If the reason for the stagnation or decline of PIR scores lay in general user behavior, it should be the same across methods. The differences between the results of same-user and different-user evaluations have to stem from the only point of departure, that is, the raters determining the relevance of single results. This does produce some effects; for example, as the relevance ratings of different users are averaged, the relevance ratings change from a six-point to a more faceted scale. This leads to fewer instances of precisely the same PIR scores for different metrics. Obviously, the decline in absolute PIR scores can also be attributed to the different raters; the relevance of individual results for one person can be assumed to have a more loose connection with another’s preference than with his own. However, I see no obvious explanation of the distinctive declines of PIR scores at cut-off ranks 4 or 5 and their subsequent higher rise at rank 6 in terms of different users providing the relevance ratings.
10.7 PIR and Relevance Scales As mentioned in Section 8.2, one more question arising with regards to the evaluation is whether the six-point scale used to rate relevance for individual results uses too fine a distinction. The most popular scale is still often binary (Macdonald et al. 2010), and occasionally a three-point scale is used (Chu 2011). Perhaps the metrics I have evaluated are not well suited for graded relevance, having been conceived for use with binary or three-point relevance scales?94 The way to deal with this question will be, of course, consistent with the rest of this study: I will test different configurations and see which relevance scales produce what kind of results.
The two relevance scales will be two-level and three-level relevance. While four-level and five-level relevance scales are quite conceivable (and occasionally used, see Gao et al. (2009) or Kishida et al. (2004) for some examples), the differences between a five-point and a sixpoint scale (and with them the need for an evaluation) are not as large as between two-point and six-point scales. Moreover, it is more difficult to convert judgments from a six-point to a four- or five-point scale in a meaningful way. The same is true for more detailed relevance ratings.
For both possible scales there are also questions of conversion. If we want to employ a binary scale, what would we want the rater to regard as relevant? There are three obvious Although DCG, at least, was constructed with graded relevance input being one of its explicitly stated advantages (Järvelin and Kekäläinen 2000).
- 129 approaches. One is to rate as relevant anything having even a slightest connection with the information need; another to rate as irrelevant anything not regarded as optimal for the query;
and the third to draw the line in the middle and using the categories “rather relevant than irrelevant” and vice versa (or perhaps “rather relevant” versus “not too relevant”). This would correspond to regarding as relevant the ratings 1 to 5, 1, and 1 to 3, respectively.95 For brevity, the three binary relevancies will be referred to as R25, R21 and R23, the “2” designating binarity and the subscript showing the worst ratings still regarded to be relevant. For the three-point scale, the two possibilities of conflating the ratings are either in equal intervals (1 and 2 as “highly relevant” or 1; 3 and 4 as “partially relevant or 0.5; and 5 and 6 as “nonrelevant” and 0), or regarding 1 as highly relevant (1), 6 as non-relevant (0), and everything in between as “partially relevant” (0.5). The two methods will be called R32 and R31 respectively, with the “3” showing a three-point scale and the subscript designating the worst rating still regarded to be highly relevant.
How, then, will the PIR performance change if we use different ratings scales? I will again start with NDCG, and proceed through the individual metrics, giving fewer details when the tendencies start to repeat.
10.7.1 Binary Relevance Figure 10.50 and Figure 10.51 show that the difference between cut-off ranks becomes lower for no-discount and rank-discounted NDCG when binary R23 precision is used; especially with higher discounts, the scores after cut-off rank 3 get very close to each other. The picture is similar for R25 (Figure 10.52) and R21 (Figure 10.53).
There are also other possibilities, like considering 1-4 to be relevant, since those are the “pass” grades in the German school system, or considering 1-2 to be relevant, since these are the grades described as “very good” or “good”, respectively. I do not wish to complicate this section (further) by considering all possibilities at once, instead leaving these particular evaluations to future studies.
NDCG PIR values discounted by result rank, with R21 relevance used.
- 132 Of more immediate interest are the relative performances of differently discounted NDCG metrics. Again, we see that the performances (in terms of PIR) become more similar. As in the six-level relevance evaluation, the performance in the R23 evaluation (Figure 10.54) is dependent on the discount strength; the low-discount functions (no discount, log5) perform worse than the higher-discount ones. In the other two binary relevance evaluations, the individual lines are still closer to each other. With R25, even the differences between lowdiscount and high-discount functions’ PIR scores become less pronounced (Figure 10.55), and in the R21 evaluation, the difference between the best-performing and the worst-performing function at any particular cut-off value almost never exceeds 0.02 PIR points (Figure 10.56).
The log2-discounted NDCG metric, the one actually used in evaluations, does not perform as well as with the six-point relevance range. It does have the highest scores with R21 relevance, but is inferior to other discounts with R23 and R25. Nevertheless, as no other discount consistently outperforms log2, I shall stick with it as the NDCG reference.
NDCG PIR scores for different discount functions, with the best-threshold approach using R23 relevance.
NDCG PIR scores for different discount functions, with the best-threshold approach using R21 relevance.
- 134 As the results for individual MAP-based discount functions are similar in kind to those of NDCG, I will not bore you with more threshold graphs. The discount comparisons, however, are a different matter. The graph for R23 relevance (Figure 10.57) is not dissimilar to that for six-point relevance (Figure 10.18). Though there are some dips, the lines rise quite steadily throughout the cut-off values. The PIR scores for R25 (Figure 10.58) relevance rise faster, with most discount functions reaching their peak values by cut-off rank six to eight. For R21, the scores fluctuate still more (Figure 10.59).
As in the six-point relevance evaluation, the no-discount function performs best in all three conditions (closely followed by log5 discount), while traditional MAP scores (discounted by rank) are at best average, compared to other discount functions.
MAP PIR scores for different discount functions, with the best-threshold approach using R23 relevance.
MAP PIR scores for different discount functions, with the best-threshold approach using R21 relevance.
- 136 The graphs for other metrics do not differ in significant or interesting ways from those for NDCG and MAP except for their absolute values; therefore, I will omit the individual metric graphs and proceed directly to the inter-metric comparison. For reference and ease of comparison, the PIR graph for six-point relevance is reproduced as Figure 10.60.
In Figure 10.61, showing PIR scores for R23 (binary relevance with the relevant results defined in the manner of “rather relevant” versus “not very relevant”), some changes are immediately apparent. While the top scores stay approximately the same, some metrics do better and some worse. MRR is the success story; rather than consistently lagging far behind the other metrics, it actually beats traditional MAP (discounted by rank) for most cut-off values. Rank-discounted MAP generally performs worse than in the six-point relevance evaluation, and so does Precision. The other metrics have results comparable to those in the six-point relevance evaluation, and no-discount MAP even improves its performance, though only by a margin. All in all, there is hardly any difference between the best-performing metrics (NDCG log2, ERR Square and ESL Rank with n=2.5), and no-discount MAP scores overtake them at the later cut-off ranks.
When we change the definition of relevance to be “not completely irrelevant” (R25), the scores sag (Figure 10.62). Instead of peaking at above 0.9, the best scores are now at just 0.8.
Again, the worst-performing metrics are MRR and rank-based MAP. The best-scoring metrics are NDCG (with log2 discount) and ESL with n=2.5 and rank-based discount; however, the remaining three metrics are not far behind. It is also remarkable that, apart from Precision, no metric experiences a significant PIR decline towards the later ranks. One possible reason for that might be the generally lower PIR scores (while in the six-point relevance and R23 environments, the peak scores are high and accordingly have more ground to lose). Thus, a relatively moderate influence of latter-rank PIR may not be enough to bring the scores down.
Finally, in Figure 10.63, the peak scores fall lower still. In this condition, where “relevance” is defined as something like “a result of the most relevant kind”, the PIR never reaches 0.75.
MRR and rank-based MAP still constitute the rear guard, with MAP performing worst.
NDCG (log2) and ERR (Square) provide the highest scores.