«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
These evaluations seem to suggest that single result evaluations based on binary relevance perform worse than those based on six-point relevance. If the distinction between “relevant” and “non-relevant” is drawn in the middle, the decline in preference identification is moderate; if “relevant” is defined as “somehow relevant” or “highly relevant”, the losses in preference prediction quality become very large indeed.
If we confine the evaluation to informational queries, the picture stays the same. Figure 10.64 through Figure 10.67 show the graphs; the lines are somewhat more jumpy (probably simply due to the reduced dataset), but all the trends remain.
- 137 Figure 10.60. Reproduction of Figure 10.34 with a modified Y-scale. Inter-metric PIR comparison using the bestthreshold method and six-point relevance. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.
Inter-metric PIR comparison using the best-threshold method and binary R23 relevance. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 9; ERR and ESL have exactly the same scores.
Inter-metric PIR comparison using the best-threshold method and binary R21 relevance. ERR and ESL have exactly the same scores.
- 139 Figure 10.64. Reproduction of Figure 10.36 with a modified Y-scale. Inter-metric PIR comparison using the bestthreshold method and six-point relevance, for informational queries only. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.
Inter-metric PIR comparison using the best-threshold method and binary R23 relevance for informational queries only.
Inter-metric PIR comparison using the best-threshold method and binary R25 relevance for informational queries only.
- 141 The next step is to examine what happens if we use binary relevance judgments coming from users other than those providing preference ratings, thus adding the effects described in the previous section. The first graph in the next cluster (Figure 10.68) again shows the six-point relevance version, which is the reference point. 96 For the R23 binary relevance, the peak values are somewhat lower (around 0.77, down from 0.8); but not all metrics perform worse.
ERR scores improve a bit, and Precision now has a top score of 0.77 as early as cut-off rank
4. The largest overall decline can be seen in both MAP metrics, which perform worst.
The situation is different for R25 relevance (Figure 10.70). At higher ranks (mostly up to rank 6), the PIR scores are even lower than for R23, apart from the MAP scores which do not perform as bad. In fact, the no-discount MAP performs best at this stage, while NDCG does worst. At about cut-off rank 7, the scores for R23 and R25 are about similar; and after that, most R25 scores improve, while those for R23 stay approximately the same or decline. The PIR scores at cut-off rank 10 are similar in both cases, except for Precision (lower for R2 5) and ESL. By ranks 9 and 10, ESL reaches 0.8, which is also the peak value in the six-point relevance evaluation.
How does it come about that the results at cut-off ranks 7 and 8 seem to contribute more towards the correct preference predictions than those at ranks 4 and 5? As relevance is defined as “not totally irrelevant” for R25, the only possibility to distinguish between two result lists is a situation when one has more totally irrelevant results, or (depending on the discount function) the totally irrelevant results occur earlier. One possible interpretation is that, at earlier ranks, the weak relevance criteria used carry little distinguishing power. Almost any result may be regarded as “concerning the query to some extent”. After a while, either the number of totally irrelevant results changes (which might be true for the original ranking result lists – but hardly for randomized ones), or the raters’ standards rise with growing acquaintance with the topic at hand and information saturation. The latter explanation cannot be true in this case since we are dealing with relevance ratings obtained from users who were not exposed to any of the result lists; they saw only single results, in random order, and mostly only two or three results from each query. However, the detailed evaluation of relevance ratings in Section 9.4 (Figure 9.19 in particular) does show some patterns similar to the issue under consideration; it is around rank 7 when the relative frequency of results judged to be absolutely irrelevant stops rising. This means that from rank 7 on, the relevance scores of the original result lists stop declining towards the levels of the randomized result lists. This is but a weak connection, and might be just coincidence – but at the present time, I do not see a more vested explanation of the results.
Evaluations using the zero-threshold approach, or informational queries only, produce very similar results. The former have slightly lower but more volatile scores (Figure 10.72), while the scores in the latter are marginally higher (Figure 10.73). Graphs are shown for R23 only since the results for R25 and R21 have the same properties.
For better readability, the Y-scales of different-user rating evaluation graphs run not from 0.5 to 1, as in the same-user rating condition, but from 0.5 to 0.8. Since this time no metric exceeds the PIR score of 0.8, this does not constitute a problem.
Inter-metric PIR comparison using the best-threshold method and binary R23 relevance. Result and preference ratings come from different users.
Inter-metric PIR comparison using the best-threshold method and binary R21 relevance. Result and preference ratings come from different users.
Inter-metric PIR comparison using the best-threshold method and binary R23 relevance for informational queries only. Result and preference ratings come from different users.
- 145 All in all, the results of this section’s evaluation suggest that metrics perform better when using six-point relevance rankings than when they use binary relevance. This is less of an issue when the relevance ratings and preference judgments come from different users (which is mostly the case in real-life evaluations), when the absolute PIR values are not that high to begin with. But even in this case, there are differences. If binary relevance is employed for whatever reasons, then it is worth the time to provide the raters with guidelines as to what constitutes relevance.
10.7.2 Three-point Relevance I now proceed to three-point relevance. As discussed above, the two possibilities of converting six-point relevance to three-point relevance are an equal division (1 and 2 as 1.0; 3 and 4 as 0.5; 5 and 6 as 0) or an extreme division (1 as 1.0; 2, 3, 4 and 5 as 0.5; 6 as 0). These possibilities will be examined analogous to those for binary relevance in the previous section.
The threshold-based evaluations do not show any marked differences. The rank-based NDCG scores serve as an example, but the situation is similar for other metrics and discounts. The graphs for R32 (Figure 10.74) and R31 (Figure 10.75) show some differences to each other and to the six-point relevance evaluation (Figure 10.5); but the differences are those of degree, not of kind, and consist largely of changes in absolute values, which I shall deal with immediately, in an inter-metric comparison. The same is true for discount function comparison, exemplified by NDCG in Figure 10.76 and Figure 10.77.
NDCG PIR values without any discount for results at later ranks. R32 relevance (1 and 2 as relevant, 3 and 4 as partially relevant, 5 and 6 as non-relevant) is used.
NDCG PIR scores for different discount functions, with the best-threshold approach. R32 relevance (1 and 2 as relevant, 3 and 4 as partially relevant, 5 and 6 as non-relevant) is used.
The small-scale issues taken care of, we can now look at the larger picture, that is, the intermetric PIR comparisons. Once again, I start with a reminder of what the original, six-point relevance evaluation looked like (Figure 10.78). The R32 relevance evaluation shown in Figure 10.79 shows some changes reminiscent of those we saw in the binary relevance evaluations. Most metrics show a more or less pronounced decline in PIR scores, in particular those that had performed best (ESL, NDCG, no-discount MAP). MRR and (at earlier results) rank-discount MAP show slightly higher scores than in the six-point relevance evaluation. As in the case of binary relevance, the improvements in the performance of MRR can be explained easily. The MRR score is determined by the first occurrence of a result which is not irrelevant. In the six-point relevance scale, this means any result with a rating from 1 to 5; in the R32 scale this changes to the equivalent of the six-point scale’s 1 to 4 ratings. Thus, the criteria are stricter, and the discriminatory power of MRR rises, albeit not to a point where it can outperform any other metric. Overall, the top PIR scores for R32 relevance are lower than those for six-point relevance by 0.03 to 0.05 points; not much, but still quite noticeable.
If we are looking for an evaluation which does perform much worse, we do not have to search for long; R31 relevance does the job quite nicely. As Figure 10.80 shows, almost all the scores are significantly lower than even in the R32 condition. The top PIR score (NDCG at rank 8) lies at 0.88, and the other metrics do not reach 0.85. Otherwise, the shape of the lines is quite similar to that of six-point evaluation, and (perhaps because of the broad band of “partly relevant” results) the graph is less erratic.
As the results for informational queries only are very similar to those for all queries, I proceed immediately to the next evaluation. Again, it will be the condition in which relevance ratings and preference judgments come from different users. Though the differences between metrics are lower than in the same-user condition, this time I will keep the Y-scale at PIR values 0.60 to 1, to show more clearly the differences between the two categories of evaluation.
Even the graph based on the six-point relevance evaluation (Figure 10.81) is much denser than any other previously shown in the present section. With all PIR values in the range from
0.66 to 0.80, it also gives an impression of more evenness. Again, NDCG performs best, followed by Precision and ESL, while ERR and rank-discounted MAP do worst; however, the differences between peak PIR values of the different metrics do not exceed 0.1. The discrepancies shrink further when we switch to the R32 evaluation (Figure 10.82); except for rank-discounted MAP, whose scores improve, the other metrics’ PIR mostly declines.
This is, however, not the case with R31 (Figure 10.83). Some metrics’ PIR scores did decline a bit, especially at earlier cut-off ranks; at cut-off rank 4, all the scores are almost equal. But starting with cut-off rank 6, all metrics’ performances (with the exception of Precision) actually improve. This situation is analogous to that in the R25 evaluation with different users providing relevance ratings and preference judgments (cp. Figure 10.70). An obvious common feature of the two evaluations is the use of only the lowest possible relevance rating as “completely non-relevant”. However, as has been argued in the discussion of the R25 graph, there is no clear reason why this should lead to an increase in PIR scores from this particular cut-off rank on. Thus, this issue remains a riddle for now.
Inter-metric PIR comparison using the best-threshold method and three-point R32 relevance. Result and preference ratings come from different users.
As I have described in Part I, explicit metrics based on users’ relevance ratings of individual results are but one of the two major search engine evaluation methods. The other is based on the data provided more or less unwittingly by the users who go about their usual business of performing web searches. As mentioned in Section 9.1, some of that data has been gathered during the study. The important types of data, the ones that are going to be evaluated in this section, are session duration and clicks.
Most click metrics described in Section 5 are relative metrics; that is, they do not provide a numeric value for individual results or the result list as a whole, but rather indicate whether one result in the list is better than another. This can be a useful feature if one wants to improve a ranking algorithm, because if it works, it provides not only an indication that something is wrong, but also a concrete way of fixing the problem (for example, by swapping the two results’ positions). However, relative log metrics are not easy to evaluate, at least using the methods of this study. This methods do not show which of two given result lists is better; instead, they take as their input one result list, and give as their output an improved
version. Therefore, evaluating them would require three steps:
letting users interact with a result list in the usual way;
using the relative metric, constructing a second, presumably superior result list;
asking (other) users which of the two result lists is preferable.
This method is not impracticable; but it requires an evaluation structure quite different from what is needed for the explicit (and log-based absolute) metrics which constitute the main body of this study. Therefore, the evaluation of relative, click-based metrics will be left to further studies to undertake.