«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
- 83 rather, most) users will find satisfactory results even if they have to look longer for them in a randomized list; the average user satisfaction was 0.79 for the original and 0.65 for the randomized lists. We take this to be a result of the same kind as that found by researchers who observed that “searchers using […] degraded systems are as successful as those using the standard system, but that, in achieving this success, they alter their behavior” (Smith and Kantor 2008, p. 147). The results might also be connected to the fact that binary satisfaction was used for the evaluation; as is the case with other measures like relevance, the threshold for a good rating is probably quite low when the user is presented with only two options. This, in its turn, suggests that retrieval system operators might want to concentrate on improving the weak points in their systems rather than try to further enhance their strong sides, as users seem to be satisfied with a relatively low standard. Of course, this is only one possible interpretation which needs to be thoroughly tested in larger studies before any real practical recommendations can be made on its basis.
18% 0,5 17% 3% 0,25 4% 0,75 1 0,5 14% 57% 14% 66% 0,75 7% Figure 9.17. User satisfaction with the original and randomized result list. The numbers indicate the percentage of queries for which mean user satisfaction lies in the bands 0, 0.01-0.25, 0.26-0.50, 0.51-0.75 and 0.75-1.
Finally, I want to look at the relevance distributions for the two result list types. Figure 9.18 shows the macro-relevance; that is, the relevance ratings for individual results are averaged, and then these average ratings are averaged again for each rank. The first thing to catch the beholder’s eye is the narrow range of relevance scores. The best score (original result lists at rank one) is just below 3; all other ratings fall in the strip between just below 3.5 and just below 4.5. That is, all positions but one have relevance scores lower than the theoretical average of 3.5. This fits well with the notion described above, namely, that it does not require top-relevance results (or even a top-relevance result) to make a user’s search experience satisfactory. The second issue with this graph regards the shapes of the two lines. The randomized result list has values in the range from 4.2 to 4.6, which do not seem to depend in any meaningful way on the rank. This is entirely expected since the results have been
- 84 randomized. It can be also reasoned that since the randomized ratings stem from results anywhere in the original top 50, the score of 4.2 to 4.6 is also the average relevance of those first fifty results. The relevance ratings for the original result lists do show a clear direction;
they start off (relatively) high and decline towards the later ranks. However, from rank 5 on, when the average relevance falls below 4, the decline is not very pronounced, and the values behave similar to those for randomized result lists. The simplest interpretation of this data is that the ranking algorithm is really effective only for the first four or five ranks, and after that does no better than chance.
1,5 2,5 (Macro-)Relevance
Macro-Relevance at individual ranks for the original and randomized result list. The randomized list relevance stays fairly stable, with fluctuations which can be attributed to chance.
Figure 9.19 and Figure 9.
20 provide a more detailed view of the original and randomized result list, accordingly. They show the precise distribution of relevance rankings by rank. In the original result list, we see that the share of high-relevance results declines towards the end.
However, there are some details that need to be kept in mind. Firstly, the decline in highestrelevance results (relevance score 1) more or less ends at rank 7, staying much the same after that. Secondly, it is mostly that relevance band (and to a lesser extent relevance score 2) that declines in frequency. The relative frequencies of average-relevance scores (3 and 4) stay more or less the same, while the lower scores’ frequencies increase, as the counterpart to the decline of high-relevance ratings. These tendencies correspond well to the macro-relevance evaluation described in the previous paragraph.
The relevance evaluation for the randomized result lists are, as could be expected, less telling.
There are no obvious trends depending on the rank. It is perhaps worth noting that here, even more than when evaluating the original result lists, we see stable relative frequencies for midway relevance ratings. Almost all the changes occur in the highest and lowest ratings.
In this section, I will describe how well many of the different explicit metrics described in Section 4 perform. We will start with considering the effect of different thresholds and discounting functions (combined with cut-off value changes) on a metric’s performance (Sections 10.1-10.4). Then we will compare the different metrics themselves (Section 10.5).
After that, the influence of using different raters for different judgments will be considered (Section 10.6). Finally, we will look at the effects of choosing different judgment scales, bearing in mind the previous findings (Section 10.7).
10.1 (N)DCG I shall start with (N)DCG, which provides a somewhat unusual standard discount rating;
while most other metrics normally discount by rank, the authors of DCG use the dual logarithm as their denominator of choice, while indicating that it might not always be the best option. I will use NDCG instead of DCG. As the PIR is always concerned with aggregating individual queries, both metrics offer the same relative scores; however, NDCG falls into the more familiar and uniform 0..1 range, while DCG scores can potentially be arbitrarily high.
Figure 10.1 to Figure 10.
7 show our first real PIR graphs. Each Y-axis shows the PIR scores, and is scaled from 0.5 (the baseline, equivalent to guessing) to 1 (every single user preference can be correctly induced from the metric in question). The figures show PIR values for NDCG using the seven discount functions listed above. There are quite a few interesting observations to be made about the differences between as well as the similarities among the graphs; however, most of those are better viewed on a comparison graph, which is presented later. I will start with the discussion of the properties of Figure 10.1 (showing NDCG PIR scores without any discounting), and work down from there; this will hopefully make for a more clear picture.
Probably the first thing an observer sees is that the graph is not very intuitive. However, concentrating on specific questions, one can read quite a lot. My primary objective in this particular evaluation was to see what threshold value performed best for each NDCG discount function. This means we should look at all lines to see whether we can find regularities in their runs. The first impression is that, while the lines go up and down a bit, in general the PIR values for smaller thresholds tend to be higher, and those for the largest threshold (0.30) are almost universally the lowest of all. 80 Table 10.1 makes this situation immediately comprehensible; it shows different average measures for the thresholds which have the highest PIR scores. Five out of seven median values and six out of seven mode values have a threshold of 0 as the best-performing one; and four out of seven average scores are below the Keep in mind that values larger than 0.30 are not pictured, since they failed to produce any increase in PIR.
- 87 lowest non-zero threshold, 0.01. 81 While the 0.01 threshold itself might perform slightly better, the zero threshold provides a useful counterpart to the best-threshold approach for PIR comparison as discussed in Section 8.2.1. It makes the evaluation simpler by leaving thresholds entirely out of the equation; and if it underestimates best possible PIR scores (which might be achieved by using, say, a threshold of 0.01, or perhaps some even smaller number), it provides a lower boundary as opposed to the best-threshold approach’s upper boundary for PIR scores. In later sections, we will see that this holds true not just for (N)DCG, but also for most other metrics.
The next graphs show other discounts, which get progressively steeper. You will note that the curves follow this trend and become steeper themselves; that is, larger threshold values have more rapidly declining PIR scores for steeper discounts.
NDCG PIR values without any discount for results at later ranks.
The numbers are on the low side since in the cases when multiple thresholds have the same PIR, the lowest one has been used for the calculation. I feel this method is acceptable since the important point is that, in this situation, lower thresholds rarely produce lower PIR scores, not that they always produce higher ones.
NDCG PIR values discounted by the frequency of clicks on results of a particular rank according to Hotchkiss, Alston and Edwards (2005).
- 91 There are more things to be seen in the graphs; but they concern not the changes brought on by different thresholds; rather, they are about the influence of cut-off values and discount functions. Thus, they are more visible and better discussed in the context of inter-metric PIR evaluation, as shown in Figure 10.8 and Figure 10.9.
The first point I would like to make about these two graphs is their similarity. On first glance, it might not even be obvious that they are indeed different. For this reason, I will discuss the best-threshold results, but everything can be equally well applied to the t=0 graph.
NDCG PIR scores for different discount functions, with the best-threshold approach. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 1 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 4. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8.
- 92 Figure 10.9. NDCG PIR scores for different discount functions, with the threshold values set to zero. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 3 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 9. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8.
The second issue concerns the overall absolute values. The best PIR scores are slightly higher than 0.92; this means that for over 92% of all sessions the user would get the result list he preferred if the result list choice is based on the NDCG scores. This is a surprisingly high value. However, there are a few factors that put the numbers in perspective. While for 92% of sessions the users get their preferred result lists, for 8% they get a result list they consider inferior. In other words, one in twelve sessions still results in a choice of a result list which the user finds worse. Furthermore, the relevance ratings on which the NDCG calculations are based come from the same user as the preference ratings with which they are compared.
Having the user rate single results for the very query he is interested in is a luxury which no search system provider is likely to be able to afford. This and other interpretations will be discussed in more detail in Section 10.6.
The third issue is the general shape of the different discount function curves. All curves start at the same PIR score of approximately 0.7; as no discount is ever applied at rank 1, the PIR scores for a cut-off value of 1 will always be the same for all discount functions. They rise when the cut-off value is increased; no surprises there – the amount of information on which to base NDCG ratings increases, and with it the quality of preference predictions. What
- 93 is surprising, however,82 is that the highest PIR scores are reached by rank six or seven by every discount function except for the click-based one. After that, the scores stay the same, and often even decline, sometimes quite significantly (for example, from a peak PIR score of
0.9 at rank 6 for the “No discount” function to just over 0.75 at rank 10). This implies that an increased amount of information can actually lead to lower-quality metric performance. We will see this trend in most other metrics as well; the possible reasons and implications will be discussed in detail in Sections 10.5 and 12.2.5.
The final issue is the one which the graphs were originally intended to clarify: How do the different discount functions perform in comparison to each other? Again, one thing that catches the eye is the relative similarity of the graph lines. The PIR scores peak between 0.9 and slightly over 0.92, and up to a cut-off value of 8, there are no PIR differences larger than 0.055 (to put flesh on the numbers once again: this is equivalent to ca. 5% of sessions displaying the better result list instead of the worse one). This difference is not negligible, of course, but neither is it gaping. Only after that, when the “No discount” and “log5” functions’ PIR scores decline sharply at cut-off values 9 and 10, are the differences becoming really large.