# «On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

A further property of the graphs that has at first surprised me was the large number of identical data points for different discount functions (for example, NDCG Square and Clickbased coinciding at cut-offs 3, 4, and 6), necessitating the detailed verbal descriptions found in Figure 10.8 and Figure 10.9.83 However, this becomes less surprising when one considers that the discount functions start out with the same performance at cut-off 1, and then improve when for a number of sessions, the better result list can be identified correctly. The improvement in PIR score is then a multiple of one over the overall number of sessions with stated result list preferences. If, for example, we had 50 sessions where a user has indicated a preference for one of the possible result lists, the PIR would always be a multiple of 1/50 or

0.02. If we furthermore consider that the PIR baseline is 0.5, there are only 25 possible PIR scores left. Therefore, it is not really surprising that different discount functions might have the same score. Also, the functions with the same scores tend to be ones with relatively similar discounts (“Square” and “Click-based”, for example, are the two functions which discount most strongly).

Overall, the “No discount”, “log5” and “Square” discounts can be said to perform worse, while “Root”, “log2” and “Rank” do better, the “Click-based” function falling in between. For the purposes of later inter-metric comparison, “log2” will be used for two reasons. It is one of the more successful functions; though it performs slightly worse that “Rank” and “Root” at cut-off 2, it is better at cut-off 3 and equally good up to cut-off 8. We will have to bear in mind, though, that its PIR scores decline at cut-offs 9 and 10, falling almost 0.08 points below “Rank” at 10. The reason to nevertheless pick it rather than “Rank” for future prominence is its role in the literature in the past and present. As discussed in Section 8.2, log2 is routinely For me, at least. Perhaps you suspected all along that the results would turn out to be just like this.

And in many of the upcoming graphs; though I omit them later when we get a feel for the graphs and what the course of each line is.

An issue with the data requires more graphs. As stated in Section 9.2, only 31 of the 42 evaluated queries were informational, with the others classified as transactional, navigational, factual, or meta-query. While the latter categories contain too few queries to be evaluated on their own, they might have disturbed the picture if their PIR score patterns look significantly different from those of informational queries. Therefore, it is important to examine the issue.

Luckily, Figure 10.10 (log2 PIR threshold graph for informational queries only) and Figure 10.11 (comparison of PIR scores for different discount functions for informational queries only) show such a strong similarity to their respective counterparts (Figure 10.3 and Figure 10.8) that it is unnecessary to reproduce all the graphs in an “informational-only” form.

** Figure 10.10.**

NDCG PIR values for informational queries only, discounted by log2 of result rank.

- 95 Figure 10.11. NDCG PIR scores for informational queries only, with the best-threshold approach. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 1 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 4. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8.

10.2 Precision As mentioned in Section 8.2, classical Precision has quite a lot of common with a no-discount version of NDCG (alternatively, one might say standard NDCG has a lot in common with Precision with a log2 discount function). Both metrics rely on simple addition of relevance scores, which are then normalized. The main difference is that Precision, in dividing the sum of scores by their number, has its highest possible score of one in a situation where all results are perfectly relevant; NDCG, instead, considers the situation where all known relevant results are presented in the optimal order to be the ideal. Therefore, Precision tends to have lower absolute scores than NDCG.

At first glance, Figure 10.12, showing the PIR scores of classical Precision depending on cutoff and threshold values, looks significantly different from its NDCG counterpart (Figure 10.1). And the difference becomes extreme when one compares the two measures with steep discounting functions, such as the squared rank (Figure 10.13 versus Figure 10.6). While the no-discount versions appears basically similar, though with a more marked decrease in PIR scores at higher threshold values, the strongly discounted Precision graph shows not just a strong PIR decline – every line showing PIR scores for cut-off values of 5 and above hits the

0.5 baseline. This means that at this point, Precision performs no better than chance in selecting the preferred result list. However, this is easily explained by the differences in

- 96 absolute values mentioned above. If, for example, the NDCG scores for all queries tend to lie in the interval between 0.2 and 0.9, while the Precision scores are between 0.1 and 0.5, then the maximum difference is 0.7 and 0.4 respectively. With the Precision differences spread over a shorter score span, the same threshold steps will provide less fine distinctions between result lists; in the above example, any threshold above 0.4 will fail to distinguish between any result lists. The effect is much stronger for high-discount versions, where the discounts reduce the score range even further. One could change this situation by choosing a finer threshold step for Precision, and selecting a different denominator for discounted Precision metrics to compensate; however, I think that this would combat a non-existent problem.

To see why the threshold issues do not matter a lot, we will examine two Precision PIR graphs comparing different discount functions, just as we did with NDCG. Figure 10.14, using the best-threshold approach, is very similar to its NDCG counterpart (Figure 10.8).

There are a few minor differences (mostly some PIR scores being slightly lower in the Precision graph), but the overall picture is clearly the same. And when we compare the graphs where threshold values of 0 are used (Figure 10.15 and Figure 10.9), we see that they are completely identical. This should not really surprise; if no threshold is used, any difference between two result lists is considered significant, and the magnitude of absolute scores does not matter anymore. For all future purposes, then, Precision and NDCG can be regarded as the same metric with different discount functions. I will still refer to them by their familiar names, but the connection should be kept in mind. For the later inter-metric comparison, I will also show classical Precision as a separate metric for the simple reason that it is an all-time favourite, and is still used in evaluations. It will be interesting and instructive to see how it performs against more modern metrics.

** Figure 10.15.**

Precision PIR scores for different discount functions, with the threshold values set to zero. The graph is identical to the one in Figure 10.9 (NDCG PIR scores with t=0).

Formula 10.1. MAP formula with queries Q, relevant documents R and documents D (at rank r). rel is a relevance function assigning 1 to relevant and 0 to non-relevant results, or, in this study, one of six values in the range from 1 to 0.

Formula 10.2. Modified MAP formula with queries Q, relevant documents R and documents D (at rank r). rel is a relevance function assigning 1 to relevant and 0 to non-relevant results, or, in this study, one of six values in the range from 1 to 0. disc(r) is a discount function depending on the rank r.

Note that it is not the relevance of later results that is being discounted, but rather the importance of later addends. In a way, the formula is not really using Precision any more; the addends (which are later averaged) may range from a simple sum of relevance judgment sums (if no discount is used) to a sum of cumulative gains discounted, for later addends, by the square of their rank. If the function discounts by rank, it is obviously the same as standard MAP.85 Also note that for disc(r)r, MAP does not necessarily fall into the [0..1] range. This is not a problem for the present study for two reasons. Firstly, this would only have a significant effect for consistently high relevance ratings on all ranks, which hardly ever happens. Second, even where MAP scores do exceed 1, they will do so for both result lists of a comparison, thereby preserving the preference.86 The two graphs showing MAP by threshold are somewhere between those for NDCG and those for Precision. Both Figure 10.16 (no discount) and Figure 10.17 (discount by rank, that As mentioned before, for our purposes (without cross-query comparison), MAP is the same as Average Precision. I will use the more familiar term MAP throughout this work.

Well... Not quite. In this first batch of evaluations, a six-point relevance scale is used. While this usually only influences the metrics by providing a more faceted relevance distribution, AP is different. The relevance function occurs twice in Formula 10.2; the second time it is as a multiplier for the entire precision fraction. Thus, its influence is higher than for other metrics. This issue will be dealt with in Section 10.7, which concerns itself with relevance scales.

The only problem could be threshold steps, which might be too small to reach an optimum level. However, the graphs in this section clearly show that this is not the case.

- 100 is, traditional MAP) show a marked PIR decline at higher thresholds, though it is not as steep as that of Precision PIR graphs. Otherwise, they are rather unspectacular, which is why I omit the detailed threshold graphs for the other five discount functions and proceed straight to the comparative graphs of the different discount metrics.

With MAP, many results are analogous to those of NDCG. The graphs using the t=0 and the best-threshold approach are quite similar; and the peak scores lie at about 0.92 once again.

The tendency of PIR scores to decline after a peak has been reached is less pronounced; it only happens at the very last rank, and only with two curves. These two discount functions, “No discount” and “log5”, are also the best-performing ones. The equivalent of traditional AP, the “Rank” discount, lies in the middle, and the only discount function it constantly outperforms is the “Square” discount. All in all, the tendency in this case seems to be for more steeply discounting functions to perform worse, while the tendency seemed to go in the opposite direction in NDCG evaluations.

** Figure 10.16.**

MAP PIR values without any discount for results at later ranks.

** Figure 10.18.**

MAP PIR scores for different discount functions, with the best-threshold approach. MAP Rank is the traditional MAP metric. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same

** Figure 10.19.**

MAP PIR scores for different discount functions, with the threshold values set to zero. Again, MAP Rank is the traditional MAP metric.

Formula 10.3. MAP calculation for the “No discount” version.

As with NDCG, I would like to compare these results with those for informational queries only, shown in Figure 10.20. The difference to Figure 10.18 (MAP PIR for all types of queries) can be discerned with the naked eye; while the overall shapes of the individual lines are similar, they lie consistently higher for the informational queries. The reason for this is not quite clear. As returning to the graph the few navigational, factual and meta-queries does not change the scores in any way, the only remaining influence are the transactional queries. The main difference between informational and transactional queries is that, while with the former the user is generally supposed to have an information need to be satisfied by multiple sources, the “transactional need” in the latter case is often satisfied by the first good result. By

** Figure 10.20.**

MAP PIR scores for informational queries only, with the best-threshold approach.

10.4 Other Metrics With the more popular metrics safely tucked, we can now turn towards others, which have not been used as extensively in evaluations. These metrics sound promising on paper, and it is not easy to tell whether they have not been used more often because other researchers found some intrinsic flaws in them, or because they did not see why they should abandon more familiar metrics, of perhaps just because of some unlucky circumstances.

Formula 10.4. ERR formula. For each rank r, the probabilities of a relevant result at each earlier rank i are multiplied; the inverse probability is used as a damping factor for the gain at the current rank.