# «On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

The first of these metrics is Expected Reciprocal Rank (ERR). To recapitulate briefly, its outstanding feature is a built-in user model which assumes that the more relevant results the user found already, the less important the relevance of further results becomes; Formula 10.4 shows the ERR calculation with a flexible discount function disc(r). As I have often This is only a possible explanation, and a post-hoc one at that. Theoretically, it could be argued with equal logic that the increased influence of any single relevant result is most visible in cases where there are many such relevant results, as the scores add up faster. Clearly, more research is needed on this topic.

** Figure 10.21.**

ERR PIR values with discount for result rank.

The threshold-dependant graph in Figure 10.21 (discounted by rank, which corresponds to the standard ERR metric formula) is typical for ERR, so it will be the only one of its kind shown.

It has a by now surely familiar pattern of PIR score decline with increasing threshold values, which is steeper than in NDCG but not as steep as in MAP.

In the comparative graph (Figure 10.22),88 ERR also offers a picture similar to what we have seen with the previous metrics. All discount functions have their peak at rank 6 or 7, and their scores start to decline at the last ranks. However, it is harder to discern a pattern regarding the different discounts’ relative performance. While high-discount functions performed best in NDCG, and low-discount functions in MAP, there is no clear regularity in ERR. While most low-discount functions do perform worse than average, the log2 discount has some of the highest scores (at least up to rank 6). The rank-based discount (used in standard ERR) lies quite reliably in the middle of the group over all cut-off ranks.

The t=0 graph is omitted, since it shows the same features as previous t=0 graphs, being almost the same as the corresponding best-threshold graph, with minimally lower scores.

- 105 Figure 10.22. ERR PIR scores for different discount functions, with the best-threshold approach. ERR Rank is the traditional ERR metric. As many of the discount functions often produce identical PIR scores, a description is needed.

Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. On the non-selfevident side, “No discount” and “log5” coincide further (up to rank 9), while “Rank” and “Click-based” values are the same from rank 4 onwards. “Square” and “Click-based” scores are the same for ranks 1-6. Also, “log2” and “Clickbased” coincide at ranks 6 to 10.

As with MAP, ERR PIR scores increase if only informational queries are evaluated (Figure 10.23). Here, the argument that the metric could have been expected to perform better for this type of query could be made as well as with MAP; after all, ERR discounts for previously found results, and informational queries are the ones where multiple results can be expected to be needed. However, as well as with MAP, the logic might have run the opposite way had the results been different.89 After all, the discount for highly relevant previous results is warranted if it means later results are not very useful anymore – which is precisely the case with navigational, fact-based and (perhaps to a lesser extent) transactional queries.

Another little-used metric is Mean Reciprocal Rank (MRR), defined as one over the first rank at which a relevant result is found. The discount comparison graph (Figure 10.24) differs from the ones we have encountered before in a few respects. First, it covers a narrow range of PIR scores; the difference between the highest and the lowest is below 0.15 points. Secondly, the scores are very low in absolute terms; the peak is below 0.65, and the lowest scores lie barely above the baseline of 0.5. Thirdly, while most previous graphs showed PIR scores rising (mostly until ranks 6-8) and then declining, MRR PIR scores hardly rise at all. The highdiscount functions’ scores rise from 0.63 at rank one by just about 0.03 points; the lowdiscount functions’ scores decline sharply after rank one. And finally, the Root, Rank, Square and Click-based discounts have precisely the same PIR scores.

Fortunately, there is an obvious explanation for all of these phenomena. In preference terms, MRR selects the better result by looking at the result lists and picking the one where the first relevant result lies in an earlier rank. Therefore, it is irrelevant how strong the discount is, as long as there is any discount at all. When there is none (as in the log2, log5 and especially nodiscount conditions), the discriminatory power of the metric actually diminishes; the nodiscount condition at cut-off rank 10, for example, can only differentiate between two result lists if one of them has at least one relevant result in the top ten ranks and the other has none.

The problem is aggravated by the broad definition of “relevant” used in the current batch of evaluations; any result which is not completely useless counts. We will examine the influence of different definitions of “relevance” in Section 10.7.

- 107 Figure 10.24. MRR PIR scores for different discount functions, with the best-threshold approach. MRR Rank is the traditional MRR metric. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log 5” have the same values for cut-off values 1 to 5. “Root”, “Rank”, “Square” and “Click-based” have the same scores at all cut-off ranks.

As mentioned in Section 4.3, a more general analogue of MRR is the Expected Search Length (ESL), defined as the number of non-relevant results a user has to examine before finding n relevant ones. With non-binary relevance, and a need for the metric’s scores to fall into the usual 0..1 range as well as to be usable with a discount function, a modified definition is needed. It is introduced in Formula 10.5, which attempts to capture a relatively simple concept. Unfortunately, it seems to loose quite a lot of simplicity, but hopefully none of the concept.90 ∑ Formula 10.5. Expected Search Length adapted for graded relevance and discount. r is the rank at which the sum of single result relevance scores reaches a threshold n. The relevance scores are obtained by dividing the user relevance score for rank i by a discount disc(i) dependant on the same rank. c is the cut-off value used.

The numerator reflects the number of non-relevant results a user has to examine before satisfying his information need. For this, we take the number of all results examined (the rank

r) and subtract the relevant results. This number is then divided by the cut-off value to normalize it since the worst case is having nothing but non-relevant results. In this case, the The metric given in Formula 10.5 might more appropriately be called “Normalized Expected Search Length”.

However, for simplicity’s sake I will continue to refer to it as ESL.

However, as we have a multi-graded relevance scale, instead of simply counting results which have any relevance, I opted for using the sum of single relevance scores. In this way, a result list which provides perfectly relevant results until the user’s information need is satisfied still receives an optimal score of 1, one without any relevant results receives a score of 0, and one providing only results with relevance 0.5 would also get an ESL score of 0.5. That is, at least, if no discount is used; otherwise, the score is lower. It might not be intuitive (or advisable) to use a discount for ESL, as the rank rn already provides some discount – if good results appear earlier, the information need is satisfied faster. However, a few good results up front and one at the end are presumably better than all good results being stuck at the lower end of the result list. Another objection against using a discount is the very nature of ESL, which assumes rather plausibly that the user looks for information until he is satisfied, and then stops. In this case, he either sees the result or he does not; thus, either no discount should apply (before rn), or the results should not be regarded at all (after rn). However, this model does not take into account that the user may abandon his search before fully satisfying his information need;

from this viewpoint, a discount is a factor incorporating the increasing likelihood of query abortion at later ranks. Ultimately, as usual, it will be up to the evaluation to show which method works best; and we always have the no-discount function to fall back on if discounts turn out to be a bad move.

For any discount function (other than “No discount”, that is, and cut-off values larger than 1, the maximum final score is actually lower than 1. However, as with the opposite case we have seen in the MAP calculation, this would only be a problem if the threshold steps were wrong (in this case, too coarse). However, there is no evidence of that problem in the data.

** Figure 10.26.**

ESL PIR values discounted by the frequency of clicks on results of a particular rank with required cumulated relevance n=1.

- 110 Figure 10.25 and Figure 10.26 show a marked difference when compared to graphs we have encountered before, the individual scores at any cut-off value show much less variability at different threshold values. In the no-discount condition, the difference between the highest PIR score of the best-performing metric and the lowest PIR score of the worst-performing metric is just about 0.1. In the most divergent case, with the discounts being click-based, this number rises to just over 0.2. Remarkably, while the decline occurs at low threshold values and proceeds steeply for higher cut-off values, once the line reaches a PIR score of about 0.6, its fall stops. The steepest fall is seen in the no-discount condition for a cut-off rank of 10.

The PIR scores fall rapidly from ca. 0.81 for a threshold value of 0 to ca. 0.61 for a threshold value of 0.08 – but then they stay at or even above this boundary up to the maximum threshold of 0.30. The probable reason for this is that the most significant factor in ESL calculation is rn, the rank at which the predetermined cumulated relevance threshold is attained. That, in its turn, depends mainly on the cumulative relevance n, which will be considered in more detail below.

The inter-discount comparison in Figure 10.27 shows the “Rank” function performing best, with “No discount” (which is as close as we have to the “classical” ESL, considering the number of changes made to the formula) having the lowest scores. Generally, functions with lower discounts perform worse, though the “Square” and “Click-based” functions lie slightly below the maximum.

** Figure 10.27.**

ESL PIR scores for different discount functions with required cumulated relevance n=1 and the bestthreshold approach. Per definition, the “No discount”, “log5” and “log2” conditions produce the same results for cutoff values 1 and 2, and “No discount” and “log5” – for cut-off values 1 to 5. “Root” scores coincide with those of “Rank” for cut-offs 1-3, and with those of “click-based” for 3-9. The “click-based” line is identical with “Square”.

- 111 However, the discount function is not the only way in which the ESL results can be modified;

it has a parameter for regulating the required cumulated relevance, n (used in rn). In the evaluation above, n=1 was used, analogous to looking for one top-quality result. It is then a logical next step to ask whether other values of n might produce better results. Figure 10.28 through Figure 10.31 show the PIR graphs for n=0.5 to n=2.5 (with n=1 in Figure 10.27). A tendency is easy to discern; higher values of n, that is, assuming that the user wants more information, lead to higher PIR scores. Again, low-discount functions tend to perform worse;

and again (and even more in these functions), the PIR scores decline at the later cut-off values. For values of n higher than 2.5, the peak values stay the same, but scores tend to decline faster at later ranks. From these graphs, the “Rank” function is the obvious best candidate for inter-metric comparison since it generally performs as well as, and often better than other discount functions in terms of PIR.

As for scores depending on query type, Figure 10.32 and Figure 10.33 show PIR graphs for informational queries only. The general shape of the curves is the same; however, the absolute values are higher. This is especially visible for lower cumulative relevance values (for n=1, the peak PIR scores rise by about 0.03).

** Figure 10.28.**

ESL PIR scores for different discount functions with required cumulated relevance n=0.5, with the bestthreshold approach.

10.5 Inter-metric Comparison Now we come to the first iteration of the answer to a question which had set us off in first place: Which metric is best at predicting user preferences? To provide a visual answer, Figure

10.34 shows the PIR performance all the different metrics discussed in the preceding sections at a glance. In most cases, only one line is presented for each metric; however, MAP has two lines, one for the universally used discount function (by rank) and one for the best-performing (without any discount).

The graph provides many instantly visible results. MRR lies far below all the other metrics; it is better than chance at recognizing preferred result lists, but does so only in about 2/3 of all cases, picking the worse result list one-third of the time. Because of this vast discrepancy, I will ignore MRR for the rest of this section when discussing the performances of individual metrics. All other metrics have peak PIR scores in the range from 0.88 to 0.92; nevertheless, many of them differ significantly.

The two metrics performing worst are ERR and traditional (that is, discounted by rank) MAP.