# «On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

8.2.1 The Preference Identification Ratio The second method of evaluating metrics against preference judgments is based on the notion that correlation, while well-tested and statistically meaningful, does not necessarily provide an immediate grasp of what the results actually mean for the algorithm or the user. Let us assume we have a Pearson correlation of 0.73 between average precision at rank 3 and user satisfaction (as was the case in Huffman and Hochster 2007). What does it tell us? What do we gain if we use this particular metric? What if we can increase the correlation to 0.82 (as Huffman and Hochster did)? How much good does it do to the end user? The answers to these questions are far from clear. The most we can say is whether the correlation is statistically significant, and that a higher correlation is, all other things being equal, generally preferable.

To provide a better grasp of metric evaluation, I propose a meta-evaluation metric. “Metaevaluation” means that it is not a metric for testing the performance of a search engine, or any other retrieval system; rather, its aim and proper use is to offer an intuitive but well-defined60 indicator of the usefulness of a metric which is directly concerned with a search engine’s performance. This metric, which I called the Preference Identification Ratio (or PIR for

**short), is concerned with the following situation:**

Imagine that we have two different result lists that we could present to the user for his query (as a result of different index sizes, ranking algorithms, or any other reasons). We also have explicit relevance judgments for the individual results which appear in the result lists. If we use a certain metric (say, Precision) to calculate the quality of the two result lists, and present the users with the one scoring higher, how likely is it, on average, that the user would really prefer it to the lower-scoring one? To put it more simply: How good is a metric at picking out the result list the user would have preferred?

The formula is probably more complex than the concept itself, so here is an example. Let us assume five queries with the characteristics shown in Table 8.1. The preliminary step is to identify the set of queries Q for which a user preference judgment between the result lists actually exists. In the sample data, this excludes q2 and leaves us with q1, q3, q4, and q5. The reason for excluding non-discriminating queries is that, whatever metric we use and whichever result list scores higher, the user is no better or worse off since he regards both result lists as equally good. It may be argued that not recognizing the similarity of the result lists’ quality is detrimental since it would suggest a change of algorithm which would not provide a better experience to the user. This may indeed lead a search engine operator to It is unfortunate that there is a “but” rather than an “and” between “intuitive” and “well-defined”; but after some thought, I concluded that the overlap of the two attributes cannot be taken for granted.

|| { Formula 8.3. This is an auxiliary formula which defines a preference value as 0 if a variable does not exceed a certain threshold t, and as 1 or -1 if it is larger than t or smaller than -t, accordingly. pref(x) is similar to the sign function.

Now that we have cleansed our data set, the first step is to calculate the difference between the metric scores (which are precision scores in this case). This exercise of figuring out how to perform this task is left to the reader, though the results are provided in Table 8.2. The next step is determining the preference indicated by these scores; this is done by the pref function, which is calculated according to Formula 8.3. Basically, it is defined as 1 if the argument is positive (that is, the first result list has a higher metric score), -1 if it is negative (the second result list having the higher score), and 0 in case it is zero (or lies below a certain threshold t;

more on that below). Then, the user preferences are converted into a similar format, with prefuser set to 1 if the first result list was judged to be better, and to -1 in the opposite case.62 After that, the two numbers calculated last (the metric preference pref and user preference prefuser) are multiplied. The result of all those difficult manipulations is a 1 if the metric scores Section 12.2.2.1 has more to say on this topic.

If you are asking yourself whether prefuser is set to 0 if there is no preference, you might want to re-read the previous paragraph.

We can now proceed with the final step by adding together the numbers (1+(-1)+1+1=2), dividing the result by double the number of our queries (2*4=8), and adding 0.5, which gives us a PIR of 0.75. Since three of four queries where the user actually has a preference would deliver the preferred result list when using precision to choose, the PIR value of 3/4 might not come as a shock. Indeed, that is what PIR does – indicating what ratio of queries will get the preferred result list. Generally, if a metric picks out the wrong result lists in most cases, its PIR would be below 0.5. However, as guessing randomly (or assigning all result lists the same score) would provide a PIR of 0.5, this can be considered the baseline. To sum up, the PIR scale goes from 0 to +1, but any metric with a PIR lower than 0.5 is doing something really wrong; so, for all practical purposes, a PIR value of 0.5 is as low as it gets. The division by 2|Q| instead of Q and the subsequent addition of 0.5 to the result serve only to put the PIR results from the conceivable but unintuitive -1..+1 to an immediately understandable and common 0..1 scale.

There is one additional consideration that has been left out. It can be argued that not all result lists with different precision scores should be regarded as different in quality. One might think, for example, that the user does not really distinguish between results with precision scores of 0.48 and 0.51. To account for that, the preference calculation contains the threshold t. If t is set to zero, any difference between result list scores is considered significant; this is the route that has been taken when discussing the example up to now (the official form for this simplified metric I call PIRt=0 is given in Formula 8.4). But what happens if any difference of less than, say, 0.15 between the precision scores of the two result lists is considered not significant? In this case, the calculation has one additional step. After arriving in the same way as before at the data presented in Table 8.2, we compare the absolute value of the precision difference with the chosen threshold. If the difference does not exceed the threshold, the query’s addend value for the final calculation is set to zero; this is the case with q3. The PIR for t=0.15 is, then, (1+0+1+1)/8+0.5, or 0.875. In words, there was one result where the difference in precision scores was rather small and pointed to the wrong result list;

by requiring any score difference to meet a minimal threshold, the risk of making the wrong judgment is reduced for the close calls. The higher the threshold is set, the less probable both wrong and correct predictions become, as the number of disregarded queries increases; for t=0.35, our example’s PIR drops to 0.625, and if the threshold is higher than the full span of possible metric scores (in most cases this would be t≥1), PIR would always be 0.5, regardless of the input.

- 64 So, once more: What does PIR tell us? One interpretation is that it shows how much a metric can help improve user experience. It is a commonplace that one cannot improve the experience for all users, at least if the system is as large as most web search engines; Google “could make 500 [queries] great, and only one worse, but you are never gonna make all of them better” (Google 2011d). If the PIR of a metric is 0.75, it means that for 75% of users who actually have a preference that preference can be correctly determined. The other 25% would get the result list they consider to be worse. Alternatively, a PIR of 0.75 could mean that the preference can be determined correctly for 50% of users, while for the other 50%, the metric (wrongly) indicates that the result lists are equally good. In this case, the users might get any of the result lists. The numbers could also be different, as long as the sum of “good” preference prediction and half the “equal” predictions totals add up to the PIR. Put another way, the PIR, as its full name (Preference Identification Ratio, in case you already forgot) suggests, shows how well a metric can identify user preferences.

PIR, as most other measures, is more useful in comparisons. If viewed for just one metric, it can be considered a comparison with the baseline of 0.5 (which might be achieved, for example, by guessing at the better result list randomly). Then, it would show to what extent the metric closes the knowledge gap between guessing and knowing the user’s preferences. If PIR is used for inter-metric comparison, it shows which metric will give users more of their preferred result lists, and how often.

- 65 PIR Graphs In order to provide as much information as possible, I have used graphs that might be considered to be a bit on the crammed side. However, I think that with a proper explanation, they are quite readable. Here, then, is the explanation. I will not provide any interpretations of the results at this point since those will be given in more detail in the next section. The aim of this graph and the next one is only to provide a guide on how to read them.

** Figure 8.2.**

Precision PIR (on Y-axis) depending on threshold value (X-axis) and cut-off value (coloured lines). Note that the Y scale does not start with zero and does not end with 1; rather, it is dynamic and graph-specific. This means different graphs can be compared directly as to the form of single plot lines, but a comparison in absolute PIR terms requires a careful look at the numbers on the Y-axis rather than just a glance on two graphs side-by-side.

** Figure 8.2 shows a Precision PIR graph.**

Let us first consider the blue line. From the legend at the bottom you can see that it is the curve (or perhaps jagged straight) for precision at the cutoff value of 3. That is, for the purposes of this evaluation, the judgments for any result beyond the third rank are ignored; the retrieval systems might just as well have returned only three results each. On the horizontal axis, the threshold values are given. If we look up the PIR value at zero threshold, we see that it is 0.835. This means that if we consider only the first three results of every result list, and if we regard any difference in Precision, however small, to be significant, for about five of six queries Precision will correctly pick out the result list preferred by the user, while once every six queries, it will pick out the less preferred. If we move right to the threshold of 0.05, we will consider only result lists where the precision scores differ by at least 0.05 to be of significantly unequal quality. This reduces the risk of accidentally regarding a close call as marking a preference and getting it wrong; but it also reduces the chance of getting it right. The larger the threshold value, the more cases are considered to be “close”, and the less material the metric uses. Thus, when the threshold value

- 66 is high enough (1 is the extreme), the PIR will always be 0.5, as no judgments will be made, and whether the better or the worse result list will be returned will be a matter of chance. The black line, indicating a cut-off value of one (only the first result is considered), is straight; that is, the threshold values up to 0.30 do not change the PIR. The reason is that we have only one judgment on what is in this case a six-point scale between 0 and 1 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), and the likelihood of just the two first-rank results being that one rating step apart is relatively small. With even larger thresholds, there would be differences; also, other ratings or other metrics might well provide for a different graph.

I have two final remarks on this graph. One concerns the threshold scale itself; the highest value you will find displayed on graphs is always 0.30, as there was no single evaluation where the PIR improved in any way when the threshold exceeded this value. The second is to inform that in the evaluation proper, all cut-off values from 1 to 10 will be displayed;63 in the sample graph, simplicity rather than completeness was called for.

** Figure 8.3.**

PIR comparison graph. The PIR values are again displayed on the Y-axis. Here, the thresholds are used on a “best-for-cut-off” basis; for each cut-off value, the best-performing threshold is chosen.

** Figure 8.3 shows a graph comparing two PIR plots.**

Here, the cut-off values are displayed on the X-axis. The graph does not show any threshold information; the threshold values used will be declared with each graph. If we again consider the Precision plot, we can see that it starts with a PIR of 0.695; that is, if only one judgment is used for each result list, Precision will correctly identify the better list seven times out of ten. If you compare this to Figure 8.2, you The reasons for choosing 10 as the cut-off value will be explained in Section 9.1; basically, the reason lies in the limited availability of resources.

The source of values for cut-offs greater than one is probably not quite self-explanatory.

While the Precision PIR at cut-off one was stable (at least, up to 0.30) with regard to threshold values, this is not the case for other cut-offs. The question that arises is: Which threshold do we use to represent Precision in this inter-metric comparison? There are two answers that will be used in the evaluation. One is to use the best possible threshold values for each cut-off; this is the method employed in Figure 8.3. For a cut-off value of 3, for example, we look at Figure