«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
NDCG PIR scores for different discount functions, with the best-threshold approach. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 1 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 4. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8. Reproduction of Figure 10.8.
- 173 Of course, the picture is not always as clear as that. When the curves become more volatile (for example, when result ratings and preferences come from different users), it is harder to discern patterns, but mostly, they are still visible. But the second parameter which can influence the decline in PIR scores is harder to observe in an actual evaluation.
It seems logical to assume that if the explanation for the tendency of PIR scores to decline after some rank given above is true, then the peak PIR score should play a role in the extent of this decline. After all, if the PIR is at 0.5 to begin with, than it has nothing to fear from results that randomize its predictive power. Conversely, the higher the peak score, the higher the fall possibilities (and the fall probabilities) could be expected to be. However, this tendency is hard to see in the data. A possible explanation is that a metric that has a high PIR peak is, on some level, a “good” metric. And a “good” metric would also be more likely than a “bad” metric to provide a better preference recognition at later ranks, which counterbalances any possible “the higher the fall” effect.109 This argument is not very satisfactory since it would imply that testing the effect is generally not possible. However, a more confident explanation will require more data.
12.2.6 Metric Performance We are now approaching the results which are probably of most immediate importance for search engine evaluation. Which metric should be employed? I hope that it has by now become abundantly clear that there is no single answer to this question (or perhaps even not much sense in the question itself, at least in such an oversimplified form). However, there are doubtless some metrics more suitable than others for certain evaluations, and even cases of metrics that seem to perform significantly better (or worse) than others in a wide range of situations.
Two of the log-based metrics that have been examined, Session Duration and Click Count, did not perform much better than chance. They might be more useful in specific situations (e.g. when considering result lists which received clicks versus no-click result lists). The third, Click Rank, had PIR scores significantly above the baseline, but its peak PIR values of 0.66 were only as good as those of the worst-performing explicit metrics.
Of the explicit metrics evaluated in this study, the one which performed by far the worst was Mean Reciprocal Rank. This is not too surprising; it is a very simple metric, considering only the rank of the first relevant result. But for other metrics, the picture is less clear.
ERR was a metric I expected to do well, as it tries to incorporate more facets of user behaviour than others. Perhaps it makes too many assumptions; but it is never best of the class. In some cases it can keep up with the best-performing metrics, and sometimes, as in the general same-user evaluation (Figure 10.34), its peak value lies significantly below all others but MRR.
“Good” and “bad”, in this case, can be taken to mean “tend to produce high PIR scores” and “tend to produce low PIR scores”. The argument says only that if a metric produces a higher PIR score at its peak value than another metric, it can also be expected to produce a higher PIR score at a later rank.
- 174 Precision did not produce any miracles; but nevertheless, its performance was better than I expected considering its relatively simple nature.110 Its peak values have been not much lower than those of most other metrics in same-user evaluations, and actually lead the field when the explicit result ratings and preference judgments came from different users. Precision’s weak point is its bad performance at later cut-off ranks; as it lacks a discounting function, the results at rank 10 are almost always significantly worse than those at an (earlier) peak rank.
Another positive surprise was the performance of ESL. The metric itself is almost as old as precision; and it performed significantly better. It is not quite the traditional version of ESL that has been evaluated; in Section 10.4, I described some changes I have made to allow for non-binary relevance and a discount factor, as well as for normalization. However, I think that its essence is still the same. Be that as it may, ESL’s PIR scores are impressive. In many situations it is one of the best-performing metrics. However, its scores are somewhat lower when explicit result ratings and preference judgments come from different users. Also, it is the metric most susceptible to threshold changes; that is, taking a simple zero-threshold approach as opposed to a best-threshold one has slightly more of a detrimental effect on ESL than on other metrics.
MAP, on the other hand, is performing surprisingly poorly. As explained in Section 10.3, two versions of MAP have been evaluated, one with the traditional rank-based discount, and another without a discount. The no-discount MAP performed unspectacularly, having mostly average PIR scores, and reaching other metrics’ peak scores in some circumstances (although at later cut-off ranks). The PIR scores of traditional MAP, discounted by rank, were outright poor. It lagged behind the respective top metrics in every evaluation. It came as a bit of a shock to me that one of the most widely used evaluation metrics performed so poorly, even though there have been other studies pointing in the same direction (see Section 4.1 for an overview). However, taken together with the current study, those results seem convincing.
While in most cases, I would advocate attempting to replicate or substantiate the findings of the current study before taking any decisions, for MAP (at least, for the usual version discounted by rank), I recommend further research before continuing to use the metric.
Finally, there is (N)DCG. In earlier sections I pointed out some theoretical reservations about the metric itself and the lack of reflection when selecting its parameter (discount function).
However, these potential shortcomings did not have a visible effect on the metric’s performance. If there is a metric which can be recommended for default use as a result of this study, it is NDCG, and in particular NDCG with its customary log2 discount. It has been in the best-performing group in virtually every single evaluation; not always providing the very best scores, but being close behind in the worst of cases.
If saying which metrics perform better and which do worse can only be done with significant reservations, explaining the why is much harder still. Why does NDCG generally perform well, why does ERR perform worse? To find general tendencies, we can try to employ the Web Evaluation Relevance model described in Chapter 7.
I do not think I was alone in my low expectations, either, as the use of precision in evaluations has steadily declined over time.
- 175 The WER model had three dimensions. The first of those was representation. While different evaluation situations had different representation styles (information need or request), they are identical for all metrics. The same can be said about the information resources; all explicit metrics are document-based. The only differences, then, lie in the third category: context.
Context is somewhat of a catch-all concept, encompassing many aspects of the evaluation environment. On an intuitive level, it reflects the complexity of the user model used by a metric.
One component of context that is present in most of the evaluated metrics is a discount function. However, as described above, the same discount functions have been tested for all metrics. A second component is the taking into the consideration the quality of earlier results.
This happens, in different ways, with ERR (where a relevant result contributes less if previous results were also relevant), and ESL (where the number of relevant results up to a rank is the defining feature).111 But this does not shed too much light on the reasons for the metrics’ performances, either. The two metrics handle the additional input in quite different ways that to not constitute parts of a continuum to be rated for performance improvement. And the use of additional input itself does not provide any clues, either; ESL generally performs well, while ERR’s PIR scores are average. This might be explained by the fact that ESL has an extra parameter, the number of required relevant results,112 which can be manipulated to select the best-performing value for any evaluation environment. All in all, it can be said that there is not enough evidence for metrics with a more complex user model (as exemplified by ERR and ESL) to provide better results.
12.3 The Methodology and Its Potential The results presented in the two previous sections, if confirmed, can provide valuable information for search engine operators and evaluators. The different impacts of user satisfaction and user preference or the performance differences between MAP and NDCG are important results; however, I regard the methodology developed and used in this study as its most important contribution.
In my eyes, one of the most significant drawbacks of many previous studies was a lack of clear definitions of what it was the study was actually going to measure in the real world.
Take as example the many evaluations where the standard against which a metric is evaluated is another metric, or result lists postulated to be of significantly different quality. The question which these metrics attempt to answer is “How similar are this metric’s results to those provided by another metric”, which is of limited use if the reference metric does not have a well-defined meaning itself; or “If we are right that the result lists are of different quality, how close does this metric reflect the difference”, which depends on the satisfaction of the MAP could be argued to actually rate results higher if previous results have been more relevant (since the addend for each rank partly depends on the sum of previous relevancies). However, as the addends are then averaged, the higher scores at earlier ranks means that the MAP itself declines. For an illustration, you might return to the example in Table 4.2 and Figure 4.2, and consider the opposed impacts of the results at rank 4 in both result lists.
Or rather, in the form used here, the total amount of accumulated relevance (see Formula 10.5).
- 176 condition. Moreover, even if the reference values are meaningful, such evaluations provide results which are at best twice removed from actual, real-life users and use cases.
This is the reason for the Preference Identification Ratio and the methodology that goes with it. The idea behind PIR is defining a real-life, user-based goal (in this case, providing users with better result lists by predicting user preferences), and measuring directly how well a metric performs. I would like to make it clear that PIR is by no means the only or the best measure imaginable. One could concentrate on user satisfaction instead of preference, for example; or on reducing (or increasing) the number of clicks, or on any number of other parameters important for a search engine operator. However, the choice of measure does not depend on theoretical considerations, but only on the goal of the evaluation. The question of how well PIR reflects a metric’s ability to predict user preference is answered by the very definition of PIR.
Another important method employed in this study is parameter variation. Metric parameters are routinely used in studies without much reflection. I have mostly used the log2 discount of NDCG as an example, but it is only one instance, and probably not the most significant one. I am not aware of any research as to the question of whether the discount by rank used in MAP for decades is really the best one; and the results of this evaluation suggest it emphatically is not. I am also not aware of a systematic evaluation of different cut-off ranks to determine what cut-off ranks produce the best effort/benefit ratio, or just the best results.
I hope I have shown a few important facts about parameters. They should not be taken for granted and used without reflection; you may be missing an opportunity to reduce effort or improve performance, or just providing bad results. But also important is the realization that it is not necessarily outlandishly hard to find the best values. In this study, there were two main sets of data; user preferences and result judgments. But with just these two input sets, it was possible to evaluate a large range of parameters: half a dozen discount functions, ten cut-off ranks, thirty threshold values, two types of rating input, three rating scales (with multiple subscales), and five main metrics, which also can be viewed as just another parameter.
Granted, the evaluation’s results are not necessarily as robust as I would have liked. However, that is not a result of the large number of evaluated parameters, but of the small scale of the study itself. I expect that a study with, say, fifty participants and a hundred queries would have stronger evidence for its findings; and, should its findings corroborate those of the present evaluation, be conclusive enough to act upon the results.
12.4 Further Research Possibilities What could be the next steps in the research framework described here? There are quite a lot;
while some could revisit or deepen research presented in this study, other might consider new questions or parameters. Below, I will suggest a few possibilities going further than the obvious (and needed) one of attempting to duplicate this evaluation’s results.
One avenue of research discussed in Section 220.127.116.11 is detailed preference identification. It goes beyond a simple, single score to provide a faceted view of a metric’s performance. By