«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
A major problem with the cited experimental confirmations of log data evaluation is that the confirmation comes from their correlation with measures based on explicit document relevance rankings, whose significance, in turn, is often based on not very persuasive evidence; indeed, sometimes their vindication comes from log data evaluations. Strictly speaking, this would imply only that explicit and log-based ratings correlate with each other, but not necessary with any satisfaction levels on the user side. The claim that implicit and explicit measures do not reflect any user preferences is disproportionate; but the point that there is hardly any validation of their significance outside this vicious circle remains. Another strategy is to compare explicit metrics among each other (e.g. Kekäläinen 2005; Della Mea et al. 2006; Sakai 2007b); I consider this to be valuable, once the validity of the primary metric has been established.
There have been some attempts to break this circle. One interesting method is to create two result lists so that one of them is deemed to be better than the other “by construction” (Radlinski, Kurup and Joachims 2008). In one of the conditions of this study, the authors implemented the “good” result list by calculating the cosine similarity between the query and multiple fields (e.g. title, abstract and full text) of the arxiv.org database of scientific articles.
A “worse” result list was created by removing from the score the content of record fields considered to be important, such as title and abstract. Those were still included in the full text and thus matched; but in the “better” list, the inclusion of further fields increased the match count for every word in title and abstract. A “worst” list was created by randomizing the top 11 results of the “worse” list. In a second condition, the “worse” list was constructed from the same “good” list by randomly swapping two of the results in positions 2-6 with two in the positions 7-11. In the “worst” list, four instead of two results were swapped. One of those lists were randomly shown to users of a search engine implemented for the very purpose, and the data from their interaction with it recorded. The data was evaluated using “absolute metrics”, that is, metrics attempting to put a concrete number on a user’s satisfaction with the result list.
Among this metrics were e.g. the Abandonment Rate (queries without clicks on any result), Clicks per Query, and the Mean Reciprocal Rank (one over the mean rank of clicked results).
The authors formulated expectations for the measures’ behavior when confronted with a worse result list; for example, the Clicks per Query and the Mean Reciprocal Rank were expected to decline. In each of the two conditions, there are three result lists which can be compared to each other; thus, the overall number of comparisons is six. The results are given in Table 6.1; it is obvious that, while the metrics tend to agree with the hypothesized quality difference more than contradict them, the results are by no means conclusive. This is especially true if only statistically significant differences (right column) are considered; of 48 comparisons, only 10 do significantly correlate with the assumed difference, while one contradicts it. The authors draw the conclusion that, while the problems may be “due to
Absolute metric results and their significance with check mark meaning agreement and danger symbol disagreement with hypothesized quality (Radlinski, Kurup and Joachims 2008).
In the next step, the three result lists of a condition were shuffled into one using a novel method (which, ingenious as it is, does not directly concern us here) so that the resulting result list contains at any point almost an equal number of results from each list in the original order. If we assume that the three lists were completely different (which they were not) and call them A, B and C, than the combined result list might look like A1, B1, C1, C2, B2, A2, A3, B3, C3… As we further assume that users scan the list progressively from top to bottom, a click on C1 without one on A1 or B1 would indicate that the user has seen the first two results but has not found them relevant, in contrast to the third. From this we can in turn follow that the first result of list C is more relevant than those of the other lists. This combined list was also shown to users of the same search engine, with results which were much more encouraging. Without delving too deep into the methodology, the results from better-conceived results lists were preferred in every statistic, and this preference was statistically significant in over 80% of evaluations. The authors conclude that, in contrast to absolute metrics, this so-called “paired comparison tests” deliver stable results and are well suitable for establishing the preferences even of relatively small user groups.
I have described this study in some detail because it addresses a central question of the present work, namely, the validity of evaluation metrics, and, furthermore, it does so in a novel and promising way. Despite, or, perhaps, precisely because of that, it is important to realize that the study is not as conclusive as it seems, and at almost every step objections can be made to the interpretation of its findings. The problems start at the very onset, when two result list triples are constructed. In each triple, there is supposed to be a significant difference in retrieval quality. But does this really follow from the construction of the result lists? The “good” list is the result of a relatively simple cosine similarity measure. We do not know much about the usual retrieval behavior of users who made up the test group; but for many queries, such a ranking algorithm that does not include, for example, any citation data (which a major academic search engine like Google Scholar does quite heavily), may provide a result list where the average decline of result quality from rank to rank may be very shallow. If the relevance of the 10th-ranked result is, on average, indeed not much lower than that of the first
- 44 one, then a slight decrease of the importance of some record fields like title and abstract might
More doubt awaits us at the evaluation of absolute measures. One issue follows directly from the above argument; if the differences in ranking quality are insignificant, than insignificant metric differences may still reflect the situation correctly. The evaluation might also be taken to support the thesis that the “good” result lists are not that good to begin with. The Abandonment Rate for these result lists lies at 68 to 70%; that is, over two thirds of searches with the retrieval quality assumed to be superior do not lead the user to any useful results.40 A second problem is that once again, the assumptions made by the authors may, but need not be correct. The hypothesized metric changes from one result list to another are founded in sensible logical arguments; but sensible logical arguments could also be made for changes in the other direction. Indeed, the authors inadvertently illustrate this ambiguity in the brief motivations they provide for their hypotheses. For Clicks per Query, the prediction is “Decrease (fewer relevant results)”; for Mean Reciprocal Rank, it is “Decrease (more need for many clicks)” (Radlinski, Kurup and Joachims 2008, p. 46), which, of course, implies an increase in Clicks per Query.
These problems should make us view the final evaluation, that involving “paired comparison tests”, with caution. Again, one issue stems from the study’s difficulties in creating result lists which significantly differ in retrieval quality a priori. The evaluation undoubtedly shows Note that the result quality at any rank is always meant to be averaged over all queries; if a ranking algorithm is not very effective, the relevance of results at any rank will be virtually unpredictable for any single query.
There seems to be some error in the data, which may or may not weaken this line of reasoning. The authors state that the search engine received ca. 700 queries per day, with about 600 clicks on results; this would imply an overall click rate of ca. 0.86. The tables, however, give us an Abandonment Rate of 0.68 to 0.70; and the remaining queries where at least one result was selected (ca. 250 of them) had a click rate of about 0.72. Now,
0.72 is by itself significantly lower than 0.86; but if the queries with no clicks were indeed excluded as stated by the authors, the Clicks per Query metric cannot be lower than 1. Furthermore, 250 queries with a click rate of
0.72 would mean around 180 clicks overall, not 600.
- 45 significant correlation of the measure with result list types; but we cannot be certain this correlation is with retrieval quality. Instead, it might correlate with one of a number of other properties, such as the appearance of search keywords in the title. 41 A further problem is the method of assembling a joint result list. As the results from two separate lists are rearranged, de-duplicated, and re-joined, the original ordering and half of the results are lost for every result page. Thus, even if the comparison tests can be used to judge which list has more relevant results,42 they may well be unable to compare the usefulness of the two original result lists as cohesive, interrelated units. By neglecting the structure of the result list, the evaluation method turns away from user models which can be seen as one of main advantages of log data evaluation.
It should be stated once more that the study by Radlinski, Kurup et al. is a promising contribution, and I do not intend to claim its results are wrong. They are, however, based on premises that are less self-evident than the authors assume, and therefore may be not very conclusive.
Another study that addresses, albeit indirectly, the connection between explicit and implicit measures, has been conducted at Microsoft (Dou et al. 2008). The authors once again try to determine relative judgments from click-through data, but their method is slightly different.
Instead of calculating the preferences for every session, the clicks on every document are aggregated for every query, and only then the preference is established. That is, if a query q is encountered 10 times in the log data, and document d1 is clicked 4 times, while document d2 is clicked once, a preference for d1 over d2 is assumed.
In a first step, the authors compare such preferences derived from click data with explicit judgments. The latter are graded ratings on a 0..4 scale. However, the prediction accuracy of the implicit judgments is determined by comparing simple preferences, without regard to the actual difference in rating. Still, the results are not very encouraging; the Kendall tau-b measure (Kendall and Smith 1938) shows a correlation of 0.36 in the test condition. This includes pairs where one document received some clicks, while the other was not clicked at all; if those are excluded, the correlation drops to 0.19. Neither condition includes statistics for pairs where none of the documents receive clicks; those obviously do not provide means for determining preferences, and would lower the correlation still further. There are multiple possible reasons for the results. As the authors note, their method is very simple and disregards many known issues such as position bias (Joachims et al. 2005). A more sophisticated approach might bring better results. Another explanation might be a generally low predictive power of click data; a third is that implicit judgments are relevant, but the explicit judgments they are compared to are not.
Remember that clicks on a result mean not positive feedback on the document but positive feedback on its representation in the result list. The “good” retrieval function with its focus on matching query keywords with title and abstract can be seen as increasing not necessary the quality of retrieved documents, but the attractiveness of the result list.
That is, either more results which are relevant or results which are more relevant. Clicks provide a binary distinction at best, so we don’t know which of these are being measured.
- 46 Figure 6.1. The correlation between preferences as determined by CT-Gn and explicitly stated preferences for different values of n (from Dou et al. 2008).