To improve the results, the authors introduce a more discriminating measure they call CT-Gn (for Click-Through Greater than n). The first evaluation considers only cases where the number of clicks on a pair of results differs by more than a certain number n. The correlation statistics for different values of n are given in Figure 6.1. The numbers look significantly better; but, as always, there is a trade-off. The higher n (and the prediction confidence), the less document pairs with a sufficient click difference there are. Unfortunately, the authors do not provide any numbers; but they may be roughly estimated using the size of the data samples used for training and testing. The training sample consisted of 10,000 queries with ca.

580,000 documents, of which ca. 70,000 have been clicked. 43 The easiest way to get a document pair with n≥100 is for one document to have no clicks while the other has 100.44 This, in turn, means that a query must comprise at least 1% of all stated queries, and in every of this cases a document d1 must be clicked on while a document d2 must not. Obviously, this can produce results for at most 100 queries, and realistically viewed for considerably less. The problem can be mitigated by using significantly larger amounts of data, which are surely available for company researchers. But those logs will probably not be open to the scientific community for scrutiny; and besides, the problem of proportions will emerge. If the amount of log data is significantly higher so that a larger number of queries reach the n≥100 level, This proportion of over 7 clicks per query is extremely high. The usual number given in the literature varies, but generally lies between 0,8 (Joachims et al. 2005) and 2 (Baeza-Yates 2004; Baeza-Yates, Hurtado and Mendoza 2005). However, the 58 documents considered for each query are also a very high number.

To be precise, this would be n=101.

- 47 some queries will have very high frequencies, with tens or perhaps hundreds of thousands of occurrences. For those queries, the n≥100 border might be too low if considered in proportion to much larger click numbers. The solution to that might be a border line which combines the total number of clicks with the frequency of a query; but that remains to be implemented and tested in detail.

There also have been attempts to derive more general connections between log data and explicit measures. One such study looked at the correspondence between a result list’s MAP rating and the number and rank of clicked results (Scholer et al. 2008). The result lists were constructed in advance so as to deliver certain levels of MAP, ranging from 55% to 95% (the method previously used by Turpin and Scholer 2006). Then, users were given a recall-based search task and five minutes to complete it by finding as many relevant documents as possible. The results were unambiguous; there was no significant correlation between MAP (or precision at rank 10), whether calculated based on TREC assessments or ratings by users doing the searching, and the average number of clicks or the mean rank of clicked results.

Again, there are some methodical issues which can be raised, such as the unrealistic fixed session time; also, the results lists were not directly related to the queries users entered, but rather to the TREC topics from which the task statements were taken. However, the tendency still points to a discrepancy between explicit measures and log data. This may or may not be a problem for web evaluation using click-through measures. With more realistic tasks, which do not have fixed time limits or the goal of finding as many documents as possible, 45 click frequency would not necessarily be expected to be an indicator of retrieval quality. Indeed, for search engines which deliver highly relevant results at the top ranks, the number of required clicks might be expected to drop.

One more Microsoft study has busied itself with the implicit-explicit-measure problem (Fox et al. 2005). The authors collected a variety of data from the company’s employees using web search engines; the session measures included data on clicks, time spent on the result page, scrolling data etc., while document measures included time spent viewing a document, the document’s position in the result list, the number of images in the document and so forth.

Also, during and after each search session explicit feedback was collected on individual results as well as on the satisfaction with the complete session; it was given on a tertiary scale, where the result or session were classified as satisfying, partially satisfying or not satisfying.

The collected implicit measures were submitted to a Bayesian model which was used to predict the user satisfaction with individual results or sessions.

The results this model produced were twofold. For explicit measures, the overall prediction accuracy was 57%, which is significantly higher than the baseline method of assigning the most frequent category (“satisfied”) to all results (which would provide 40% accuracy). If only results with high prediction confidence are taken into account (those constitute about 60% of all results), the prediction accuracy rises to 66%. The most powerful predictors were the time spent viewing a document (that is, from clicking on a result in the result list to Even an informational query has a certain, limited amount of required information; this holds all the more for transactional and especially navigational queries.

- 48 leaving the page) and exit type (whether the user left the document returning to the result page, following a link, entering a new address, closing the browser window etc.). Just these two features provided an overall prediction accuracy of 56%, as compared to the 57% accuracy when using all features; thus, it may be concluded that they carry the main burden of providing evidence for the quality of documents. These findings are highly significant; but it should be noted that they do not provide as much help for search engine evaluation as the numbers suggest. The two main features can be obtained through a search engine log only if the user returns from a document to the search list; for other exit types, the type as well as the time spent viewing the document can only judged if further software is installed on the test persons’ computers.46 Furthermore, if the user does return to the result list, this was found to be an indication of dissatisfaction with the viewed document. Thus, if this feature is available in the logs, it is only in cases where the prediction accuracy is lower than average (ca. 50%, as compared to 57% overall prediction accuracy and 40% baseline). As there are multiple other ways of leaving a document than returning to the result page, the absence of this feature in the logs seems not to be an indicator of user satisfaction. To summarize, the most important features are either not available through the search engine logs, or are available but have lower than usual predictive power. This means that the logs’ main advantage, that is, their ready availability and large size, all but disappear for most researchers.

A second part of the study concerned itself with the evaluation of complete sessions. Here, the overall prediction accuracy was 70%, as compared to a 56% baseline (again assuming that all sessions produced satisfying experiences). However, the features considered for this model included the explicit relevance judgments collected during the study; if those were excluded to construct a model based entirely on implicit metrics, the accuracy sank to 60% – better, but not much better, than the baseline. On the one hand, this obviously means that the study could not show a sizable improvement of predictive power over a baseline method when using implicit metrics on sessions, even if those metrics included much more than what is usually available from search engine logs.47 On the other hand, and perhaps more importantly, it gives us two very important results on the relation between explicit measures, which have not been addressed by the authors.

Firstly, the impact of result ratings on the precision of session evaluation suggests that there is indeed a connection between explicit single-result ratings and user satisfaction as measured by the explicit session ratings collected in this study. This is a rare case where result- and session-based measures have been indeed collected side by side, and where an attempt has been made to incorporate the first into a model predicting the second. The “secondly”, however, goes in the opposite direction. If we consider that they were aided by other log data as well as usage data not usually available to researchers, and a complex model, the prediction power of explicit result ratings is remarkably low – 70% as compared to a baseline of 56%.

Given that the ratings were obtained from the same user who also rated the query, and the Fox, Karnawat et al. used a plug-in for the Internet Explorer browser.

As the authors rightly notice, this problem could be mitigated by using the implicit result ratings constructed in the first part of the study. However, it remains to be seen whether this less reliable input can provide the same level of improvement in session-based evaluation.

- 49 result rating was in relation to this query, we can hardly hope for more or better data. This means that either the Bayesian model employed was very far from ideal, or that the correlation of result ratings with user satisfaction with a search session is quite low. Together, the results suggest that, while the perceived quality of individual results has a noticeable influence on the perceived quality of the search experience, this influence is not as high as to be, at least on its own, an accurate predictor of user satisfaction.

The concentration of some of the studies presented in this chapter on user satisfaction (or sometimes user success in executing a task) merits a separate discussion. For studies performed within the Cranfield paradigm, the “gold standard” against which a system’s performance is measured is the rating for a single document provided by a judge. More concretely, in most studies using TREC data, “gold standard” is associated with a rating made by an expert who is, at the same time, the originator of the query (e.g. Bailey et al. 2008;

Saracevic 2008). However, this is only a meaningful standard if one is interested in the relevance to a query of one isolated document. If, instead, the question one wants to ask is about the quality of a result list as a whole (and that is what popular measures like MAP and DCG attempt to answer), it seems prudent to ask the rater for his opinion on the same subject.

User ratings of single documents might or might not be combinable into a meaningful rating of the whole; but this is a question to be asked and answered, not an axiom to start from. An evaluation measure is not logically deduced from some uncontroversial first principles;

instead, it is constructed by an author who has some empirical knowledge and some gut feelings about what should be a good indicator of search engine quality. If we probe further and ask what constitutes high search engine quality, or what an evaluation standard for evaluation methods should be, we unavoidably arrive at a quite simple conclusion: “By definition, whatever a user says is relevant, is relevant, and that is about all that can be said” (Harter 1992, p. 603).

- 50 Part II: Meta-Evaluation Are you happy? Are you satisfied?

QUEEN, “ANOTHER ONE BITES THE DUST” “While many different evaluation measures have been defined and used, differences among measures have almost always been discussed based on their principles. That is, there has been very little empirical examination of the measures themselves” (Buckley and Voorhees 2000, p. 34). Since the writing of that statement, things have changed; the previous chapter contains a number of studies concerning themselves with empirical examination of search engine evaluation measures. However, it also shows that the results of this empirical work are not ones that would allow us to carry on with the methods we have. For most measures examined, no correlation with user satisfaction has been found; for a few, brief glimpses of correlation have been caught, but the results are far too fragile to build upon. What, then, do the search engine evaluation measures measure? This is where the present work enters the stage.

–  –  –

As I already remarked, it is important to make sure the metric one uses measures the intended quality of the search engine. So, after considering the various evaluation types, it is time for us to ponder one of the central issues of information retrieval evaluation: the concept of relevance. I do not intend to give a complete overview over the history or width of the field;

instead, I will concentrate on concepts which are promising for clarifying the issues at hand.

This is all the more appropriate since for most of the history of the “relevance” concept, it was applied to classical IR systems, and thus featured such elements as intermediaries or the query language, which do not play a major role in modern-day web search engines.

The approach I am going to embrace is a modification of a framework by Mizzaro (1998). He proposed a four-dimensional relevance space; I shall discuss his dimensions, consider their appropriateness for web search evaluation, and modify his model to fit my requirements.

The first category, adopted from earlier work (Lancaster 1979), is that of the information resource. Three types are distinguished. The most basic is the surrogate; this is a representation of a document as provided by the IR system, such as bibliographical data for a book. The next level is the document itself, “the physical entity that the user of an IR system will obtain after his seeking of information” (Mizzaro 1998, p. 308).48 The third level, the most user-centered one, is the information the user obtains by examining the provided document.

