«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
The second category is the representation of the user’s problem; again, it is based on earlier research (Taylor 1967). The original form is called the Real Information Need (RIN). This means, broadly speaking, the real world problem which the user attempts to solve by his search. The user transforms this into a Personal Information Need (PIN); this is his perception of his information need. Then, he formulates his PIN in a request (a verbal formulation of the PIN), before formalizing it into a query (a form that can be processed by the information retrieval system).
The third category is time. The user’s requirements may change over time, whether as a result of changes in the world and therefore in the RIN, or because of the progress of the user’s search itself. For example, the user may hit upon a promising branch of information and then wish to encounter more documents from this area.
The fourth category is constituted by what Mizzaro calls components. He identifies three of those; topic – “the subject area interesting for the user”, task – “the activity that the user will execute with the retrieved documents”, and context – “everything not pertaining to topic and Obviously, “physical entities” include electronic entities.
Mizzaro’s relevance model without the temporal dimension (Mizzaro 1998, p. 314).
For the purposes of this study, I propose some changes to this model. First, as mentioned above, some of its aspects are not appropriate for modern web search. Second, the present work concerns itself only with a certain part of web search evaluation, making it possible to omit parts of the model, while acknowledging their importance for other aspects of evaluation. To distinguish my model from others, I will call it the Web Evaluation Relevance model, or WER model for short.
First, let us address the question of representation. The possibility of differences between RIN and PIN is undeniable; as an extreme example, the user can just misunderstand the task at hand. But, while “find what I mean not what I say” has been mentioned quite often49 (it refers to inducing the PIN from the query), “find what I need not what I want”, its equivalent for the RIN-PIN-distinction, does not seem to be very relevant to the field of Information Retrieval.50 Granted, there are cases when a user learns during his search session, and recognizes that his information needs are different from what he assumed. But most of them lie outside the scope of this work. The easy cases might be caught by a spell checker or another suggestion system.
In difficult cases, the user might need to see the “wrong” results to realize his error, and thus those results will still be “right” for his RIN as well as PIN. All in all, it seems justifiable in these circumstances to drop the distinction and to speak, quite generally, of a user’s information need (IN). The request, on the contrary, has its place in search evaluation. If the raters are not initiators of the query, they might be provided with an explicit statement of the It has even been the title of at least two articles (Feldman 2000; Lewandowski 2001).
Although Google, for one, has had a vacancy for an autocompleter position, with a “certificate in psychic reading strongly preferred” (Google 2011a; Google 2011b).
Next, let us consider the information resources. Surrogates and documents are obviously the most frequently used evaluation objects. Lately, however, there has been another type: the result set (studies directly measuring absolute or relative result set quality include Fox et al.
2005; Radlinski, Kurup and Joachims 2008). Different definitions of “result set” are possible;
we shall use it to mean those surrogates and documents provided by a search engine in response to a query that have been examined, clicked or otherwise interacted with by the user.
Usually, that means all surrogates up to the one viewed last,51 and all visited documents. The evaluation of a result set as a whole might be modeled as the aggregated evaluation of surrogates as well as documents within their appropriate contexts; but, given the number of often quite different methods to obtain set judgments from individual results, the process is nothing like straightforward. Thus, for our purposes, we shall view set evaluation as separate from that of other resources, while keeping in mind the prospect of equating it with a surrogate/document metric, should one be shown to measure the same qualities. It is clear that the user is interested in his overall search experience, meaning that set evaluation is the most user-centered type. Surrogates and documents, however, cannot be easily ordered as to the closeness to the actual user interest. While the user usually derives the information he needs from documents, for simple, fact-based informational queries (such as “length of Nile”) all the required data may be available in the snippet. Furthermore, the user only views documents where he already examined the surrogate and found it promising. Therefore, I will regard surrogates and documents as equidistant from the user’s interests.
I do not distinguish between components. Topic and task are parts of the user’s information need, and as such already covered by the model. 52 I also do not use context in the way Mizzaro did. Instead, I combine it with the time dimension to produce quite another context, which I will use meaning any change in the user’s searching behavior as a result of his interaction with the search engine. In the simplest form, this might just mean the lesser usefulness of results on later ranks, or of information already encountered during the session.
On a more complex level, it can signify anything from the user finding an interesting subtopic and desiring to learn more about it, to a re-formulation of the query, to better surrogate evaluation by the user caused by an increased acquaintance with the topic or with the search engine’s summarization technique.
The result list being normally examined from the top down in a linear fashion (Joachims et al. 2005).
One might reverse the reasoning and define the information need as the complex of all conceivable components. There may be many of those, such as the desired file type of the document, its language, and so forth. These are all considered to be part of the user’s information need.
No context Context Figure 7.2. The Web Evaluation Relevance (WER) model. The point at the intersection of the three axis indicates maximal user-centricity.
Using this model, we now can return to the metrics described above and categorize them.
Precision, for example, is document-based (though a surrogate precision is also conceivable) and context-free; the representation type may be anything from a query to the information need, depending on whether the originator of the query is identical with the assessor. MAP is similar, but allows for a small amount of context – it discounts later results, reflecting the findings that the users find those less useful, if they find them at all. Click-based measures, on the other hand, are based on the information need and incorporate the entire context, but regard the surrogate (since the relevance of the document itself is not assessed).
What is the WER model good for? We can postulate that the more user-centric metrics are better at discerning a user’s actual search experience. For me, it stands to reason that a user’s satisfaction with a result list as a whole provides a more appropriate view of his web search session than his satisfaction with individual results; or that a result list should be, if possible, evaluated with regard to the information need rather than the query itself.53 Similarly, metrics that are considering more aspects of context might be supposed to be better. However, this last assertion is more problematic. We can hypothesize about the superiority of some contextOf course, there are cases when a query is obviously and blatantly inappropriate for an information need (e.g. a “me” search to find information about yourself. But these queries do occur, and all search algorithms are in the same situation with regard to them.
- 55 sensitive user model, but, as the variety of user models employed in evaluations suggests, they
cannot all be appropriate in all circumstances. Therefore, this is another issue awaiting testing:
do metrics which employ a more user-centric model of relevance generally perform better than those that do not?
The discussions in the earlier sections indicate that user-based measures can generally be expected to be closer to the users’ real satisfaction than system-based ones. It is not very hard to obtain very user-centered measures; all one has to do is to ask the user about his
satisfaction after a search session. However, this metric has one very important disadvantage:
it is hard to use to improve the search engine, which, after all, is what evaluation should be about. If a user states he is unsatisfied or hardly satisfied with a search session, nothing is learned about the reasons of this dissatisfaction. One possible way to overcome this problem is to submit the ratings, along with a list of features for every rated result list, to an algorithm (e.g. a Bayesian model) which will try to discern the key features connected to the user’s satisfaction. However, this seems too complex a task, given the amount of data that can possibly influence the results; and we do not know of any attempts to derive concrete proposals for improvement from explicit set-based results alone.
Document-based measures have, in some ways, the opposite problems. They are “recomposable under ranking function changes” (Ali, Chang and Juan 2005, p. 365), that is, it can be directly calculated how the metric results will change if the retrieval mechanism is adjusted in a certain way. However, this comes at a price. As factors like already encountered information or, for that matter, any distinction on the context dimension, are generally not accounted for in a document-based metric, its correlation with user satisfaction is by no means self-evident. Of course, the distinction is not binary but smooth; as metrics become less usercentered, they become more usable for practical purposes, but presumably less reliably correlated to actual user satisfaction, and vice versa. This is one of the reasons why understanding the connections between different measures is so important; if we had a usercentered and a system-centered measure with a high correlation, we could use them to predict, with high confidence, the effect of changes made to a retrieval algorithm.
An additional complication is introduced by the abstracts provided by the search engines on the result page. The user obviously only clicks on a result if he considers the abstract to indicate a useful document. Therefore, even the best document-based metric is useless if we do not know whether the user will see the document at all. This might suggest that, for a complete evaluation of a search engine to be both strongly related to real user satisfaction and helpful for practical purposes, all types should be used. The evaluation of user behavior confirms this notion. “User data […] shows that 14% of highly relevant and 31% of relevant documents are never examined because their summary is judged irrelevant. Given that most modern search engines display some sort of summary to the user, it seems unrealistic to judge system performance based on experiments using document relevance judgments alone. Our re-evaluation of TREC data confirms that systems rankings alter when summary relevance judgments are added” (Turpin et al. 2009, p. 513).
8.1 Evaluation Criteria As discussed in the beginning of this work, how to judge evaluation metrics is an intricate question which is nevertheless (or rather: precisely for that reason) vital to the whole process of evaluation. To get a meaningful and quantifiable answer, one has to ask the question with more precision.
Mostly, studies are not interested in an absolute quality value of a result list. Rather, the relevant question tends to be whether one result list has a higher quality than another. It can be asked to compare different search engines; or to see whether a new algorithm actually improves the user experience; or to compare a new development with a baseline. The common feature is the comparison of the quality of two result lists.
But how to get such a judgment? I opted for the direct approach of asking the users. This has some disadvantages; for example, the user might not know what he is missing, or might misjudge the quality of the results. These problems reflect the difference between real and perceived information need (as discussed in detail by Mizzaro (1998)). However, as discussed in Chapter 7, a judgment based on a real information need (RIN) would be more problematic.