«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
9.2 The Queries The raters contributed a total of 42 queries. There were no restrictions placed on the kind of query to be entered since the intention was to gather queries that were as close to possible to those employed for real-life information needs. In particular, this means no query type was predetermined. However, the detailed information need statements allowed for an unequivocal identification of the original intent behind the queries. 31 queries were informational, 6 were transactional and 2 were navigational, according to Broder’s widely
Two queries are similar to what has been described as “closed directed informational queries” (Rose and Levinson 2004), and commonly called “factual”, as they are generally geared towards finding a specific fact (Spink and Ozmultu 2002). This indicates a query aimed at finding not as much information as possible on a certain topic, but rather a (presumably definitive and singular) answer to a well-defined question. Another example would be a query like “How long is the Nile”. I would like to further narrow down the definition of such queries to distinguish them from informational ones and highlighting special features particular to
them. These features are:
The query can easily be reformulated as a question starting with “who”, “when”, or “where”.74 The originator of the query knows exactly what kind of information he is looking for.
The answer is expected to be short and concise.
The answer can be found in a snippet, eliminating the need to examine the actual results.
One authoritative result is enough to cover the information need.
There are many pages providing the needed information.
The last three properties can be used to derive some interesting predictions for this type of queries. If all the needed information can (and, ideally, would) be found on the result page of the search engine, the sessions with the highest user satisfaction will tend to have low durations and no clicks – features also typical of “abandoned sessions”.
In other words, sessions without any clicks are typically assumed to have been worthless for the user (Radlinski, Kurup and Joachims 2008), while for factual queries, the opposite might be true:
the absence of a click can indicate a top result list. Furthermore, the last two points indicate that, while factual queries resemble navigational ones in that a single result will satisfy the Here and further, information needs are translated from German where needed.
“What” and “how” are more complex cases. Obviously, my own example (“How long is the Nile”) starts with “how”. Mostly, questions aiming for a concise answer (“how long”, “how big”, “how old” etc.) are quite typical for factual queries, whereas open-ended, general questions (“how do you...”) are not. A similar division holds for “what”, with “What is the capital of Madagascar” being a prototypical factual query, but not “What is the meaning of life”. In general, the factuality of a query is more meaningfully defined by the next two points of expected preciseness and brevity of the answer, with the question word being, at best, a shortcut.
- 75 user’s information need, they are unlike navigational queries in that there are many possible results which might provide the fact the user is looking for.
Another type of information need not covered by traditional classifications is represented by what I will call “meta-queries”. The single example is provided by a query intended to check the position of a certain page in the result list. The exceptional property of these queries is that a result list cannot be “good” or “bad”. Whether the result sought in that example is in first position, somewhere in the tail or not in the result list at all – the information need is satisfied equally well. Another example would be the searches done for the comic shown in Figure 9.8.
Here, the user was looking for the number of Google hits for a particular query. The number of returned results and the results themselves do not have any effect on user satisfaction; more than that, it is unclear how any single result can be relevant or non-relevant to that query. The information the user is looking for is not found on any web page except for the search engine’s result page; it is either exclusively a result of the search algorithm (as in the case of the query aiming to find out the rank of a certain page in Google), or else a service provided by the search engine which can be described as “added value” compared to the information found on the web itself (as with submitting a query to determine the popularity of a certain phrase). The search, in short, is not a web search about real life; it is a search engine search about the web.75 Figure 9.8. The queries submitted to research this xkcd comic (Munroe 2008) fall in the meta-query category.
There is only one meta-query in the present study, so that it will not play a major role.
However, it would be interesting to determine how frequent this type of query is in real life, and particularly whether it occurs often enough to warrant a special consideration in further studies.
It is also interesting to note that the web comic shown in Figure 9.8 explicitly states that the relevant information is the number of Google hits, not the number of web pages it is supposed to represent, and not the popularity of the phrases implied by the number of web pages, and also not the general attitudes of the population to be derived from the popularity. This is in no way to criticize the xkcd comic, which is generally very much in tune with the general attitudes of the population (at least, to judge from what can be seen online), but rather to point out the extent to which this population is relying on the notion that Google is an accurate reflection of the web, or even of a large part of the modern world.
Table 9.2 shows the average query length for the different kinds of queries.
Leaving aside the small amount of queries that leaves factual, navigational and meta-queries a statistically insignificant minority, we see that the numbers are well in accord with the two to three term average found in other studies (Jansen and Spink 2006; Yandex 2008). Also in accord with other studies (White and Morris 2007), there were no operators used, which was to be expected for a relatively small number of queries.
There are surely not enough factual, navigational and meta-queries to allow for their meaningful evaluation on their own, and probably also not enough transactional queries. 76 For this reason, in considering the study’s results, I will evaluate two sets: the informational queries on their own, and all the queries taken together.
9.3 User Behavior Another set of data concerns user behavior during the search sessions. An interesting question is how long it takes for the raters to conduct a session in a single result list or side-by-side condition, and decide upon their satisfaction or preference judgment. Figure 9.9 shows the session length for single result list evaluation. Almost half of all sessions were concluded in less than 30 seconds, and around 80% took up to 90 seconds. Less than 5% of queries took more than 5 minutes. The situation is somewhat different in the side-by-side evaluation (Figure 9.10); here, about 50% of all sessions were completed in up to a minute, and about 75% in up to three and a half minutes. Here, 10% of sessions took more than 6 minutes; it has to be noted, however, that a number of sessions have extremely long durations of up to 24 hours. These are cases where the raters started a session, but seem to have temporarily abandoned the task. These sessions are included in the evaluation, as all of them were later correctly finished with a satisfaction or preference judgment.77 Also, this might well be in line with actual user behavior since a user can start a search session but be distracted and come back to an open result list page on the next day. However, these outliers greatly distort the average session time, so that median numbers are more meaningful in these cases. Those are 32 seconds for single result lists and 52 seconds for side-by-side result lists. While the raters – understandably – needed more time to conduct a query with two result lists, the session duration does not seem so high as to indicate an excessive cognitive burden.
Although there have been studies published with less than six different queries.
These might be considered to consist of multiple sub-sessions, with significantly lower overall durations.
However, since in this study there was no clear way of determining the end of the first sub-sessions and the start of the last ones (and since these sessions were rare), I opted to keeping them with their full duration.
The first notable result is the large number of sessions where no results are clicked on. As Figure 9.11 indicates, in both single and side-by-side evaluations, almost half of all sessions end without any result being selected. Of the sessions that did have clicks, most had a very small number, and the sessions with one to three clicks made up 75% and 63% respectively (of those with any clicks at all). The graph for non-zero click sessions can be seen in Figure
9.12. Side-by-side evaluations had generally more clicks than single ones, with the average being 2.4 and 1.9 clicks, respectively. With usual click numbers per session given in the literature ranging from 0.3-1.5 (Dupret and Liao 2010) to 2-3 (Spink and Jansen 2004), these results seem to fall within the expected intervals.
Clicks per session. No session had more than 20 clicks.
- 79 These numbers indicate that, while raters click more often in side-by-side settings, the click frequency is not as high as to suggest a total change of search habits. With more results to choose from, it is only to be expected that more results are chosen for closer inspection; but since the users have been instructed to abandon the session whenever they felt probable improvements weren’t worth the extra effort, they had more chances of seeing good results and satisfying their information needs early on, causing the increase in clicks to be a moderate
0.5 per session.
Left – randomized Right – original Figure 9.14. Sample click trajectories. The arrow coming from the left shows the first click, further arrows indicate later clicks.
9.4 Ranking Algorithm Comparison Apart from evaluating user behavior, it seems interesting to evaluate the performance of the two result list types. Obviously, the original ranking would be expected to perform much better than the randomized list in preference as well as satisfaction. When I submitted a preliminary paper on this topic (Sirotkin 2011) to a conference, an anonymous reviewer raised a logical issue: “It seems obvious that users always prefer the ranked lists. I would be interested in knowing whether this was indeed the case, and if so, whether this ‘easy guess’ could have an influence on the evaluation conducted.” We can approach the matter of rankings coming from the log statistics; namely, with an evaluation of clicks depending on the
Click ranks for original and randomized rankings.
Obviously, clicks are not the best indication of the raters’ evaluation of the two result list types – especially if we actually asked them to explicitly state their satisfaction and preference levels. Figure 9.16 shows the user preference for the original and randomized result lists; and the results are not what was expected by the reviewer (and myself). Of course, the original result list is preferred most of the time; but in over a quarter of all cases, the lists were deemed to be of comparable quality and in almost 15% of judgments, the randomized result list was actually preferred to the original. The reasons for this are not quite easy to fathom. One simple explanation would be that the quality of the original list was just not good enough, so that a randomization might create a better sequence of results about one in six times.
However, a detailed examination of queries where the randomized result list is preferred at least by some users shows another pattern. Almost all of those queries are informational and transactional and stated quite broadly, so that the number of possibly interesting hits can go into the thousands or tens of thousands; some examples are shown in Table 9.3. The case of the query “korrellation”, though not representative statistically, may nevertheless be typical. It is a factual query, for which any web site with an authoritative feel mentioning the word will probably be not only highly relevant, but also sufficient. The result list preference might be Not that this notion needs much support, mind you; it is universally accepted.
The difference between result list types becomes smaller still if we switch from preference to satisfaction (shown in Figure 9.17). While the original result lists had a higher proportion of queries which satisfied most raters (average satisfaction of over 0.5 was 80% versus 64%), the randomized option had a higher proportion of average-quality result lists. Result lists which entirely failed to satisfy the users were equally uncommon (17% and 18%, respectively). This can be taken to mean that, while the original result list is indeed more satisfactory, many (or, In case you wonder: it’s spelled “Korrelation”.