# «On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

8.2 and see that the PIR is highest (0.835) when the threshold is zero; so 0.835 is the value we use for the metric comparison. For a cut-off value of 5, precision peaks (or rather plateaus) at thresholds 0.01 to 0.04, also with a value of 0.835; so in the current graph, Precision PIR returns to 0.835 at cut-off 5. This method has two disadvantages; firstly, the threshold values used cannot be deduced from the graph;65 and secondly, it almost certainly overestimates the metrics’ success. Consider that in order to be realistic, we have to have enough data to reliably discern the most successful threshold value for each metric and cut-off value. The finer the threshold steps, the less significant any difference found is likely to be. In Figure 8.2, you can see that for the cut-off value of 5, the 0.01 steps might be just about fine enough to capture the ups and downs of the PIR; but at the same time, most differences between neighbouring data points are not statistically significant. While PIR with the threshold set at

0.2 is significantly lower than that for t=0.1, the PIR rise from 0.785 to 0.795 between t=0.08 and t=0.09 is not, at least in this study. If we make the steps larger, we will, in most cases, see the same picture – the PIR tends to be largest with the lowest thresholds, and to decline as the threshold increases. Therefore, the second method of determining which threshold values to use for inter-metric PIR comparison is using t=0. As you will see for yourself on the evaluation graphs, this is the most obvious candidate if you are looking for a constant threshold value; it has the maximal PIR in most of the cases (see the discussion in Section 12.2.2). Those two methods constitute convenient upper and lower boundaries for the PIR estimate since the former relies too much on threshold “quality” for individual metrics and cut-off values, while the latter does not make any use of it at all.

Why this happens will be the topic of much discussion in the evaluation section itself.

Unless an additional threshold value is given for each data point, which would make the graphs harder to read than Joyce’s Ulysses.

–  –  –

In this section, the layout of the study will be explained. After that, some results will be provided which do not directly concern the main questions of the present study, but are interesting in themselves or may provide useful information for other issues in evaluation or IR studies. Furthermore, many of these findings will help to put the main result into the proper context.

9.1 Gathering the Data The study was conducted in July 2010. 31 Information Science undergraduates from Düsseldorf University participated in the study. The users were between 19 and 29 years old;

the average age was 23 years. 18 participants (58%) were female, while 13 (42%) were male.

The raters were also asked to estimate their experience with the web in general and with search engines in particular on a scale from 1 to 6, 1 signifying the largest possible amount of experience, and 6 the lowest.66 The overwhelming majority considered themselves to be quite to very experienced (Figure 9.1); this should not be surprising since their age and choice of study indicate them to be “digital natives” (Prensky 2001).

–  –  –

Figure 9.1.

Self-reported user experience, 1 being highest and 6 lowest The scale is familiar to German students as it is the standard grading system in German education. The scale is explained in more detail later in this section.

- 69 The study was conducted entirely online using a web-based application designed and implemented by the author for this very purpose. Each user logged on to the system using a unique ID. Both the front-end and back-end were written in PHP, with data stored in a MySQL database. The present section contains screenshots of the front-end (web interface used by the participants) to help the gentle reader visualize the procedures (apropos visualization: some of the visualizations in this work have been created directly from the back-end using the JpGraph PHP library; others were prepared and created using Microsoft Excel). As a result of this set-up, the raters could participate in the study from any computer with internet access. After a supervised introductory and practice session, they were instructed to proceed with the study on their own schedule; the only constraint was that all tasks were to be finished within one week. If any questions or other issues arose, the participants were encouraged to use the contact form, the link to which was provided on every page.

The raters were asked to submit queries for information needs they had at the time or in the recent past, together with a description of the information need itself 67. They also were to indicate the language they would prefer the results to be in; the choice was only between English and German. Queries aimed at other languages were not accepted; as all students, through their background, could be expected to be proficient in German and English, queries issued by a rater could be understood (at least insofar the language is concerned) by another.

Figure 9.2. Query/information need entry form

The query was then forwarded, via the BOSS (“Build your Own Search Service”) API,68 to the now-defunct Yahoo search index.69 The top 50 results were fetched; in particular, the result URL, the page title, and the query-specific snippet constructed by Yahoo were obtained. The snippet and URL contained formatting which was also query-specific and mostly highlighted terms contained in the query. Then, the source code of every document Of course, if the user had already attempted to satisfy the information his information need before the study;

this would influence the judgments. This would not make a large difference, as most students stated they provided a current information need; furthermore, as will become clear later in this section, every result set was rated by the originator as well as by others. Still, this constitutes one of possible methodological problems with the present evaluation.

http://developer.yahoo.com/search/boss/ http://www.yahoo.com; the Yahoo portal and the search engine obviously still exist, but now they use Bing technology and index.

- 70 was downloaded; this was later used to ensure the documents had not changed unduly between evaluations.70 As one of criteria for evaluation was to be user preferences, multiple result lists were needed for each query to provide the user with a choice. From the 50 results obtained for each query (no query had fewer than 50 hits), two result lists were constructed. One contained the results in the original order; for the other, they were completely randomized. Thus, the second result list could theoretically contain the results in the original order (though this never happened).

After this, the evaluation proper began. After logging into the system, the raters were presented with a User Center, where they could see whether and how many queries they needed to submit, how many evaluations they had already performed, what the total number of evaluations to be performed was, and what the next evaluation task type would be (Figure 9.3). With a click on the relevant link, the rater could then commence the next evaluation task.

Figure 9.3. User Center.

The first type of evaluation was preference evaluation. The user was presented with a query as well as an information need statement and desired language on top of the page. The main part of the page consisted of two columns, each containing one version of the result list for the current query. The left-right-ordering of the result lists was randomized to exclude a bias for one side. Also, the result lists were anonymized, so that it was not discernible that they were provided by Yahoo, to exclude potential search engine bias (Jansen, Zhang and Zhang 2007).

Instead, the design of each column was based on that used by Google for the presentation of its result list.71 This did not include any branding or advanced search features, but rather the then-current styling and colors of the result presentation. Each result column contained 10 results (consisting of title, snippet and URL with term highlighting as provided by Yahoo);

below the last results, the raters were provided with a link to navigate to the next result page containing results ranked 11-20 on the corresponding result lists.72 At the bottom of the page the users were provided with a box. It contained the question “Session completed?” and three links, the left to indicate that the left-side result list was considered better, the right to indicate the preference for the right-side result list, and the central link to signal that the result lists A small content variation (up to 5% of characters) was tolerated, as many documents contained advertisements or other changeable information not crucial to the main content.

http://www.google.com/search?name=f&hl=en&q=the+actual+query+we+pose+is+not+important%2C+rather %2C+the+aim+of+this+reference+is+to+demonstrate+the+design+of+a+Google+result+page Obviously, if the user was on page 2 to 5, he could also return to a previous page. The number of pages was limited to five.

–  –  –

Figure 9.4.

Preference evaluation.

The raters were instructed to use the result page as they liked, attempting to satisfy the given information need. They were also told to end the session when they felt continuing would not be worthwhile, either because they already found what they sought for or because they thought it unlikely that the rest of the results contained good documents. Simplifying somewhat, the session was to end when the results were thought to be good enough or else too bad. The end of the session was marked by a click on one of the three options described above, marking the user’s preference or the lack thereof. This click took the rater back to the User Center. There have been reports of search engines using such side-by-side result lists for their internal evaluations; later, this was confirmed, at least for Google (see Figure 9.5).

Figure 9.5.

A side-by-side evaluation screen used by Google (Google 2011c). Unfortunately, the screenshot is only available from an online video, hence the poor image quality.

- 72 In the second type of evaluation, the aim was to get an assessment of user satisfaction. The procedure was similar to that employed for preference evaluation, but the raters were provided with just one result list, as is usual for search engines. The list could be either the “original” or the “random” one; the distribution of different result list types was randomized here as well. A user only evaluated one result list type for a query (if he had already worked with a side-by-side evaluation for a certain query, his satisfaction evaluation would surely not be realistic any more, as he knew the result list and individual results already). The only other difference to the preference evaluation was the feedback box which featured only two links, one indicating satisfaction with the search session and one stating dissatisfaction. The users were instructed to judge satisfaction on a purely subjective level. The layout can be seen in Figure 9.6. Again, a click on one of the “Session end” links redirected the rater to the User Center.

Figure 9.6. Satisfaction evaluation.

For both preference and satisfaction evaluation, log data was recorded. It included the session start timestamp (the time when the result page was first shown to the user), the clicks the user performed (in particular, result clicked on and the time of the click), and the session end timestamp (the time when the user clicked on a link in the “session finished” box) as well as the user ID. Also, for every session the query and information need and the display mode of the result lists were included.

The third type of evaluation was result evaluation. Here, the rater was presented with the query and information need and just one result description, again consisting of title, snippet and URL with the appropriate highlighting and kept in the Google layout. After the result description, the user was confronted with two questions. The first, which he was to ponder before actually clicking on the result link, was whether the result seemed relevant for the query under consideration, or, put in another way, whether he would click on this result given

- 73 the query and information need. After that, the user was to click on the result and examine the actual document. The pertinence of the document to the query was then to be rated on a scale from 1 to 6, with 1 being the best imaginable result and 6 a result not providing any value to the rater. This scale, while unusual in web search evaluation, has the advantage of being very well-established in Germany, as it is the standard academic grading scale for education in schools as well as at universities. Users generally have problems in differentiating between neighboring grades; the employment of a scale they already know how to use and interpret was meant to reduce the uncertainty and the decrease of judgment quality associated with it.

For evaluation purposes, the scale was internally converted into the usual 0..1 range (with a rating of 1 corresponding to 1.0, 2 to 0.8, and so forth). After selecting a grade, the user submitted the form and was automatically provided with the next result to judge.

Figure 9.7. Result evaluation.

After all the required evaluations have been performed, the raters received the message that they had completed the evaluation. The requirements included 6 preference evaluation judgments, 12 satisfaction evaluation judgments and up to 350 result evaluation judgments.

Overall, the 31 users contributed 42 queries, which they evaluated by providing 147 preference judgments, 169 satisfaction judgments, and 2403 individual result ratings.

