FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 4 ] --

More interesting is issue five, which deals with queries. One may use queries submitted by the very person who is to assess the results; in fact, this is mostly regarded as the optimal solution. However, this is not always easy to do in a laboratory experiment; also, one might want to examine a certain type of query, and so needs to be selective. As an alternative, the queries might be constructed by the researcher; or the researcher might define an information need and let the assessors pose the query themselves. As a third option, real queries might be available if the researcher has access to search engine logs (or statistics of popular queries); in this case, however, the information need can only be guessed. The question of information need versus query will be discussed in more detail in Chapter 7 which deals with the types of relevance.

Step six deals with query processing; my answer to this problem is explained in Section 9.1.

Step seven is about assigning treatments to experimental units; that is, balancing assessors, queries, and other experimental conditions so that the results are as reliable as possible. Step eight deals with the means and methods of collecting the data (observation, questionnaires, and so forth); steps nine and ten are about the analysis and presentation of the results. These are interesting problems, but I will not examine them in much detail in the theoretical part, while trying to pay attention to them in the actual evaluations we undertake.

Tague-Sutcliffe’s paradigm was conceived for classical information retrieval systems. I have already hinted at some of the differences search engine evaluation brings with it, and more are to come when evaluation measures will be discussed. To get a better idea of the entities search engine evaluation actually deals with, I feel this is a good time to introduce our protagonists.

Text REtrieval Conference, http://trec.nist.gov Cross-Language Evaluation Forum, http://www.clef-campaign.org

–  –  –

The evaluation methods that are most popular and easiest to conceptualize tend to be explicit, or judgment-based. This is to be expected; when faced with the task of determining which option is more relevant for the user, the obvious solution is to ask him. The approach brings with it some clear advantages. By selecting appropriate test persons, one can ensure that the information obtained reflects the preferences of the target group. One can also focus on certain aspects of the retrieval system or search process, and – by carefully adjusting the experimental setting – limit the significance of the answers to the questions at hand. The main alternative will be discussed in Chapter 5 which deals with implicit metrics.

–  –  –

We will first recapitulate the well-known concept of recall and precision and consider the measures derived from it (Section 4.1). Then we will describe some further measures which focus on the system being evaluated (Section 4.2), before turning to more user-based measures (Section 4.3). Finally, we will discuss problems common to all explicit measures and ways suggested to overcome them (Section 4.4). An overview of all described metrics can be found in Table 4.1.

- 23 Recall, Precision and Their Direct Descendants Precision, probably the earliest of formally stated IR evaluation measures, has been proposed in 1966 as part of the Cranfield Project (Cleverdon and Keen 1968). The basic idea is very simple: the precision of a retrieval system for a certain query is the proportion of results that are relevant. A second proposed and widely used metric was recall, defined as the proportion of relevant results that have been retrieved.

Formula 4.1.

Precision Formula 4.2. Recall The recall and precision measures are generally inversely proportional: if a retrieval system returns more results, the recall can only increase (as the number of relevant results in the database does not change), but precision is likely to fall. This is often captured in the so-called recall-precision-graphs (see Figure 4.1).

Figure 4.1.

Example of a recall-precision graph showing the performances of different retrieval systems (from Voorhees and Harman 2000, p. 14). The graph visualizes the typical trade-off between recall and precision faced by any IR system. When the recall reaches 1, that is, when every document is retrieved, the precision is at its lowest;

when precision is highest, it is recall that plummets. Every system has to find a balance reflecting the expected needs of its users.

- 24 To combine these two aspects of retrieval, the F-Measure was introduced (Van Rijsbergen 1979). It is an indicator for both recall and precision, with the relative importance set by the weighting constant β. In the special but frequent case of recall and precision being equally important (β=1) the formula reduces to the so-called F1 score. In this case (or with any β close to 1), either low precision or low recall leads to a low overall rating. With a β closer to 0, precision has a larger influence on the F-Measure, while a β higher than 1 means recall is more important.

Formula 4.3.

F-Measure for precision P and recall R. β is a constant which can be used to influence the relative importance of recall and precision.

Formula 4.4.

F1-Measure for precision P and recall R; it corresponds to the F-Measure with β set to 1.

As has often been noted, “since the 1950’s the adequacy of traditional measures, precision and recall, has been a topic of heated debates […], and at the same time of very little experimentation” (Su 2003, p. 1175). It was argued that these measures might work well for a relatively small-scale IR system used by information professionals or experienced researchers.

But for other uses, their practical and theoretical limitations become apparent. It stands to reason that for large databases with millions of documents recall cannot be easily determined, as it depends on knowing the precise number of relevant documents in the database. 23 Furthermore, as the quantity of returned results increases, it becomes difficult to rate the relevance of each one manually. Especially in web search engines, where there may be millions of relevant results,24 it is impossible to evaluate each one. The solution comes from studies showing that users look only at a small number of results, with most users restricting themselves to just the first result page (Jansen and Spink 2006). Thus, evaluating all returned results becomes at least unnecessary, and, if one focuses on the user experience, may even be counterproductive.

One further problem remains, however. Precision treats all returned results equally; but as the user’s attention decreases, pages in the lower parts of the result list become less important, and may not be considered at all (Jansen et al. 1998). Therefore, most modern metrics, precision-based metrics among them, include some kind of damping constant that discounts the value of later results. One of the most popular measures is the Mean Average Precision (MAP). As its name suggests, it averages mean precisions 25 of multiple queries. In words, This problem was at least partially alleviated by the employment of methods such as Relative Recall (Clarke and Willett 1999). As recall is not a primary issue of the present work, it will not be discussed further.

At least, those are the numbers estimated and displayed by major search engines. In practice, only a subset of those is actually returned in the result list, mostly under a thousand pages. This aggravates the problem, as the results are usually still too many to rate, but now precision cannot be calculated even in theory, lacking the complete list of returned results.

“It means average precisions” would have the words in a more appropriate order, but seems a bit ambiguous.

–  –  –

Formula 4.5.

MAP formula with queries Q and documents D. dr is a document at rank r. rel is a relevance function assigning 1 to relevant and 0 to non-relevant results.

An example may be appropriate do demonstrate the calculation of MAP. Given the imaginary relevance judgments in Table 4.2, and supposing no further relevant documents are known to exist, the explicit calculation is given in Figure 4.2. For comparison: The precision for each of the queries is 0.6.

–  –  –

(( )) (( )) Figure 4.2. MAP calculation for values in Table 4.2.

MAP is one of the most-used metrics and is employed in single studies as well as in large efforts such as TREC (Clarke, Craswell and Soboroff 2009). However, it is also not without its problems. In particular, some studies have detected a lack of correlation between MAP and actual user performance (Hersh et al. 2000; Turpin and Hersh 2001). These studies are based upon data collected in the TREC Interactive Track (see Hersh and Over 2000), which is assessed by asking users to find one or more answers to predefined questions in a given time.

The number of answers or instances they collect can be used as a standard against which to rate other, more system-centered metrics. The results show that an increase in MAP need not correspond to a significant, or indeed any, increase in user performance. Turpin and Hersh have also looked in more detail at the possible reasons for this discrepancy; their finding is that, while users encounter more non-relevant results, the number of relevant results stays constant. They conclude that the users are not hindered much by a low Mean Average Precision.

The two studies had a relatively small number of queries; furthermore, it could be argued that the setting of a fixed search time significantly distorts the users’ real-life behavior and cannot be transferred to other fields such as classical web search. For example, the assessors were not supposed to abandon the search ahead of time if they grew frustrated with the high number of non-relevant results. However, the user model implied by this limit is conceptually not less Note that this is not the number of returned relevant results but the number of all relevant results in the database, or at least the number of relevant results known to exist.

–  –  –

The findings were also strengthened in a study concerning itself with more simple – and presumably more common – tasks (Turpin and Scholer 2006). A first task was precisionbased; it required the users to find a single relevant result for a pre-defined TREC query. The ranking algorithms were selected as to reflect a range of MAP values from 55% to 95%. For every algorithm, the average time elapsed before answer is found and the average percentage of sessions where no relevant document was found were calculated. Neither had any significant correlation to MAP, or to precision at ranks 2, 3, 4 and 10. The only (weak) correlation was observed for precision at rank 1; however, the authors note that this might be an effect of the study design. A second task was recall-based; the users were asked to identify as many relevant documents as possible in a fixed time period. Again, the different ranking algorithms were employed. The only statistically significant differences were observed between the pairs 55% vs. 75% and 65% vs. 75%. However, since the absolute improvement was very low (extra 0.3 documents per session), and the user performance fell again in the conditions featuring MAP values of 85% and 95%, the correlation does not seem impressive.

Though some objections against the methodology of the study could be made – the authors mention the different user groups for the TREC-rated documents and the student-performed retrieval task – it clearly adds much doubt to MAPs appropriateness as a measure of probable user performance.

Slightly better correlation between precision and user satisfaction was shown by Kelly, Fu and Shah (2007). They performed a test on four topics from TREC newspaper data. Each subject was given a topic statement and was instructed to construct a query; however, the user was presented with a result list that was fixed in advance and did not depend on the actual query but solely on the topic. There were four result lists; in one condition, they significantly differed in precision, while being similar in MAP (more scattered relevant results versus few relevant results at top ranks), while in another, the precision was equal, but the ranking different (e.g. ranks 1-5 relevant and 6-10 irrelevant, or vice versa). The users were asked to rate single documents as well as the overall performance of every “search engine”. The results showed that, with higher MAP as constructed from users’ document ratings, the rating of the result list also tend to grow. However, the correlation was only significant for less than half of all users; in fact, users with average ratings for documents as well as result lists were the only group showing this significance. Furthermore, the study has some serious methodological issues. First, there were only four topics, a very small sample. Second, the result lists were not query-specific but topic-specific; this must have lead to different search experiences and different relevance ratings among users. Finally, the result lists were constructed using expected precision values derived from TREC ratings; however, no statistics on their correlation with user assessments of result set quality are provided.

4.2 Other System-based Metrics A simple measure intended for a specific task is the Mean Reciprocal Rank (MRR).

Reciprocal Rank (RR) is defined as zero if no relevant results are returned, or else one

- 27 through the rank of the first encountered relevant result (Formula 4.6). For MRR, the mean of RR values for single queries is taken (Voorhees 1999). MRR assumes that the user is only interested in one relevant result, and anything following that result is irrelevant to his search experience (see Formula 4.7). This seems to be an intuitive assumption for some types of queries, particularly navigational ones.

{ Formula 4.6. Reciprocal Rank (RR), with r the first rank where a relevant result is found.

∑ || Formula 4.7. Mean Reciprocal Rank (MRR), Q being the set of queries and RRq the Reciprocal Rank measured for query q.

A metric proposed explicitly for web search evaluation is the Quality of result ranking (QRR;

Vaughan 2004). It is quite unusual among explicit measures in the way it constructs its rating.

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 26 |

Similar works:

«Can Relational Aggression and Victimization Help to Explain the Emergence of the Sex Difference in Depression During Adolescence? A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Lindsay Catherine Mathieson IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nicki R. Crick, Adviser & Bonnie Klimes-Dougan, Co-Adviser August 2012 © Lindsay C. Mathieson 2011 i Acknowledgements First of all, I would like to thank my...»

«Community Cellular Networks by Kurtis Heimerl A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Eric Brewer, Co-chair Assistant Professor Tapan Parikh, Co-chair Associate Professor Jenna Burrell Professor Ali Niknejad Fall 2013 Community Cellular Networks Copyright 2013 by Kurtis Heimerl Abstract Community Cellular...»

«The Relationship Between Dominance and Vocal Communication in the Male Ring-Tailed Lemur (Lemur catta) by Laura McLachlan Bolt A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Anthropology University of Toronto © Copyright by Laura McLachlan Bolt, 2013 The Relationship Between Dominance and Vocal Communication in the Male Ring-Tailed Lemur (Lemur catta) Laura McLachlan Bolt Doctor of Philosophy Department of Anthropology University of...»

«NOTE TO USERS This reproduction is the best copy available. UMI' Touching Fiction: Embodied Narrative Self-Reflexivity and Eighteenth-Century British Sentimental Novels By Alex Wetmore B.A. (Hon), M.A. A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfilment of the requirements for the degree of Doctor of Philosophy Cultural Mediations Carleton University Ottawa, Canada August, 2009 © Copyright 2009, Alex Wetmore 1*1 Library and Archives Bibliotheque et Archives...»

«Der Sechste Zhva dmar pa Chos kyi dbang phyug (1584–1630) und sein Reisebericht aus den Jahren 1629/1630: Studie, Edition und Übersetzung Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie an der Ludwig-Maximilians-Universität München vorgelegt von: Navina Lamminger Datum der mündlichen Prüfung: 28. Januar 2013 Gutachter: Prof. Dr. Franz-Karl Ehrhard Prof. Dr. Jens-Uwe Hartmann Inhaltsverzeichnis Danksagung vii 1. Allgemeine Einführung und Aufbau der Arbeit 1 1.1...»

«International Journal of Humanities and Social Science Invention ISSN (Online): 2319 – 7722, ISSN (Print): 2319 – 7714 www.ijhssi.org ||Volume 4 Issue 10 || October. 2015 || PP.40-43 The Style of Gita Govinda Recital and Odissi Music Dheeraj Kumar Mohapatra Guest Faculty, Odissi Vocal Department, Utkal University Of Culture, India Abstract : In Gita Govinda, Sri Jayadeva has embellished the philosophy, metaphysics, ontology and mysticism in erotic words, melodious versifications, ardent...»

«Richard L. W. Clarke LITS2306 Notes 05A RENÉ DESCARTES MEDITATIONS ON FIRST PHILOSOPHY (1641) Descartes, René. Meditations on First Philosophy. Selected Philosophical Writings. Trans. John Cottingham, Robert Stoothoff and Dugald Murdoch. Cambridge: CUP, 1988. 73-122. Meditation I: What can be called into Doubt (“The General Demolition of My Opinions” [76]) Here, Descartes’s concerns are epistemological in nature as he plunges into the depth of skepticism, coming to the view that almost...»

«On the Structure of Classical Mechanics Thomas William Barrett∗† February 21, 2013 Abstract Jill North (North, 2009) has recently argued that Hamiltonian mechanics ascribes less structure to the world than Lagrangian mechanics does. I will argue that North’s argument is not sound. In doing so, I will present some obstacles that must be navigated by anyone interested in comparing the amounts of structure that different physical theories ascribe to the world. 1 Introduction Different...»

«Cultural Considerations in the Delivery of Homecare Services: Beyond 2 kitchens and a disability/ più di due cucine e disabilità. by (Hedy) Anna Walsh A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Factor-Inwentash Faculty of Social Work University of Toronto © Copyright by (Hedy) Anna Walsh (2014) Cultural Considerations in the Delivery of Homecare Services: Beyond 2 kitchens and a disability/ più di due cucine e disabilità. Doctor of...»

«Middle English Lyrics: Lyric Manuscripts 1200–1400 and Chaucer’s Lyric by Emma Kate Charters Gorst A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of English University of Toronto © Copyright by Emma Kate Charters Gorst 2013 Middle English Lyrics: Lyric Manuscripts 1200–1400 and Chaucer’s Lyric Emma Kate Charters Gorst Doctor of Philosophy Department of English University of Toronto Abstract This thesis endeavours to understand...»

«La libertà nella filosofia di L P. Sartre Gli esistenzialisti si sono presentati alla massa facilmente riconoscibili per il loro aspetto, disobbedienti ad ogni vincolo di moda, di costumi e di educazione, nella varietà di uomini e donne provenienti dai più diversi strati della società. Così, sulla scorta dei resoconti e delle curiosità giornalistiche, la massa si è abituata a vedere l'Esistenzialismo come movimento bizzarro, e più ancora, come atteggiamento di una casta di artisti...»

«1 27 August 2009 Mereologies as the Grammars of Chemical Discourses ROM HARRÉ AND JEAN-PIERRE LLORED `If you cut a crumb in half do you have two new crumbs or two halves of a crumb?’ John Palmer, quoted in the Sunday Times, 28 June 2009, News Review, p. 16. Since Robert Boyle’s corpuscularian philosophy, chemistry has been a mereological science. Displacing the metaphysics of `continuous substances’ and `qualities’ as the expression of “principles”, chemistry has been built on a...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.