FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 5 ] --

First, human judgments are obtained for the search results in the usual way. Preferably, the results will be pooled from several search engines’ lists for the same query. In the next step, however, those are used to construct an alternative result list, one that is sorted by userassigned relevance. This is compared to the original result lists provided by the search engines, and a correlation coefficient is calculated. Vaughan provides an evaluation of her metric. Vaughan applies QRR to three search engines and finds that the results are significantly different. From that, she concludes that it is “able to distinguish search engine performance” (Vaughan 2004, p. 689). This, however, seems to be an example of a vicious circle; the metric is valid because it “recognizes” a difference, which is assumed on the basis of the metric. QRR itself has not been widely used for evaluations; however, it deserves attention for its attempt to detect correlations. We will encounter more correlation-based measures when dealing with implicit measurements; in particular, the notion of Normalized Discounted Cumulated Gain later in this section follows a similar logic.

A measure which has enjoyed some exposure, for example at a few of the TREC tracks, is bpref,27 so called “because it uses binary relevance judgments to define the preference relation (any relevant document is preferred over any nonrelevant document for a given topic)” (Buckley and Voorhees 2004, p. 26). This is unusual among explicit metrics, as it does not attempt to assign an absolute rating to results, but merely to find pairs where one document should be preferable to another. The metric, given in Formula 4.8, calculates the number of non-relevant documents n ranked higher than a relevant document r, and averages the number over the first R relevant results.28 Bpref was shown to correlate well (with Kendall τ90%) The authors write “bpref” with a lower-case “b” except in the beginning of a sentence, where it is capitalized. I do not see any reason not to follow their approach – after all, they should know best.

To ensure that the values fall into the [1..0] interval, n is also restricted to the first R occurrences.

- 28 with MAP, while being more stable when fewer judgments are available. Also, increases in bpref were correlated to lower times for task completion and higher instance recall in a passage retrieval task, although the connections are not linear, and may not apply to other retrieval tasks (Allan, Carterette and Lewis 2005).

| | ∑ Formula 4.8. Bpref (Buckley and Voorhees 2004). Its distinguishing characteristic is the comparison of results within a result list. For each relevant result r (up to a predetermined number of relevant results R), the number of nonrelevant results n above it in the list is calculated, and the results average over the different r’s. Thus, every nonrelevant result ranked higher than a relevant one causes bpref to decrease from its maximum value of 1.

There has also been the rpref metric, introduced mainly to allow for graded instead of binary relevance (De Beer and Moens 2006). However, neither bpref nor its modification will play a significant role in the current evaluation. The reason is that these metrics are explicitly designed for situations where relevance judgments are incomplete. In the study presented in Part II, this problem does not occur, for reasons that will be discussed in the methodological Section 9.1. Additionally, it has been suggested that even in the very circumstances it has been conceived for, bpref performs worse than traditional metrics. When non-judged results are simply disregarded, metrics like Average Precision or NDCG show more robustness and discriminatory power than bpref or rpref (Sakai 2007a).

An early measure comparing a system’s actual result list to an idealized output is the Sliding Ratio (Pollock 1968). It is calculated as the ratio of the actual retrieval system’s score to an ideal ranking of the same documents for every rank up to a certain threshold; hence the name.

In Formula 4.9, c is the cut-off value (the number of results considered), rel(dr) the relevance (or weight) of the document at rank r, and rel(drideal) the relevance of the r-th result in an ideally ranked result list. The simple sample shown in Table 4.3 illustrates that since only the retrieved documents are considered for the construction of the ideal ranking, the SR at rank n is always 1.

∑ ∑ Formula 4.9. Sliding Ratio (Korfhage 1997)

–  –  –

A measure which has enjoyed wide popularity since its introduction is Discounted Cumulated Gain or DCG for short (Järvelin and Kekäläinen 2000). The more basic measure upon which it is constructed is the Cumulated Gain, which is a simple sum of the relevance judgments of

- 29 all results up to a certain rank. DCG enhances this rather simple method by introducing “[a] discounting function [...] that progressively reduces the document score as its rank increases but not too steeply (e.g., as division by rank) to allow for user persistence in examining further documents” (Järvelin and Kekäläinen 2002, p. 425). In practice, the authors suggest a logarithmic function, which can be adjusted (by selecting its base) to provide a more or less strong discount, depending on the expectations of users’ persistence.

{ Formula 4.10. DCG with logarithm base b (based on Järvelin and Kekäläinen 2002). CGr is the Cumulated Gain at rank r, and rel(r) a a relevance function assigning 1 to relevant and 0 to non-relevant results.

–  –  –

One weak point of DCG is the missing comparability to other metrics as well as between different DCG-evaluated queries. Should one query be “easy” and have more possible relevant hits than another, its DCG would be expected to be higher; the difference, however, would not signify any difference in the general retrieval performance. A measure which indicates retrieval quality independent from the quality of available results (that is, from the “difficulty” of the search task) would be more helpful. To remedy the situation, Normalized DCG (NDCG) can be employed. NDCG works by pooling the results from multiple search engines’ lists and sorting them by relevance. This provides an “ideal result list” under the assumption that all the most relevant results have been retrieved by at least one of the search engines. The DCG values of the single search engines can then be divided by the ideal DCG to put them into the [0..1] interval, with 0 meaning no relevant results and 1 the ideal result list.29 Note, however, that NDCG is quite similar to a version of Sliding Ratio with added discount for results at later ranks.

The authors evaluated the different CG measures (Järvelin and Kekäläinen 2000; Järvelin and Kekäläinen 2002). However, this was not done by comparing the new measure with a standard, or with established measures; instead, it was used to evaluate different IR systems, where one was hypothesized to outperform the others. The CG measures indeed showed a significant difference between the systems, and were considered to have been validated. I think this methodology is not quite satisfactory. It seems that evaluating the hypothesis with Obviously, if the ideal DCG is zero, the calculation is not possible. However, this is not a large problem, since this value would mean no relevant pages are known, and such a query would probably best be excluded from the evaluation altogether. Alternatively, if relevant documents are supposed to exist, they can be explicitly added into the ideal ranking, either on a per-query basis or as a default baseline across all queries.

- 30 the new measure while at the same time evaluating the new measure against the hypothesis may produce a positive correlation without necessarily signifying a meaningful connection to any outside entity. However, a more stringent evaluation of DCG was performed by AlMaskari, Sanderson and Clough (2007). In the study, Precision, CG, DCG and NDCG were compared to three explicit measures of user satisfaction with the search session called “accuracy”, “coverage” and “ranking”. The results were mixed. From the overall 12 relations between metric and user satisfaction, only two showed a significant correlation, namely, Precision and CG with the ranking of results. Unfortunately, the authors provide no details as to what the assessors were instructed to rate by their satisfaction measures; it seems possible that, say, a user’s perception of accuracy may well be influenced by the result ranking. Still, the study is the only one I know of that directly compares the CG family to user satisfaction, and its results are only partly satisfactory. These results notwithstanding, (N)DCG is conceptually sound, and provides more flexibility than MAP. Since its introduction in 2000, it has become one of the most popular search engine evaluation measures; and we definitely do not mean to suggest throwing it overboard as soon as some initial doubt is cast on its correlation to real-world results.

∑ | | || Formula 4.11. ADM for retrieved documents D, System Relevance Score SRS and User Relevance Score URS A measure which is rarely used for actual evaluation but provides some interesting aspects for the current study is the Average Distance Measure (ADM). It has been introduced explicitly to replace existing measures which were considered to rely too heavily on binary distinctions.

The distinction between relevant and non-relevant documents is opposed to documents lying along a scaled or continuous relevance axis, and that between retrieved and non-retrieved documents is to be smoothed by considering the rank of the retrieved result (Mizzaro 2001;

Della Mea et al. 2006). The ADM measure is, in short, the average difference between a system’s and a user’s relevance scores for the documents returned for a query. The interesting characteristic of ADM is that, while it attempts to distinguish itself from precision in some aspects, it is similar in that it is an extremely system-based measure, focusing on single system ratings rather than on user experience. To provide a simple example, a retrieval system providing a bad result list but recognizing it as such can get an ideal ADM (since its evaluation of the results’ quality is perfect); one that provides ideal results but regards them as mediocre them performs worse (since it underestimates the result quality). 30 Evaluation results calculated using ADM have been compared to those produced by Average Precision, and found not to correlate with it (Della Mea, Di Gaspero and Mizzaro 2004); however, this merely means that the two measures do not evaluate the same aspects of search engines. Both still may be good at specified but distinctly different tasks.

Also, AD obviously depends on a system’s relevance scores for documents, which are not readily available.

- 31 User-based Metrics Another class of metrics goes beyond relevance by employing models of user behavior. The distinction is somewhat vague since any discounted metric such as MAP or DCG already makes some assumption about the behavior, viz. the user being more satisfied with relevant results in earlier than in later ranks. DCG even provides a method for adjusting for user persistence by modifying the logarithm base of its formula. Still, it can be argued that the measures to be presented in this section differ significantly by explicitly stating a user model

and constructing a metric to reflect it. The rationale for turning away from purely systembased measures was nicely formulated by Carbonell and Goldstein:

Conventional IR systems rank and assimilate documents based on maximizing relevance to the user query. In cases where relevant documents are few, or cases where very-high recall is necessary, pure relevance ranking is very appropriate. But in cases where there is a vast sea of potentially relevant documents, highly redundant with each other or (in the extreme) containing partially or fully duplicative information we must utilize means beyond pure relevance for document ranking.31 (Carbonell and Goldstein 1998, p. 335) An early metric focusing on the user’s information need is Expected Search Length (ESL). It is defined as the number of non-relevant documents in a weakly ordered document list a user has to examine before finding n relevant ones (Cooper 1968). It may be viewed as a more general version of Mean Reciprocal Rank (MRR), which is analogous (or, rather, inversely proportional) to ESL with n=1. It should also be noted that the number of retrieved relevant documents is not the only variable; the definition of relevance can also be set so as to model a user’s information need. One problem with ESL is the underlying assumption that user satisfaction is primarily influenced by non-relevant results. This is not self-evident; indeed, some studies suggest that at least in some circumstances, the number of non-relevant results plays hardly any role (Turpin and Hersh 2001).

For this and other reasons, ESL was expanded by Dunlop (1997) to an “Expected Search Time” (EST) measure. By estimating the time needed to interact with the search engine (to read an abstract, to read a document and so forth), he comprised a metric for the probable duration of a user’s search session before he finds the n relevant results he needs. However, further limitations of EST as well as ESL remain; the number of desired relevant documents may vary heavily depending on the information need as well as on the individual user. Also, a user himself probably does not know how many documents he will require; he may prefer all the information to be in a single document, but if it is scattered, he might really want all five or ten relevant documents.

A measure concerned with the diversity of the result list is the Maximal Marginal Relevance, or MMR (Carbonell and Goldstein 1998). It assumes that the user wishes to encounter documents which are as similar as possible to the query, but as dissimilar as possible between Here, “relevance” is taken to mean the objective relevance of a single document to a query (topicality).

Relevance is discussed in more detail in Chapter 7.

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 26 |

Similar works:

«Tel Aviv University The Lester & Sally Entin Faculty of Humanities The Shirley & Leslie Porter School of Cultural Studies SIMILARITY, VARIATION, AND CHANGE: INSTABILITY IN HEBREW WEAK VERBS THESIS SUBMITTED FOR THE DEGREE “DOCTOR OF PHILOSOPHY” by Gila Zadok Submitted to the Senate of Tel Aviv University December 2012 This work was carried out under the supervision of Prof. Outi Bat-El TABLE OF CONTENTS Abstract Acknowledgements CHAPTER 1. Introduction 1.1. Overview 1.2. Language Background...»

«WRITING HOME: THE POST COLONIAL DIALOGUE OF ATHOL FUGARD AND AUGUST WILSON BY ©2008 Paul Prece Submitted to the graduate program in Theatre and Film, and to the Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the degree of Doctor of Philosophy. _ John Gronbeck-Tedesco Chairperson Committee Members: Omofolabo Ajayi-Soyinka Henry Bial Iris Smith Fischer Jack B. Wright Date defended: The Dissertation Committee for Paul Prece certifies that this is...»

«Morphologies of Becoming: Posthuman Dandies in Fin-de-Siècle France. by Marina E. Starik B.A., University of Utah, 2003 M.A. University of Utah, 2005 Submitted to the Graduate Faculty of The Kenneth P. Dietrich School of Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh UNIVERSITY OF PITTSBURGH Dietrich School of Arts and Sciences This dissertation was presented by Marina E. Starik It was defended on November 12, 2012...»

«Copyright by Uttiya Chowdhury The Dissertation Committee for Uttiya Chowdhury certifies that this is the approved version of the following dissertation: MOCVD Growth for UV Photodetectors and Light Emitting Diodes Committee: Russell D. Dupuis, Supervisor Joe C. Campbell Nathan F. Gardner John B. Goodenough Leonard F. Register MOCVD Growth for UV Photodetectors and Light Emitting Diodes by Uttiya Chowdhury, B.Sc.Engg., M.S. Dissertation Presented to the Faculty of the Graduate School of The...»

«AUGUSTUS LE PLONGEON: EARLY MAYA ARCHAEOLOGIST by Lawrence Gustave Desmond B.S.C., University of Santa Clara, 1957 M.S., San Jose State University, 1964 Universidad de las Americas, 1979 ~1.A., A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Anthropology This thesis for the Doctor of Philosophy degree by Lawrence Gustave Desmond has been approved for the...»

«Plasticity in Cu thin films: an experimental investigation of the effect of microstructure A thesis presented by Yong Xiang to The Division of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Engineering Sciences Harvard University Cambridge, Massachusetts October, 2005 © 2005 Yong Xiang All rights reserved. Thesis advisor Author Joost J. Vlassak Yong Xiang Plasticity in Cu thin films: an experimental...»

«The Delicacy of Causal Ascription and Bell’s Theorem Erik Curiel August 24, 2009 Contents 1 Introduction 1 2 Jones and Clifton’s Argument Part 1: The Formal Situation 5 3 Jarrett’s Argument 10 4 Jones and Clifton’s Argument Part 2: The Delicacy of Causal Ascription 16 5 Conclusion 19 The utility of a notion testifies not to its clarity but rather to the philosophical importance of clarifying it. Nelson Goodman Fact, Fiction and Forecast 1 Introduction Quantum mechanics predicts...»

«Ecological Interactions Among Nitrate-, Perchlorate-, and Sulfate-Reducing Bacteria in Hydrogen-Fed Biofilm Reactors by Aura Ontiveros-Valencia A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved March 2014 by the Graduate Supervisory Committee: Bruce Rittmann, Co-chair Rosa Krajmalnik-Brown, Co-chair César I. Torres ARIZONA STATE UNIVERSITY May 2014 ABSTRACT Water contamination with nitrate (NO3-) (from fertilizers) and perchlorate...»

«Contextuality: A Philosophical Paradigm, with Applications to Philosophy of Cognitive Science Carlos Gershenson School of Cognitive and Computer Sciences University of Sussex C.Gershenson@sussex.ac.uk Abstract We develop on the idea that everything is related, inside, and therefore determined by a context. This stance, which at first might seem obvious, has several important consequences. This paper first presents ideas on Contextuality, for then applying them to problems in philosophy of...»

«Hase Shōtō* Nishitani’s Philosophy of Emptiness in “Emptiness and Immediacy” Translated by Robert F. Rhodes (Ōtani University, Kyoto) This paper explores Nishitani Keiji’s interpretation of emptiness found in his essay “Emptiness and Immediacy,” dating from the last period of his life. The paper first points out that Nishitani focused on the notion of emptiness in his Religion and Nothingness, his representative work from his middle years, in relation to the problem of nihilism....»

«Transit for National Parks and Gateway Communities: Impacts and Guidance A Dissertation Presented to The Academic Faculty By Anne E. Dunning In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Civil and Environmental Engineering Georgia Institute of Technology January 2005 Copyright © Anne E. Dunning, 2005.Transit for National Parks and Gateway Communities: Impacts and Guidance Approved by: Dr. Michael D. Meyer, Advisor Dr. Michael O. Rodgers School of Civil and...»


<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.