FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 9 | 10 || 12 | 13 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 11 ] --

There is no obvious way to assess RIN. Expert judgments, which are often used as part of the Cranfield paradigm, are not just laborious. They may work if the information needs are carefully selected to be judged by certain specialists; if the study goes for real-life information needs, this becomes unfeasible, even if descriptions of the personal information needs are available. For some queries, like the one looking for the definition of the term “panic attack”, expert judgment can be expected to work. But even for science-based queries like the one looking for information about language acquisition, they would be problematic, as linguists passionately disagree about which answers are correct. More obviously, a potential visitor looking for information about a city might not notice a web site’s omission of an attractive sight, but he is surely a better judge of whether the information he sees is interesting and useful to him than any expert. The argument is similar for transactional queries, where an expert might know whether the price is too high, but probably not the searcher’s exact requirements and preferences. In short, to provide “good” assessments, an expert would have to interview the searcher extensively, and then – at least in the case of web search, where noone is likely to be aware of all potentially useful sites – to conduct research of his own before rating results provided by the search engine under review. And even then, the judgments will not reflect real usefulness if, say, the searcher would not have come to grips with a site’s interface, or failed to find the relevant information on a page.54 And, of course, expert evaluations tend to be on the expensive side. Still, even when the resources are available, one has to carefully consider whether experts are the best solution.

–  –  –

If real web users with real information needs could be presented with one of two possible result lists, how well is a particular metric suited to determine which result list would be preferred by this particular user, given explicit relevance judgments for the individual results?

8.2 Evaluation Methods Some of the main decisions when evaluating a metric’s relationship to user judgments concern the input, the metric parameters and the threshold and cut-off values.

Some metrics (like precision) were constructed for use with binary relevance, and this is the judgment input they are mostly provided with. However, it has been suggested that scaled ratings might be more useful (Järvelin and Kekäläinen 2002; Zhou and Yao 2008). In the present study, scaled ratings have been used for the simple reason that they can be easily reduced to binary, while the opposite process is not yet within the reach of mathematicians.

The complications arise when one considers the different possibilities of conflating the ratings; the two probably most obvious are mapping the lowest rating to “irrelevant” and all others to relevant, or the mapping of the lower half of the ratings to “irrelevant” and the higher half to “relevant”. I approach this issue in the way which we consider to be least assuming (or rather to make as little assumptions as possible); I use multiple methods separately, and look at which provide better results, that is, which are closer to the users’ real assessments.

{ Formula 8.1, alias Formula 4.10. DCG with logarithm base b (based on Järvelin and Kekäläinen 2002).

The second issue are metric parameters. 55 Some metrics, DCG perhaps most prominently among them, have elements that are supposed to be changed to reflect the assumptions about user behavior or user aims.

To illustrate that, we will reproduce the DCG formula as Formula 8.1, and quote once more, now more extensively, from the authors of DCG:

A discounting function is needed which progressively reduces the document value as its rank increases but not too steeply (e.g., as division by rank) to allow for user persistence in examining further documents. A simple way of discounting with this requirement is to divide the document value by the log of its rank. For example log2 2 = 1 and log2 1024 = 10, thus a document at the position 1024 would still get one That is, the parameters of metrics; this has nothing to do with the conversion of pints into liters.

–  –  –

I see two problems with this excerpt. The first is the authors’ statement that a division by rank would be too steep a discount. The question not explicitly considered is: Too steep for what?

It seems logical to assume that the benchmark is user behavior, or user preference. But, as the authors indicate in the last sentence quoted above, user behavior varies. To make a point: as described in Section 2.2, hardly any users of web search engines go beyond the first result page, that is, the first 10 results. Surely, for them a good result at rank 1 is more useful than ten good results soon after rank 1000 – though the authors’ example suggests they would be equal for the purposes of DCG.56 The authors themselves do not provide any information on which users they mean, and whether there has been any research into what would be the most appropriate discounting function for those users. It is at least conceivable that division by rank, dismissed by Järvelin and Kekäläinen, might yet work well; or that some more complex function, reflecting the viewing and clicking patterns measured for real people, might work better. The second problem is that, though the authors bring up base 2 as an example, and call for adjustments wherever necessary, b is routinely set to 2 in studies. Thus, even the adjustments actually suggested are not made, and I am not aware of any work looking either at the appropriateness of different discounting functions, or even different logarithm bases, for any scenario. I do not claim that the widely used parameters are inappropriate in all or even any particular circumstances; I just point out that, as far as I know, no-one has tried to find that out. Again, my own approach is trying to preselect as little as possible. Obviously, one cannot test every function and base; however, I will attempt to cover a variety of both.

Which discount functions should one examine? Obviously, there is an indefinite number of them, so I will concentrate on a small sample which attempts to cover the possible assumptions one can make about user behavior. It should not be forgotten that user behavior is the raison d’être of discount functions; they are supposed to reflect the waning perseverance of the user, whether from declining attention, increasing retrieval effort, or any other influence that causes the marginal utility 57 of results to decline. The question relevant to discounting is: How steep does the discount have to be to correctly account for this behavior?

Therefore, the considered discount functions have to cover a wide range of steepness rates.

The ones I consider are, in approximate order of increasing steepness:

 No discount. The results on all ranks are considered to be of equal value. This is a special case since the absence of a discount in NDCG makes it analogous to classical precision. Both metrics consist of a normalized total sum of individual result ratings.

The definition of a “good result” is, in this case, not important, since the argument holds regardless of what is considered “good” or “relevant”.

Defined as “the additional satisfaction or benefit (utility) that a consumer derives from buying an additional unit of a commodity or service. The concept implies that the utility or benefit to a consumer of an additional unit of a product is inversely related to the number of units of that product he already owns” (Encyclopædia Britannica 2011). In the case of IR, it is almost universally assumed that the user gains less and less after each additional result, at least after some rank.

- 59 A small disparity in absolute scores results from different normalizing, which leads to different absolute numbers and thus to different performances at different thresholds.

This similarity will be discussed in more detail in Section 10.2.

 log5. A shallow function. The logarithm is not defined mathematically for ranks less than the base (five, in this case); taking the lead from standard DCG usage, results up to and including rank 5 are not discounted at all.

 log2. The standard discount function of DCG. Starts discounting at rank 3.

 Square root.

 Division by rank. A function used – among others – by MAP.

 Square rank. A very steep discounting function; the third result carries just one ninth of the first result’s weight.

 Click-based discount. The data for this function comes from Hotchkiss, Alston and Edwards (2005), who have looked at the click frequencies for different result ranks.

They found click rates for the second result are only a small fraction of those for the first result. This function is unusual in that it does not decreasing monotonically;

instead, the click rates rise e.g. from rank 2 to 3 and from 6 to 7. The authors explain the result with certain user and layout properties; for example, most display resolutions allow the user to see the first six results on the screen, and after scrolling the seventh result can become first. For a more detailed discussion, I refer to the original study. This function was chosen to represent a more user-centered approach not just to evaluation, but also to the calculation of the metric itself.58

–  –  –

Figure 8.1.

The discount rates of different functions. The Y-axis shows the weights assigned to the ranks shown on the X-axis.

Of course, the data from the (Hotchkiss, Alston and Edwards) is not perfect, and might not be representative for general click behavior (if such a generalization exists at all); I use it as an example of a log-based discount function.

- 60 The third question concerns threshold values. When considering the relationship between individual metrics and explicit preference judgments, we want to look at whether, for a pair of result lists, both methods pick the same “preferable” result list. However, result lists can also be equally good (or bad). With explicit preference judgments, there are no problems; users can just indicate such a situation. For classical metrics, however, one has to decide when the quality of two result lists is to be considered similar. Are, for example, result lists with a MAP of 0.42 and 0.43 equally good? It can be argued that a user is unlikely to recognize such a minute difference. Where, then, to draw a line? Once more, I do not make any a priori judgments; instead, we will try out different thresholds and evaluate their suitability.

The forth problem are cut-off values. It is a commonplace that more data is better, but one cannot possibly hope to evaluate all the estimated 59 thousand results for “why a duck?”,59 and even the 774 results actually returned will be more than any but the most determined (and well-paid) rater can manage. On the other hand, research has shown that people would not be likely to go to result №775 even if they could (see Section 2.2). And even if they would, the marginal gain from evaluating the additional results would not necessarily justify the extra effort. Thus, there might be multiple points of interest between the extremes of evaluating only the first result and evaluating everything out there. There might be a minimal number of evaluated results for a study to have any validity; there probably is a point beyond which nothing at all will be gained by evaluating more results; and there can also be multiple points where the marginal gain from rating the next result does not warrant the pains. I speak of multiple points, as different studies have different objectives, for which different cut-off values might be more or less appropriate; but at the very least, there should be a span in which this “usefulness peak” falls for most studies.

The four problems outlined above are aggravated by the fact that they might be interconnected. Perhaps a dual logarithm-based calculation of DCG works well with binary judgments, while other judgment scales require different discount functions. While it is possible to conduct all of the necessary evaluations in one study, I will not attempt to do that.

The amount of data would probably need to be larger than that which the current study could muster to provide robust results; and the interpretation and presentation would be a challenge in itself. Such an evaluation would have to consider scale versus discount versus threshold versus cut-off value versus the metric itself versus the actual main criterion of the present study, user preference or satisfaction. This six-dimensional matrix is something I happily leave to the researchers to come, content with smugly pointing out that the method introduced in this study seems well suited for the task. Nevertheless, the issues cannot be completely ignored; for this reason, I attempt to combine evaluations of these features on a smaller scale whenever it seems possible and appropriate.

To evaluate the different metrics and their possible parameters, two different methods are possible. The first is statistical correlation in its various forms. In short, they indicate how strong the link between different data sets is. This method would bring with it some methodological problems. For example, most popular correlations measure the connections For Google (http://www.google.com/search?q=%22why+a+duck%22) on July 8th 2011.

- 61 between two data sets; however, for the preference evaluation, we have three (preference judgments and the metric values for two different result lists). For example, the user might have indicated a preference for result list 1 over result list 2, with the former’s precision at 0.6 and the latter’s at 0.3. Of course, the two metrics might be conflated into a single number;

however, it is unclear how best to do that. Do we take the difference, the ratio, or some other measure of discrepancy? Another problem is that statistical correlation are significant only in a mathematical sense.

Pages:     | 1 |   ...   | 9 | 10 || 12 | 13 |   ...   | 26 |

Similar works:

«FLEXIBLE NEURAL IMPLANTS Thesis by Ray Kui-Jui Huang In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CALIFORNIA INSTITUTE OF TECHNOLOGY Pasadena, California (Defended June 25, 2010) ii © 2011 Ray Kui-Jui Huang All Rights Reserved iii To My Family and Friends iv ACKNOWLEDGEMENTS This dissertation not only reflects the countless hours spinning photoresist, cleaning parylene machines, and mixing epoxy in Caltech Micromachining Laboratory, but it is also a...»

«THEORETICAL STUDIES OF THE STRUCTURE-PROPERTY RELATIONSHIPS OF HOLEAND ELECTRON-TRANSPORT MATERIALS FOR ORGANIC PHOTOVOLTAIC APPLICATIONS A Dissertation Presented to The Academic Faculty by Laxman Pandey In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Chemistry and Biochemistry Georgia Institute of Technology August 2013 Copyright © 2013 by Laxman Pandey THEORETICAL STUDIES OF THE STRUCTURE-PROPERTY RELATIONSHIPS OF HOLEAND ELECTRON-TRANSPORT...»

«New Forms of Revolt Julia Kristeva Journal of French and Francophone Philosophy Revue de la philosophie française et de langue française, Vol XXII, No 2 (2014) pp 1-19 Vol XXII, No 2 (2014) ISSN 1936-6280 (print) ISSN 2155-1162 (online) DOI 10.5195/jffp.2014.650 www.jffp.org This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. This journal is operated by the University Library System of the University of Pittsburgh as part of...»

«MICRO ELECTRET POWER GENERATORS Thesis by Justin Boland In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CALIFORNIA INSTITUTE OF TECHNOLOGY Pasadena, California (Defended May 24, 2005) ii © 2005 Justin Boland All Rights Reserved iii ACKNOWLEDGEMENTS Yu-Chong Tai, Trevor Roper, Tanya Owen, Wen Hsieh, Ellis Meng, Tom Tsao, Mattieu Liger, Qing He, Chi-Yuan (Victor) Shih, Scott Miserendino, Po-Jui (PJ) Chen, Nick Lo, Jayson Messenger, Svanhild (Swan) Simonson,...»

«Creating Community: Ancient Maya Mortuary Practice at Mid-Level Sites in the Belize River Valley, Belize by Anna Novotny A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved July 2015 by the Graduate Supervisory Committee: Jane Buikstra, Chair Christopher Carr Kelly Knudson Cynthia Robin Miguel Aguilera Vera Tiesler ARIZONA STATE UNIVERSITY August 2015 ABSTRACT This research focuses upon the intersection of social complexity and...»

«ABSTRACT DESIGN TOOLS FOR DYNAMIC, Title of dissertation: DATA-DRIVEN, STREAM MINING SYSTEMS Kishan Palintha Sudusinghe, Doctor of Philosophy, 2015 Professor Shuvra S. Bhattacharyya Dissertation directed by: Department of Electrical and Computer Engineering and Institute for Advanced Computer Studies The proliferation of sensing devices and costand energy-efficient embedded processors has contributed to an increasing interest in adaptive stream mining (ASM) systems. In this class of signal...»

«ANTIGONE’S DAUGHTERS: REVOLUTIONS IN KINSHIP AND PERFORMANCE A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Megan Leonard Shea May 2009 © 2009 Megan Leonard Shea ANTIGONE’S DAUGHTERS: REVOLUTIONS IN KINSHIP AND PERFORMANCE Megan Leonard Shea, Ph. D. Cornell University 2009 This dissertation intertwines performance studies methodologies with classical historiography...»

«Methods for Pronunciation Assessment in Computer Aided Language Learning by Mitchell A. Peabody M.S., Drexel University, Philadelphia, PA (2002) B.S., Drexel University, Philadelphia, PA (2002) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2011 © Massachusetts Institute of Technology 2011. All rights reserved. Author..........»

«Intention and the Idea of the Literary in Chaucer by Stephen Andrew Katz A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in English in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Steven Justice, Chair Professor Anne Middleton Professor Niklaus Largier Fall 2009 Intention and the Idea of the Literary in Chaucer © 2009 by Stephen Andrew Katz Abstract Intention and the Idea of the...»

«1 Extraits de : B. PATAR, Dictionnaire actuel de l’art d’écrire, Montréal, Éd. Fides, 1995. V. Rédiger et présenter une BIBLIOGRAPHIE (TABLE BIBLIOGRAPHIQUE) suite 3° L'entrée bibliographique d'un article de revue ou de journal L'ordre des renseignements bibliographiques On doit disposer ces renseignements dans l'ordre suivant: le nom et le prénom de l'auteur;le titre de l'article; la particule «in» (ou «dans»); le nom de la revue ou du journal; le numéro du volume et/ou du...»

«DC PULSE-POWERED MICRODISCHARGES ON PLANAR ELECTRODES AND THEIR USE IN VAPOR AND LIQUID PHASE CHEMICAL SENSING IN AMBIENT AIR by Bhaskar Mitra A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan 2008 Doctoral Committee: Professor Yogesh B. Gianchandani, Chair Professor Kensall D. Wise Professor Fred L. Terry Jr. Associate Professor John E. Foster Assistant Professor Michel M. Maharbiz...»

«Thesis for the degree ‫חבור לש קבלת התואר‬ Doctor of Philosophy ‫דוקטור לפילוסופיה‬ Submitted to the Scientific Council of the ‫מוגש למועצה המדעית של‬ Weizmann Institute of Science ‫מכו ויצמ למדע‬ Rehovot, Israel ‫רחובות, ישראל‬ Regular Format By ‫מאת‬ Ido Zelman ‫עידו זלמ‬ Kinematics of octopus arm movements ‫קינמאטיקה של תנועות זרוע התמנו‬ Advisor: Prof....»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.