WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 20 | 21 || 23 | 24 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 22 ] --

Arguably, this approach ignores a lot of fine detail. For example, if the user clicks on all of the top five results, it could be a sign of higher result quality than if only the third (or even only the first) result is selected. However, the average click rank would be the same (or lower for the supposedly better case in the second scenario). But a metric can still be useful, irrespective of how far it goes in its attempt to capture all possible influences.

Figure 11.8.

Preference Identification Ratio based on average click rank. PIR is shown on the Y-axis and the threshold (difference in rank average) on the X-axis. The PIR indicates the correct preference identification if average click ranks which lie higher in the result list are considered to be better. No-click sessions are regarded as having an average click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results.

And indeed, Figure 11.8 shows average click rank to perform well above chance when predicting user preferences. At low threshold values (with an average difference of 3 ranks regarded as sufficient, or even for all thresholds) the PIR scores lie at around 0.65. For thresholds of 4 ranks and more, the score declines to around 0.55 (and for thresholds starting with 10 ranks, it lies around the baseline score of 0.50).

- 161 Figure 11.9. Preference Identification Ratio based on first click rank. PIR is shown on the Y-axis and the threshold (difference in first click rank) on the X-axis. The PIR indicates the correct preference identification if clicks which lie higher in the result list are considered to be better. No-click sessions are regarded as having a first click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results.

Instead of using the average click rank, we could also consider just the first click. By the same logic as before, if the first result a user clicks on is at rank 1, the result list should be preferable to one where the first click falls on rank 10. This metric is probably even more simplistic than average click rank; however, as Figure 11.9 shows, it produces at least some results.

Figure 11.10.

Preference Identification Ratio based on first click rank. PIR is shown on the Y-axis and the threshold (difference in first click rank) on the X-axis. The PIR indicates the correct preference identification if clicks which lie higher in the result list are considered to be better. No-click sessions are regarded as having a first click rank of 21, so as to have a lower score than sessions with any clicks in the first 20 results. Only informational queries have been evaluated.

–  –  –

In the previous chapters, I have presented the results of the study in some detail; perhaps sometimes in more detail than would be strictly necessary. I feel this is justified since my main goal is to show the possibilities of the approach, which is to judge metrics by comparing them to users’ explicit statements regarding their preferences. I have also commented briefly on the meaning of most results. In this section, I want to pull together these results to provide a broader picture. I will pronounce judgment on individual metrics and types of evaluation that were studied; however, a very important caveat (which I will repeat a few times) is that this is but one study, and a not large one at that. Before anyone throws up his hands in despair and wails “Why, oh why have I ever used method X! Woe unto me!” he should attempt to reproduce the results of this study and to see if they really hold. On the other hand, most of the findings do not contradict earlier results; at most, they contradict views widely held on theoretical grounds. That is, while the results may be not as solid as I would like, they are sometimes the best empirical results on particular topics, if only because there are no others.

The discussion will span three topics. First, I will sum up the lessons gained from the general results, particularly the comparison of the two result list types. Then, I will offer an overview of and some further thoughts on the performance of individual metrics and parameters.

Finally, I will turn to the methodology used and discuss its merits and shortcomings, for the evaluation presented in this study as well as for further uses.

12.1 Search Engines and Users The evaluation of user judgments shows some quite important results. Those concern three areas: search engine performance, user judgment peculiarities, and the consequent appropriateness of these judgments for particular evaluation tasks.

The conclusions regarding search engine performance come from the evaluation of judgments for the original result list versus the randomized one (to recall: the latter was constructed by shuffling the top 50 results from the former). The judgments, for most evaluations, were significantly different from each other; but, also for most evaluations, much less so than could be expected for an ordered versus an unordered result list of a top search engine. In click numbers and click rank, in user satisfaction and user preference the randomized result lists performed quite acceptably.

This was most frequent for queries which have a large number of results that might be relevant. This can be the case for queries where a broad range of options is sought.





Informational queries that have a wide scope (“What about climate change?”) fall into that category. In this case, the user presumably wants to look at a variety of documents, and there are also likely to be more than enough pages in the search engine index to fill the top 50 ranks with more or less relevant results. Another possibility is that the user looks for something

- 164 concrete which can be found at any number of pages. This can be a factual query (“When was Lenin born?”), or a large subset of transactional queries (there are hundreds of online shops selling similar products at similar prices, or offering the same downloads). All in all, those options cover a significant part of all queries. The queries for which there is likely to be a large difference between the performance of a normally ranked and a randomized list are those which have a small number of relevant results; these can be narrow informational or transactional queries (“Teaching music to the hard of hearing”), or, most obviously, navigational queries, which tend to have only one relevant result. In the present study, only 2 of the 42 queries were navigational, compared to the 10% to 25% given in the literature (see Section 2.2); a higher proportion may well have resulted in more differences between result list performance.

One evaluation where the two result list types performed very alike was the number of result lists that failed to satisfy any rater (17% versus 18%). This and the relatively high percentage of queries rated as very satisfactory in the randomized result list (57%) indicate that user satisfaction might not be a very discriminating measure. Of course, the satisfaction was binary in this case, and a broader rating scale might produce more diverse results.

For clicks, the results are twofold. On the one hand, the original result lists had many more clicks at earlier ranks than the randomized one. But after rank five, the numbers are virtually indistinguishable. They generally decline at the same pace, apparently due to the users’ growing satisfaction with the information they already got, or to their growing tiredness, or just to good old position bias; but result quality seems to plays no role. Almost the same picture arises when we consider relevance, whether macro or micro; at first, the original result list has more relevant results, but, halfway through the first result page, the quality difference between it and the randomized result list all but disappears.

An important lesson from these findings is that authors should be very careful when, in their studies, they use result lists of different quality to assess a metric. The a priori difference in quality can turn out to be much smaller than expected, and any correlation it shows with other values might turn out to be, in the worst case, worthless. The solution is to define the quality criteria in an unambiguous form; in most cases, a definition in terms of explicit user ratings (or perhaps user behavior) will be appropriate.

12.2 Parameters and Metrics Which metric performs best? That is, which metric is best at capturing the real user preferences between two result lists? This is a relatively clear-cut question; unfortunately, there is no equally unequivocal answer. However, if we look closer and move away from unrealistically perfect conditions, differences between the metrics’ performances become visible. Then, towards the end of this section, we will return to the question of general metric quality.

12.2.1 Discount Functions One important point regards the discount functions. This is especially important in the case of MAP, where classical, widely-used MAP (with result scores discounted by rank) is

- 165 consistently outperformed by a no-discount version.102 However, this by no means means that no discount is the best discount. In NDCG and ESL, for example, the no-discount versions tend to perform worst. All in all, the hardly world-changing result is that different discount functions go down well with different metrics; though once again a larger study may find more regularities.

However, this lack of clarity does not extend to different usage situations of most metrics.

That is, no-discount MAP outperforms rank-discounted MAP for all queries lumped together and for informational queries on their own, for result ratings given by the same user who made the preference judgment and for those made by another one, for ratings made on a sixpoint scale and for binary ratings. If this finding is confirmed, it would mean that you only need to determine the best discount function for a given metric in one situation, and then would be able to use it in any.

Another point to be noted is that there is rarely a single discount function for any metric and situation performing significantly better than all others. Rather, there tend to be some wellperforming and some not-so-well-performing discounts, with perhaps one or two not-so-wellbut-also-not-so-badly-performing in between. For NDCG, for example, the shallowest discounts (no discount and log5) perform worst, the steepest (square and click-based) do better, with the moderately discounting functions providing the highest results.

All of these results are hard to explain. One thing that seems to be clear is that the properties of a metric make it work better with a certain kind of discount; however, this is a reformulated description rather than an explanation. If pressed, I think I could come up with a noncontradictory explanation of why MAP works best without discount, NDCG with a slight one, and ERR likes it steep. For example: MAP already regards earlier results to be more important, even without an explicit discount function. As it averages over precision at ranks 1 to 3, say, the relevance of the first result influences all three precision values, while that of the third result has an effect on the third precision value only. Thus, using an additional, explicit discount by rank turns the overall discount into something more like a squared-rank function.

However, the other cases are less clear-cut; and in all of them, the a posteriori nature of possible explanations should make us cautious not to put too much trust in them until they are corroborated by further results.

12.2.2 Thresholds The thresholds, as employed in this study, have the function of providing some extra sensitivity to preference predictions. These predictions depend on differences between two values; and it should make immediate sense that a difference of 300% should be a pretty strong predictor, while a difference of 1% might or might not mean anything. Indeed, the At later ranks, no-discount MAP sometimes has lower scores than rank-discounted MAP. However, this is not really a case of the latter improving with more data, but rather the former falling. This is an important difference, since it means that no-discount MAP generally has a higher peak score (that is, its highest score is almost always higher than the highest score of rank-discounted MAP).

- 166 smaller the difference, the larger the chance that the user preference actually goes in the opposite direction.103 The good news is that in many cases, the best predictions can be made with the threshold set to zero. For example, the inter-metric evaluation results given in Section 10.5 look very similar, whether we use the zero-threshold approach, or take the best-performing thresholds for every individual metric, discount and cut-off value. If there are changes, they generally lie in the margin of 0.01-0.02 PIR points.

The bad news is that this is not the case when we use single-result and preference judgments by different users – the most likely situation for a real-life evaluation. For the six-point relevance scale as well as for binary and three-point versions, the zero-threshold scores are lower and more volatile. While the peak values are relatively stable, it becomes more important to either use precisely the right cut-off value, or go at least some way towards determining good threshold values; else, one can land at a point where PIR scores take a dip, and get significantly worse predictions than those using a slightly different parameter would yield.

12.2.2.1 Detailed Preference Identification There is an additional twist to threshold values. Throughout this study, I have assumed that the goal, the performance we were trying to capture, was to provide a maximal rate of preference recognition; this is what PIR scores are about. Sensible as this assumption is, there can be others. One situation where PIR scores are not everything occurs when we are trying to improve our algorithm without worsening anyone’s search experience.



Pages:     | 1 |   ...   | 20 | 21 || 23 | 24 |   ...   | 26 |


Similar works:

«1 Assertion, Practical Reasoning, and Epistemic Separabilism Forthcoming in Philosophical Studies Draft – Please Cite Final Version Abstract I argue here for a view I call epistemic separabilism (ES), which states that there are two different ways we can be evaluated epistemically when we assert a proposition or treat a proposition as a reason for acting: one in terms of whether we have adhered to or violated the relevant epistemic norm, and another in terms of how epistemically...»

«PRODUCTS OF TOPOLOGICAL MODAL LOGICS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF PHILOSOPHY AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Darko Sarenac May 2006 c Copyright by Darko Sarenac 2006 All Rights Reserved ii Products of Topological Modal Logics ILLC Dissertation Series DS-2006-08 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation...»

«Greek Ethics and Moral Theory GISELA STRIKER HUMAN VALUES THE TANNER LECTURES ON Delivered at Stanford University May 14 and 19, 1987 GISELA STRIKER is professor of philosophy at Columbia University, New York City. She was born and educated in Germany, where she received her doctorate from the University of Göttingen in 1969. Until 1986 she taught philosophy at Göttingen. She also held visiting appointments at the universities of Stanford, Princeton, and Harvard. In 1984 she gave the Nellie...»

«George Bernard Shaw’s Religion of Creative Evolution : A Study of Shavian Dramatic Works Thesis Submitted for the degree o f Doctor o f Philosophy at the University o f Leicester by Keum-Hee Jang Departm ent o f English & Victorian Studies Centre University o f Leicester Novem ber 2006 UMI Number: U230427 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a...»

«The Representation of Animals and the Natural World in Late-Medieval Hagiography and Romance Thesis submitted for the degree of Doctor of Philosophy at the University of Leicester by David Salter BA (Leicester) Department of English University of Leicester September 1998 UMI Number: U113229 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and...»

«Open Journal of Philosophy, 2016, 6, 176-183 Published Online May 2016 in SciRes. http://www.scirp.org/journal/ojpp http://dx.doi.org/10.4236/ojpp.2016.62016 China-West Interculture Kuangming Wu Philosophy Department, University of Wisconsin-Oshkosh, Oshkosh, WI, USA Received 3 March 2016; accepted 2 May 2016; published 5 May 2016 Copyright © 2016 by author and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY)....»

«ACQUIRING STYLE: THE DEVELOPMENT OF DIALECT SHIFTING AMONG AFRICAN AMERICAN CHILDREN by Jennifer Renn A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Linguistics. Chapel Hill Approved by: Advisor: J. Michael Terry Advisor: Walt Wolfram Reader: Randall Hendrick Reader: Robert MacCallum Reader: David Mora-Marin © 2010 Jennifer Renn ALL RIGHTS RESERVED...»

«DISSERTATION Titel der Dissertation „The Efficacy of Relational Governance and Value-creating Relational Investments in Revenue-enhancement in Supplier-Intermediate Buyer Relationships“ Verfasser Muhammad Zafar Yaqub Angestrebter akademischer Grad Doctor of Philosophy (PhD) Wien, im April 2011 Studienkennzahl lt. Studienblatt: 094 146 Dissertationsgebiet lt. Studienblatt: Management Betreuer / Betreuerin: o, Univ.-Prof. Mag. Dr. Rudolf Vetschera This page is intentionally left blank To my...»

«CONSERVATION OF ANTILLEAN MANATEES IN THE DROWNED CAYES AREA OF BELIZE A Dissertation by CARYN SELF SULLIVAN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY May 2008 Major Subject: Wildlife and Fisheries Sciences CONSERVATION OF ANTILLEAN MANATEES IN THE DROWNED CAYES AREA OF BELIZE A Dissertation by CARYN SELF SULLIVAN Submitted to the Office of Graduate Studies of Texas A&M University in...»

«DEVELOPMENT AND VALIDATION OF A MUSICAL BEHAVIOR MEASURE FOR PRESCHOOL CHILDREN By Gina Jisun Yi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Music Education Doctor of Philosophy ABSTRACT DEVELOPMENT AND VALIDATION OF A MUSICAL BEHAVIOR MEASURE FOR PRESCHOOL CHILDREN By Gina Jisun Yi The purpose of this study was to develop a measure for use in assessing musical behaviors of preschool children in the context of regular music...»

«Metaphor in Diagrams Alan Frank Blackwell Darwin College Cambridge Dissertation submitted for the degree of Doctor of Philosophy University of Cambridge September 1998 Abstract Modern computer systems routinely present information to the user as a combination of text and diagrammatic images, described as “graphical user interfaces”. Practitioners and researchers in Human-Computer Interaction (HCI) generally believe that the value of these diagrammatic representations is derived from...»

«IRON FILE SYSTEMS by Vijayan Prabhakaran B.E. Computer Sciences (Regional Engineering College, Trichy, India) 2000 M.S. Computer Sciences (University of Wisconsin-Madison) 2003 A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Sciences University of Wisconsin-Madison Committee in charge: Andrea C. Arpaci-Dusseau (Co-chair) Remzi H. Arpaci-Dusseau (Co-chair) David J. DeWitt Mary K. Vernon Mikko H. Lipasti ii iv v Abstract IRON...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.