FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 21 | 22 || 24 | 25 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 23 ] --

It is no revelation that most changes, in search as elsewhere, make some things better and some worse. Unless we reach a PIR of 1 (and we won’t), some user’s preferences are not going to be recognized, or will even be inverted. If we have a PIR of 0.7, than 70% of user preferences are recognized correctly, while 30% are reversed; or 40% are recognized correctly, and 60% are treated as no preference at all; or anything in between. For a search engine operator, it is a big difference whether you improve results for two fifth of your users, or improve it for about two thirds while antagonizing the remaining third. Users may be content with stagnating quality, but react badly if it actually deteriorates.

Consider the example of one of the best-performing metrics, NDCG with a default log2 discount104 (the graph is reproduced for your convenience as Figure 12.1). Let us assume that you’ve gone for the cut-off value of 10 to use as much information as possible. You can see in the graph that the peak PIR value is reached at thresholds 0.01-0.03 and again at 0.08. Which of those should you choose if you want to maximize benefits, but only provided you cause a minimum of harm? Or is there another threshold which is best suited for that?

This example suggests that another method would be to determine thresholds by relative, not absolute values.

However, since the PIR scores are unlikely to differ by orders of magnitude, large differences in the results seem improbable.

This example has been picked at random; obviously, the same methods can be used to evaluate in detail any other metric and discount.

–  –  –

In this case, it might make sense to go beyond PIR. Figure 12.2 shows the detailed distribution of judgments at the first peak point (threshold 0.01). I will take a moment to explain what the colours mean.

 Dark green shows the proportion of queries where result list preference is identified correctly by the difference in NDCG scores.

 Light green means that the user has no preference, and the NDCG scores’ difference is below the threshold, so that the equal quality of the result lists is correctly inferred.

These queries are not used in PIR calculation for reasons stated below.

 The yellow sector is for the queries where the user considers the result lists to be equally good, but the difference in NDCG scores is larger than the threshold, falsely showing a preference. This is the second kind of query not used for PIR calculation.

As explained in Section 8.2.1, if a user considers two result lists to be of equal use, than recognizing this (non-)preference or failing to do that has no direct influence on the user’s experience since he will be neither better no worse off, whichever result list he gets.

 Orange means that the NDCG difference is once again below the threshold, but this time, there is a user preference, which is thus not captured.

 Red means that there is a user preference, but the difference in NDCG scores actually predicts the opposite preference.

As an easy guide, the colours go from dark green (good) through yellow (relatively harmless) towards red (really bad).

–  –  –

Figure 12.2.

Preference identifications for NDCG discounted by log2 of result rank, at a cut-off value of ten and a threshold of 0.01.

The PIR value for the point shown in Figure 12.2 is 0.84, as can be read from Figure 12.1.

However, the pie graph provides more information. We see that for over 60% of all queries, user have preferences which are correctly identified by NDCG (dark green); that of the 23% of queries where the user does not have a preference, NDCG scores reflect this in just 2% (light green), while for 21%, they suggest a non-existing preference (yellow); that in 7% of cases, NDCG scores miss out on existing preferences (orange); and that for 8% of all queries, NDCG actually picks the result list the user likes less. It is this last number that you are interested in minimizing in this scenario, while keeping the first one as high as possible.105 But 8% seems quite large; one in thirteen users will actually experience a decline in result list quality. Therefore, we need a method to look at the number of reversed preference judgments depending on the threshold used.

Of course, the erroneous “No preference” statement should also be kept as low as possible; but while this would mean missed chances to improve the user experience, reversed preference actually impairs it.

- 169

–  –  –

Figure 12.3.

Preference identifications for NDCG discounted by log2 of result rank, at a cut-off value of 10.

This is where Figure 12.3 comes into play, and I suspect that if you regarded PIR graphs as too immersed in specifics, you are going to hate this one. It shows the relative development of the five categories introduced above as a function of threshold values. As above, we would like the green areas to be as large as possible, while keeping orange and red small.

Additionally, the black line shows the PIR at each threshold.

If we look at the PIR line, we will see that the peak scores described above are indeed quite different with regard to the number of queries which will have their user preference reversed.

With a threshold of 0.01, more than 8% of all users will experience a decline in result list quality; at 0.03, the number falls to about 6%, and at 0.08 to 4%. If our aim is to minimize wrong predictions, we should go with the latter threshold; or even raise it to as much as 0.24, when PIR is 0.05 points below its maximum, but no user will be worse off than before.

Figure 12.3 also illustrates an inherent property of PIR development.

With a threshold of 0, all queries (except for those where metric scores are absolutely identical) are regarded as having a preferred result list. If we increase the threshold, the only possible changes are that results that some of these queries move into one of the “no preference” categories. Thus, if PIR falls, it means that some queries moved from correct preference identification (dark green) to mistaken equality (orange); and if PIR increases, some queries must have gone from wrong preference (red).106 This gives us an easy possibility to discover purged mistakes: whenever PIR goes up when thresholds increase, the red faction has shrunk. We cannot assume that PIR always increases when errors are corrected, however, since correct judgments may be dropped at the same point.

To be more precise, more queries must have moved in the mentioned direction than in the opposite.

- 170 Rating Sources A large difference exists between results depending on whether the explicit result ratings come from the same raters as the preference judgments. Not surprisingly, if a user’s ratings are used to predict his own preference, the PIR scores are significantly higher. While the highest attained PIR score in this condition is about 0.92, it drops to just 0.80 if the preference is inferred from other users’ judgements. This means that with all other parameters being optimal, one in five users would actually experience worse results than in the absence of any judgments.

This result holds broadly across most parameters. It can be seen at any cut-off rank, with any metric, threshold and scale (with an exception mentioned in the next section). Moreover, the results are not only lower in absolute PIR scores; they also get much less stable. The increased fluctuations can be seen particularly well in threshold evaluations, though most parameters are affected.

What are we to make of these results? To me, they mainly suggest the importance of search personalization. I stated that explicit rating results from the real-life user issuing a query are a luxury hardly imaginable for a search engine. But neither is it necessary to base preference assessments on all (or a random subset) of other users’ judgements. The personalization of web search is a well-established topic which has grown in importance in the last decade (see, for example, Speretta and Gauch 2005; Teevan, Dumais and Horvitz 2005; Sieg, Mobasher and Burke 2007). Usually, the research focuses on generating useful changes to a result list drawing upon the user’s previous actions. For our purposes, some kind of user profile could be generated, and a user’s preference could be induced from the preferences of raters or real users with similar profiles. The resulting evaluation scores would then fall in between the conditions of same-user and different-user ratings.

12.2.4 Relevance Scales The evaluation of relevance scales is somewhat less satisfactory than that of other parameters in that it is not direct. The better method would certainly have been to provide different rater groups with different scales, and then give different sub-groups different instructions as to what the individual relevance ratings are supposed to signify. However, for the relevance scales tested here this would mean a total of six subgroups, in which case each subgroup would be too small to provide reliable results. Therefore, I had to make do with conflating a six-point scale; and a more direct evaluation would be appropriate before taking any drastic measures.107 Generally, the six-point scale performs better than any binary or three-point scale. However, there are large differences among the subgroups as well as for different situations. For binary scales, the subgroup where the division between relevant and non-relevant was drawn in the middle (between the six-point scale’s ratings 3 and 4) performed only slightly worse than the six-point scale itself, while the subgroups where the relevance criteria were either very strict or very lax did far worse. This is important since users tend to be generous with their ratings Although I do not think any of the results are very revolutionary in the first place.

–  –  –

For three-point relevance scales, an equal distribution of relevance ratings also provided results not far below those of the six-point scale. If “relevant” and “non-relevant” categories were narrower, with more ratings falling into the “partially relevant” category, the PIR scores dropped. However, there was a noticeable exception: in one condition (with result ratings and preference judgements coming from different users), the latter three-point scale was actually more successful than the six-point scale. The result is not easily explained, and awaits further examination.

12.2.5 Cut-off Ranks The evaluation of cut-off ranks and their influence on the metrics’ performances was a surprise for me and could lead to a change in the methodology of search engine evaluation if it is confirmed by further studies. Usually, the necessity of introducing a cut-off rank is regarded as just that – a necessary compromise between optimal quality (evaluating all results) and reality (having a limited amount of resources). Accordingly, my initial aim regarding cut-offs was to see if there are certain points at which the marginal gain of evaluating one more rank would fall significantly, therefore making them good candidates for cut-off-rankdom.

I did not expect to find that for most metrics and discount functions, as well as other parameters, the metrics’ performance does not rise steadily from rank one to ten. Instead, it rises initially, levels out, and then declines. The peak cut-off rank differs, and I will come to the particulars shortly. But before that, I would like to illustrate how more information can actually decrease the correspondence between an explicit metric and user preference.

As an extreme example, consider a navigational query. In the first result list, the desired result is at rank one; in the second, it is at rank 10. It is reasonable to expect that users will prefer the first result list. However, with precision and a cut-off rank of 10, the metrics will have identical scores. As soon as we lower the cut-off rank, the “right” result list takes the lead.

Of course, it is also possible that the first result list has another really good result at rank 10, while the second result has none at all. In this case, going all the way to cut-off rank 10 would increase the chance of getting the right prediction. But if we assume that both scenarios are equally likely,108 than we have a fifty-fifty chance of improving or decreasing PIR, all other things being equal. But usually, all other things are not equal. PIR, in all cases we have seen, lies above the baseline of 0.5, and mostly well above it, sometimes in regions around 0.9. This means that it has much more room for decline than for gain; and, in our example, if the first result list had the good result at rank ten, the PIR would not have improved.

There is also another tendency working in the same direction. As shown in Section 9.4, the average relevance of the results up to rank 5 is higher for the original result lists than for the That is not necessarily true; but the general logic holds in any case.

–  –  –

If we accept this “regression towards the mean” explanation, what parameters would determine a metric’s proneness to this cut-off decay? It would be the discount function and the PIR score itself.

The discount function’s influence is easy to see: the steeper the discount, the less heed is taken of later results. With square rank discount, for example, the tenth result has hundred times less influence then the first; for most purposes, it might as well not have been included at all. This effect can be seen clearly in Figure 12.4, which shows the performance of NDCG with different discount functions. The legend orders them in order of increasing steepness, and it is easy to see that no-discount DCG is the first to go into decline, and has the largest fall. log5 starts to decline at the same point (cut-off rank 5), but does so less dramatically. log2 and root are next to go, at cut-off rank 8. The steeper discounts (rank and squared rank) do not decline at all, and the click-based discount actually manages a marginal rise as late as cut-off rank 10.

Figure 12.4.

Pages:     | 1 |   ...   | 21 | 22 || 24 | 25 |   ...   | 26 |

Similar works:

«MARIEKE VAN PUYMBROECK, Ph.D., CTRS, FDRT Clemson University mvp@clemson.edu Department of Parks, Recreation, & Tourism Management 864.656.1189 (phone) 278 Lehotsky Hall, Box 340735 864.656.2226 (fax) Clemson, SC 29634-0735 864.986.1843 (cell) FORMAL EDUCATION Doctor of Philosophy in Rehabilitation Science May 2004 University of Florida, Gainesville, Florida Emphasis: Social and Behavioral Integration Concentration: Rehabilitation Counseling Minor: Gerontology Dissertation: “Predictors of...»

«Methods for Pronunciation Assessment in Computer Aided Language Learning by Mitchell A. Peabody M.S., Drexel University, Philadelphia, PA (2002) B.S., Drexel University, Philadelphia, PA (2002) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2011 © Massachusetts Institute of Technology 2011. All rights reserved. Author..........»

«A MOLECULAR INVESTIGATION OF ADSORPTION ONTO MINERAL PIGMENTS BY Brian J. Ninness B.S. University of Pittsburgh, 1995 M.S. University of Maine, 1997 A THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy (in Chemical Engineering) The Graduate School The University of Maine December, 200 1 Advisory Committee: Carl Tripp, Associate Professor of Chemistry and LASST, Co-Advisor Douglas Bousfield, Professor of Chemical Engineering, Co-Advisor Bill...»

«Optimizing Energy Conversion in Organic Materials via Processing and Morphological Design by Jojo A. Amonoo A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Applied Physics) in The University of Michigan Doctoral Committee: Professor Peter F. Green, Chair Professor Roy Clarke Professor Jinsang Kim Professor Richard E. Robertson Associate Professor Anish Tuteja © Jojo A. Amonoo All Rights Reserved Dedication To my late father, Dr....»

«THE SAVAGE SELF THE SAVAGE SELF: INDIANS AND THE EMERGENCE OF THE MODERN BRITISH SUBJECT By ROBBIE RICHARDSON, B.A. (HONOURS), M.A. A Thesis Submitted to the School of Graduate Studies in Partial Fulfilment of the Requirements for the Degree Doctorate of Philosophy McMaster University © Copyright by Robbie Richardson, December 2010 DOCTORATE OF PHILOSOPHY McMaster University (English) Hamilton, Ontario TITLE: The Savage Self: Indians and the Emergence of the Modem British Subject AUTHOR:...»

«BEHAVIORAL AND NEURAL INVESTIGATIONS OF PERCEPTUAL AFFECT by Edward Allen Vessel A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (NEUROSCIENCE) May 2004 Copyright © 2004 Edward Allen Vessel ii ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Irving Biederman, for his professional, financial, and personal support. He gave me this wonderful...»

«Revised August 20, 2013 Adriel A. Hilton, Ph.D. Assistant Vice President for Inclusion Initiatives Grand Valley State University Allendale, Michigan Business Address: Home Address: Grand Valley State University 579 Hampton Circle, N.W. 221 Student Services Building Apartment 2B 1 Campus Drive Walker, Michigan 49534 Allendale, Michigan 49401-9403 (616) 791-1911 (Home) (616) 331-5051 (Office) (305) 491-7125 (Mobile) (616) 331-3880 (Fax) adriel_hilton@hotmail.com hiltona@gvsu.edu EDUCATION Doctor...»


«Integrated Input/Output Interconnection and Packaging for GSI A Dissertation Presented to The Academic Faculty by Bing Dang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Electrical and Computer Engineering Georgia Institute of Technology December 2006 Copyright 2006 by Bing Dang Integrated Input/Output Interconnection and Packaging for GSI Approved by: Dr. Gary S. May Dr. James. D. Meindl School of Electrical and Computer School of Electrical...»

«The 7th European Critical Accounting Studies conference (ECAS 7), The Centre for Applied Ethics and Legal Philosophy, Department of Accounting and Finance, University of Glasgow, 18-20 July 2007 Is there commensurability between Burns and Stalker’s The Management of Innovation and research applications of their ideas? Miriam Green, London Metropolitan University, Holloway, London N7 8DB m.green@londonmet.ac.uk work in progress – please do not quote Introduction This paper problematises...»

«THE CERTIFIED REGISTERED NURSE ANESTHETIST: OCCUPATIONAL RESPONSIBILITIES, PERCEIVED STRESSORS, COPING STRATEGIES, AND WORK RELATIONSHIPS Tristan Roberts Perry Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Curriculum and Instruction Kerry Redican, Chair Jane Abraham Charles Baffi Bonnie Billingsley John Burton November 6, 2002 Blacksburg, Virginia Keywords:...»

«  New Arrivals Catalog June 2015 Windows Booksellers 199 West 8th Ave., Suite 1 Eugene, OR 97401 USA Phone: (800) 779-1701 or (541) 485-0014 * Fax: (541) 465-9694 Email: katrina@theologybooks.com Website: http://www.windowsbooks.com Monday Friday: 10:00 AM to 5:00 PM, Pacific time (phone & in-store); Saturday: Noon to 3:00 PM, Pacific time (in-store onlysorry, no phone). Our specialty is used and out-of-print academic books in the areas of theology, church history, biblical studies, and...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.