FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 16 | 17 || 19 | 20 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 18 ] --

ERR has the lowest peak value, 0.88. While MAP (Rank) reaches 0.90, it does so only at cutoff rank 9. Up to cut-off rank 7, it is the worst-performing metric. Somewhat surprising are the PIR values for classical Precision. While it does not attain the highest scores, it does very well up to rank 6; after that, though, the PIR declines sharply, to hardly 0.75 towards cut-off rank 10. The no-discount version of MAP takes off rather slowly, but overtakes Precision by cut-off rank 7 and reaches a PIR score of 0.92 at ranks 8 and 9.

- 115 The best-performing metrics are NDCG (with the standard discount function log2) and ESL (with rank-based discount and a cumulative relevance of 2.5). NDCG rises just a bit slower at the start, and declines after cut-off rank 8 while ESL stays stable, but over the middle distance, their scores are the same.

Apart from individual metrics’ scores, there are two important questions to be raised about this data, both of which have been mentioned in the NDCG discussion in Section 10.1. The first of those is the shape of the individual curves, while the second concerns absolute values.

None of the metrics in Figure 10.34 starts with high PIR values at low cut-off ranks. Instead, there are two basic forms; either the scores rise up to a point, and then stay stable (MRR, MAP (Rank), and ESL (Rank, n=2.5) ), or they rise to a point, plateau, and the decline again (Precision, NDCG). MAP (No discount) and ERR (Square) only decline at the cut-off 10, the former more, the latter less strongly. The broad pattern that can be discerned is one of discount steepness; the metrics with steeper discounts tend to keep their high scores, while those with low or no discounts tend to do worse starting with cut-off rank 6 to 9. The nodiscount MAP and ERR (Square) metrics do not quite conform with this observation, but they do not depart from it strikingly, either. But is there, I hear you cry out, any possible reason for this strange phenomenon? Indeed, there might well be.

Figure 10.34.

Inter-metric PIR comparison using the best-threshold method. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.

–  –  –

Usually, it is assumed that it would be best to evaluate all available results if possible. In web search engines, this possibility rarely occurs, and the researchers settle for the second-best option – evaluating as many results as they can afford. But this may turn out to be a relic of good old days of professional databases used by professional researchers. Consider Precision, the most venerable of metrics. In olden days, the perfect solution would have been to evaluate all results returned by the information retrieval system. But, as study after study has shown, modern-day search engine users do not pay attention to all documents. Their attention dwindles rapidly, so that in most search sessions, a result at rank ten is unlikely to even be considered, much less clicked on. This means that if one result list has a top-quality page at rank 10, while a second has a completely irrelevant result, the user is unlikely to care (or even notice). Precision, however, regards this difference to be important – in fact, as important as a similar difference at rank 1. What does this mean for the relative evaluation of the two result lists? If the result at a later rank does not influence user preference, but is counted towards the metric score, the score may go up or down independent of the preference rating. And an indiscriminate fluctuation of scores pushes the PIR to its baseline of 0.5, which, as discussed in Section 8.2.1, constitutes precisely the case where the user preference and metric score are independent.

More generally, this line of reasoning suggests that, while later ranks have the potential to improve preference identification, this diminishes the farther away one gets from the first few results; the chance of leading the metric off course with a result ignored by the user will, in contrast, rise. If the metric takes too little notice of lower-ranked results, it might miss an

–  –  –

The result of this view on discounting is that considering only a small part of results returned by a search engine might not be a necessary evil, saving resources by diminishing the quality of the study. It might instead be a way to save resources while improving the study quality – a trade-off most researchers would not consider too disadvantageous.

The second important question is that of the metrics’ absolute values. As has been already mentioned in Section 10.1, the scores seem quite high. There might be multiple explanations for that; for example, metric quality, query type, threshold advantages, lack of non-preferable result lists, and unrealistic rater availability, which I will discuss in the following paragraphs.

First, let us call to mind what the numbers actually mean. The “good” metrics (that is, every metric but MRR) peak at PIR scores of 0.88 to 0.92, so I will use a value of 0.9 as an average.

A PIR score of 0.9 means that, given a query, two result lists one of which is preferable to the user, and explicit result judgments by this user, a metric will correctly pick the preferred result list nine times out of ten; one in ten result lists will be the less desirable one. Alternatively, the metric might pick the preferred result eight times out of ten, and return no preference for the other two results (leading to guessing and a preference recognition rate of 50% for those results).

The most straightforward explanation is that the numbers simply reflect the high quality of the metrics. After all, many of these metrics have been used extensively, often for decades, and there is no compelling intrinsic reason to assume they have to be poor predictors of user preference. However, previous studies examining the relationship between these metrics and user satisfaction and quality have produced results which are hard to reconcile with these high scores (e.g. Turpin and Hersh (2001) or Al-Maskari, Sanderson and Clough (2007); see Section 4 for detailed discussions of these studies).

The next possibility is that the mix of different query types used for the evaluation pushed the scores higher than they would be otherwise, assuming that different metrics should be used for different query types. We can check this by comparing Figure 10.34 with its informational-only counterpart (Figure 10.36), as we did before. The results for informational queries show slightly higher PIR scores than, but no major differences from the all-query results.

- 118 Figure 10.36. Inter-metric PIR comparison using the best-threshold method and only informational queries. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8. Again.

Another conceivable reason for the high scores is the best-threshold approach; by cherrypicking the best thresholds, the results are likely to be on the upper bounds of what is possible to determine on a relatively modest data set. This possibility can also be easily tested. In Figure 10.35, the zero-threshold approach is used instead of best-threshold, and the results are so similar that it takes a bit of time to find the differences. 92 The peak values remain practically unchanged; and so we can shelve this possibility for good.

A third option for doubting the high values comes from the pre-selected nature of result lists used for this evaluation. As explained in Section 8.2.1, result list pairs where the user does not have a preference have been excluded from the study. The reason for this is simple: if the user does not care which result list he gets, the metric can neither improve nor degrade the user experience. On the downside, this means that PIR measures only a specific sub-part of a metric’s ability to correctly predict the user’s judgments. The question of judging metric quality for other, more detailed evaluations will be discussed in Section But we can already hypothesize the results in this case; as we have seen, most high PIR scores are reached at very low thresholds. However, low thresholds mean that we judge almost any difference in metric scores to be important; and the likelihood of assuming a user preference when there is none is quite considerable in these cases.

A final reason for possibly inflated scores is grounded in the relevance ratings and preference judgments used. As it is probably the most complex one, it has a section of its very own.

Yes, there are some. I leave it to the reader to find them as a home assignment.

- 119 Preference Judgments and Extrinsic Single-result Ratings As mentioned before, the results discussed in the previous section were all based on a comparison of result lists preferred by users and ratings of single results by the very same users. As was also briefly mentioned before, this is an unrealistic situation. If a user has already seen the individual results, he is unlikely to be interested in seeing result lists containing links to those results, even with the added benefit of abstracts. If an evaluation effort uses any single result based metric to determine the better result list, its ratings will not come from precisely the people for whom the result list is intended. Accordingly, a more realistic estimate of a metric’s usefulness will come from the comparison of user preferences with metric scores based on other users’ ratings. The calculation is straightforward; the standard PIR is used, as it has been stated in Formula 8.2, but with the relevance rankings of individual results being the average relevance ratings of all users except for the user whose result list preference is being used.

As in the same-user rating condition, I will start with threshold graphs. Only a few will be used, as the graphs show many of the same properties. One difference, however, can be seen on graphs such as Figure 10.37 (NDCG with log2 discount) and Figure 10.38 (NDCG with square-rank discount). When compared to their same-user-rating counterparts (Figure 10.3 and Figure 10.6), they show much more shallow curves for individual cut-off ranks. This is especially true for higher-discount functions; while the PIR scores diminish rapidly with increasing thresholds in the same-user condition, here, they stay relatively stable or even increase slightly at threshold values up to 0.1, and decline relatively slowly after that.

Figure 10.37.

NDCG PIR values discounted by log2 of result rank. Result and preference ratings come from different users.

–  –  –

Figure 10.39.

NDCG PIR scores for different discount functions, with the best-threshold approach. Result and preference ratings come from different users.

–  –  –

This change has predictable implications for the PIR comparison graphs using different threshold approaches (since t=0 does not necessarily provide scores close to the best-threshold method, as it did almost universally in the condition with same-user ratings). While the general shapes of the curves in Figure 10.39 and Figure 10.40 are still quite similar, the differences are immediately visible. The differences between different discount functions are larger with the zero-threshold approach, as are the score amplitude and score fluctuations for individual functions. The best-threshold approach produces more stable results, but it also shows some changes. The PIR scores peak earlier, mostly at cut-off rank 6; the declines after the peak are less steep and less universal; and because of the stronger score fluctuations, it is harder to determine which metrics perform better. The low-discount functions (no-discount and log5) seem to perform worst if viewed over all cut-offs; but their peak score is higher than that of the high-discount functions (rank square, click-based). Arguably, the average-discount functions (log2, root, rank) perform the best; for this reason, I will keep the NDCG log2 as NDCG reference metric.

A similar picture arises for MAP. The zero-threshold graph (Figure 10.42) appears even more chaotic than that for NDCG. Scores still tend to peak around cut-off 6, but there are some large increases at ranks 7 or 8. The best-threshold evaluation (Figure 10.41) is again slightly more stable, but still hard to read. The low-discount functions tend to perform better; the “No discount” function can still be considered to perform best in the best-threshold case, while falling behind for zero-threshold; the “Rank” discount, equivalent to traditional MAP, still performs rather poorly.

–  –  –

Figure 10.42.

MAP PIR scores for different discount functions, with the zero-threshold approach. Result and preference ratings come from different users.

- 123 Our next test case is ESL. The shapes of the curves in Figure 10.43 and Figure 10.44 show less difference to their counterparts with same-user ratings. The latter graph, showing the PIR scores for the cumulative relevance n=3, is somewhat more jagged, and more lines show a slight decline at later ranks (n=3 produced better results than n=2.5 in this condition, and will be used for the purposes of this section). Nevertheless, both graphs are much more orderly than MAP or even NDCG. The scores obtained with the zero-threshold approach are less stable for n=3 (Figure 10.46), but otherwise both these results and those for n=1 (Figure 10.45) are quite similar to the results of the best-threshold approach.

Figure 10.43.

ESL PIR scores for cumulated relevance n=1 with the best-threshold approach. Result and preference ratings come from different users.

–  –  –

Figure 10.45.

ESL PIR scores for cumulated relevance n=1 with the zero-threshold approach. Result and preference ratings come from different users.

–  –  –

Pages:     | 1 |   ...   | 16 | 17 || 19 | 20 |   ...   | 26 |

Similar works:

«Making Art, Creating Infrastructure: deviantART and the Production of the Web by Daniel Perkel A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Information Management and Systems and the Designated Emphasis in New Media in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Nancy Van House, Chair Professor Paul Duguid Professor Jenna Burrell Professor Laura Sterponi Fall 2011 Making Art,...»

«BOSTON UNIVERSITY GRADUATE SCHOOL OF ART AND SCIENCES Dissertation EVALUATION OF THE PERFORMANCE OF THE MODIS LAI AND FPAR ALGORITHM WITH MULTIRESOLUTION SATELLITE DATA by YUHONG TIAN B.S., Nanjing Institute of Meteorology, 1992 M.S., Chinese Academy of Meteorological Science, 1995 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2002 Approved by First Reader Ranga B. Myneni, Ph. D. Associate Professor of Geography Second Reader Yuri Knyazikhin, Ph....»


«STABLE ISOTOPE RATIOS OF BENTHIC MARINE FAUNA: DEVELOPING RECORDS OF ANTHROPOGENIC CHANGE A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by David Michael Baker February 2010 © 2010 David Michael Baker STABLE ISOTOPE RATIOS OF BENTHIC MARINE FAUNA: DEVELOPING RECORDS OF ANTHROPOGENIC CHANGE David Michael Baker, Ph. D. Cornell University 2010 Worldwide, the escalation of...»

«MINING PREDICTIVE PATTERNS AND EXTENSION TO MULTIVARIATE TEMPORAL DATA by Iyad Batal BS, University of Damascus, 2005 MS, University of Pittsburgh, 2008 Submitted to the Graduate Faculty of the Kenneth P. Dietrich School of Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science University of Pittsburgh 2012 UNIVERSITY OF PITTSBURGH COMPUTER SCIENCE DEPARTMENT This dissertation was presented by Iyad Batal It was defended on...»

«FINKISH DISPOSITIONS AND CONTEXTUALISM (forthcoming in Monist) Abstract: Charlie Martin‟s famous examples of dispositional finks have worked as a driving force for the evolution of our understanding of dispositional properties. Although they have been thoroughly discussed by a number of philosophers time and again, it is still not clear whether we can get them around simply by appealing to the context-sensitivity of dispositions. In this regard, I will put to bed what I call the contextual...»

«Phonetic and Cognitive Bases of Sound Change by Reiko Kataoka A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Linguistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor John J. Ohala, Co-Chair Professor Keith Johnson, Co-Chair Professor Andrew Garrett Professor Yoko Hasegawa Fall 2011 Phonetic and Cognitive Bases of Sound Change © 2011 by Reiko Kataoka 1 Abstract Phonetic and...»


«1 Extraits de : B. PATAR, Dictionnaire actuel de l’art d’écrire, Montréal, Éd. Fides, 1995. V. Rédiger et présenter une BIBLIOGRAPHIE (TABLE BIBLIOGRAPHIQUE) suite 3° L'entrée bibliographique d'un article de revue ou de journal L'ordre des renseignements bibliographiques On doit disposer ces renseignements dans l'ordre suivant: le nom et le prénom de l'auteur;le titre de l'article; la particule «in» (ou «dans»); le nom de la revue ou du journal; le numéro du volume et/ou du...»

«Annual Report for the 164th year of the ROYAL SOCIETY OF NZ WELLINGTON BRANCH INCORPORATED Year ended 31st August, 2015 Report for the 164th year of ROYAL SOCIETY OF NEW ZEALAND WELLINGTON BRANCH INCORPORATED Founded July 1851 as the New Zealand Society Reconstituted November 1867 and renamed the Wellington Philosophical Society Incorporated in 1868 with the New Zealand Institute Act 1867 Renamed in 1938 as the Wellington Branch of the Royal Society of New Zealand Incorporated in 1939 under the...»

«TECHNOLOGIES FOR IMAGING WITH BIOLUMINESCENTLY LABELED PROBES A DISSERTATION SUBMITTED TO THE DEPARTMENT OF BIOENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Andreas Markus Loening June 2006 c Copyright by Andreas Markus Loening 2006 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for...»

«Prices Reduced by 50% (Prices shown in the catalog reflect this reduction.) June 2015 Windows Booksellers 199 West 8th Ave., Suite 1 Eugene, OR 97401 USA Phone: (800) 779-1701 or (541) 485-0014 * Fax: (541) 465-9694 Email: katrina@theologybooks.com Website: http://www.windowsbooks.com Monday Friday: 10:00 AM to 5:00 PM, Pacific time (phone & in-store); Saturday: Noon to 3:00 PM, Pacific time (in-store onlysorry, no phone). Our specialty is used and out-of-print academic books in the areas of...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.