WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 22 | 23 || 25 | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 24 ] --

NDCG PIR scores for different discount functions, with the best-threshold approach. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 1 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 4. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8. Reproduction of Figure 10.8.

- 173 Of course, the picture is not always as clear as that. When the curves become more volatile (for example, when result ratings and preferences come from different users), it is harder to discern patterns, but mostly, they are still visible. But the second parameter which can influence the decline in PIR scores is harder to observe in an actual evaluation.

It seems logical to assume that if the explanation for the tendency of PIR scores to decline after some rank given above is true, then the peak PIR score should play a role in the extent of this decline. After all, if the PIR is at 0.5 to begin with, than it has nothing to fear from results that randomize its predictive power. Conversely, the higher the peak score, the higher the fall possibilities (and the fall probabilities) could be expected to be. However, this tendency is hard to see in the data. A possible explanation is that a metric that has a high PIR peak is, on some level, a “good” metric. And a “good” metric would also be more likely than a “bad” metric to provide a better preference recognition at later ranks, which counterbalances any possible “the higher the fall” effect.109 This argument is not very satisfactory since it would imply that testing the effect is generally not possible. However, a more confident explanation will require more data.

12.2.6 Metric Performance We are now approaching the results which are probably of most immediate importance for search engine evaluation. Which metric should be employed? I hope that it has by now become abundantly clear that there is no single answer to this question (or perhaps even not much sense in the question itself, at least in such an oversimplified form). However, there are doubtless some metrics more suitable than others for certain evaluations, and even cases of metrics that seem to perform significantly better (or worse) than others in a wide range of situations.

Two of the log-based metrics that have been examined, Session Duration and Click Count, did not perform much better than chance. They might be more useful in specific situations (e.g. when considering result lists which received clicks versus no-click result lists). The third, Click Rank, had PIR scores significantly above the baseline, but its peak PIR values of 0.66 were only as good as those of the worst-performing explicit metrics.

Of the explicit metrics evaluated in this study, the one which performed by far the worst was Mean Reciprocal Rank. This is not too surprising; it is a very simple metric, considering only the rank of the first relevant result. But for other metrics, the picture is less clear.

ERR was a metric I expected to do well, as it tries to incorporate more facets of user behaviour than others. Perhaps it makes too many assumptions; but it is never best of the class. In some cases it can keep up with the best-performing metrics, and sometimes, as in the general same-user evaluation (Figure 10.34), its peak value lies significantly below all others but MRR.

“Good” and “bad”, in this case, can be taken to mean “tend to produce high PIR scores” and “tend to produce low PIR scores”. The argument says only that if a metric produces a higher PIR score at its peak value than another metric, it can also be expected to produce a higher PIR score at a later rank.

- 174 Precision did not produce any miracles; but nevertheless, its performance was better than I expected considering its relatively simple nature.110 Its peak values have been not much lower than those of most other metrics in same-user evaluations, and actually lead the field when the explicit result ratings and preference judgments came from different users. Precision’s weak point is its bad performance at later cut-off ranks; as it lacks a discounting function, the results at rank 10 are almost always significantly worse than those at an (earlier) peak rank.

Another positive surprise was the performance of ESL. The metric itself is almost as old as precision; and it performed significantly better. It is not quite the traditional version of ESL that has been evaluated; in Section 10.4, I described some changes I have made to allow for non-binary relevance and a discount factor, as well as for normalization. However, I think that its essence is still the same. Be that as it may, ESL’s PIR scores are impressive. In many situations it is one of the best-performing metrics. However, its scores are somewhat lower when explicit result ratings and preference judgments come from different users. Also, it is the metric most susceptible to threshold changes; that is, taking a simple zero-threshold approach as opposed to a best-threshold one has slightly more of a detrimental effect on ESL than on other metrics.





MAP, on the other hand, is performing surprisingly poorly. As explained in Section 10.3, two versions of MAP have been evaluated, one with the traditional rank-based discount, and another without a discount. The no-discount MAP performed unspectacularly, having mostly average PIR scores, and reaching other metrics’ peak scores in some circumstances (although at later cut-off ranks). The PIR scores of traditional MAP, discounted by rank, were outright poor. It lagged behind the respective top metrics in every evaluation. It came as a bit of a shock to me that one of the most widely used evaluation metrics performed so poorly, even though there have been other studies pointing in the same direction (see Section 4.1 for an overview). However, taken together with the current study, those results seem convincing.

While in most cases, I would advocate attempting to replicate or substantiate the findings of the current study before taking any decisions, for MAP (at least, for the usual version discounted by rank), I recommend further research before continuing to use the metric.

Finally, there is (N)DCG. In earlier sections I pointed out some theoretical reservations about the metric itself and the lack of reflection when selecting its parameter (discount function).

However, these potential shortcomings did not have a visible effect on the metric’s performance. If there is a metric which can be recommended for default use as a result of this study, it is NDCG, and in particular NDCG with its customary log2 discount. It has been in the best-performing group in virtually every single evaluation; not always providing the very best scores, but being close behind in the worst of cases.

If saying which metrics perform better and which do worse can only be done with significant reservations, explaining the why is much harder still. Why does NDCG generally perform well, why does ERR perform worse? To find general tendencies, we can try to employ the Web Evaluation Relevance model described in Chapter 7.

I do not think I was alone in my low expectations, either, as the use of precision in evaluations has steadily declined over time.

- 175 The WER model had three dimensions. The first of those was representation. While different evaluation situations had different representation styles (information need or request), they are identical for all metrics. The same can be said about the information resources; all explicit metrics are document-based. The only differences, then, lie in the third category: context.

Context is somewhat of a catch-all concept, encompassing many aspects of the evaluation environment. On an intuitive level, it reflects the complexity of the user model used by a metric.

One component of context that is present in most of the evaluated metrics is a discount function. However, as described above, the same discount functions have been tested for all metrics. A second component is the taking into the consideration the quality of earlier results.

This happens, in different ways, with ERR (where a relevant result contributes less if previous results were also relevant), and ESL (where the number of relevant results up to a rank is the defining feature).111 But this does not shed too much light on the reasons for the metrics’ performances, either. The two metrics handle the additional input in quite different ways that to not constitute parts of a continuum to be rated for performance improvement. And the use of additional input itself does not provide any clues, either; ESL generally performs well, while ERR’s PIR scores are average. This might be explained by the fact that ESL has an extra parameter, the number of required relevant results,112 which can be manipulated to select the best-performing value for any evaluation environment. All in all, it can be said that there is not enough evidence for metrics with a more complex user model (as exemplified by ERR and ESL) to provide better results.

12.3 The Methodology and Its Potential The results presented in the two previous sections, if confirmed, can provide valuable information for search engine operators and evaluators. The different impacts of user satisfaction and user preference or the performance differences between MAP and NDCG are important results; however, I regard the methodology developed and used in this study as its most important contribution.

In my eyes, one of the most significant drawbacks of many previous studies was a lack of clear definitions of what it was the study was actually going to measure in the real world.

Take as example the many evaluations where the standard against which a metric is evaluated is another metric, or result lists postulated to be of significantly different quality. The question which these metrics attempt to answer is “How similar are this metric’s results to those provided by another metric”, which is of limited use if the reference metric does not have a well-defined meaning itself; or “If we are right that the result lists are of different quality, how close does this metric reflect the difference”, which depends on the satisfaction of the MAP could be argued to actually rate results higher if previous results have been more relevant (since the addend for each rank partly depends on the sum of previous relevancies). However, as the addends are then averaged, the higher scores at earlier ranks means that the MAP itself declines. For an illustration, you might return to the example in Table 4.2 and Figure 4.2, and consider the opposed impacts of the results at rank 4 in both result lists.

Or rather, in the form used here, the total amount of accumulated relevance (see Formula 10.5).

- 176 condition. Moreover, even if the reference values are meaningful, such evaluations provide results which are at best twice removed from actual, real-life users and use cases.

This is the reason for the Preference Identification Ratio and the methodology that goes with it. The idea behind PIR is defining a real-life, user-based goal (in this case, providing users with better result lists by predicting user preferences), and measuring directly how well a metric performs. I would like to make it clear that PIR is by no means the only or the best measure imaginable. One could concentrate on user satisfaction instead of preference, for example; or on reducing (or increasing) the number of clicks, or on any number of other parameters important for a search engine operator. However, the choice of measure does not depend on theoretical considerations, but only on the goal of the evaluation. The question of how well PIR reflects a metric’s ability to predict user preference is answered by the very definition of PIR.

Another important method employed in this study is parameter variation. Metric parameters are routinely used in studies without much reflection. I have mostly used the log2 discount of NDCG as an example, but it is only one instance, and probably not the most significant one. I am not aware of any research as to the question of whether the discount by rank used in MAP for decades is really the best one; and the results of this evaluation suggest it emphatically is not. I am also not aware of a systematic evaluation of different cut-off ranks to determine what cut-off ranks produce the best effort/benefit ratio, or just the best results.

I hope I have shown a few important facts about parameters. They should not be taken for granted and used without reflection; you may be missing an opportunity to reduce effort or improve performance, or just providing bad results. But also important is the realization that it is not necessarily outlandishly hard to find the best values. In this study, there were two main sets of data; user preferences and result judgments. But with just these two input sets, it was possible to evaluate a large range of parameters: half a dozen discount functions, ten cut-off ranks, thirty threshold values, two types of rating input, three rating scales (with multiple subscales), and five main metrics, which also can be viewed as just another parameter.

Granted, the evaluation’s results are not necessarily as robust as I would have liked. However, that is not a result of the large number of evaluated parameters, but of the small scale of the study itself. I expect that a study with, say, fifty participants and a hundred queries would have stronger evidence for its findings; and, should its findings corroborate those of the present evaluation, be conclusive enough to act upon the results.

12.4 Further Research Possibilities What could be the next steps in the research framework described here? There are quite a lot;

while some could revisit or deepen research presented in this study, other might consider new questions or parameters. Below, I will suggest a few possibilities going further than the obvious (and needed) one of attempting to duplicate this evaluation’s results.

One avenue of research discussed in Section 12.2.2.1 is detailed preference identification. It goes beyond a simple, single score to provide a faceted view of a metric’s performance. By



Pages:     | 1 |   ...   | 22 | 23 || 25 | 26 |


Similar works:

«The Diverse Geographies of Jewishness: Exploring the Intersections between Race, Religion, and Citizenship among Israeli Migrants in Toronto by Tamir Arviv A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Geography and Planning University of Toronto © Copyright by Tamir Arviv 2016 ii The Diverse Geographies of Jewishness: Exploring the Intersections between Race, Religion, and Citizenship among Israeli Migrants in Toronto Tamir Arviv...»

«Filosofická fakulta Univerzity Palackého Katedra anglistiky a amerikanistiky The Poet Sylvester Pollet and his Backwoods Broadsides: Samizdat Made in Maine Disertační práce Autor: PhDr. Matthew Sweney, M.A. Vedoucí práce: Prof. PhDr. Marcel Arbeit, Dr. Olomouc 2012 Palacký University Philosophical Faculty Department of English and American Studies The Poet Sylvester Pollet and his Backwoods Broadsides: Samizdat Made in Maine Doctoral dissertation Candidate: PhDr. Matthew Sweney, M.A....»

«Utilizing Soft Computing Methods in Analyzing Build-Operate-Transfer (BOT) Contracts Neda Shahrara Submitted to the Institute of Graduate Studies and Research in partial fulfillment of the requirements for the Degree of PhD in Civil Engineering Eastern Mediterranean University September 2015 Gazimağusa, North Cyprus Approval of the Institute of Graduate Studies and Research Prof. Dr. Serhan Çiftçioğlu Acting Director I certify that this thesis satisfies the requirements as a thesis for the...»

«Van Inwagen on the Problem of Evil: the Good, the Bad and the Ugly Kenny Boyce & Justin McBrayer The purpose of this essay is to probe the most important points of Peter van Inwagen’s 2003 Gifford lectures on the problem of evil (van Inwagen 2006) in an effort to initiate a thoughtful workshop discussion for the 2007 Baylor philosophy of religion conference. In true Texas style, the paper is entitled “Van Inwagen on the Problem of Evil: the Good, the Bad and the Ugly.” In the section on...»

«Law and War in Late Medieval Italy: the Jus Commune on War and its Application in Florence, c. 1150-1450 by Ryan Martin Greenwood A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Centre for Medieval Studies University of Toronto © Copyright by Ryan Martin Greenwood (2011) Law and War in Late Medieval Italy: the Jus Commune on War and its Application in Florence, c. 1150-1450 Ryan Martin Greenwood Doctor of Philosophy Centre for Medieval Studies...»

«Becoming Joaquin Murrieta: John Rollin Ridge and the Making of an Icon By Blake Michael Hausman A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in English in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Hertha D. Sweet Wong, Chair Professor Bryan Wagner Professor Beth H. Piatote Fall 2011 1 Abstract Becoming Joaquin Murrieta: John Rollin Ridge and the Making of an Icon by Blake Michael...»

«SEASONAL CHANGES IN PITUITARY AND PLASMA PROLACTIN CONCENTRATIONS, AND THE ROLE OF PROLACTIN IN THE CONTROL OF DELAYED IMPLANTATION IN FEMALE Miniopterus schreibersii THESIS Submitted in fulfilment of the requirements for the Degree of DOCTOR OF PHILOSOPHY of Rhodes University by CHRISTINA BOJARSKI DECEMBER 1993 ADDENDUM The following paragraphs are added to or changed in the text as follows: 1. Chapter 2, page 19, line 12: Add' All plasma samples were run in one assay.' 2. Chapter 2, page 19,...»

«Transcript profiling of the cone-only Nrl-knockout retina using custom cDNAmicroarrays: identification of novel targets of Nrl and of signaling pathways involved in photoreceptor function by Jindan Yu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Biomedical Engineering) in The University of Michigan Doctoral Committee: Professor Anand Swaroop, Co-Chair Assistant Professor Alan J. Hunt, Co-Chair Associate Professor Thomas M. Glaser...»

«Efficient Implementation Techniques for Block-Level Cloud Storage Systems A Dissertation Presented by Dilip Nijagal Simha to The Graduate School in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science Stony Brook University May 2014 Stony Brook University The Graduate School Dilip Nijagal Simha We, the dissertation committee for the above candidate for the Doctor of Philosophy degree, hereby recommend acceptance of this dissertation. Dr. Tzi-cker...»

«APOLIPROTEIN(A)-INDUCED APOPTOSIS IN VASCULAR ENDOTHELIAL CELLS by John Tra A thesis submitted to the Department of Biochemistry In conformity with the requirements for the degree of Doctor of Philosophy Queen’s University Kingston, Ontario, Canada (June, 2011) Copyright ©JohnTra, 2011 Abstract Elevated plasma concentrations of lipoprotein(a) (Lp(a)) are a risk factor for a variety of atherosclerotic disorders including coronary heart disease. In the current study, the investigators report...»

«INNOVATION-DIFFUSION PROCESSES IN URBAN DESIGN MOVEMENTS: APPLICATION OF THE MODEL-PROTOTYPE-ADAPTATION FRAMEWORK TO NEW URBANISM AND NEIGHBORHOOD DEVELOPMENT PRACTICES IN ATLANTA A Dissertation Presented to The Academic Faculty By Jaecheol Kim In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Architecture Georgia Institute of Technology December 2010 COPYRIGHT 2010 BY JAECHEOL KIM INNOVATION-DIFFUSION PROCESSES IN URBAN DESIGN MOVEMENTS: APPLICATION OF THE...»

«A Neural Basis for Atypical Auditory Processing: A Williams Syndrome Model By Jennifer Raechelle Pryweller Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Neuroimaging of Neurodevelopmental Disorders December, 2013 Nashville, TN Approved Carissa J. Cascio, Ph.D. Ronald L. Cowan, M.D., Ph.D. Elisabeth M. Dykens, Ph.D. Baxter P. Rogers, Ph.D. Tricia A. Thornton-Wells, Ph.D....»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.