FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 14 | 15 || 17 | 18 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 16 ] --

A further property of the graphs that has at first surprised me was the large number of identical data points for different discount functions (for example, NDCG Square and Clickbased coinciding at cut-offs 3, 4, and 6), necessitating the detailed verbal descriptions found in Figure 10.8 and Figure 10.9.83 However, this becomes less surprising when one considers that the discount functions start out with the same performance at cut-off 1, and then improve when for a number of sessions, the better result list can be identified correctly. The improvement in PIR score is then a multiple of one over the overall number of sessions with stated result list preferences. If, for example, we had 50 sessions where a user has indicated a preference for one of the possible result lists, the PIR would always be a multiple of 1/50 or

0.02. If we furthermore consider that the PIR baseline is 0.5, there are only 25 possible PIR scores left. Therefore, it is not really surprising that different discount functions might have the same score. Also, the functions with the same scores tend to be ones with relatively similar discounts (“Square” and “Click-based”, for example, are the two functions which discount most strongly).

Overall, the “No discount”, “log5” and “Square” discounts can be said to perform worse, while “Root”, “log2” and “Rank” do better, the “Click-based” function falling in between. For the purposes of later inter-metric comparison, “log2” will be used for two reasons. It is one of the more successful functions; though it performs slightly worse that “Rank” and “Root” at cut-off 2, it is better at cut-off 3 and equally good up to cut-off 8. We will have to bear in mind, though, that its PIR scores decline at cut-offs 9 and 10, falling almost 0.08 points below “Rank” at 10. The reason to nevertheless pick it rather than “Rank” for future prominence is its role in the literature in the past and present. As discussed in Section 8.2, log2 is routinely For me, at least. Perhaps you suspected all along that the results would turn out to be just like this.

And in many of the upcoming graphs; though I omit them later when we get a feel for the graphs and what the course of each line is.

–  –  –

An issue with the data requires more graphs. As stated in Section 9.2, only 31 of the 42 evaluated queries were informational, with the others classified as transactional, navigational, factual, or meta-query. While the latter categories contain too few queries to be evaluated on their own, they might have disturbed the picture if their PIR score patterns look significantly different from those of informational queries. Therefore, it is important to examine the issue.

Luckily, Figure 10.10 (log2 PIR threshold graph for informational queries only) and Figure 10.11 (comparison of PIR scores for different discount functions for informational queries only) show such a strong similarity to their respective counterparts (Figure 10.3 and Figure 10.8) that it is unnecessary to reproduce all the graphs in an “informational-only” form.

Figure 10.10.

NDCG PIR values for informational queries only, discounted by log2 of result rank.

- 95 Figure 10.11. NDCG PIR scores for informational queries only, with the best-threshold approach. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same results for cut-off values 1 and 2, and “No discount” and “log5” have the same values for cut-off values 1 to 5. Furthermore, the “Root” and “Rank” conditions have the same scores for cut-off ranks 1 to 8, and “Square” and “Click-based” coincide for cut-off values 3 to 4. “log2” has the same results as “Root” for cut-off values 4 to 10 and the same results as “Rank” for cut-off values 4 to 8.

10.2 Precision As mentioned in Section 8.2, classical Precision has quite a lot of common with a no-discount version of NDCG (alternatively, one might say standard NDCG has a lot in common with Precision with a log2 discount function). Both metrics rely on simple addition of relevance scores, which are then normalized. The main difference is that Precision, in dividing the sum of scores by their number, has its highest possible score of one in a situation where all results are perfectly relevant; NDCG, instead, considers the situation where all known relevant results are presented in the optimal order to be the ideal. Therefore, Precision tends to have lower absolute scores than NDCG.

At first glance, Figure 10.12, showing the PIR scores of classical Precision depending on cutoff and threshold values, looks significantly different from its NDCG counterpart (Figure 10.1). And the difference becomes extreme when one compares the two measures with steep discounting functions, such as the squared rank (Figure 10.13 versus Figure 10.6). While the no-discount versions appears basically similar, though with a more marked decrease in PIR scores at higher threshold values, the strongly discounted Precision graph shows not just a strong PIR decline – every line showing PIR scores for cut-off values of 5 and above hits the

0.5 baseline. This means that at this point, Precision performs no better than chance in selecting the preferred result list. However, this is easily explained by the differences in

- 96 absolute values mentioned above. If, for example, the NDCG scores for all queries tend to lie in the interval between 0.2 and 0.9, while the Precision scores are between 0.1 and 0.5, then the maximum difference is 0.7 and 0.4 respectively. With the Precision differences spread over a shorter score span, the same threshold steps will provide less fine distinctions between result lists; in the above example, any threshold above 0.4 will fail to distinguish between any result lists. The effect is much stronger for high-discount versions, where the discounts reduce the score range even further. One could change this situation by choosing a finer threshold step for Precision, and selecting a different denominator for discounted Precision metrics to compensate; however, I think that this would combat a non-existent problem.

To see why the threshold issues do not matter a lot, we will examine two Precision PIR graphs comparing different discount functions, just as we did with NDCG. Figure 10.14, using the best-threshold approach, is very similar to its NDCG counterpart (Figure 10.8).

There are a few minor differences (mostly some PIR scores being slightly lower in the Precision graph), but the overall picture is clearly the same. And when we compare the graphs where threshold values of 0 are used (Figure 10.15 and Figure 10.9), we see that they are completely identical. This should not really surprise; if no threshold is used, any difference between two result lists is considered significant, and the magnitude of absolute scores does not matter anymore. For all future purposes, then, Precision and NDCG can be regarded as the same metric with different discount functions. I will still refer to them by their familiar names, but the connection should be kept in mind. For the later inter-metric comparison, I will also show classical Precision as a separate metric for the simple reason that it is an all-time favourite, and is still used in evaluations. It will be interesting and instructive to see how it performs against more modern metrics.

–  –  –

Figure 10.15.

Precision PIR scores for different discount functions, with the threshold values set to zero. The graph is identical to the one in Figure 10.9 (NDCG PIR scores with t=0).

–  –  –

Formula 10.1. MAP formula with queries Q, relevant documents R and documents D (at rank r). rel is a relevance function assigning 1 to relevant and 0 to non-relevant results, or, in this study, one of six values in the range from 1 to 0.

–  –  –

Formula 10.2. Modified MAP formula with queries Q, relevant documents R and documents D (at rank r). rel is a relevance function assigning 1 to relevant and 0 to non-relevant results, or, in this study, one of six values in the range from 1 to 0. disc(r) is a discount function depending on the rank r.

Note that it is not the relevance of later results that is being discounted, but rather the importance of later addends. In a way, the formula is not really using Precision any more; the addends (which are later averaged) may range from a simple sum of relevance judgment sums (if no discount is used) to a sum of cumulative gains discounted, for later addends, by the square of their rank. If the function discounts by rank, it is obviously the same as standard MAP.85 Also note that for disc(r)r, MAP does not necessarily fall into the [0..1] range. This is not a problem for the present study for two reasons. Firstly, this would only have a significant effect for consistently high relevance ratings on all ranks, which hardly ever happens. Second, even where MAP scores do exceed 1, they will do so for both result lists of a comparison, thereby preserving the preference.86 The two graphs showing MAP by threshold are somewhere between those for NDCG and those for Precision. Both Figure 10.16 (no discount) and Figure 10.17 (discount by rank, that As mentioned before, for our purposes (without cross-query comparison), MAP is the same as Average Precision. I will use the more familiar term MAP throughout this work.

Well... Not quite. In this first batch of evaluations, a six-point relevance scale is used. While this usually only influences the metrics by providing a more faceted relevance distribution, AP is different. The relevance function occurs twice in Formula 10.2; the second time it is as a multiplier for the entire precision fraction. Thus, its influence is higher than for other metrics. This issue will be dealt with in Section 10.7, which concerns itself with relevance scales.

The only problem could be threshold steps, which might be too small to reach an optimum level. However, the graphs in this section clearly show that this is not the case.

- 100 is, traditional MAP) show a marked PIR decline at higher thresholds, though it is not as steep as that of Precision PIR graphs. Otherwise, they are rather unspectacular, which is why I omit the detailed threshold graphs for the other five discount functions and proceed straight to the comparative graphs of the different discount metrics.

With MAP, many results are analogous to those of NDCG. The graphs using the t=0 and the best-threshold approach are quite similar; and the peak scores lie at about 0.92 once again.

The tendency of PIR scores to decline after a peak has been reached is less pronounced; it only happens at the very last rank, and only with two curves. These two discount functions, “No discount” and “log5”, are also the best-performing ones. The equivalent of traditional AP, the “Rank” discount, lies in the middle, and the only discount function it constantly outperforms is the “Square” discount. All in all, the tendency in this case seems to be for more steeply discounting functions to perform worse, while the tendency seemed to go in the opposite direction in NDCG evaluations.

Figure 10.16.

MAP PIR values without any discount for results at later ranks.

–  –  –

Figure 10.18.

MAP PIR scores for different discount functions, with the best-threshold approach. MAP Rank is the traditional MAP metric. As many of the discount functions often produce identical PIR scores, a description is needed. Due to the discount function definitions, the “No discount”, “log5” and “log2” conditions produce the same

–  –  –

Figure 10.19.

MAP PIR scores for different discount functions, with the threshold values set to zero. Again, MAP Rank is the traditional MAP metric.

–  –  –

Formula 10.3. MAP calculation for the “No discount” version.

As with NDCG, I would like to compare these results with those for informational queries only, shown in Figure 10.20. The difference to Figure 10.18 (MAP PIR for all types of queries) can be discerned with the naked eye; while the overall shapes of the individual lines are similar, they lie consistently higher for the informational queries. The reason for this is not quite clear. As returning to the graph the few navigational, factual and meta-queries does not change the scores in any way, the only remaining influence are the transactional queries. The main difference between informational and transactional queries is that, while with the former the user is generally supposed to have an information need to be satisfied by multiple sources, the “transactional need” in the latter case is often satisfied by the first good result. By

–  –  –

Figure 10.20.

MAP PIR scores for informational queries only, with the best-threshold approach.

10.4 Other Metrics With the more popular metrics safely tucked, we can now turn towards others, which have not been used as extensively in evaluations. These metrics sound promising on paper, and it is not easy to tell whether they have not been used more often because other researchers found some intrinsic flaws in them, or because they did not see why they should abandon more familiar metrics, of perhaps just because of some unlucky circumstances.

–  –  –

Formula 10.4. ERR formula. For each rank r, the probabilities of a relevant result at each earlier rank i are multiplied; the inverse probability is used as a damping factor for the gain at the current rank.

Pages:     | 1 |   ...   | 14 | 15 || 17 | 18 |   ...   | 26 |

Similar works:

«AUTHORIZATION AND TRUST IN SOFTWARE SYSTEMS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Kevin A. Walsh January 2012 c 2012 Kevin A. Walsh ALL RIGHTS RESERVED AUTHORIZATION AND TRUST IN SOFTWARE SYSTEMS Kevin A. Walsh, Ph.D. Cornell University 2012 Nexus Authorization Logic (NAL) provides a principled basis for specifying and reasoning about credentials and authorization...»

«Printed in Anna Marmodoro (ed.) The Metaphysics of Powers, Routledge 2010: 143-59 A Powerful Theory of Causation Rani Lill Anjum Department of Philosophy, University of Tromsø and Nottingham Stephen Mumford Department of Philosophy, University of Nottingham Abstract Hume thought that if you believed in powers, you believed in necessary connections in nature. He was then able to argue that there were none such because anything could follow anything else. But Hume wrong-footed his opponents. A...»

«UNIVERSITY OF CALIFORNIA, SAN DIEGO Gravity changes associated with underground injection of CO2 at the Sleipner storage reservoir in the North Sea, and other marine geodetic studies. A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Earth Sciences By Scott L. Nooner Committee in charge: Mark A. Zumberge, Chair Donna K. Blackman Ahmed-Waeil Elgamal Robert L. Parker Glenn S. Sasagawa Fred N. Spiess 2005 Copyright Scott L. Nooner, 2005 All...»


«INVESTIGATING STRUCTURE-FUNCTION RELATIONSHIPS IN FAMILY 7 CELLULASES BY MOLECULAR SIMULATION By Courtney Barnett Taylor Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Chemical Engineering August, 2012 Nashville, Tennessee Approved: Clare McCabe Peter T. Cummings Eugene LeBoeuf Kenneth A. Debelak To Mom, Dad, John, and Mal, unwavering in their support and To Trent, for 13...»

«Justifications and Excuses Marcia Baron∗ The distinction between justifications and excuses is a familiar one to most of us who work either in moral philosophy or legal philosophy. But exactly how it should be understood is a matter of considerable disagreement. My aim in this paper is, first, to sort out the differences and try to figure out what underlying disagreements account for them. I give particular attention to the following question: Does a person who acts on a reasonable but...»

«Energy Efficiency in Hybrid Mobile and Wireless Networks Ziaul Haq Abbas Energy Efficiency in Hybrid Mobile and Wireless Networks A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Philosophiae Doctor (PhD) in Information and Communication Technology Department of Information and Communication Technology Faculty of Engineering and Science University of Agder 2012 iii Doctoral Dissertation at the University of Agder 49 ISBN: 978-82-7117-712-6 ISSN: 1504-9272...»

«The Effects of Humor on Cognitive Learning in a Computer-Based Environment Robert D. Whisonant Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Curriculum and Instruction Dr. John K. Burton, Chair Dr. Glen A. Holmes Dr. Franklin M. Jones Dr. Susan G. Magliaro Dr. D. Mike Moore June 2, 1998 Blacksburg, Virginia Keywords: Humor, CBI, Instructional Technology, Comic...»

«ABSTRACT Title of Dissertation: ALD-ENABLED CATHODE-CATALYST ARCHITECTURES FOR LI-O2 BATTERIES Marshall Adam Schroeder, Doctor of Philosophy, 2015 Directed By: Professor Gary W. Rubloff Minta Martin Professor of Engineering Department of Materials Science and Engineering Institute for Systems Research The Li-O2 electrochemical redox couple is one of the prime candidates for next generation energy storage. Known for its impressive theoretical metric for specific energy, even current practically...»

«Curriculum Vitae Name: Justin K Taylor Concact: Email – Justin.k.taylor87@gmail.com Ph.D. 2014 Collegiate Institutions Attended: 2009 2014 University of Maryland, Baltimore, Doctor of Philosophy, June 2014 2005 2009 University of Colorado, Boulder, Bachelor of Arts, May 2009 Major: Molecular Microbiology and Immunology Publications 1. Christopher C, Liu Y, Mu H, Taylor J, Massare M, Flyer D, Smith G, & Frieman M. (2014) Purified coronavirus Spike protein nanoparticles induce coronavirus...»

«ARCHITECTURES FOR RUN-TIME VERIFICATION OF CODE INTEGRITY by MILENA MILENKOVIC A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Shared Computer Engineering Program of The University of Alabama in Huntsville The University of Alabama at Birmingham to The School of Graduate Studies of The University of Alabama in Huntsville HUNTSVILLE, ALABAMA 2005 In presenting this dissertation in partial fulfillment of the requirements for a...»

«Tel Aviv University The Lester & Sally Entin Faculty of Humanities The Shirley & Leslie Porter School of Cultural Studies SIMILARITY, VARIATION, AND CHANGE: INSTABILITY IN HEBREW WEAK VERBS THESIS SUBMITTED FOR THE DEGREE “DOCTOR OF PHILOSOPHY” by Gila Zadok Submitted to the Senate of Tel Aviv University December 2012 This work was carried out under the supervision of Prof. Outi Bat-El TABLE OF CONTENTS Abstract Acknowledgements CHAPTER 1. Introduction 1.1. Overview 1.2. Language Background...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.