WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 18 | 19 || 21 | 22 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 20 ] --

These evaluations seem to suggest that single result evaluations based on binary relevance perform worse than those based on six-point relevance. If the distinction between “relevant” and “non-relevant” is drawn in the middle, the decline in preference identification is moderate; if “relevant” is defined as “somehow relevant” or “highly relevant”, the losses in preference prediction quality become very large indeed.

If we confine the evaluation to informational queries, the picture stays the same. Figure 10.64 through Figure 10.67 show the graphs; the lines are somewhat more jumpy (probably simply due to the reduced dataset), but all the trends remain.

- 137 Figure 10.60. Reproduction of Figure 10.34 with a modified Y-scale. Inter-metric PIR comparison using the bestthreshold method and six-point relevance. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.

Figure 10.61.

Inter-metric PIR comparison using the best-threshold method and binary R23 relevance. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 9; ERR and ESL have exactly the same scores.

–  –  –

Figure 10.63.

Inter-metric PIR comparison using the best-threshold method and binary R21 relevance. ERR and ESL have exactly the same scores.

- 139 Figure 10.64. Reproduction of Figure 10.36 with a modified Y-scale. Inter-metric PIR comparison using the bestthreshold method and six-point relevance, for informational queries only. NDCG (log2) and ESL (Rank, n=2.5) have the same values for cut-off ranks 3 to 8.

Figure 10.65.

Inter-metric PIR comparison using the best-threshold method and binary R23 relevance for informational queries only.

–  –  –

Figure 10.67.

Inter-metric PIR comparison using the best-threshold method and binary R25 relevance for informational queries only.

- 141 The next step is to examine what happens if we use binary relevance judgments coming from users other than those providing preference ratings, thus adding the effects described in the previous section. The first graph in the next cluster (Figure 10.68) again shows the six-point relevance version, which is the reference point. 96 For the R23 binary relevance, the peak values are somewhat lower (around 0.77, down from 0.8); but not all metrics perform worse.

ERR scores improve a bit, and Precision now has a top score of 0.77 as early as cut-off rank

4. The largest overall decline can be seen in both MAP metrics, which perform worst.

The situation is different for R25 relevance (Figure 10.70). At higher ranks (mostly up to rank 6), the PIR scores are even lower than for R23, apart from the MAP scores which do not perform as bad. In fact, the no-discount MAP performs best at this stage, while NDCG does worst. At about cut-off rank 7, the scores for R23 and R25 are about similar; and after that, most R25 scores improve, while those for R23 stay approximately the same or decline. The PIR scores at cut-off rank 10 are similar in both cases, except for Precision (lower for R2 5) and ESL. By ranks 9 and 10, ESL reaches 0.8, which is also the peak value in the six-point relevance evaluation.

How does it come about that the results at cut-off ranks 7 and 8 seem to contribute more towards the correct preference predictions than those at ranks 4 and 5? As relevance is defined as “not totally irrelevant” for R25, the only possibility to distinguish between two result lists is a situation when one has more totally irrelevant results, or (depending on the discount function) the totally irrelevant results occur earlier. One possible interpretation is that, at earlier ranks, the weak relevance criteria used carry little distinguishing power. Almost any result may be regarded as “concerning the query to some extent”. After a while, either the number of totally irrelevant results changes (which might be true for the original ranking result lists – but hardly for randomized ones), or the raters’ standards rise with growing acquaintance with the topic at hand and information saturation. The latter explanation cannot be true in this case since we are dealing with relevance ratings obtained from users who were not exposed to any of the result lists; they saw only single results, in random order, and mostly only two or three results from each query. However, the detailed evaluation of relevance ratings in Section 9.4 (Figure 9.19 in particular) does show some patterns similar to the issue under consideration; it is around rank 7 when the relative frequency of results judged to be absolutely irrelevant stops rising. This means that from rank 7 on, the relevance scores of the original result lists stop declining towards the levels of the randomized result lists. This is but a weak connection, and might be just coincidence – but at the present time, I do not see a more vested explanation of the results.

Evaluations using the zero-threshold approach, or informational queries only, produce very similar results. The former have slightly lower but more volatile scores (Figure 10.72), while the scores in the latter are marginally higher (Figure 10.73). Graphs are shown for R23 only since the results for R25 and R21 have the same properties.





For better readability, the Y-scales of different-user rating evaluation graphs run not from 0.5 to 1, as in the same-user rating condition, but from 0.5 to 0.8. Since this time no metric exceeds the PIR score of 0.8, this does not constitute a problem.

–  –  –

Figure 10.69.

Inter-metric PIR comparison using the best-threshold method and binary R23 relevance. Result and preference ratings come from different users.

–  –  –

Figure 10.71.

Inter-metric PIR comparison using the best-threshold method and binary R21 relevance. Result and preference ratings come from different users.

–  –  –

Figure 10.73.

Inter-metric PIR comparison using the best-threshold method and binary R23 relevance for informational queries only. Result and preference ratings come from different users.

- 145 All in all, the results of this section’s evaluation suggest that metrics perform better when using six-point relevance rankings than when they use binary relevance. This is less of an issue when the relevance ratings and preference judgments come from different users (which is mostly the case in real-life evaluations), when the absolute PIR values are not that high to begin with. But even in this case, there are differences. If binary relevance is employed for whatever reasons, then it is worth the time to provide the raters with guidelines as to what constitutes relevance.

10.7.2 Three-point Relevance I now proceed to three-point relevance. As discussed above, the two possibilities of converting six-point relevance to three-point relevance are an equal division (1 and 2 as 1.0; 3 and 4 as 0.5; 5 and 6 as 0) or an extreme division (1 as 1.0; 2, 3, 4 and 5 as 0.5; 6 as 0). These possibilities will be examined analogous to those for binary relevance in the previous section.

The threshold-based evaluations do not show any marked differences. The rank-based NDCG scores serve as an example, but the situation is similar for other metrics and discounts. The graphs for R32 (Figure 10.74) and R31 (Figure 10.75) show some differences to each other and to the six-point relevance evaluation (Figure 10.5); but the differences are those of degree, not of kind, and consist largely of changes in absolute values, which I shall deal with immediately, in an inter-metric comparison. The same is true for discount function comparison, exemplified by NDCG in Figure 10.76 and Figure 10.77.

Figure 10.74.

NDCG PIR values without any discount for results at later ranks. R32 relevance (1 and 2 as relevant, 3 and 4 as partially relevant, 5 and 6 as non-relevant) is used.

–  –  –

Figure 10.76.

NDCG PIR scores for different discount functions, with the best-threshold approach. R32 relevance (1 and 2 as relevant, 3 and 4 as partially relevant, 5 and 6 as non-relevant) is used.

–  –  –

The small-scale issues taken care of, we can now look at the larger picture, that is, the intermetric PIR comparisons. Once again, I start with a reminder of what the original, six-point relevance evaluation looked like (Figure 10.78). The R32 relevance evaluation shown in Figure 10.79 shows some changes reminiscent of those we saw in the binary relevance evaluations. Most metrics show a more or less pronounced decline in PIR scores, in particular those that had performed best (ESL, NDCG, no-discount MAP). MRR and (at earlier results) rank-discount MAP show slightly higher scores than in the six-point relevance evaluation. As in the case of binary relevance, the improvements in the performance of MRR can be explained easily. The MRR score is determined by the first occurrence of a result which is not irrelevant. In the six-point relevance scale, this means any result with a rating from 1 to 5; in the R32 scale this changes to the equivalent of the six-point scale’s 1 to 4 ratings. Thus, the criteria are stricter, and the discriminatory power of MRR rises, albeit not to a point where it can outperform any other metric. Overall, the top PIR scores for R32 relevance are lower than those for six-point relevance by 0.03 to 0.05 points; not much, but still quite noticeable.

If we are looking for an evaluation which does perform much worse, we do not have to search for long; R31 relevance does the job quite nicely. As Figure 10.80 shows, almost all the scores are significantly lower than even in the R32 condition. The top PIR score (NDCG at rank 8) lies at 0.88, and the other metrics do not reach 0.85. Otherwise, the shape of the lines is quite similar to that of six-point evaluation, and (perhaps because of the broad band of “partly relevant” results) the graph is less erratic.

–  –  –

As the results for informational queries only are very similar to those for all queries, I proceed immediately to the next evaluation. Again, it will be the condition in which relevance ratings and preference judgments come from different users. Though the differences between metrics are lower than in the same-user condition, this time I will keep the Y-scale at PIR values 0.60 to 1, to show more clearly the differences between the two categories of evaluation.

Even the graph based on the six-point relevance evaluation (Figure 10.81) is much denser than any other previously shown in the present section. With all PIR values in the range from

0.66 to 0.80, it also gives an impression of more evenness. Again, NDCG performs best, followed by Precision and ESL, while ERR and rank-discounted MAP do worst; however, the differences between peak PIR values of the different metrics do not exceed 0.1. The discrepancies shrink further when we switch to the R32 evaluation (Figure 10.82); except for rank-discounted MAP, whose scores improve, the other metrics’ PIR mostly declines.

This is, however, not the case with R31 (Figure 10.83). Some metrics’ PIR scores did decline a bit, especially at earlier cut-off ranks; at cut-off rank 4, all the scores are almost equal. But starting with cut-off rank 6, all metrics’ performances (with the exception of Precision) actually improve. This situation is analogous to that in the R25 evaluation with different users providing relevance ratings and preference judgments (cp. Figure 10.70). An obvious common feature of the two evaluations is the use of only the lowest possible relevance rating as “completely non-relevant”. However, as has been argued in the discussion of the R25 graph, there is no clear reason why this should lead to an increase in PIR scores from this particular cut-off rank on. Thus, this issue remains a riddle for now.

–  –  –

Figure 10.82.

Inter-metric PIR comparison using the best-threshold method and three-point R32 relevance. Result and preference ratings come from different users.

–  –  –

As I have described in Part I, explicit metrics based on users’ relevance ratings of individual results are but one of the two major search engine evaluation methods. The other is based on the data provided more or less unwittingly by the users who go about their usual business of performing web searches. As mentioned in Section 9.1, some of that data has been gathered during the study. The important types of data, the ones that are going to be evaluated in this section, are session duration and clicks.

Most click metrics described in Section 5 are relative metrics; that is, they do not provide a numeric value for individual results or the result list as a whole, but rather indicate whether one result in the list is better than another. This can be a useful feature if one wants to improve a ranking algorithm, because if it works, it provides not only an indication that something is wrong, but also a concrete way of fixing the problem (for example, by swapping the two results’ positions). However, relative log metrics are not easy to evaluate, at least using the methods of this study. This methods do not show which of two given result lists is better; instead, they take as their input one result list, and give as their output an improved

version. Therefore, evaluating them would require three steps:

 letting users interact with a result list in the usual way;

 using the relative metric, constructing a second, presumably superior result list;

 asking (other) users which of the two result lists is preferable.

This method is not impracticable; but it requires an evaluation structure quite different from what is needed for the explicit (and log-based absolute) metrics which constitute the main body of this study. Therefore, the evaluation of relative, click-based metrics will be left to further studies to undertake.



Pages:     | 1 |   ...   | 18 | 19 || 21 | 22 |   ...   | 26 |


Similar works:

«LEAVING HOME IN LATE LIFE: VOLUNTARY HOUSING TRANSITIONS OF OLDER ADULTS AS GIFT GIVING PRACTICES IN THE MIDWESTERN UNITED STATES by Tam Elisabeth Perry A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Social Work and Anthropology) in the University of Michigan Doctoral Committee: Professor Ruth Dunkle, Co-Chair Professor Judith T. Irvine, Co-Chair Professor Berit Ingersoll-Dayton Professor Webb Keane Lecturer IV Holly Peters-Golden ©...»

«SOCIAL COGNITIVE INTENTIONS OF BECOMING AN ATHLETIC DIRECTOR: AN INVESTIGATION OF NON-TRADITIONAL SENIOR ATHLETIC ADMINISTRATORS By JANELLE E. WELLS A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA © 2012 Janelle E. Wells To all non-traditional leaders, Maya Angelou said it best “all great achievements require time.” Thank you for your perseverance, it serves...»

«Accidents The Dewey lectures are supposed to be autobiographical and also reflective about our profession hence, I guess the hope is, somewhat instructive. I don=t think my own history very instructive, but it did have some amusing twists, and I do have some concerns about the current state of the profession that I would like to share. So I thank the Dewey Society very much indeed for inviting me to give this lecture. It is an unexpected and a very pleasant honor! Current research has it that...»

«Ghent University Faculty of Arts and Philosophy FACT OR FICTION? Possible implications of choosing one subject over the other Paper submitted in partial fulfilment of the requirements for the degree of “Master in de Taal-en Letterkunde: Engels-Spaans” by Supervisor: Dr. Kate Macdonald May 2008 Katrien Persoons Preface This thesis came about in a rather unorthodox way. About five years ago, some time before the well-known films were released, I took a book from a shelf in my local library...»

«Queering Home: Domestic Space and Sexuality in Postmodern American Literature by Jason Bryant A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved November 2013 by the Graduate Supervisory Committee: Deborah Clarke, Chair Robert Sturges Lee Bebout ARIZONA STATE UNIVERSITY December 2013 ABSTRACT This dissertation explores narratives of homosexuals and trans men and women occupying domestic spaces, discerning the ways that “home”...»

«K-12 Education as a Hermeneutic Adventurous Endeavor: Theorizing the Instrumental Approach, Education as a Sovereign Agent for Humanity, and the Relationships with Philosophy by Doron Yosef-Hassidim A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Leadership, Higher and Adult Education Ontario Institute for Studies in Education University of Toronto © Copyright by Doron Yosef-Hassidim 2015 K-12 Education as a Hermeneutic Adventurous...»

«UNIVERSITY OF CALIFORNIA, IRVINE Evolution of Cooperation: Comparative Study of Kinship Behavior DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Mathematical Behavioral Science by ¨ Bahattin Tolga Oztan Dissertation Committee: Professor Douglas R. White, Chair Professor Cailin O’Connor Professor Louis Narens 2016 ¨ c 2016 Bahattin Tolga Oztan DEDICATION To Mom, Dad and Noyan Kalyoncu ii TABLE OF CONTENTS Page LIST OF FIGURES v...»

«“IN THEIR PLACE”: MARKING AND UNMARKING SHI’ISM IN PAHLAVI IRAN by Aaron Vahid Sealy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (History) in The University of Michigan 2011 Doctoral Committee: Professor Juan R. Cole, Chair Professor Paul C. Johnson Associate Professor Kathryn Babayan Visiting Professor Gerard J. Libaridian Copyright © 2011 Aaron V. Sealy For my wife and kids ii Acknowledgements This dissertation would not...»

«The Diverse Geographies of Jewishness: Exploring the Intersections between Race, Religion, and Citizenship among Israeli Migrants in Toronto by Tamir Arviv A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Geography and Planning University of Toronto © Copyright by Tamir Arviv 2016 ii The Diverse Geographies of Jewishness: Exploring the Intersections between Race, Religion, and Citizenship among Israeli Migrants in Toronto Tamir Arviv...»

«Syed ​Zafar​ ul Hussan ​Gilani William Gates Building, 15 JJ Thomson Ave, Cambridge CB3 0FD, UK +44 7459 789978 | ​Cambridge email​ | ​Personal email​ | ​Webpage​ | ​Github​ | ​LinkedIn​ | ​Google Scholar CURRENT RESEARCH INTERESTS: Distributed Systems, Social Network Analysis, Information Propagation/Dissemination EDUCATION: ● Aug 2017, University of Cambridge Doctor of Philosophy (Ph.D.) in Computer Science Dissertation: “Measuring and Modelling the impact of...»

«University of Alberta Cultural Identity and Transnational Networks in a Chinese Diaspora Society in Sibu, Sarawak, Malaysia by Hsu, Yu-tsuen A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Anthropology ©Hsu, Yu-tsuen Fall 2012 Edmonton, Alberta Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such...»

«PREPARATION AND CHARACTERIZATION OF POLYMER COMPOSITES CONTAINING GOLD NANOPARTICLES a dissertation submitted to the department of chemistry and the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy By Eda Yılmaz September, 2011 I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of doctor of philosophy. Prof....»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.