FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 8 ] --

A major problem with the cited experimental confirmations of log data evaluation is that the confirmation comes from their correlation with measures based on explicit document relevance rankings, whose significance, in turn, is often based on not very persuasive evidence; indeed, sometimes their vindication comes from log data evaluations. Strictly speaking, this would imply only that explicit and log-based ratings correlate with each other, but not necessary with any satisfaction levels on the user side. The claim that implicit and explicit measures do not reflect any user preferences is disproportionate; but the point that there is hardly any validation of their significance outside this vicious circle remains. Another strategy is to compare explicit metrics among each other (e.g. Kekäläinen 2005; Della Mea et al. 2006; Sakai 2007b); I consider this to be valuable, once the validity of the primary metric has been established.

There have been some attempts to break this circle. One interesting method is to create two result lists so that one of them is deemed to be better than the other “by construction” (Radlinski, Kurup and Joachims 2008). In one of the conditions of this study, the authors implemented the “good” result list by calculating the cosine similarity between the query and multiple fields (e.g. title, abstract and full text) of the arxiv.org database of scientific articles.

A “worse” result list was created by removing from the score the content of record fields considered to be important, such as title and abstract. Those were still included in the full text and thus matched; but in the “better” list, the inclusion of further fields increased the match count for every word in title and abstract. A “worst” list was created by randomizing the top 11 results of the “worse” list. In a second condition, the “worse” list was constructed from the same “good” list by randomly swapping two of the results in positions 2-6 with two in the positions 7-11. In the “worst” list, four instead of two results were swapped. One of those lists were randomly shown to users of a search engine implemented for the very purpose, and the data from their interaction with it recorded. The data was evaluated using “absolute metrics”, that is, metrics attempting to put a concrete number on a user’s satisfaction with the result list.

Among this metrics were e.g. the Abandonment Rate (queries without clicks on any result), Clicks per Query, and the Mean Reciprocal Rank (one over the mean rank of clicked results).

The authors formulated expectations for the measures’ behavior when confronted with a worse result list; for example, the Clicks per Query and the Mean Reciprocal Rank were expected to decline. In each of the two conditions, there are three result lists which can be compared to each other; thus, the overall number of comparisons is six. The results are given in Table 6.1; it is obvious that, while the metrics tend to agree with the hypothesized quality difference more than contradict them, the results are by no means conclusive. This is especially true if only statistically significant differences (right column) are considered; of 48 comparisons, only 10 do significantly correlate with the assumed difference, while one contradicts it. The authors draw the conclusion that, while the problems may be “due to

–  –  –

Table 6.1.

Absolute metric results and their significance with check mark meaning agreement and danger symbol disagreement with hypothesized quality (Radlinski, Kurup and Joachims 2008).

In the next step, the three result lists of a condition were shuffled into one using a novel method (which, ingenious as it is, does not directly concern us here) so that the resulting result list contains at any point almost an equal number of results from each list in the original order. If we assume that the three lists were completely different (which they were not) and call them A, B and C, than the combined result list might look like A1, B1, C1, C2, B2, A2, A3, B3, C3… As we further assume that users scan the list progressively from top to bottom, a click on C1 without one on A1 or B1 would indicate that the user has seen the first two results but has not found them relevant, in contrast to the third. From this we can in turn follow that the first result of list C is more relevant than those of the other lists. This combined list was also shown to users of the same search engine, with results which were much more encouraging. Without delving too deep into the methodology, the results from better-conceived results lists were preferred in every statistic, and this preference was statistically significant in over 80% of evaluations. The authors conclude that, in contrast to absolute metrics, this so-called “paired comparison tests” deliver stable results and are well suitable for establishing the preferences even of relatively small user groups.

I have described this study in some detail because it addresses a central question of the present work, namely, the validity of evaluation metrics, and, furthermore, it does so in a novel and promising way. Despite, or, perhaps, precisely because of that, it is important to realize that the study is not as conclusive as it seems, and at almost every step objections can be made to the interpretation of its findings. The problems start at the very onset, when two result list triples are constructed. In each triple, there is supposed to be a significant difference in retrieval quality. But does this really follow from the construction of the result lists? The “good” list is the result of a relatively simple cosine similarity measure. We do not know much about the usual retrieval behavior of users who made up the test group; but for many queries, such a ranking algorithm that does not include, for example, any citation data (which a major academic search engine like Google Scholar does quite heavily), may provide a result list where the average decline of result quality from rank to rank may be very shallow. If the relevance of the 10th-ranked result is, on average, indeed not much lower than that of the first

- 44 one, then a slight decrease of the importance of some record fields like title and abstract might

not lead to a significant decline in perceived retrieval quality. And even a completely random re-scrambling of the first 11 results might not produce large distortions if they are of close enough quality to start with.39 Of course, that would also hold for the “swap 2” and “swap 4” lists of the second triple. If we consider what might be termed the bibliographic equivalent of navigational queries, that is, search for a concrete document, the situation becomes still more complicated. If the query is not precise enough, the document may land anywhere in the result list, and the “worsening” of the list (which, as we recall, consists of shuffling some or all of the results) may move it up or down. If, on the other hand, the query does precisely identify the document, it is likely to land on top of the “good” result list. It is also likely to land on top of the “worse” result list in the first condition, and of both “worse” and “worst” lists in the second condition, as the top result is never shuffled. Of course, there can be queries which are somewhere between those two states. The point is that the quality difference between the lists might not be as self-evidently significant as suggested by the authors. The statement that “[i]n fact, our subjective impression is that these three ranking functions deliver substantially different retrieval quality […] and any suitable evaluation method should be able to detect this difference” (Radlinski, Kurup and Joachims 2008, p. 44) stands, and, indeed, falls with its first part.

More doubt awaits us at the evaluation of absolute measures. One issue follows directly from the above argument; if the differences in ranking quality are insignificant, than insignificant metric differences may still reflect the situation correctly. The evaluation might also be taken to support the thesis that the “good” result lists are not that good to begin with. The Abandonment Rate for these result lists lies at 68 to 70%; that is, over two thirds of searches with the retrieval quality assumed to be superior do not lead the user to any useful results.40 A second problem is that once again, the assumptions made by the authors may, but need not be correct. The hypothesized metric changes from one result list to another are founded in sensible logical arguments; but sensible logical arguments could also be made for changes in the other direction. Indeed, the authors inadvertently illustrate this ambiguity in the brief motivations they provide for their hypotheses. For Clicks per Query, the prediction is “Decrease (fewer relevant results)”; for Mean Reciprocal Rank, it is “Decrease (more need for many clicks)” (Radlinski, Kurup and Joachims 2008, p. 46), which, of course, implies an increase in Clicks per Query.

These problems should make us view the final evaluation, that involving “paired comparison tests”, with caution. Again, one issue stems from the study’s difficulties in creating result lists which significantly differ in retrieval quality a priori. The evaluation undoubtedly shows Note that the result quality at any rank is always meant to be averaged over all queries; if a ranking algorithm is not very effective, the relevance of results at any rank will be virtually unpredictable for any single query.

There seems to be some error in the data, which may or may not weaken this line of reasoning. The authors state that the search engine received ca. 700 queries per day, with about 600 clicks on results; this would imply an overall click rate of ca. 0.86. The tables, however, give us an Abandonment Rate of 0.68 to 0.70; and the remaining queries where at least one result was selected (ca. 250 of them) had a click rate of about 0.72. Now,

0.72 is by itself significantly lower than 0.86; but if the queries with no clicks were indeed excluded as stated by the authors, the Clicks per Query metric cannot be lower than 1. Furthermore, 250 queries with a click rate of

0.72 would mean around 180 clicks overall, not 600.

- 45 significant correlation of the measure with result list types; but we cannot be certain this correlation is with retrieval quality. Instead, it might correlate with one of a number of other properties, such as the appearance of search keywords in the title. 41 A further problem is the method of assembling a joint result list. As the results from two separate lists are rearranged, de-duplicated, and re-joined, the original ordering and half of the results are lost for every result page. Thus, even if the comparison tests can be used to judge which list has more relevant results,42 they may well be unable to compare the usefulness of the two original result lists as cohesive, interrelated units. By neglecting the structure of the result list, the evaluation method turns away from user models which can be seen as one of main advantages of log data evaluation.

It should be stated once more that the study by Radlinski, Kurup et al. is a promising contribution, and I do not intend to claim its results are wrong. They are, however, based on premises that are less self-evident than the authors assume, and therefore may be not very conclusive.

Another study that addresses, albeit indirectly, the connection between explicit and implicit measures, has been conducted at Microsoft (Dou et al. 2008). The authors once again try to determine relative judgments from click-through data, but their method is slightly different.

Instead of calculating the preferences for every session, the clicks on every document are aggregated for every query, and only then the preference is established. That is, if a query q is encountered 10 times in the log data, and document d1 is clicked 4 times, while document d2 is clicked once, a preference for d1 over d2 is assumed.

In a first step, the authors compare such preferences derived from click data with explicit judgments. The latter are graded ratings on a 0..4 scale. However, the prediction accuracy of the implicit judgments is determined by comparing simple preferences, without regard to the actual difference in rating. Still, the results are not very encouraging; the Kendall tau-b measure (Kendall and Smith 1938) shows a correlation of 0.36 in the test condition. This includes pairs where one document received some clicks, while the other was not clicked at all; if those are excluded, the correlation drops to 0.19. Neither condition includes statistics for pairs where none of the documents receive clicks; those obviously do not provide means for determining preferences, and would lower the correlation still further. There are multiple possible reasons for the results. As the authors note, their method is very simple and disregards many known issues such as position bias (Joachims et al. 2005). A more sophisticated approach might bring better results. Another explanation might be a generally low predictive power of click data; a third is that implicit judgments are relevant, but the explicit judgments they are compared to are not.

Remember that clicks on a result mean not positive feedback on the document but positive feedback on its representation in the result list. The “good” retrieval function with its focus on matching query keywords with title and abstract can be seen as increasing not necessary the quality of retrieved documents, but the attractiveness of the result list.

That is, either more results which are relevant or results which are more relevant. Clicks provide a binary distinction at best, so we don’t know which of these are being measured.

- 46 Figure 6.1. The correlation between preferences as determined by CT-Gn and explicitly stated preferences for different values of n (from Dou et al. 2008).

Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 26 |

Similar works:

«The Class of 2014 Doctor of Philosophy Degrees College of Engineering and Mines D r. D o u g l a s J. G o e r i n g, D e a n Liangbiao Chen ** Ph.D. Engineering: Mechanical Engineering B.C.E., Shaghai Jiao Tong University, 2006; M.C.E., University of Alaska Fairbanks, 2011. Thesis: Using the Generalized Interpolation Material Point Method for FluidSolid Interactions Induced by Surface Tension Numerical algorithms were developed to handle surface tension and contact (wetting) in fluid-solid...»

«KRITIKE An Online Journal of Philosophy Volume 10, Number 1 June 2016 ISSN 1908-7330 KRITIKE An Online Journal of Philosophy Volume 10, Number 1 June 2016 ISSN 1908-7330 THE DEPARTMENT OF PHILOSOPHY University of Santo Tomas Philippine Commission on Higher Education COPYRIGHTS All materials published by KRITIKE are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License KRITIKE supports the Open Access Movement. The copyright of an article published by the journal...»

«Expected Firm Performance and IPO Price Formation by Bradley Eric Hendricks A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Business Administration) in the University of Michigan 2015 Doctoral committee: Professor Gregory S. Miller, Chair Professor Amy K. Dittmar Associate Professor Yusuf C. Masatlioglu Associate Professor Catherine Shakespeare Assistant Professor Christopher D. Williams © Bradley Eric Hendricks 2015 Acknowledgements...»

«Justification, Reasons and Truth By Ian Paul Schnee A dissertation submitted in partial satisfaction of the Requirements of the degree of Doctor of Philosophy in Philosophy in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Barry Stroud, Chair Professor Niko Kolodny Professor Anthony A. Long Fall 2010 © 2010 Copyright Ian Paul Schnee All rights reserved Abstract Justification, Reasons and Truth By Ian Paul Schnee Doctor of Philosophy in Philosophy...»

«Page 163 Samuel Beckett and Poland 10 Marek Kedzierski ˛ I Concerning Beckett, my attitude is rather negative. One of the tendencies of the capitalist world leans undoubtedly towards the complete alienation of man. For Beckett, it is the fundamental tendency against which no resistance is possible. On this basis he makes formal experiments. He reminds me of determinism and naturalism [...] He belongs to those who isolate man from the rest, as the young Maeterlinck. This severe judgement by...»

 Entertainment-Education and the Fotonovela: A New Strategy to Increase Help-Seeking Behaviors for Depression among Immigrant Latinas By Maria Yesenia Hernandez A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Social Welfare in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Kurt C. Organista, Chair Assistant Professor Adrian Aguilera Assistant Professor Qing Zhou Fall 2012 

«RVP Newsletter — Winter 2012/Spring 2013 Message from the President: The work of the Council for Research in Values and Philosophy (RVP) has been advancing rapidly with the needs of the times. You will find greater detail on the following pages, but below is a summary of some key RVP efforts and their themes. We find ourselves at a point of major transition from a world order based upon separate and competing nations to a global order in which modes of cooperation become newly possible and...»

«Border Partial Citizenship, Border Towns, and Thai-Myanmar Cross-Border Development: Case Studies at the Thai Border Towns by Pitch Pongsawat B.A. (Chulalongkorn University), 1991 M.Phil. (University of Cambridge), 1995 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in City and Regional Planning in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor AnnaLee Saxenian, Chair Professor David...»

«FEMALE EDUCATION IN THE LATE SIXTEENTH AND EARLY SEVENTEENTH CENTURIES IN ENGLAND AND WALES: A STUDY OF ATTITUDES AND PRACTI CE Thesis submitted for the degree of Doctor of Philosophy: January 1996 Caroline Mary Kynaston Bowden Institute of Education, University of London \ Abstract OF THESIS This thesis provides a study of attitudes and practice in respect of female education in England and Wales in the late sixteenth and early seventeenth centuries. It begins with a review of primary and...»

«Development of complex sound representations in the primary auditory cortex by Michele Nerissa Insanally A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Neuroscience in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Shaowen Bao, PhD., Chair Professor Mu-ming Poo Professor Daniel Feldman Professor Keith Johnson Fall 2011 Copyright 2011 Michele Nerissa Insanally All rights reserved....»

«A COLD OF THE HEART: JAPAN STRIVES TO NORMALIZE DEPRESSION by George Kendall Vickery BA, College of William and Mary, 1986 Submitted to the Graduate Faculty of Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2005 UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by George Kendall Vickery It was defended on August 1, 2005 and approved by Dr. Joseph S. Alter Dr. Richard Scaglion Dr....»

«Student Handbook College of Urban and Public Affairs Nohad A. Toulan School of Urban Studies and Planning Doctor of Philosophy: Urban Studies (Ph.D.) Doctor of Philosophy: Urban Studies: Regional Science (Ph.D.) Academic Year 2012-2013 For additional information, contact: Nohad A. Toulan School of Urban Studies and Planning Urban Center Building Room 350 503-725-5477 susp@pdx.edu Latest Revision Date: 10/17/2012 Urban Studies Ph.D. Handbook 2012 Page 2 I. CONTENTS College and School Overview I....»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.