FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 10 ] --

The second category is the representation of the user’s problem; again, it is based on earlier research (Taylor 1967). The original form is called the Real Information Need (RIN). This means, broadly speaking, the real world problem which the user attempts to solve by his search. The user transforms this into a Personal Information Need (PIN); this is his perception of his information need. Then, he formulates his PIN in a request (a verbal formulation of the PIN), before formalizing it into a query (a form that can be processed by the information retrieval system).

The third category is time. The user’s requirements may change over time, whether as a result of changes in the world and therefore in the RIN, or because of the progress of the user’s search itself. For example, the user may hit upon a promising branch of information and then wish to encounter more documents from this area.

The fourth category is constituted by what Mizzaro calls components. He identifies three of those; topic – “the subject area interesting for the user”, task – “the activity that the user will execute with the retrieved documents”, and context – “everything not pertaining to topic and Obviously, “physical entities” include electronic entities.

–  –  –

Figure 7.1.

Mizzaro’s relevance model without the temporal dimension (Mizzaro 1998, p. 314).

For the purposes of this study, I propose some changes to this model. First, as mentioned above, some of its aspects are not appropriate for modern web search. Second, the present work concerns itself only with a certain part of web search evaluation, making it possible to omit parts of the model, while acknowledging their importance for other aspects of evaluation. To distinguish my model from others, I will call it the Web Evaluation Relevance model, or WER model for short.

First, let us address the question of representation. The possibility of differences between RIN and PIN is undeniable; as an extreme example, the user can just misunderstand the task at hand. But, while “find what I mean not what I say” has been mentioned quite often49 (it refers to inducing the PIN from the query), “find what I need not what I want”, its equivalent for the RIN-PIN-distinction, does not seem to be very relevant to the field of Information Retrieval.50 Granted, there are cases when a user learns during his search session, and recognizes that his information needs are different from what he assumed. But most of them lie outside the scope of this work. The easy cases might be caught by a spell checker or another suggestion system.

In difficult cases, the user might need to see the “wrong” results to realize his error, and thus those results will still be “right” for his RIN as well as PIN. All in all, it seems justifiable in these circumstances to drop the distinction and to speak, quite generally, of a user’s information need (IN). The request, on the contrary, has its place in search evaluation. If the raters are not initiators of the query, they might be provided with an explicit statement of the It has even been the title of at least two articles (Feldman 2000; Lewandowski 2001).

Although Google, for one, has had a vacancy for an autocompleter position, with a “certificate in psychic reading strongly preferred” (Google 2011a; Google 2011b).

–  –  –

Next, let us consider the information resources. Surrogates and documents are obviously the most frequently used evaluation objects. Lately, however, there has been another type: the result set (studies directly measuring absolute or relative result set quality include Fox et al.

2005; Radlinski, Kurup and Joachims 2008). Different definitions of “result set” are possible;

we shall use it to mean those surrogates and documents provided by a search engine in response to a query that have been examined, clicked or otherwise interacted with by the user.

Usually, that means all surrogates up to the one viewed last,51 and all visited documents. The evaluation of a result set as a whole might be modeled as the aggregated evaluation of surrogates as well as documents within their appropriate contexts; but, given the number of often quite different methods to obtain set judgments from individual results, the process is nothing like straightforward. Thus, for our purposes, we shall view set evaluation as separate from that of other resources, while keeping in mind the prospect of equating it with a surrogate/document metric, should one be shown to measure the same qualities. It is clear that the user is interested in his overall search experience, meaning that set evaluation is the most user-centered type. Surrogates and documents, however, cannot be easily ordered as to the closeness to the actual user interest. While the user usually derives the information he needs from documents, for simple, fact-based informational queries (such as “length of Nile”) all the required data may be available in the snippet. Furthermore, the user only views documents where he already examined the surrogate and found it promising. Therefore, I will regard surrogates and documents as equidistant from the user’s interests.

I do not distinguish between components. Topic and task are parts of the user’s information need, and as such already covered by the model. 52 I also do not use context in the way Mizzaro did. Instead, I combine it with the time dimension to produce quite another context, which I will use meaning any change in the user’s searching behavior as a result of his interaction with the search engine. In the simplest form, this might just mean the lesser usefulness of results on later ranks, or of information already encountered during the session.

On a more complex level, it can signify anything from the user finding an interesting subtopic and desiring to learn more about it, to a re-formulation of the query, to better surrogate evaluation by the user caused by an increased acquaintance with the topic or with the search engine’s summarization technique.

The result list being normally examined from the top down in a linear fashion (Joachims et al. 2005).

One might reverse the reasoning and define the information need as the complex of all conceivable components. There may be many of those, such as the desired file type of the document, its language, and so forth. These are all considered to be part of the user’s information need.

–  –  –

No context Context Figure 7.2. The Web Evaluation Relevance (WER) model. The point at the intersection of the three axis indicates maximal user-centricity.

Using this model, we now can return to the metrics described above and categorize them.

Precision, for example, is document-based (though a surrogate precision is also conceivable) and context-free; the representation type may be anything from a query to the information need, depending on whether the originator of the query is identical with the assessor. MAP is similar, but allows for a small amount of context – it discounts later results, reflecting the findings that the users find those less useful, if they find them at all. Click-based measures, on the other hand, are based on the information need and incorporate the entire context, but regard the surrogate (since the relevance of the document itself is not assessed).

What is the WER model good for? We can postulate that the more user-centric metrics are better at discerning a user’s actual search experience. For me, it stands to reason that a user’s satisfaction with a result list as a whole provides a more appropriate view of his web search session than his satisfaction with individual results; or that a result list should be, if possible, evaluated with regard to the information need rather than the query itself.53 Similarly, metrics that are considering more aspects of context might be supposed to be better. However, this last assertion is more problematic. We can hypothesize about the superiority of some contextOf course, there are cases when a query is obviously and blatantly inappropriate for an information need (e.g. a “me” search to find information about yourself. But these queries do occur, and all search algorithms are in the same situation with regard to them.

- 55 sensitive user model, but, as the variety of user models employed in evaluations suggests, they

cannot all be appropriate in all circumstances. Therefore, this is another issue awaiting testing:

do metrics which employ a more user-centric model of relevance generally perform better than those that do not?

The discussions in the earlier sections indicate that user-based measures can generally be expected to be closer to the users’ real satisfaction than system-based ones. It is not very hard to obtain very user-centered measures; all one has to do is to ask the user about his

satisfaction after a search session. However, this metric has one very important disadvantage:

it is hard to use to improve the search engine, which, after all, is what evaluation should be about. If a user states he is unsatisfied or hardly satisfied with a search session, nothing is learned about the reasons of this dissatisfaction. One possible way to overcome this problem is to submit the ratings, along with a list of features for every rated result list, to an algorithm (e.g. a Bayesian model) which will try to discern the key features connected to the user’s satisfaction. However, this seems too complex a task, given the amount of data that can possibly influence the results; and we do not know of any attempts to derive concrete proposals for improvement from explicit set-based results alone.

Document-based measures have, in some ways, the opposite problems. They are “recomposable under ranking function changes” (Ali, Chang and Juan 2005, p. 365), that is, it can be directly calculated how the metric results will change if the retrieval mechanism is adjusted in a certain way. However, this comes at a price. As factors like already encountered information or, for that matter, any distinction on the context dimension, are generally not accounted for in a document-based metric, its correlation with user satisfaction is by no means self-evident. Of course, the distinction is not binary but smooth; as metrics become less usercentered, they become more usable for practical purposes, but presumably less reliably correlated to actual user satisfaction, and vice versa. This is one of the reasons why understanding the connections between different measures is so important; if we had a usercentered and a system-centered measure with a high correlation, we could use them to predict, with high confidence, the effect of changes made to a retrieval algorithm.

An additional complication is introduced by the abstracts provided by the search engines on the result page. The user obviously only clicks on a result if he considers the abstract to indicate a useful document. Therefore, even the best document-based metric is useless if we do not know whether the user will see the document at all. This might suggest that, for a complete evaluation of a search engine to be both strongly related to real user satisfaction and helpful for practical purposes, all types should be used. The evaluation of user behavior confirms this notion. “User data […] shows that 14% of highly relevant and 31% of relevant documents are never examined because their summary is judged irrelevant. Given that most modern search engines display some sort of summary to the user, it seems unrealistic to judge system performance based on experiments using document relevance judgments alone. Our re-evaluation of TREC data confirms that systems rankings alter when summary relevance judgments are added” (Turpin et al. 2009, p. 513).

–  –  –

8.1 Evaluation Criteria As discussed in the beginning of this work, how to judge evaluation metrics is an intricate question which is nevertheless (or rather: precisely for that reason) vital to the whole process of evaluation. To get a meaningful and quantifiable answer, one has to ask the question with more precision.

Mostly, studies are not interested in an absolute quality value of a result list. Rather, the relevant question tends to be whether one result list has a higher quality than another. It can be asked to compare different search engines; or to see whether a new algorithm actually improves the user experience; or to compare a new development with a baseline. The common feature is the comparison of the quality of two result lists.

But how to get such a judgment? I opted for the direct approach of asking the users. This has some disadvantages; for example, the user might not know what he is missing, or might misjudge the quality of the results. These problems reflect the difference between real and perceived information need (as discussed in detail by Mizzaro (1998)). However, as discussed in Chapter 7, a judgment based on a real information need (RIN) would be more problematic.

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 26 |

Similar works:

«Mobiles, Media, and the Agency of Indian Youth by Neha Kumar A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Information Management and Systems in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Tapan S. Parikh, Chair Professor Brian W. Carver Professor Abigail De Kosnik Fall 2013 Mobiles, Media, and the Agency of Indian Youth Copyright 2013 by Neha Kumar Abstract Mobiles, Media, and...»

«THE RECYCLING OF FILTH: TRANSCULTURAL DISCOURSES IN THE FILMS OF PEDRO ALMODÓVAR AND JOHN WATERS by BRETT JESSIE DRUMMOND IGNACIO RODEÑO, CHAIR CONSTANCE JANIGA-PERKINS ALICIA CIPRIA ÁLVARO BAQUERO-PECINO JEREMY BUTLER A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Modern Languages & Classics in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA Copyright Brett Jessie Drummond 2013 ALL RIGHTS...»

«UNIVERSITY OF CALIFORNIA Los Angeles High-Impedance Electromagnetic Surfaces A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Electrical Engineering by Daniel Frederic Sievenpiper The dissertation of Daniel Frederic Sievenpiper is approved. University of California, Los Angeles ii TABLE OF CONTENTS 1 Introduction 1.1 Electric Conductors 1.2 High Impedance Surfaces 2 Surface Waves 2.1 Dielectric Interfaces 2.2 Metal Surfaces 2.3...»

«Abstractions for Language-Independent Program Transformations Karl Trygve Kalleberg Thesis for the degree of Philosophiae Doctor (PhD) at the University of Bergen 2007-05-11 ISBN 978-82-308-0441-4 Bergen, Norway 2007 Copyright Karl Trygve Kalleberg Produced by: Allkopi Bergen Abstractions for Language-Independent Program Transformations Karl Trygve Kalleberg Department of Informatics Thesis for the degree of Philosophiae Doctor (PhD) at the University of Bergen 2007-05-11 iv Contents...»

«PATRICK GRIM SUNY Distinguished Teaching Professor Department of Philosophy State University of New York at Stony Brook Stony Brook, New York 11794 cell (631) 790-2356 fax (631) 632-7522 patrick.grim@stonybrook.edu www.pgrim.org Specializations Philosophical Logic, Philosophical Computer Modeling (Agent-Based Modeling, Networks, Artificial Societies, and Evolutionary Game Theory), Ethics, Philosophy of Religion, Philosophy of Science Positions Stony Brook: Distinguished Teaching Professor, 2001...»

«Developing a Methodology for appraising Building Integrated Low or Zero Carbon Technologies Yaseen Waseem A thesis submitted in fulfilment of the requirements for the degree of Master in Philosophy in Mechanical Engineering DEPARTMENT OF MECHANICAL AND AEROSPACE ENGINEERING University of Strathclyde Mechanical Engineering James Weir Building, 75 Montrose Street, Glasgow, UK, G1 1XJ Declaration of Authenticity ‘The thesis is the result of the author’s original research. It has been composed...»

«“THE COMING MAN FROM CANTON” CHINESE EXPERIENCE IN MONTANA (1862-1943) By CHRISTOPHER WILLIAM MERRITT Master’s of Science, Michigan Technological University, Houghton, Michigan, 2006 Bachelor’s of Arts, The University of Montana, Missoula, Montana, 2004 Dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Anthropology The University of Montana Missoula, MT May 2010 Approved by: Perry Brown, Associate Provost for Graduate Education...»

«Imagining the Internet and Making it Governable: Canadian Law and Regulation by Michael S. Mopas A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Centre of Criminology University of Toronto © Copyright by Michael S. Mopas 2009 Imagining the Internet and Making it Governable: Canadian Law and Regulation Michael S. Mopas Doctor of Philosophy Centre of Criminology University of Toronto 2009 Abstract This dissertation builds upon the existing body of...»

«Order N ◦ : 4822 Thesis with University of Bordeaux Physics Science and Engineering Doctorate School presented by Hassen KRAIMIA in partial fulfillment of the requirements for the degree of Doctor of Philosophy in: Electronics ————————— Ultra-Low Power RFIC Solutions for Wireless Sensor Networks ————————— Discussing: 10 July 2013 Commission: Examinator Prof. Corinne BERLAND ESIEE Examinator Prof. Christian ENZ Swiss Federal Institute of Technology in...»

«Humanity at the Turning Point: Rethinking Nature, Culture and Freedom (Sonja Servomaa, editor). Helsinki, Finland: Renvall Institute for Area and Cultural Studies, 2006.Beyond Tolerance: Globalization, Freedom, and Religious Pluralism Douglas W. Shrader1 Distinguished Teaching Professor & Chair of Philosophy SUNY Oneonta Oneonta, NY Abstract: If “Globalization” is to mean something other than imposing a single set of uniform, unexamined, and unchallengeable ideas on the entire human race,...»

«MONITORING AND ANALYSIS SYSTEM FOR PERFORMANCE TROUBLESHOOTING IN DATA CENTERS A Thesis Presented to The Academic Faculty by Chengwei Wang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology December 2013 Copyright c 2013 by Chengwei Wang MONITORING AND ANALYSIS SYSTEM FOR PERFORMANCE TROUBLESHOOTING IN DATA CENTERS Approved by: Professor Karsten Schwan, Dr. Matthew Wolf Committee Chair School of...»


<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.