FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 | 2 || 4 | 5 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 3 ] --

- 16 Many major search engines offer other types of results than the classical web page list – images, videos, news, maps and so forth. However, the present work only deals with the document lists. As far as I know, there have been no attempts to provide a single framework for evaluating all aspects of a search engine, and such an approach seems unfeasible. There are also other features I will have to ignore to present a concise case; for example, spellcheckers and query suggestions, question-answering features, relevant advertising etc. I do not attempt to discuss the usefulness of the whole output a search engine is able of; when talking of “result lists”, I will refer only to the organic document listing that is still the search engines’ stock-in-trade.

This is as much depth as we need to consider search evaluation and its validity; for a real

introduction to search engines, I recommend the excellent textbook “Search Engines:

Information Retrieval in Practice” (Croft, Metzler and Strohman 2010).

2.2 Search Engine Usage For any evaluation of search engine performance it is crucial to understand user behavior.

This is true at the experimental design stage, when it is important to formulate a study object which corresponds to something the users do out there in the real world, as well as in the evaluation process itself, when one needs to know how to interpret the gathered data.

Therefore, I will present some findings that attempt to shed light on the question of actual search engine usage. This data comes mainly from two sources: log data evaluation (log data is discussed in Chapter 5) and eye-tracking studies. Most other laboratory experiment types, as well as questionnaires, interviews, and other explicit elicitations, have rarely been used in recent years. 12 Unfortunately, in the last years search engine logs have not been made available to the scientific community as often as ten years ago. The matter of privacy alone, which is at any case hotly debated by the interested public, surely is enough to make a search engine provider think twice before releasing potentially sensitive data.13 What, then, does this data show? The users’ behavior is unlike that experienced in classical information retrieval systems, where the search was mostly performed by intermediaries or other information professionals skilled at retrieval tasks and familiar with the database, query language and so forth. Web search engines aim at the average internet user, or, more precisely, at any internet user, whether new to the web or a seasoned Usenet veteran.

For our purposes, a search session starts with the user submitting a query. It usually contains two to three terms (see also Figure 2.2), and has been slowly increasing during the time web search has been observed (Jansen and Spink 2006; Yandex 2008). The average query has no This is probably due to multiple reasons: The availability (at least for a while) of large amounts of log data; the difficulty in inferring the behaviour of millions of users from a few test persons (who are rarely representative of the average surfer), and the realization that explicit questions may induce the test person to behave unnaturally.

The best-known case is probably the release by AOL for research purposes of anonymized log data from 20 million queries made by about 650.000 subscribers. Despite the anonymization, some users were identified soon (Barbaro and Zeller Jr. 2006). Its attempt at public-mindedness brought AOL not only an entry in CNN’s “101 Dumbest Moments in Business” (Horowitz et al. 2007), but also a Federal Trade Commission complaint by the Electronic Frontier Foundation (2006) and a class-action lawsuit which was settled out of court for an undisclosed amount (Berman DeValerio 2010).

- 17 operators; the only one used with any frequency being the phrase operator (most major engines use quotation marks for that) which occurs in under 2% of the queries, with 9% of users issuing at least one query containing operators (White and Morris 2007).14 The queries are traditionally divided into navigational (user looks for a specific web page known or supposed to exist), transactional (not surprisingly, in this case the user looks to perform a transaction like buying or downloading) and informational, which should need no further explanation (Broder 2002). While this division is by no means final, and will be modified after an encounter with the current state of affairs in Section 9.2, it is undoubtedly useful (and widely used). Broder’s estimation of query type distribution is that 20-25% are navigational, 39-48% are informational, and 30-36% are transactional. A large-scale study with automatic classifying had a much larger amount of informational queries (around 80%), while navigational and transactional queries were at about 10% each (Jansen, Booth and Spink 2007). 15 Another study with similar categories (Rose and Levinson 2004) also found informational queries to be more frequent than originally assumed by Broder (62-63%), and the other two types less frequent (navigational queries 12-15%, transactional queries 22Figure 2.2. Query length distribution (from Yandex 2008, p. 2). Note that the data is for Russian searches.

Once the user submits his query, he is presented with the result page. Apart from advertising, “universal search” features and other elements mentioned in Section 2.1, it contains a result list with usually ten results. Each result consists of three parts: a page title taken directly from the page’s title tag, a so-called “snippet” which is a query-dependant extract from the For the 13 weeks during which the data was collected. A study of log data collected 7-10 years earlier had significantly higher numbers of queries using operators, though they were still under 10% (Spink and Jansen 2004).

While the study was published in 2007, the log data it is based is at least five years older, and partly originates in the end-90ies.

Note that all cited statistics are for English-speaking users. The picture may or may not be different for other languages; a study using data from 2004 showed no fundamental differences in queries originating from Greece (Efthimiadis 2008). A study of Russian log data additionally showed 4% of queries containing a full web address, which would be sufficient to reach the page via the browser’s address bar instead of the search engine. It also stated that about 15% of all queries contained some type of mistake, with about 10% being typos (Yandex 2008).

- 18 page, and a URL pointing to the page itself. When the result page appears, the user tends to start by scanning the first-ranked result; if this seems promising, the link is clicked and the user (at least temporarily) leaves the search engine and its hospitable result page. If the first snippet seems irrelevant, or if the user returns for more information, he proceeds to the second snippet which he processes in the same way. Generally, if the user selects a result at all, the first click falls within the first three or four results. The time before this first click is, on average, about 6.5 seconds; and in this time, the user catches at least a glimpse of about 4 results (Hotchkiss, Alston and Edwards 2005). 17 As Thomas and Hawking note, “a quick scan of part of a result set is often enough to judge its utility for the task at hand. Unlike relevance assessors, searchers very seldom read all the documents retrieved for them by a search engine” (Thomas and Hawking 2006, p. 95). There is a sharp drop in user attention after rank 6 or 7; this is caused by the fact that only these results are directly visible on the result page without scrolling.18 The remaining results up to the end of the page receive approximately equal amounts of attention (Granka, Joachims and Gay 2004; Joachims et al. 2005). If a user scrolls to the end of the result page, he has an opportunity to move to the next one which contains further results; however, this option is used very rarely – fewer than 25% of users visit more than one result page,19 the others confining themselves to the (usually ten) results on the first page (Spink and Jansen 2004).

Figure 2.3.

Percentage of times an abstract20 was viewed/clicked depending on the rank of the result (taken from Joachims et al. 2005, p. 156). Data is based on original and manipulated Google results, in an attempt to eliminate quality bias. Note that the study was a laboratory-based, not log-based one.

Turpin, Scholer et al. (2009) mention 19 seconds per snippet; however, this is to consciously judge the result and produce an explicit rating. When encountering a result list in a “natural” search process, the users tend to spend much less time deciding whether to click on a result.

Obviously, this might not be the case for extra high display resolutions; however, this does not invalidate the observations.

Given the previous numbers in this section, a quarter of all users going beyond the first result page seems like a lot. The reason probably lies in the different sources of data in the various studies; in particular, the data used by Spink and Jansen (2004) mostly stems from the 1990s.

“Abstract” and “snippet” have the same meaning in this case.

- 19 The number of pages a user actually visits is also not high. Spink and Jansen (2004) estimate them at 2 or 3 per query; and only about one in 35,000 users clicks on all ten results on the first result page (Thomas and Hawking 2006). This fact will be very important when we discuss the problems with traditional explicit measures in section 4. Most of those clicks are on results in the very highest ranks; Figure 2.3 gives an overview of fixation (that is, viewing one area of the screen for ca. 200-300 milliseconds) and click rates for top ten results. Note that the click rate falls much faster than the fixation rate. While rank 2 gets almost as much attention as rank 1, it is clicked more than three times less often.

–  –  –

The evaluation of a web search engine has the same general goal as that of any other retrieval system: to provide a measure of how good the system is at providing the user with the information he needs. This knowledge can then be used in numerous ways: to select the most suitable of available search engines, to learn about the users or the data searched upon, or – most often – to find weaknesses in the system and ways to eliminate them.

However, web search engine evaluation also has the same potential pitfalls as its more general counterpart. The classical overview of an evaluation methodology was developed by TagueSutcliffe (Tague 1981; Tague-Sutcliffe 1992). The ten points made by her have been extremely influential, and we will now briefly recount them, commenting on their relevance to the present work.

The first issue is “To test or not to test”, which bears more weight than it seems at a first glance. Obviously, this step includes reviewing the literature to check if the questions asked already have answers. However, before this can be done, one is forced to actually formulate the question. The task is anything but straightforward. “Is that search engine good?” is not a valid formulation; one has to be as precise as possible. If the question is “How many users with navigational queries are satisfied by their search experience?”, the whole evaluation process will be very different than if we ask “Which retrieval function is better at providing results covering multiple aspects of a legal professional’s information need?”. These questions are still not as detailed as they will need to be when the concrete evaluation methods are devised in the next steps; but they show the minimal level of precision required to even start thinking about evaluation. It is my immodest opinion that quite a few of the studies presented in this chapter would have profited from providing exact information on what it is they attempt to quantify. As it stands, neither the introduction of new nor the evaluation of existing metrics is routinely accompanied by a statement on what precisely is to be captured. There are some steps in the direction of a goal definition, but these tend to refer to behavioral observations ("in reality, the probability that a user browses to some position in the ranked list depends on many other factors other than the position alone", Chapelle et al.

2009, p. 621) or general theoretical statements ("When examining the ranked result list of a query, it is obvious that highly relevant documents are more valuable than marginally relevant", Järvelin and Kekäläinen 2002, p. 424). To be reasonably certain to find an explicit phenomenon to be measured, one needs to turn to the type of study discussed in Chapter 6, which deals precisely with the relationship of evaluation metrics and the real world.

The second point mentioned by Tague-Sutcliffe concerns the decision on what kind of test is to be performed, the basic distinction being that between laboratory experiments and operational tests. In laboratory experiments, the conditions are controlled for, and the aim of

- 21 the experiment can be addressed more precisely. Operational tests run with real users on real systems, and thus can be said to be generally closer to “real life”. Of course, in practice the distinction is not binary but smooth, but explicit measures (Chapter 4) tend to be obtained by laboratory experiments, while implicit measures (Chapter 5) often come from operational tests.

The third issue “is deciding how actually to observe or measure […] concepts – how to operationalize them” (Tague-Sutcliffe 1992, p. 469). This means deciding which variables one controls for, and which are going to be measured. The features of the database, the type of user to be modeled, the intended behavioral constraints for the assessor (e.g. “Visit at least five results”), and – directly relevant to the present work – what and how to evaluate are all questions for this step.

The fourth question, what database to select, is not very relevant for us; mostly, either one of the popular web search engines is evaluated, or the researcher has a retrieval method of his own whose index is filled with web pages, of which there is no shortage. Sometimes, databases assembled by large workshops such as TREC21 or CLEF22 can be employed.

Pages:     | 1 | 2 || 4 | 5 |   ...   | 26 |

Similar works:

«Western Influences on Contemporary Chinese Art Education Two Case Studies of Responses from Chinese Academics and College Students to Modern Western Art Theory Lian Duan A Thesis In the Department of Art Education Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy at Concordia University Montreal, Quebec, Canada January 2012 © Lian Duan 2012 CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES This is to certify that the thesis prepared By: Lian Duan...»

«THERMOMECHANICAL FATIGUE CRACK FORMATION IN A SINGLE CRYSTAL NI-BASE SUPERALLOY A Dissertation Presented to The Academic Faculty By Robert Lewis Amaro In Partial Fulfillment Of the Requirements for the Degree Doctor of Philosophy in the George W. Woodruff School of Mechanical Engineering Georgia Institute of Technology December, 2010 THERMOMECHANICAL FATIGUE CRACK FORMATION IN A SINGLE CRYSTAL NI-BASE SUPERALLOY Approved by: Dr. Stephen D. Antolovich, Co-Advisor Dr. Richard W. Neu, Co-Advisor...»

«Om – The Primordial Sound by Simon Heather According to Vedic philosophy, Om is the primordial sound from which the whole universe was created. It is a sacred sound in Hinduism, Buddhism, Jainism and Sikhism. Om is also know as Omkāra (Aum syllable). It appears at the beginning of most Vedic chants and is said to be the essence of the Vedas. The Upanishads are full of references to Aum “Om ityetadaksharam idam sarvam, tasyopavyakhyanam bhutam bhavat bhavishyaditi sarvam omkara...»

«Legal Status, Education, and Latino Youths’ Transition to Adulthood A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Gemma Punti IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Dr. Kendall King, Adviser June 2013 © Gemma Punti 2013 Acknowledgements First, I would like to thank the six participants, as well as their families and friends who welcomed me into their communities. They opened their homes and allowed...»


«New Arrivals Catalog July 2015 Windows Booksellers 199 West 8th Ave., Suite 1 Eugene, OR 97401 USA Phone: (800) 779-1701 or (541) 485-0014 * Fax: (541) 465-9694 Email: katrina@theologybooks.com Website: http://www.windowsbooks.com Monday Friday: 10:00 AM to 5:00 PM, Pacific time (phone & in-store); Saturday: Noon to 3:00 PM, Pacific time (in-store onlysorry, no phone). Our specialty is used and out-of-print academic books in the areas of theology, church history, biblical studies, and western...»

«Indian Feminism (1921–1947): Cosmopolitan Visions and the Traffic in Women Tara Suri Murray Edwards College, University of Cambridge Multi-Disciplinary Gender Studies, Department of Geography Word Count: 19,981 July 2014 This dissertation is submitted for the degree of Master of Philosophy. Preface Declaration: This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. Statement of...»

«ABSTRACT Title of Document: THE EFFECTS OF BUILDING INFORMATION MODELING ON CONSTRUCTION SITE PRODUCTIVITY Douglas E. Chelson, Doctor of Philosophy, 2010 Directed By: Professor Miroslaw J. Skibniewski, Ph.D., A. James Clark Chair Civil & Environmental Engineering Construction experiences low productivity compared to other industries, largely attributed to poor planning and communication. Building Information Modeling (BIM) is a process that is used to resolve these problems by simulating...»

«Humanity at the Turning Point: Rethinking Nature, Culture and Freedom (Sonja Servomaa, editor). Helsinki, Finland: Renvall Institute for Area and Cultural Studies, 2006.Beyond Tolerance: Globalization, Freedom, and Religious Pluralism Douglas W. Shrader1 Distinguished Teaching Professor & Chair of Philosophy SUNY Oneonta Oneonta, NY Abstract: If “Globalization” is to mean something other than imposing a single set of uniform, unexamined, and unchallengeable ideas on the entire human race,...»


«The Assessment Sensitivity of Knowledge Attributions∗ John MacFarlane† June 28, 2004 Recent years have seen an explosion of interest in the semantics of knowledge-attributing sentences, not just among epistemologists but among philosophers of language seeking a general understanding of linguistic context sensitivity. Despite all this critical attention, however, we are as far from consensus as ever. If we have learned anything, it is that each of the standard views—invariantism,...»

«EMISSION OF VOLATILE ORGANIC COMPOUNDS FROM MULTI-LAYER STRUCTURAL INSULATED PANELS Huali Yuan Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Civil Engineering Dr. John C. Little, Chair and Advisor Dr. Marc A. Edwards Dr. Daniel L. Gallagher Dr. Brian J. Love Dr. Linsey C. Marr 25 August 2005 Blacksburg, Virginia Keywords: VOC, emission, indoor, modeling,...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.