«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»
- 177 including all sessions, not just the ones where a user preference exists, and providing more details, it allows for a finer evaluation, and consideration of more precise goals. In Section 12.2.3, I mentioned personalization efforts; if such mechanisms could be implemented and tested, they might provide an important bridge between the relatively low results of differentuser evaluation and same-user evaluation. In Section 12.2.4, a useful possible study was suggested, which would provide user groups with different relevance scales and different instructions, and could show if an intuitive six-point relevance scale is really better than a binary or three-point one; and which subcategory of the latter scales produces the most accurate preference predictions.
Another interesting topic is constituted by log-based, and in particular click-based, metrics. In Chapter 11, I have explained that the study layout was not suited for most click metrics proposed and used in the last years. They do not provide absolute scores for a result or result list, but rather, given a result list and click data, construct a result list that is postulated to be of better quality. Thus, we cannot take two result lists and compare them using, say, the “click skip above” model (see Chapter 5); instead, we have to take one result list, collect the log data, then construct a second result list, and obtain a user preference between the two lists.
Obviously, there also possible research topics apart from those already mentioned in this study. Perhaps the most intriguing of them concerns result snippets. If you recall, one of the ratings required from the evaluators in the present study was whether the result descriptions they encountered in the result lists were “good”, that is, whether the user would click on this result for this query. I have not performed the evaluation (yet), as this is a large, separate task with many possibilities of its own.
Snippets have been the object of some research lately; however, it mostly focused on snippet creation (e.g. Turpin et al. 2007; Teevan et al. 2009). There has also been research on snippetbased evaluation (Lewandowski 2008; Höchstötter and Lewandowski 2009), which mostly focused on the evaluation of snippets on their own, as well as comparing ratings for snippets and documents. But there is more to be done. One possibility for which the PIR framework seems well suited is combining snippet and document judgment.
It is well established that the user only examines a small subset of available results, even within a cut-off rank of, say, ten. One method for determining which results will be clicked on is evaluating the snippets. It stands to reason that if a user does not regard a snippet as relevant, he will not click on it; he will not see the document itself; and he will not gain anything from the result. 113 Thus, the results with unattractive snippets will be assigned a relevance score of zero. This is an approach that has received attention (Turpin et al. 2009), although not as much as its possible impact would warrant.
In the next step, there are at least two variants of dealing with unattractive snippets. We can consider them as having no influence on the user at all, and discard all documents which will Except for the case where the snippet actually contains the sought-after information, as may be the case with factual queries. This case can be accounted for by using three snippet relevance categories: “unpromising”, “promising” and “useful on its own”.
- 178 not be seen; in this case, the rank of the following documents will move up. Or we can assume that the results do not provide any benefits to the user, and furthermore distract him from the possibly more useful results; then, we would just set the relevance scores of the documents to zero. PIR should be able to provide an answer which of these models (or perhaps some other) can better predict user preferences, and whether any of them performs better than the current model which does not consider snippets at all.
There are, of course, many more open questions out there, such as different result types (e.g.
image or news search). I think that the framework introduced in this thesis, combining direct, session-based user opinion as a reference point and variation of as much data as possible to assess not only one metric, but as many parameters as needed, can help answer some of them.
In this thesis, I describe...
an overview of the metrics used in search engine evaluation;
the theoretical issues which can be raised about them;
previous studies which evaluate evaluation metrics.
a meta-evaluation measure, the Preference Identification Ratio (PIR), which captures a metric’s ability to correctly recognize explicitly stated user preferences;
an evaluation method which varies metrics and parameters to allow to use one set of data to run dozens or hundreds of different evaluations;
a new category of query, called “meta-query”, which requires information on the search engine itself and cannot produce a “good” or “bad” result list.
I find that...
randomizing the top 50 results of a leading search engine (Yahoo) leads to result lists that are regarded to be equal or superior to the original ones for over 40% of queries;
after the first five ranks, further results of a leading search engine (Yahoo) are on average no better than the average of the first 50 results;
a cut-off rank slightly smaller than 10 does not only reduce the data-gathering effort but in most cases actually improves the evaluation accuracy;
the widely-used Mean Average Precision metric is in most cases a poor predictor of user preference, worse than Discounted Cumulated Gain and not better than Precision;
the kind of discount function employed in a metric can be crucial for its ability to predict user preference;
a six-point relevance scale intuitively understandable to raters produces better results than a binary or three-point scale;
session duration and click count on their own are not useful in predicting user preference
I find preliminary evidence that...
(normalized) Discounted Cumulative Gain and a variety of Estimated Search Length are, with appropriate discount functions, the best predictors of user preference;
if a binary or three-point relevance scale is used, certain rater instructions as to what the single ratings mean can significantly influence user preference prediction quality;
depending on the metric and other evaluation parameters, cut-off ranks as low as 4 may provide the best effort-quality ratio.
Al-Maskari, A., M. Sanderson and P. Clough (2007). The Relationship Between IR Effectiveness Measures and User Satisfaction. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, ACM: 773-774.
Al-Maskari, A., M. Sanderson, P. Clough and E. Airio (2008). The Good and the Bad System: Does the Test Collection Predict Users' Effectiveness? Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, Singapore. New York, ACM: 59-66.
Ali, K., C.-C. Chang and Y. Juan (2005). Exploring Cost-Effective Approaches to Human Evaluation of Search Engine Relevance. Advances in Information Retrieval(ed.). Berlin; Heidelberg, Springer: 360-374.
Allan, J., B. Carterette and J. Lewis (2005). When Will Information Retrieval Be "Good Enough"?
Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. New York, ACM: 433-440.
Alpert, J. and N. Hajaj (2008). We Knew the Web Was Big... Retrieved 2010-10-04, from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
archive.org (2006). Wayback Machine - dmoz.org 2005-06-22. Retrieved 2010-10-05, from http://web.archive.org/web/20050622050456/http://www.dmoz.org/.
Asano, Y., Y. Tezuka and T. Nishizeki (2008). Improvements of HITS Algorithms for Spam Links.
Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management, Huang Shan, China, Springer-Verlag: 200-208.
Baeza-Yates, R. (2004). Query Usage Mining in Search Engines. Web Mining: Applications and Techniques. A. Scime (ed.). Hershey, PA, IGI Publishing: 307-321.
Baeza-Yates, R., C. Hurtado and M. Mendoza (2005). Query Recommendation Using Query Logs in Search Engines. Current Trends in Database Technology - EDBT 2004 Workshops. W. Lindner, M. Mesiti, C. Türker, Y. Tzitzikas and A. Vakali (ed.). New York, Springer: 395-397.
Bailey, P., N. Craswell, I. Soboroff, P. Thomas, A. P. d. Vries and E. Yilmaz (2008). Relevance Assessment: Are Judges Exchangeable and Does It Matter. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, Singapore, ACM: 667-674.
Barbaro, M. and T. Zeller Jr. (2006). A Face Is Exposed for AOL Searcher No. 4417749. Retrieved 31.10.2011.
Barboza, D. (2010). Baidu’s Gain from Departure Could Be China’s Loss. The New York Times. New York: B1.
Berman DeValerio (2010). Cases: AOL Privacy. Retrieved 31.10.2011.
Breithut, J. (2011). Drei gegen Google. Retrieved 30.10.2011.
- 181 Brin, S. and L. Page (1998). The Anatomy of a Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30(1-7): 107-117.
Broder, A. (2002). A Taxonomy of Web Search. SIGIR Forum 36(2): 3-10.
Buckley, C. and E. M. Voorhees (2000). Evaluating Evaluation Measure Stability. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athens, Greece, ACM: 33-40.
Buckley, C. and E. M. Voorhees (2004). Retrieval Evaluation with Incomplete Information.
Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, United Kingdom, ACM: 25-32.
Carbonell, J. and J. Goldstein (1998). The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, ACM: 335-336.
Chapelle, O., D. Metzler, Y. Zhang and P. Grinspan (2009). Expected Reciprocal Rank for Graded Relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, ACM: 621-630.
Chu, H. (2011). Factors Affecting Relevance Judgment: A Report from TREC Legal Track. Journal of Documentation 67(2): 264-278.
Clarke, C. L. A., N. Craswell and I. Soboroff (2009). Overview of the TREC 2009 Web Track. TREC 2009, Gaithersburg, Maryland, NIST.
Clarke, C. L. A., M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher and I. MacKinnon (2008). Novelty and Diversity in Information Retrieval Evaluation. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, Singapore, ACM: 659-666.
Clarke, S. J. and P. Willett (1999). Estimating the Recall Performance of Web Search Engines. Aslib Proceedings 49(7): 184-189.
Cleverdon, C. W. and M. Keen (1968). Factors Determining the Performance of Indexing Systems.
Cranfield, England, Aslib Cranfield Research Project.
CNET News (2006). Yahoo's Steady Home Page Transformation. Retrieved 2010-10-05, from http://news.cnet.com/2300-1032_3-6072801.html.
comScore (2009). comScore Releases June 2009 U.S. Search Engine Rankings. Retrieved 23.10.2009, from http://www.comscore.com/Press_Events/Press_releases/2009/7/comScore_Releases_June_ 2009_U.S._Search_Engine_Rankings.
comScore (2010). Global Search Market Draws More than 100 Billion Searches per Month. Retrieved 2010-10-05, from http://www.comscore.com/Press_Events/Press_Releases/2009/8/Global_Search_Market_Dr aws_More_than_100_Billion_Searches_per_Month.
Cooper, W. S. (1968). Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems. American Documentation 19(1): 30-41.
Craswell, N., O. Zoeter, M. Taylor and B. Ramsey (2008). An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Web Data Mining, Palo Alto, California, USA, ACM.
- 182 Croft, B., D. Metzler and T. Strohman (2010). Search Engines: Information Retrieval in Practice, Addison-Wesley Publishing Company.
Dang, H. T., J. Lin and D. Kelly (2006). Overview of the TREC 2006 Question Answering Track.
Proceedings of the 15th Text REtrieval Conference, Gaithersburg, Maryland.
Davison, B. D., A. Gerasoulis, K. Kleisouris, Y. Lu, H.-j. Seo, W. Wang and B. Wu (1999). DiscoWeb:
Applying Link Analysis to Web Search. Poster Proceedings of the Eighth International World Wide Web Conference, Elsevier: 148-149.
De Beer, J. and M.-F. Moens (2006). Rpref: A Generalization of Bpref Towards Graded Relevance Judgments. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA, ACM: 637-638.
de Kunder, M. (2010). The Size of the World Wide Web. Retrieved 2010-10-04, from http://www.worldwidewebsize.com/.
Della Mea, V., G. Demartini, L. Di Gaspero and S. Mizzaro (2006). Measuring Retrieval Effectiveness with Average Distance Measure (ADM). Information: Wissenschaft und Praxis 57(8): 433-443.
Della Mea, V., L. Di Gaspero and S. Mizzaro (2004). Evaluating ADM on a Four-level Relevance Scale Document Set from NTCIR. Proceedings of NTCIR Workshop 4 Meeting - Supplement 2: 30Diaz, A. (2008). Through the Google Goggles: Sociopolitical Bias in Search Engine Design. Web Search.
A. Spink and M. Zimmer (ed.). Berlin, Heidelberg, Springer. 14: 11-34.
Dou, Z., R. Song, X. Yuan and J.-R. Wen (2008). Are Click-through Data Adequate for Learning Web Search Rankings? Proceeding of the 17th ACM Conference on Information and Knowledge Management. Napa Valley, California, USA, ACM: 73-82.