WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 19 | 20 || 22 | 23 |   ...   | 26 |

«On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische ...»

-- [ Page 21 ] --

11.1 Session Duration Evaluation The amount of time a user spends on examining the results presented by the search engine has been proposed as an indicator of those results’ quality. However, a problem that arises immediately is the definition and determination of “session duration”. One might reasonably define it as the time between the user submitting the query97 and his completion or abortion of the search. 98 But, while the moment of query submission is routinely recorded in search engine logs, the end point is less obvious. If we see a session as a sequence of Since modern web search engines return their results in a negligible time upon query submission, the time points of submitting the query and seeing the result list are the same for all practical purposes.

In the present evaluation, I am concerned only with single-query sessions. Multi-query sessions constitute a significant part of all search activity, and are quite likely to be a more realistic model of user behavior than single-query sessions. There is also no obvious reason why they could not be evaluated using the same PIR

–  –  –

and then possibly (5) returning to the result list and proceeding with step (2), then the session ends after step (4), whether on the first or a subsequent iteration. But whereas the steps (1) and (3) can be (and are) recorded by search engines, there is no obvious way to know when the user has stopped examining the last result and decided he will not return to the result list. Furthermore, if the user’s information need is not satisfied yet, but he looks at the result list and finds no promising results, the session might also end after step (2). Particular cases of this scenario are no-click sessions, which may have an additional twist; as described in Section 9.2, the absence of clicks can also mean a result list which provides enough information to satisfy the user’s information need without delving into the results themselves, especially in the cases of factual queries or meta-queries.

In this study, two possible end points of a session can be evaluated. We happen to have definite end points for every session, as raters were required to submit a preference or satisfaction judgment when they were “done” (see Chapter 9.1 for more details). However, since this information is not usually available to search engine operators, I will also evaluate session duration with the last recordable point in the session process as described above; that is, the time of the last recorded click.

The session duration is a fairly straightforward metric; one takes the timestamp of the session end, subtracts the time of the session start, and – voila! – one has the duration. Then, it can be judged against the actual user preference, just as has been done with the explicit metrics in Chapter 10. Here also, the threshold approach can be used in a way analogous to that employed for the explicit metrics. There might be a difference in session duration for which it does not make sense to assume an underlying preference for one of the result lists (say, 65 seconds versus 63 seconds).

There is also the possibility that a longer session is not necessarily indicative of a worse result list. Longer sessions might conceivably be generally preferable. A long session can indicate that the user has found many relevant results and spent a lot of time interacting with the web pages provided. In this view, a short session might just mean there wasn’t much useful information in the result list.

framework used here. However, multi-query sessions require a approach to study design and data gathering that different from the one employed here, and so they will be left behind, without the benefit of having been evaluated.

- 154 Figure 11.1. Preference Identification Ratio based on session duration (measured with the user-indicated session end).

PIR is shown on the Y-axis and the threshold (difference in session duration, in seconds) on the X-axis. The PIR indicates the correct preference identification if shorter sessions are considered to be better. For most queries, the graph seems to indicate the opposite (better longer queries could be evaluated by mirroring the graph on 0.50 PIR), but the results are far too small to be of significance.

Figure 11.2.

Preference Identification Ratio based on session duration (measured with the user-indicated session end)..

PIR is shown on the Y-axis and the threshold (difference in session duration, in seconds) on the X-axis. The PIR indicates the correct preference identification if shorter sessions are considered to be better. Only informational queries have been assessed.

- 155 Figure 11.1 can put our minds to rest about the first assumption. Indeed, it shows a result that is, at least to me, quite surprising: The differences in session duration do not seem to indicate user preference in either direction. The PIR scores range approximately from 0.45 to 0.52;

that is, the difference from a baseline performance is negligible. The same can be said for an evaluation of the informational queries only (Figure 11.2); while the PIR scores are slightly higher (0.48 to 0.56), the overall shape of the graph is the same as before.

Does this mean that session duration is no good at all as a predictor of user preference? Not necessarily. It might be that for certain specific durations, there are meaningful connections between the two. For example, sessions of less than five seconds may indicate a much worse result list than those from 20 seconds upwards, while for durations of more than a minute, shorter might be better.99 Unfortunately, the more specific the situations, the narrower the duration bands become, and the less queries fall into each. In the end, the present study turns out not to be large enough to permit this kind of analysis. All that can be said is that there seems to be no direct, general and significant connection between session duration difference and user preference.





Figure 11.3.

Preference Identification Ratio based on session duration (measured with the last click as session end).

PIR is shown on the Y-axis and the threshold (difference in session duration, in seconds) on the X-axis. The PIR indicates the correct preference identification if shorter sessions are considered to be better.

Or perhaps the opposites are true; this is meant not as a prediction of what duration means, but just of possible connections the analysis in the previous paragraphs would not have captured.

- 156 Figure 11.4. Preference Identification Ratio based on session duration (measured with the last click as session end).

PIR is shown on the Y-axis and the threshold (difference in session duration, in seconds) on the X-axis. The PIR indicates the correct preference identification if shorter sessions are considered to be better. Only informational queries have been assessed.

Figure 11.3 shows the PIR graph with a different method of calculating session duration, namely, determining the end of the session by the last click made.

Here, the range of PIR scores is a bit wider, from 0.44 to 0.56; however, once again this is nothing to get excited about. The scores are slightly higher than before, but still not high enough to provide meaningful clues as to user preference. A similar picture emerges in Figure 11.4, which shows the graph for informational queries only. In that case, the maximum is about 0.59 (for session duration differences of 35 seconds and above); still not enough to be useful, especially since for most session pairs, the PIR score is closer to chance.

The conclusion from this section is simple: In the present study, differences in session duration were not a good indicator for user preference; and mostly they were no indicator at all. However, this study was relatively small; in particular, it did not have enough data to evaluate sessions of particular durations only (say, sessions shorter than 10 seconds versus those taking more than half a minute). Such measures could conceivably produce better results.

11.2 Click-based Evaluations Even if session duration does not cut the mustard, we can still turn to click data to provide the cutting edge. Some methods used for evaluating click data to infer the quality of a result list have been described in Chapter 5; they and a few others not widely employed in studies will be dealt with in this section The methods will be the same as those for explicit result ratings

- 157 and session duration; that is, the relevant quality-indicating numbers will be calculated from clicks performed by users in the single result list condition, and then the difference in these numbers will be compared to actual user preferences using PIR. If you are still uncertain what that looks like in practice, read on; the evaluation should be a clear example of the method employed.

11.2.1 Click Count The simplest way to measure something about the clicks a user performs is counting them. As is so often the case with evaluations (or at least with the approach to them taken in this study), there are at least two possible ways to interpret the number of clicks a result list gets. A low number of clicks might indicate that the user found the sought information soon; in the extreme case, no clicks can mean that the information was in the result list itself, eliminating the need to view any result.100 Or, a low number of clicks might be a sign that most results (at least as presented in the result list) just weren’t attractive, and the user was left unsatisfied.

And, as usual, I will turn to evaluation to provide the conclusion as to which of the views is closer to the truth.

Figure 11.5.

Preference Identification Ratio based on a simple click count. PIR is shown on the Y-axis and the threshold (difference in click counts) on the X-axis. The PIR indicates the correct preference identification if less clicks are considered to be better.

This would necessarily be the case for meta-queries (described in Section 9.2).

- 158 Figure 11.6. Preference Identification Ratio based on a simple click count. PIR is shown on the Y-axis and the threshold (difference in click counts) on the X-axis. The PIR indicates the correct preference identification if less clicks are considered to be better. Only informational queries have been assessed.

Figure 11.5 indicates that, as with the session duration, the answer is probably “none”.

The PIR lies between 0.44 and 0.54, never far from the baseline.101 The picture is very similar if we restrict the evaluation to informational queries (Figure 11.6). Both show that the sheer number of clicks is not a reliable indicator of user preference; both smaller and larger click numbers can indicate superior as well as inferior result list quality.

There were no queries where the difference in clicks for the two result lists was larger than 4; hence the stable value of 0.50 from a threshold of 5 on. Numbers for individual result lists (averaged among different sessions/users) were as high as 10 clicks, but the disparity was never too large. I have kept the larger thresholds in the graph for easier comparison to following graphs.

- 159 Figure 11.7. Preference Identification Ratio based on a simple click count, for queries where only one of the result lists received clicks. PIR is shown on the Y-axis and the threshold (difference in click counts) on the X-axis. The PIR indicates the correct preference identification if less clicks are considered to be better.

An additional possibility is considering that particular subset of sessions where one of the result lists does not receive any clicks. As Figure 11.7 shows, this does show a very pronounced tendency. The assumption “less clicks are better” is refuted strongly in this case;

looking at the results from the other direction, we can say that the PIR for the “more clicks are better” metric is higher than 0.85. This is a high value indeed; but it has to be noted that it is not directly comparable to the other PIR scores we have seen. One reason for that is that the number of sessions fitting the criteria of one and only one of the result lists having a click count of zero is relatively small; these 35 sessions are about 1/4 of those used, for example, in the general click count evaluation. Even more importantly, the excluded cases are in large part precisely those where the metric performs poorly in terms of preference identification; for example, the sessions where neither result list receives clicks have, on their own, a PIR of 0.50.

This is not to say that the comparison of sessions with a click/no click difference is not valuable or meaningful. But previous PIR evaluations concerned the whole population of queries, which are supposed to be at least a rough approximation of real-life query variety.

This result, however, stems from sessions preselected in a way that strongly favours a certain outcome. The evaluations which regard only informational queries also narrow the session population; however, they might be closer to of further from the baseline, while selecting based on the metric scores’ differences necessarily nudges PIR towards a better performance.

What does this high score mean, then? Obviously, it means that if one result list has no clicks but another one does, the latter is very likely to be preferred by the user. By the way – if you wonder where the PIR evaluation for informational queries only is: all the queries that remained in this evaluation were informational. This also means the possibility that,

- 160 especially for factual queries and meta-queries, a no-click session can be a good indicator, has not really been tested. But for the main body of queries, the result does not seem to be found in the result list itself. Also, it does not seem to happen with any frequency that all the clicks a user makes in a result list are for nothing; that is, here, the users seem to be quite good at scanning the result list and recognizing the absence of any useful results.

11.2.2 Click Rank Another relatively straightforward metric is the average click rank. As with many of the explicit result ratings, the assumption here is that the user prefers it if good results appear at the top of the result list. If this is the case (and if the user can recognize good results from the snippets presented in the result list), the clicks in better result lists should occur at earlier ranks, while those in inferior ones would be in the lower regions.



Pages:     | 1 |   ...   | 19 | 20 || 22 | 23 |   ...   | 26 |


Similar works:

«On Ethical Thoughtfulness Item type text; Electronic Dissertation Authors Matteson, Jason Kent Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Downloaded 23-Nov-2016 06:46:30 Link to item...»

«EVIDENCE OF ALTERED AMPA RECEPTOR LOCALIZATION AND REGULATION IN SCHIZOHPRENIA by JANA BENESH DRUMMOND JAMES H. MEADOR-WOODRUFF, CHAIR SARAH CLINTON RITA COWELL ANNE THEIBERT SCOTT WILSON A DISSERTATION Submitted to the graduate faculty of The University of Alabama at Birmingham, in partial fulfillment of the requirements for the degree of Doctor of Philosophy BIRMINGHAM, ALABAMA 2014 Copyright by Jana Benesh Drummond 2014 EVIDENCE OF ALTERED AMPA RECEPTOR LOCALIZATION AND REGULATION IN...»

«China & Technical Global Internet Governance: From Norm-Taker to Norm-Maker? by Tristan Galloway L.L.B(Hons)/B.A.(Hons) Deakin University Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Deakin University July, 2015 I am the author of the thesis entitled: China & Technical Global Internet Governance: From Norm-Taker to NormMaker? submitted for the degree of This thesis may be made available for consultation, loan and limited copying in accordance with the...»

«The Genteel Frontier: Westward Expansion of Womanly Refinement A DISSERTATION SUBMITTED TO THE FACULTY OF UNIVERSITY OF MINNESOTA BY Laura B. Bozeman IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Donald Ross, Jr., Advisor May, 2014 © Laura B. Bozeman 2014 i Acknowledgements Special thanks to my professors at Texas Christian University who encouraged me to consider pursuing a doctorate. Fred Toner, Linda Hughes, Bob Frye, C. David Grant, the late Jim Corder,...»

«MICRO ELECTRET POWER GENERATORS Thesis by Justin Boland In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CALIFORNIA INSTITUTE OF TECHNOLOGY Pasadena, California (Defended May 24, 2005) ii © 2005 Justin Boland All Rights Reserved iii ACKNOWLEDGEMENTS Yu-Chong Tai, Trevor Roper, Tanya Owen, Wen Hsieh, Ellis Meng, Tom Tsao, Mattieu Liger, Qing He, Chi-Yuan (Victor) Shih, Scott Miserendino, Po-Jui (PJ) Chen, Nick Lo, Jayson Messenger, Svanhild (Swan) Simonson,...»

«The Concept of Self-Reflexive Intertextuality in the Works of Umberto Eco by Annarita Primier A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Centre for Comparative Literature University of Toronto © Copyright by Annarita Primier 2013 ii The Concept of Self-Reflexive Intertextuality in the Works of Umberto Eco Annarita Primier Doctor of Philosophy Centre for Comparative Literature University of Toronto 2013 Abstract Umberto Eco’s novels are...»

«COMPUTATIONAL MODELS OF REPRESENTATION AND PLASTICITY IN THE CENTRAL AUDITORY SYSTEM by Michael A. Carlin A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland February, 2015 c Michael A. Carlin 2015 All rights reserved Abstract The performance for automated speech processing tasks like speech recognition and speech activity detection rapidly degrades in challenging acoustic conditions. It is...»

«LIBERALISM AND MULTICULTURALISM: A PHILOSOPHICAL DILEMMA By Joshua Seth Crites Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Philosophy August, 2007 Nashville, Tennessee Approved: Professor Robert B. Talisse Professor Henry Teloh Professor Jonathan Neufeld Professor William James Booth Professor Chandran Kukathas Copyright © 2007 by Joshua Seth Crites All Rights Reserved...»

«Additional material: WORSHIP: Luke Worship with the Gospels LUKE Background: These resources were first produced by Clare Amos for ‘Partners in Learning’ an ecumenical resource of learning resources for all ages based on the lectionary readings and themes. They have been adapted here as a resource to encourage congregations to bring ‘the particular interests, topics and concerns of each gospel writer’ into our worship. We would be delighted if you want to use and adapt the material for...»

«TRAINING METHODS FOR THE CHILD DIRECTED INTERACTION (CDI) IN PARENT-CHILD INTERACTION THERAPY (PCIT) AND PARENTING SKILL ACQUISITION By KELLY ANN O’BRIEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA © 2008 Kelly Ann O’Brien To my nephew, Ethen O’Brien ACKNOWLEDGMENTS I thank my graduate mentor, Dr. Sheila Eyberg, for her guidance, support, and...»

«MAKING STATUS LEGIBLE: SELF-WRITING IN THE LONG EIGHTEENTH CENTURY by LESLIE M. MORRISON A DISSERTATION Presented to the Department of English and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2014 DISSERTATION APPROVAL PAGE Student: Leslie M. Morrison Title: Making Status Legible: Self-Writing in the Long Eighteenth Century This dissertation has been accepted and approved in partial fulfillment of the...»

«Wire Grid Polarizer by Angled Deposition Method Using Nanoimprint Lithography by Young Jae Shin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Macromolecular Science and Engineering) in The University of Michigan 2014 Doctoral Committee: Professor L. Jay Guo, Chair Peng-Fei Fu, Dow Corporation Professor Jinsang Kim Professor Richard Robertson © Young Jae Shin 2014 Acknowledgement Foremost, I would like to express my sincere gratitude...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.