FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 10 | 11 || 13 | 14 |   ...   | 20 |

«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»

-- [ Page 12 ] --

For a histogram, the bin size and width would be draggable, providing an affordance for the user to manipulate these parameters, rather than encouraging them to keep the default, as R does. Similarly, spatial binning would be manipulable, again providing the user the capability of exploring and gaining intuition about the ‘true’ underlying spatial distribution. Essentially, every level of abstraction could be manipulated. A histogram is generalizing into bins, so the generalization (binning) can be manipulated. A box plot has a lot of abstraction, so the user could manipulate the various parameters there, by changing the mapping of center from median to mean, for example, or by redefining how outliers are determined.

Interaction is one of the weakest elements in R. The R programming paradigm means if a user wants to manipulate a parameter value they must modify the code and re-run it, making the comparison between states in their head. Comparing two states in this way may be possible, but comparing more than two is very difficult. Additionally, because of the static way code is saved, there is no user incentive to return to the beginning of the analysis to see how a code modication would trickle down.

The development of RMarkdown has improved this somewhat, as users can change elements in their code and re-knit the report to see the full scope of effects from the change. However, the edit/compile cycle is still quite clunky. As Biehler suggests, we want to encourage direct manipulation over modifying a script (Biehler, 1997). This concept has been given a lot of attention. One of the main tenants of Michael K¨lling’s Greenfoot (the integrated development o environment designed for novices to learn Java) was shortening the feedback loop (K¨lling, 2010). Bret Victor has made the shortening of the loop one of o his driving design principles, to provide users with the ability to see the direct results of their actions without waiting for something to compile (Victor, 2012).

Deborah Nolan and Duncan Temple-Lang make the distinction between dynamic documents (those that are compiled and then automatically include the results of embedded code), and interactive documents (those that let a reader interact with components like graphics) (Nolan and Temple Lang, 2007). Given the goals of interactivity all the way down, and the importance of publishing, the system we are imagining should provide dynamic-interactive graphics. Users could interact with any component of the document and have the results re-run in real time.

R programmers have been wrestling with the issue of making interactive graphics from within R (Nolan and Temple Lang, 2012). Several efforts have been made, the most successful being the R packages Shiny and ggvis. However, these packages are not accessible for novices. While they allow expert users to create dynamic graphics, they are too complicated for a beginner. For that use case, manipulate is the closest to the functionality we are imagining.

Currently existing tools like the iPython notebook and RMarkdown provide some of this dynamic-interactive functionality. They can be used to create interactive documents that respond to user input, but the process is not dynamic for the author, Users must modify code and then re-run it in order to see results, so direct manipulation is not being achieved, and readers only have access to dynamic behavior programmed in by the author.

TinkerPlots and Fathom are dynamic and interactive, but they break down on this requirement when it comes to publishing interactive documents. TinkerPlots makes it easy to interactively develop analysis and play with it, but there is no way to share an interactive result.

4.7 Inherent visual documentation Requirement 7 Inherent visual documentation.

Each component of the system should show the user what it is going to do, versus simply telling them. Another version of this goal would be ‘help that is helpful,’ because most documentation in systems like R is unintelligible to novices. However, the idea of a self-articulating system goes one step further.

The system should show the user what it is going to do, not just use textual labels. For example, if it is going to perform k-means clustering, instead of a box with the words “k-means” on it, the user should see a visual representation of the algorithm, and as it is applied to the data interim steps should be visualized (M¨hlbacher et al., 2014). There also needs to be a nice model object u and likely an associated visualization (not residual plots, etc.) to give the user a sense of the model in order to increase intuition.

This is similar to the idea of scented widgets, which are embedded visualizations providing hints to users about what all elements are capable of (Willett et al., 2007). Scented widgets are really a specific platform for implementing this idea, but the concept is generalizable.

4.8 Simple support for narrative, publishing, and reproducibility Requirement 8 Simple support for narrative, publishing, and reproducibility.

The products of the system should be as easy to understand as the process of creating them, and they should be simple to share with others.

4.8.1 Narrative Many programming systems tend toward a paradigm of writing code in one document and narrative in another, such as performing analysis in Excel and writing about the results in Word. Code comments notwithstanding, narrative and analysis are usually kept separate. However, in order to create compelling reports, it must to be easier to combine the two. This means the system should encourage documentation alongside or mixed in with the code in order to facilitate the integration of storytelling with words and statistical products.

Data science is so much about storytelling that it should be built into the analysis process. Andrew Gelman and Thomas Basboll have written about how stories can be a way to develop hypotheses (Gelman and Basboll, 2013), which is one of the powers of data journalism. Where statisticians usually think of data as a pre-existing object, journalists are more likely to ‘interview’ their data source and research the contextual story surrounding it. This process should be integrated into the documentation and analysis.

In this sense, documentation does not refer to the indecipherable comments I am guilty of inserting into my code (e.g., “#Don’t know what this does”), but rather a supporting narrative that surrounds the analysis when it is complete.

Instead of encouraging a process where analysts create their data product first, then go back and try to interpret it, a good statistical programming tool should create major incentive to do the hard work of thinking as you go.

Students can learn how to write about their data analysis early on. While not ideal, RMarkdown and knitr are beginning to make it possible to integrate this into introductory courses (Baumer et al., 2014).

4.8.2 Publishing Similarly, data analysis products should be easy to publish. Rather than having to translate the document to another format, it should be as simple as a button click to make the finished product available to an audience.

It should be easy for a journalist to create a data-driven website, or a citizen scientist to share the cool thing they found in the data they helped create. On top of this philosophy, the publishing format should allow for exploration. In fact, the ideal case would be a system that would look nearly identical to the person accessing the ‘publication’ as it did to the person producing it. In this way, users could continue to explore the data, modify the analysis, and perhaps move sliders to see the effects of changes in the analysis and visualizations.

Integrated documentation and one-click publishing will necessarily encourage reproducibility. Anyone who reads the published product will not only be able to see the code, but it will be easy to understand, given the integrated documentation.

Again, RStudio is working toward making this possible. Their RPubs website makes it possible, particularly for students, to write RMarkdown documents and then publish them to be viewable to their class or on the web at large.

4.8.3 Reproducibility Most important in this requirement is the focus on reproducibility. While it has been a goal of research for some time, reproducibility is being more explicitly valued in the scientific publishing community (Buckheit and Donoho, 1995; Ince et al., 2012). Jan DeLeeuw ends his summary of statistical programming tools by making it clear reproducibility is the next frontier (De Leeuw, 2009).

People agree modern data analysis should be reproducible. However, there is some debate about what reproducibility really means. There are two main perspectives on this issue.

The first is the perspective that posits reproducibility means someone other than the author of the analysis should be able to take the data and code used by the author and get exactly the same result. The goal of reproducing an analysis using the same data and code seems like it should be simple to achieve, but there are many factors that can make it difficult. Namely, software versions can change, package dependencies can get broken, and most disruptive to the process, authors often do not manage to document their entire process. There may have been data cleaning steps that took place outside the main software package (e.g., the bulk of the analysis takes place in R but the author does some data cleaning in Excel before the analysis), or analysis steps run elsewhere and not added to the code. The provided code might not be the current version, or it might have bugs that need to be addressed before the code will run (and often, code is poorly documented, so it is hard to debug someone else’s code). The R package packrat is attempting to fix some of these problems (Ushey et al., 2015).

The second view of reproducibility is an independent researcher should be able to replicate the results, either with new data and code, or at least with new code. In this framework, another researcher would replicate the study that produced the data or go about procuring it in the same way as the original researcher (e.g., making API calls to a service, or using FOIA requests to access government data). They would then attempt to perform the same analysis, writing the code from scratch or using a different analysis platform, to see if the results were the same.

As we are attempting to expose students to an authentic experience of data science, it is crucial they experience reproducibility. This aim has implications both pedagogically and technologically. Pedagogically, the aim of reproducibility needs to be written into the curriculum, and teachers need to be trained in order to understand the aim. Technologically, the tool students are using needs to support reproducibility. Supporting reproducibility means, first and foremost, the process that created a data product needs to be savable. Some teaching tools, like applets, are more for interaction than anything else, and the creation process can not be saved. Other tools, like TinkerPlots and Fathom, allow the user to save the environment that produced the product, but do not document the steps taken within the environment. An ‘independent researcher’ (in this context, another student) could potentially use this environment and manage to reproduce the steps required to produce the analysis product, but it would be much harder to do so.

The current tools for learning statistics fall quite short in this regard. In most systems, there is no encouragement of documentation, and analysis is not reproducible because it was all produced interactively. This tension is addressed in Section 5.2, where we attempt a method of tracking interactions. Tools like Excel also fail to meet this criteria, because they do not make it clear what modifications have been made on the data, not do they provide a method for reproducing the analysis with different data.

Even in courses where introductory college students are using R, they often document their data process in Microsoft Word, copying and pasting results into their final document. This can be a very frustrating experience, because if a student realizes they made a mistake early in their analysis they must replace all the dependent pieces of the analysis in their Word document. In introductory college statistics courses there is a movement toward students using tools like RMarkdown from the beginning in order to get them in the habit of reproducible research (Baumer et al., 2014). While RMarkdown takes some cognitive effort to learn, the payoffs are typically great enough students see the usefulness.


Much of the work of data analysis is gaining access to data and getting it

into a format a tool can work with, which can take up to 80% of the time in data projects (Kandel et al., 2011a). This ties into reproducibility, as well. If, once a user has cleaned some data and neatened it into a ‘tidy’ format (Wickham, 2014b), they are given a new version of the data set, are they able to perform those same actions on these new data? Thus, another benchmark is to be able to reproduce wrangling steps and share them with others (Kandel et al., 2011a).

Once data have been wrangled, it should be clear what steps have been taken, and how to re-do them on another data set. ‘Consumers’ of the product should be able to look at the process and assess whether it was done in good faith.

Again, this is a shortcoming of tools designed for learning statistics, as they make it difficult to share and reproduce research. Several products are working to address this problem, most notably Data Wrangler and Open Refine.

4.9 Flexibility to build extensions Requirement 9 Flexibility to build extensions.

Again, this requirement comes from the paper on computational thinking tools referenced in the introduction to this chapter. In the words of Repenning et al, the system should have a “high ceiling” (Repenning et al., 2010). This means users should not ‘age out or ‘experience out of the system. Instead, it should be possible to build almost anything from the components the system provides.

As discussed in Section 4.4, it should be possible to develop new visualization types, building from a series of primitives. Similarly, it should be possible to build new data processes from other modular pieces.

Pages:     | 1 |   ...   | 10 | 11 || 13 | 14 |   ...   | 20 |

Similar works:

«Terraforming: An Investigation of the Boundaries Between Science and Hard Science Fiction A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Peter Allon Schmidt, Jr.IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Robert Seidel, Advisor May 2010 © Peter Allon Schmidt, Jr. 2010 i Acknowledgements Outside of numerous friends, family and colleagues who have helped me along in the completion of this project, three in...»

«Droplet impact and spreading of viscous dispersions and volatile solutions Daniel A. Bolleddula A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington Program Authorized to Offer Degree: Mechanical Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Daniel A. Bolleddula and have found that it is complete and satisfactory in all respects,...»

«THE UNIVERSITY OF HULL Kinship and Modernisation: An Analysis of a Cham Community of East Coast Peninsular Malaysia Being a Thesis submitted for the degree of Doctor of Philosophy in The University of Hull By Siti Nor Awang B.A (Hons.), University of Malaya, Malaysia 1993 M.A (Anthropology), University of Malaya, Malaysia 1996 February 2010 ABSTRACT This study addresses the issue of the nature of contemporary kinship relationships among the Cham Muslim community of migrants from Cambodia, now...»

«Mapping Suffering: Pain, Illness, and Happiness in the Christian Tradition by Sarah Conrad Sours Program in Religion Duke University Date:_Approved: _ Stanley Hauerwas, Supervisor _ Richard B. Hays _ Allen Verhey _ Gerald McKenny Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in Religion in the Graduate School of Duke University i v ABSTRACT Mapping Suffering: Pain, Illness, and Happiness in the Christian Tradition by...»

«ENHANCEMENT OF ROLL MANEUVERABILITY USING POST-REVERSAL DESIGN A Thesis Presented to The Academic Faculty by Wei-En Li In Partial Fulfillment of the Requirement for the Degree Doctor of Philosophy in the School of Aerospace Engineering Georgia Institute of Technology August 2009 ENHANCEMENT OF ROLL MANEUVERABILITY USING POST-REVERSAL DESIGN Approved by: Professor Dewey H. Hodges, Advisor Professor J. V. R. Prasad Committee Chair School of Aerospace Engineering School of Aerospace Engineering...»

«IPHIGENIA AT AULIS: MYTH, PERFORMANCE, AND RECEPTION by George Adam Kovacs A thesis submitted in conformity of the requirements for the degree of Doctor of Philosophy Department of Classics University of Toronto © Copyright by George Adam Kovacs 2010 Abstract Iphigenia at Aulis: Myth, Performance, and Reception George Kovacs Doctor of Philosophy Department of Classics University of Toronto When Euripides wrote his final play, Iphigenia at Aulis, depicting the human sacrifice of Agamemnon’s...»

«MELODIC FUNCTION AND MODAL PROCESS IN GREGORIAN CHANT by RICHARD PORTERFIELD A dissertation submitted to the Graduate Faculty in Music in partial fulfillment of the requirements for the degree of Doctor of Philosophy The City University of New York ii © 2014 RICHARD PORTERFIELD All Rights Reserved iii This manuscript has been read and accepted by the Graduate faculty in Music in satisfaction of the dissertation requirement for the degree of Doctor of Philosophy Codex hic lectus acceptusque est...»

«Thomas Taylor, Wisdom’s Champion Temenos Academy, November 2008 Thomas Taylor, described by Kathleen Raine as a genius of metaphysics, will eventually become recognized as the greatest philosopher the English speaking peoples have produced at least until our time. If there is a greater one to follow, then our race will have been blessed beyond its desserts, for this first genius sent down from the orb of light was treated shamefully in his own time, and has been largely neglected ever since....»

«Testimony and Knowing How Katherine Hawley kjh5@st-and.ac.uk Department of Philosophy, University of St Andrews, Fife KY16 9AL Scotland Abstract Much of what we learn from talking and listening does not qualify as testimonial knowledge: we can learn a great deal from other people without simply accepting what they say as true. In this article, I examine the ways in which we acquire skills or knowledge how from our interactions with other people, and I discuss whether there is a useful notion of...»

«LEARNING PROCEDURAL PLANNING KNOWLEDGE IN COMPLEX ENVIRONMENTS by Douglas John Pearson A dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 1996 Doctoral Committee: Associate Professor John E. Laird, Chair Professor John H. Holland Professor Keki B. Irani Associate Professor Colleen M. Seifert This research was supported under contract N66001-95-C-6013 from the Advanced Systems...»

«Investigaciones Fenomenológicas, vol. Monográfico 4/I (2013): Razón y vida, 351-360. e-ISSN: 1885-1088 THE MELODY OF LIFE. MERLEAU-PONTY, READER OF JACOB VON UEXKÜLL LA MELODÍA DE LA VIDA. MERLEAU-PONTY, LECTOR DE JACOB VON UEXKÜLL Luís António Umbelino Associação Portuguesa de Filosofia Fenomenológica (APFFEN)/ Universidade de Coimbra, Portugal luis.um@megamail.pt Abstract: This paper aims to meditate on the Resumen: Este artículo trata de meditar acerca importance of Jacob von...»

«Cardiff University IMAGED CONCEPTS: ART AND THE NATURE OF THE AESTHETIC by Bernard van Lierop A Dissertation Submitted in Partial Fulfilment o f the Requirements for the Degree of Doctor o f Philosophy School of English, Communication and Philosophy. 2009 UMI Number: U584333 All rights reserved INFORMATION TO ALL U SERS The quality of this reproduction is d ep en d e n t upon the quality of the copy subm itted. In the unlikely event th at the author did not sen d a com plete m anuscript and...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.