FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 7 | 8 || 10 | 11 |   ...   | 20 |

«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»

-- [ Page 9 ] --

Another bespoke data visualization system is Lyra, which makes it easy for novices to create graphics in a drag-and-drop matter (Satyanarayan and Heer, 2014). Lyra was developed at the University of Washington Interactive Data Lab. Interestingly, Jeffrey Heer was a member of the Stanford Visualization Group that created Data Wrangler, and is now one of the founders of Trifacta.

He has since moved to the University of Washington and is a member of the Interactive Data Lab. Lyra is built on top of vega, an abstraction layer on top of d3, a JavaScript library.

d3 is a library for “manipulating documents based on data,” where here ‘documents’ refers to the document object model (DOM) of the web (Bostock et al., 2011; Bostock, 2013). It is commonly used to create interactive web visualizations. d3 is generally considered to be the work of Mike Bostock, but the paper introducing the library also lists Vadim Ogievetsky and Jeffrey Heer. d3 is a very general library, and cannot be considered to be a plotting library at all. It does not provide primitives like bar, box, axes, etc., like standard visualization systems. Instead, it binds data to the DOM of a web page. Many of the pieces by the New York Times mentioned in Section 3.2 are based on d3, and are co-authored by Mike Bostock himself. Bostock has created a site where users can share ‘blocks’ they have created in d3, and he has contributed many of them (Bostock, 2015). While the sharing of code examples helps users get started, d3 is generally considered to be quite difficult to get started using.

Vega is an attempt to make it easier for novices to create the beautiful interactive graphics associated with d3 (Heer, 2014). It provides the sorts of graphical primitives more typically associated with data visualization tools: rect, area, and line. However, even with these primitives, Vega can be difficult for novices in the same way all textual programming languages can be.

Enter Lyra, a tool to make the creation of Vega graphics very simple. It supports simple data transformation, like grouping based on a variable, but generally should only be considered to be a visualization tool, because it does not provide functionality for data cleaning, modeling, etc. It is a reproducible tool, because the resulting graphics are Vega graphics and can therefore be interrogated in the way standard Vega graphics can be (i.e., by looking at the code).

Lyra does not support interactive graphics creation, but it is likely the group will soon go in that direction. Figure 3.34 shows how the tool can be used to reproduce the famous Minard visualization of Napoleon’s march to Moscow (Satyanarayan and Heer, 2014).

Figure 3.34: Lyra

Bespoke data tools like these are great sources for inspiration about new ways to visualize and improve data cleaning, modeling, and visualization. Many of these projects are open-source, so while they do not cover the entire analysis trajectory, they show promise as tools for particular data needs.

3.11 Summary of currently available tools

As is probably clear, my preference for doing statistical analysis is R, for its status as a free and open source project, as well as its flexibility, extensibility, and community of users. However, I acknowledge that R is a very difficult tool to begin learning. There have been attempts to ease the transition to R at a variety of levels. The attempts including GUIs and IDEs for R, pedagogical frameworks like swirl and Data Desk, and fencing attempts including mosaic and the MobilizeSimple package discussed further in Section However, none of these attempts have been able to fully remove the barrier to entry for R. On the learning end of the spectrum, tools like TinkerPlots and Fathom provide flexible and creative ways to explore data. They offer little barrier to entry, but do not support reproducible analysis or the sharing of results.

Many of the tools we have examined are inspirational. TinkerPlots and Fathom in particular, but also the bespoke tools Data Wrangler, Open Refine, Tableau, and Lyra. All of these tools forefront methods to increase the visual representation of analysis and to simplify it for novices. However, none of the tools we have seen are ideal. In the next chapter, we look to the future of statistical programming.

This idea—that programming will provide exercise for the highest mental faculties, and that the cognitive development thus assured for programming will generalize or transfer to other content areas in the child’s life—is a great hope. Many elegant analyses offer reasons for this hope, although there is an important sense in which the arguments ring like the overzealous prescriptions for studying Latin in Victorian times.

–  –  –

Given the capabilities of current tools, it is possible to imagine a new system which combines the strengths of existing tools with some of the abilities not yet possible. Considering these strengths and weaknesses, we can develop a list of requirements for the statistical programming tools of the future. The remainder of this chapter describes these requirements.

One major inspiration for the qualities that follow is a paper in which Repenning et al outlined what they saw as the requirements for a “computational thinking tool” (Repenning et al., 2010). They posit a computational thinking

tool must fulfill all the following conditions:

• “Has low threshold.” The tool does not take much time to get up to speed with the software, and students can easily jump into really ‘doing’ whatever it is the tool helps them do (in this case, statistics).

• “Has high ceiling.” The tool allows students to learn as much as they want and have access to the industry-standard methods.

• “Scaffolds flow.” As related to the curriculum accompanying the tool, it allows for pieces to build on one another.

• “Enables transfer.” The tool teaches skills useful in other contexts (generally, computer science contexts).

• “Supports equity.”The tool should be easy to access for all types of students.

• “Systemic and sustainable.” The tool can be used to teach students at a variety of levels, and aligns with standards.

Also inspiring was John Tukey’s 1965 paper about the “technical tools of statistics,” (Tukey, 1965), in which he describes his vision for the future of statistical programming tools. He argues statisticians should be looking for, “(1) More of the essential erector-set character of data-analysis techniques, in which a kit of pieces are available for assembly into any of a multitude of analytical schemes, (2) an increasing swing toward greater emphasis on graphicality and informality of inference, (3) a greater and greater role for graphical techniques as aids to exploration and incisiveness (4) steadily increasing emphasis on flexibility and on fluidity,, (5) wider and deeper use of empirical inquiry, of actual trials on potentially interesting data, as a way to discover new analytic techniques (6) greater emphasis on parsimony of representation and inquiry, on the focusing, in each individual analysis, of most of our attention on relatively specific questions, usually in combination with a broader spreading of the remainder of our attention to the exploration of more diverse possibilities.” (Tukey, 1965) Given these requirements for a computational thinking tool and the various positive qualities existing in current tools for doing and teaching statistics, we hold that a statistical thinking tool bridging the gap between learning and doing

statistics must provide the following:

–  –  –

All these requirements will be discussed in more detail in their respective sections.

4.1 Easy entry for novice users Requirement 1 Easy entry for novice users.

This theory comes directly from Reppenning’s work on tools for computational thinking (Repenning et al., 2010). Tools to be used by novices – and really, all tools – should make it easy to get started (Repenning et al., 2010). It should be clear what the tool does, how to use it, and what the most salient components are. The tools should provide immediate gratification, rather than a period of frustration eventually leading to success assuming the user perseveres.

Some systems that are supposedly easy to get started using have startup times around a week – in this context, we want novices to be examining data in a rich way within the first 10 or 15 minutes. With tools like TinkerPlots and Fathom, this is possible within the first minute, so 10-15 minutes should not be unreasonable. In fact, depending on the curricular structure, novices can begin making plots within their first hour of R, but typically the first lesson is unnecessarily hung up on installation issues.

In the context of statistical programming tools, users should be able to jump directly into ‘doing’ data analysis without having to think about the minutiae of a particular data import function or the name of a plot type. As Biehler says, “In secondary education, but also in introductory statistics in higher education, using a command language system is problematical. We are convinced that a host system with a graphical user interface offers a more adequate basis” (Biehler, 1997). Thus, by Biehler’s definition, a system that provides easy entry for novices will likely have a visual component, either initially or throughout.

4.2 Data as a first-order persistent object Requirement 2 Data as a first-order persistent object.

Perhaps the most important component of any data analysis platform is how it deals with data, or, more specifically, the way data are formatted and represented within the system. The issue of data representation is important at a number of levels1.

First, the system must find a balance between the sanctity of data and its fallibility. When novices begin engaging with data, they often perceive data as infallible (Hancock et al., 1992). In a pedagogical setting, it is important for students to move toward an awareness of data’s subjectivity and to learn how to critique data and data products. However, the realization of the subjectivity of data can send students into a state of extreme skepticism, making arguments like “you can say anything with statistics!” In this nihilistic state, it can be hard to see the difference between the inherently subjective nature of data and the effects of intentional manipulation. However, it is vitally important to In this context, we are thinking most specifically of how the data appear to the user, not how they are stored within the computer’s memory system the scientific process that data are not modified in this way. We need to treat data sets as complete objects without room for modification, while also identifying the weaknesses and biases that may be present within them.

As discussed in Sections 3.1 and 3.8, data are given particular affordances in each analysis system. Excel does not privilege data as a complete object, because once a user has a spreadsheet of data open, modification is just a click away, and the original value is lost forever. In contrast, R, which uses data as its primary object, makes it very difficult to modify the original data file. This paradigm helps provide implicit valuing of maintaining original data.

A system privileging initial data should also have a commitment to providing data as the result of all actions. In systems like SAS and SPSS, results are often just text outputs that cannot be saved or incorporated into additional stages of the analysis. However, in R, almost every result is itself data that can be used again. This is a design decision on the part of the language authors, and could be implemented in other systems. The main exception to this rule in R are base plots, which are ephemeral. They cannot be saved, other than exporting them as image files. However, ggplot2 plots remedy this. They can be saved into objects, which is in fact the suggested use case. In the tool of the future, all results should be re-useable data, even plots.

Another important consideration is how the data are represented. One of the most common data representations is a flat file or rectangular data set. This representation is composed of rows and columns – observations and variables – and can generally be visualized as a spreadsheet. The data most naturally used in R are rectangular, particularly those data that come stored as comma separated values (.csv files). Hadley Wickham wrote a paper on ‘tidy’ data which describes the way a rectangular or flat data file should be structured (Wickham, 2014b). It specifies that for a flat file, every row should represent one case (e.g., a person, gene expression, or experiment), and every column should be a variable (i.e., something measured or recorded about the case). Wickham’s tidy requirements necessarily exclude hierarchical structures, but do lead to neat rectangular datasets that avoid many error sources.

However, novices who have not encountered data before often default to a list-based or hierarchical format for their data (Lehrer and Schauble, 2007;

Finzer, 2014, 2013). This suggests that rectangular data may not be the most natural representation. Adults, particularly those who have taken a statistics class, will default to the format they were taught, typically a flat spreadsheetlike file, although they also tend to find the format challenging. So it is clear the affordances of a data analysis system have far-reaching implications for the people who learn on that system.

There are popular hierarchical and list-based formats, such as JSON and XML, but they are typically not introduced to novices. A more modern data analysis system might include these data types, and should attempt to find ways to represent them naturally.

One reason rectangular formatting has been popular with statisticians is it allows us to think of many operations on data as matrix manipulations (e.g., take the inverse, do a multiplication, decompose the whole thing and take some pieces out, find eigenvalues, etc.). Hierarchical data will likely require new metaphors or operations to clarify how the pieces fit together.

Pages:     | 1 |   ...   | 7 | 8 || 10 | 11 |   ...   | 20 |

Similar works:

«ABSTRACT Title of Dissertation: SECONDARY TRANSITION EXPERIENCES: ANALYZING PERCEPTIONS, ACADEMIC SELFEFFICACY, ACADEMIC ADJUSTMENT AND OVERALL IMPACT ON COLLEGE STUDENTS’ WITH LD SUCCESS IN POSTSECONDARY EDUCATION Allison Lynette Butler, Doctor of Philosophy, 2011 Dissertation directed by: Dr. Ellen S. Fabian Department of Counseling and Personnel Services The National Center for Special Education Research at the Institute of Education Sciences under the United States Department of Education...»

«LANGUAGE OUTCOMES IN SCHOOL-AGED CHILDREN ADOPTED FROM EASTERN EUROPEAN ORPHANAGES by Susan D. Hough B.A. Pennsylvania State University, 1976 M.A. University of Pittsburgh, 1979 Submitted to the Graduate Faculty of Education in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2005 UNIVERSITY OF PITTSBURGH SCHOOL OF EDUCATION This dissertation was presented by Susan D. Hough It was defended on April 25, 2005 and approved by Dr. Naomi...»

«In vitro and In vivo Proteome Analysis of Coccidioides posadasii by Setu Kaushal A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved November 2015 by the Graduate Supervisory Committee: Douglas Lake, Chair D. Mitchell Magee Douglas Chandler Jeffery Rawls ARIZONA STATE UNIVERSITY December 2015 ABSTRACT Coccidioidomycosis (valley fever) is caused by inhalation of arthrospores from soildwelling fungi, Coccidioides immitis and C....»

«Why assertion and practical reasoning are possibly not governed by the same epistemic norm Robin McKenna University of Geneva rbnmckenna@gmail.com Penultimate draft. Final version forthcoming in Logos & Episteme. Abstract This paper focuses on Martin Montminy’s recent attempt to show that assertion and practical reasoning are necessarily governed by the same epistemic norm (“Why assertion and practical reasoning must be governed by the same epistemic norm”, Pacific Philosophical Quarterly...»

«Community, Place, and Cultural Battles: Associational Life in Central Italy, 1945-1968 Laura J. Hornbake Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY Laura J. Hornbake This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. ABSTRACT Community, Place, and Cultural Battles: Associational Life in Central Italy, 1945-1968 Laura J. Hornbake...»

«Pathos, Performance, Volition: Melodrama's Legacy in the Work of Carl Th. Dreyer by Amanda Elaine Doxtater A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Scandinavian Languages and Literatures and the Designated Emphasis in Film Studies in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Mark Sandberg, Chair Professor Linda Rugg Professor Linda Williams Fall 2012 Pathos, Performance,...»


«RAJP111097 (NT) Australasian Journal of Philosophy Vol. 83, No. 2, pp. 241 – 251; June 2005 IN DEFENCE OF SCEPTICAL THEISM A REPLY TO ALMEIDA AND OPPY Michael Bergmann and Michael Rea Some evidential arguments from evil rely on an inference of the following sort: ‘If, after thinking hard, we can’t think of any God-justifying reason for permitting some horrific evil then it is likely that there is no such reason’. Sceptical theists, us included, say that this inference is not a good one...»

«The Assessment Sensitivity of Knowledge Attributions∗ John MacFarlane† June 28, 2004 Recent years have seen an explosion of interest in the semantics of knowledge-attributing sentences, not just among epistemologists but among philosophers of language seeking a general understanding of linguistic context sensitivity. Despite all this critical attention, however, we are as far from consensus as ever. If we have learned anything, it is that each of the standard views—invariantism,...»

«Universidade Federal de Goiás Faculdade de Ciências Humanas e Filosofia Programa de Pós-Graduação em Sociologia Mestrado em Sociologia Identidade e Territorialidade entre os Kalunga do Vão do Moleque Thais Alves Marinho Goiânia, Março de 2008. Universidade Federal de Goiás Faculdade de Ciências Humanas e Filosofia Programa de Pós-Graduação em Sociologia Identidade e Territorialidade entre os Kalunga do Vão do Moleque Dissertação apresentada ao Programa de PósGraduação em...»

«Creating Green Chemistry: Discursive Strategies of a Scientific Movement Jody A. Roberts Dissertation submitted to the Faculty of Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Science and Technology Studies Committee: Richard M. Burian (chair) Daniel Breslau Richard F. Hirsh Timothy W. Luke Joseph C. Pitt 13 December 2005 Keywords: Green Chemistry, Scientific Movements, Chemistry Studies, Discursive...»

«THE INFLUENCE OF UNETHICAL PEER BEHAVIOR ON OBSERVERS’ UNETHICAL BEHAVIOR: A SOCIAL COGNITIVE PERSPECTIVE By MICHAEL JAMES O’FALLON A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY WASHINGTON STATE UNIVERSITY Department of Management and Operations DECEMBER 2007 To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of MICHAEL JAMES O’FALLON find it satisfactory and...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.