«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»
Another bespoke data visualization system is Lyra, which makes it easy for novices to create graphics in a drag-and-drop matter (Satyanarayan and Heer, 2014). Lyra was developed at the University of Washington Interactive Data Lab. Interestingly, Jeﬀrey Heer was a member of the Stanford Visualization Group that created Data Wrangler, and is now one of the founders of Trifacta.
d3 is a library for “manipulating documents based on data,” where here ‘documents’ refers to the document object model (DOM) of the web (Bostock et al., 2011; Bostock, 2013). It is commonly used to create interactive web visualizations. d3 is generally considered to be the work of Mike Bostock, but the paper introducing the library also lists Vadim Ogievetsky and Jeﬀrey Heer. d3 is a very general library, and cannot be considered to be a plotting library at all. It does not provide primitives like bar, box, axes, etc., like standard visualization systems. Instead, it binds data to the DOM of a web page. Many of the pieces by the New York Times mentioned in Section 3.2 are based on d3, and are co-authored by Mike Bostock himself. Bostock has created a site where users can share ‘blocks’ they have created in d3, and he has contributed many of them (Bostock, 2015). While the sharing of code examples helps users get started, d3 is generally considered to be quite diﬃcult to get started using.
Vega is an attempt to make it easier for novices to create the beautiful interactive graphics associated with d3 (Heer, 2014). It provides the sorts of graphical primitives more typically associated with data visualization tools: rect, area, and line. However, even with these primitives, Vega can be diﬃcult for novices in the same way all textual programming languages can be.
Enter Lyra, a tool to make the creation of Vega graphics very simple. It supports simple data transformation, like grouping based on a variable, but generally should only be considered to be a visualization tool, because it does not provide functionality for data cleaning, modeling, etc. It is a reproducible tool, because the resulting graphics are Vega graphics and can therefore be interrogated in the way standard Vega graphics can be (i.e., by looking at the code).
Lyra does not support interactive graphics creation, but it is likely the group will soon go in that direction. Figure 3.34 shows how the tool can be used to reproduce the famous Minard visualization of Napoleon’s march to Moscow (Satyanarayan and Heer, 2014).
Figure 3.34: Lyra
Bespoke data tools like these are great sources for inspiration about new ways to visualize and improve data cleaning, modeling, and visualization. Many of these projects are open-source, so while they do not cover the entire analysis trajectory, they show promise as tools for particular data needs.
3.11 Summary of currently available tools
As is probably clear, my preference for doing statistical analysis is R, for its status as a free and open source project, as well as its ﬂexibility, extensibility, and community of users. However, I acknowledge that R is a very diﬃcult tool to begin learning. There have been attempts to ease the transition to R at a variety of levels. The attempts including GUIs and IDEs for R, pedagogical frameworks like swirl and Data Desk, and fencing attempts including mosaic and the MobilizeSimple package discussed further in Section 22.214.171.124. However, none of these attempts have been able to fully remove the barrier to entry for R. On the learning end of the spectrum, tools like TinkerPlots and Fathom provide ﬂexible and creative ways to explore data. They oﬀer little barrier to entry, but do not support reproducible analysis or the sharing of results.
Many of the tools we have examined are inspirational. TinkerPlots and Fathom in particular, but also the bespoke tools Data Wrangler, Open Reﬁne, Tableau, and Lyra. All of these tools forefront methods to increase the visual representation of analysis and to simplify it for novices. However, none of the tools we have seen are ideal. In the next chapter, we look to the future of statistical programming.
This idea—that programming will provide exercise for the highest mental faculties, and that the cognitive development thus assured for programming will generalize or transfer to other content areas in the child’s life—is a great hope. Many elegant analyses oﬀer reasons for this hope, although there is an important sense in which the arguments ring like the overzealous prescriptions for studying Latin in Victorian times.
Given the capabilities of current tools, it is possible to imagine a new system which combines the strengths of existing tools with some of the abilities not yet possible. Considering these strengths and weaknesses, we can develop a list of requirements for the statistical programming tools of the future. The remainder of this chapter describes these requirements.
One major inspiration for the qualities that follow is a paper in which Repenning et al outlined what they saw as the requirements for a “computational thinking tool” (Repenning et al., 2010). They posit a computational thinking
tool must fulﬁll all the following conditions:
• “Has low threshold.” The tool does not take much time to get up to speed with the software, and students can easily jump into really ‘doing’ whatever it is the tool helps them do (in this case, statistics).
• “Has high ceiling.” The tool allows students to learn as much as they want and have access to the industry-standard methods.
• “Scaﬀolds ﬂow.” As related to the curriculum accompanying the tool, it allows for pieces to build on one another.
• “Enables transfer.” The tool teaches skills useful in other contexts (generally, computer science contexts).
• “Supports equity.”The tool should be easy to access for all types of students.
• “Systemic and sustainable.” The tool can be used to teach students at a variety of levels, and aligns with standards.
Also inspiring was John Tukey’s 1965 paper about the “technical tools of statistics,” (Tukey, 1965), in which he describes his vision for the future of statistical programming tools. He argues statisticians should be looking for, “(1) More of the essential erector-set character of data-analysis techniques, in which a kit of pieces are available for assembly into any of a multitude of analytical schemes, (2) an increasing swing toward greater emphasis on graphicality and informality of inference, (3) a greater and greater role for graphical techniques as aids to exploration and incisiveness (4) steadily increasing emphasis on ﬂexibility and on ﬂuidity,, (5) wider and deeper use of empirical inquiry, of actual trials on potentially interesting data, as a way to discover new analytic techniques (6) greater emphasis on parsimony of representation and inquiry, on the focusing, in each individual analysis, of most of our attention on relatively speciﬁc questions, usually in combination with a broader spreading of the remainder of our attention to the exploration of more diverse possibilities.” (Tukey, 1965) Given these requirements for a computational thinking tool and the various positive qualities existing in current tools for doing and teaching statistics, we hold that a statistical thinking tool bridging the gap between learning and doing
statistics must provide the following:
All these requirements will be discussed in more detail in their respective sections.
4.1 Easy entry for novice users Requirement 1 Easy entry for novice users.
This theory comes directly from Reppenning’s work on tools for computational thinking (Repenning et al., 2010). Tools to be used by novices – and really, all tools – should make it easy to get started (Repenning et al., 2010). It should be clear what the tool does, how to use it, and what the most salient components are. The tools should provide immediate gratiﬁcation, rather than a period of frustration eventually leading to success assuming the user perseveres.
Some systems that are supposedly easy to get started using have startup times around a week – in this context, we want novices to be examining data in a rich way within the ﬁrst 10 or 15 minutes. With tools like TinkerPlots and Fathom, this is possible within the ﬁrst minute, so 10-15 minutes should not be unreasonable. In fact, depending on the curricular structure, novices can begin making plots within their ﬁrst hour of R, but typically the ﬁrst lesson is unnecessarily hung up on installation issues.
In the context of statistical programming tools, users should be able to jump directly into ‘doing’ data analysis without having to think about the minutiae of a particular data import function or the name of a plot type. As Biehler says, “In secondary education, but also in introductory statistics in higher education, using a command language system is problematical. We are convinced that a host system with a graphical user interface oﬀers a more adequate basis” (Biehler, 1997). Thus, by Biehler’s deﬁnition, a system that provides easy entry for novices will likely have a visual component, either initially or throughout.
4.2 Data as a ﬁrst-order persistent object Requirement 2 Data as a ﬁrst-order persistent object.
Perhaps the most important component of any data analysis platform is how it deals with data, or, more speciﬁcally, the way data are formatted and represented within the system. The issue of data representation is important at a number of levels1.
First, the system must ﬁnd a balance between the sanctity of data and its fallibility. When novices begin engaging with data, they often perceive data as infallible (Hancock et al., 1992). In a pedagogical setting, it is important for students to move toward an awareness of data’s subjectivity and to learn how to critique data and data products. However, the realization of the subjectivity of data can send students into a state of extreme skepticism, making arguments like “you can say anything with statistics!” In this nihilistic state, it can be hard to see the diﬀerence between the inherently subjective nature of data and the eﬀects of intentional manipulation. However, it is vitally important to In this context, we are thinking most speciﬁcally of how the data appear to the user, not how they are stored within the computer’s memory system the scientiﬁc process that data are not modiﬁed in this way. We need to treat data sets as complete objects without room for modiﬁcation, while also identifying the weaknesses and biases that may be present within them.
As discussed in Sections 3.1 and 3.8, data are given particular aﬀordances in each analysis system. Excel does not privilege data as a complete object, because once a user has a spreadsheet of data open, modiﬁcation is just a click away, and the original value is lost forever. In contrast, R, which uses data as its primary object, makes it very diﬃcult to modify the original data ﬁle. This paradigm helps provide implicit valuing of maintaining original data.
A system privileging initial data should also have a commitment to providing data as the result of all actions. In systems like SAS and SPSS, results are often just text outputs that cannot be saved or incorporated into additional stages of the analysis. However, in R, almost every result is itself data that can be used again. This is a design decision on the part of the language authors, and could be implemented in other systems. The main exception to this rule in R are base plots, which are ephemeral. They cannot be saved, other than exporting them as image ﬁles. However, ggplot2 plots remedy this. They can be saved into objects, which is in fact the suggested use case. In the tool of the future, all results should be re-useable data, even plots.
Another important consideration is how the data are represented. One of the most common data representations is a ﬂat ﬁle or rectangular data set. This representation is composed of rows and columns – observations and variables – and can generally be visualized as a spreadsheet. The data most naturally used in R are rectangular, particularly those data that come stored as comma separated values (.csv ﬁles). Hadley Wickham wrote a paper on ‘tidy’ data which describes the way a rectangular or ﬂat data ﬁle should be structured (Wickham, 2014b). It speciﬁes that for a ﬂat ﬁle, every row should represent one case (e.g., a person, gene expression, or experiment), and every column should be a variable (i.e., something measured or recorded about the case). Wickham’s tidy requirements necessarily exclude hierarchical structures, but do lead to neat rectangular datasets that avoid many error sources.
However, novices who have not encountered data before often default to a list-based or hierarchical format for their data (Lehrer and Schauble, 2007;
Finzer, 2014, 2013). This suggests that rectangular data may not be the most natural representation. Adults, particularly those who have taken a statistics class, will default to the format they were taught, typically a ﬂat spreadsheetlike ﬁle, although they also tend to ﬁnd the format challenging. So it is clear the aﬀordances of a data analysis system have far-reaching implications for the people who learn on that system.
There are popular hierarchical and list-based formats, such as JSON and XML, but they are typically not introduced to novices. A more modern data analysis system might include these data types, and should attempt to ﬁnd ways to represent them naturally.
One reason rectangular formatting has been popular with statisticians is it allows us to think of many operations on data as matrix manipulations (e.g., take the inverse, do a multiplication, decompose the whole thing and take some pieces out, ﬁnd eigenvalues, etc.). Hierarchical data will likely require new metaphors or operations to clarify how the pieces ﬁt together.