«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»
In one data collection exercise from Mobilize, students collect data on their snacking habits, and data are aggregated by class on a centralized server. When they download the aggregated data, it is represented in a rectangular format (Figure 4.1). However, it could also be represented in a more hierarchical format, as shown in Figure 4.2. Considering the current curricular tasks we expect students to complete, some tasks seem diﬃcult using this representation. For example, subsetting is less straightforward. Subsetting the data to select only the snacks where the reason given was “hungry” would be diﬃcult, and would require removal of the matching pieces and creation of an entirely new data structure with the time of day attached.
Another common initial assumption is data should always be visualized in such a way that all of the values can immediately be read. In conversations with data journalists, they often cite reading data values in a spreadsheet as their ﬁrst line of inquiry, so this practice may be useful. However, there are other possibilities for the highest level view of data. For example, Victor Powell’s CSV Fingerprint project provides an extremely high-level view of data, and uses colors to indicate data types and missing data (Powell, 2014). See Figure 4.3 for an example of this visualization.
Figure 4.3: ‘CSV Fingerprint’ dataset visualization by Victor Powell
Because so much of the interesting data on the web are made available through application programming interfaces (APIs)2, it is crucial to provide smooth integration with them. An API connection helps users quickly engage with interesting topics requiring little prior contextual knowledge (Gould, 2010). Several of the existing tools for statistical analysis provide API connections, Fathom being most notable on the statistical education side (Finzer, 2002a). However, the interfaces are often clunky.
Another open area of research focuses on how data could be surrounded by more inherent documentation. When using API data, a user will typically perform a direct call to the API, store the data locally during their work session, and save results. However, they often do not cache the API data. The next time the user runs the analysis, they make another API call and get the most current data. This workﬂow means the analysis must be very reproducible, as the data cleaning steps need to work for slight variations in the supplied data.
However, because APIs rely on internet services, there is the possibility the data provider could ‘go dark’ and stop providing data. To protect against this, In this context, APIs speciﬁcally refer to data access APIs. The term API is more general, and can be used to refer to any programming language speciﬁcation, but statisticians do not use it this way.
there is a growing desire to provide functionality for caching API data calls so the most recent data can continue to be used in these cases (Ogden et al., 2015).
There is also a need for better metadata about API data, and about data in general. Ideally, API data would be accompanied with information about how it was retrieved and when, and perhaps more context about how it was collected initially. Fuller data documentation ﬁts in with the idea of documentation as part of the process (discussed in Section 4.8), because the data would come with more of a story from the start.
4.3 Support for a cycle of exploratory and conﬁrmatory analysis Requirement 3 Support for a cycle of exploratory and conﬁrmatory analysis.
Statistical thinking tools should always promote exploratory analysis. Exploratory Data Analysis (EDA) was proposed by John Tukey in his 1977 book of the same name. Although it has its own acronym, EDA is not very complicated; it is simply the idea that we can glean important information about data by exploring it. Instead of starting with high-powered statistical models, Tukey proposes doing simple descriptive statistics and making many simple graphs – of one variable or several – in order to look for trends. Indeed, humans looking at graphs are often better than computers at identifying trends in data (Kandel et al., 2011a).
EDA is an engaging and empowering way to introduce novices to statistics, but introductory courses do not always include it, perhaps because it requires computational skills to achieve, or because it can seem to teachers as too ‘soft’ a skill to be truly useful. In particular, the math teachers we have trained as part of the Mobilize grant (Section 5.1) are often wary of situations where there is more than one possible answer.
Although EDA can appear as a ‘soft’ or subjective practice, there are situations where it is the best and richest method for analysis. For example, to make inference using formal statistical tests, data must be randomly collected.
But in many situations (including the participatory sensing context discussed in Section 5.1.1) data are either non-random, or comprise a census. In these situations, EDA is the best method for ﬁnding patterns in the data and performing informal inference.
When Tukey wrote his book, computers were not in everyday use, so his text suggests using pencil and paper to produce descriptive statistics and simple plots. Today, the tasks listed in the book are even easier because of the many computer tools facilitating them, and the results are analogous to those made by hand (Curry, 1995; Friel, 2008). However, simply being able to implement the things Tukey did by hand in the 1970s is not enough; computers should be enabling even more powerful types of exploration (Pea, 1985).
Many tools for teaching statistics do support rapid exploration and prototyping. In fact, this is one area where tools for doing statistics fall short. In R, creating multiple graphs takes eﬀort, as does playing with parameter values.
A sense of play is important to data analysis, which is never a linear process from beginning to end. Instead, data scientists repeatedly cycle back through questioning, exploration, and conﬁrmation or inference.
There are many opinions as to what the cycle entails. In the Mobilize Introduction to Data Science course we use a cycle of statistical questions, collecting data, analyzing data, and interpretation, based on the GAISE guidelines discussed in Section 1.2 (Franklin et al., 2005). Because this is a cycle, interpreting leads back to questioning. Ben Fry suggests data analysis takes the shape of a cycle comprising the steps of acquiring, parsing, ﬁltering, mining, representing, reﬁning, and interacting (Fry, 2004). These cycles can be thought of as either exploratory or conﬁrmatory, and the two complement each other. If users ﬁnd something interesting in a cycle of exploratory analysis, they need to follow with conﬁrmatory analysis.
The complementary exploratory and conﬁrmatory cycles were suggested by Tukey, and have been re-emphasized by current educators (Biehler et al., 2013). There are several ways to bring the two cycles together. One method uses graphics to perform inference, discussed in Section 4.5, and thus brings exploratory and conﬁrmatory analysis together. Andrew Gelman thinks modelers should be using simulation-based methods to check their models, and people doing exploratory data analysis should be ﬁtting models and making graphs to show the patterns left behind (Gelman, 2004).
The diﬀerence between exploratory and conﬁrmatory analysis (or informal and formal inference) is like the diﬀerence between sketching or taking notes and the act of writing an essay. One is more creative and expansive, and the other tries to pin down the particular information to be highlighted in the ﬁnal product. A system supporting exploration and conﬁrmation should provide a workﬂow connecting these two types of activities. Users need ‘scratch paper,’ or a place to play with things without them being set in stone. While data analysis needs to leave a clear trail of what was done so someone else can reproduce it, a scratch paper environment might allow a user to do things not ‘allowed’ in the ﬁnal product, like moving data points around. This connects to Biehler’s goal of easily modiﬁable draft results (Biehler, 1997).
Many current systems for teaching statistics provide simple sketching-like functionality (allowing users to manipulate data or play with graphic representations), but the transition to an essay-like format is more complex. An essay requires a solid line of reasoning strung through. The question then becomes, how do readers interact with it? We also want the user to be able to see the multiplicity of possibilities (the ‘what-ifs’), while maintaining the ability to reset to reality.
If the system kept a transcript of all the actions taken in the sketching area, the resulting analysis possibilities would be analogous to a set of sketches an artist or designer does in the ﬁrst stages of a project (a ‘charrette’). The goal is produce and document many ideas, ideally all very diﬀerent from one another. The charrette process is useful in art because it allows an artist to provide provocations instead of prototypes (Bardzell et al., 2012). Provocations drive thinking forward, while prototypes tend to lock in a particular direction.
In order to support this, the system could provide a way to save many analyses and look through the collection.
In the context of art, once the charrette is complete, one sketch is converted into a ﬁnished product. In statistics, we want to support this same type of trajectory. Artists take one sketch and look back and forth between it and their inprogress painting as they complete it. With data, the user needs to make sure the trajectory of analysis is solid, so the relationship between a less-detailed sketch and a ﬁnished product is less clear.
The book “Infographic Designers’ Sketchbooks” provides a glimpse into how designers of infographics and data visualizations work. Designers of infographics typically start with pen-and-paper sketches, and they use digital tools as digital paint (occasionally, they will speak about sketching in Photoshop or Illustrator).
In contrast, designers of visualizations grounded in data (Mike Bostock, Tony Chu, Cooper Smith, Moritz Stefaer) begin by ‘sketching’ using code. They use a variety of computer tools to do this (e.g., Bostock’s sketches look to be done in R, Smith sketches in Processing), but they are not explicitly mapping color to paper the way the infographic designers are (Heller and Landers, 2014).
While it is not precisely clear how these diﬀerent styles of creation could be incorporated into one tool, the LivelyR work discussed in Section 5.2 shows some methods for allowing both in the same interface.
4.4 Flexible plot creation Requirement 4 Flexible plot creation.
To fully support exploratory data analysis, a tool needs to emphasize plotting.
Computers make it possible to visually explore large datasets in a way not possible before. For example, Jacques Bertin developed a method for reordering matrices of unordered data in the 1960s. At the time, Bertin’s method involved cutting up paper representations of matrices or creating custom physical representations, then reordering the rows and columns by hand (Bertin, 1983). Much like the graphical methods John Tukey developed by hand in the ‘60s, Bertin’s methods are now easily accessible on the computer.
Providing easy plotting functionality is a goal of almost any tool, both for learning and doing statistics. However, there are two perspectives on plotting.
One is to provide appropriate default plots based on data types, and the other is a system allowing for the ﬂexible creation of any plot a user can think of.
The use of appropriate defaults is the perspective driving the base R graphics, which use similar defaults in the generic plot() function, producing diﬀerent graph types depending on the data passed to it.
There are also eﬀorts to provide default visualizations representing all variables in a particular data set, such as the generalized pairs plot, which can be
used to automatically visualize all two-variable relationships (Emerson et al., 2013). There are also many bespoke data visualization systems (discussed further in Section 3.10) that automatically oﬀer the ‘best’ plots for each variable in the data (Miller, 2014; Mackinlay et al., 2007). Some of the most common defaults are enumerated in Table 4.1.
The matching of plots to data is in line with most educational standards.
When mathematical educational standards reference statistical graphics, they tend to do so in a rote way, suggesting students should learn how to pick the most appropriate standard plots for their data. Educators of this mindset will often critique students’ visualization choices, and emphasize students should be able to choose appropriate plots for their data or otherwise make, read, and interpret standard plots (e.g., scatterplots, bar charts, histograms) (National Governors Association Center for Best Practices and Council of Chief State School Oﬃcers, 2010). In other words, students should serve as the plot-choosing algorithm, and learn to make the mappings outlined in Table 4.1.
This is also the paradigm spreadsheet tools like Excel, or route-type tools (Bakker, 2002) implicitly expect, because they only oﬀer standard visualization types. However, there is a growing body of research suggesting learning by rote, both in general and particularly with respect to statistical graphics, is not an eﬀective way to embed knowledge. Instead, researchers suggest students should learn to develop their own representations of data.
The development of new data representations falls on the other end of the
spectrum. In Leland Wilkinson’s Grammar of Graphics, Hadley Wickham’s ggplot2, TinkerPlots, and Fathom, users can build up novel plots from any type of data encoding they ﬁnd appropriate (Wilkinson, 2005; Wickham, 2008; Konold and Miller, 2005; Finzer, 2002a). While the Grammar of Graphics and ggplot2 are intended for use by advanced data analysts, the inclusion of this type of ﬂexible plotting in the learning tools TinkerPlots and Fathom points to the cognition research about visual encodings.