# «University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»

Learners tend to move through a standard sequence of levels of understanding as they begin to map data to visual representations. Typically, their ﬁrst representations are idiosyncratic (Watson and Fitzallen, 2010). That is, they have some relation to the data at hand, but typically do not explicitly represent it. For example, when students are asked to represent a set of random draws from a collection, an idiosyncratic representation might be a drawing of a person taking a slip out of a bowl.

Students move on to a case-value plot. In case-value plots, every data value is explicitly shown on the plot. In other words, there is no abstraction away from the raw data, simply a visual encoding. Visualizations like dot plots and scatterplots are barely abstractions. They allow a reader to easily retrieve the exact values of the data. Studies show students start with case-value plots and gradually move toward more abstract representations (Kader and Mamer, 2008).

Students will often initially think a graph representing speed is actually representing location, or a graph of growth is a graph of height. They have diﬃculty with the abstractions. Instead, they think the graph is representing the most ‘real’ thing possible (Shah and Hoeﬀner, 2002).

The next level of abstraction is classifying (e.g., stacking similar cases together and then abstracting them as a bar, as in bar charts and histograms) (Konold et al., 2014). These visualizations add a count abstraction. The original data are still retrievable, but the reader must understand the height of the bar has encoded how many times a particular value has been seen. Even more abstract is the histogram. A reader could generate data that would produce the histogram (randomly choosing n values between a and a + x for all y bins in the histogram, where n is the height of the bar in the histogram, and (a, a + x] is the range of the bin), but the resulting data would not necessarily be the same as the original. It would preserve some qualities of the data, but not necessarily even the true summary statistics of center and spread.

The ﬁnal level of abstraction is where data are fully abstracted to summary values, as in a box plot (Konold et al., 2014). In these plots, there is no way to map back to the original data. The reader could generate data to would produce the same boxplot, but the data might have almost no correspondence to the original data. Modiﬁcations to the boxplot such as the violin and vase plots reduce the abstraction level slightly, but still often leave the size of the dataset abstract (Wickham and Stryjewski, 2011). Some examples of common data representations and their abstraction level are shown in Table 4.2.

By providing a tool that grows as a user learns, we can support learners on their path toward more abstraction. Rolf Biehler emphasized the importance of “co-evolution” of user and tool (Biehler, 1997). TinkerPlots intentionally supports the natural trajectory to build from less abstract plots, as its default plot of a single variable is just a set of dots set randomly displayed in the plot window. However, while TinkerPlots supports the natural sequence of building graph understandings, it is characterized as a landscape-type tool (Bakker, 2002), because it does not prescribe this particular trajectory. If a student had a diﬀerent natural inclination, she could build representations that felt natural to her. However, even with a landscape-type tool, the aﬀordances of a tool such as TinkerPlots impact the sorts of plots and reasoning users develop (Hammerman and Rubin, 2004).

Again, the standard (named) plot types mentioned in Table 4.2 do not rep

** Table 4.2: Levels of abstraction in standard statistical plots resent an exhaustive list of possible visualization methods, and users should have the opportunity to build their own encodings.**

In fact, being able to create unique visual data representations may help users to better understand nonstandard visualizations they encounter in the wild. That is, learning to encode data visually will help them decode other visuals (Meirelles, 2011).

Most of the plot types described here are simple visualizations of one or two variables. But again, we want computational tools to do more than amplify human abilities (Pea, 1985). Methods like the Grand Tour (Cook et al., 1995;

Buja and Asimov, 1986; Asimov, 1985), or the generalized pairs plot (Emerson et al., 2013) can allow humans to look for patterns in more dimensions.

Beyond providing an interface to ﬂexibly create novel plot types, the tool should support graphs as an interface to the data (Biehler, 1997). Behaviors like brushing and linking should do dynamic subsetting (Few, 2010).

As Biehler suggests, the tool should provide functionality for formatting, as well as interacting with, and enhancing graphics (Biehler, 1997). Formatting consists of tasks like the zoom, scale, data symbols, and graph elements. Interaction should allow for actions like select, group, modify, and measure. Enhancing should allow for labeling, the inclusion of statistical information, and other variables (Biehler, 1997).

** Some current tools for learning statistics allow users to draw pictures on top**

of their data, circling interesting features and providing annotations. Drawing functionality could be enhanced to become another method for interacting with the system. Researchers in the Communications Design Group are thinking about the ways in which drawing could become another level of vocabulary for humans to use as they interact with the computer. Ken Perlin has developed a tool he calls Chalktalk, which allows him to draw from a vocabulary of simple drawings to create the impression of interaction in his presentations (Perlin, 2015). For example, he might draw a crude pendulum and set it swinging. The behavior of these gestures has been coded behind the scenes, but they allow him to quickly and ﬂuidly show examples almost like sketches. There is also work being done to add ‘smart’ drawing features to Lively Web (Section 5.2). This work is so new there is not yet a good pointer to it, but Calmez et al. (2013) describe the initial work making it possible.

The system should also make it possible to see multiple coordinated views of everything in the user’s environment. Rolf Biehler suggests a multiple window environment to allow for easy comparisons (Biehler, 1997). The importance of a coordinated view is supported by researchers who suggest allowing for multiple views of the same data may help students gain a more intuitive understanding (Shah and Hoeﬀner, 2002; Bakker, 2002). In many systems, this is supported by brushing and linking (Wilkinson, 2005).

4.5 Support for randomization throughout Requirement 5 Support for randomization throughout.

Computers have made it possible to use randomization and bootstrap methods where approximating formulas would once have been the only recourse. These methods are not only more ﬂexible than traditional statistical tests, but can also be more intuitive for novices to understand (Pfannkuch et al., 2014; Tintle et al., 2012).

Randomization and bootstrap methods can help make inferences from data, even if those data are from small sample sizes or non-random collection methods (Efron and Tibshirani, 1986; Lunneborg, 1999; Ernst, 2004). While these methods are not new, they have recently been extended into a ﬁeld of visual inference. Statisticians at Iowa State University have been working on methods to use randomized or null data to provide graphical inference protocols (Wickham et al., 2010; Majumder et al., 2013; Buja et al., 2009).

Humans are very adept at ﬁnding visual patterns, whether the patterns are real or artifacts. Graphical inference helps people train their eyes to more accurately judge whether a pattern is real or not. In these protocols, one plot created with the original data is displayed in a matrix of n decoy plots. If the user sets n = 19, the chance of randomly picking the true plot at random is = 0.05, which is the traditional boundary for statistical signiﬁcance (Wick

<

**ham et al., 2010).**

The method for generating the null plots can vary. Sometimes it is as simple as randomizing one of the variables to break any linear relationship that might have existed between the two. However, this does not work on data where the relationship is more complex, e.g., a quadratic relationship. In those cases, data are generated from the null model and compared to the true data.

Using the null data, plots analogous to the one ﬁlled with real data are generated. If the user creates a matrix of decoy plots with the true plot randomly placed within the matrix, identifying the true plot means it is somehow diﬀerent from randomness. These methods have been extended for use in validating models (Majumder et al., 2013; Buja et al., 2009; Gelman, 2004).

Of course, protocols must be followed so as not to introduce bias. For example, if the user has been working with the data or doing exploratory data analysis, familiarity will make it easier to recognize the true plot in the matrix.

However, even with prior knowledge, picking the real data out of a lineup of plots it is a very compelling exercise. I have introduced the idea to groups of high school teachers and students, and the ‘game’ of picking the right plot has proven to be very engaging. In fact, showing randomized plots can be a great extension to the ‘making the call’ activity of trying to determine the diﬀerence between groups – in this case, between real and randomized versions (Pfannkuch, 2006; Wild et al., 2011).

The application of randomization and the bootstrap is another place where tools for teaching statistics shine. All the popular applet collections provide functionality for simply randomizing or bootstrapping data (Chance and Rossman, 2006; Morgan et al., 2014). TinkerPlots and Fathom also provide interfaces for this (Finzer, 2002a; Konold and Miller, 2005). However, the tools for doing statistics have lagged behind. R provides the most complete functionality, but it has not been simple to use. Tim Hesterberg has prepared a document explaining how bootstrap methods could be integrated into the undergraduate curriculum (Hesterberg, 2014), as well as an R package called resample providing a simpler syntax.

Because of their intuitive nature and generalizability, randomization and bootstrap methods are ideal for novices. They can be used in a variety of contexts, including graphical inference methods bridging the gap between exploratory and conﬁrmatory analysis.

4.6 Interactivity at every level Requirement 6 Interactivity at every level.

Interactivity is becoming the standard for the web, and data analysis should be no diﬀerent. It should be possible for users to interactively develop an analysis, e.g., building up a plot by using drag-and-drop elements. The results of this analysis in the session should themselves be interactive. All graphs should be zoomable, it should be easy to change the data cleaning methods and see how that change is reﬂected in the analysis afterward, and parameters should be easily manipulable. This type of simple parameter manipulation will further support exploratory data analysis.

Finally, the product from the tool should also be interactive. Interactivity in published analysis would be of particular use for data journalism and academic publishing. As reproducibility becomes more valued in the academic community, data products are more often accompanied with fully reproducible code, and if the code were interactive, the audience – even if they do not know much about statistics – could play with the parameters and convince themselves the data were not doctored.

With a tool that made it simple to publish fully interactive results of data analysis, it would be easy to imagine data-driven newspaper articles accompanied by the reproducible code that produced them, allowing readers to audit the story. As noted in Chapter 3, bespoke projects such as the IEEE programming language ratings (Cass et al., 2014) provide readers access to the process used to create an analysis.

The draw of interactivity was also clear to Rolf Biehler, who inspirationally wrote, The concept of slider pushes the tool a step in the direction of a method construction tool where one can operate with general parameters. [...] It may be considered a weakness of systems like Data Desk that the linkage structure is not explicitly documented as it is the case with explicit programming or if we had written the list of commands in an editor. An improvement would be if a list of commands or another representation of the linkage structure would be generated automatically. (Biehler, 1997) The implementation of this vision may require something akin to the Shiny reactive environment to allow the system to keep track of all the downstream elements depending on the one above.

The power and usefulness of this type of functionality is easy to imagine, and likely the possibilities are even greater than can be imagined at present.

For example, if a user had used a cut point to create a categorical variable from a continuous variable, and then fed that categorical variable into a regression model, the system would allow them to manipulate the cut point to see the effect on regression parameters, interaction eﬀects, etc. This example is examined further in Section 5.3.2.

Many other possibilities would be available in the world opened up by this type of functionality. All plots would be resizable, zoomable, and pan-able.

Clicking on an element in a plot would highlight the associated element in the data representation, while clicking on a non-data element (e.g., an axis, tick line or model line) would oﬀer information about the element, and that information would be manipulable as well.