«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»
The experience of developing and deploying the MobilizeSimple package underscored my earlier understanding, which is that users need support as they learn a language. Exactly how to support people is the salient question.
5.2 LivelyR Through my work with the Communications Design Group, I was able to collaborate with Aran Lunzer to develop a tool we are calling LivelyR. The Communications Design Group is an independent research lab headed by Alan Kay.
It draws together researchers from Kay’s non-proﬁt, Viewpoints Research Institute, as well as employees from SAP.
The product of this work is an interface we are calling LivelyR8. The interface is a bespoke system (see Section 3.10 for other examples of such systems).
An R server is running in the background, either locally on a user’s computer or on a centralized server. Results are much quicker (and therefore, interaction is smoother) when R is run locally, but of course that requires a local installation.
As discussed in Section 5.1.4, relying on a local installation is a barrier to accessibility. However, because LivelyR is more provocation than prototype, we were not as concerned with the realities of use cases.
The behavior of LivelyR was based on my commitment to interactive statistical programming tools, and Lunzer’s longstanding work on subjunctive interfaces (Lunzer and Hornbæk, 2008). For an overview of the functionality of LivelyR as of May 2014, see Lunzer (2014)9.
5.2.1 Histogram cloud One feature that has garnered a lot of attention when we have shown this work is what I call the ‘histogram cloud.’ The data used in the example shown in Figure 5.5 is from the R sample dataset mtcars.
In the screenshot, a scatterplot of weight (wt) versus miles per gallon (mpg) is shown in the center of the plot window. At the top of the image, a few simple summary statistics are displayed: the number of data points currently in use (which changes as subsets are applied), the mean and standard deviation of both variables, and the Pearson correlation coeﬃcient. Green arrows indicate the range of the data included. The arrows can modiﬁed interactively to subset the data, but the view shown is using the complete data set.
At the bottom of the screen is a rectangular display of the data set, with the x- and y-variables noted, and ranges for each of the variables. The 0 and 100 indicate 100% of the data are included for each variable. Again, this could be interactively modiﬁed to only include a particular percentile of the data.
https://vimeo.com/93535802 For this version of LivelyR, we made it possible to plot several diﬀerent types of plot in the same plotting window. Therefore, the left vertical axis is the wt axis, but the right vertical axis is the count for the histogram(s) of mpg.
Researchers have found this type of layering of multiple plots is diﬃcult for users to understand, so this functionality would ideally not be included in a production tool (Isenberg et al., 2011; Few, 2008). However, providing histograms along the margins of scatterplots (typically outside the plot region, rather than inside as shown here) is a common visualization feature (Emerson et al., 2013).
Typically, statistical packages provide default bin widths and bin oﬀsets, and the aﬀordances of the system provide a disincentive to modify the parameters.
For example, in base R, the hist() command uses a default bin width based on the Sturges algorithm (Sturges, 1926). The geom_bar() command in the R package ggplot2 uses range/30 and though it does provide a warning, stat_bin: binwidth defaulted to range/30.
Use ‘binwidth = x’ to adjust this.
many people stick with what is given.
There are several other accepted algorithms for choosing optimal histogram bin width (Wand, 1997), but generally the parameters should be tuned to a particular data set. Choosing appropriate bin widths is one of the pieces of data science that ends up being more of an art.
Therefore, we sought to make it so easy to modify the defaults there would be no reason not to. This is in line with Biehler’s idea of a “stretchy” histogram, allowing users to pull the height of bars up or down (Biehler, 1997). In this interface, the bars cannot be directly manipulated, but the parameters are easily manipulable.
Outside the scope of the screenshot in Figure 5.5 but visible in Figure 5.7 are the slider controls for bin width and bin oﬀset. Users can interactively maFigure 5.5: LivelyR interface showing a histogram cloud of miles per gallon nipulate these sliders independently, to see a series of static histograms with the particular parameter value. However, Lunzer’s work often gives users the ability to modify several parameters together, in order to see more ‘what if?’ possibilities. In this case, it is possible to make a ‘sweep’ of one of the parameters. In Figure 5.5 the sweep is of the bin oﬀset parameter. Therefore, each histogram in the cloud has the same bin width, but they all have slightly diﬀerent bin oﬀsets, where the bin begins. Once this sweep is in place, the user can use the slider to modify the bin width of the cloud, which will produce a new set of histograms with the same (new) bin width, but a variety of bin oﬀsets.
When people are presented with the histogram cloud for the ﬁrst time, they typically express a sense of wonder. Because the full range of possible histograms for a particular data set has never been accessible to them, they ﬁnd it fascinating how many variations are possible. The cloud also gives a sense of the ‘true’ shape of the distribution. Obviously, kernel density estimation can provide similar information about the true distribution of the data, but understanding kernel densities requires another layer of abstraction. Understanding histograms is a complicated task for novices (Watson and Fitzallen, 2010; Friel, 2008), but the histogram cloud only requires an additional small cognitive step in order to be understood.
Inspired by the popularity of the histogram cloud when showing LivelyR, I have begun to think about how the concept could be extended to a 2-D setting.
Mapmaking is often appealing to novices because it is very grounded in reality.
As Tom MacWright noted, “The problem with maps is that the world looks like a map. We don’t have that problem with other visualizations.” One of the challenges with map-making, particularly of choropleth maps, is the areal units used do not have much meaning in terms of the variable being mapped. For example, mapping incidences of traﬃc accidents by zipcode or Census block is typically not useful, because traﬃc accidents tend to happen along streets. Because data are often measured in standard areal units, there is typically not much that can be done.
In geography, this is called the Modiﬁable Areal Unit Problem (MAUP). Geographers and geostatisticians have developed some methods for dealing with these data. The methods are usually spoken of in terms of ‘scaling.’ Upscaling is the easiest task, and it involves making a map at a less detailed spatial resolution than the data collection method (e.g., taking a map aggregated at the county level and turning it into one aggregated at the state level). Side-scaling is somewhat more complex, as it involves taking two similarly-sized areal units and translating between them (e.g., moving from zipcodes to Census tracts).
The most complex is down-scaling, which involves taking data at a less-detailed level down to a more detailed level (Atkinson, 2013).
There are a variety of methods to deal with this problem, including data fusion and area-to-point kriging. Also relevant are eﬀorts using data augmentation, much like what was discussed in Section 5.1.1, which will use auxiliary information to help with disagreggation. For example, the Disser project helps disaggregate Census data by bringing in information about zoning to determine housing density (Martin-Anderson, 2014).
Again, because viewers found the ability to move bins in the histogram cloud and see the resulting changes in distribution in 1-D, it seemed likely an analogous action would be appealing in 2-D. However, this extension is still a work in progress.
5.2.2 Regression guess line Another feature built into LivelyR is the ability to create a ‘regression guess’ line. This is another feature suggested by Biehler, who suggests, “eye-ﬁtted lines with residual analysis should precede the method of least squares” (Biehler, 1997). As in the previous section, a video of the interaction is available to more clearly display the functionality (Lunzer, 2014).
Similar to the histogram cloud discussed above, this feature allows users to manipulate one or multiple parameters in order to ﬁnd the best ﬁt line. In this case, the parameters are the start and end points of the line the user is guessing. Choosing the best line can be done by eye by modifying ﬁrst one end of the line and then the other, then judging how well the line ﬁts the scatterplot of points. The interface provides additional information in the form of the residual sum of squares (RSS) value, which a user can also use as a guide, manually attempting to optimize RSS.
Because LivelyR complies with Lunzer’s conception of a subjunctive interface (Lunzer and Hornbæk, 2008), it also allows users to try a ‘sweep’ of parameter values. In this case, the user deﬁnes a sweep of point locations on one end of the line, then moves the other end by hand. This allows the user to visually compare a selection of lines with a more dynamic set of end points. Additionally, the interface provides an ephemeral plot of the RSS value, which allows the user to more directly attempt to optimize the value by aiming for the local maximum on the plot. A screenshot of this functionality is shown in Figure
5.6. Again, this falls into the category of allowing learners to discover things by themselves, as Biehler suggests.
Of course, even the LivelyR implementation does not do the most superb job of supporting discoverability. The interface provides RSS as a measure to optimize without ever explaining why one might want to optimize it, and there is no visual support to suggest why the movement of the line is increasing or decreasing the RSS. A second generation of this type of tool should include a more self-articulating version of this, where the residuals or the squares are visually represented on the screen. More work should be done to study how novices conceptualize the sum of squares, and to determine the most eﬀective way of conveying the concept visually.
5.2.3 Small multiple callout plots In all the plot scenarios involving sweeps of parameters, LivelyR makes it possible to see each of the possible scenarios broken out into small multiple plots (Tufte, 2001). An example of this is shown in Figure 5.7, where each of the histograms from the histogram cloud in section 5.2.1 is broken out into an individual plot.
Once again, this feature allows for exploration over multiple parameters. In
this case, although the small multiples each display one of the set of histograms from the histogram cloud, the user can use a second input device (in Lunzer’s experiments, an iPad conﬁgured for use as a lefthand input) to select a histogram they want to compare with all others. In Figure 5.7, there is a light ghosted histogram shown in the background of all the small multiples. This is the histogram being used for comparison. In this example, the histogram being used for comparison is the beginning of the sweep (notice the small multiple in the upper left does not have a ghost histogram), but the tool is more generic.
5.2.4 Documentation of interaction history
Many of the features discussed above are available in other software packages, particularly those for learning statistics (see Section 2.2 for more on these tools).
However, LivelyR oﬀers a feature we have not witnessed in any common interactive tools: a history. Each time a user performs an action in the LivelyR interface, whether changing the subset of the data, modifying the histogram bin width, or sweeping values for a regression line guess, a line is added to the history list. The history list is visible on the right side of the screenshot in Figure
5.7 and is shown in more detail in Figure 5.8.