«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»
Lexical scoping is a quality either lauded or critiqued in R. In general, the term means variables are only available in particular environments, but in R, lexical scoping means the language looks within local environments ﬁrst (for example, within the function being used) but if it cannot ﬁnd a particular variable it will continue to look outside to the next highest environment (the next highest function) until it reaches the global scope. Again, this type of scoping can be good or bad. It means R is less restrictive than some other languages, but it can also lead to unintended side eﬀects.
R work typically takes place at a command-line interface (CLI). In fact, R can be used directly at the command line, as seen in Figure 3.11. A slightly friendlier interface is the regular R graphical user interface (GUI) that comes with the language when it is downloaded (Figure 3.12). GUIs are discussed in more depth in Section 3.8.5, but essentially the GUI allows a user to move some tasks from command line typing tasks to more menu-oriented tasks. The regular R GUI provides a ﬁle menu, a red stop button for when code gets out of hand, and not much else. Section 3.8.5 examines more advanced GUIs for R as well as the integrated development environment RStudio. All these approaches add support for users, but still require syntactic knowledge.
Figure 3.11: R at the command line
Like many programming languages, R has both a language base and additional libraries, allowing users to extend its functionality. The additional libraries are called ‘packages’ and most are hosted on a centralized server called the Comprehensive R Archive Network (CRAN) (R Core Team, 2015). Having CRAN makes it simple for users to install new packages, as they do not have to go hunting for where a particular package is hosted. Additionally, R provides great library management support. By using the command install.packages(), a particular package can be pulled from CRAN, unpacked, and installed in precisely the right location. This simpliﬁes the installation process and makes it Figure 3.12: Regular R graphical user interface much more straightforward than library management in Python, for example.
Because R has the statistical community invested in it, and because it is open-source and easy to modify, there are many additional packages for R. As of this writing, CRAN hosts over 6,500 packages. However, there has been a recent movement away from hosting packages on CRAN toward hosting them on GitHub. Using Hadley Wickham’s devtools package, package installation from GitHub is as seamless as from CRAN, although there are fewer checks on packages prior to their installation. The reduced checks could allow unscrupulous package writers to distribute packages with nefarious aims (a worst case scenario would be some type of trojan horse virus). However, because of the high bar packages must pass to be included on CRAN, the relaxed checking also promotes more creative packages and allows users to try out packages still in development. The ﬂexibility of being able to install packages from GitHub has outweighed the risks for many users.
One good aﬀordance of R is it makes it very diﬃcult to modify original data.
Instead, R loads a copy of the data into the work session and the user works with the copy. Once it is loaded, it is possible to ‘clean’ the data, i.e., standardize some of the ﬁelds so labels are consistent, convert 0s to NAs, etc. None of these actions are taken on the data itself. Instead, they take place on the copy of the data you have loaded. Because R is a language, it is possible to follow the trail of actions taking the original data ﬁle to the cleaned version. Code can be saved and veriﬁed by another party if there are questions of reproducibility. Of course, it is still possible to save the modiﬁed data ﬁle over the original, but the process of making changes to a local copy makes it much less likely.
There are many other aﬀordances of data analysis systems that shape the way users think about and work with data. For example, while many languages rely on constructs like ‘for’ loops and ‘while’ statements (often called control statements), other languages, like R, support vectorized operations. Instead of having to explicitly state an operation should be done for every entry in a list, R allows for the operation to be applied to the list itself, and the program’s inherent paradigm will know to do it in vectorized format. R includes control statements like ‘for’ loops, but they are used less often than in other similar languages.
As with any tool, R has its shortcomings as well. The main drawback of R is its status as a programming language. Many of the other tools discussed here are much more toward graphical user interfaces, while R is a language, meaning users need to provide the correct function calls with appropriate syntax and arguments. Adding to this is R’s inconsistent syntax, which makes it hard to learn for novices and programming experts alike. This is discussed below in Section 3.8.1.
There have been eﬀorts to simplify the coding aspects of R over the years.
Some of these eﬀorts are curricular, reducing the number of commands to which novices are exposed, or providing more consistent syntax (Verzani, 2005; Kaplan and Shoop, 2013). Other eﬀorts are Graphical User Interfaces (GUIs) like Deducer (Fellows, 2012) and RCommander (Fox, 2004), discussed further in Section 3.8.5. However, none of these eﬀorts have truly solved the problem.
One complex aspect of R is the multitude of syntaxes it supports. Where most programming languages would have one standard syntax, R has many.
The two main syntaxes most users encounter are the dollar sign syntax and the formula syntax. The dollar sign syntax uses the $ operator to denote when an object is within another object. For example mtcars$wt indicates the wt variable within the mtcars dataset. The formula syntax is so named because it is most commonly found in functions performing modeling, which use a formula speciﬁcation. This syntax uses a ~ operator, but the entire syntax is diﬀerent.
Instead of referring to variables within datasets, the user refers to the variables directly and then notes the dataset later.
For a more thorough example, see below. In this example, a set of three plots are made to compare the weight (wt) and miles per gallon (mpg) of cars with diﬀerent numbers of cylinders (cyl).
First, the dollar sign syntax:
par(mfrow=c(1,3)) plot(mtcars$wt[mtcars$cyl == 4], mtcars$mpg[mtcars$cyl == 4]) plot(mtcars$wt[mtcars$cyl == 6], mtcars$mpg[mtcars$cyl == 6]) plot(mtcars$wt[mtcars$cyl == 8], mtcars$mpg[mtcars$cyl == 8])
Then, the formula syntax:
xyplot(mpg~wt | as.factor(cyl), data=mtcars) The plot from the dollar sign example is seen in Figure 3.13 and the plot from the formula example is seen in Figure 3.14. Perhaps the most obvious observation of these two plots is they do not appear to be ‘the same.’ This is because the formula syntax example standardizes the axes automatically, while the dollar sign syntax example generates the best axes for each plot.
21.5 21.0 20.5 mtcars$mpg[mtcars$cyl == 4]
19.5 19.0 18.5 18.0 1.5 2.0 2.5 3.0 2.6 2.8 3.0 3.2 3.4 3.5 4.0 4.5 5.0 5.5
The plotting example makes the formula syntax appear far superior, but there are tasks that are much easier in the dollar sign syntax as well. For example, creating a new variable in the formula syntax requires something like mtcars = mutate(mtcars, avgWT = mean(wt)) rather than mtcars$avgWT = mean(mtcars$wt) in the dollar sign syntax.
One method for simplifying R for novices is to expose them to only one syntax. In projects limiting exposure to only one syntax (Project MOSAIC and Mobilize), the choice has been made to use the formula syntax (Pruim et al., 2015a; Gould et al., 2015), although either choice would have been eﬀective.
In order to focus on the formula syntax, graphics are made using the lattice package (Sarkar, 2008), summary statistics are computed with the mosaic package (Pruim et al., 2015b), and modeling can continue to be done in base R.
Again, one of the strengths of R is the large quantity of additional packages available to extend its functionality.
126.96.36.199 mosaic Project MOSAIC and its associated R package, mosaic, have advanced the state of R for education (Pruim et al., 2015a,b). By making summary statistics available in the formula-based R syntax, the mosaic package allows for a standardization within the introductory curricula. By using the mosaic package, along with lattice graphics (Sarkar, 2008), students can stay ﬁrmly within the formulabased syntax for an entire introductory statistics course.
In addition, the project has been improving RMarkdown templates (discussed further in Section 3.8.3), and creating interactive widgets – what Bieher might call microworlds – to allow users to interact with R more dynamically (Biehler, 1997). However, even with these advantages, students still need to type syntactically correct code and cannot easily create their own interactive graphics.
188.8.131.52 The Hadleyverse
In recent years, simpler and more ﬂexible R packages have become more plentiful. A main driver of this trend is Hadley Wickham, the Chief Scientist at RStudio (RStudio is discussed more in Section 184.108.40.206). Wickham developed the ﬂexible graphics package ggplot2 as his doctoral dissertation (Wickham, 2009).
ggplot2 is an implementation of The Grammar of Graphics (Wilkinson, 2005), which means it supports the creation of novel plot types within a structured syntax.
Wickham has also developed packages to help users deal with data manipulation, such as plyr, reshape2 and dplyr (Wickham, 2011, 2007; Wickham and Francois, 2015). Wickham has said he wants to build tools that allow users to easily express 90% of what they want to be able to do, while only losing 10% of the ﬂexibility. He acknowledges there will always be edge cases falling outside the capabilities of his packages. But, he is not trying to create a complete language, simply domain-speciﬁc languages for data analysis tasks.
220.127.116.11 Shiny and manipulate Shiny is an R package developed by the RStudio team enabling R programmers to create interactive visualizations for the web (Chang et al., 2015).
In order to develop a Shiny app, an author must create two R ﬁles, one called ui.R and one called server.R. These two ﬁles must be developed in parallel, taking particular care to match variable names between the processes happening on the back end (in the server ﬁle) and those visible in the app (in the UI ﬁle).
In the server ﬁle, authors can deﬁne reactive expressions. The reactive expressions can be used to build a system that responds to user input, updating only those values that depend on the modiﬁcations made by the user.
Shiny supports interface features like sliders, radio buttons, check boxes, even text input. Typically, though, the resulting visualizations are themselves static. The user cannot zoom into them naturally in the way they would with a d3 web graphic. Instead, the designer would have to incorporate sliders for the x- and y-ranges, and the user would manipulate those to impact the zoom.
Shiny also supports simple publishing, as local Shiny apps can be ported to RStudio’s shinyapps.io hosting site with one button click.
However, authoring Shiny apps is not a task for novices. In order to create an interactive graphic, the user ﬁrst needs to understand R syntax and which parameters she wants to manipulate. The user must also have a basic understanding of reactive programming, and must be able to match up variable names and outputs in the paired server/UI paradigm Shiny uses. Much more useful would be a tool where users could create interactive graphics using direct manipulation of objects on the screen, without needing to know R syntax.
Its challenges notwithstanding, Shiny is a very powerful tool, and it has been enabling R programmers to build interactive tools that have gained viral success, such as the dialect map (Katz and Andrews, 2013) published on the New York Times that eventually received more views than any article in the history of the paper (Leonhardt, D. (@DLeonhardt), 2014).
As it stands, Shiny can be useful as what Biehler calls a “meta-tool,” enabling teachers to “adapt and modify material and software for their students” (Biehler, 1997). There are many nice examples of these types of tools, including the gallery being curated by Mine Cetinkaya-Rundel (Cetinkaya Rundel, 2014) ¸ ¸ and tools for exploring database joins (Ribeiro, 2014). Two Shiny apps I developed are discussed in Section ??.
A simpler package with a similar idea is the manipulate package, which was also developed at RStudio (Allaire, 2014). manipulate is easier for novices to use, although it still requires some knowledge of R. However, instead of requiring the server/UI paradigm Shiny requires (which is useful for publishable web documents), manipulate works directly at the command-line to produce interaction that cannot be published but can be used for educational purposes.
3.8.3 knitr/RMarkdown While not speciﬁcally for doing statistical analysis, several projects by Yihui Xie are extending the capabilities for reproducible research, both with R and more generally.
During his doctoral dissertation, Xie wrote the knitr package, which expanded the capabilities of previous functionality called Sweave from base R (Xie, 2013). knitr makes it possible to combine text, code, and the results from the code. This is much like the iPython notebooks discussed in Section 3.3, but is more ﬂexible in terms of languages and output formats. The most canonical examples are including R code in L TEX or Markdown text, but the package is A much more ﬂexible (Xie, 2013). In fact, knitr allows users to combine any type of code (Python, C++, etc) with any textual format.
Users write text and code (delimited as such by particular syntax depending on the textual format they are using), then ‘knit’ the source document to create
Figure 3.15: Code and associated output from RMarkdown a fully formatted HTML or PDF document.