«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»
22.214.171.124 RStudio On the other hand, RStudio is an Integrated Development Environments (IDE) for R (RStudio Team, 2014). It was initially developed by J.J. Allaire, who is now supported by a team of expert R programmers, including several who have been mentioned previously in this document (Yihui Xie and Hadley Wickham in particular).
RStudio oﬀers many useful features, such as code completion, ﬁle management, and comprehensive code history. For introductory college users, many professors have found the support RStudio provides makes it much easier to pick up R (Baumer et al., 2014; Muller and Kidd, 2014; Pruim et al., 2014; Horton et al., 2014a). It is also easier for high school teachers and students, as discovered during the Mobilize project and in McNamara and Hansen (2014).
RStudio can be run as a desktop application for Mac, Windows, or Linux, but it is also available as a server install. By using a server version, instructors can manage package installations from a central location and provide quick bug ﬁxes to all students at once. Users go to a particular web address, log in, and ﬁnd their R session just how they left it. All ﬁles can be hosted on a central server, which means students can do their work from any computer without having to worry about moving data from place to place.
Because of how simple it makes access, many colleges use RStudio servers for their students. In particular, Smith College, Mount Holyoke College, Duke University, and Macalester College all use this arrangement. For Mobilize, a server version was used for our high school teachers and students (discussed further in Section 5.1).
When a user opens RStudio, whether in the desktop version of the application or through a server, the initial screen will look much like what is seen in Figure 3.22. In fact, in the book “Start Teaching with R,” the authors warn the server version and desktop version “look so similar that if you run them both, you will have to pay attention to make sure you are working in the one you intend to be working in.” (Pruim et al., 2014).
The RStudio screen has four panes, which are shown in their default arrangement in Figure 3.223. RStudio does allow the user to move panes in the options menu, but this layout is the default when a user ﬁrst launches the program, so we will act as though it is standard.
The ﬁrst pane allows users to view ﬁles and data. This is a useful feature because it allows users to view a spreadsheet-like representation of their data, which they can scroll through (Figure 3.23). In standard R, views of data are much more piecemeal, and users must type commands like head(data) and tail(data) to view the ﬁrst (or last) few rows. In contrast, RStudio helps smooth the transition from a spreadsheet tool. This is also the pane where documentation can be written by users, such as.R ﬁles or RMarkdown documents (discussed at more length in Section 3.8.3).
The second pane allows users to view a list of all the objects loaded into the The following text has been modiﬁed from McNamara (2013a)
working environment, as in Figure 3.24a. For example, all datasets loaded will appear here, along with any other objects that have been created (special text or spatial formats, vectors, etc). Again, this helps make using R more concrete.
Every time a user creates a new variable, they see its name appear in the environment tab, and can click on its name to inspect it more closely. The other tab in the second pane is a code history, which includes the complete history of all code typed, over all R sessions, as in Figure 3.24b. The history is searchable, so the user can use the search box at the upper right of the pane to search through their code history.
Figure 3.24: Second pane in RStudio: environment and history The third pane provides integrated views of ﬁles, plots, packages and help.
When a user runs code in the Console creating a plot, the Plots tab will be automatically selected, as in Figure 3.25b. The automatic plot tab selection makes it very simple for users to know when the code they have run created a plot. In contrast with the standard R GUI, which has a ﬂoating window for plots that can easily get ‘lost,’ this integrated tab keeps everything cohesively together.
The packages tab (Figure 3.25c) gives a visual summary of all the packages the user has installed and which are loaded into the current working session.
Again, the ability to visually keep track of packages supports users. In the standard R GUI, users have to type installed.packages() to see the list of packages they have installed, and this command is often not taught. By making it simple to see the list of packages, RStudio is encouraging users, even novices, to view their installed packages. Similar to the plots tab, the help tab will be automatically selected whenever the user runs help code in the Console, like help(plot) or ?plot.
The fourth pane is the console, and provides the command line for users to enter R code. The console pane is where the majority of work takes place – the other panes provide support for the user, but no code interpretation. The console looks similar to the standard R GUI, but it also provides support for users.
For example, if a user begins to type a function and then hits the ‘tab’ key on their keyboard, RStudio will do code completion and provide a hovering hint of the documentation of the function (Figure 3.26).
RStudio provides many support features, in particular a uniﬁed interface where windows cannot get ‘lost.’ It also provides visual cues; to objects in the working environment, to installed packages, and to ﬁles in the working directory. The data preview functionality helps ease the transition from spreadsheet programs. And even in the most programming-oriented area, the Console, RStudio provides coding support features like tab completion and code hints. RStudio has been used successfully in many introductory college statistics classes (Baumer et al., 2014; Muller and Kidd, 2014; Pruim et al., 2014; Horton et al.,
2014a) and with high school teachers and students through the Mobilize Project (Section 5.1). However, even though it lowers the barrier to entry for R, RStudio still requires users to code, so there is a startup cost associated with using it.
Other commonly used tools for doing statistical analysis are Stata, SAS, and SPSS. All three tools are stand-alone software, and all combine elements of graphical user interfaces with command-line tools. They are used in a variety of disciplinary contexts, so the argument for teaching them is ‘students will need to use this in the future.’ They are often popular in industry, because they come with guarantees of validity and technical support.
Stata (Figure 3.27) is often the tool of choice for introductory statistics courses taught in an economics department, as it is used routinely by economists. Stata does support users writing their own routines, so it is extensible. It also includes a command-line language, so it can be used to create reproducible research, in the sense that analyses can be re-run to get the same results.
However, Stata does not integrate with tools like the iPython notebook or knitr, so there is no easy way to produce reproducible reports (Rising, 2014).
Another major drawback to Stata is its price. As of 2015, educational pricing was $445 for an annual license or $895 for a perpetual license. A single business license costs $845 per year or $1,695 for a perpetual license. The company does oﬀer group discounts, but these are also expensive (for example, a 10-user lab costs $1,850 plus $160 each, unless purchasing the “small” version which only deals with data up to 1,200 observations).
SAS has similar beneﬁts and drawbacks to Stata. It is often used in pharmaceutical and business applications because it comes with a guarantee of acFigure 3.27: Stata interface curacy. SAS is also hugely expensive for corporate use, in part because of the guarantee of accuracy and included support. For an individual it costs $9,000, and enterprise and government licenses require users to submit a request for a quote.
However, the company makes the software free for educational use, both as desktop software and via the cloud, so students can access it via a web browser, much like RStudio. This is a shrewd business decision, because it grooms students to become experts at their software. A screenshot of SAS is shown in Figure 3.28.
The other product from SAS often used in an educational setting is JMP.
JMP is a drag-and-drop, menu-driven graphical user interface. Some features of JMP are shown in Figure 3.29. The backbone of JMP is SAS, but JMP provides a simple visual interface. Again, JMP is expensive ($1,540 for an individual) but they oﬀer academic discounts: $50 for a yearly license for undergraduate and graduate students.
JMP provides many features useful for novices, like interactive brushing and linking, generalizable data cleaning, and visual model support.
Like TinkerPlots and Fathom, while JMP does produce interactive graphics within an individual session, these interactive results cannot be exported. Instead, a work session can be printed or pasted into a document. The student version of JMP does not support exporting graphics, but individual licenses do.
SPSS (Figure 3.30) is typically used by social scientists and is focused on a menu-driven interface, although it does have a proprietary command-line syntax allowing for reproducibility. However, the syntax is hard to understand and code is generally only created by copying and pasting, versus users generating code themselves (Academic Technology Services, 2013). SPSS is also very expensive– $5,760 for a 12-month individual license, or $95.75 for a one-year student license.
Although Stata, SAS, and SPSS are commonly used in industry, none of them seem to be supportive of learners. They all provide speciﬁc types of graphics, and most work is done using menus and wizards, so they do not make clear what the tool is actually doing. Using these tools creates ‘users’ not ‘creators’ of statistics (see Section 1.5 for more on the distinction). All three tools obscure the underlying computational processes and reduce statistical procedures to button clicks. Although they all provide some capability of extending the software with scripting, none of them have the community of statisticians sharing work that R has. Further, they all suﬀer from a lack of transparency about how internal routines were coded, they do not produce reproducible reports, and their pricing is prohibitive for the secondary school use-case.
(a) JMP dynamic querying
The most inspirational is JMP (Figure 3.29), which makes data analysis visual and interactive, providing many of the features of software for learning statistics with the power of a tool for really doing statistics.
3.10 Bespoke tools In addition to the tools discussed above, there are a number of ‘bespoke’ tools for doing particular things with data. The most salient examples are Data Wrangler, Open Reﬁne, Tableau, and Lyra.
Data Wrangler (Figure 3.31) began as a project from the Stanford Visualization Group in 2011 (Kandel et al., 2011b). Their goal was to provide a visual representation of data transforms, as well as a reproducible history of those transforms. For example, a user could select an empty row and indicate it should be deleted, at which point the Wrangler interface would suggest a variety of generalizable transformations that could be built from that one ‘rule’ (e.g., delete all empty rows, or always delete the 7th row). Once the user speciﬁes a transform, it is applied to the data and added to the interaction history. The interaction history can be exported as a data transformation script in a variety of languages.
Wrangler can also perform simple database manipulations, in the same way dplyr manipulates data in R. The tools Wrangler provided were so useful the authors were able to convert their academic research project into a corporate venture, which is now known as Trifacta.
Very similar to Data Wrangler is Open Reﬁne (Verborgh and Wilde, 2013).
The project was initially called Google Reﬁne, but has since been turned into an open source package. Like Wrangler, Open Reﬁne can help clean data and document the data cleaning process. It can also be used for data exploration and data matching, including geooding. Open Reﬁne is shown in Figure 3.32.
Again, the results of the reﬁning process are available as a re-useable script.
Both Data Wrangler and Open Reﬁne provide great alternatives to the
spreadsheet paradigm. They privilege data as a complete object, and document all modiﬁcations. By suggesting methods of generalizing data transformations, they remove much of the grunge work of spreadsheet analysis. The other beneﬁt of generalized data transformations is they encourage the user to think computationally. Instead of just doing ‘whatever works,’ there is user incentive to ﬁnd a way to describe the data cleaning rule in a way that works generally.
Tableau (Figure 3.334 ) is a bespoke system for data visualization. As such, it does not provide much support for data cleaning. Tableau makes it simple for users to create interactive graphics that can be easily published on the web.
Tableau will suggest the ‘best’ plot for particular data, which is both a blessing and a curse (Mackinlay et al., 2007). It can lead to much more appropriate uses of standard plots, but it also does not support novices’ learning trajectory.
A user can make a plot without having any idea of what it means. Similarly, Screenshot from Lin (2012).
Tableau makes it possible to ﬁt models to data, but again does not make it clear what these models mean or how appropriate they may be. Like the tools discussed in Section 3.9, Tableau is expensive– $999 for an individual license or $1,999 for an individual professional license. However, as with SAS, they make the tool free to students.
Figure 3.33: Tableau