FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 20 |

«University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»

-- [ Page 10 ] --

In one data collection exercise from Mobilize, students collect data on their snacking habits, and data are aggregated by class on a centralized server. When they download the aggregated data, it is represented in a rectangular format (Figure 4.1). However, it could also be represented in a more hierarchical format, as shown in Figure 4.2. Considering the current curricular tasks we expect students to complete, some tasks seem difficult using this representation. For example, subsetting is less straightforward. Subsetting the data to select only the snacks where the reason given was “hungry” would be difficult, and would require removal of the matching pieces and creation of an entirely new data structure with the time of day attached.

–  –  –

Another common initial assumption is data should always be visualized in such a way that all of the values can immediately be read. In conversations with data journalists, they often cite reading data values in a spreadsheet as their first line of inquiry, so this practice may be useful. However, there are other possibilities for the highest level view of data. For example, Victor Powell’s CSV Fingerprint project provides an extremely high-level view of data, and uses colors to indicate data types and missing data (Powell, 2014). See Figure 4.3 for an example of this visualization.

Figure 4.3: ‘CSV Fingerprint’ dataset visualization by Victor Powell

Because so much of the interesting data on the web are made available through application programming interfaces (APIs)2, it is crucial to provide smooth integration with them. An API connection helps users quickly engage with interesting topics requiring little prior contextual knowledge (Gould, 2010). Several of the existing tools for statistical analysis provide API connections, Fathom being most notable on the statistical education side (Finzer, 2002a). However, the interfaces are often clunky.

Another open area of research focuses on how data could be surrounded by more inherent documentation. When using API data, a user will typically perform a direct call to the API, store the data locally during their work session, and save results. However, they often do not cache the API data. The next time the user runs the analysis, they make another API call and get the most current data. This workflow means the analysis must be very reproducible, as the data cleaning steps need to work for slight variations in the supplied data.

However, because APIs rely on internet services, there is the possibility the data provider could ‘go dark’ and stop providing data. To protect against this, In this context, APIs specifically refer to data access APIs. The term API is more general, and can be used to refer to any programming language specification, but statisticians do not use it this way.

there is a growing desire to provide functionality for caching API data calls so the most recent data can continue to be used in these cases (Ogden et al., 2015).

There is also a need for better metadata about API data, and about data in general. Ideally, API data would be accompanied with information about how it was retrieved and when, and perhaps more context about how it was collected initially. Fuller data documentation fits in with the idea of documentation as part of the process (discussed in Section 4.8), because the data would come with more of a story from the start.

4.3 Support for a cycle of exploratory and confirmatory analysis Requirement 3 Support for a cycle of exploratory and confirmatory analysis.

Statistical thinking tools should always promote exploratory analysis. Exploratory Data Analysis (EDA) was proposed by John Tukey in his 1977 book of the same name. Although it has its own acronym, EDA is not very complicated; it is simply the idea that we can glean important information about data by exploring it. Instead of starting with high-powered statistical models, Tukey proposes doing simple descriptive statistics and making many simple graphs – of one variable or several – in order to look for trends. Indeed, humans looking at graphs are often better than computers at identifying trends in data (Kandel et al., 2011a).

EDA is an engaging and empowering way to introduce novices to statistics, but introductory courses do not always include it, perhaps because it requires computational skills to achieve, or because it can seem to teachers as too ‘soft’ a skill to be truly useful. In particular, the math teachers we have trained as part of the Mobilize grant (Section 5.1) are often wary of situations where there is more than one possible answer.

Although EDA can appear as a ‘soft’ or subjective practice, there are situations where it is the best and richest method for analysis. For example, to make inference using formal statistical tests, data must be randomly collected.

But in many situations (including the participatory sensing context discussed in Section 5.1.1) data are either non-random, or comprise a census. In these situations, EDA is the best method for finding patterns in the data and performing informal inference.

When Tukey wrote his book, computers were not in everyday use, so his text suggests using pencil and paper to produce descriptive statistics and simple plots. Today, the tasks listed in the book are even easier because of the many computer tools facilitating them, and the results are analogous to those made by hand (Curry, 1995; Friel, 2008). However, simply being able to implement the things Tukey did by hand in the 1970s is not enough; computers should be enabling even more powerful types of exploration (Pea, 1985).

Many tools for teaching statistics do support rapid exploration and prototyping. In fact, this is one area where tools for doing statistics fall short. In R, creating multiple graphs takes effort, as does playing with parameter values.

A sense of play is important to data analysis, which is never a linear process from beginning to end. Instead, data scientists repeatedly cycle back through questioning, exploration, and confirmation or inference.

There are many opinions as to what the cycle entails. In the Mobilize Introduction to Data Science course we use a cycle of statistical questions, collecting data, analyzing data, and interpretation, based on the GAISE guidelines discussed in Section 1.2 (Franklin et al., 2005). Because this is a cycle, interpreting leads back to questioning. Ben Fry suggests data analysis takes the shape of a cycle comprising the steps of acquiring, parsing, filtering, mining, representing, refining, and interacting (Fry, 2004). These cycles can be thought of as either exploratory or confirmatory, and the two complement each other. If users find something interesting in a cycle of exploratory analysis, they need to follow with confirmatory analysis.

The complementary exploratory and confirmatory cycles were suggested by Tukey, and have been re-emphasized by current educators (Biehler et al., 2013). There are several ways to bring the two cycles together. One method uses graphics to perform inference, discussed in Section 4.5, and thus brings exploratory and confirmatory analysis together. Andrew Gelman thinks modelers should be using simulation-based methods to check their models, and people doing exploratory data analysis should be fitting models and making graphs to show the patterns left behind (Gelman, 2004).

The difference between exploratory and confirmatory analysis (or informal and formal inference) is like the difference between sketching or taking notes and the act of writing an essay. One is more creative and expansive, and the other tries to pin down the particular information to be highlighted in the final product. A system supporting exploration and confirmation should provide a workflow connecting these two types of activities. Users need ‘scratch paper,’ or a place to play with things without them being set in stone. While data analysis needs to leave a clear trail of what was done so someone else can reproduce it, a scratch paper environment might allow a user to do things not ‘allowed’ in the final product, like moving data points around. This connects to Biehler’s goal of easily modifiable draft results (Biehler, 1997).

Many current systems for teaching statistics provide simple sketching-like functionality (allowing users to manipulate data or play with graphic representations), but the transition to an essay-like format is more complex. An essay requires a solid line of reasoning strung through. The question then becomes, how do readers interact with it? We also want the user to be able to see the multiplicity of possibilities (the ‘what-ifs’), while maintaining the ability to reset to reality.

If the system kept a transcript of all the actions taken in the sketching area, the resulting analysis possibilities would be analogous to a set of sketches an artist or designer does in the first stages of a project (a ‘charrette’). The goal is produce and document many ideas, ideally all very different from one another. The charrette process is useful in art because it allows an artist to provide provocations instead of prototypes (Bardzell et al., 2012). Provocations drive thinking forward, while prototypes tend to lock in a particular direction.

In order to support this, the system could provide a way to save many analyses and look through the collection.

In the context of art, once the charrette is complete, one sketch is converted into a finished product. In statistics, we want to support this same type of trajectory. Artists take one sketch and look back and forth between it and their inprogress painting as they complete it. With data, the user needs to make sure the trajectory of analysis is solid, so the relationship between a less-detailed sketch and a finished product is less clear.

The book “Infographic Designers’ Sketchbooks” provides a glimpse into how designers of infographics and data visualizations work. Designers of infographics typically start with pen-and-paper sketches, and they use digital tools as digital paint (occasionally, they will speak about sketching in Photoshop or Illustrator).

In contrast, designers of visualizations grounded in data (Mike Bostock, Tony Chu, Cooper Smith, Moritz Stefaer) begin by ‘sketching’ using code. They use a variety of computer tools to do this (e.g., Bostock’s sketches look to be done in R, Smith sketches in Processing), but they are not explicitly mapping color to paper the way the infographic designers are (Heller and Landers, 2014).

While it is not precisely clear how these different styles of creation could be incorporated into one tool, the LivelyR work discussed in Section 5.2 shows some methods for allowing both in the same interface.

4.4 Flexible plot creation Requirement 4 Flexible plot creation.

To fully support exploratory data analysis, a tool needs to emphasize plotting.

Computers make it possible to visually explore large datasets in a way not possible before. For example, Jacques Bertin developed a method for reordering matrices of unordered data in the 1960s. At the time, Bertin’s method involved cutting up paper representations of matrices or creating custom physical representations, then reordering the rows and columns by hand (Bertin, 1983). Much like the graphical methods John Tukey developed by hand in the ‘60s, Bertin’s methods are now easily accessible on the computer.

Providing easy plotting functionality is a goal of almost any tool, both for learning and doing statistics. However, there are two perspectives on plotting.

One is to provide appropriate default plots based on data types, and the other is a system allowing for the flexible creation of any plot a user can think of.

The use of appropriate defaults is the perspective driving the base R graphics, which use similar defaults in the generic plot() function, producing different graph types depending on the data passed to it.

There are also efforts to provide default visualizations representing all variables in a particular data set, such as the generalized pairs plot, which can be

–  –  –

used to automatically visualize all two-variable relationships (Emerson et al., 2013). There are also many bespoke data visualization systems (discussed further in Section 3.10) that automatically offer the ‘best’ plots for each variable in the data (Miller, 2014; Mackinlay et al., 2007). Some of the most common defaults are enumerated in Table 4.1.

The matching of plots to data is in line with most educational standards.

When mathematical educational standards reference statistical graphics, they tend to do so in a rote way, suggesting students should learn how to pick the most appropriate standard plots for their data. Educators of this mindset will often critique students’ visualization choices, and emphasize students should be able to choose appropriate plots for their data or otherwise make, read, and interpret standard plots (e.g., scatterplots, bar charts, histograms) (National Governors Association Center for Best Practices and Council of Chief State School Officers, 2010). In other words, students should serve as the plot-choosing algorithm, and learn to make the mappings outlined in Table 4.1.

This is also the paradigm spreadsheet tools like Excel, or route-type tools (Bakker, 2002) implicitly expect, because they only offer standard visualization types. However, there is a growing body of research suggesting learning by rote, both in general and particularly with respect to statistical graphics, is not an effective way to embed knowledge. Instead, researchers suggest students should learn to develop their own representations of data.

The development of new data representations falls on the other end of the

spectrum. In Leland Wilkinson’s Grammar of Graphics, Hadley Wickham’s ggplot2, TinkerPlots, and Fathom, users can build up novel plots from any type of data encoding they find appropriate (Wilkinson, 2005; Wickham, 2008; Konold and Miller, 2005; Finzer, 2002a). While the Grammar of Graphics and ggplot2 are intended for use by advanced data analysts, the inclusion of this type of flexible plotting in the learning tools TinkerPlots and Fathom points to the cognition research about visual encodings.

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |   ...   | 20 |

Similar works:

«ABSTRACT Title of dissertation: DOSE RANGING STUDY OF LUTEIN SUPLEMENTATION IN ELDERLY WITH AND WITHOUT AGE RELATED MACULAR DEGENERATION Fabiana Fonseca de Moura, Doctor of Philosophy, 2004 Dissertation directed by: Professor Frederick Khachik Adjunct Professor, Department of Chemistry and Biochemistry and the Department of Nutrition and Food Science Age-related macular degeneration (AMD) is the leading cause of blindness among people over the age of 65. Epidemiological studies have indicated...»

«This dissertation is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Lance P. Garrison Approved September, 1997 Romuald N. Lipcius, Ph. D. Committee Co-Chair/ Co-advisor Fu-lin E. Chu, Ph. D. Committee Co-Chair/Co-advisor Roger Mann, Ph. D. Jacques van Montfrans John Boon, Ph. D. Anson H. Hines, Ph. D. Smithsonian Environmental Research Center Edgewater, MD Dedicated to Kimberly with grateful thanks for her constant love and support. “Why is the sea...»

«THE GROWTH OF WHITEHEAD'S THEISM Lewis S. Ford Preface Ordinarily we can only know what philosophers have produced without having much insight into how they arrived at their conclusions. If we can chart some progression, it is only with respect to the differences between successive dialogues or books. We rarely have the opportunity to observe the creative activity that goes into the construction of a complex system of thought that a book represents. The book as a whole is all we have. In the...»

«Disclosure of Genetic Information for Personalized Nutrition and Change in Dietary Intake by Daiva Elena Nielsen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Nutritional Sciences University of Toronto © Copyright by Daiva Elena Nielsen 2014 Disclosure of Genetic Information for Personalized Nutrition and Change in Dietary Intake Daiva Elena Nielsen Doctor of Philosophy Department of Nutritional Sciences University of Toronto...»

«A COLD OF THE HEART: JAPAN STRIVES TO NORMALIZE DEPRESSION by George Kendall Vickery BA, College of William and Mary, 1986 Submitted to the Graduate Faculty of Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2005 UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by George Kendall Vickery It was defended on August 1, 2005 and approved by Dr. Joseph S. Alter Dr. Richard Scaglion Dr....»


«RULE LOGIC AND ITS VALIDATION FRAMEWORK OF MODEL VIEW DEFINITIONS FOR BUILDING INFORMATION MODELING A Dissertation Presented to The Academic Faculty By Yong-Cheol Lee In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the College of Architecture Georgia Institute of Technology December, 2015 Copyright © 2015 by Yong-Cheol Lee RULE LOGIC AND ITS VALIDATION FRAMEWORK OF MODEL VIEW DEFINITIONS FOR BUILDING INFORMATION MODELING Approved by: Professor Charles M....»


«Influence of beliefs about cancer pain and analgesics on pain experience outcomes in Taiwanese patients with lung or colorectal cancer by Shu-Liu Guo A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Lawrence S. Bloomberg Faculty of Nursing University of Toronto © Copyright by 2014 Shu-Liu Guo Influence of beliefs about cancer pain and analgesics on pain experience outcomes in Taiwanese patients with lung or colorectal cancer Shu-Liu Guo Doctor of...»

«Selectively De-animating and Stabilizing Videos by Jiamin Bai A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ravi Ramamoorthi, Chair Professor Maneesh Agarwala Professor Marty Banks Fall 2014 Selectively De-animating and Stabilizing Videos Copyright 2014 by Jiamin Bai 1 Abstract Selectively De-animating and Stabilizing...»

«Computationally Optimizing the Directed Evolution of Proteins Thesis by Christopher Ashby Voigt In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy California Institute of Technology Pasadena, California, USA 2002 (Submitted July 25, 2002) ii © 2002 Christopher Ashby Voigt All Rights Reserved iii Abstract Directed evolution has proven a successful strategy for protein engineering. To accelerate the discovery process, we have developed several computational methods...»

«Thesis for the Degree of Doctor of Philosophy Human Activity Recognition Using A Single Tri-axial Accelerometer Adil Mehmood Khan Department of Computer Engineering Graduate School Kyung Hee University Seoul, Korea February, 2011 Thesis for the Degree of Doctor of Philosophy Human Activity Recognition Using A Single Tri-axial Accelerometer by Adil Mehmood Khan Supervised by Prof. Young-Koo Lee, Ph.D. Department of Computer Engineering Graduate School Kyung Hee University Seoul, Korea February,...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.