# «University of California Los Angeles Bridging the Gap Between Tools for Learning and for Doing Statistics A dissertation submitted in partial ...»

On the Mobilize project, we have done our best to provide satisfactory user rewards. In fact, being able to analyze their own data is one of the main payoﬀs students receive for participating in the collection exercise. The dashboards produced by the technical team (discussed in Section 5.1.4.1) help with this. But again, the tools we are using are falling short, as they encourage students to become users of a dashboard tool and not true producers of statistics. It is also clear other participatory projects could beneﬁt from an easy-to-use data analysis platform as well.

5.1.1.3 Analyzing participatory sensing data While participatory sensing data has many beneﬁts, in particular giving power back to data creators, it is always messy. It rarely represents a random sample, either in terms of people collecting data or their collection methods.

For example, classes involved in the Mobilize project have collected data on snacking habits every year since the grant began in 2010. If students are completely successful at collecting every snack they eat, the data represent a census rather than a sample, and it is hard to say what statistics should be used to analyze it. However, even the most conscientious of data collectors usually miss a few observations. It is unknown if those observations are missing at random or missing not at random, and even less clear what to do with the data.

Up to this point, the grant has side-stepped this issue by focusing on exploratory data analysis rather than formal inference. The IDS course has used randomization to make limited conclusions from the data. However, it is often unsatisfying for teachers, students, and even non-statistical PIs to learn we cannot make formal inference from these data.

Many of the examples of participatory sensing and dealing with missing data come from ornithology. The Cornell Laboratory of Ornithology has found data mining methods can be used to help deal with missing data (Caruana et al., 2006). Decision trees, bagging, and boosting can all be used to help ﬁll in data where it is missing not at random. For example, bird data tends to be biased toward higher population areas, where more people exist to track bird sightings (Hochachka et al., 2010). Another approach is to do data augmentation, combining two datasets – one collected using a participatory method and one more rigorously collected (perhaps by paid researchers). Using these two data sets, data augmentation suggests researchers ﬁt a model, one to each data set, and compare the predictions (Munson et al., 2010).

In environmental contexts, where the variable of interest is assumed to be smoothly distributed, researchers have used interpolation (Mendez et al., 2013).

If this results in a model with high variability in certain areas, researchers can then try to incentivize data collectors to collect data in highly variable spatial areas (Mendez and Labrador, 2012). There are also examples predicting election outcomes using non-representative polls (Wang et al., 2014).

Finally, because many psychological researchers are using the Amazon Mechanical Turk3, there is a growing ﬁeld of research for dealing with those data.

While Turkers do not represent a random sample of humanity, they do seem to represent the demographics of the internet well, although they tend to skew a bit young (Ipeirotis, 2010; Ross et al., 2010). One method for increasing the quality of data from Turkers is to do repeated labeling (i.e., have several Turkers answer the same question to see what the consensus is). But, too much reThe Amazon Mechanical Turk is an online forum to ﬁnd humans to perform tasks diﬃcult for computers. It was ﬁrst developed to create reference data sets for image recognition research, as humans can easily identify a number shown in an image, for example.

peated labelling is costly. An alternative is to use the EM algorithm to convert ‘hard’ labels to ‘soft’ (probabilistic) labels (Ipeirotis et al., 2010). A simpler method is to determine some measure of trustworthiness for each data collector, and use weightings to put emphasis on the more trusted data (Welinder et al., 2010).

**5.1.2 Computational and statistical thinking**

“Computational thinking” is a term ﬁrst described by Jeannette Wing, the head of the computer science department at Carnegie Mellon (Wing, 2006). Wing is concerned with access to computer science in the general population, and has developed the concept of computational thinking to describe the skills she believes are foundational. Computational thinking encompasses much more than the traditional view of computer science. Wing describes it as “a fundamental skill for everyone” (Wing, 2006). It includes many facets of thinking like a computer: problem solving, algorithmic thinking, recursive thinking, and abstraction. Computational thinking, therefore, is one of the crucial skills in today’s economy. Not coincidentally, true statistical literacy requires many computational thinking skills.

‘Statistical thinking’ is a similarly general term, encompassing fundamental skills of thinking related to statistical concepts. For example, statistical thinking means considering the tendency of phenomena in a distribution or over time, as well as the variation within data or with repeated sampling. While these topics are included in introductory statistics curriculum, they are often presented as skills to be applied within class, not generally in life. Students presented with data visualizations often do not connect them to the context of the data from which they came (Meirelles, 2011; Wickham, 2010).

** Statistical thinking also includes what Darrell Huﬀ called “talking back to**

a statistic,” or checking for bias in reporting on statistical issues, and asking if it makes sense when presented with a statistical argument (Huﬀ, 1954). On a broader level, it can be argued that people should develop a “data habit of mind” and look for evidence to ground things in their life, even outside of a statistics class (Finzer, 2013; Pfannkuch et al., 2014).

Computational and statistical thinking tie together because computers are so necessary for statistics today. There is no ‘data science’ without computation, and statistics provides an excellent place to introduce computing because it is inherently contextualized. Traditional methods of motivating programming in the classroom (e.g., building a game) are often not as appealing to women (Kelleher and Pausch, 2005; Cooper and Cunningham, 2010). However, by couching statistics in data in which students are interested (particularly data in which they can see themselves, like participatory sensing data) makes the desire for inference natural (Wild et al., 2011). Speaking about the diﬀerence between mathematics and statistics, Cobb and Moore note, “in mathematics, context obscures structure. [...] In data analysis, context provides meaning” (Cobb and Moore, 1997).

**5.1.3 Curriculum**

As previously note, Mobilize has created curricular units across content areas.

They are all grounded in participatory sensing, computational thinking, and statistical thinking. While the grant has units to insert into computer science, algebra, and biology courses, as well as a stand-alone, year-long Introduction to Data Science curriculum, my primary contributions were on the ECS and IDS material. The ECS and IDS curricula were also the two most computationallybased courses.

** 5.1.3.1 Exploring Computer Science unit**

The ﬁrst Mobilize curriculum to be developed was a six-week-long unit on data analysis, written to ﬁt within a year-long curriculum called Exploring Computer Science (ECS)4. ECS initially piloted in the LAUSD, but has now grown to include schools in Chicago, Oregon, Utah, Washington, D.C. and New York.

Thousands of high school students are exposed to ECS annually. ECS includes 6-week-long units on human computer interaction, problem solving, HTML and web design, Scratch programming (animation and game design), LEGO Mindstorms robotics, and data analysis.

Members of the Mobilize team, including myself, helped developed the data analysis unit for ECS. In the unit, students engage in exploratory data analysis, creatively interacting with their own data.

The initial version of the curriculum involved students collecting data using one of two ‘canonical’ surveys provided: one for collecting data about advertising in the community, and one for collecting data about personal snacking habits. In the example of the advertising survey, a student would take a photo of an ad (e.g., a billboard), and then answer questions about the demographic they believe the ad is targeting, what product it is selling, how much they want the product, etc.

In keeping with participatory sensing, the survey is implemented on a smartphone, and the data are automatically uploaded to a server, along with information gathered by the phone. The unit incorporated student-collected data alongside previously-collected data from sources like the CDC to expose students to a variety of data analysis topics.

The curriculum and its implementation struggled with many issues. A major stumbling block for the teachers in professional development was learning the Mobilize is essentially a sibling grant to ECS, with many overlapping PIs.

data analysis tool. However, as we moved through the various technical tools discussed in Section 5.1.4, we realized that not only was our professional development too short to get the teachers up to speed, but the six-week unit really was not enough time for students to truly engage with R. Additionally, as ECS grew nationally, it became harder to support a particular data analysis tool.

The curriculum was later re-written to be tool agnostic, and the need for smartphones for data collection was minimized. In the most current version, students decide what topic they want to study, collect data using pencil and paper, then analyze it using a tool of their teacher’s choice.

This experience underscored the need for a longer data science curriculum, which formed the inspiration for the Introduction to Data Science curriculum.

5.1.3.2 Math and science units The Mobilize math and science curricula comprise shorter units, and were designed to be inserted into existing courses: Algebra I and Biology, respectively.

In math, students collect participatory sensing data on their snacking habits, and learn to connect linear modeling and predictions to the equation of the line, y = mx + b. In science, the participatory sensing campaign is related to trash– whether recyclables are being put into the incorrect container or not. This gets related to environmental concerns biology teachers are comfortable discussing.

The math and science units were loosely based on the ECS curriculum I helped develop. For more on the curricula, see (Board et al., 2015; Perez et al., 2015).

Because these curricula were on much shorter time scales, and because the teachers we trained were even less comfortable with statistics, neither the Algebra I nor the Biology unit include true computational statistics. Instead, students use bespoke dashboards to analyze their data. By using the dashboards, they are able to ask and answer questions with data, but the questions are somewhat limited by the abilities of the tool.

5.1.3.3 Introduction to Data Science course The most exciting curricular development in Mobilize is the year-long course, Introduction to Data Science (IDS). This course piloted in 10 schools in the 2014-2015 school year, and will be expanded to 25 teachers in 34 schools in 2015-2016. Unlike the previous curricula mentioned, the IDS course is a full, free standing course for high school students.

Historically, it has been diﬃcult to schedule students into courses in statistics and computer science because these courses do not typically fulﬁll graduation requirements. The state of California adds an additional hitch, as all high school courses must be approved by the University of California Oﬃce of the President (UCOP) to be used for admission to college within the UC system, the Cal States, or the California community colleges. One boon to the IDS course is we were able to get it approved by the UCOP, so as of fall 2014, taking the IDS course counts for “C” credit.5 As a result of the UCOP approval, high school counselors are interested in scheduling students into the course, and students are interested in taking it.

I was involved in the overall planning for the curriculum, the UCOP application, and the detailed development of the ﬁrst two units of the curriculum. However, I was not involved in the detailed development of the second two units. The rest of the IDS team includes Suyen Moncada-Machado, Robert Gould, Terri Johnson, and James Molyneux.

The topics covered by the class include visualizing data in one and multiCalifornia has “A-G” requirements, which require high school students to take two years history, four years of English, three years of math, two years of science, two years of a foreign language, one year of visual or performing arts, and one year of an elective. “C” credit indicates the IDS course satisﬁes one of the three years of college preparatory math.

ple dimensions, statistical questioning, randomization methods, linear modeling, classiﬁcation and regression trees, and k-means clustering. For the full curriculum, see (Gould et al., 2015).

The entire curriculum is based on R within RStudio, with regular labs for students to work through the techniques they are learning in class. There are also numerous ‘hands-on’ activities allowing students to enact data analysis physically, whether creating a human box plot as captured in (Menezes, 2015) or doing randomization activities with notecards.

**5.1.3.4 Labs**

All of the IDS coursework is grounded in computational labs (Figure 5.1), which take place in RStudio (Molyneux et al., 2014). The labs take advantage of the features of RStudio, and provide an integrated method for viewing lab prompts and accomplishing the associated tasks6.