Data sources - StatPREP Workshop

Real data

At the October 2018 StatPREP meeting at the Mathematical Association of America’s DC headquarters the new MAA deputy executive director, Rachel Levy asked a simple question: What’s real data?

A core recommendation of the American Statistical Association’s GAISE report is to “use real data” when teaching statistics. Prof. Levy wasn’t looking to prompt a philosophical discussion of the nature of reality, but to define a benchmark. If a widely lauded, consensus report from the world’s leading organization of statisticians calls for every introductory course to use real data, we need a way for instructors to know, for sure, whether they are in compliance. And so we discussed what is “real” when it comes to teaching statistics using real data.

Our conclusion:

Data is real when it has at least 1000 rows, at least 5 variables, and was not initially collected with a primary purpose of teaching statistics.

How did we come up with this definition? In part, we looked at the examples of “real data” in the GAISE report, for instance a dataset on housing with 2930 rows and 80 variables, or a dataset on 53,940 diamonds with 10 variables. But mainly, we looked at the reasons motivating the recommendation to teach with real data: which practices are encouraged and which discouraged. These are:

teach statistics as an investigative process,
foster active learning,
give students experience with multivariate thinking,
use technology but focus on concepts.

Why 1000 rows? Working with data on this scale requires using appropriate technology, the sort used in the data workplace. Graphics with 1000 points can be rich enough to see relationships, even when there are multiple variables. And with 1000 rows, a central concept in statistical reasoning, sampling variation, can be shown directly using random selection.

Why 5 variables? “Multivariate” is at least three. There are three basic roles played by variables in data analysis: response variable, explanatory variable(s), covariate(s). But we need more than 3 because both categorical and quantitative variables can star in any of these roles. And we need room for students to explore actively. This can be as simple as letting them choose which variables to relate to which.

As appropriate for an operational definition, it is readily applied to any data set and provides a yes-or-no answer. There’s no call to be fanatical about the numbers 1000 and 5; anything close will do. But the common textbook practice of using on the order of 10 rows and only one or two variables is not close.

Using real data

Now it’s time to talk about the meaning of the word “use” in the phrase “use real data.” Much of the data that appears in textbooks is intended to provide numbers for statistical calculations. Any other numbers could be used instead. But there are far richer ways to make data central to a lesson or exercise, ways that involve the actual handling of data, the search for patterns or anomalies, data display, and the interpretation of data in its actual context.

The data should be “raw.” That is, students should work directly with the data themselves rather than with a statistical summary. Statistics like means and proportions are not data but summaries of data.
The data should be in a professional form. It’s important for students to learn about the conventional ways in which data are organized, the most important of which is as a spreadsheet like table, with a columns for each variable and a row for each “unit of observation.” They should see examples of good practice, for instance putting meta-data such as measurement units and descriptions in a separate codebook file.
The data should be rich enough to reward exploration and encourage hypothesis formation.

Data collections

CAUSEWEB The CAUSE website includes links to hundreds of locations for data. On the Home page in the upper right corner type “Datasets” in the Search field.
Awesome public datasets Consortium for the Advancement of Undergraduate Statistics Education (CAUSE) - https://www.causeweb.org/
New York City Open Data More than 1000 datasets on various aspects of life in the Big Apple. You can search a specific term or click to view all available datasets.
Larry Winner data Larry Winner from the Department of Statistics at the University of Florida has amassed hundreds of datasets. Each dataset includes a description.
OzDASL Datasets categorized by statistical topic.
Revolution Analytics R Data Links to various data sources.
Kaggle Website that hosts data analysis competitions. Many datasets here are quite large and very messy. Establishing a free account is necessary for access to the data.
Curated research data
The FiveThirtyEight Blog
State and local government “open data” repositories. Searching on “open data [your-city-name]” will often lead you to a site.

Searching for social justice

Many instructors understandably seek to find data that is relevant, particularly that relates to issues in the news. For many instructors, data concerning matters of social justice is particularly valuable. There are good and bad sources for such data.

Care is required. The passion for social justice is not always consistent with the available data. To illustrate, consider what ABC News (for instance) calls the lead public-health crisis in Flint, MI. For data, you can go to the Centers for Disease Control lead publication site. Here is a research paper, Blood Lead Levels of Children in Flint, Michigan: 2006-2016 from the Journal of Pediatrics, that shows, contrary to conventional wisdom, that whatever problems are associated with water supply in Flint, lead poisoning is not one of them.

Wrangling and Cleaning

Basics of how to wrangle and clean data

Join trick

Separating columns

Dates and times

Real stories

The ASA and Miami Univ of Ohio host a series of videos where professionals talk about their work in applied statistics: https://statsandstories.net. As of May 2019, there are about 100 videos on work in many different areas of research, the economy, government, and society.