Skip to main content

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.

Comments

Unknown said…
Hello,
The Article on What type of data should we use when teaching social science statistics is nice.it give detail information about it .Thanks for Sharing the information about importance of Data. hire data scientists
WaldoEffertz said…
Hi! Unlock new possibilities of you AI systems with our flexible Outsource Geospatial Data Annotation and Data Entry BPO services. Our team of skilled annotators specializes in various domains, providing you with comprehensive support. From image annotations to 3d models labeling and more, we have the expertise to optimize your business processes.

Popular posts from this blog

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background). LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package. I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object. To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis . The full function is below. To