Skip to main content

source_GitHubData: a simple function for downloading data from GitHub into R

Update 31 January: I've folded source_GitHubData into the repmis packaged. See this post.


Update 7 January 2012: I updated the internal workings of source_GitHubData so that it now relies on httr rather than RCurl. Also it is more directly descended from devtool's source_url command.

This has two advantages.

  • Shortened URL's can be used instead of the data sets' full GitHub URL,
  • The ssl.verifypeer issue is resolved. (Though please let me know if you have problems).

The post has been rewritten to reflect these changes.


In previous posts I've discussed how to download data stored in plain-text data files (e.g. CSV, TSV) on GitHub directly into R.

Not sure why it took me so long to get around to this, but I've finally created a little function that simplifies the process of downloading plain-text data from GitHub. It's called source_GitHubData. (The name mimicks the devtools syntax for functions like source_gist and source_url. The function's syntax is actually just a modified version of source_url.)

The function is stored in a GitHub Gist HERE (it's also at the end of this post). You can load it directly into R with devtools' source_gist command.

Here is an example of how to use the function to download the electoral disproportionality data I discussed in an earlier post.


# Load source_GitHubData
library(devtools)

# The functions' gist ID is 4466237
source_gist("4466237")

# Create Disproportionality data UrlAddress object
# Make sure the URL is for the "raw" version of the file
# The URL was shortened using bitly
UrlAddress <- "http://bit.ly/Ss6zDO"

# Download data
Data <- source_GitHubData(url = UrlAddress)

# Show Data variable names
names(Data)

## [1] "country"            "year"               "disproportionality"

There you go.

Note that the the function is set by default to load comma-separated data (CSV). This can easily be changed with the sep argument.

Comments

Fr. said…
This also works with Google Spreadsheets if they have published to the Web (from the File menu in Google Docs). All you need is the CSV path. This should work, for example:

source_GitHubData("https://docs.google.com/spreadsheet/pub?key=0Agz-ZYJ5rH_WdG9oR2Y3T3U1Y3I5YlgzUmNBSVFrRUE&single=true&gid=0&output=csv")

A function to get data from Google Spreadsheets could be a useful addition to your package IMHO. People also use DropBox, so that could be another addition. Curious to know your thoughts on that.

Also, your download method seems better than the one I was using with RCurl, because RCurl's getURL() needs ssl.verifypeer=FALSE to work properly in some (HTTPS) cases. It seems httr's GET does not encounter the issue.
Fr. said…
P.S. The ssl.verifypeer problem is documented on the R Revolutions blog somewhere.
Unknown said…
Hi Fr.

Yep, it should work for Google spreadsheets published to CSV.

I might add a wrapper to the repmis package source_GoogleData or something like that, but it would basically be the same as source_GitHubData.

You actually don't need source_GitHubData to download data stored in a plain-text format on a Dropbox public folder. They use non-secure (http) URLs, so you can just use read.table. (source_GitHubData works for https sites)

Data stored in non-Public folders on Drobpox cannot be easily downloaded into R, because their URLs take you to a page that is more than just the text-file. They have lots of HTML that needs to be scrapped away.
Fr. said…
I believe adding the Google Docs wrapper would be useful, because (1) not everyone is on GitHub and (2) both storage options have more or less the same permanence. Also, (3) multiple data sheets per file.

GitHub, of course, is a better option, because it does not require "publishing to the web" to share the file. Perhaps the Google Docs API makes it possible to stopifnot(publish = TRUE).

Dropbox is less required IMHO because it's (a) less secure and (b) easier to move or delete things on it. I did not know their URLs were mere HTTP, it does not sound very wise.

All that just to say that your package is inspiring.

Popular posts from this blog

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background). LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package. I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object. To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis . The full function is below. To