id | title | date | author | layout | guid | permalink | image | categories | tags | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41085 |
RSocrata |
2014-01-13 01:00:28 -0600 |
Tom Schenk |
post |
/index.php/rsocrata/ |
|
|
R is a powerful statistics program that is a favorite among data scientists. Using R with the City of Chicago data portal has been possible, but R users always needed to handle some residual issues after loading files from the data portal. These issues were also common for Chicago’s data science team, so we’re excited to release the RSocrata package to make the interaction with the Chicago data portal–and any other Socrata data portal–easier for R users.
RSocrata is available on CRAN and can be installed and loaded with:
install.packages("RSocrata") library(RSocrata)
Just use the URL of the datasets from any Socrata site to load data with read.socrata(). Below, the Towed Vehicles dataset is loaded as a dataframe with:
towed.vehicles <- read.socrata("https://data.cityofchicago.org/Transportation/Towed-Vehicles/ygr5-vcbg")
You can also use the API Access Endpoint address to load data. Locate the API Access Endpoint address under the Export button and the API menu. You will need to change the “.json” extension to “.csv”. For example, the API Access Endpoint for Towed Vehicles is http://data.cityofchicago.org/resource/ygr5-vcbg.csv.
To use with RSocrata, type:
towed.socrata <- read.socrata("http://data.cityofchicago.org/resource/ygr5-vcbg.csv")
Using either the human-readable URL or the API Access Endpoint will make the same call to Socrata and is designed to minimize throttling.
There are a couple of benefits from RSocrata. First, date values are loaded in R as POSIX formatted dates, which is not the case using read.csv. Comparing the two methods, read.csv will usually be classified as factors:
towed.csv <- read.csv("http://data.cityofchicago.org/api/views/ygr5-vcbg/rows.csv") # Reading CSV input class(towed.csv$Tow.Date) # Check the date classification for 'Tow Date' column [1] "factor" class(salaries.socrata) # Loaded with read.socrata [1] "POSIXlt" "POSIXt"
The RSocrata package uses a loop and Socrata’s $offset parameter to minimize throttling from the data portal.
RSocrata is available on CRAN and is open for pull requests on GitHub.