-
Notifications
You must be signed in to change notification settings - Fork 75
/
Copy path23-datasets.qmd
64 lines (60 loc) · 9.82 KB
/
23-datasets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
engine: knitr
---
# Datasets {#sec-datasets}
One thing students often struggle with is picking a dataset. In general, it is better to stay away from datasets on Kaggle, the UCI Machine Learning Repository, and other commonly used options. From a data science perspective, using a dataset as it is available from such a source means that almost all the important decisions have been already made, and are potentially undocumented. And from a career perspective, it does not set your portfolio apart because everyone else just uses these datasets. Some alternatives include:
- [AidData](https://www.aiddata.org/datasets) provides a large number of datasets related to research on development and foreign aid.
- [Alex Cookson's datasets](https://github.com/tacookson/data).
- @andrews2012data provide a variety of datasets, which are available [here](https://www.york.ac.uk/depts/maths/histstat/pml1/r/andrews.htm).
- [*APIs for social scientists*](https://bookdown.org/paul/apis_for_social_scientists/) provides a variety of APIs that could be used to gather data.
- @Bombieri2023 provide a dataset about more than 5,000 large carnivore attacks on humans.
- The British Library's [catalogue of world newspapers](https://bl.iro.bl.uk/concern/datasets/943bd083-6355-44a1-97eb-b8ff898f87d5?locale=en) contains information about the start and end years of publication, the places of publication, variant titles and editions, and the language of publication.
- BuzzFeed News provides [access](https://github.com/BuzzFeedNews/nics-firearm-background-checks) to many datasets underpinning their articles.
- The [Canadian Municipal Elections Database](https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=doi:10.5683/SP2/4MZJPQ) contains complete municipal election results for municipalities across Canada [@CanadianMunicipalElectionsDatabase].
- [Congressindata](https://www.congressindata.com) provides datasets about US Congress Members from 2005 to 2015.
- The [Congress.gov API](https://blogs.loc.gov/law/2022/09/introducing-the-congress-gov-api/) is an especially useful source of data about the US Congress especially bills and other text data.
- [COVerAGE-DB](https://osf.io/mpwjq/) is a global demographic database of COVID-19 cases and deaths [@Riffe2021].
- `cricketdata` [@cricketdata] provides functions for downloading data about international and other major cricket matches
- [The Data And Story Library](https://dasl.datadescription.com) provides access to hundreds of datasets.
- [Data Is Plural](https://www.data-is-plural.com) provides a weekly newsletter of interesting datasets with archives back to 2015.
- The [Data Liberation Project](https://www.data-liberation-project.org) focuses on using FOI requests to build US government datasets.
- The [Demographic and Health Surveys](https://dhsprogram.com) (DHS) Program provides survey data for 90 countries beginning in 1984.
- Duolingo [provides](https://research.duolingo.com) access to datasets that underpin its research papers.
- The Economist provides [access](https://github.com/orgs/TheEconomist/repositories) to many datasets underpinning their articles.
- [EH.net](https://eh.net/databases/) provides a variety of interesting historical economic datasets.
- The EPA provides occurrence data from the [Unregulated Contaminant Monitoring Rule](https://www.epa.gov/dwucmr/occurrence-data-unregulated-contaminant-monitoring-rule#5).
- [European NUTS-Level Election Database (EU-NED)](https://eu-ned.com) provides national and European parliamentary election results from 1990 to 2020.
- Federal Reserve Economic Data (FRED) [provides](https://fred.stlouisfed.org) US economic data, and there is an R package `fredr` [@fredr] for accessing the API.
- FiveThirtyEight provides [access](https://github.com/fivethirtyeight/data) to many datasets underpinning their articles.
- [Goodreads Datasets](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) are a scrape from 2017 of public data about more than two million books including meta-data and reviews [@goodreadsone; @goodreadstwo].
- [Historical Social Conflict Database](https://www.unicaen.fr/hiscod/accueil.html) provide data about more than 20,000 conflicts, largely focused on Europe [@historicalsocialconflictdatabase].
- [Historical Statistics](https://www.historicalstatistics.org/) provides links to historical statistics.
- [Human Mortality Database](https://www.mortality.org) provides detailed mortality and population data for a variety of countries.
- ICANN's [Centralized Zone Data Service](https://czds.icann.org/home) provides access to all domain names, after an application and approval process that can take a few days.
- [IPCC Data Distribution Centre](https://www.ipcc-data.org).
- The Irish Social Science Data Archive has a wide variety of datasets [available](https://www.ucd.ie/issda/data/).
- J-PAL (Abdul Latif Jameel Poverty Action Lab) maintains a [catalog](https://www.povertyactionlab.org/catalog-administrative-data-sets) of administrative data.
- [NFL Savant](http://nflsavant.com/index.php) provides team-specific data about the NFL, including play-by-play data since 2013, combine data since 1999, and weather data.
- The Markup's [Show Your Work](https://themarkup.org/series/show-your-work) series often include links to GitHub repos with the data that underpin the article. A few notable ones include: [The Secret Bias Hidden in Mortgage-Approval Algorithms](https://github.com/the-markup/investigation-redlining).
- The Massachusetts Water Resources Authority makes its Wastewater COVID-19 Tracking data available [here](https://www.mwra.com/biobot/biobotdata.htm), with the raw data available in a PDF that could be parsed.
- The Museum of Modern Art (MoMA) makes datasets about their [collection](https://github.com/MuseumofModernArt/collection) and [exhibitions](https://github.com/MuseumofModernArt/exhibitions) available.
- NASA's [Planetary Data System](https://pds.nasa.gov).
- [ProPublica Data Store](https://www.propublica.org/datastore/) provides an extensive number of datasets about the US, some of which are quite large. For instance, the [Open Payments Data (2016)](https://www.propublica.org/datastore/dataset/cms-open-payments-data-2016) is 6 GB.
- The Notable People [dataset](https://medialab.github.io/bhht-datascape/) of @bhht3 provides a cross-verified database of notable people from 3500BC to 2018AD.
- The OECD [provides](https://data.oecd.org) economic data.
- The ParlEE dataset contains annotated full-text of millions of speeches in the EU legislative chambers [@parlee].
- The Prison Policy Initiative provides many [datasets](https://www.prisonpolicy.org/data/) about US prisons and jails.
- The Pudding makes many of the datasets underpinning their articles [available](https://github.com/the-pudding/data). A few notable ones include: [The Naked Truth](https://github.com/the-pudding/data/tree/master/foundation-names), and [The Evolution of the American Census](https://github.com/the-pudding/data/tree/master/census-history).
- The [Pushshift Reddit Dataset](https://files.pushshift.io/reddit/) is a collection of Reddit posts since 2015 [@pushshiftreddit].
- The [Refugee Law Lab](https://refugeelab.ca) provides the full text of full text of Supreme Court of Canada decisions in JSON format [@rehaag].
- The Rijksmuseum [provides](https://data.rijksmuseum.nl) a variety of data about their collections.
- The Socioeconomic High-resolution Rural-Urban Geographic Platform [(SHRUG)](http://devdatalab.org/shrug) is an open data platform provides data about socioeconomic development across 600,000 villages and towns in India [@shrugdataset].
- Tom Cardoso's [Bias behind bars](https://www.theglobeandmail.com/canada/article-investigation-racial-bias-in-canadian-prison-risk-assessments/) provides data about Black and Indigenous inmates in Canada.
- Tracking (In)Justice is a dataset that tracks police-involved deaths in Canada [@alexmclelandagain].
- The US Centers for Disease Control and Prevention (CDC) National Vital Statistics System provides a variety of datasets, including [Linked Birth and Infant Death Data](https://www.cdc.gov/nchs/nvss/linked-birth.htm#Two_Formats).
- The United States Sentencing Commission [Individual Offender Data Sets](https://www.ussc.gov/research/datafiles/commission-datafiles) as cleaned and prepared by [Kevin Wilson](https://kevinhayeswilson.com/data.html).
- [Women's Activities in Armed Rebellion](https://www.waarproject.com) provides access to measures of women's participation in rebel organizations between 1946-2015 [@lokenintroducing].
- The Washington Post provides [access](https://github.com/washingtonpost) to many datasets underpinning their articles. Especially of interest may be [congress slaveowners](https://github.com/washingtonpost/data-congress-slaveowners), [fatal force shooting](https://github.com/washingtonpost/data-police-shootings), [school shootings](https://github.com/washingtonpost/data-school-shootings), and [Why FEMA is denying aid to Black disaster survivors in the Deep South](https://github.com/wpinvestigative/fema_ihp_denials).
- The [Wordbank database](http://wordbank.stanford.edu) is an open database of children's vocabulary growth. Access is additionally available using `wordbankr` [@wordbankr], and [Alison Presmanes Hill](https://www.apreshill.com) provides useful [background](https://apreshill.github.io/data-vis-labs-2018/03-colors.html) and [cleaning code](https://apreshill.github.io/data-vis-labs-2018/03a-meow-cleaning.html).
- The World Bank provides an extensive range of [global development data](https://data.worldbank.org) and a [Microdata Library](https://microdata.worldbank.org/index.php/home/).
- Yale's International Center for Finance datasets: [Historical Financial Research Data](https://som.yale.edu/centers/international-center-for-finance/data/historical-financial-research-data), and [Stock Market Confidence Indices](https://som.yale.edu/centers/international-center-for-finance/data/stock-market-confidence-indices).