forked from paulcbauer/apis_for_social_scientists_a_review
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-Github_api.Rmd
263 lines (196 loc) · 17.6 KB
/
07-Github_api.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# GitHub.com API
<chauthors>Markus Konrad (WZB)</chauthors>
<br><br>
```{r github-api-1, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=TRUE)
```
You will need to install the following packages for this chapter (run the code):
```{r github-api-2, echo=FALSE, comment=NA}
# Run this and copy output into the p_load function in next chunk
file_name <- dir()[stringr::str_detect(dir(), "Github_api.Rmd")]
lines_text <- readLines(file_name)
packages <- gsub("library\\(|\\)", "", unlist(str_extract_all(lines_text, "library\\([a-zA-z0-9]*\\)|p_load\\([a-zA-z0-9]*\\)")))
packages <- packages[packages!="pacman"]
packages <- paste("# install.packages('pacman')", "library(pacman)", "p_load('", paste(packages, collapse="', '"), "')",sep="")
packages <- str_wrap(packages, width = 80)
packages <- gsub("install.packages\\('pacman'\\)", "install.packages\\('pacman'\\)\n", packages)
packages <- gsub("library\\(pacman\\)", "library\\(pacman\\)\n", packages)
cat(packages)
```
## Provided services/data
* *What data/service is provided by the API?*
The [GitHub.com API](https://docs.github.com/en/rest) provides a service for interacting with the social coding platform [GitHub](https://github.com). A social coding platform is a website that allows users to work on software projects collaboratively. Users can share their work, engage in discussions and track activities of other users and projects. GitHub is currently the largest such platform with more than 50M user accounts (as of January 2022). GitHub's API allows for retrieving user-generated data from its platform, which is probably of main interest for social scientists. It also provides tools for controlling your account, organizations and projects on the platform in order to automate workflows. This is more of interest for professional software development.
From a social scientist's perspective, the API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development. Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there's understandably only a few aggregate data available.
When collecting data on GitHub, you should follow [GitHub's terms of service](https://docs.github.com/github/site-policy/github-terms-of-service), especially the [API terms](https://docs.github.com/en/github/site-policy/github-terms-of-service#h-api-terms). GitHub users agree that, when using the service, anything they post publicly may be viewed and used by other users (see [ToS section D5](https://docs.github.com/github/site-policy/github-terms-of-service#5-license-grant-to-other-users)). Still, you should be cautious when collecting and analyzing such user-generated data, as the user was not informed about contributing to a research project. If research data should later be published for replication, only aggregated and/or anonymized data should be made public. Depending on the data you collect, an ethics review should also be considered.
Special care should be taken about sampling techniques and replicability when working with this API. The sheer amount of data shared on GitHub requires sampling in most cases, but this is often hard to implement with the API, since results are usually sorted in a way that introduces bias when you can only fetch a certain number of top results (e.g. searching for users gives you the "most popular" users matching your criteria). @cosentino2016 show that many studies that use the GitHub API fail to implement proper sampling.
There are several competing platforms to GitHub. Many of them also provide an API which allow you to retrieve data. To list a few:
- [BitBucket](https://bitbucket.org/) – [API documentation](https://developer.atlassian.com/cloud/bitbucket/)
- [SourceForge](https://sourceforge.net/) – [API documentation](https://sourceforge.net/p/forge/documentation/API/)
- [Launchpad](https://launchpad.net/) – [API documentation](https://help.launchpad.net/API)
## Prerequisites
The API can be used for free and you can send up to 60 requests per hour if you're not authenticated (i.e. if you don't provide an API key). For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a [personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) that acts as API key. This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.
## Simple API call
* *What does a simple API call look like?*
First of all, it's important to note that GitHub actually provides two APIs: A [REST API](https://docs.github.com/en/rest) and a [GraphQL API](https://docs.github.com/en/graphql). They differ mainly in how requests are performed and how the responses are formatted, but also in *what* can be requested. The REST API is the default interface and provides a way of access akin to most other APIs in this book. The GraphQL API provides a quite different interface which requires you to more specifically describe what kind of information you want to retrieve. Though some very detailed information can only be retrieved via the GraphQL API, we will stick to the REST API for this chapter due to the complexity of the GraphQL API.
We can perform a request using the `curl` command in a terminal or by simply visiting the URL in a browser. We will start by retrieving public profile data from the GitHub account of software developer and hacker Lilith Wittmann:^[The author asked for permission for fetching and displaying this data.]
```{bash github-api-3, eval=FALSE}
curl https://api.github.com/users/LilithWittmann
```
The result is an HTTP response with JSON formatted data which contains user profile information that can also be found on the respective [public profile page](https://github.com/LilithWittmann):
```{text github-api-4, eval=FALSE}
{
"login": "LilithWittmann",
"name": "Lilith Wittmann",
"location": "Berlin, Germany",
"bio": "freelance developer & consultant",
"twitter_username": "LilithWittmann",
"public_repos": 37,
"public_gists": 6,
"followers": 440,
"following": 52,
// [ more data ... ]
}
```
The GitHub API offers extensive search capabilities. You can search for users, repositories, discussions and more by constructing a [search query](https://docs.github.com/en/rest/reference/search#constructing-a-search-query). Here, we search for users that use R and report being located in Berlin:
```{bash github-api-5, eval=FALSE}
curl 'https://api.github.com/search/users?q=language:r+location:berlin'
```
```{text github-api-6, eval=FALSE}
{
"total_count": 541,
"incomplete_results": false,
"items": [
{
"login": "IndrajeetPatil",
"id": 11330453,
// [...]
},
{
"login": "christophergandrud",
"id": 1285805,
// [...]
},
{
"login": "RobertTLange",
"id": 20374662,
// [...]
},
// [ more data ... ]
}
```
We could then continue fetching profile information for each account in the result set.
Note that by default, the results are [ordered by "best match"](https://docs.github.com/en/rest/reference/search#ranking-search-results). This can be changed with an additional "sort" parameter. The GitHub API doesn't provide a way of sampling search results – they are always sorted in some way and hence may introduce bias when you only fetch a limited number of results (which is often necessary since the result set is too large). You may consider to narrow the criteria for your search requests, e.g. by sampling geographical locations, so that you can always obtain the whole search results for a place.
Another example would be retrieving data about a project repository. We will use the [GitHub repository for the popular R package dplyr](https://github.com/tidyverse/dplyr) as an example:
```{bash github-api-7, eval=FALSE}
curl https://api.github.com/repos/tidyverse/dplyr
```
```{text github-api-8, eval=FALSE}
{
"id": 6427813,
"name": "dplyr",
"full_name": "tidyverse/dplyr",
"description": "dplyr: A grammar of data manipulation",
"homepage": "https://dplyr.tidyverse.org",
"size": 50767,
"stargazers_count": 3959,
"watchers_count": 3959,
"language": "R",
// [ more data ... ]
}
```
Finally, let's fetch data about an *issue* in the dplyr repository. On GitHub, issues are code or documentation problems as well as development tasks that users and collaborators bring up and discuss. Here, we collect all data on [dplyr issue #5958](https://github.com/tidyverse/dplyr/issues/5958):
```{bash github-api-9, eval=FALSE}
curl https://api.github.com/repos/tidyverse/dplyr/issues/5958
```
```{text github-api-10, eval=FALSE}
{
"url": "https://api.github.com/repos/tidyverse/dplyr/issues/5958",
"number": 5958,
"title": "Inaccurate documentation for `name` argument of `count()`",
"user": {
"login": "sfirke",
"id": 7569808,
// [ more data ... ]
},
"comments": 13,
"body": "The documentation for `count()` says of the [...]", // truncated
// [ more data ... ]
}
```
As you can see in the example above, a response may contain nested data (the "user" entry shows information about the author of the issue as a nested structure).
## API access
* *How can we access the API from R?*
Note that there's no output provided for the code examples in this section. When you're learning to use this API, it's recommended to copy and paste the code examples and run them in your own R environment.
There are [packages for many programming languages](https://docs.github.com/en/rest/overview/libraries) that provide convenient access for communicating with the GitHub API. Unfortunately, there are no such packages for R at the time of writing.^[Please note that the [git2r](https://cran.r-project.org/web/packages/git2r/index.html) package does *not* provide access to the GitHub API. It provides access to git repositories which is something completely different to accessing the web API of a social coding platform.] This means we can only access the API directly, e.g. by using the [jsonlite package](https://cran.r-project.org/web/packages/jsonlite/index.html) to fetch the data and convert it to an R list or dataframe.
We start with translating the first API call from the previous section to R:
```{r github-api-11, eval=FALSE}
library(jsonlite)
profile_data <- fromJSON('https://api.github.com/users/LilithWittmann')
# this gives a list
head(profile_data)
```
The JSON response from the GitHub API was directly converted to an R list object. When a JSON result only consists of arrays, `fromJSON` automatically converts the result to a dataframe by default. We can observe that when fetching the repositories of an user:
```{r github-api-12, eval=FALSE}
profile_repos <- fromJSON('https://api.github.com/users/LilithWittmann/repos')
# this gives a dataframe
head(profile_repos[,1:3]) # selecting only the first 3 columns
```
Let's also repeat the search query from the previous section. We want to obtain accounts that use R and entered "Berlin" in their profile's location field. This time, the result is a list that reports the number of search results as `search_results$total_count` and the search results details as a dataframe in `search_results$items`:
```{r github-api-13, eval=FALSE}
search_results <- fromJSON('https://api.github.com/search/users?q=language:r+location:berlin')
# this gives a list with a dataframe in "items"
str(search_results, list.len = 3, vec.len = 3)
```
Let's suppose we want to collect profile data for each account in the result set. First, let's investigate the search results in `search_results$items`:
```{r github-api-14, eval=FALSE}
head(search_results$items[,1:3])
```
To retrieve the profile details of each user, we only need the account name which we can then feed into the `https://api.github.com/users/<account>` API query. For demonstration purposes, let's only select the first ten users in the result set:
```{r github-api-15, eval=FALSE}
r_users_berlin <- search_results$items$login[1:10]
# gives a character vector of 10 account names
head(r_users_berlin)
```
We can now build the API queries to fetch the profile information for each user:
```{r github-api-16, eval=FALSE}
users_query <- paste0('https://api.github.com/users/', r_users_berlin)
# gives a character vector of 10 API queries
head(users_query)
```
The `fromJSON` function only accepts a single character string as query, but we have a vector of ten queries for the ten users. Hence, we resort to `lapply` which passes each query to `fromJSON` and stores the result as a list of ten lists (i.e. a nested structure – a list of lists). Each item in `users_details` is then a list that contains the profile information for the respective user.
```{r github-api-17, eval=FALSE}
users_details <- lapply(users_query, fromJSON)
# gives a list of 10 lists (one for each user)
str(users_details, list.len = 3)
```
A nested list is hard to work with, so we convert the result to a dataframe. On the way, we select only the information that we actually want to work with. For demonstration purposes, we only select the account name, the real name, the location, the number of public repositories and the number of followers of each user.
```{r github-api-18, eval=FALSE}
users_rows <- lapply(users_details, function(d) {
# select information that we need and generate a dataframe with a single row
data.frame(login = d$login,
name = d$name,
loc = d$location,
repos = d$public_repos,
followers = d$followers,
stringsAsFactors = FALSE)
})
# combine all dataframe rows to form the final dataframe
users_df <- do.call(rbind, users_rows)
# gives a dataframe with login, name, location, number of repos, number of followers
users_df
```
As we can see, working with the results from the API requires some effort in R, since we have to deal with the often complex structure of the provided data. There's no "convenience wrapper" package for the GitHub API in R so far. If you're familiar with other programming languages, you may consider using one of the [recommended packages](https://docs.github.com/en/rest/overview/libraries).^[For example, the [PyGithub](https://github.com/PyGithub/PyGithub) package for Python provides a comprehensive set of tools for interacting with the GitHub API which usually requires much less programming effort than with R and jsonlite.]
### Provide API key
Authenticating with the GitHub API via an API key allows you to send much more requests to the API (5000 requests per hour instead of 60 at the time of writing). API access keys for the GitHub API are called *personal access tokens (PAT)* and the [documentation explains how to generate a PAT](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) once you've logged into your GitHub account. **Please be careful with your PATs and never publish them.**
When you want to use authenticated requests, you need to pass along authentication information (your GitHub user account and your PAT) with every request. This is not possible with `fromJSON` – you have to send HTTP requests directly, e.g. via `GET` from the [httr](https://cran.r-project.org/web/packages/httr/index.html) package and then need to handle the HTTP response. The following example shows this. Here, we have the secret access token `PAT` stored as environment variable and fetch it via `Sys.getenv`. Next, we use `GET` to make an authenticated HTTP response to the GitHub API. We use the [`/user` API endpoint](https://docs.github.com/en/rest/reference/users#get-the-authenticated-user), which simply reports information about the currently authenticated user (which is your own account). If we were to use this API endpoint in an unauthenticated manner, we'd receive the error message *"Requires authentication"*.^[Try that out yourself via `curl https://api.github.com/user` in the terminal.] We then need to get the content of the response as text (`content(response, as = 'text')`) and pass this to `fromJSON` in order to get the result as list object.
```{r github-api-19, eval=FALSE}
library(httr)
PAT <- Sys.getenv("PAT")
response <- GET('https://api.github.com/user', authenticate('<account name>', PAT))
account_details <- fromJSON(content(response, as = 'text'))
account_details
```
You can use that code the same way as for the other API requests, only that when you use authenticated requests the request quotas are much higher.
## Social science examples
* *Are there social science research examples using the API?*
When the GitHub API is used in a research context, this is mainly done in the fields of computer science, (IT) business economics and social media studies. @lima2014coding analyze social ties, collaboration patterns and the geographical distribution of users on the platform. @chatziasimidis2015 try to find "success rules" for software projects by analyzing GitHub data and relating it to software download numbers. Importantly, @cosentino2016 provide a meta-analysis of 93 research papers and find *"concerns regarding the dataset collection process and size, the low level of replicability, poor sampling techniques, lack of longitudinal studies and scarce variety of methodologies."* @hipp_has_2021 used the GitHub API to collect data about open-source software development contributions before and during the COVID-19 pandemic. This was then used to analyze the different impact of the pandemic on the productivity of female and male software developers. They tackled the user sampling issue by employing geographic sampling and narrowing search results so that always the full result set of a user search for a place could be obtained.