The project provides a set of API to retrieve the coauthorship graph of a given author and builds the equivalent of the Erdos distance for the person requested. The system, from the first requests, keeps collecting every coauthor in a recursive manner in two parallel ways, through the direct coauthors defined by the person itself on GScholar and from analyzing the publications and extracting every coauthor.
A live version can be found at: http://104.155.115.87/, precomputed graphs can be found for:
- Daniele Miorandi
- Ivano Bison
- Alberto Montresor
The project is organized in 6 dockerized microservices in the following folders (Redis is not listed as it does not need any particular configuration), every folder contains at least one Dockerfile.
-
client
contains the frontend in React, composed by three pageshomepage (SearchPage)
which allows to enter the desired authorsearch (ListResults)
which displays the found authors, both on Google Scholar and on our databaseauthor (DisplayResults)
, which displays the graph and lists the coauthors and their distances
-
flask
contains the API written in Python with Flaskapp.py
contains the entrypoint of the Flask application and the endpoints of the application/search?name=name
allows to search for a person, the system will check against Google Scholar and Neo4j/author?id=id&node_id=node_id
allows to retrieve the graph,id
refers to the Google ID of the person whilenode_id
refers to the node ID on Neo4j, the first one overrides the second.
aragog.py
contains the Google Scholar crawler, it gets the list of matching authors, it usesRequests
andBeautifulSoup
to scrape the search page on Google Scholarsherlock
contains the logic for neo4j, calculating the graph and determining the response. If the author is not present in the system, it is enqueued to be analyzed on Redis with the maximum priority, so the two consumers will analyze him/her as soon as possible. The graph calculation merges information coming both from direct coauthorship and publications coauthorship, obviously taking the shortest one in case of multiple paths.
-
stc
(Short-Term Consumer) analyzes only the publicly available coauthor list, to provide a fast answer to the user. By knowing the Google ID of a person, the coauthors page can be accessed directly and data can be scraped withBeautifulSoup
. Priority of the analysis is determined on queues on Redis, and every coauthor of the person in enqueued again with less priority, so to foster new requests incoming. -
ltc
(Long-Term Consumer), based on Selenium, scrapes paper by paper and extract every coauthor of a person. Asstc
it uses Redis priority queues to extract the most relevant person to analyze, and enqueues back with lower priority every coauthor. It simulates the human activity of clicking in order to avoid automated blocking by Google. -
neo4j
contains Neo4j data and configuration, at the moment it is configured for a GCP instance with 64GB of RAM, for local executionconf/neo4j.conf
should be updated accordingly to your machine.
Alongside the code there is .gitlab-ci.yml
, which setups a Continous Integration environment on GitLab that builds the images of the services and pushes them onto the GitLab Container Registry
It can be started service by service or, alternatively, there are two docker-compose files that help doing that.
docker-compose-build.yml
builds the images locally with the source contained in the repo.
docker-compose.yml
pulls the images from Gitlab Container Registry, so it is updated to the latest commit that triggered the build.
The second one is the easier way and can be achieved with
docker-compose up
We accurately designed the system to elude Google, stc
requests are delayed over time so to not bomb Google Scholar, while ltc
, using Selenium, seems like a human to Google. We have run it for several hours on our machines and on the deployed version but we were not blocked.
Because we could not bulk download Google Scholar archive, so we need a starting point from which to start our analysis, from this the system will crawl endlessly.
If the number of vertices is below around 10000 the coauthorship graph can be built in a few seconds, higher number of nodes leads to higher computational time. From our tests, with > 30000 nodes calculating the coauthorship with distance at maximum 5 will take several minutes, since 5 hops can take to almost every other author, especially if publications were analyzed
Graph of Alberto Montresor, Daniele Miorandi's one was too dense to display