scraper

Currently implemented: From a publisher page (like https://nature.com) scrape list of journals, for every journal, scrape list of articles (papers) and save the html from the individual papers to disk.

The meta information about the publishers is stored in meta/publishers.yaml

The scraping pipeline is build with the scrapy framework.

usage

There are currently 3 simple spiders:

journal index spider
paper url spider
paper download spider

1. get list of journal indexes

Scrape all publishers specified in the yaml file in the publishers argument (-a) and save the list of journals to ./data/journals.csv:

scrapy crawl journal_indexes -a publishers=./meta/publishers.yaml -o ./data/journals.csv

2. get list of paper urls

Scrape based on the journal urls from step 1 all paper urls and save them to ./data/papers.csv:

scrapy crawl paper_urls -a publishers=./meta/publishers.yaml -a journals=./data/journals.csv -o ./data/papers.csv

3. download papers

Once you have a list of all the papers you want to download (created in step 1 & 2), download them with this command:

scrapy crawl papers -a papers=./data/papers.csv -a save_to=./data/papers/ -o ./data/papers/index.csv

installation

check out repo
use python 3
install scrapy and other dependencies (virtual environment recommended):

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
extractors		extractors
publishers		publishers
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
elastic.json		elastic.json
logstash.conf		logstash.conf
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraper

usage

1. get list of journal indexes

2. get list of paper urls

3. download papers

installation

About

Releases

Packages

Contributors 2

Languages

License

followthegrant/scraper

Folders and files

Latest commit

History

Repository files navigation

scraper

usage

1. get list of journal indexes

2. get list of paper urls

3. download papers

installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages