Skip to content

Commit

Permalink
Update README and resource text
Browse files Browse the repository at this point in the history
  • Loading branch information
titipata committed Feb 17, 2020
1 parent 58ac743 commit 1376aa6
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 15 deletions.
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,10 @@
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)

Pubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)
, [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses `lxml` library to parse this information into a Python dictionary which can be easily used for research such in text mining and natural language processing pipelines. See
[wiki page](https://github.com/titipata/pubmed_parser/wiki) on how to download and
process dataset using the repository.
, [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses `lxml` library to parse this information into a Python dictionary which can be easily used for research such in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see [documentation page](http://titipata.github.io/pubmed_parser/) for more details.
Below, we list some of the core funtionalities and examples code.
For available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or
[documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and examples code.

## Available Parsers

Expand Down
19 changes: 9 additions & 10 deletions docs/resources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,14 @@
Resources
=========

Here are some useful resources for downloading data.
Here are some useful resources for downloading MEDLINE and PubMed Open Access (PubMed OA) XML data.

Links to download PubMed OA and MEDLINE dataset
-----------------------------------------------

Links to download Pubmed and MEDLINE dataset
--------------------------------------------
Below, we provide links for downloading PubMed OA and MEDLINE data

Here are links for downloading PubMed OA and MEDLINE data

- `PubMed Open-Access (OA) <http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/>`_ dataset is available at ``http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/``. Here is the `FTP link <ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/>`_ for downloading the bulk of dataset. You can check `oa_bulk folder <ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk//>`_ to see the full tar files.
- `PubMed Open-Access (OA) <http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/>`_ dataset is available at ``http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/``. Here is the `FTP link <ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/>`_ for downloading the bulk of dataset. In the FTP link, you can go to `oa_bulk folder <ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/>`_ to see the full available tar files.
- the MEDLINE XMLs are available here ``ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/``
- the MEDLINE XMLs weekly updates are available here ``ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/``
- MEDLINE Document Type Definitions (DTDs) file is available at this `link <https://www.nlm.nih.gov/databases/dtd/>`_. We can use it to see available tags from a given MEDLINE XML.
Expand All @@ -19,17 +18,17 @@ Here are links for downloading PubMed OA and MEDLINE data
Download PubMed OA figures
--------------------------

Here, we explain how to download PubMed OA figures corresponded to the parsed information
Here, we explain how to download PubMed OA figures corresponded to the parsed information from ``parse_pubmed_caption`` function

- In ``pubmed_parser``, you can use ``parse_pubmed_caption`` to parse figures (to be specific ``figure_id``) and captions corresponding to a manuscript.
- To download the images corresponding to a given ``PMC`` or ``PMID``, you can download a CSV file from ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv`` first. - The file will have columns ``PMID``, ``Accession ID`` (``PMC``), and ``File`` where it looks something like ``oa_package/08/e0/PMC13900.tar.gz``.
- You can then download a tar file for a given ``PMID`` or ``PMC`` from ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/e0/PMC13900.tar.gz``. You can check out ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/`` to get all the access for tar files.
- To download the images corresponding to a given ``PMC`` or ``PMID``, you can download a CSV file from ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv`` first. The file will have columns ``PMID``, ``Accession ID`` (``PMC``), and ``File``. In ``File`` column, you can see the path to download a tar file of an XML and corresponding figures in the following format ``oa_package/08/e0/PMC13900.tar.gz``.
- You can use the path to download a tar file for a given ``PMID`` or ``PMC`` in a following format: ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/e0/PMC13900.tar.gz``. If you want to download all the tar files, check out ``ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/`` to see all the files.


PMC Copyright Notice
--------------------

Wehn you use Pubmed Parser to parse information from the website, do not download them as a bulk. Your IP might get banned from doing it.
When you use Pubmed Parser to parse information from the website, do not download them as a bulk. Your IP might get banned from doing it.
Please see copyright notice when you scrape data from website `here <https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC/>`_.

Alternative implementation of MEDLINE parsers
Expand Down

0 comments on commit 1376aa6

Please sign in to comment.