Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
johnne committed May 6, 2021
1 parent e3225b7 commit e1124e9
Showing 1 changed file with 331 additions and 7 deletions.
338 changes: 331 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,95 @@ conda install -c bioconda coidb
2. Clone git repository and build

```bash
git clone <repo>
git clone https://github.com/johnne/coidb.git
python setup.py install
```

## Running the software
## Quick start
To generate a file `bold_clustered.fasta` with COI-5P sequences, run:

The `coidb` tool is a wrapper for a small snakemake workflow that downloads,
filters and clusters reference sequences for (mainly) the COI gene.
```bash
coidb
```

This will download, filter and cluster sequences from [GBIF Hosted Datasets](https://hosted-datasets.gbif.org/ibol/).

See below for configuration and more options.

## Configuration
There are a few configurable parameters that modifies how sequences are filtered
and clustered. You can modify these parameters using a config file in `yaml`
format. The default setup looks like this:

```yaml
database:
# url to download info and sequence files from
url: "https://hosted-datasets.gbif.org/ibol/ibol.zip"
# gene of interest (will be used to filter sequences)
gene:
- "COI-5P"
# phyla of interest (omit this in order to include all phyla)
phyla: []
# Percent identity to cluster seqs in the database by
pid: 1.0
```
### Gene types
By default, only sequences named 'COI-5P' are included in the
final output. To modify this behaviour you can supply a config file in `yaml`
format via `-c <path-to-configfile.yaml>`. For example, to also include
'COI-3P' sequences you can create a config file, _e.g._ named `config.yaml` with
these contents:

```yaml
database:
gene:
- 'COI-5P'
- 'COI-3P'
```

Then run `coidb` as:

```bash
coidb -c config.yaml
```

Typical gene names and their occurrence in the database are shown in
[this table](#Common-gene-types).

### Phyla

The default is to include sequences from all taxa. However, you can filter the
resulting sequences to only those from one or more phyla. For instance, to only
include sequences from the phyla 'Arthropoda' and 'Chordata' you supply a
config file with these contents:

```yaml
database:
phyla:
- 'Arthropoda'
- 'Chordata'
```

Typical phyla and their occurrence in the database are shown in
[this table](#Common-phyla).

### Clustering

After sequences have been filtered to the genes and phyla of interest they are
clustered on a per-species (or BOLD `BIN` id where applicable) basis using
`vsearch`. By default this clustering is performed at 100% identity. To change
this behaviour, to _e.g._ 95% identity make sure your config file contains:

```yaml
database:
pid: 0.95
```

## Command line options

The `coidb` tool is a wrapper for a small snakemake workflow that handles
all the downloading, filtering and clustering.

```
usage: coidb [-h] [-n] [-j CORES] [-f] [-u] [-c [CONFIG_FILE ...]] [--cluster-config CLUSTER_CONFIG] [--workdir WORKDIR] [-p] [targets ...]
Expand All @@ -48,7 +129,39 @@ optional arguments:
-p, --printshellcmds Print shell commands
```
1. Download reference files
Explanation:
`-n, --dryrun`: Only print what will be done, don't actually do anything.
`-j, --cores`: The number of cores to run the workflow with. Because the download
and filtering steps have to be run in sequential order this only affects the
clustering step using `vsearch`.
`-f, --force`: Force the execution of the workflow even though files already
exist.
`-u, --unlock`: Release a working directory lock (which could result from a
previously interrupted run)
`-c, --configfile`: Supply a configuration file to alter the behaviour of the
tool.
`--workdir`: Specify the directory in which to read/write output files.
Defaults to the current directory.
`-p, --printshellcmds`: Shows the actual commands as they are being executed.
### Step-by-step
You can also run the `coidb` tool in steps, _e.g._ if you are only interested
in some of the files or if you want to inspect the results before proceeding
to the next step. This is done using the positional argument `targets`.
Valid targets are `download`, `filter` and `cluster`.
#### Step 1: Download
For example, to only
download files from GBIF you can run:
```bash
coidb download
Expand All @@ -57,10 +170,221 @@ coidb download
This should produce two files `bold_info.tsv` and `bold_seqs.txt` containing
metadata and nucleotide sequences, respectively.

2. Filter reference files
#### Step 2: Filter

To also filter the `bold_info.tsv` and `bold_seqs.txt` files (according to the
default 'COI-5P' gene or any other genes/phyla you've defined in the optional
config file) you can run:

```bash
coidb filter
```

This filters
This filters sequences in `bold_seqs.txt` and entries in `bold_info.tsv` to
potential genes and phyla of interest, respectively. Entries are then merged
so that only sequences with relevant information are kept. Output files from
this step are `bold_filtered.fasta` and `bold_info_filtered.tsv`.


#### Step 3: Clustering

The final step clusters sequences in `bold_filtered.fasta` on a per-species
basis. This means that for each species, the sequences are gathered,
clustered with `vsearch` and only the representative sequences are kept. In this
step sequences can either have a species name or a BOLD `BIN` ID
(_e.g._ `BOLD:AAY5017`) and are treated as being equivalent.

To run the clustering step, do:

```bash
coidb cluster
```

The end result is a file `bold_clustered.fasta`.

## Common gene types

| #seqs | gene |
| ------- | ------ |
| 6074566 | COI-5P |
| 153409 | COI-3P |
| 146758 | ITS |
| 114124 | matK |
| 110798 | ITS2 |
| 86915 | rbcL |
| 66793 | rbcLa |
| 14192 | 16S |
| 13496 | CYTB |
| 10675 | trnH-psbA |
| 9787 | COII |
| 9140 | 28S |
| 9066 | COXIII |
| 6166 | ND2 |
| 5872 | ND1 |
| 5868 | ND5-0 |
| 5868 | ND3 |
| 5867 | ND4 |
| 5863 | ND4L |
| 5843 | ND6 |
| 5772 | ITS1 |
| 4866 | 28S-D2 |
| 3940 | 12S |
| 3751 | 18S |
| 3547 | atp6 |
| 3459 | 5-8S |
| 3135 | trnL-F |
| 3027 | D-loop |
| 2870 | EF1-alpha |
| 1991 | Wnt1 |
| 1822 | Rho |
| 1722 | COI-PSEUDO |
| 1716 | H3 |
| 1326 | CAD |
| 1241 | rpoC1 |
| 1236 | atpF-atpH |
| 968 | tufA |
| 944 | COI-LIKE |
| 865 | rpoB |
| 865 | UPA |
| 749 | psbK-psbI |
| 597 | 28S-D2-D3 |
| 470 | CAD4 |
| 449 | PSBA |
| 431 | PGD |
| 393 | DBY-EX7-8 |
| 383 | GAPDH |
| 353 | RpS5 |
| 336 | ycf1 |
| 309 | AATS |
| 296 | 28S-D1-D2 |
| 240 | 28S-D9-D10 |
| 223 | MDH |
| 194 | TPI |
| 192 | trnD-trnY-trnE |
| 188 | LWRHO |
| 186 | RAG1 |
| 172 | H4 |
| 171 | COII-COI |
| 168 | ND6-ND3 |
| 167 | RAG2 |
| 167 | 16S-ND2 |
| 166 | IDH |
| 154 | RpS2 |
| 144 | 18S-V4 |
| 141 | 28S-D3-D5 |
| 137 | RNF213 |
| 132 | MC1R |
| 132 | MB2-EX2-3 |
| 125 | fbpA |
| 124 | ND4L-MSH |
| 124 | ArgKin |
| 120 | CADH |
| 117 | CHD-Z |
| 107 | ENO |
| 103 | 28S-D3 |
| 101 | CHOLC |
| 99 | VDAC |
| 98 | ADR |
| 95 | RPB2 |
| 94 | atpB-rbcL |
| 94 | atp6-atp8 |
| 92 | DYN |
| 91 | H3-NUMT |
| 88 | COI-NUMT |
| 86 | PSA |
| 86 | CYTB-NUMT |
| 81 | AOX-fmt |
| 72 | trnK |
| 69 | matR |
| 65 | CsIV |
| 64 | nucLSU |
| 64 | EF2 |
| 61 | TYR |
| 61 | ARK |
| 56 | ATP1A |
| 55 | petD-intron |
| 55 | matK-trnK |
| 53 | PLAGL2 |
| 47 | psbA-3P |
| 38 | PER |
| 31 | matK-like |
| 31 | FL-COI |
| 30 | CAD1 |
| 30 | 18S-3P |
| 25 | rbcL-like |
| 24 | DDC |
| 21 | HfIV |
| 20 | R35 |
| 17 | COII-COIII |
| 16 | RBM15 |
| 16 | NGFB |
| 16 | CK1 |
| 15 | WSP |
| 14 | psaB |
| 14 | TULP |
| 10 | rpL32-trnL |
| 10 | PY-IGS |
| 9 | EF1-alpha-5P |
| 7 | NBC-COI-5P |
| 4 | COI-5PNMT1 |
| 2 | TMO-4C4 |
| 2 | PKD1 |
| 1 | S7 |
| 1 | RPL37 |
| 1 | RPB1 |
| 1 | RBCL-5P |
| 1 | COI-5PNMT2 |
| 1 | Beta-tubulin

## Common phyla

| #entries | Phylum |
| -------- | ------ |
| 5886491 | Arthropoda |
| 505704 | Chordata |
| 270743 | Magnoliophyta |
| 180809 | Mollusca |
| 76301 | Ascomycota |
| 58536 | Annelida |
| 48727 | Basidiomycota |
| 29817 | Rhodophyta |
| 28723 | Echinodermata |
| 28105 | Platyhelminthes |
| 21786 | Nematoda |
| 19453 | Cnidaria |
| 16321 | Bryophyta |
| 9116 | Rotifera |
| 8368 | Pteridophyta |
| 7122 | Chlorophyta |
| 5863 | Pinophyta |
| 4877 | Porifera |
| 4863 | Heterokontophyta |
| 3770 | Nemertea |
| 3516 | Glomeromycota |
| 2934 | Zygomycota |
| 2095 | Acanthocephala |
| 1787 | Bryozoa |
| 1671 | Tardigrada |
| 1512 | Pyrrophycophyta |
| 1339 | Chaetognatha |
| 1248 | Onychophora |
| 952 | Lycopodiophyta |
| 711 | Gastrotricha |
| 640 | Sipuncula |
| 573 | Ciliophora |
| 393 | Kinorhyncha |
| 370 | Nematomorpha |
| 276 | Chytridiomycota |
| 273 | Cycliophora |
| 223 | Myxomycota |
| 202 | Brachiopoda |
| 153 | Ctenophora |
| 149 | Hemichordata |
| 104 | Priapulida |
| 102 | Phoronida |
| 61 | Chlorarachniophyta |
| 48 | Rhombozoa |
| 21 | Entoprocta |
| 16 | Xenacoelomorpha |
| 16 | Gnathostomulida |
| 12 | Placozoa

0 comments on commit e1124e9

Please sign in to comment.