Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Km buildindices docs #1158

Merged
merged 4 commits into from
Dec 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions website/docs/Pipelines/BuildIndices_Pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
sidebar_position: 1
slug: /Pipelines/BuildIndices_Pipeline/README
---

# BuildIndices Overview

| Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
| :----: | :---: | :----: | :--------------: |
| [BuildIndices_v3.0.0](https://github.com/broadinstitute/warp/releases) | December, 2023 | Kaylee Mathews | Please file GitHub issues in warp or contact [documentation authors](mailto:[email protected]) |

![BuildIndices_diagram](./buildindices_diagram.png)


## Introduction to the BuildIndices workflow

The [BuildIndices workflow](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/BuildIndices.wdl) is an open-source, cloud-optimized pipeline developed in collaboration with the [BRAIN Initiative Cell Census Network](https://biccn.org/) (BICCN) and the BRAIN Initiative Cell Atlas Network (BICAN).

Overall, the workflow filters GTF files for selected gene biotypes, calculates chromosome sizes, and builds reference bundles with required files for [STAR](https://github.com/alexdobin/STAR) and [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) aligners.

## Quickstart table
The following table provides a quick glance at the BuildIndices pipeline features:

| Pipeline features | Description | Source |
| --- | --- | --- |
| Overall workflow | Reference bundle creation for STAR and bwa-mem2 aligners | Code available on [GitHub](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/BuildIndices.wdl) |
| Workflow language | WDL 1.0 | [openWDL](https://github.com/openwdl/wdl) |
| Genomic Reference Sequence | GRCh38 human genome primary sequence, M32 (GRCm39) mouse genome primary sequence, and release 103 (GCF_003339765.1) macaque genome primary sequence | GENCODE [human reference files](https://www.gencodegenes.org/human/release_43.html), GENCODE [mouse reference files](https://www.gencodegenes.org/mouse/release_M32.html), and NCBI [macaque reference files](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_003339765.1/) |
| Gene annotation reference (GTF) | Reference containing gene annotations | GENCODE [human GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz), GENCODE [mouse GTF](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/gencode.vM32.primary_assembly.annotation.gtf.gz), and NCBI [macaque GTF](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_003339765.1/) |
| Reference builders | STAR, bwa-mem2 | [Dobin et al. 2013](https://pubmed.ncbi.nlm.nih.gov/23104886/), [Vasimuddin et al. 2019](https://ieeexplore.ieee.org/document/8820962) |
| Data input file format | File format in which reference files are provided | FASTA, GTF, TSV |
| Data output file format | File formats in which BuildIndices output is provided | GTF, TAR, TXT |

## Set-up

### BuildIndices installation

To download the latest BuildIndices release, see the release tags prefixed with "BuildIndices" on the WARP [releases page](https://github.com/broadinstitute/warp/releases). All BuildIndices pipeline releases are documented in the [BuildIndices changelog](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/BuildIndices.changelog.md).

To search releases of this and other pipelines, use the WARP command-line tool [Wreleaser](https://github.com/broadinstitute/warp/tree/master/wreleaser).

If you’re running a BuildIndices workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP [releases page](https://github.com/broadinstitute/warp/releases) (see the folder `website/docs/Pipelines/BuildIndices_Pipeline`).

The BuildIndices pipeline can be deployed using [Cromwell](https://cromwell.readthedocs.io/en/stable/), a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in [Terra](https://app.terra.bio), a cloud-based analysis platform.

### Inputs

The BuildIndices workflow inputs are specified in JSON configuration files. Configuration files for [macaque](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/Macaque.json) and [mouse](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/Mouse.json) references can be found in the WARP repository.

#### Input descriptions

| Parameter name | Description | Type |
| --- | --- | --- |
| genome_source | Describes the source of the reference genome listed in the GTF file; used to name output files; can be set to “NCBI” or “GENCODE”. | String |
| gtf_annotation_version | Version or release of the reference genome listed in the GTF file; used to name STAR output files; ex.”M32”, “103”. | String |
| genome_build | Assembly accession (NCBI) or version (GENCODE) of the reference genome listed in the GTF file; used to name output files; ex. “GRCm39”, “GCF_003339765.1”. | String |
| organism | Organism of the reference genome; used to name the output files; can be set to “Macaque”, “Mouse”, “Human”, or any other organism matching the reference genome. | String |
| annotations_gtf | GTF file containing gene annotations; used to build the STAR reference files. | File |
| genome_fa | Genome FASTA file used for building indices. | File |
| biotypes | TSV file containing gene biotypes attributes to include in the modified GTF file; the first column contains the biotype and the second column contains “Y” to include or “N” to exclude the biotype; [GENCODE biotypes](https://www.gencodegenes.org/pages/biotypes.html) are used for GENCODE references and RefSeq biotypes are used for NCBI references. | File |

## BuildIndices tasks and tools

Overall, the BuildIndices workflow:
1. Checks inputs, modifies reference files, and creates STAR index.
2. Calculates chromosome sizes.
3. Builds reference bundle for bwa.

The tasks and tools used in the BuildIndices workflow are detailed in the table below.

To see specific tool parameters, select the [workflow WDL link](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/BuildIndices.wdl); then find the task and view the `command {}` section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL `# runtime values` section as `docker: `.

| Task name | Tool | Software | Description |
| --- | --- | --- | --- |
| BuildStarSingleNucleus | [modify_gtf.py](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/build-indices/modify_gtf.py), STAR | [warp-tools](https://github.com/broadinstitute/warp-tools/tree/develop), [STAR](https://github.com/alexdobin/STAR) | Checks that the input GTF file contains input genome source, genome build version, and annotation version with correct build source information, modifies files for the STAR aligner, and creates STAR index file. |
| CalculateChromosomeSizes | faidx | [Samtools](http://www.htslib.org/) | Reads the genome FASTA file to create a FASTA index file that contains the genome chromosome sizes. |
| BuildBWAreference | index | [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) | Builds the reference bundle for the bwa aligner. |

#### 1. Check inputs, modify reference files, and create STAR index file

**Check inputs**

The BuildStarSingleNucleus task reads the input GTF file and verifies that the `genome_source`, `genome_build`, and `gtf_annotation_version` listed in the file match the input values provided to the pipeline.

**Modify reference files and create STAR index**

The BuildStarSingleNucleus task uses a custom python script, [`modify_gtf.py`](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/build-indices/modify_gtf.py), and a list of biotypes ([example](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/build-indices/Biotypes.tsv)) to filter the input GTF file for only the biotypes indicated in the list with the value “Y” in the second column. The defaults in the custom code produce reference outputs that are similar to those built with 10x Genomics reference scripts.

The task uses the filtered GTF file and STAR `--runMode genomeGenerate` to generate the index file for the STAR aligner. Outputs of the task include the modified GTF and compressed STAR index files.

#### 2. Calculates chromosome sizes

The CalculateChromosomeSizes task uses Samtools to create and output a FASTA index file that contains the genome chromosome sizes, which can be used in downstream tools like SnapATAC2.

#### 3. Builds reference bundle for bwa-mem2

The BuildBWAreference task uses the chromosome sizes file and bwa-mem2 to prepare the genome FASTA file for alignment and builds, compresses, and outputs the reference bundle for the bwa-mem2 aligner.

## Outputs

The following table lists the output variables and files produced by the pipeline.

| Output name | Filename, if applicable | Output format and description |
| ------ | ------ | ------ |
| snSS2_star_index | `modified_star2.7.10a-<organism>-<genome_source>-build-<genome_build>-<gtf_annotation_version>.tar` | TAR file containing a species-specific reference genome and GTF file for [STAR](https://github.com/alexdobin/STAR) alignment. |
| pipeline_version_out | `BuildIndices_v<pipeline_version>` | String describing the version of the BuildIndices pipeline used. |
| snSS2_annotation_gtf_modified | `modified_v<gtf_annotation_version>.annotation.gtf` | GTF file containing gene annotations filtered for selected biotypes. |
| reference_bundle | `bwa-mem2-2.2.1-<organism>-<genome_source>-build-<genome_build>.tar` | TAR file containing the reference index files for [BWA-mem](https://github.com/lh3/bwa) alignment. |
| chromosome_sizes | `chrom.sizes` | Text file containing chromosome sizes for the genome build. |

## Versioning and testing

All BuildIndices pipeline releases are documented in the [BuildIndices changelog](https://github.com/broadinstitute/warp/blob/master/pipelines/skylab/build_indices/BuildIndices.changelog.md) and tested manually using [reference JSON files](https://github.com/broadinstitute/warp/tree/master/pipelines/skylab/build_indices).

## Consortia support
This pipeline is supported by the [BRAIN Initiative](https://braininitiative.nih.gov/) (BICCN and BICAN).

If your organization also uses this pipeline, we would like to list you! Please reach out to us by contacting the [WARP Pipeline Development team](mailto:[email protected]).

## Feedback

Please help us make our tools better by contacting the [WARP Pipelines Team](mailto:[email protected]) for pipeline-related suggestions or questions.
4 changes: 4 additions & 0 deletions website/docs/Pipelines/BuildIndices_Pipeline/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "BuildIndices",
"position": 2
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "CEMBA",
"position": 2
"position": 3
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "Exome Germline Single Sample",
"position": 3
"position": 4
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "GDC Whole Genome Somatic Single Sample",
"position": 4
"position": 5
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "Illumina Genotyping Array",
"position": 5
"position": 6
}
2 changes: 1 addition & 1 deletion website/docs/Pipelines/Imputation_Pipeline/_category_.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "Imputation",
"position": 6
"position": 7
}
2 changes: 1 addition & 1 deletion website/docs/Pipelines/Multiome_Pipeline/_category_.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "Multiome scATAC and GEX",
"position": 7
"position": 8
}
2 changes: 1 addition & 1 deletion website/docs/Pipelines/Optimus_Pipeline/_category_.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "Optimus",
"position": 8
"position": 9
}
5 changes: 2 additions & 3 deletions website/docs/Pipelines/RNA_with_UMIs_Pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ slug: /Pipelines/RNA_with_UMIs_Pipeline/README

| Pipeline Version | Date Updated | Documentation Authors | Questions or Feedback |
| :----: | :---: | :----: | :--------------: |
| [RNAWithUMIsPipeline_v1.0.6](https://github.com/broadinstitute/warp/releases?q=RNAwithUMIs&expanded=true) | April, 2022 | [Elizabeth Kiernan](mailto:[email protected]) & [Kaylee Mathews](mailto:[email protected])| Please file GitHub issues in warp or contact [the WARP team](mailto:[email protected]) |
| [RNAWithUMIsPipeline_v1.0.15](https://github.com/broadinstitute/warp/releases?q=RNAwithUMIs&expanded=true) | December, 2023 | [Elizabeth Kiernan](mailto:[email protected]) & [Kaylee Mathews](mailto:[email protected])| Please file GitHub issues in warp or contact [the WARP team](mailto:[email protected]) |

![RNAWithUMIs_diagram](rna-with-umis_diagram.png)

Expand Down Expand Up @@ -235,8 +235,7 @@ Workflow outputs are described in the table below.
| Output variable name | Description | Type |
| ------ | ------ | ------ |
| sample_name | Sample name extracted from the input unmapped BAM file header. | String
| transcriptome_bam | Duplicate-marked BAM file containing alignments from STAR translated into transcriptome coordinates. | BAM |
| transcriptome_bam_index | Index file for the transcriptome_bam output. | BAM Index |
| transcriptome_bam | Duplicate-marked BAM file containing alignments from STAR translated into transcriptome coordinates and postprocessed for RSEM. | BAM |
| transcriptome_duplicate_metrics | File containing duplication metrics. | TXT |
| output_bam | Duplicate-marked BAM file containing alignments from STAR translated into genome coordinates. | BAM |
| output_bam_index | Index file for the output_bam output. | BAM Index |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"label": "RNA with UMIs",
"position": 9
"position": 10
}
Loading
Loading