Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MycoSNP-WDL] Update README.md to delineate workflow I/O and usage #6

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
480b16a
initialize README.md update with detailed workflow inputs and outputs…
Jan 17, 2025
6a8809c
Update README.md to reflect changes in WDL workflows and inputs for M…
Jan 23, 2025
a29cea5
Update README.md title to MycoSNP-WDL Workflow Series
Jan 23, 2025
5bc57e9
remove explicit Terra mention
Jan 23, 2025
c40b02c
change out of searchable table
Jan 31, 2025
5cf3b46
update table I/O to correspond with PR 7
Jan 31, 2025
035a6fd
formatting
Jan 31, 2025
f43ae3f
add internal links
Jan 31, 2025
4b3298a
include blurbs about workflows
Jan 31, 2025
16ac560
expand inputs and explicitly delineate that variant calling is an ini…
Jan 31, 2025
0f146f9
include reference clades
Jan 31, 2025
2f073a3
delineate directory structure appropriately
Jan 31, 2025
827bc80
add back the searchable table
Jan 31, 2025
c2f2a4b
update mycosnp_tree tables to correspond with terra
Jan 31, 2025
63f88f0
update mycosnp_variants tables to correspond to Terra i/o
Jan 31, 2025
3ea8790
change release to v1.5
Jan 31, 2025
9daffec
update function
Jan 31, 2025
54b17db
update input notes
Jan 31, 2025
885c534
test new table inputs
Jan 31, 2025
70535d7
update input delineation in tables
Jan 31, 2025
c760aa0
formatting
Jan 31, 2025
c069470
expand on reference info
Jan 31, 2025
67ad891
capitalize fasta
Jan 31, 2025
b3f5128
conform to PHB formatting
Jan 31, 2025
80b4b12
add note on genome requirements for mycosnp_tree in README
Feb 4, 2025
ec553ca
incorporate Fraser's proposed changes for higher quality I/O delineation
Feb 6, 2025
d5f9776
doesnt fail anymore
Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 169 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,169 @@
# mycosnp-wdl
A WDL wrapper of [CDCGov/mycosnp-nf for](https://github.com/CDCgov/mycosnp-nf) Terra.bio
# MycoSNP-WDL Workflow Series

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| mycosnp_variants | Fungi | v1.5 | Yes | Sample-level |
| mycosnp_tree | Fungi | v1.5 | Yes | Set-level |


## MycoSNP-WDL
WDL wrappers of [CDCGov/mycosnp-nf](https://github.com/CDCgov/mycosnp-nf) designed for [Terra.bio](https://terra.bio) integration. These workflows conduct *Candiozyma (Candida) auris* [variant calling](#wf_mycosnp_variants.wdl) and subsequent single nucleotide polymorphism (SNP) [phylogenetic tree reconstruction](#wf_mycosnp_treewdl).

<br/>

### wf_mycosnp_variants.wdl
`mycosnp_variants` calls variants for inputted reads referencing the *C. auris* B11204 assembly accession [GCA_016772135](https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_016772135/) by default. Users can optionally reference a separate *C. auris* clade [data directory](https://github.com/theiagen/mycosnp-wdl/tree/main/data/reference), FASTA, or directory as described below.

Note that `mycosnp_tree` requires at least 4 genomes that reference the same reference in `mycosnp_variants`.

#### Inputs

- **reference** optionally takes a presupplied reference clade directory depicted [here](https://github.com/theiagen/mycosnp-wdl/tree/main/data/reference). The default is `GCA_016772135`.
- **ref_fasta** optionally takes a reference FASTA (requires suffix `.fa`) that will be indexed via BWA and generate a reference directory.
- **ref_tar** optionally takes a gzipped tarchive (`.tar.gz`) with the same directory structure as the provided reference clades:

```
data/reference

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend,

data/reference
├── B11221                    # Prebuilt clade directory
├── Clade1
│   ├── bwa                    # BWA index for alignment
│   ├── dict                   # Picard dictionary
│   ├── fai                    # FASTA index file
│   ├── masked                 # Masked reference sequence
│   └── Clade1.fasta           # Main reference FASTA
├── Clade2
├── Clade3
├── Clade4
├── Clade5
└── GCA_016772135              # Default reference 
```

├── B11221 # Prebuilt clade directory
├── Clade1
│ ├── bwa
| | ├── bwa # BWA index for alignment
| | | ├── reference.am
| | | ├── reference.ann
| | | ├── reference.bwt
| | | ├── reference.pac
| | | └── reference.sa
│ ├── dict
| | └── reference.dict # Picard dictionary
│ ├── fai
| | └── reference.fa.fai # FASTA index file
│ ├── masked
| | └── reference.fa # Masked reference sequence
│ └── Clade1.fasta
├── Clade2
├── Clade3
├── Clade4
├── Clade5
└── GCA_016772135 # Default reference
```

- **strain** optionally delineates the strain name for VCF gene name annotation. MycoSNP currently only annotates with respect to the default strain, "B11205", so changing this option will simply bypass VCF annotation.


<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| mycosnp_variants | **read1** | File | Illumina forward read file in FASTQ format (compression optional) | | Required |
| mycosnp_variants | **read2** | File | Illumina reverse read file in FASTQ format (compression optional) | | Required |
| mycosnp_variants | **samplename** | String | Name of sample to be analyzed | | Required |
| mycosnp | **coverage** | Int | Coverage is used to calculate a down-sampling rate that results in the specified coverage. For example, if coverage is 70, then FASTQ files are down-sampled such that, when aligned to the reference, the result is approximately 70x coverage | 0 | Optional |
| mycosnp | **cpu** | Int | Number of CPUs to allocate to the task | 8 | Optional |
| mycosnp | **debug** | Boolean | If true, keeps `.nextflow/` and `work/` directories | false | Optional |
| mycosnp | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| mycosnp | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/mycosnp:1.5" | Optional |
| mycosnp | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 64 | Optional |
| mycosnp | **min_depth** | Int | Min depth for a base to be called as the consensus sequence, otherwise it will be called as an N; set to 0 to disable | 10 | Optional |
| mycosnp | **reference** | String | Reference clade | "GCA_016772135" | Optional |
| mycosnp | **sample_ploidy** | Int | 1 | Ploidy of sample (GATK) | Optional |
| mycosnp | **strain** | String | Reference strain | "B11205" | Optional |
| mycosnp_variants | **ref_fasta** | File | Reference FASTA file | | Optional |
| mycosnp_variants | **ref_tar** | File | Reference gzipped compressed tarchive | | Optional |
| version_capture | **timezone** | String | Alternative timezone | | Optional |

</div>

#### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| analysis_date | String | Date of the analysis |
| assembly_size | Int | Size of the assembly |
| average_q_score_after_trimming | Float | Average quality score after trimming |
| average_q_score_before_trimming | Float | Average quality score before trimming |
| consensus_n_variant_min_depth | Int | Minimum depth for consensus N variant |
| full_results | File | Full results file |
| gc_after_trimming | Float | GC content after trimming |
| gc_before_trimming | Float | GC content before trimming |
| mean_coverage_depth | Float | Mean coverage depth |
| multiqc | File | MultiQC report |
| myco_bam | File | BAM file |
| myco_bam_bai | File | BAM index file |
| mycosnp_docker | String | Docker image used for MycoSNP |
| mycosnp_variants_analysis_date | String | Date of the MycoSNP variants analysis |
| mycosnp_variants_version | String | Version of the MycoSNP variants |
| mycosnp_version | String | Version of MycoSNP |
| number_n | Int | Number of N bases |
| paired_reads_after_trimming | Int | Number of paired reads after trimming |
| paired_reads_after_trimming_percent | String | Percentage of paired reads after trimming |
| percent_reference_coverage | Float | Percentage of reference coverage |
| reads_after_trimming | Int | Number of reads after trimming |
| reads_after_trimming_percent | String | Percentage of reads after trimming |
| reads_before_trimming | Int | Number of reads before trimming |
| reads_mapped | Int | Number of reads mapped |
| reference_length_coverage_after_trimming | Float | Reference length coverage after trimming |
| reference_length_coverage_before_trimming | Float | Reference length coverage before trimming |
| reference_name | String | Name of the reference genome used |
| reference_strain | String | Reference strain used |
| unpaired_reads_after_trimming | Int | Number of unpaired reads after trimming |
| unpaired_reads_after_trimming_percent | String | Percentage of unpaired reads after trimming |
| vcf | File | Compressed variant call format (VCF) file depicting SNPs |
| vcf_index | File | Compressed index file for the VCF |

</div>

<br/>

### wf_mycosnp_tree.wdl
`mycosnp_tree` reconstructs an IQ-TREE SNP phylogenetic tree that incorporates representative genomes of Clade1-Clade5 *C. auris*. VCF data generated from [wf_mycosnp_variants.wdl](#wf_mycosnp_variantswdl) are used as inputs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree will fail with less than 4 samples so I think we should add this in. IQ tree wont run if less than 4 samples are in the file I saw in the log output

NOTE: At least four samples, including reference, are required

#### Inputs

- **reference** optionally takes a presupplied reference clade directory delineated [here](https://github.com/theiagen/mycosnp-wdl/tree/main/data/reference).
- **ref_fasta** optionally takes a reference FASTA (requires suffix `.fa`) that will be indexed via BWA and generate a reference directory.
- **strain** is passed to output but does not change workflow function.

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| mycosnp_tree | **vcf** | Array[File] | VCF files (.vcf.gz) containing SNP data for phylogenetic analysis. These files can be generated from `wf_mycosnp_variants.wdl` | | Required |
| mycosnp_tree | **vcf_index** | Array[File] | Index files for the VCF files | | Required |
| mycosnp_tree | **ref_fasta** | File | Reference FASTA input | | Optional |
| mycosnptree | **cpu** | Int | Number of CPUs to allocate to the task | 8 | Optional |
| mycosnptree | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| mycosnptree | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/mycosnp:1.5" | Optional |
| mycosnptree | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 64 | Optional |
| mycosnptree | **reference** | String | Preexisting [reference directory](https://github.com/theiagen/mycosnp-wdl/tree/main/data/reference) | "GCA_016772135" | Optional |
| mycosnptree | **strain** | String | mycosnp-nf reference strain name | "B11205" | Optional |
| version_capture | **timezone** | String | Alternative timezone | | Optional |

</div>

#### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| mycosnp_alignment | File | Concatenated SNP alignment file |
| mycosnp_docker | String | Docker image used for MycoSNP |
| mycosnp_fastree_tree | File | Phylogenetic tree inferred using FastTree (heuristic maximum likelihood) |
| mycosnp_iqtree_tree | File | Phylogenetic tree inferred using IQ-TREE (high quality maximum likelihood) |
| mycosnp_rapidnj_tree | File | Phylogenetic tree inferred using RapidNJ (neighbor-joining method) |
| mycosnp_tree_analysis_date | String | Date of the analysis |
| mycosnp_tree_full_results | File | Full results file |
| mycosnp_tree_vcf_csv | File | SNP variants formatted as a CSV table |
| mycosnp_tree_version | String | Version of the `mycosnp_tree` WDL workflow |
| mycosnp_version | String | Version of MycoSNP |
| mycosnptree_snpdists | File | SNP distances file |
| reference_name | String | Name of the reference |
| reference_strain | String | Reference strain used |

</div>