Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MycoSNP-WDL] Update README.md to delineate workflow I/O and usage #6

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

xonq
Copy link
Member

@xonq xonq commented Jan 17, 2025

This PR updates the main repository README.md to delineate workflow I/O and usage.

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

Delineate workflow I/O and usage in the README

⚡ Impacted Workflows/Tasks

None

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

Delineate workflow I/O and usage in README.md

⚙️ Algorithm

n/a

➡️ Inputs

n/a

⬅️ Outputs

n/a

🧪 Testing

n/a

Suggested Scenarios for Reviewer to Test

n/a

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@xonq xonq marked this pull request as ready for review January 31, 2025 22:41
@xonq xonq requested a review from a team as a code owner January 31, 2025 22:41
Copy link

@fraser-combe fraser-combe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas on wording that may add more context for the reader. Otherwise great documentation!

- **ref_tar** optionally takes a gzipped tarchive (`.tar.gz`) with the same directory structure as the provided reference clades:

```
data/reference

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend,

data/reference
├── B11221                    # Prebuilt clade directory
├── Clade1
│   ├── bwa                    # BWA index for alignment
│   ├── dict                   # Picard dictionary
│   ├── fai                    # FASTA index file
│   ├── masked                 # Masked reference sequence
│   └── Clade1.fasta           # Main reference FASTA
├── Clade2
├── Clade3
├── Clade4
├── Clade5
└── GCA_016772135              # Default reference 
```


### wf_mycosnp_tree.wdl
`mycosnp_tree` reconstructs an IQ-TREE SNP phylogenetic tree that incorporates representative genomes of Clade1-Clade5 *C. auris*. VCF data generated from [wf_mycosnp_variants.wdl](#wf_mycosnp_variantswdl) are used as inputs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree will fail with less than 4 samples so I think we should add this in. IQ tree wont run if less than 4 samples are in the file I saw in the log output

README.md Outdated

#### Inputs

- **reference** optionally takes a presupplied reference clade directory delineated [here](https://github.com/theiagen/mycosnp-wdl/tree/main/data/reference). Currently, this option will fail the workflow with "GCA_016772135" set as the reference - use "B11205" instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this. i have successful runs using the default settings for variants and tree. -- reference optionally takes a presupplied reference clade directory delineated here. The default reference GCA_016772135 is fully supported, but users may specify an alternative reference, such as B11205 or other clade specific reference, if desired.

README.md Outdated
| mycosnp_variants | **samplename** | String | Name of sample to be analyzed | | Required |
| mycosnp | **coverage** | Int | Coverage is used to calculate a down-sampling rate that results in the specified coverage. For example, if coverage is 70, then FASTQ files are down-sampled such that, when aligned to the reference, the result is approximately 70x coverage | 0 | Optional |
| mycosnp | **cpu** | Int | Number of CPUs to allocate to the task | 8 | Optional |
| mycosnp | **debug** | Boolean | Keeps `.nextflow/` and `work/` directories | false | Optional |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If true, keeps .nextflow/ and work/ directories for debugging purposes.

README.md Outdated
| reference_strain | String | Reference strain used |
| unpaired_reads_after_trimming | Int | Number of unpaired reads after trimming |
| unpaired_reads_after_trimming_percent | String | Percentage of unpaired reads after trimming |
| vcf | File | VCF file |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final variant call format (VCF) file containing SNPs.

README.md Outdated
| reads_mapped | Int | Number of reads mapped |
| reference_length_coverage_after_trimming | Float | Reference length coverage after trimming |
| reference_length_coverage_before_trimming | Float | Reference length coverage before trimming |
| reference_name | String | Name of the reference |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the reference genome used.

README.md Outdated

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| mycosnp_tree | **vcf** | Array[File] | VCF files for analysis | | Required |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add above inputs,

Compressed VCF files (.vcf.gz) containing SNP data for phylogenetic analysis. These files should be generated from wf_mycosnp_variants.wdl.

README.md Outdated

| **Variable** | **Type** | **Description** |
|---|---|---|
| mycosnp_alignment | File | Alignment file |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concatenated SNP alignment used for tree inference.

README.md Outdated
| mycosnp_rapidnj_tree | File | RapidNJ tree file |
| mycosnp_tree_analysis_date | String | Date of the analysis |
| mycosnp_tree_full_results | File | Full results file |
| mycosnp_tree_vcf_csv | File | VCF to CSV file |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNP variants formatted as a CSV table for external analysis

For the tree methods:

mycosnp_fastree_tree | File | Phylogenetic tree inferred using FastTree.
mycosnp_iqtree_tree | File | Phylogenetic tree inferred using IQ-TREE (maximum likelihood method).
mycosnp_rapidnj_tree | File | Phylogenetic tree inferred using RapidNJ (neighbor-joining method).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes have been incorporated into the most recent commit (ec553ca) - thank you for the suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants