Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CombineBatches workflow #732

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

mwalker174
Copy link
Collaborator

@mwalker174 mwalker174 commented Oct 3, 2024

Replaces most methods in the CombineBatches workflow with a greatly simplified set of tasks that utilize GATK SVCluster and the new GroupedSVCluster tool (see PR). SVCluster replaces most of the current functionality including VCF joining and clustering, while GroupedSVCluster introduces refined clustering (a.k.a. "reclustering") that has become a best practice for larger call sets.

Clustering refinement is critical for consolidating redundant variants in repetitive sequence contexts such as simple repeats and segmental duplications. This also addresses an issue with duplicate insertions that share coordinates but have slightly different split read signatures (i.e. different END positions).

Genotype merging was improved slightly in this PR as well.

In addition, this PR makes some minor improvements to VCF formatting and parsing:

  • In GenotypeBatch, CNVs are now formatted the same way as in CleanVcf, i.e. no genotypes and <CNV> ALT allele and SVTYPE=CNV, rather than using alt alleles <CN0>,<CN1>,… and having SVTYPE=DUP. In case a user needs to run on a VCF with the old format, there is a legacy_vcfs flag in CombineBatches that will update to the new format prior to processing.
  • Within CombineBatches, records are annotated with HIGH_SR_BACKGROUND and BOTHSIDE_PASS INFO field flags rather than passing around separate lists, which is cumbersome.
  • Minor improvements to some downstream scripts to use .get() for accessing FORMAT fields rather than brackets. This was required in some cases because GATK omits a FORMAT field if it is null for all samples in a given record. Pysam then throws an error since the requested key does not exist, whereas .get() returns None.
  • The VCF is converted back to the "old" format at the end of CombineBatches to minimize risk of bugs in downstream workflows.
  • Minor change to the breakpoint overlap filter: variants are prioritized on BOTHSIDE_PASS status (binary) rather than fraction of supporting batches.

@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from b5de9f0 to 3a170b6 Compare November 1, 2024 15:40
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 3a170b6 to 6a6d00b Compare November 18, 2024 16:47
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 2a39fcb to 9b729de Compare December 2, 2024 16:05
Add reformatting to GenotypeBatch

Expose reformat_script

Start ripping stuff out

Finish rewriting wdl and template

Add TODO and delete unused task

Don't assign genotypes to CNVs in add_genotypes.py

Set mixed_breakend_window to 1mb

Fixes

Start implementing ContextAwareClustering

Fix wdl

Update wdl and gatk docker

Comment

Set context overlap parameters

Update docker

Fix size in ReformatGenotypedVcf

Update to reformat_genotyped_vcf.py

Update legacy reformatter, fix GenotypeBatchMetrics wdl

Remove reformat step from GenotypeBatch

Use RD_CN if CN is unavailable for mCNVs in svtk vcf2bed

Add additional_args to svcluster and groupedsvcluster

Add join step

Update runtime attr

Filter legacy records with invalid coords (needs testing)

Fix record dropping; add --fix-end to wdl call

Representative breakpoint summary strategy

Update gatk docker

Integerate SR flags into VCF

Update dockers

Parse last column in SR flags lists

Gatk to svtk formatting

Fix CNV strands and overlap breakpoint filter bothside pass file parsing

Breakpoint overlap filter now sorts by BOTHSIDES_SUPPORT status rather than fraction

Set empty FILTER statuses to PASS

Use safer get() methods instead of brackets for accessing FORMAT fields

Delete unused VcfClusterSingleChromsome.wdl

Remove other unused wdls

Do not require CN or RD_CN to be defined for all samples for CNVs in get_called_samples()

Fix multi-allelic ALTs and genotype parsing

Fix multi-allelic formatting in cleanvcf5

Clean vcf 5 script override

Add SR1POS and SR2POS to gatk format to recover INS END coordinate

Reset dockers to main

Fix mCNV alts again

Update gatk docker

context to track

Integrate into top-level WDLs and update json templates

Update terra config

Fix MakeCohortVcf inputs

Remove MakeCohortVcf json templates

Remove duplicate runtime_override_clean_background_fail input in GATKSVPipelineSingleSample

Update yaml for testing

Remove duplicate runtime_override_breakpoint_overlap_filter input
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 54a0f97 to e3adbd0 Compare December 3, 2024 16:08
@mwalker174 mwalker174 marked this pull request as ready for review December 9, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant