-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CombineBatches workflow #732
Open
mwalker174
wants to merge
8
commits into
main
Choose a base branch
from
mw_gatk_combine_batches
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mwalker174
force-pushed
the
mw_gatk_combine_batches
branch
from
November 1, 2024 15:40
b5de9f0
to
3a170b6
Compare
mwalker174
force-pushed
the
mw_gatk_combine_batches
branch
from
November 18, 2024 16:47
3a170b6
to
6a6d00b
Compare
mwalker174
force-pushed
the
mw_gatk_combine_batches
branch
from
December 2, 2024 16:05
2a39fcb
to
9b729de
Compare
Add reformatting to GenotypeBatch Expose reformat_script Start ripping stuff out Finish rewriting wdl and template Add TODO and delete unused task Don't assign genotypes to CNVs in add_genotypes.py Set mixed_breakend_window to 1mb Fixes Start implementing ContextAwareClustering Fix wdl Update wdl and gatk docker Comment Set context overlap parameters Update docker Fix size in ReformatGenotypedVcf Update to reformat_genotyped_vcf.py Update legacy reformatter, fix GenotypeBatchMetrics wdl Remove reformat step from GenotypeBatch Use RD_CN if CN is unavailable for mCNVs in svtk vcf2bed Add additional_args to svcluster and groupedsvcluster Add join step Update runtime attr Filter legacy records with invalid coords (needs testing) Fix record dropping; add --fix-end to wdl call Representative breakpoint summary strategy Update gatk docker Integerate SR flags into VCF Update dockers Parse last column in SR flags lists Gatk to svtk formatting Fix CNV strands and overlap breakpoint filter bothside pass file parsing Breakpoint overlap filter now sorts by BOTHSIDES_SUPPORT status rather than fraction Set empty FILTER statuses to PASS Use safer get() methods instead of brackets for accessing FORMAT fields Delete unused VcfClusterSingleChromsome.wdl Remove other unused wdls Do not require CN or RD_CN to be defined for all samples for CNVs in get_called_samples() Fix multi-allelic ALTs and genotype parsing Fix multi-allelic formatting in cleanvcf5 Clean vcf 5 script override Add SR1POS and SR2POS to gatk format to recover INS END coordinate Reset dockers to main Fix mCNV alts again Update gatk docker context to track Integrate into top-level WDLs and update json templates Update terra config Fix MakeCohortVcf inputs Remove MakeCohortVcf json templates Remove duplicate runtime_override_clean_background_fail input in GATKSVPipelineSingleSample Update yaml for testing Remove duplicate runtime_override_breakpoint_overlap_filter input
mwalker174
force-pushed
the
mw_gatk_combine_batches
branch
from
December 3, 2024 16:08
54a0f97
to
e3adbd0
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Replaces most methods in the CombineBatches workflow with a greatly simplified set of tasks that utilize GATK
SVCluster
and the newGroupedSVCluster
tool (see PR).SVCluster
replaces most of the current functionality including VCF joining and clustering, whileGroupedSVCluster
introduces refined clustering (a.k.a. "reclustering") that has become a best practice for larger call sets.Clustering refinement is critical for consolidating redundant variants in repetitive sequence contexts such as simple repeats and segmental duplications. This also addresses an issue with duplicate insertions that share coordinates but have slightly different split read signatures (i.e. different END positions).
Genotype merging was improved slightly in this PR as well.
In addition, this PR makes some minor improvements to VCF formatting and parsing:
GenotypeBatch
, CNVs are now formatted the same way as inCleanVcf
, i.e. no genotypes and<CNV>
ALT allele andSVTYPE=CNV
, rather than using alt alleles<CN0>,<CN1>,…
and havingSVTYPE=DUP
. In case a user needs to run on a VCF with the old format, there is alegacy_vcfs
flag inCombineBatches
that will update to the new format prior to processing.CombineBatches
, records are annotated withHIGH_SR_BACKGROUND
andBOTHSIDE_PASS
INFO field flags rather than passing around separate lists, which is cumbersome..get()
for accessing FORMAT fields rather than brackets. This was required in some cases because GATK omits a FORMAT field if it is null for all samples in a given record. Pysam then throws an error since the requested key does not exist, whereas.get()
returnsNone
.CombineBatches
to minimize risk of bugs in downstream workflows.