-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend functionality for outlier sample exclusion workflow #496
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RCollins13! Could you please undo the changes to the dockers.json
file? Docker images will be rebuilt as needed and the new images will be added to dockers.json
automatically as part of the CI/CD when this PR is merged.
bba805f
to
cdca408
Compare
Thanks for pointing that out, @VJalili! This made me realize I had rebased versus the old |
Thanks, @RCollins13, that looks great. |
@RCollins13 if you rebase your branch on the latest main, some errors in the |
This PR extends the existing outlier sample exclusion workflow,
wdl/FilterOutlierSamples.wdl
, in several respects:Accept one or more vcfs. This is necessary for large cohorts (e.g., gnomAD) where each chromosome is stored as a separate VCF for improved parallelization in the cloud, but we need to define outliers based on the sum of variants across all VCFs.
Add an optional VCF preprocessing step (with bcftools) prior to collecting sample counts. This is necessary in situations where we want to restrict to certain subsets of variants for defining outliers. Two empirical use cases from gnomAD v3 include: (a) restricting to rare (
AF
<1%), non-singleton (AC
>1) deletions between 300bp - 1kb to deal with the artifact deletion bump we sometimes see in various callsets, and (b) restricting toPASS
-only variants for defining our final set of samples at the very end of all QC & post-processing.Allow outlier samples to be defined on one or more independent subsets of samples within the same VCF. This is necessary when cohorts contain a mixture of samples with different properties (e.g., PCR+ vs. PCR-) and we want to fit their SV count distributions separately when defining outliers. These are optional inputs to the workflow; if no values of
sample_subset_prefixes
andsample_subset_lists
are provided, then outliers will be defined for all samples in the VCF together.Enable (optional) plotting of outlier distributions from within the workflow. I understand that there is a separate workflow for plotting outlier distributions (
PlotSVCountsPerSample.wdl
) and the intention is for users to optionally run that workflow after collecting per-sample counts, but from a convenience perspective I thought it would be useful to have plotting as an option of the main outlier workflow. If there is a design reason why this is undesirable, we can remove the plotting, but I know I have enjoyed having this convenience added for gnomAD.I have tested the above changes on gnomAD v3 (24 VCFs & ~120k samples) and can confirm that they work as expected. I have not tested a cohort with a single VCF but I believe it should work as long as the VCF is passed as an array of one element.