Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend functionality for outlier sample exclusion workflow #496

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

RCollins13
Copy link
Contributor

This PR extends the existing outlier sample exclusion workflow, wdl/FilterOutlierSamples.wdl, in several respects:

  1. Accept one or more vcfs. This is necessary for large cohorts (e.g., gnomAD) where each chromosome is stored as a separate VCF for improved parallelization in the cloud, but we need to define outliers based on the sum of variants across all VCFs.

  2. Add an optional VCF preprocessing step (with bcftools) prior to collecting sample counts. This is necessary in situations where we want to restrict to certain subsets of variants for defining outliers. Two empirical use cases from gnomAD v3 include: (a) restricting to rare (AF<1%), non-singleton (AC>1) deletions between 300bp - 1kb to deal with the artifact deletion bump we sometimes see in various callsets, and (b) restricting to PASS-only variants for defining our final set of samples at the very end of all QC & post-processing.

  3. Allow outlier samples to be defined on one or more independent subsets of samples within the same VCF. This is necessary when cohorts contain a mixture of samples with different properties (e.g., PCR+ vs. PCR-) and we want to fit their SV count distributions separately when defining outliers. These are optional inputs to the workflow; if no values of sample_subset_prefixes and sample_subset_lists are provided, then outliers will be defined for all samples in the VCF together.

  4. Enable (optional) plotting of outlier distributions from within the workflow. I understand that there is a separate workflow for plotting outlier distributions (PlotSVCountsPerSample.wdl) and the intention is for users to optionally run that workflow after collecting per-sample counts, but from a convenience perspective I thought it would be useful to have plotting as an option of the main outlier workflow. If there is a design reason why this is undesirable, we can remove the plotting, but I know I have enjoyed having this convenience added for gnomAD.

I have tested the above changes on gnomAD v3 (24 VCFs & ~120k samples) and can confirm that they work as expected. I have not tested a cohort with a single VCF but I believe it should work as long as the VCF is passed as an array of one element.

Copy link
Member

@VJalili VJalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RCollins13! Could you please undo the changes to the dockers.json file? Docker images will be rebuilt as needed and the new images will be added to dockers.json automatically as part of the CI/CD when this PR is merged.

@RCollins13 RCollins13 force-pushed the rlc_outlier_exclusion_preprocessing branch from bba805f to cdca408 Compare February 14, 2023 20:11
@RCollins13
Copy link
Contributor Author

Thanks for pointing that out, @VJalili! This made me realize I had rebased versus the old master branch before submitting this PR, so I just pulled the main branch and rebased this PR versus the main. That should have reverted all of the changes to dockers.json but let me know if anything else looks odd!

@VJalili
Copy link
Member

VJalili commented Feb 14, 2023

Thanks, @RCollins13, that looks great.

@VJalili
Copy link
Member

VJalili commented Feb 16, 2023

@RCollins13 if you rebase your branch on the latest main, some errors in the Test WDLs action should be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants