Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] Apply strandbias filter for WGS TN workflow #1505

Open
3 tasks
mathiasbio opened this issue Nov 26, 2024 · 2 comments
Open
3 tasks

[User Story] Apply strandbias filter for WGS TN workflow #1505

mathiasbio opened this issue Nov 26, 2024 · 2 comments
Assignees
Labels
User-Story A User-Story describing new functionality
Milestone

Comments

@mathiasbio
Copy link
Collaborator

mathiasbio commented Nov 26, 2024

Need

As a clinician interpreting variants from balsamic I want to avoid interpreting false positive artefacts.

Currently in the balsamic WGS T+N workflow there's no strand bias filtering applied, which has caused a lot of likely artefacts to be called in at least a couple of cases, as can be seen in this deviation: https://github.com/Clinical-Genomics/Deviations/issues/719 where a very large number of variants were called with 100% strand bias.

These samples admittedly seems to have some technical lab issues, something to do with the index-pairs, but it highlighted a need for applying a SOR filter.

Suggested approach

Apply bcftools filter similar to the one applied in the tumor only WGS workflow:

| bcftools filter --threads {threads} --include "INFO/SOR < {params.sor[0]}" --soft-filter '{params.sor[1]}' --mode '+' \

in the rule: bcftools_quality_filter_tnscope_tumor_normal_wgs

Considered alternatives

No response

Deviation

No response

System requirements assessed

  • Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

  • Needed
  • Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

No response

Anything else?

No response

@mathiasbio mathiasbio added the User-Story A User-Story describing new functionality label Nov 26, 2024
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Nov 26, 2024
@mathiasbio mathiasbio added this to the Release 17 milestone Jan 20, 2025
@mathiasbio mathiasbio moved this from Todo to In Progress in BALSAMIC Jan 20, 2025
@mathiasbio
Copy link
Collaborator Author

I have done some testing of this filter in this sheet: https://docs.google.com/spreadsheets/d/1jBSoqJR1IcSw1w5sDXeAnMSZJMrZpuxEkzW3DJeQYns/edit?gid=0#gid=0

But I'll summarise the relevant parts here.

I looked into the effects on the number of variants are applying the same filter as we're using in the WGS TO analysis on 5 different cases, where the bottom is the case from the deviation https://github.com/Clinical-Genomics/Deviations/issues/719

If we apply the filter as it is in WGS TO there will be a LOT of variants filtered from the WGS TN workflow, stretching from between 82% -> 7.8% of the final clinical filtered variants.

I also wanted to check how the SOR parameter works for low AD variants by checking if the fraction of variants with an AD below 6 which also had the SOR > 3 filter set was higher than in the total set of variants. But I didn't see any such clear link. Which was promising.

case step # Variants # SOR filtered # below 6 AD # SOR and below 6 AD Fraction SOR to total Fraction SOR to total below 6AD
WGSTN1 clincial.filtered.pass 5471 3794 502 140 0.693 0.279
WGSTN1 quality_filtered 13537 7009 2972 1020 0.518 0.343
WGSTN2 clincial.filtered.pass 3283 257 485 87 0.078 0.179
WGSTN2 quality_filtered 10488 2876 2725 760 0.274 0.279
WGSTN3 clincial.filtered.pass 5611 4058 344 82 0.723 0.238
WGSTN3 quality_filtered 14008 7505 2428 771 0.536 0.318
WGSTN4 clincial.filtered.pass 13335 1642 1234 183 0.123 0.148
WGSTN4 quality_filtered 25354 5781 4421 1220 0.228 0.276
WGSTN5 clincial.filtered.pass 5717 4690 627 244 0.820 0.389

Then to look into the relationship between the ALT_F1R2 and ALT_F2R1 and the SOR parameter a bit more closely I generated some plots. In this plot I merged the top 4 WGS cases (excluding the one from the deviation) and focusing on the quality_filtered VCFs.

Image

And as usual it's quite frustrating to understand the parameters in TNscope. It seems of course that the SOR parameter is linked to the ALT_F1R2 and ALT_F2R1 values but it's not the whole story. It just puts us in the situation where we look at variants like this:

Image

Where such as in the 3 row there's equal representation of both strands in the variant. Should we just trust that TNscope behind the scenes is doing some clever math? Or should we try to create our own filter which would make sense given the variant data?

I tested this method:

df['CUSTOM_STRANDBIAS'] = ((df.ALT_F1R2_F2R1ratio > 3.2) | (df.ALT_F1R2_F2R1ratio < 0.3125)) & (df.V_AD > 6)

Which would make more sense when looking at these variant read-strand data. And which we probably could implement in bcftools. But I don't know at the moment which method to prefer.

Image

@mathiasbio mathiasbio self-assigned this Jan 22, 2025
@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 22, 2025

I guess this SOR parameter is inspired by, or taken directly from GATK StrandOddsRatio: https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio

Where the strandbias is also based on the reference allele. But I wonder if that strandbias metric would be optimal for cases such as the one in the deviation, where the reference allele appears to have no strandbias.

But update on this: I tested the method they posted on the website with one of these variants from TNscope and the SOR did not agree at all. So I'm guessing TNscope calculates it in a slightly different way.

It's also quite clear that some variants exists with quite significant strandbias based on the alt allele strandedness values, but which has quite good SOR values and are consequently not filtered out:

Image

I have emailed Sentieon to see if they have any information on this

This script from SMD-Bioinformatics-lund could be worth taking a look at: https://github.com/SMD-Bioinformatics-Lund/SomaticPanelPipeline/blob/01c5bd8ae916bc5a27b20353c0bc8e9b4fae3e3e/bin/filter_tnscope_somatic.pl#L4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User-Story A User-Story describing new functionality
Projects
Status: In Progress
Development

No branches or pull requests

1 participant