-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User Story] Add sex check for all workflows #1517
Comments
It's of course good if the methods used here are reliable, but it's valuable to clarify what the issues would be if it was not. What happens if the sex prediction is wrong at some point? The biggest issue is probably false warnings of mix-ups. The production team would see that the case has failed the QC step, and the sample would be flagged as a potential mixup. This would spur some investigation to try to confirm the sex-prediction, and depending on the type of sample this would be easier or more difficult. For WGS it should be quite easy to verify with a glance at the bam file if the sex prediction is correct or not, and so I think for WGS we don't need to worry as much about sometimes getting the sex wrong, and since we have much more data in WGS compared to TGA, this should also be less likely to happen. I'm going to add results from tests to verify these methods in this sheet: https://docs.google.com/spreadsheets/d/18vzEjZe14OAV9bKw9SGkxN4fuu-kVhS-Yd5OV5UE8SU/edit?gid=0#gid=0 For now I have this for the WGS TN method using Ascat, with 12 WGS TN cases which were all correctly predicted:
|
In the same sheet I have added another section for the TGA prediction testing, with 45 TGA samples from a different spread of panels, and all 45 were correctly predicted. There's more details in the sheet, but basically the script tries to consolidate X and Y coverage stats from 2 files per sample, and it takes the best quality prediction in either direction towards male or female.
Here is a boxplot of only the anti-target values, separated into the given sexes: It's quite clear that for the majority of samples there is a quite clear distinction in the median and mean coverage values in the anti-target bins, but there are a few samples from both genders that have quite similar fractions. For the target bins it's a bit clearer, at least if you only look at samples from panels with more than 9 target regions in the Y-chromosome: But for samples where there is below 10 targets in the Y-chromosome at all, then the metric becomes practically useless for distinguishing between males and females: Which is why I'm planning to use both antitarget and target CNN files, to prioritise the target file when there's sufficient data. So far all 45 samples were correctly assigned, even for those very small panels like gi_cfdna_3.1, so hopefully this means that if we put this method in production it will not give many false sex check fails. |
Finally, for the tumor only WGS where I'm calculating a fraction of the per base median coverage of Y / X chromosome, using the same samples as for the TN cases. Again all samples pass the sex prediction, and based on these stats it seems to be a very stable way of predicting the sex:
|
Need
In production we sometimes encounter mixed up samples, when running cases as a tumor + normal matched analysis we can detect these with somalier, but in tumor only cases we don't have this, if the sample was switched with a sample of the opposite gender we could detect these mix-ups in roughly 50% of the time.
To achieve this we need to start predicting the sexes of our samples and compare it to the specified gender in the config-file, by the customer.
Previous issue: #1125
Suggested approach
We have different resources available currently in BALSAMIC to achieve this goal depending on the workflow, but in all approaches a "simple way" would be to compare the relative coverages on the Y and X chromosomes, and depending on a threshold we predict the sample as "male", "female", or in edge-cases as "unknown", which we can then compare to the specified gender in the config.
For WGS T+N we get the predicted sex directly from Ascat:
...ascat.samplestatistics.txt
For WGS T only we don't have this as Ascat is only run for TN cases, but we produce the per-base coverage files from the Sentieon WGS metrics rule. From there we can extract the median X and Y coverages across the whole chromosomes and calculate a fraction we can use.
For TGA samples we have some files from CNVkit that we could use, the target and antitarget CNN files, and these we can use similarly as for the WGS T sex prediction to calculate some relative coverage fractions.
Considered alternatives
No response
Deviation
No response
System requirements assessed
Requirements affected by this story
No response
Risk assessment needed
Risk assessment
No response
SOUPs
No response
Can be closed when
No response
Blockers
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: