Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] Add sex check for all workflows #1517

Open
5 tasks
mathiasbio opened this issue Jan 7, 2025 · 3 comments · May be fixed by #1516
Open
5 tasks

[User Story] Add sex check for all workflows #1517

mathiasbio opened this issue Jan 7, 2025 · 3 comments · May be fixed by #1516
Assignees
Labels
User-Story A User-Story describing new functionality
Milestone

Comments

@mathiasbio
Copy link
Collaborator

Need

In production we sometimes encounter mixed up samples, when running cases as a tumor + normal matched analysis we can detect these with somalier, but in tumor only cases we don't have this, if the sample was switched with a sample of the opposite gender we could detect these mix-ups in roughly 50% of the time.

To achieve this we need to start predicting the sexes of our samples and compare it to the specified gender in the config-file, by the customer.

Previous issue: #1125

Suggested approach

We have different resources available currently in BALSAMIC to achieve this goal depending on the workflow, but in all approaches a "simple way" would be to compare the relative coverages on the Y and X chromosomes, and depending on a threshold we predict the sample as "male", "female", or in edge-cases as "unknown", which we can then compare to the specified gender in the config.

For WGS T+N we get the predicted sex directly from Ascat: ...ascat.samplestatistics.txt

NormalContamination 0.1459652664553748
Ploidy 2.125379612054747
rho 0.55
psi 2.7
goodnessOfFit 94.0355936448371
GenderChr Y
GenderChrFound N

For WGS T only we don't have this as Ascat is only run for TN cases, but we produce the per-base coverage files from the Sentieon WGS metrics rule. From there we can extract the median X and Y coverages across the whole chromosomes and calculate a fraction we can use.

  • Evaluate what is a good threshold for defining male, female, and unknown for boundary cases.

For TGA samples we have some files from CNVkit that we could use, the target and antitarget CNN files, and these we can use similarly as for the WGS T sex prediction to calculate some relative coverage fractions.

  • Evaluate what is a good threshold for defining male, female, and unknown for boundary cases.

Considered alternatives

No response

Deviation

No response

System requirements assessed

  • Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

  • Needed
  • Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

No response

Anything else?

No response

@mathiasbio mathiasbio added the User-Story A User-Story describing new functionality label Jan 7, 2025
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Jan 7, 2025
@mathiasbio mathiasbio linked a pull request Jan 7, 2025 that will close this issue
55 tasks
@mathiasbio mathiasbio added this to the Release 17 milestone Jan 7, 2025
@mathiasbio mathiasbio self-assigned this Jan 7, 2025
This was referenced Jan 7, 2025
@mathiasbio
Copy link
Collaborator Author

It's of course good if the methods used here are reliable, but it's valuable to clarify what the issues would be if it was not. What happens if the sex prediction is wrong at some point? The biggest issue is probably false warnings of mix-ups.

The production team would see that the case has failed the QC step, and the sample would be flagged as a potential mixup. This would spur some investigation to try to confirm the sex-prediction, and depending on the type of sample this would be easier or more difficult.

For WGS it should be quite easy to verify with a glance at the bam file if the sex prediction is correct or not, and so I think for WGS we don't need to worry as much about sometimes getting the sex wrong, and since we have much more data in WGS compared to TGA, this should also be less likely to happen.

I'm going to add results from tests to verify these methods in this sheet: https://docs.google.com/spreadsheets/d/18vzEjZe14OAV9bKw9SGkxN4fuu-kVhS-Yd5OV5UE8SU/edit?gid=0#gid=0

For now I have this for the WGS TN method using Ascat, with 12 WGS TN cases which were all correctly predicted:

WGS TN GenderChr Y Found given sex status
1 N female PASS
2 Y male PASS
3 N female PASS
4 Y male PASS
5 Y male PASS
6 N female PASS
7 N female PASS
8 Y male PASS
9 N female PASS
10 Y male PASS
11 N female PASS
12 N female PASS

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 8, 2025

In the same sheet I have added another section for the TGA prediction testing, with 45 TGA samples from a different spread of panels, and all 45 were correctly predicted. There's more details in the sheet, but basically the script tries to consolidate X and Y coverage stats from 2 files per sample, and it takes the best quality prediction in either direction towards male or female.

sex capture_kit predicted sex predicted sex confidence status
male exome_comp_10.2 male high PASS
male exome_comp_10.2 male high PASS
male gmck_solid_4.2 male high PASS
male gmck_solid_4.2 male low PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male medium PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male high PASS
male gms_lymphoid_7.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.4 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.3 male high PASS
male gms_myeloid_5.4 male low PASS
male gms_myeloid_5.4 male low PASS
male gms_myeloid_5.4 male high PASS
male gms_myeloid_5.4 male high PASS
male gi_cfdna_3.1 male medium PASS
female gms_lymphoid_7.3 female high PASS
female gms_lymphoid_7.3 female medium PASS
female gms_lymphoid_7.3 female medium PASS
female gms_lymphoid_7.3 female high PASS
female gms_lymphoid_7.3 female high PASS
female gms_lymphoid_7.3 female high PASS
female gms_lymphoid_7.3 female high PASS
female gmck_solid_4.2 female high PASS
female gms_lymphoid_7.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female exome_comp_10.2 female high PASS
female exome_comp_10.2 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gms_myeloid_5.3 female high PASS
female gi_cfdna_3.1 female high PASS

Here is a boxplot of only the anti-target values, separated into the given sexes:

image

It's quite clear that for the majority of samples there is a quite clear distinction in the median and mean coverage values in the anti-target bins, but there are a few samples from both genders that have quite similar fractions.

For the target bins it's a bit clearer, at least if you only look at samples from panels with more than 9 target regions in the Y-chromosome:

image

But for samples where there is below 10 targets in the Y-chromosome at all, then the metric becomes practically useless for distinguishing between males and females:

image

Which is why I'm planning to use both antitarget and target CNN files, to prioritise the target file when there's sufficient data. So far all 45 samples were correctly assigned, even for those very small panels like gi_cfdna_3.1, so hopefully this means that if we put this method in production it will not give many false sex check fails.

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 8, 2025

Finally, for the tumor only WGS where I'm calculating a fraction of the per base median coverage of Y / X chromosome, using the same samples as for the TN cases. Again all samples pass the sex prediction, and based on these stats it seems to be a very stable way of predicting the sex:

sex predicted_sex tumor_y_x_median_frac status
female female 0 PASS
female female 0 PASS
male male 0.9375 PASS
male male 1.13333 PASS
female female 0 PASS
female female 0 PASS
male male 1 PASS
female female 0 PASS
female female 0 PASS
male male 1.05 PASS
female female 0 PASS
male male 1.10526 PASS
female female 0 PASS
female female 0 PASS
male male 0.98333 PASS
male male 1.05455 PASS
female female 0 PASS
female female 0 PASS
male male 1.04615 PASS
female female 0 PASS
male male 1.01786 PASS
male male 0.97917 PASS
female female 0 PASS
female female 0 PASS

@mathiasbio mathiasbio moved this from Todo to In Progress in BALSAMIC Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User-Story A User-Story describing new functionality
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

1 participant