[User Story] Add sex check for all workflows #1517

mathiasbio · 2025-01-07T16:13:14Z

Need

In production we sometimes encounter mixed up samples, when running cases as a tumor + normal matched analysis we can detect these with somalier, but in tumor only cases we don't have this, if the sample was switched with a sample of the opposite gender we could detect these mix-ups in roughly 50% of the time.

To achieve this we need to start predicting the sexes of our samples and compare it to the specified gender in the config-file, by the customer.

Previous issue: #1125

Suggested approach

We have different resources available currently in BALSAMIC to achieve this goal depending on the workflow, but in all approaches a "simple way" would be to compare the relative coverages on the Y and X chromosomes, and depending on a threshold we predict the sample as "male", "female", or in edge-cases as "unknown", which we can then compare to the specified gender in the config.

For WGS T+N we get the predicted sex directly from Ascat: ...ascat.samplestatistics.txt

NormalContamination 0.1459652664553748
Ploidy 2.125379612054747
rho 0.55
psi 2.7
goodnessOfFit 94.0355936448371
GenderChr Y
GenderChrFound N

For WGS T only we don't have this as Ascat is only run for TN cases, but we produce the per-base coverage files from the Sentieon WGS metrics rule. From there we can extract the median X and Y coverages across the whole chromosomes and calculate a fraction we can use.

Evaluate what is a good threshold for defining male, female, and unknown for boundary cases.

For TGA samples we have some files from CNVkit that we could use, the target and antitarget CNN files, and these we can use similarly as for the WGS T sex prediction to calculate some relative coverage fractions.

Evaluate what is a good threshold for defining male, female, and unknown for boundary cases.

Considered alternatives

No response

Deviation

No response

System requirements assessed

Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

Needed
Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

mathiasbio · 2025-01-08T09:39:30Z

It's of course good if the methods used here are reliable, but it's valuable to clarify what the issues would be if it was not. What happens if the sex prediction is wrong at some point? The biggest issue is probably false warnings of mix-ups.

The production team would see that the case has failed the QC step, and the sample would be flagged as a potential mixup. This would spur some investigation to try to confirm the sex-prediction, and depending on the type of sample this would be easier or more difficult.

For WGS it should be quite easy to verify with a glance at the bam file if the sex prediction is correct or not, and so I think for WGS we don't need to worry as much about sometimes getting the sex wrong, and since we have much more data in WGS compared to TGA, this should also be less likely to happen.

I'm going to add results from tests to verify these methods in this sheet: https://docs.google.com/spreadsheets/d/18vzEjZe14OAV9bKw9SGkxN4fuu-kVhS-Yd5OV5UE8SU/edit?gid=0#gid=0

For now I have this for the WGS TN method using Ascat, with 12 WGS TN cases which were all correctly predicted:

WGS TN	GenderChr Y Found	given sex	status
1	N	female	PASS
2	Y	male	PASS
3	N	female	PASS
4	Y	male	PASS
5	Y	male	PASS
6	N	female	PASS
7	N	female	PASS
8	Y	male	PASS
9	N	female	PASS
10	Y	male	PASS
11	N	female	PASS
12	N	female	PASS

mathiasbio · 2025-01-08T11:48:56Z

In the same sheet I have added another section for the TGA prediction testing, with 45 TGA samples from a different spread of panels, and all 45 were correctly predicted. There's more details in the sheet, but basically the script tries to consolidate X and Y coverage stats from 2 files per sample, and it takes the best quality prediction in either direction towards male or female.

sex	capture_kit	predicted sex	predicted sex confidence	status
male	exome_comp_10.2	male	high	PASS
male	exome_comp_10.2	male	high	PASS
male	gmck_solid_4.2	male	high	PASS
male	gmck_solid_4.2	male	low	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	medium	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_lymphoid_7.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.4	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.3	male	high	PASS
male	gms_myeloid_5.4	male	low	PASS
male	gms_myeloid_5.4	male	low	PASS
male	gms_myeloid_5.4	male	high	PASS
male	gms_myeloid_5.4	male	high	PASS
male	gi_cfdna_3.1	male	medium	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gms_lymphoid_7.3	female	medium	PASS
female	gms_lymphoid_7.3	female	medium	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gmck_solid_4.2	female	high	PASS
female	gms_lymphoid_7.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	exome_comp_10.2	female	high	PASS
female	exome_comp_10.2	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gms_myeloid_5.3	female	high	PASS
female	gi_cfdna_3.1	female	high	PASS

Here is a boxplot of only the anti-target values, separated into the given sexes:

It's quite clear that for the majority of samples there is a quite clear distinction in the median and mean coverage values in the anti-target bins, but there are a few samples from both genders that have quite similar fractions.

For the target bins it's a bit clearer, at least if you only look at samples from panels with more than 9 target regions in the Y-chromosome:

But for samples where there is below 10 targets in the Y-chromosome at all, then the metric becomes practically useless for distinguishing between males and females:

Which is why I'm planning to use both antitarget and target CNN files, to prioritise the target file when there's sufficient data. So far all 45 samples were correctly assigned, even for those very small panels like gi_cfdna_3.1, so hopefully this means that if we put this method in production it will not give many false sex check fails.

mathiasbio · 2025-01-08T14:51:51Z

Finally, for the tumor only WGS where I'm calculating a fraction of the per base median coverage of Y / X chromosome, using the same samples as for the TN cases. Again all samples pass the sex prediction, and based on these stats it seems to be a very stable way of predicting the sex:

sex	predicted_sex	tumor_y_x_median_frac	status
female	female	0	PASS
female	female	0	PASS
male	male	0.9375	PASS
male	male	1.13333	PASS
female	female	0	PASS
female	female	0	PASS
male	male	1	PASS
female	female	0	PASS
female	female	0	PASS
male	male	1.05	PASS
female	female	0	PASS
male	male	1.10526	PASS
female	female	0	PASS
female	female	0	PASS
male	male	0.98333	PASS
male	male	1.05455	PASS
female	female	0	PASS
female	female	0	PASS
male	male	1.04615	PASS
female	female	0	PASS
male	male	1.01786	PASS
male	male	0.97917	PASS
female	female	0	PASS
female	female	0	PASS

mathiasbio added the User-Story A User-Story describing new functionality label Jan 7, 2025

mathiasbio added this to BALSAMIC Jan 7, 2025

github-project-automation bot moved this to Todo in BALSAMIC Jan 7, 2025

mathiasbio linked a pull request Jan 7, 2025 that will close this issue

feat: add sex check #1516

Open

55 tasks

mathiasbio added this to the Release 17 milestone Jan 7, 2025

mathiasbio self-assigned this Jan 7, 2025

This was referenced Jan 7, 2025

Add Sex QC check #1125

Closed

feat: add sex check #1516

Open

mathiasbio moved this from Todo to In Progress in BALSAMIC Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] Add sex check for all workflows #1517

[User Story] Add sex check for all workflows #1517

mathiasbio commented Jan 7, 2025

mathiasbio commented Jan 8, 2025

mathiasbio commented Jan 8, 2025 •

edited

Loading

mathiasbio commented Jan 8, 2025 •

edited

Loading

[User Story] Add sex check for all workflows #1517

[User Story] Add sex check for all workflows #1517

Comments

mathiasbio commented Jan 7, 2025

Need

Suggested approach

Considered alternatives

Deviation

System requirements assessed

Requirements affected by this story

Risk assessment needed

Risk assessment

SOUPs

Can be closed when

Blockers

Anything else?

mathiasbio commented Jan 8, 2025

mathiasbio commented Jan 8, 2025 • edited Loading

mathiasbio commented Jan 8, 2025 • edited Loading

mathiasbio commented Jan 8, 2025 •

edited

Loading

mathiasbio commented Jan 8, 2025 •

edited

Loading