-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add sex check #1516
base: develop
Are you sure you want to change the base?
feat: add sex check #1516
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1516 +/- ##
===========================================
+ Coverage 99.48% 99.50% +0.01%
===========================================
Files 40 40
Lines 1932 2000 +68
===========================================
+ Hits 1922 1990 +68
Misses 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! 🌟 It's a bit annoying that the sex check is not the same for all workflows but that they rely in different files and methods. Since the most general way seems to be the one using sentieon outputs, would that work for all of them (using different thresholds if needed)? I know you haven't run those tests, but do you have a feeling for it?
The other things that might be problematic is that the sex check is not written in multiqc but only on the yalm file. If I'm not wrong the point was to stop using the yalm file eventually and only feed multiqc files into janus. Would it make sense to make a local multiqc module to output this value?
see other small comments below
python {params.collect_qc_metrics_script} {params.config_path} {output.yaml} {input.json} {input.bcftools_counts} | ||
""" | ||
|
||
if config["analysis"]["analysis_workflow"] != "balsamic-qc": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, why is this needed now? why are we diverging between balsamic and balsamic-qc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could just limit this if-statement to WGS TN and TGA.
But there it's because the sex_prediction.json is reliant on CNV tools which are not run in the balsamic-qc workflow. But I didn't want to bother too much about allowing the sex-check for specifically WGS TO cases in balsamic-qc because I don't think anyone is ever running balsamic-qc anyway, and the logic for starting it is also being removed from CG. So it feels like a dead workflow 🤷
but I guess if I did not rely on the CNV tools but only the per base coverage stats like in WGS TO I wouldn't need to make this distinction. But we're not generating this file for the TGA workflow so it would need to be added, and at that point it just starts to feel like extra work with little benefit. But for sure I could use this file in WGS TN similar to WGS TO, and only have 2 ways of getting the prediction instead of 3. It wouldn't be that difficult to make that change, it was just nice to rely on a more sophisticated tool when there was one available, but then I'm not actually sure how Ascat determines if the Y-chrom is present or not, only that it works so far...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha I see your point. Could the senteion file be generated to the TGA workflow? If not, I agree that is extra work for not much benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be added but I would need to investigate what the file looks like for TGA and rerun a bunch of cases to get the files and find a suitable threshold like I did with the CNVkit files. I agree that it would be cleaner, and I'm sure it could work! But for now it feels like we have more pressing things to add to release 17, and I kind of just wanted to sneak this feature in as it was requested by prodbioinfo for so long, but we didn't really plan for it to be included in the release 😬 and it seems to be working so I'm happy with this compromise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if I could start over I'd generate the per base coverage files! Now it feels like I don't have time anymore to make the change and finish the remaining features 🥲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! That's fine then. Thanks for clarifying though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit ugly to have this repeated and slightly different for the different workflows, I wonder if we could simplify it to have a single rule and dynamically determine the input and the arguments for the scripts. Would that be cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would force me to do a lot of snakemake wildcard stuff, which would end up converting the input files to a list so I couldn't use arguments anymore to the script, and it would end up transferring the logic if figuring out which files goes where to the script and it could get messy I think 🤔 I kind of think this repeated rule structure gives a nice overview of what's actually happening but maybe that's just me 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeated structures are difficult to maintain, so generally I would argue against them. However in this case it seems that it might help readability and understanding of how things are done so I guess it's ok to leave it
I made some changes now, to remove ascat for TN WGS and instead use the same method as for WGS TO, and to not use case_sex but instead just compare each samples sex to the sex in the config. This should be a bit cleaner. Thanks for the suggestions Eva! |
Quality Gate passedIssues Measures |
Description
Adds sex checks to all workflows. See issue: #1517
In the above issue I also post some results from tests using the 3 different methods implemented here:
Added
Documentation
Tests
Feature Tests
Test that the sex check is working!
More tests showing that the prediction is working in all workflows in this issue: #1517
all.0.sh.123456.err
all.0.sh.123456.err
Both Somalier and the sex prediction fails as expected.
From predicted_sex.json:
Pipeline Integrity Tests
.hk
file)Clinical Genomics Stockholm
Documentation
Panel of Normal specific criteria
User Changes
Infrastructure Changes
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.