Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add sex check #1516

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from
Open

feat: add sex check #1516

wants to merge 41 commits into from

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Jan 3, 2025

Description

Adds sex checks to all workflows. See issue: #1517
In the above issue I also post some results from tests using the 3 different methods implemented here:

  • WGS TN (using Ascat)
  • WGS TO using a fraction of the median per base X and Y coverage,
  • TGA using the CNVkit CNN target and antitarget files

Added

  • sex prediction tools and specified sex-verification for all workflows

Documentation

  • N/A, WILL UPDATE ATLAS THOUGH!
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [Document Name]

Tests

Feature Tests

Test that the sex check is working!
More tests showing that the prediction is working in all workflows in this issue: #1517

  • Manually changing config-file for a sex to wrong gender, and after running metric validation the error shows up in all.0.sh.123456.err
image
  • After running a switched up case, the sex-prediction shows as "conflicting" and the metric validation error is shown in all.0.sh.123456.err

Both Somalier and the sex prediction fails as expected.

image

From predicted_sex.json:

image

  • Sex metric shows up in deliverables yaml file

image

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

Copy link

codecov bot commented Jan 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.50%. Comparing base (7d529e6) to head (fc59c9e).
Report is 36 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1516      +/-   ##
===========================================
+ Coverage    99.48%   99.50%   +0.01%     
===========================================
  Files           40       40              
  Lines         1932     2000      +68     
===========================================
+ Hits          1922     1990      +68     
  Misses          10       10              
Flag Coverage Δ
unittests 99.50% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mathiasbio mathiasbio linked an issue Jan 7, 2025 that may be closed by this pull request
5 tasks
@mathiasbio mathiasbio marked this pull request as ready for review January 9, 2025 14:19
@mathiasbio mathiasbio requested a review from a team as a code owner January 9, 2025 14:19
Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! 🌟 It's a bit annoying that the sex check is not the same for all workflows but that they rely in different files and methods. Since the most general way seems to be the one using sentieon outputs, would that work for all of them (using different thresholds if needed)? I know you haven't run those tests, but do you have a feeling for it?

The other things that might be problematic is that the sex check is not written in multiqc but only on the yalm file. If I'm not wrong the point was to stop using the yalm file eventually and only feed multiqc files into janus. Would it make sense to make a local multiqc module to output this value?

see other small comments below

BALSAMIC/assets/scripts/collect_qc_metrics.py Outdated Show resolved Hide resolved
BALSAMIC/assets/scripts/collect_qc_metrics.py Outdated Show resolved Hide resolved
BALSAMIC/assets/scripts/collect_qc_metrics.py Outdated Show resolved Hide resolved
BALSAMIC/assets/scripts/collect_qc_metrics.py Outdated Show resolved Hide resolved
BALSAMIC/assets/scripts/collect_qc_metrics.py Outdated Show resolved Hide resolved
python {params.collect_qc_metrics_script} {params.config_path} {output.yaml} {input.json} {input.bcftools_counts}
"""

if config["analysis"]["analysis_workflow"] != "balsamic-qc":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, why is this needed now? why are we diverging between balsamic and balsamic-qc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just limit this if-statement to WGS TN and TGA.

But there it's because the sex_prediction.json is reliant on CNV tools which are not run in the balsamic-qc workflow. But I didn't want to bother too much about allowing the sex-check for specifically WGS TO cases in balsamic-qc because I don't think anyone is ever running balsamic-qc anyway, and the logic for starting it is also being removed from CG. So it feels like a dead workflow 🤷

but I guess if I did not rely on the CNV tools but only the per base coverage stats like in WGS TO I wouldn't need to make this distinction. But we're not generating this file for the TGA workflow so it would need to be added, and at that point it just starts to feel like extra work with little benefit. But for sure I could use this file in WGS TN similar to WGS TO, and only have 2 ways of getting the prediction instead of 3. It wouldn't be that difficult to make that change, it was just nice to rely on a more sophisticated tool when there was one available, but then I'm not actually sure how Ascat determines if the Y-chrom is present or not, only that it works so far...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha I see your point. Could the senteion file be generated to the TGA workflow? If not, I agree that is extra work for not much benefit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be added but I would need to investigate what the file looks like for TGA and rerun a bunch of cases to get the files and find a suitable threshold like I did with the CNVkit files. I agree that it would be cleaner, and I'm sure it could work! But for now it feels like we have more pressing things to add to release 17, and I kind of just wanted to sneak this feature in as it was requested by prodbioinfo for so long, but we didn't really plan for it to be included in the release 😬 and it seems to be working so I'm happy with this compromise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if I could start over I'd generate the per base coverage files! Now it feels like I don't have time anymore to make the change and finish the remaining features 🥲

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! That's fine then. Thanks for clarifying though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit ugly to have this repeated and slightly different for the different workflows, I wonder if we could simplify it to have a single rule and dynamically determine the input and the arguments for the scripts. Would that be cleaner?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would force me to do a lot of snakemake wildcard stuff, which would end up converting the input files to a list so I couldn't use arguments anymore to the script, and it would end up transferring the logic if figuring out which files goes where to the script and it could get messy I think 🤔 I kind of think this repeated rule structure gives a nice overview of what's actually happening but maybe that's just me 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated structures are difficult to maintain, so generally I would argue against them. However in this case it seems that it might help readability and understanding of how things are done so I guess it's ok to leave it

tests/models/test_metric_models.py Outdated Show resolved Hide resolved
tests/scripts/test_sex_prediction.py Show resolved Hide resolved
@mathiasbio
Copy link
Collaborator Author

I made some changes now, to remove ascat for TN WGS and instead use the same method as for WGS TO, and to not use case_sex but instead just compare each samples sex to the sex in the config. This should be a bit cleaner. Thanks for the suggestions Eva!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[User Story] Add sex check for all workflows
2 participants