Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap out semsimian for hpo3 #458

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

MattWellie
Copy link
Collaborator

@MattWellie MattWellie commented Nov 11, 2024

https://cpg-populationanalysis.atlassian.net/browse/TAL-12

Fixes

  • Talos is going to be irritating to deploy due to multiple dependencies, and irritating to update as the input data files on https://hpo.jax.org/app/data/annotations can't be downloaded programmatically
  • HPO3 as a library is faster than Semsimian, and comes with all the required data files as part of the install. We can choose to use different files, but as a baseline this is a much more usable library.

Proposed Changes

  • Swaps out the semsimian package for hpo3 - a rust implementation of pyHPO
  • HPO3 comes with all the relevant JAX data bundled, so we don't need to source these files separately, or supply them as arguments
  • HPO3 can take a manual override of the files used, but will be a much lower friction default library

Changes

  1. src/talos/CPG/MakePhenopackets.py: still remove all terms which are children of (HP:5 - "Mode of Inheritance"), but in a different way. Don't need to have conditional behaviour depending on whether a HPO ontology file is provided
  2. src/talos/FindGeneSymbolMap.py: the output dictionary is reversed, indexed on ENSG instead of Symbol. This is relevant in HPOFlagging
  3. src/talos/GeneratePanelData.py: Much simpler, instead of recursively calling a method to check for all the panel matches layer by layer, we take each participant HPO term, find all its parent terms up to the HPO root, and match any PanelApp panels which intersect with that list of parents. This is functionally equivalent to what was done before, and probably would have worked with obonet too.
  4. src/talos/HPOFlagging.py: Substantially different...
  • previously this was doing a pairwise termset match between all HPO terms assc. the gene, and all HPO terms assigned to the parent. Based on the strength of individual matches, we were keeping a set of all HPO: NAME for strong matches.
  • Now this is collecting the HPO terms from the participant and using pyHPO to do an gene enrichment test. This test reports all genes which have a substantially high probability of enrichment, or high fold enrichment. I'm applying the p.value as a label if the variant gene has a better than 0.05 p value for enriched association. This label is not as detailed as before, but is way faster to implement...
  • Crucially for redeployment, this no longer requires downloading the phenio_db file, or the collection of gene~phenotypes form Jax, as it's distributed as part of HPO3
  • HPO3 contains a pile of individual or set-wise similarity comparison methods (here), but doesn't export the same format we had before (all HPO terms in the comparison, and their specific score)

Needs

[ ] Decide if we're happy with the trade-offs here (faster, easier to redeploy, vs. more granular similarity comparisons)
[ ] If so remove all the phenioDB/genes2phenotypes reference files from prod-pipes and the nextflow demo
[ ] Check for performance differences between the two results

Checklist

  • Related Issue created
  • Tests covering new change
  • Linting checks pass

@MattWellie MattWellie requested a review from cassimons November 11, 2024 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant