Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New linter. Readme stickers #355

Merged
merged 4 commits into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ repos:
hooks:
- id: ruff

- repo: https://github.com/populationgenomics/pre-commits
rev: v0.1.3
hooks:
- id: cpg-id-checker

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
Expand Down
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Automated Interpretation Pipeline (AIP)

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) ![test](https://github.com/populationgenomics/automated-interpretation-pipeline/actions/workflows/test.yaml/badge.svg) ![black](https://img.shields.io/badge/code%20style-black-000000.svg)

## Purpose

A variant prioritisation tool, aiming to assist clinical curators by sifting through large datasets and highlighting a
Expand Down Expand Up @@ -113,9 +115,9 @@ of all panels on which they have previously been seen. A gene will be treated as
before, or features in a phenotype-matched panel on which it has not been seen before. i.e.

- if a Gene is promoted to `Green` on the Mendeliome, it will be recorded as `New`, and the prior data will be extended
to show that the gene was seen on the Mendeliome.
to show that the gene was seen on the Mendeliome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, just make this one line, and use soft-wrapping in VSCode:

// in the settings.json
"[markdown]": {
    // ... other settings
    "editor.wordWrap": "wordWrapColumn",
    "editor.wrappingIndent": "same",
    "editor.wordWrapColumn": 80,
},

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was caused by markdownlint... or ruff. One or the other. On-save this got split off back onto separate lines. I've reformatted all markdown files in the repo to remove artificial line breaks

- if a Gene has previously appeared on the Mendeliome, and during current run it now appears on a new panel, the gene
will be recorded as `New`, and the new panel will be added to the prior data list.
will be recorded as `New`, and the new panel will be added to the prior data list.

### Variant Results

Expand All @@ -127,9 +129,10 @@ This data consists of the variant IDs, Categories they've previously been assign
have formed a compound-het with. When reviewing the variants of a current run, we check for previously seen variants on
a per-sample basis, e.g.:

- If a variant has been seen as a `Cat.1` before, and appears again as a `Cat.1`, it will be removed
- If a variant has been seen as a `Cat.1` before, and now is both `Cat.1` & `Cat.2`, the `Cat.1` assignment will be
removed, and will be reported only as a `Cat.2`. The prior data will be extended to show that it has been a `Cat.1 & 2`
- If a variant was never seen before, it will appear on the report with no removed Categories
- If a variant was seen in a compound-het now, and was previously partnered with a different variant, all `Categories`
will be retained, and the new partner ID will be added to the list of `support_vars` in the prior data
- If a variant has been seen as a `Cat.1` before, and appears again as a `Cat.1`, it will be removed
- If a variant has been seen as a `Cat.1` before, and now is both `Cat.1` & `Cat.2`, the `Cat.1` assignment will be
removed, and will be reported only as a `Cat.2`. The prior data will be extended to show that it has been
a `Cat.1 & 2`
- If a variant was never seen before, it will appear on the report with no removed Categories
- If a variant was seen in a compound-het now, and was previously partnered with a different variant, all `Categories`
will be retained, and the new partner ID will be added to the list of `support_vars` in the prior data
16 changes: 8 additions & 8 deletions design_docs/AddingNewCategories.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# New Categories

This framework is designed to make the addition of new categories super simple. The minimal changes
required to create a new category are:
This framework is designed to make the addition of new categories super simple. The minimal changes required to create a
new category are:

1. Add new Category name/number and description to the config file (
e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L54))
2. If new fields are acted upon (e.g. a new annotation field), add those to the `CSQ` field in config to ensure the
values are exported in the labelled VCF (
e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L36))
e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L36)).
Without this change, the values will not be pulled from the MT, and cannot be presented in the report
3. Add a new category method in the [hail_filter_and_label.py script](../reanalysis/hail_filter_and_label.py), (
e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/hail_filter_and_label.py#L622-L658).
This method should stand independently, and contain all the logic to decide whether the label is applied or not.
This encapsulation should also include the decision about whether a classification is Boolean (once per variant),
Sample (only relevant to a subset of Samples), or Support (a lesser level of significance)
4. Add a new diagram describing the decision tree to the [images folder](images), and reference it in the
[README](Hail_Filter_and_Label.md)
This method should stand independently, and contain all the logic to decide whether the label is applied or not. This
encapsulation should also include the decision about whether a classification is Boolean (once per variant), Sample (
only relevant to a subset of Samples), or Support (a lesser level of significance)
4. Add a new diagram describing the decision tree to the [images folder](images), and reference it in
the [README](Hail_Filter_and_Label.md)
5. If you require new fields to be displayed in the HTML report, make the appropriate changes to the templates
6. If you need additional logic (e.g. when this category is assigned we should interpret the variant under a partial
penetrance model), that's... more complicated. Get in touch with the team to discuss.
8 changes: 4 additions & 4 deletions design_docs/Annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

AIP leverages numerous functions within the Hail Batch and Hail Query libraries to organise workflows and efficiently
query large datasets. Unfortunately at time of writing, Hail's ability to run annotation using VEP is limited to GCS
DataProc instances. To allow for a more flexible implementation, the CPG has created an annotation workaround,
within the [cpg_workflows package](https://github.com/populationgenomics/production-pipelines/tree/main/cpg_workflows).
This acts directly on a VCF, fragmenting the raw data and annotating in parallelised jobs, forming the annotated data
back into a [Hail MatrixTable](https://hail.is/docs/0.2/hail.MatrixTable.html), which is the AIP starting point.
DataProc instances. To allow for a more flexible implementation, the CPG has created an annotation workaround, within
the [cpg_workflows package](https://github.com/populationgenomics/production-pipelines/tree/main/cpg_workflows). This
acts directly on a VCF, fragmenting the raw data and annotating in parallelised jobs, forming the annotated data back
into a [Hail MatrixTable](https://hail.is/docs/0.2/hail.MatrixTable.html), which is the AIP starting point.
103 changes: 46 additions & 57 deletions design_docs/Clinvar.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,55 +4,48 @@ See [relevant development issue](https://github.com/populationgenomics/automated

## Context

ClinVar is valuable resource in identifying known Pathogenic & Benign variants within a
genomic dataset. By aggregating evidence from a range of submitters, we can utilise the
crowdsourced information to annotate current data with established clinical relevance.
ClinVar is valuable resource in identifying known Pathogenic & Benign variants within a genomic dataset. By aggregating
evidence from a range of submitters, we can utilise the crowdsourced information to annotate current data with
established clinical relevance.

ClinVar entries consist of:

* Individual Submissions, representing an assertion made by a submitter about the impact
of an individual allele.
* Allele summaries, which produce a top-line decision about each allele by aggregating all
relevant submissions.

During benchmarking of this application we have run into numerous instances of failing to
identify known pathogenic variants, due to conflicting ClinVar submissions results. On
closer inspection, the aggregation logic for ClinVar submissions seems too conservative
for our needs. An example of this is [here](https://ncbi.nlm.nih.gov/clinvar/variation/10/);
despite 24 Pathogenic submissions to only 2 Benign, the variant is given an overall status
of `Conflicting interpretations`. Whilst this is accurate, it obfuscates the bias towards
pathogenicity present in the individual submissions. When we annotate a dataset with ClinVar
consequences, all we have is this top-line decision, meaning that we are unable to flag such
variants for more manual scrutiny.

The role of AIP is not to make clinical decisions, but to identify variants of interest
for further review by analysts. In this setting we want to flag variants where manual
review of the submissions could signal a variant is worth consideration, even if it
doesn't appear pathogenic within the strict aggregation logic of ClinVar. To this end we
* Individual Submissions, representing an assertion made by a submitter about the impact of an individual allele.
* Allele summaries, which produce a top-line decision about each allele by aggregating all relevant submissions.

During benchmarking of this application we have run into numerous instances of failing to identify known pathogenic
variants, due to conflicting ClinVar submissions results. On closer inspection, the aggregation logic for ClinVar
submissions seems too conservative for our needs. [An example](https://ncbi.nlm.nih.gov/clinvar/variation/10/): despite
24 Pathogenic submissions to only 2 Benign, the variant is given an overall status of `Conflicting interpretations`.
Whilst this is accurate, it obfuscates the bias towards pathogenicity present in the individual submissions. When we
annotate a dataset with ClinVar consequences, all we have is this top-line decision, meaning that we are unable to flag
such variants for more manual scrutiny.

The role of AIP is not to make clinical decisions, but to identify variants of interest for further review by analysts.
In this setting we want to flag variants where manual review of the submissions could signal a variant is worth
consideration, even if it doesn't appear pathogenic within the strict aggregation logic of ClinVar. To this end we
created a manual re-curation of ClinVar which:

* Allows for specific submitters to be removed from consideration (i.e. so that when we
run benchmarking analysis on cohorts, we can blind our ClinVar annotations to entries
originating from that cohort)
* Accepts an optional date threshold, to simulate a 'latest' summary at the given point in
time (discarding any submissions added or edited after the date, i.e. to simulate different
ClinVar time points using the same starting files)
* Allows for specific submitters to be removed from consideration (i.e. so that when we run benchmarking analysis on
cohorts, we can blind our ClinVar annotations to entries originating from that cohort)
* Accepts an optional date threshold, to simulate a 'latest' summary at the given point in time (discarding any
submissions added or edited after the date, i.e. to simulate different ClinVar time points using the same starting
files)
* Defers to submissions after mainstream acceptance of ACMG criteria (estimated start 2016)
* Performs a more decisive summary, preferring a decision towards Pathogenic/Benign instead
of summarising any disagreements as `conflicting`
* Performs a more decisive summary, preferring a decision towards Pathogenic/Benign instead of summarising any
disagreements as `conflicting`

## Process

The re-summary is rapid, and can be repeated at regular intervals, taking the latest
available clinvar submissions each time it runs. The files used are the `submission_summary`
and `variant_summary` present on [the NCBI clinvar FTP site](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/).
We bin submissions into few discrete categories; see [the script (L37)](
../reanalysis/summarise_clinvar_entries.py) for the bins used.
The re-summary is rapid, and can be repeated at regular intervals, taking the latest available clinvar submissions each
time it runs. The files used are the `submission_summary`and `variant_summary` present
on [the NCBI clinvar FTP site](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/). We bin submissions into few
discrete categories; see [the script (L37)](../reanalysis/summarise_clinvar_entries.py) for the bins used.

1. Iterate over all individual submissions, removing any from blacklisted providers or
which were submitted after the date threshold. Collect all retained submissions per-allele.
2. For each allele, if any retained submissions were last edited after 2015 (representative ACMG
date), reduce submissions to only those. If no subs are from after 2015, retain all.
1. Iterate over all individual submissions, removing any from blacklisted providers or which were submitted after the
date threshold. Collect all retained submissions per-allele.
2. For each allele, if any retained submissions were last edited after 2015 (representative ACMG date), reduce
submissions to only those. If no subs are from after 2015, retain all.
3. Find a summary 'rating' across all alleles, checking these scenarios until a match is found:

* If an Expert Review/Clinical Guideline submission is present - choose that rating.
Expand All @@ -71,17 +64,15 @@ We bin submissions into few discrete categories; see [the script (L37)](
* If any submissions `Criteria Provided` -> `1 stars`
* `0 Stars`

At this stage we have each allele with a new summary and star rating. The allele ID is
matched up with the corresponding variant coordinates and ref/alt alleles from the
variant summary file, then the whole object is persisted as a Hail Table, indexed on
Locus and Alleles, ready to be used in annotation within Hail.
At this stage we have each allele with a new summary and star rating. The allele ID is matched up with the corresponding
variant coordinates and ref/alt alleles from the variant summary file, then the whole object is persisted as a Hail
Table, indexed on Locus and Alleles, ready to be used in annotation within Hail.

## Clinvar Runner [clinvar_runner.py](../reanalysis/clinvar_runner.py)

The script [clinvar_runner.py](../reanalysis/clinvar_runner.py) automates the regeneration
of the re-summarised ClinVar data table, as well as the PM5 annotations as described in
[Hail_Filter_and_Label.md](Hail_Filter_and_Label.md#usp). This sets up a workflow which is
designed to run independently of an AIP run, generating and saving all the annotation
The script [clinvar_runner.py](../reanalysis/clinvar_runner.py) automates the regeneration of the re-summarised ClinVar
data table, as well as the PM5 annotations as described in[Hail_Filter_and_Label.md](Hail_Filter_and_Label.md#usp). This
sets up a workflow which is designed to run independently of an AIP run, generating and saving all the annotation
sources:

```commandline
Expand All @@ -95,8 +86,7 @@ sources:
* Annotate the VCF with VEP, and create MatrixTable

3. PM5 Re-Index
* Re-format the MatrixTable, finding all missense consequences and indexing
variants by protein ID & altered residue, e.g. ENSP1234::123
* Re-format the MatrixTable, finding all missense consequences and indexing variants by protein ID & altered residue, e.g. ENSP1234::123
* Aggregate all Clinvar Variants by affected allele
* Save results as a second Hail Table
```
Expand All @@ -114,14 +104,13 @@ The second is the PM5 data in the following format:
| ENSP12345::123 | AlleleID::#Stars |
| ENSP12345::678 | AlleleID::#Stars+AlleleID::#Stars |

The outputs of this process are all written into `cpg-common-<main/test>-analysis/clinvar_aip/YY-MM`,
and a new round of creation is triggered on the first day of each month. When AIP runs, the annotation
Tables are collected from the latest communal directory, and incorporated into the AIP results.
The outputs of this process are all written into `cpg-common-<main/test>-analysis/clinvar_aip/YY-MM`, and a new round of
creation is triggered on the first day of each month. When AIP runs, the annotation Tables are collected from the latest
communal directory, and incorporated into the AIP results.

## Automation

To facilitate the regular regeneration of the ClinVar without requiring manual intervention, a new
GitHub CI Yaml file has been created using a CRON scheduler. This is located at
[.github/workflows/clinvar_runner.yml](../.github/workflows/clinvar_runner.yml). This file triggers
the `clinvar_runner.py` process, though this can be manually triggered at any time, and will not run
if the outputs already exist.
To facilitate the regular regeneration of the ClinVar without requiring manual intervention, a new GitHub CI Yaml file
has been created using a CRON scheduler. This is located
at [.github/workflows/clinvar_runner.yml](../.github/workflows/clinvar_runner.yml). This file triggers
the `clinvar_runner.py` process, can be manually triggered at any time, and will not run if the outputs already exist.
Loading
Loading