populationgenomics · MattWellie · Feb 5, 2024 · Feb 5, 2024 · Feb 5, 2024 · Feb 5, 2024
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -36,6 +36,11 @@ repos:
     hooks:
       - id: ruff
 
+  - repo: https://github.com/populationgenomics/pre-commits
+    rev: v0.1.3
+    hooks:
+      - id: cpg-id-checker
+
   - repo: https://github.com/pre-commit/mirrors-mypy
     rev: v1.8.0
     hooks:

diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # Automated Interpretation Pipeline (AIP)
 
+[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) ![test](https://github.com/populationgenomics/automated-interpretation-pipeline/actions/workflows/test.yaml/badge.svg) ![black](https://img.shields.io/badge/code%20style-black-000000.svg)
+
 ## Purpose
 
 A variant prioritisation tool, aiming to assist clinical curators by sifting through large datasets and highlighting a
@@ -113,9 +115,9 @@ of all panels on which they have previously been seen. A gene will be treated as
 before, or features in a phenotype-matched panel on which it has not been seen before. i.e.
 
 - if a Gene is promoted to `Green` on the Mendeliome, it will be recorded as `New`, and the prior data will be extended
-to show that the gene was seen on the Mendeliome.
+  to show that the gene was seen on the Mendeliome.
 - if a Gene has previously appeared on the Mendeliome, and during current run it now appears on a new panel, the gene
-will be recorded as `New`, and the new panel will be added to the prior data list.
+  will be recorded as `New`, and the new panel will be added to the prior data list.
 
 ### Variant Results
 
@@ -127,9 +129,10 @@ This data consists of the variant IDs, Categories they've previously been assign
 have formed a compound-het with. When reviewing the variants of a current run, we check for previously seen variants on
 a per-sample basis, e.g.:
 
- - If a variant has been seen as a `Cat.1` before, and appears again as a `Cat.1`, it will be removed
- - If a variant has been seen as a `Cat.1` before, and now is both `Cat.1` & `Cat.2`, the `Cat.1` assignment will be
-removed, and will be reported only as a `Cat.2`. The prior data will be extended to show that it has been a `Cat.1 & 2`
- - If a variant was never seen before, it will appear on the report with no removed Categories
- - If a variant was seen in a compound-het now, and was previously partnered with a different variant, all `Categories`
-will be retained, and the new partner ID will be added to the list of `support_vars` in the prior data
+- If a variant has been seen as a `Cat.1` before, and appears again as a `Cat.1`, it will be removed
+- If a variant has been seen as a `Cat.1` before, and now is both `Cat.1` & `Cat.2`, the `Cat.1` assignment will be
+  removed, and will be reported only as a `Cat.2`. The prior data will be extended to show that it has been
+  a `Cat.1 & 2`
+- If a variant was never seen before, it will appear on the report with no removed Categories
+- If a variant was seen in a compound-het now, and was previously partnered with a different variant, all `Categories`
+  will be retained, and the new partner ID will be added to the list of `support_vars` in the prior data
diff --git a/design_docs/AddingNewCategories.md b/design_docs/AddingNewCategories.md
@@ -1,21 +1,21 @@
 # New Categories
 
-This framework is designed to make the addition of new categories super simple. The minimal changes
-required to create a new category are:
+This framework is designed to make the addition of new categories super simple. The minimal changes required to create a
+new category are:
 
 1. Add new Category name/number and description to the config file (
    e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L54))
 2. If new fields are acted upon (e.g. a new annotation field), add those to the `CSQ` field in config to ensure the
    values are exported in the labelled VCF (
-   e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L36))
+   e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/reanalysis_global.toml#L36)).
    Without this change, the values will not be pulled from the MT, and cannot be presented in the report
 3. Add a new category method in the [hail_filter_and_label.py script](../reanalysis/hail_filter_and_label.py), (
    e.g. [here](https://github.com/populationgenomics/automated-interpretation-pipeline/blob/afcf1bfa2acc30803558fa2092fab4fd8b0a58a5/reanalysis/hail_filter_and_label.py#L622-L658).
-   This method should stand independently, and contain all the logic to decide whether the label is applied or not.
-   This encapsulation should also include the decision about whether a classification is Boolean (once per variant),
-   Sample (only relevant to a subset of Samples), or Support (a lesser level of significance)
-4. Add a new diagram describing the decision tree to the [images folder](images), and reference it in the
-   [README](Hail_Filter_and_Label.md)
+   This method should stand independently, and contain all the logic to decide whether the label is applied or not. This
+   encapsulation should also include the decision about whether a classification is Boolean (once per variant), Sample (
+   only relevant to a subset of Samples), or Support (a lesser level of significance)
+4. Add a new diagram describing the decision tree to the [images folder](images), and reference it in
+   the [README](Hail_Filter_and_Label.md)
 5. If you require new fields to be displayed in the HTML report, make the appropriate changes to the templates
 6. If you need additional logic (e.g. when this category is assigned we should interpret the variant under a partial
    penetrance model), that's... more complicated. Get in touch with the team to discuss.
diff --git a/design_docs/Annotation.md b/design_docs/Annotation.md
@@ -2,7 +2,7 @@
 
 AIP leverages numerous functions within the Hail Batch and Hail Query libraries to organise workflows and efficiently
 query large datasets. Unfortunately at time of writing, Hail's ability to run annotation using VEP is limited to GCS
-DataProc instances. To allow for a more flexible implementation, the CPG has created an annotation workaround,
-within the [cpg_workflows package](https://github.com/populationgenomics/production-pipelines/tree/main/cpg_workflows).
-This acts directly on a VCF, fragmenting the raw data and annotating in parallelised jobs, forming the annotated data
-back into a [Hail MatrixTable](https://hail.is/docs/0.2/hail.MatrixTable.html), which is the AIP starting point.
+DataProc instances. To allow for a more flexible implementation, the CPG has created an annotation workaround, within
+the [cpg_workflows package](https://github.com/populationgenomics/production-pipelines/tree/main/cpg_workflows). This
+acts directly on a VCF, fragmenting the raw data and annotating in parallelised jobs, forming the annotated data back
+into a [Hail MatrixTable](https://hail.is/docs/0.2/hail.MatrixTable.html), which is the AIP starting point.
diff --git a/design_docs/Clinvar.md b/design_docs/Clinvar.md
@@ -4,55 +4,48 @@ See [relevant development issue](https://github.com/populationgenomics/automated
 
 ## Context
 
-ClinVar is valuable resource in identifying known Pathogenic & Benign variants within a
-genomic dataset. By aggregating evidence from a range of submitters, we can utilise the
-crowdsourced information to annotate current data with established clinical relevance.
+ClinVar is valuable resource in identifying known Pathogenic & Benign variants within a genomic dataset. By aggregating
+evidence from a range of submitters, we can utilise the crowdsourced information to annotate current data with
+established clinical relevance.
 
 ClinVar entries consist of:
 
-* Individual Submissions, representing an assertion made by a submitter about the impact
-  of an individual allele.
-* Allele summaries, which produce a top-line decision about each allele by aggregating all
-  relevant submissions.
-
-During benchmarking of this application we have run into numerous instances of failing to
-identify known pathogenic variants, due to conflicting ClinVar submissions results. On
-closer inspection, the aggregation logic for ClinVar submissions seems too conservative
-for our needs. An example of this is [here](https://ncbi.nlm.nih.gov/clinvar/variation/10/);
-despite 24 Pathogenic submissions to only 2 Benign, the variant is given an overall status
-of `Conflicting interpretations`. Whilst this is accurate, it obfuscates the bias towards
-pathogenicity present in the individual submissions. When we annotate a dataset with ClinVar
-consequences, all we have is this top-line decision, meaning that we are unable to flag such
-variants for more manual scrutiny.
-
-The role of AIP is not to make clinical decisions, but to identify variants of interest
-for further review by analysts. In this setting we want to flag variants where manual
-review of the submissions could signal a variant is worth consideration, even if it
-doesn't appear pathogenic within the strict aggregation logic of ClinVar. To this end we
+* Individual Submissions, representing an assertion made by a submitter about the impact of an individual allele.
+* Allele summaries, which produce a top-line decision about each allele by aggregating all relevant submissions.
+
+During benchmarking of this application we have run into numerous instances of failing to identify known pathogenic
+variants, due to conflicting ClinVar submissions results. On closer inspection, the aggregation logic for ClinVar
+submissions seems too conservative for our needs. [An example](https://ncbi.nlm.nih.gov/clinvar/variation/10/): despite
+24 Pathogenic submissions to only 2 Benign, the variant is given an overall status of `Conflicting interpretations`.
+Whilst this is accurate, it obfuscates the bias towards pathogenicity present in the individual submissions. When we
+annotate a dataset with ClinVar consequences, all we have is this top-line decision, meaning that we are unable to flag
+such variants for more manual scrutiny.
+
+The role of AIP is not to make clinical decisions, but to identify variants of interest for further review by analysts.
+In this setting we want to flag variants where manual review of the submissions could signal a variant is worth
+consideration, even if it doesn't appear pathogenic within the strict aggregation logic of ClinVar. To this end we
 created a manual re-curation of ClinVar which:
 
-* Allows for specific submitters to be removed from consideration (i.e. so that when we
-  run benchmarking analysis on cohorts, we can blind our ClinVar annotations to entries
-  originating from that cohort)
-* Accepts an optional date threshold, to simulate a 'latest' summary at the given point in
-  time (discarding any submissions added or edited after the date, i.e. to simulate different
-  ClinVar time points using the same starting files)
+* Allows for specific submitters to be removed from consideration (i.e. so that when we run benchmarking analysis on
+  cohorts, we can blind our ClinVar annotations to entries originating from that cohort)
+* Accepts an optional date threshold, to simulate a 'latest' summary at the given point in time (discarding any
+  submissions added or edited after the date, i.e. to simulate different ClinVar time points using the same starting
+  files)
 * Defers to submissions after mainstream acceptance of ACMG criteria (estimated start 2016)
-* Performs a more decisive summary, preferring a decision towards Pathogenic/Benign instead
-  of summarising any disagreements as `conflicting`
+* Performs a more decisive summary, preferring a decision towards Pathogenic/Benign instead of summarising any
+  disagreements as `conflicting`
 
 ## Process
 
-The re-summary is rapid, and can be repeated at regular intervals, taking the latest
-available clinvar submissions each time it runs. The files used are the `submission_summary`
-and `variant_summary` present on [the NCBI clinvar FTP site](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/).
-We bin submissions into few discrete categories; see [the script (L37)](
-../reanalysis/summarise_clinvar_entries.py) for the bins used.
+The re-summary is rapid, and can be repeated at regular intervals, taking the latest available clinvar submissions each
+time it runs. The files used are the `submission_summary`and `variant_summary` present
+on [the NCBI clinvar FTP site](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/). We bin submissions into few
+discrete categories; see [the script (L37)](../reanalysis/summarise_clinvar_entries.py) for the bins used.
 
-1. Iterate over all individual submissions, removing any from blacklisted providers or
-   which were submitted after the date threshold. Collect all retained submissions per-allele.
-2. For each allele, if any retained submissions were last edited after 2015 (representative ACMG
-   date), reduce submissions to only those. If no subs are from after 2015, retain all.
+1. Iterate over all individual submissions, removing any from blacklisted providers or which were submitted after the
+   date threshold. Collect all retained submissions per-allele.
+2. For each allele, if any retained submissions were last edited after 2015 (representative ACMG date), reduce
+   submissions to only those. If no subs are from after 2015, retain all.
 3. Find a summary 'rating' across all alleles, checking these scenarios until a match is found:
 
 * If an Expert Review/Clinical Guideline submission is present - choose that rating.
@@ -71,17 +64,15 @@ We bin submissions into few discrete categories; see [the script (L37)](
 * If any submissions `Criteria Provided` -> `1 stars`
 * `0 Stars`
 
-At this stage we have each allele with a new summary and star rating. The allele ID is
-matched up with the corresponding variant coordinates and ref/alt alleles from the
-variant summary file, then the whole object is persisted as a Hail Table, indexed on
-Locus and Alleles, ready to be used in annotation within Hail.
+At this stage we have each allele with a new summary and star rating. The allele ID is matched up with the corresponding
+variant coordinates and ref/alt alleles from the variant summary file, then the whole object is persisted as a Hail
+Table, indexed on Locus and Alleles, ready to be used in annotation within Hail.
 
 ## Clinvar Runner [clinvar_runner.py](../reanalysis/clinvar_runner.py)
 
-The script [clinvar_runner.py](../reanalysis/clinvar_runner.py) automates the regeneration
-of the re-summarised ClinVar data table, as well as the PM5 annotations as described in
-[Hail_Filter_and_Label.md](Hail_Filter_and_Label.md#usp). This sets up a workflow which is
-designed to run independently of an AIP run, generating and saving all the annotation
+The script [clinvar_runner.py](../reanalysis/clinvar_runner.py) automates the regeneration of the re-summarised ClinVar
+data table, as well as the PM5 annotations as described in[Hail_Filter_and_Label.md](Hail_Filter_and_Label.md#usp). This
+sets up a workflow which is designed to run independently of an AIP run, generating and saving all the annotation
 sources:
 
 ```commandline
@@ -95,8 +86,7 @@ sources:
   * Annotate the VCF with VEP, and create MatrixTable
 
 3. PM5 Re-Index
-  * Re-format the MatrixTable, finding all missense consequences and indexing
-    variants by protein ID & altered residue, e.g. ENSP1234::123
+  * Re-format the MatrixTable, finding all missense consequences and indexing variants by protein ID & altered residue, e.g. ENSP1234::123
   * Aggregate all Clinvar Variants by affected allele
   * Save results as a second Hail Table
 ```
@@ -114,14 +104,13 @@ The second is the PM5 data in the following format:
 | ENSP12345::123   | AlleleID::#Stars                  |
 | ENSP12345::678   | AlleleID::#Stars+AlleleID::#Stars |
 
-The outputs of this process are all written into `cpg-common-<main/test>-analysis/clinvar_aip/YY-MM`,
-and a new round of creation is triggered on the first day of each month. When AIP runs, the annotation
-Tables are collected from the latest communal directory, and incorporated into the AIP results.
+The outputs of this process are all written into `cpg-common-<main/test>-analysis/clinvar_aip/YY-MM`, and a new round of
+creation is triggered on the first day of each month. When AIP runs, the annotation Tables are collected from the latest
+communal directory, and incorporated into the AIP results.
 
 ## Automation
 
-To facilitate the regular regeneration of the ClinVar without requiring manual intervention, a new
-GitHub CI Yaml file has been created using a CRON scheduler. This is located at
-[.github/workflows/clinvar_runner.yml](../.github/workflows/clinvar_runner.yml). This file triggers
-the `clinvar_runner.py` process, though this can be manually triggered at any time, and will not run
-if the outputs already exist.
+To facilitate the regular regeneration of the ClinVar without requiring manual intervention, a new GitHub CI Yaml file
+has been created using a CRON scheduler. This is located
+at [.github/workflows/clinvar_runner.yml](../.github/workflows/clinvar_runner.yml). This file triggers
+the `clinvar_runner.py` process, can be manually triggered at any time, and will not run if the outputs already exist.