Introducing models (#344)

* MODELS! * lazily load config stuff in models * add toml into package installation * Bump version: 2.1.1 → 3.0.0
populationgenomics · Dec 8, 2023 · 2fbb058 · 2fbb058
1 parent 5218689
commit 2fbb058
Show file tree

Hide file tree

Showing 44 changed files with 2,638 additions and 2,251 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 2.1.1
+current_version = 3.0.0
 commit = True
 tag = False
 

diff --git a/.github/workflows/clinvar_runner.yaml b/.github/workflows/clinvar_runner.yaml
@@ -32,5 +32,5 @@ jobs:
           curl --fail --silent --show-error -X POST \
               -H "Authorization: Bearer $TOKEN" \
               -H "Content-Type:application/json" \
-              -d '{"output": "generate_clinvar_${{ steps.date.outputs.date }}", "dataset": "talos", "accessLevel": "full", "repo": "automated-interpretation-pipeline", "commit": "${{ github.sha }}", "cwd": "reanalysis", "script": ["./clinvar_runner.py"], "description": "Generate Latest Clinvar Summaries", "image": "australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_aip:2.1.1", "config": {"workflow": {"sequencing_type": "genome"}, "cohorts": {"talos": {"clinvar_filter": ["victorian clinical genetics services, murdoch childrens research institute"]}}}, "wait": false}' \
+              -d '{"output": "generate_clinvar_${{ steps.date.outputs.date }}", "dataset": "talos", "accessLevel": "full", "repo": "automated-interpretation-pipeline", "commit": "${{ github.sha }}", "cwd": "reanalysis", "script": ["./clinvar_runner.py"], "description": "Generate Latest Clinvar Summaries", "image": "australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_aip:3.0.0", "config": {"workflow": {"sequencing_type": "genome"}, "cohorts": {"talos": {"clinvar_filter": ["victorian clinical genetics services, murdoch childrens research institute"]}}}, "wait": false}' \
               https://server-a2pko7ameq-ts.a.run.app
diff --git a/.github/workflows/docker.yaml b/.github/workflows/docker.yaml
@@ -14,7 +14,7 @@ permissions:
   contents: read
 
 env:
-  VERSION: 2.1.1
+  VERSION: 3.0.0
 
 jobs:
   docker:
@@ -55,8 +55,8 @@ jobs:
     - name: manual build
       if: ${{ github.event_name == 'workflow_dispatch' }}
       run: |
-        docker tag $IMAGE_NAME:$VERSION $DOCKER_DEV/$IMAGE_NAME:$VERSION
-        docker push $DOCKER_DEV/$IMAGE_NAME:$VERSION
+        docker tag $IMAGE_NAME:$VERSION $DOCKER_DEV/$IMAGE_NAME:${{ github.event.inputs.tag }}
+        docker push $DOCKER_DEV/$IMAGE_NAME:${{ github.event.inputs.tag }}
 
     - name: push latest
       if: ${{ github.event_name == 'push' && github.ref_name == 'main' }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,48 @@ Suggested headings per release (as appropriate) are:
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+[3.0.0] - 2023-12-08
+
+This was a MASSIVE refactor affecting almost all parts of the codebase. Instead of characterising
+data in various places using type hinting, I've now introducted Pydantic models to formally represent
+the data used in each part of the application (see reanalysis/models.py). This has the benefit of
+making the code more readable, and also allows for the use of Pydantic's validation and parsing.
+
+A key note is that all serialised/deserialised data is now done through a Pydantic validation layer,
+forcing the data to conform to the model. This may require some objects (e.g. state/history files)
+to be reformatted, as during this work I identified some inconsistencies in the data.
+
+### Added
+
+* Data models for every aspect of the analysis
+* Pydantic validation layer for all serialised/deserialised data
+* Specifically, all variants inherit from VariantCommon, and Structural and Small Variants each have
+  a fundamentally different object representation, with a common interface but some internal methods
+  specific to each variant type. This is extensible (e.g. further inheritance for STRs)
+
+### Changed
+
+* All minimial-case representations of Variants or Reportable events used in unit tests are now
+  represented as full Pydantic models for the corresponding data type
+* So many little things, this was a +3000/-2000 line change
+
+[2.1.0] - 2023-11-20
+
+This change involved the introduction of SV data, and the co-processing of SV and Small Variant
+data in a unified analysis process. The SV data format expected is based on the GATK-SV pipeline
+and annotation.
+
+This also includes the addition of the `sv1` category, which represents LoF structural variants
+
+### Added
+
+* SV data is now processed alongside small variants
+* `sv1` category specific to SV data
+
+### Changed
+
+* Some of the AbstractVariant loading is modified to allow for SV data (e.g. defaults for depths)
+
 [2.0.0 & 2.0.1] - 2023-10-26
 
 This bump is less substantial than the version number suggests, but in hindsight the previous

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,2 +1,3 @@
 include reanalysis/*.json
 include reanalysis/templates/*.jinja
+include reanalysis/reanalysis_global.toml
diff --git a/design_docs/MOI_Checks.md b/design_docs/MOI_Checks.md
@@ -20,24 +20,28 @@ variant to decide whether we include it in the final report.
 Static Objects:
 
 - Configuration File: used throughout the pipeline, this contains all runtime settings
-  - When we are assessing each variant, we may choose to apply thresholds (e.f. MAF, frequency in a joint call). These
-  may change with each run, so we take those parameters from a central configuration file
+    - When we are assessing each variant, we may choose to apply thresholds (e.f. MAF, frequency in a joint call). These
+      may change with each run, so we take those parameters from a central configuration file
 - PanelApp data: associates genes with evidenced inheritance patterns
-  - For each 'green' (high evidence) gene contains inheritance pattern to be applied & panels where the gene is 'new'
-  (if any).
+    - For each 'green' (high evidence) gene contains inheritance pattern to be applied & panels where the gene is 'new'
+      (if any).
 
 Dynamic Objects:
 
 - AbstractVariant: Each variant in turn is read from the source (VCF) and cast in a custom AbstractVariant Class
-representation. This is for a couple of reasons:
-
-  1. CyVCF2 was chosen as a parsing library due to speed relative to pyVCF, but each library provides a slightly different
-  representation. This normalises all logic to a common class, and allows for different sources/parsers to generate a
-  common format.
-  2. CyVCF2 and PyVCF variant object formats contain structures which cannot be pickled by default. This leads to issues
-  with introducing parallelisation into the code. A dictionary/set/list-based class can be pickled easily, so async/await
-  structure can be added in at any level.
-  3. Making an abstract object from simple types will lead to simpler unit testing.
+  representation. This is for a couple of reasons:
+
+    1. CyVCF2 was chosen as a parsing library due to speed relative to pyVCF, but each library provides a slightly
+       different
+       representation. This normalises all logic to a common class, and allows for different sources/parsers to generate
+       a
+       common format.
+    2. CyVCF2 and PyVCF variant object formats contain structures which cannot be pickled by default. This leads to
+       issues
+       with introducing parallelisation into the code. A dictionary/set/list-based class can be pickled easily, so
+       async/await
+       structure can be added in at any level.
+    3. Making an abstract object from simple types will lead to simpler unit testing.
 
 - Compound Het lookup: This object is built per-gene as we iterate over all genes independently
 
@@ -52,7 +56,7 @@ following details:
    rather than storing static booleans
 6. If physical phasing is evident for any sample calls, record the numerical phase set ID, and sample-specific GT
     1. Within a phase set ID, two variants are in-phase if their GT is also the same, i.e. for phase set #1, variants
-   with the GT `0|1` and `1|0` are explicitly out of phase, so knowing the PS ID alone is not sufficient
+       with the GT `0|1` and `1|0` are explicitly out of phase, so knowing the PS ID alone is not sufficient
 
 The AbstractVariant has a few internal methods:
 
@@ -63,31 +67,31 @@ The AbstractVariant has a few internal methods:
 - `category_non_support`: returns True if any non-support category is assigned
 - `is_classified`: returns True if any category is assigned
 - `category_values`: returns a list of strings for each assigned category, i.e.
-  - if a variant is True for category 2 and 3, the return will be `['2', '3']`
-  - if a variant is True for category 2 and 4, the return will be `['2', 'de_novo']`
+    - if a variant is True for category 2 and 3, the return will be `['2', '3']`
+    - if a variant is True for category 2 and 4, the return will be `['2', 'de_novo']`
 - `sample_specific_category_check`: returns True if the specific sample is _de novo_,
-or if the variant has a boolean category assigned - accepts a switch to also check for `Support`
+  or if the variant has a boolean category assigned - accepts a switch to also check for `Support`
 - `get_sample_flags`: checks for any variant flags - currently this only implements one check - AB Ratio test
 
 ## Variant Gathering
 
 The Variant gathering strategy observed in this application is:
 
- - parse the VCF header to obtain all chromosome names
- - for each contig in turn, extract all variants & create an Abstract representation of each
- - group variants by gene ID, forming a structure `{'gene_name': [Variant1, Variant2, ...], }`
-   - each variant contains exactly one gene annotation, from an earlier split, so grouping is accurate
-   - each gene can be processed separately if required, allowing logical parallelisation breaks
- - once all variants on a contig are extracted, iterate over variants grouped by gene
+- parse the VCF header to obtain all chromosome names
+- for each contig in turn, extract all variants & create an Abstract representation of each
+- group variants by gene ID, forming a structure `{'gene_name': [Variant1, Variant2, ...], }`
+    - each variant contains exactly one gene annotation, from an earlier split, so grouping is accurate
+    - each gene can be processed separately if required, allowing logical parallelisation breaks
+- once all variants on a contig are extracted, iterate over variants grouped by gene
 
 Justification:
 
 1. Use of the AbstractVariant structure means each variant can be pickled, so each group can be processed in parallel
-(Cyvcf2 and pyvcf objects can't be pickled directly)
+   (Cyvcf2 and pyvcf objects can't be pickled directly)
 2. Allows us to collect the panelapp details once per gene, instead of once per variant; minor efficiency gain
 3. Gathering all variants relevant to an MOI test means that when generating the results, we can access all attributes
-of the relevant variants for presentation, e.g. instead of just variant coordinates for the variant pair, we can show
-the exact consequence(s) that led to category labelling
+   of the relevant variants for presentation, e.g. instead of just variant coordinates for the variant pair, we can show
+   the exact consequence(s) that led to category labelling
 
 ## Compound-Heterozygous checks
 
@@ -106,11 +110,12 @@ for this sample. These objects can be retrieved directly from this dictionary, a
 inheritance checks.
 
 ```python
-from reanalysis.utils import AbstractVariant
+from reanalysis.models import SmallVariant, StructuralVariant
+
 _comp_het = {
     "sample": {
-        "coords_string_1": [AbstractVariant, AbstractVariant],
-        "coords_string_2": [AbstractVariant, AbstractVariant]
+        "coords_string_1": [SmallVariant, StructuralVariant],
+        "coords_string_2": [SmallVariant, SmallVariant]
     }
 }
 ```
@@ -126,28 +131,28 @@ Includes one controlling class `MoiRunner`, and one functional class `BaseMoi` a
 derivatives each define a single Mode of Inheritance e.g.
 
 - DominantAutosomal - requires a variant to be exceptionally rare, and homozygotes to be absent from population
-databases. All samples with a heterozygous variant call are passed as fitting with the MOI
+  databases. All samples with a heterozygous variant call are passed as fitting with the MOI
 
 - XRecessive - Male samples are passed so long as the AF is below a stringent threshold, Female samples must be
-Homozygous or supported in a trans compound-het
+  Homozygous or supported in a trans compound-het
 
 The separation between the methods defining the filter algorithm, and the MoiRunner using one or more algorithms
 allows multiple filters to be applied for a given MOI string e.g.
 
 - "BOTH monoallelic and biallelic, autosomal or pseudoautosomal" from PanelApp is easy to interpret as
-2 filters, monoallelic, and biallelic
+  2 filters, monoallelic, and biallelic
 - "X-LINKED: hemizygous mutation in males, biallelic mutations in females" for male samples we call a
-DominantMOI filter, for females we call a Recessive filter
+  DominantMOI filter, for females we call a Recessive filter
 
 The usage paradigm is:
 
 1. Create an instance of the `MoiRunner`, passing a target MOI string and some configuration parameters
 2. During the setup, the MOI string is used to determine which filters to add into the filter list
 3. The `MoiRunner` implements a `.run()` method, which takes a single variant and the lookup of all compound-hets within
-this gene, and passes through each filter in turn
+   this gene, and passes through each filter in turn
 4. Where a variant passes all conditions within a filter class, a 'result' object is created
 5. The result of `.run()` is a list of valid modes of inheritance (ReportedVariant), each including details of the
-variant(s), and samples
+   variant(s), and samples
 
 ### Family Checks
 
@@ -156,11 +161,11 @@ of participants. The inheritance rules are applied unilaterally to every member
 directional manner, e.g. when considering each candidate variant:
 
 - for a monoallelic variant, inherited with complete penetrance, we enforce a unilateral rule - every person with the
-variant must also be affected. If this rule is violated for any individual member, the variant does not pass the
-inheritance test.
+  variant must also be affected. If this rule is violated for any individual member, the variant does not pass the
+  inheritance test.
 - for a biallelic variant pair, all participants with the variant pair must also be affected. If the variant considered
-is Homozygous, we permit unaffected members to be heterozygous. Similarly, for compound-het inheritance, unaffected
-participants may have either variant but not both.
+  is Homozygous, we permit unaffected members to be heterozygous. Similarly, for compound-het inheritance, unaffected
+  participants may have either variant but not both.
 
 Complete and Incomplete penetrance modes are available in this module. For the Complete penetrance
 inheritance model:
@@ -206,7 +211,8 @@ which will generate a reported variant structure for the sample(s) which passed
 ## Flags
 
 When a reportable event is found, a JSON blob representing the variant and sample is created. If the relevant gene was
-in one or more of the additional panels requested (see [additional panels](PanelApp_interaction.md#per-participant-panels)),
+in one or more of the additional panels requested (
+see [additional panels](PanelApp_interaction.md#per-participant-panels)),
 the names/IDs of those panels are appended to the list of any variant-specific panels.
 
 This leaves us the flexibility to mark individual samples/families as having disease-relevant panels, which can then be

diff --git a/helpers/forbidden_gene_check.py b/helpers/forbidden_gene_check.py
@@ -18,6 +18,8 @@
 
 from cpg_utils import to_path
 
+from reanalysis.models import PhenotypeMatchedPanels
+
 PANELAPP_BASE = 'https://panelapp.agha.umccr.org/api/v1/panels'
 
 
@@ -177,7 +179,9 @@ def main(panels: str | None, out_path: str, dates: list[str]):
     assert dates, 'whats the point doing this with no dates?'
 
     if panels:
-        all_panels = read_panels_from_participant_file(panels)
+        with open(panels) as handle:
+            panel_obj = PhenotypeMatchedPanels.model_validate(json.load(handle))
+            all_panels = panel_obj.all_panels
     else:
         all_panels = {137}
 

diff --git a/helpers/minimise_output_for_seqr.py b/helpers/minimise_output_for_seqr.py
@@ -16,9 +16,8 @@
 
 import json
 from argparse import ArgumentParser
-from collections import defaultdict
 
-import toml
+from reanalysis.models import MiniForSeqr, MiniVariant, ResultData
 
 
 def coord_to_string(coord: dict) -> str:
@@ -34,54 +33,46 @@ def coord_to_string(coord: dict) -> str:
     return f"{coord['chrom']}-{coord['pos']}-{coord['ref']}-{coord['alt']}"
 
 
-def main(input_file: str, output: str, config_file: str):
+def main(input_file: str, output: str):
     """
     reads in the input file, shrinks it, and writes the output file
 
     Args:
         input_file (str):
         output (str):
-        config_file (str):
-
-    Returns:
-
     """
-    with open(config_file, encoding='utf-8') as f:
-        config = toml.load(f)
 
     with open(input_file, encoding='utf-8') as f:
-        data = json.load(f)
-
-    lil_data = {
-        'metadata': {'categories': config['categories']},
-        'results': defaultdict(dict),
-    }
-
-    for individual, details in data['results'].items():
-        for variant in details['variants']:
-            var_data = variant['var_data']
-            lil_data['results'][individual][  # type: ignore
-                coord_to_string(var_data['coords'])
-            ] = {  # type: ignore
-                'categories': var_data['categories'],
-                # 'labels': variant['labels'],
-                'support_vars': variant['support_vars'],
-                # 'independent': variant['independent'],
-            }
+        data = ResultData.model_validate(json.load(f))
+
+    lil_data = MiniForSeqr(
+        **{
+            'metadata': {'categories': data.metadata.categories},
+        }
+    )
+
+    for individual, details in data.results.items():
+        lil_data.results[individual] = {}
+        for variant in details.variants:
+            var_data = variant.var_data
+            lil_data.results[individual][
+                var_data.coordinates.string_format
+            ] = MiniVariant(
+                **{
+                    'categories': variant.categories,
+                    'support_vars': variant.support_vars,
+                    'independent': variant.independent,
+                }
+            )
 
     with open(output, 'w', encoding='utf-8') as f:
-        json.dump(lil_data, f, indent=4)
+        f.write(MiniForSeqr.model_validate(lil_data).model_dump_json(indent=4))
 
 
 if __name__ == '__main__':
     parser = ArgumentParser()
     parser.add_argument('input_file', help='the input file to process')
     parser.add_argument('output_file', help='the output file to write to')
-    parser.add_argument('config_file', help='the config file to use')
     args = parser.parse_args()
 
-    main(
-        input_file=args.input_file,
-        output=args.output_file,
-        config_file=args.config_file,
-    )
+    main(input_file=args.input_file, output=args.output_file)
diff --git a/helpers/prepare_aip_cohort.py b/helpers/prepare_aip_cohort.py
@@ -270,7 +270,7 @@ def main(
 
     panel_remote = remote_root / 'participant_panels.json'
     with panel_remote.open('w') as handle:
-        json.dump(hpo_panel_dict, handle, indent=4, default=list)
+        handle.write(hpo_panel_dict.model_dump_json(indent=4))
         logging.info(f'Wrote panel file to {panel_remote}')
 
     # get the list of all pedigree members as list of dictionaries