Skip to content

Commit

Permalink
Introducing models (#344)
Browse files Browse the repository at this point in the history
* MODELS!

* lazily load config stuff in models

* add toml into package installation

* Bump version: 2.1.1 → 3.0.0
  • Loading branch information
MattWellie authored Dec 8, 2023
1 parent 5218689 commit 2fbb058
Show file tree
Hide file tree
Showing 44 changed files with 2,638 additions and 2,251 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 2.1.1
current_version = 3.0.0
commit = True
tag = False

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/clinvar_runner.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,5 +32,5 @@ jobs:
curl --fail --silent --show-error -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type:application/json" \
-d '{"output": "generate_clinvar_${{ steps.date.outputs.date }}", "dataset": "talos", "accessLevel": "full", "repo": "automated-interpretation-pipeline", "commit": "${{ github.sha }}", "cwd": "reanalysis", "script": ["./clinvar_runner.py"], "description": "Generate Latest Clinvar Summaries", "image": "australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_aip:2.1.1", "config": {"workflow": {"sequencing_type": "genome"}, "cohorts": {"talos": {"clinvar_filter": ["victorian clinical genetics services, murdoch childrens research institute"]}}}, "wait": false}' \
-d '{"output": "generate_clinvar_${{ steps.date.outputs.date }}", "dataset": "talos", "accessLevel": "full", "repo": "automated-interpretation-pipeline", "commit": "${{ github.sha }}", "cwd": "reanalysis", "script": ["./clinvar_runner.py"], "description": "Generate Latest Clinvar Summaries", "image": "australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_aip:3.0.0", "config": {"workflow": {"sequencing_type": "genome"}, "cohorts": {"talos": {"clinvar_filter": ["victorian clinical genetics services, murdoch childrens research institute"]}}}, "wait": false}' \
https://server-a2pko7ameq-ts.a.run.app
6 changes: 3 additions & 3 deletions .github/workflows/docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ permissions:
contents: read

env:
VERSION: 2.1.1
VERSION: 3.0.0

jobs:
docker:
Expand Down Expand Up @@ -55,8 +55,8 @@ jobs:
- name: manual build
if: ${{ github.event_name == 'workflow_dispatch' }}
run: |
docker tag $IMAGE_NAME:$VERSION $DOCKER_DEV/$IMAGE_NAME:$VERSION
docker push $DOCKER_DEV/$IMAGE_NAME:$VERSION
docker tag $IMAGE_NAME:$VERSION $DOCKER_DEV/$IMAGE_NAME:${{ github.event.inputs.tag }}
docker push $DOCKER_DEV/$IMAGE_NAME:${{ github.event.inputs.tag }}
- name: push latest
if: ${{ github.event_name == 'push' && github.ref_name == 'main' }}
Expand Down
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,48 @@ Suggested headings per release (as appropriate) are:
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

[3.0.0] - 2023-12-08

This was a MASSIVE refactor affecting almost all parts of the codebase. Instead of characterising
data in various places using type hinting, I've now introducted Pydantic models to formally represent
the data used in each part of the application (see reanalysis/models.py). This has the benefit of
making the code more readable, and also allows for the use of Pydantic's validation and parsing.

A key note is that all serialised/deserialised data is now done through a Pydantic validation layer,
forcing the data to conform to the model. This may require some objects (e.g. state/history files)
to be reformatted, as during this work I identified some inconsistencies in the data.

### Added

* Data models for every aspect of the analysis
* Pydantic validation layer for all serialised/deserialised data
* Specifically, all variants inherit from VariantCommon, and Structural and Small Variants each have
a fundamentally different object representation, with a common interface but some internal methods
specific to each variant type. This is extensible (e.g. further inheritance for STRs)

### Changed

* All minimial-case representations of Variants or Reportable events used in unit tests are now
represented as full Pydantic models for the corresponding data type
* So many little things, this was a +3000/-2000 line change

[2.1.0] - 2023-11-20

This change involved the introduction of SV data, and the co-processing of SV and Small Variant
data in a unified analysis process. The SV data format expected is based on the GATK-SV pipeline
and annotation.

This also includes the addition of the `sv1` category, which represents LoF structural variants

### Added

* SV data is now processed alongside small variants
* `sv1` category specific to SV data

### Changed

* Some of the AbstractVariant loading is modified to allow for SV data (e.g. defaults for depths)

[2.0.0 & 2.0.1] - 2023-10-26

This bump is less substantial than the version number suggests, but in hindsight the previous
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
include reanalysis/*.json
include reanalysis/templates/*.jinja
include reanalysis/reanalysis_global.toml
86 changes: 46 additions & 40 deletions design_docs/MOI_Checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,24 +20,28 @@ variant to decide whether we include it in the final report.
Static Objects:

- Configuration File: used throughout the pipeline, this contains all runtime settings
- When we are assessing each variant, we may choose to apply thresholds (e.f. MAF, frequency in a joint call). These
may change with each run, so we take those parameters from a central configuration file
- When we are assessing each variant, we may choose to apply thresholds (e.f. MAF, frequency in a joint call). These
may change with each run, so we take those parameters from a central configuration file
- PanelApp data: associates genes with evidenced inheritance patterns
- For each 'green' (high evidence) gene contains inheritance pattern to be applied & panels where the gene is 'new'
(if any).
- For each 'green' (high evidence) gene contains inheritance pattern to be applied & panels where the gene is 'new'
(if any).

Dynamic Objects:

- AbstractVariant: Each variant in turn is read from the source (VCF) and cast in a custom AbstractVariant Class
representation. This is for a couple of reasons:

1. CyVCF2 was chosen as a parsing library due to speed relative to pyVCF, but each library provides a slightly different
representation. This normalises all logic to a common class, and allows for different sources/parsers to generate a
common format.
2. CyVCF2 and PyVCF variant object formats contain structures which cannot be pickled by default. This leads to issues
with introducing parallelisation into the code. A dictionary/set/list-based class can be pickled easily, so async/await
structure can be added in at any level.
3. Making an abstract object from simple types will lead to simpler unit testing.
representation. This is for a couple of reasons:

1. CyVCF2 was chosen as a parsing library due to speed relative to pyVCF, but each library provides a slightly
different
representation. This normalises all logic to a common class, and allows for different sources/parsers to generate
a
common format.
2. CyVCF2 and PyVCF variant object formats contain structures which cannot be pickled by default. This leads to
issues
with introducing parallelisation into the code. A dictionary/set/list-based class can be pickled easily, so
async/await
structure can be added in at any level.
3. Making an abstract object from simple types will lead to simpler unit testing.

- Compound Het lookup: This object is built per-gene as we iterate over all genes independently

Expand All @@ -52,7 +56,7 @@ following details:
rather than storing static booleans
6. If physical phasing is evident for any sample calls, record the numerical phase set ID, and sample-specific GT
1. Within a phase set ID, two variants are in-phase if their GT is also the same, i.e. for phase set #1, variants
with the GT `0|1` and `1|0` are explicitly out of phase, so knowing the PS ID alone is not sufficient
with the GT `0|1` and `1|0` are explicitly out of phase, so knowing the PS ID alone is not sufficient

The AbstractVariant has a few internal methods:

Expand All @@ -63,31 +67,31 @@ The AbstractVariant has a few internal methods:
- `category_non_support`: returns True if any non-support category is assigned
- `is_classified`: returns True if any category is assigned
- `category_values`: returns a list of strings for each assigned category, i.e.
- if a variant is True for category 2 and 3, the return will be `['2', '3']`
- if a variant is True for category 2 and 4, the return will be `['2', 'de_novo']`
- if a variant is True for category 2 and 3, the return will be `['2', '3']`
- if a variant is True for category 2 and 4, the return will be `['2', 'de_novo']`
- `sample_specific_category_check`: returns True if the specific sample is _de novo_,
or if the variant has a boolean category assigned - accepts a switch to also check for `Support`
or if the variant has a boolean category assigned - accepts a switch to also check for `Support`
- `get_sample_flags`: checks for any variant flags - currently this only implements one check - AB Ratio test

## Variant Gathering

The Variant gathering strategy observed in this application is:

- parse the VCF header to obtain all chromosome names
- for each contig in turn, extract all variants & create an Abstract representation of each
- group variants by gene ID, forming a structure `{'gene_name': [Variant1, Variant2, ...], }`
- each variant contains exactly one gene annotation, from an earlier split, so grouping is accurate
- each gene can be processed separately if required, allowing logical parallelisation breaks
- once all variants on a contig are extracted, iterate over variants grouped by gene
- parse the VCF header to obtain all chromosome names
- for each contig in turn, extract all variants & create an Abstract representation of each
- group variants by gene ID, forming a structure `{'gene_name': [Variant1, Variant2, ...], }`
- each variant contains exactly one gene annotation, from an earlier split, so grouping is accurate
- each gene can be processed separately if required, allowing logical parallelisation breaks
- once all variants on a contig are extracted, iterate over variants grouped by gene

Justification:

1. Use of the AbstractVariant structure means each variant can be pickled, so each group can be processed in parallel
(Cyvcf2 and pyvcf objects can't be pickled directly)
(Cyvcf2 and pyvcf objects can't be pickled directly)
2. Allows us to collect the panelapp details once per gene, instead of once per variant; minor efficiency gain
3. Gathering all variants relevant to an MOI test means that when generating the results, we can access all attributes
of the relevant variants for presentation, e.g. instead of just variant coordinates for the variant pair, we can show
the exact consequence(s) that led to category labelling
of the relevant variants for presentation, e.g. instead of just variant coordinates for the variant pair, we can show
the exact consequence(s) that led to category labelling

## Compound-Heterozygous checks

Expand All @@ -106,11 +110,12 @@ for this sample. These objects can be retrieved directly from this dictionary, a
inheritance checks.

```python
from reanalysis.utils import AbstractVariant
from reanalysis.models import SmallVariant, StructuralVariant

_comp_het = {
"sample": {
"coords_string_1": [AbstractVariant, AbstractVariant],
"coords_string_2": [AbstractVariant, AbstractVariant]
"coords_string_1": [SmallVariant, StructuralVariant],
"coords_string_2": [SmallVariant, SmallVariant]
}
}
```
Expand All @@ -126,28 +131,28 @@ Includes one controlling class `MoiRunner`, and one functional class `BaseMoi` a
derivatives each define a single Mode of Inheritance e.g.

- DominantAutosomal - requires a variant to be exceptionally rare, and homozygotes to be absent from population
databases. All samples with a heterozygous variant call are passed as fitting with the MOI
databases. All samples with a heterozygous variant call are passed as fitting with the MOI

- XRecessive - Male samples are passed so long as the AF is below a stringent threshold, Female samples must be
Homozygous or supported in a trans compound-het
Homozygous or supported in a trans compound-het

The separation between the methods defining the filter algorithm, and the MoiRunner using one or more algorithms
allows multiple filters to be applied for a given MOI string e.g.

- "BOTH monoallelic and biallelic, autosomal or pseudoautosomal" from PanelApp is easy to interpret as
2 filters, monoallelic, and biallelic
2 filters, monoallelic, and biallelic
- "X-LINKED: hemizygous mutation in males, biallelic mutations in females" for male samples we call a
DominantMOI filter, for females we call a Recessive filter
DominantMOI filter, for females we call a Recessive filter

The usage paradigm is:

1. Create an instance of the `MoiRunner`, passing a target MOI string and some configuration parameters
2. During the setup, the MOI string is used to determine which filters to add into the filter list
3. The `MoiRunner` implements a `.run()` method, which takes a single variant and the lookup of all compound-hets within
this gene, and passes through each filter in turn
this gene, and passes through each filter in turn
4. Where a variant passes all conditions within a filter class, a 'result' object is created
5. The result of `.run()` is a list of valid modes of inheritance (ReportedVariant), each including details of the
variant(s), and samples
variant(s), and samples

### Family Checks

Expand All @@ -156,11 +161,11 @@ of participants. The inheritance rules are applied unilaterally to every member
directional manner, e.g. when considering each candidate variant:

- for a monoallelic variant, inherited with complete penetrance, we enforce a unilateral rule - every person with the
variant must also be affected. If this rule is violated for any individual member, the variant does not pass the
inheritance test.
variant must also be affected. If this rule is violated for any individual member, the variant does not pass the
inheritance test.
- for a biallelic variant pair, all participants with the variant pair must also be affected. If the variant considered
is Homozygous, we permit unaffected members to be heterozygous. Similarly, for compound-het inheritance, unaffected
participants may have either variant but not both.
is Homozygous, we permit unaffected members to be heterozygous. Similarly, for compound-het inheritance, unaffected
participants may have either variant but not both.

Complete and Incomplete penetrance modes are available in this module. For the Complete penetrance
inheritance model:
Expand Down Expand Up @@ -206,7 +211,8 @@ which will generate a reported variant structure for the sample(s) which passed
## Flags

When a reportable event is found, a JSON blob representing the variant and sample is created. If the relevant gene was
in one or more of the additional panels requested (see [additional panels](PanelApp_interaction.md#per-participant-panels)),
in one or more of the additional panels requested (
see [additional panels](PanelApp_interaction.md#per-participant-panels)),
the names/IDs of those panels are appended to the list of any variant-specific panels.

This leaves us the flexibility to mark individual samples/families as having disease-relevant panels, which can then be
Expand Down
6 changes: 5 additions & 1 deletion helpers/forbidden_gene_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

from cpg_utils import to_path

from reanalysis.models import PhenotypeMatchedPanels

PANELAPP_BASE = 'https://panelapp.agha.umccr.org/api/v1/panels'


Expand Down Expand Up @@ -177,7 +179,9 @@ def main(panels: str | None, out_path: str, dates: list[str]):
assert dates, 'whats the point doing this with no dates?'

if panels:
all_panels = read_panels_from_participant_file(panels)
with open(panels) as handle:
panel_obj = PhenotypeMatchedPanels.model_validate(json.load(handle))
all_panels = panel_obj.all_panels
else:
all_panels = {137}

Expand Down
59 changes: 25 additions & 34 deletions helpers/minimise_output_for_seqr.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@

import json
from argparse import ArgumentParser
from collections import defaultdict

import toml
from reanalysis.models import MiniForSeqr, MiniVariant, ResultData


def coord_to_string(coord: dict) -> str:
Expand All @@ -34,54 +33,46 @@ def coord_to_string(coord: dict) -> str:
return f"{coord['chrom']}-{coord['pos']}-{coord['ref']}-{coord['alt']}"


def main(input_file: str, output: str, config_file: str):
def main(input_file: str, output: str):
"""
reads in the input file, shrinks it, and writes the output file
Args:
input_file (str):
output (str):
config_file (str):
Returns:
"""
with open(config_file, encoding='utf-8') as f:
config = toml.load(f)

with open(input_file, encoding='utf-8') as f:
data = json.load(f)

lil_data = {
'metadata': {'categories': config['categories']},
'results': defaultdict(dict),
}

for individual, details in data['results'].items():
for variant in details['variants']:
var_data = variant['var_data']
lil_data['results'][individual][ # type: ignore
coord_to_string(var_data['coords'])
] = { # type: ignore
'categories': var_data['categories'],
# 'labels': variant['labels'],
'support_vars': variant['support_vars'],
# 'independent': variant['independent'],
}
data = ResultData.model_validate(json.load(f))

lil_data = MiniForSeqr(
**{
'metadata': {'categories': data.metadata.categories},
}
)

for individual, details in data.results.items():
lil_data.results[individual] = {}
for variant in details.variants:
var_data = variant.var_data
lil_data.results[individual][
var_data.coordinates.string_format
] = MiniVariant(
**{
'categories': variant.categories,
'support_vars': variant.support_vars,
'independent': variant.independent,
}
)

with open(output, 'w', encoding='utf-8') as f:
json.dump(lil_data, f, indent=4)
f.write(MiniForSeqr.model_validate(lil_data).model_dump_json(indent=4))


if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('input_file', help='the input file to process')
parser.add_argument('output_file', help='the output file to write to')
parser.add_argument('config_file', help='the config file to use')
args = parser.parse_args()

main(
input_file=args.input_file,
output=args.output_file,
config_file=args.config_file,
)
main(input_file=args.input_file, output=args.output_file)
2 changes: 1 addition & 1 deletion helpers/prepare_aip_cohort.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ def main(

panel_remote = remote_root / 'participant_panels.json'
with panel_remote.open('w') as handle:
json.dump(hpo_panel_dict, handle, indent=4, default=list)
handle.write(hpo_panel_dict.model_dump_json(indent=4))
logging.info(f'Wrote panel file to {panel_remote}')

# get the list of all pedigree members as list of dictionaries
Expand Down
Loading

0 comments on commit 2fbb058

Please sign in to comment.