Skip to content

Genotype Queries

Dave Lawrence edited this page Feb 22, 2022 · 3 revisions

VCF sample data

Storing VCF sample data (zygosity/allele depth/phred etc) in a different row per sample would require a large amount of joins when performing multiple sample queries like Trio and Cohort analyses.

We use the technique from Gemini of packing all of the genotype information for multiple samples into a single row in the snpdb.CohortGenotype model, though use Postgres arrays instead of binary blobs, which allows us to perform all queries in SQL.

Zygosity is stored in a string which we can filter using Postgres regex extensions. We benchmarked storing zygosity as a bit field (2 bits to store ref/het/hom/unknown) but the gain was negligable.

CohortGenotype uses Data Partitioning (constrained by CohortGenotypeCollection)

Cohort and VCF

Cohorts and VCFs use the same underlying CohortGenotypeCollection data structure so we can run the same code on both.

You can create cohorts using samples from different VCFs (see snpdb.tasks.cohort_genotype_tasks.cohort_genotype_task) but this is done purely at the SQL level and if a sample has no data for a variant will fill in the data with NULL or "." (ie we do not store bams and re-call variants at those positions)

Code

There are utility methods on CohortGenotypeCollection and other useful ones are:

# These use FilteredRelation to make a JOIN ON to the table partition
# then inner join to restrict query to those records
sample.get_variant_qs()
vcf.get_variant_qs()

Cohort Genotype Common Filters

We split up VCF/Cohort genotype calls into common (>5% gnomAD frequency) - see Cohort Genotype Common Filters

Clone this wiki locally