genotypooler

Description

This project implements a simulation of SNP genotype pooling with a simple shifted transversal design. The block size chosen for the pooling design is 4*4, with 8 pools and a design weight of 2. The encoding and decoding part of the pooling procedure can be represented as follows: where {0, 1, 2, -1} are the allelic dosages from the true genotypes values at one SNP of any sample in (a). {0, 1, 2, -1} stand for homozygote reference allele, heterozygote, homozygote alternate allele, missing genotype.

Based on a Marginal Likelihood Maximization method, we implemented a refined version of decoding where the missing true genotypes are converted to posterior genotype probabilities depending on the position of the sample in the block layout lambda and the pooling pattern psi. In the above picture (b), lambda= (0, 2, 1, 0) e.g. allelic dosages of the ambiguous samples after pooling, and psi=((2, 2, 0), (2, 2, 0)) is the pooling pattern e.g. 2 row-pools have genotype 0, 2 have genotype 1, none has genotype 2, idem for the column-pools.

Set up

a Python 3.6 environment with packages listed in requirements.txt, e.g. for a Linux-based OS from the genotypooler folder:

(if venv for Python 3.6 is not installed: apt install libpython3.6-dev python3.6-venv)

/usr/bin/python3.6 -m venv venv3.6

source venv3.6/bin/activate

pip install --upgrade pip

pip install -r requirements.txt

(see official venv documentation)

bcftools installed on the OS. See official page.
tabix

Usage

Some data and scripts are provided as use cases in /examples. In particular, the following files can be found:

adaptive_gls.csv: posterior genotypes probabilities of pooled individuals, computed by Marginal Likelihood Maximization with heterozygotes degeneracy.
ALL.chr20.snps.gt.vcf.gz and its index .csi: a subset of 1000 diallelic SNPs on the chromosome 20 for 2504 unrelated individuals from the 1000 Genomes Prject phase3
TEST.chr20.snps.gt.vcf.gz and its index .csi: a subset of 100 diallelic SNPs on the chromosome 20 for 240 unrelated individuals from the 1000 Genomes Prject phase3
pooling-ex.py: a minimalistic command-line program for simulating SNPs genotypes pooling from VCF files
pooling-imputing-ex.ipynb: a pipeline showing pooling simulation, imputation in pooled data with Beagle and impuatation quality visualization.

Larger data files can be found in /data. They can be used the same way as the ones created in examples /examples after executing pooling-ex.py. However the processing needs to be run in parallel on chunked data:

From /data, run bash ../bin/bcfchunkpara.sh IMP.chr20.snps.gt.vcf.gz ./tmp 1000. You should get 53 chunks (0 to 52) in a tmp folder.
From /runtools run the script parallel_pooling.py with python3 parallel_pooling.py ../data/IMP.chr20.snps.gt.vcf.gz ../data/IMP.chr20.pooled.snps.gl.vcf.gz 4 (if you have 4 cores available on your machine). This should output the pooled file /data/tmp/IMP.chr20.pooled.snps.gl.vcf.gz. You can copy this file where you want and delete the /tmp folder.

References

DNA Sudoku pooling designs
Beagle 4.1 articles for phasing and imputation
Beagle 4.1 documentation and binaries
The 1000 Genomes Project and its VCF phase 3 data release.
Our paper in BMC Bioinformatics: "A joint use of pooling and imputation for genotyping SNPs"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

genotypooler

Description

Set up

Usage

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

genotypooler

Description

Set up

Usage

References