This project implements a simulation of SNP genotype pooling with a simple shifted transversal design. The block size chosen for the pooling design is 4*4, with 8 pools and a design weight of 2. The encoding and decoding part of the pooling procedure can be represented as follows: where {0, 1, 2, -1} are the allelic dosages from the true genotypes values at one SNP of any sample in (a). {0, 1, 2, -1} stand for homozygote reference allele, heterozygote, homozygote alternate allele, missing genotype.
Based on a Marginal Likelihood Maximization method, we implemented a refined version of decoding where the missing true genotypes are converted to posterior genotype probabilities depending on the position of the sample in the block layout lambda and the pooling pattern psi. In the above picture (b), lambda= (0, 2, 1, 0) e.g. allelic dosages of the ambiguous samples after pooling, and psi=((2, 2, 0), (2, 2, 0)) is the pooling pattern e.g. 2 row-pools have genotype 0, 2 have genotype 1, none has genotype 2, idem for the column-pools.
- a Python 3.6 environment with packages listed in requirements.txt, e.g. for a Linux-based OS from the genotypooler folder:
(if venv
for Python 3.6 is not installed: apt install libpython3.6-dev python3.6-venv
)
/usr/bin/python3.6 -m venv venv3.6
source venv3.6/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
(see official venv documentation)
-
bcftools installed on the OS. See official page.
-
tabix
Some data and scripts are provided as use cases in /examples. In particular, the following files can be found:
- adaptive_gls.csv: posterior genotypes probabilities of pooled individuals, computed by Marginal Likelihood Maximization with heterozygotes degeneracy.
- ALL.chr20.snps.gt.vcf.gz and its index .csi: a subset of 1000 diallelic SNPs on the chromosome 20 for 2504 unrelated individuals from the 1000 Genomes Prject phase3
- TEST.chr20.snps.gt.vcf.gz and its index .csi: a subset of 100 diallelic SNPs on the chromosome 20 for 240 unrelated individuals from the 1000 Genomes Prject phase3
- pooling-ex.py: a minimalistic command-line program for simulating SNPs genotypes pooling from VCF files
- pooling-imputing-ex.ipynb: a pipeline showing pooling simulation, imputation in pooled data with Beagle and impuatation quality visualization.
Larger data files can be found in /data. They can be used the same way as the ones created in examples /examples after executing pooling-ex.py. However the processing needs to be run in parallel on chunked data:
- From /data, run
bash ../bin/bcfchunkpara.sh IMP.chr20.snps.gt.vcf.gz ./tmp 1000
. You should get 53 chunks (0 to 52) in atmp
folder. - From /runtools run the script parallel_pooling.py with
python3 parallel_pooling.py ../data/IMP.chr20.snps.gt.vcf.gz ../data/IMP.chr20.pooled.snps.gl.vcf.gz 4
(if you have 4 cores available on your machine). This should output the pooled file /data/tmp/IMP.chr20.pooled.snps.gl.vcf.gz. You can copy this file where you want and delete the /tmp folder.
- DNA Sudoku pooling designs
- Beagle 4.1 articles for phasing and imputation
- Beagle 4.1 documentation and binaries
- The 1000 Genomes Project and its VCF phase 3 data release.
- Our paper in BMC Bioinformatics: "A joint use of pooling and imputation for genotyping SNPs"