Skip to content

camcl/genotypooler

Repository files navigation

genotypooler

Description

This project implements a simulation of SNP genotype pooling with a simple shifted transversal design. The block size chosen for the pooling design is 4*4, with 8 pools and a design weight of 2. The encoding and decoding part of the pooling procedure can be represented as follows: Pooling simulation on genotypes in the DNA Sudoku style where {0, 1, 2, -1} are the allelic dosages from the true genotypes values at one SNP of any sample in (a). {0, 1, 2, -1} stand for homozygote reference allele, heterozygote, homozygote alternate allele, missing genotype.

Based on a Marginal Likelihood Maximization method, we implemented a refined version of decoding where the missing true genotypes are converted to posterior genotype probabilities depending on the position of the sample in the block layout lambda and the pooling pattern psi. In the above picture (b), lambda= (0, 2, 1, 0) e.g. allelic dosages of the ambiguous samples after pooling, and psi=((2, 2, 0), (2, 2, 0)) is the pooling pattern e.g. 2 row-pools have genotype 0, 2 have genotype 1, none has genotype 2, idem for the column-pools.

Set up

  • a Python 3.6 environment with packages listed in requirements.txt, e.g. for a Linux-based OS from the genotypooler folder:

(if venv for Python 3.6 is not installed: apt install libpython3.6-dev python3.6-venv)

/usr/bin/python3.6 -m venv venv3.6

source venv3.6/bin/activate

pip install --upgrade pip

pip install -r requirements.txt

(see official venv documentation)

Usage

Some data and scripts are provided as use cases in /examples. In particular, the following files can be found:

  • adaptive_gls.csv: posterior genotypes probabilities of pooled individuals, computed by Marginal Likelihood Maximization with heterozygotes degeneracy.
  • ALL.chr20.snps.gt.vcf.gz and its index .csi: a subset of 1000 diallelic SNPs on the chromosome 20 for 2504 unrelated individuals from the 1000 Genomes Prject phase3
  • TEST.chr20.snps.gt.vcf.gz and its index .csi: a subset of 100 diallelic SNPs on the chromosome 20 for 240 unrelated individuals from the 1000 Genomes Prject phase3
  • pooling-ex.py: a minimalistic command-line program for simulating SNPs genotypes pooling from VCF files
  • pooling-imputing-ex.ipynb: a pipeline showing pooling simulation, imputation in pooled data with Beagle and impuatation quality visualization.

Larger data files can be found in /data. They can be used the same way as the ones created in examples /examples after executing pooling-ex.py. However the processing needs to be run in parallel on chunked data:

  1. From /data, run bash ../bin/bcfchunkpara.sh IMP.chr20.snps.gt.vcf.gz ./tmp 1000. You should get 53 chunks (0 to 52) in a tmp folder.
  2. From /runtools run the script parallel_pooling.py with python3 parallel_pooling.py ../data/IMP.chr20.snps.gt.vcf.gz ../data/IMP.chr20.pooled.snps.gl.vcf.gz 4 (if you have 4 cores available on your machine). This should output the pooled file /data/tmp/IMP.chr20.pooled.snps.gl.vcf.gz. You can copy this file where you want and delete the /tmp folder.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published