pixy
is a command-line tool for painlessly and correctly estimating average nucleotide diversity within (π) and between (dxy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (AKA monomorphic) sites, which are essential for the correct computation of π and dxy in the face of missing data (i.e. always).
pixy
is currently in active development and is not ready for general use. The software will be fully described in a forthcoming publication.
Kieran Samuk and Katharine Korunes
Duke University
https://pixy.readthedocs.io/en/latest/
pixy
is currently available for installation on Linux/OSX systems via conda-forge. To install pixy using conda, you will first need to add conda-forge as a channel (if you haven't already):
conda config --add channels conda-forge
Then install pixy:
conda install pixy
You can test your pixy installation by running:
pixy --help
For information in installing conda, see here:
anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/
miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html
Population geneticists are often interested in quantifying nucleotide diversity within and nucleotide differences between populations. The two most common summary statistics for these quantities were described by Nei and Li (1979), who discuss summarizing variation case of two populations (denoted 'x' and 'y'):
- π - Average nucleotide diversity within populations, also sometimes denoted πx and πy to indicate the population of interest.
- dxy - Average nucleotide difference between populations, also sometimes denoted πxy (pixy, get it?), to indicate that the statistic is a comparison between populations x and y.
Both of these statistics use the same basic formula:
xi and xj are the respective frequencies of the ithand jth sequences, πij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample. (Source: Wikipedia)
In the case of π, all comparisons are made between sequences from same population, wherease for dxy all comparisons are made between sequences from two different populations.
In order to be comparable across samples/studies/genomic regions, π and dxy are generally standardized by dividing by the total number of sites in the sequences being compared. As such, one must explicitly keep track of variable vs. missing vs. monomorphic (invariant) sites. Failure to do this results in biased estimates of pi and dxy. Prior to the genomic era, missing/invariant sites were almost always explicitly included in datasets because sequence data was in FASTA format (e.g. Sanger reads). However, most modern genomics tools encode variants as VCFs which by design often omit invariant sites. With variants-only VCFs, there is often no way to distinguish missing sites from invariant sites. Further, when one does include invariant sites in a VCF, it generally results in very large files that are difficult to manipulate with standard tools.
pixy
provides the following solutions to problems inherent in computing π and dxy from a VCF:
- Fast and efficient handing of invariant sites VCFs via conversion to on-disk chunked databases (Zarr format).
- Standardized individual-level filtration of variant and invariant genotypes.
- Computation of π and dxy for arbitrary numbers of populations
- Computes all statistics in arbitrarily sized windows, and output contains all raw information for all computations (e.g. numerators and denominators).
The majority of this is made possible by extensive use of the existing data structures and functions found in the brilliant python library scikit-allel.