Name	Name	Last commit message	Last commit date
Latest commit ksamuk bugfixes for multichromosome support Apr 2, 2020 e6bb296 · Apr 2, 2020 History 127 Commits
conda.recipe	conda.recipe	multichromosome support	Mar 31, 2020
docs	docs	Added requirement for single quotes in filtration expressions	Mar 29, 2020
pixy	pixy	bugfixes for multichromosome support	Apr 2, 2020
.Rhistory	.Rhistory	initial testing scripts	May 15, 2019
.gitignore	.gitignore	testing readthedocs	Aug 13, 2019
LICENSE	LICENSE	Initial commit	Apr 18, 2019
README.md	README.md	Update README.md	Jan 17, 2020
setup.py	setup.py	bugfixes for multichromosome support	Apr 2, 2020

Repository files navigation

`pixy`

pixy is a command-line tool for painlessly and correctly estimating average nucleotide diversity within (π) and between (d_xy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (AKA monomorphic) sites, which are essential for the correct computation of π and d_xy in the face of missing data (i.e. always).

pixy is currently in active development and is not ready for general use. The software will be fully described in a forthcoming publication.

Authors

Kieran Samuk and Katharine Korunes

Duke University

Documentation

https://pixy.readthedocs.io/en/latest/

Installation

pixy is currently available for installation on Linux/OSX systems via conda-forge. To install pixy using conda, you will first need to add conda-forge as a channel (if you haven't already):

conda config --add channels conda-forge

Then install pixy:

conda install pixy

You can test your pixy installation by running:

pixy --help

For information in installing conda, see here:

anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/

miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html

Background

Population geneticists are often interested in quantifying nucleotide diversity within and nucleotide differences between populations. The two most common summary statistics for these quantities were described by Nei and Li (1979), who discuss summarizing variation case of two populations (denoted 'x' and 'y'):

π - Average nucleotide diversity within populations, also sometimes denoted π_x and π_y to indicate the population of interest.
d_xy - Average nucleotide difference between populations, also sometimes denoted π_xy (pixy, get it?), to indicate that the statistic is a comparison between populations x and y.

Both of these statistics use the same basic formula:

x_i and x_j are the respective frequencies of the i_thand j_th sequences, π_ij is the number of nucleotide differences per nucleotide site between the i_th and j_th sequences, and n is the number of sequences in the sample. (Source: Wikipedia)

In the case of π, all comparisons are made between sequences from same population, wherease for d_xy all comparisons are made between sequences from two different populations.

The problem

In order to be comparable across samples/studies/genomic regions, π and d_xy are generally standardized by dividing by the total number of sites in the sequences being compared. As such, one must explicitly keep track of variable vs. missing vs. monomorphic (invariant) sites. Failure to do this results in biased estimates of pi and dxy. Prior to the genomic era, missing/invariant sites were almost always explicitly included in datasets because sequence data was in FASTA format (e.g. Sanger reads). However, most modern genomics tools encode variants as VCFs which by design often omit invariant sites. With variants-only VCFs, there is often no way to distinguish missing sites from invariant sites. Further, when one does include invariant sites in a VCF, it generally results in very large files that are difficult to manipulate with standard tools.

The solution

pixy provides the following solutions to problems inherent in computing π and d_xy from a VCF:

Fast and efficient handing of invariant sites VCFs via conversion to on-disk chunked databases (Zarr format).
Standardized individual-level filtration of variant and invariant genotypes.
Computation of π and d_xy for arbitrary numbers of populations
Computes all statistics in arbitrarily sized windows, and output contains all raw information for all computations (e.g. numerators and denominators).

The majority of this is made possible by extensive use of the existing data structures and functions found in the brilliant python library scikit-allel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`pixy`

Authors

Documentation

Installation

Background

The problem

The solution

About

Releases 47

Packages

Contributors 7

Languages

License

ksamuk/pixy

Folders and files

Latest commit

History

Repository files navigation

pixy

Authors

Documentation

Installation

Background

The problem

The solution

About

Resources

License

Stars

Watchers

Forks

Releases 47

Packages 0

Contributors 7

Languages

`pixy`

Packages