Skip to content

Latest commit

 

History

History
153 lines (109 loc) · 11.6 KB

README.md

File metadata and controls

153 lines (109 loc) · 11.6 KB

ngsLD

ngsLD is a program to estimate pairwise linkage disequilibrium (LD) taking the uncertainty of genotype's assignation into account. It does so by avoiding genotype calling and using genotype likelihoods or posterior probabilities.

Citation

ngsLD has been published in Bioinformatics, so please cite it if you use it in your work:

Fox EA, Wright AE, Fumagalli M, and Vieira FG
ngsLD: evaluating linkage disequilibrium using genotype likelihoods
Bioinformatics (2019) 35(19):3855 - 3856

Installation

ngsLD can be easily installed but has some external dependencies:

  • Mandatory:
    • gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
    • zlib: v1.2.7 tested on Debian 7.8 (wheezy)
    • gsl : v1.15 tested on Debian 7.8 (wheezy)
  • Optional (only needed for testing or auxilliary scripts):
    • md5sum
    • Perl packages: Getopt::Long, Graph::Easy, Math::BigFloat, and IO::Zlib
    • R packages: optparse, tools, ggplot2, reshape2, plyr, gtools, and LDheatmap
    • python3 packages: graph-tool, and pandas

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsLD.git

and run:

% cd ngsLD
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsLD [options] --geno glf/in/file --n_ind INT --n_sites INT --out output/file

Parameters

  • --geno FILE: input file with genotypes, genotype likelihoods or genotype posterior probabilities.
  • --probs: is the input genotype probabilities (likelihoods or posteriors)?
  • --log_scale: is the input in log-scale?
  • --n_ind INT: sample size (number of individuals).
  • --n_sites INT: total number of sites.
  • --pos(H) FILE: input file with site coordinates (one per line), where the 1st column stands for the chromosome/contig and the 2nd for the position (bp); remaining columns will be ignored but included in output; --posH assumes there is a header.
  • --max_kb_dist DOUBLE: maximum distance between SNPs (in Kb) to calculate LD. Set to 0(zero) to disable filter. [100]
  • --max_snp_dist INT: maximum distance between SNPs (in number of SNPs) to calculate LD. Set to 0 (zero) to disable filter. [0]
  • --min_maf DOUBLE: minimum SNP minor allele frequency. [0]
  • --ignore_miss_data: ignore missing genotype data from analyses (e.g. MAF and haplotype frequency estimation).
  • --call_geno: call genotypes before running analyses.
  • --N_thresh DOUBLE: minimum threshold to consider position; if highest GL is lower, set as missing data (assumes -call_geno).
  • --call_thresh DOUBLE: minimum threshold to call genotype; if highest GL is lower, left as is (assumes -call_geno).
  • --rnd_sample DOUBLE: proportion of comparisons to randomly sample. [1]
  • --seed INT: random number generator seed for random sampling (--rnd_sample).
  • --extend_out: print extended output (see below).
  • --out FILE: output file name. [stdout]
  • --n_threads INT: number of threads to use. [1]
  • --verbose INT: selects verbosity level. [1]

Input data

As input, ngsLD accepts both genotypes, genotype likelihoods (GP) or genotype posterior probabilities (GP). Genotypes must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, 0, 1, 2]. As for GL and GP, ngsLD accepts both gzipd TSV and binary formats, but with 3 columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

It is advisable that SNPs be called first, since monomorphic sites are not informative and it will greatly speed up computation. If not, these comparisons will show up as nan or inf in the output.

Output

ngsLD outputs a TSV file with LD results for all pairs of sites for which LD was calculated, where the first two columns are positions of the SNPs, the third column is the distance (in bp) between the SNPs, and the following 4 columns are the various measures of LD calculated (r^2 from pearson correlation between expected genotypes, D from EM algorithm, D' from EM algorithm, and r^2 from EM algorithm). If the option --extend_out is used, then an extra 8 columns are printed with number of samples, minor allele frequency (MAF) of both loci, haplotype frequencies for all four haplotypes, and a chi2 (1 d.f.) for the strength of association (Collins et al., 1999).

If both --max_kb_dist and --max_snp_dist are set to 0, ngsLD will output all comparisons, even between different chromosomes/scaffolds/contigs.

Possible analyses

LD pruning

For some analyses, linked sites are typically pruned since their presence can bias results. You can use prune_graph to prune your dataset and get a list of unlinked sites. Alternatively, you can also use the auxiliary scripts scripts/prune_graph.pl or scripts/prune_ngsLD.py, but they are much slower (specially the perl script) and no longer supported.

prune_graph --header --in testLD_2.ld --weight-field "r2" --weight-filter "dist <= 50000 && r2 >= 0.5" --out testLD_unlinked.pos

For more advanced options, please check help (prune_graph --help).

NOTE: as of version 1.2.1 the MAF filtering has been disabled in ngsLD, due to potential misleading prunning results. Briefly, filtered sites due to low MAF are skipped by ngsLD and not included in the output file. As such, when pruning, if the output list of unlinked sites is used to extract independent sites from a dataset, all low MAF sites will also be excluded (since they were not in the ngsLD output). If you want to include them, either perform the MAF filtering when prunning (you need ngsLD extended output) or use the list of linked sites and exclude them from your dataset.

LD decay

If you are interested on the rate of LD decay, you can fit a distribution to your data using the script scripts/fit_LDdecay.R to fit LD decay models for r^2 (Hill and Weir, 1988 and Remington et al., 2001) and D' (Abecassis et al., 2001) over physical (or genetic) distance.

There are two models implemented for r^2 and one for D' decay. Briefly, the first is derived by adjusting the expected r^2 under a drift-recombination equilibrium for finite samples sizes and low level of mutation (Hill and Weir, 1988) with a single parameter (rate of decay):

E_r^2

The second formulation is an extension of the Sved, 1971 model (expected r^2 under drift-recombination equilibrium) with three parameters (rate of decay, maximum observed r^2 and minimum observed r^2) to account for the range of observed r^2 values:

E_r^2

where the rate of decay C. For D', we fit the expectation derived by Abecassis et al., 2001, assuming a recombination rate of cm-Mb and fixing D'0_1, resulting also on a three parameter model (rate of decay, maximum D' and minimum D'):

E_D'

For more information, please refer to the online published supplementary data.

% Rscript --vanilla --slave scripts/fit_LDdecay.R --ld_files ld_files.list --out plot.pdf
  • --ld_files FILE: list of LD files to fit and plot (if ommited, can be read from STDIN)
  • --out: Name of output plot
  • --n_ind: Only relevant when fitting r^2 decay to specify the first model

For more advanced options, please check script help (Rscript --vanilla --slave scripts/fit_LDdecay.R --help). The shape of the decay curve can give valuable insight into the sample's biology (e.g.):

  • Intersect (high): small Ne (or bottleneck), natural selection (local LD)*, non-random mating (or inbreeding), recent admixture
  • Decay Rate (high): large Ne (or population expansion), high mutation rate, high recombination rate*, random mating (or no inbreeding)*, not recently admixture
  • Asymptote (high): population structure*, small sample size*

NOTE: Please keep in mind that these are just general trends, and that it all depends on the biology/genetics of the sample. Since LD is influenced by many factors, it is usually less straightforward to derive exact predictions from it.

LD blocks

To plot LD blocks, we also provide a small script as an example for how it can be easily done in R using the LDheatmap package (by default, r^2 is plotted).

% cat testLD_2.ld | bash ../scripts/LD_blocks.sh chrSIM_21 2000 5000
Notes
  • LD can be affected by:
    • New mutations, genetic drift, population growth (reduces LD)
    • Admixture, population structure, inbreeding, natural selection (increases LD)
  • r^2 (preferred in popgen)
    • ranges between 0 (no linkage) and 1 (markers in complete linkage)
    • expected value of 1/2n
  • D (hard to interpret)
    • range depends on frequencies
    • sign is arbitrary
  • D'
    • ranges between -1 and 1
    • values of 1 or -1 mean complete linkage
    • if frequencies are similar, high values mean the markers are good surrogates for each other
    • overestimates LD in small samples
    • overestimates LD when one allele is rare

Hints

  • ngsLD performance seems to drop considerable under extremely low coverages (<1x); consider these cases only if you have large sample sizes (>100 individuals).
  • For some analyses (e.g. LD decay) consider sampling your data (--rnd_sample), since ngsLD will be much faster and you probably don't need all comparisons.
  • For the LD decay, as a rule-of-thumb, consider using at least 10'000 SNPs; check the confidence interval and, if too wide, increase number of SNPs.

Thread pool

The thread pool implementation was adapted from Mathias Brossard's and is freely available from: https://github.com/mbrossard/threadpool