code/R/EM-HAF at main · Gero1999/code

History

Name		Name	Last commit message	Last commit date
parent directory ..
EM-HAF.Rmd		EM-HAF.Rmd
README.md		README.md
icon.png		icon.png
yeast.mpileup		yeast.mpileup

README.md

Haploid Allele Frequency Estimate with EM algorithm

Identify the most common allele frequence and guess the base!

View Demo ·

Aims of the project

The purpose of this project is to be capable of automatizing pipeline processes in order to:

Understand the GATK proposal for base calling
Deploy the EM algorithm over a practical bioinformatics case
Infer the alelle frequency in an haploid organism

(back to top)

Likelihood model

The likelihood model used in this project is based on the genotype likelihoods calculated using the Genome Analysis ToolKit (GATK). For a specific genotype (denoted as Z = z), the probability of observing data X for an individual at a particular genomic site is proportional to the product of the probability of a read-base (𝑏𝑑) given the genotype. This probability calculation is performed for all reads, and the product of these probabilities extends to the depth of coverage (D) for that site.

To compute the probability of a read-base (given the genotype), the error probability of the read (denoted as 𝜖𝑑) is considered. If the observed base matches the assumed genotype (z), the error's complement (1 - 𝜖𝑑) is used. If there is a mismatch, the error is divided by three (𝜖𝑑/3) since we are studying haploid organisms.

EM Algorithm

The Expectation-Maximization (EM) algorithm is employed to estimate allele frequencies in this project. The algorithm iteratively refines these estimates through two key steps:

Expectation Step: For each individual, the algorithm calculates the expected genotype likelihood for a site, denoted as $Q_i(Z_j)$. This calculation uses the current allele frequency estimates ($f^{(n)}$). The probability $Q_i(Z_j)$ represents the likelihood that a particular genotype Z is present at site j given the observed data from individual i. It takes into account the observed read data and the allele frequencies.

Maximization Step: In this step, the algorithm updates the allele frequency estimates ($f_j^{n+1}$) using the expectations calculated in the previous step. It considers all individuals for each site and incorporates the expected genotype likelihoods.

These iterations continue until the estimates converge or a predefined number of iterations is reached. The EM algorithm enables us to obtain accurate estimates of allele frequencies for each genomic site in haploid organisms.

These two components, the likelihood model and the EM algorithm, form the core of the project, allowing us to infer allele frequencies effectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EM-HAF

EM-HAF

README.md

Haploid Allele Frequency Estimate with EM algorithm

Aims of the project

Likelihood model

EM Algorithm

Built With R

Still in develop!

Files

EM-HAF

Directory actions

More options

Directory actions

More options

Latest commit

History

EM-HAF

Folders and files

parent directory

README.md

Haploid Allele Frequency Estimate with EM algorithm

Aims of the project

Likelihood model

EM Algorithm

Built With R

Still in develop!