Skip to content

Latest commit

 

History

History
172 lines (98 loc) · 3.95 KB

lecture_0.md

File metadata and controls

172 lines (98 loc) · 3.95 KB
title author institute date output
Introduction to pangenomics
Alexander Leonard
ETH Zürich
Day 1
beamer_presentation
theme keep_tex
Boadilla
true

What is a "genome"?

The definition of genome from genome.gov is

the entire set of DNA instructions found in a cell.

For us, that is approximately 3,000,000,000 A/T/C/G nucleotide bases.

Chromosome 1 of CHM13v2.0 looks like

CACCTAAACCCTAACCCCTAACCCTAACCCTAAC
...
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT


Reference genomes

The human reference genome was assembled in 2001

  • International Human Genome Consortium. Initial sequencing and analysis of the human genome. Nature (2021).
  • Venter et al. The sequence of the human genome. Science (2021).

. . .

The human reference genome was actually completed much later.
Nurk et al. The complete sequence of a human genome. Science (2022).

. . .

Even that version has "limitations", including no Y chromosome!
Rhie et al. The complete sequence of a human Y chromosome. Nature (2023).


Reference genomes

What is the point of a reference genome?

The definition of reference sequence from genome.gov is

an accepted representation ... that is used by researchers as a standard for comparison to DNA sequences generated in their studies.

. . .

So a reference is a "good enough" framework for us to consistently refer to the same genomic sequence.


Reference genomes

What do we use a reference genome for?


Reference genomes?

The CHM13 genome represents one person, not all of us.

What about my genome?

. . .

What about my genome genomes?


Enter the pangenome

Pangenomes are one solution to this problem of representing many genomes, or more formally put is

a collection of well assembled genomes in a clade (typically a species) to be analysed together.

Historically, pangenomes were used primarily in bacteria — many E. coli genomes only share ~50% of genes, so the "set" of genes was the "pangenome".


What is a "pangenome"?

Now we might refer to a collection of many genomes as a pangenome.

. . .

Species pangenomes, genus pangenomes (superpangenome), etc.


What is a "pangenome"?

In reality there are many underlying types of pangenomes:

  • variation-based
  • sequence-based
  • gene-based
  • kmer-based

. . .

As a young field, no one definition has won yet.
Each may have their own strengths and weaknesses.


What is a "pangenome"?

BPC diagram{ width=85% }

. . .

At least a "sequence graph pangenome".


Why use a pangenome?

We want to mitigate reference bias, which is

when reads containing non-reference alleles fail to align to their true point of origin.

This is especially relevant to agriculture, where "the same species" can be highly diverse.

. . .

Consider what happens if our sample is diverged (either in one region or the whole genome) compared to the reference.


Why use a pangenome?

reference bias in alignment{ width=65% }

Genomics mutations/sequencing errors make aligners second-guess themselves...


Why just now?

Pangenomes may seem like an obvious solution, but

  • assemblies were extremely expensive/hard to produce
  • computational power/algorithms were not sufficiently advanced
  • the scale of diversity (in humans at least) was underestimated

Some guiding steps

The goals of this course are to gain familiarity with the rapidly evolving field of pangenomics, including:

  • performing analyses with long read data
  • constructing and assessing pangenomes
  • using pangenomes to find meaningful variation

Some guiding steps

If you don't find this challenging, I have a long list of issues!