title | author | institute | date | output | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Genome assembly and pangenome graphs |
Alexander Leonard |
ETH Zürich |
Day 2 |
|
There are not any pangenome "curriculums".
These are ideas we think are useful to help apply pangenomics to your own research.
. . .
Get involved and discuss any questions or ideas of your own!
By the end of the lecture, we should be able to:
- assemble genomes from long reads and evaluate them
- understand common pangenome terminology and formats
- build pangenomes graphs from input assemblies
- Long read sequencing and assembly
- Some pangenome basics
- Building a pangenome
Genome sequencing falls in several "generations":
- Sanger (manual)
- "Next generation" (high throughput)
- Third generation (long reads)
. . .
Perhaps we are due for a fourth generation designation soon...
Improvements in sequencing happen at a breakneck pace.
We most likely work with chromosomes that are too large to sequence directly.
Genome sequencing provides us with many smaller fragments.
. . .
We need to reassemble all the reads to reconstruct the original genome sequence.
Consider the simple case
TTAGGCAA
GCAAGTCCCA
TCCCATTAA
. . .
The assembled sequence would be "TTAGGCAAGTCCCATTAA".
We call this a contig (refering to a contiguous region of the genome).
However, one possible type of issue
TTAGGCAA
GCAAGTCATCAT
CATCATCATCCC
CATCATCATCCC
. . .
It is already ambiguous which read (the 3rd or 4th) is better, and so we have to "guess" the genome sequence.
Many genomes are unfortunately full of complex repeats that even with perfect reads cannot be resolved.
. . .
TTAGGCAA
GCAAGTCCCA
TCACATTAA
^
Now, we add in sequencing errors and assembly gets even harder.
Various theoretical arguments for how much sequencing coverage is needed to "guarantee" we can reassemble the genome correctly.
. . .
In short, long+accurate+high coverage is the best.
Confidently placing 150 (or smaller!) basepair sequencing reads was incredibly challenging.
Absolutely no hope of spanning complex repeats.
. . .
Highly fragmented assemblies, but each piece is generally correct.
Long reads solved some of the issues, but initially were extremely error-prone (>80% accuracy).
. . .
This required different assembly algorithms (de Bruijn graphs versus Overlap-Layout-Consensus).
With long and accurate reads, many problems disappeared.
Assemblers like hifiasm
(https://github.com/chhylp123/hifiasm) enabled drastic improvements in:
- genome quality
- compute resources
- data required
. . .
Jarvis et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature (2022)
Even more improvements with additional sequencing:
- ONT ultralong reads
- Hi-C
- SBB/6b4 reads
. . .
Rautiainen et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nature Biotechnology (2023)
Q100 project (https://github.com/marbl/HG002)
Accurate long reads also enabled distinguishing genomic variation from errors on each read.
Many samples of interest are not haploid, we want to assemble each haplotype.
For diploids, this is easier.
We can use parental reads to assign phases over heretozygous regions.
. . .
For higher ploidy, this is challenging but feasible in theory.
How can we investigate how good our genome assemblies are?
. . .
Is the assembly:
- contiguous?
- complete?
- correct?
The easiest metric has the hardest definition:
Given a set of contigs, the N50 is defined as the sequence length of the shortest contig such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.
Basically, are the chromosomes mostly in one piece each.
We can exploit evolutionary conservation!
Search the assembly for the set of Universal Single-Copy Orthologs (USCOs).
. . .
Since these genes are conserved, anything that is missing contributes to "incompleteness".
We can also generate k-mers from short read sequencing.
. . .
Any sequencing k-mer missing from the assembly contributes to "incompleteness".
We can align the genome to the reference and call SNPs.
The more SNPs, the more presumed errors.
. . .
† REFERENCE BIAS †
More diverged samples should have more SNPs.
We again can make use of k-mers from short read sequencing.
. . .
Assembly k-mers not found in the sequencing are more confidently errors.
. . .
Assembly k-mers found too often or not often enough are potential false collapses/duplications.
Can calculate a "QV" score from these.
There are many limitations to any "summary" metrics:
. . .
:::incremental
- Many of these are global measures, but many misassemblies are local.
- How do we pick the "best" assembly if the metrics are not strictly better in one option?
- They were developed based on an outdated "state of the art". :::
Some of these metrics are quickly becoming pointless with routine, high-quality assemblies.
. . .
How will we assess genomes in the near future?
Will we even need to assess them?
And then we move into pangenomes!
As uncovered by Erik Garrison, inspired by Desmond and Colomb (2009).
Valerio Magrelli's "Campagna Romana" (1981):
We now have a lot of genome assemblies, what are we going to do?
. . .
We broadly want:
- similar regions to be represented once
- diverged regions to show that variability
. . .
This involves some alignment step and some collapsing step.
There is not just "one" pangenome we can build from the same input, unlike genome assembly.
. . .
Consider a region with dense variation, like SNPs every few bases.
How should that be represented?
Ideally some happy intermediate between nucleotide-level and redundant sequence.
The type of graph you want may differ on your needs:
- reference genomes
- pangenomes for a specific analysis
. . .
Keep in mind the needs of your project.
Pangenome: a collection of assemblies
Graph: a type of pangenome representation with nodes and edges
Nodes: some contiguous sequence
Edges: connection between contiguous sequences
Bubbles: regions of variation
Most sequencing data (or anything representing genomes) are in fasta/q.
Sequence alignments are generally in SAM/BAM.
Other miscellaneous files like BED, GFF, etc.
New file formats!
GFA: Graphical Fragment assembly.
. . .
Three main components:
- S-lines: the sequence of the nodes
- L-lines: how the graph is connected with edges
- P-lines: how a "sample" traverses the graph (optional)
. . .
S s1 AATTTACC
S s2 GGTAT
S s3 T
L s1 + s2 + 0M
L s1 + s3 + 0M
L s2 + s3 - 0M
P ME s1+,s2+,s3
P YOU s1+,s3+
. . .
There are other, less used lines (Walk/Jump).
Most downstream tools have their own "efficient" representation:
.og
.vg
.xg
.gbz
These graphs contain a lot of information.
GFA is human-readable and can be stored better for computer operations.
GAF: Graph Alignment Format
A graph "superset" of PAF (Pairwise mApping Format).
Similar to SAM/BAM, broadly capturing:
- which read
- aligns to where
- and how good it was
Likewise, this is human-readable, and so some tools prefer the binary version .gam
.
Several types of whole-genome sequence graph builders:
minigraph
(https://github.com/lh3/minigraph)cactus
(https://github.com/ComparativeGenomicsToolkit/cactus)pggb
(https://github.com/pangenome/pggb)pgr-tk
(https://github.com/cschin/pgr-tk)
. . .
Some specialised types:
pangene
(https://github.com/lh3/pangene)pantools
(https://git.wur.nl/bioinformatics/pantools)vg
(https://github.com/vgteam/vg)
|
|
Reference-based | Lossless | N+1 | Compute | |
---|---|---|---|---|---|---|
minigraph |
Yes | No | Yes | No | Easy | Laptop |
cactus |
Yes | Yes | No-ish | Yes | Easy-ish | Cluster |
pggb |
Yes | Yes | No | Yes | Rebuild | Big cluster |
. . .
We can perfectly reconstruct any assembly from a lossless graph.
Add bubbles to graph one-by-one.
Solve many pairwise problems.
Add SNP-level details to a minigraph
graph.
"Transitive closure" (
minigraph
can be shaped by:
-j
: divergence level-L
: minimum "bubble" size
. . .
pggb
can be shaped by:
-p
: similarity level-s
: mapping segment length
. . .
Parameter recommendations frequently change as the software changes.
. . .
They are also rarely intuitive.
Having 100 assemblies with ">chr1" is unhelpful!
. . .
We can use PanSN-spec (https://github.com/pangenome/PanSN-spec).
[sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
. . .
For example:
- "sample_49#2#X" for chromosome X of the second haplotype of sample_49
- "BSW#0#1" for chromosome 1 of a primary Brown Swiss assembly
Some programs (e.g., vg
, odgi
, wfmash
, etc.) implicitly assume PanSN-spec.
We don't want to do impossible alignments.
. . .
Unless we want to (e.g., inter-chromosomal duplications).
Reference genomes allowed us to universally refer to the same location.
. . .
InDels within individuals means coordinates rarely match between assemblies.
. . .
Uneven assembly (or true biological variation) of telo/centromeres wildly change coordinates.
One partial solution is rGFA (reference GFA), used by minigraph
.
Can refer to any non-reference sequence relative to the reference "backbone".
. . .
Adds the SN, SO, SR tags to the GFA.
. . .
This still breaks if you update your graph!
Other tools like pggb
and cactus
avoid this feature/issue by being "reference-free".
. . .
Converting coordinates within a lossless graph is straightforward.
Except when it is undefined!
Many large scale efforts in progress:
- Human Pangenome Research Council
- Vertebrate Genome Project
- Darwin Tree of Life
Several hundreds of genomes can take millions of core hours!
The
- rebuild with annual data freezes?
- "stable" graph and "development" graph?
How do these problems scale for compute resources?
What will be bottlenecks in the near future?
. . .
-
wfmash
is$\mathcal{O}(n^2)$ -
seqwish
is memory/disk hungry -
GFAffix
is almost IO bound
. . .
Will rate of development beat the rate of sequencing?
Some genome assemblers start by building graphs representing variation in the sequencing reads.
. . .
Isn't that a type of pangenome?
. . .
Could we represent n=2+
genomes as graphs?
Could we represent somatic mutations as graphs?
The answer is: maybe?
"Graph-to-graph" alignment is a harder problem, but an intriguing idea to consider.
. . .
We will almost certainly never work with less variation, so what will that future look like?
. . .
:::incremental
- Accurate long reads have effectively "solved" genome assembly
- Some complex organisms (plants especially) still have challenges
- Graph pangenomes are great for representing all this newly accessible variation
- Different pangenomes have different (dis)advantages :::
And then coffee