Skip to content

Commit

Permalink
latex
Browse files Browse the repository at this point in the history
  • Loading branch information
leomrtns committed Jan 13, 2020
1 parent e77e68e commit 21db061
Show file tree
Hide file tree
Showing 11 changed files with 322 additions and 140 deletions.
57 changes: 38 additions & 19 deletions docs/ms001/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,42 @@
%* On empirical data, cluster (either using an automated method, or quadrants, or by hand), and try to reconstruct a consensus tree for each cluster vs. overall consensus tree.
%* Potential discussion point: discrete clusters or continuum? ILS and “isolated” HGT events, missing data, inference errors -> probably better modeled as a continuum -> interesting to see the extremes?

A fundamental unit in phylogenomic analysis is the gene (or genomic locus), and the most detailed evolutionary
history of a gene includes the duplication and loss events by which an ancestral locus gave rise to all
observed diversity of loci and genes --- the so-called gene family, of which a single copy gene is a particular case.
Each gene family will then be described by all the loci within a species connected through an common ancestor
(i.e. inferred to be homologous to each other).
And their histories are expected to differ from one another due to the coalescent, duplications, losses, and
other biological events.
Therefore, even for the simplest case of single copy genes, we might still observe distinct patterns of their presence and
absence amongst species, and conflicting inferred phylogenies due to the coalescent, lateral transfers, and the very
inference process.
The accumulation of large-scale phylogenomic data sets leads to new challenges of comparison and visualisation of
distinct gene families, as well as of detecting the influence of each genomic region into the overall phylogenomic
signals.
State-of-the-art phylogenetic methods take gene trees as input, and model the incongruence among them in
various ways, based on various assumptions.
signals.
Many state-of-the-art phylogenetic methods take gene trees as input, and model the incongruence among them in
various ways, based on parametric and non-parametric assumptions \citep{astral, astrid}.
Many of these methods require the input gene trees to have at most one
representative from each species (e.g. by requiring the user to first run an orthology inference pipeline). This
limitation is hard to circumvent since almost all tree distance measures (required to measure incongruence between two
trees) assume that the same leaves are present on both trees.
representative from each species (e.g. by requiring the user to first run an orthology inference pipeline).
This limitation is hard to circumvent since almost all tree distance measures (required to measure incongruence between
two trees) assume that the same leaves are present on both trees.

There has been several attempts at describing phylogenetic trees as vectors of features, suitable for statistical
comparison, such as \cite{Leigh2008, Leigh2011, Susko2006, Narechania2016, Nye2011, Yoshida2015, Lewitus2015,
Kendall2016, Colijn2018}.
There are also a few methods that rely on pairwise tree distance matrices, which could then be projected into a new
coordinate system.

Unsupervised learning algorithms accept one of two forms of input: a design (also called feature) matrix X of size nXp
(n samples with p dimensions each), or a dissimilarity matrix D of size nXn describing the distances between each pair
of samples. Given D, one can project the samples into a feature space for further analysis (using multidimensional
scaling, for instance). However, this projection needs to be recalculated if new samples arise. Our method, on the other
Unsupervised learning algorithms accept one of two forms of input: a design (also called feature) matrix $X$ of size
$n\times p$ (n samples with $p$ dimensions each), or a dissimilarity matrix $D$ of size $n\times n$
describing the distances between each pair of samples.
Given $D$, one can project the samples into a feature space for further analysis (using multidimensional
scaling, for instance).
However, this projection needs to be recalculated if new samples arise.
Our method, on the other
hand, allows for disentangling the acquisition of sample gene trees and their projection, since their feature space can
be described without resorting to the whole set of existing sample trees. This can become particularly relevant when the
be described without resorting to the whole set of existing sample trees.
This can become particularly relevant when the
number of sample gene trees exceeds largely the number of reference species trees.

Visualisation and comparison of gene trees has been increasingly recognised as a way to objectively partition
Expand All @@ -37,15 +51,19 @@
However in many cases we cannot or prefer not to decide beforehand the orthologous groups.
In these cases we must work with the so-called
multi-labelled trees (or mul-trees, for short), which are trees with potentially more than one leaf with same label
(labelled by the same species, in our case). At the same time, dissimilarity matrices are not the only input for
(labelled by the same species, in our case).
At the same time, dissimilarity matrices are not the only input for
classification algorithms, and describing samples through a coordinate system can have advantages.

There are many new algos thanks to big data, and our data sets are also increasing, therefore we can make use of their
novelties if we write our problem as a big data one. <...> This analysis can also help in ‘gene shopping’, i.e. when
There are many new algorithms thanks to big data, and our data sets are also increasing, therefore we can make use of their
novelties if we write our problem as a big data one.
\red{to fill in something}
This analysis can also help in ‘gene shopping’, i.e. when
only genomic regions with desired properties are selected \citep{Smith2018}.
On the other hand, we might be concerned if a certain selection of genes can be responsible for a bias in the results.

Each gene family tree is represented by a set of features, and may contain paralogs or missing species. Each gene family
Each gene family tree is represented by a set of features, and may contain paralogs or missing species.
Each gene family
can be represented by several trees, all sharing same pattern of missing/duplicate species, as in Bayesian posterior
distributions. (However for testing purposes we might prune individual trees from a gene family.)

Expand All @@ -70,15 +88,16 @@
can "weight" these hypotheses by their representativity in the reference sptrees (minimal case is to use just two
sptrees, as "only" dimensions in the eigenbasis).

At the same time, choosing just "a few" sptrees allows our matrix to be lower dimensional than a full pairwise distance.
At the same time, choosing just ``a few'' sptrees allows our matrix to be lower dimensional than a full pairwise distance.
This becomes more evident when 1) gene families are much larger than sptrees (more leaves), and 2) many samples from
many genefams are analysed (e.g. 1M trees per family).

The idea is that although we may lose a lot of resolution when comparing two gene family trees directly (assuming such
comparison can be accomplished), we may have higher resolution [signal] by comparing each gene family to a species tree.
The difference lies in the number of species in common: when comparing two gene families G1 and G2 representing
respectively n1 and n2 species (over possibly N species), they will have in the worst case only max(0,n1+n2-N) ≤
min(n1,n2) --- where min(n1,n2) is the worst case comparison between G1 or G2 and the species tree.
The difference lies in the number of species in common: when comparing two gene families $G_1$ and $G_2$ representing
respectively $n_1$ and $n_2$ species (over possibly $N$ species), they will have in the worst case only
$\max\left(0,n1+n2-N\right) \leq \min\left(n1,n2\right)$ --- where $\min\left(n1,n2\right)$
is the worst case comparison between $G_1$ or $G_2$ and the species tree.


\begin{figure}[!htbp]
Expand Down
31 changes: 20 additions & 11 deletions docs/ms001/methods.tex
Original file line number Diff line number Diff line change
@@ -1,19 +1,28 @@
\section{Methods}
The gene family trees represent orthogroups or root HOGs <ref>, that is, a tree describing all sequences assumed to
share a common ancestral sequence (including paralogs, or several individuals from the same population). These trees are
The gene family trees represent orthogroups or root HOGs [ref], that is, a tree describing all sequences assumed to
share a common ancestral sequence (including paralogs, or several individuals from the same population).
These trees are
the input to the algorithm and may have been estimated by any phylogenetic method --- the algorithm is agnostic to the
source of disagreement (and therefore to the reason for the multiple leaves with same species label) or to the inference
procedure. The reference trees represent possible species trees, and must be on the same set of all species (i.e. all
species must be present in all reference trees). The set of reference trees should capture the variability between gene
trees with respect to the set of distances used to describe them. Therefore a good choice would be a set of optimal
procedure.
The reference trees represent possible species trees, and must be on the same set of all species (i.e. all
species must be present in all reference trees).
The set of reference trees should capture the variability between gene
trees with respect to the set of distances used to describe them.
Therefore a good choice would be a set of optimal
species trees for the most dissimilar sets of gene families, although here we usually limit ourselves to a few species
trees inferred from a single set of gene families, with further randomisation in a few cases. It is important to notice
trees inferred from a single set of gene families, with further randomisation in a few cases.
It is important to notice
that the reference species trees are not restricted to the optimal ones (under some notion of optimality) and do not
need to include all optimal species trees. However having good candidates for the species trees will help interpreting
the gene trees in terms of them. The gene trees are assumed to be unrooted since the basic phylogenetic inference models
can’t infer the root location, but our method can be easily adapted for rooted gene family trees. The reference species
trees are rooted, although some distances disregard this information. The reference species trees are fixed beforehand,
but once they are set they can be used to create the “tree signal” of any gene family online, as long as all species
need to include all optimal species trees.
However having good candidates for the species trees will help interpreting
the gene trees in terms of them.
The gene trees are assumed to be unrooted since the basic phylogenetic inference models
can’t infer the root location, but our method can be easily adapted for rooted gene family trees.
The reference species
trees are rooted, although some distances disregard this information.
The reference species trees are fixed beforehand,
but once they are set they can be used to create the ``tree signal'' of any gene family online, as long as all species
present in the gene family are represented in the reference species trees.

\red{describe the vector}
Expand Down
27 changes: 15 additions & 12 deletions docs/ms001/phylosignal.aux
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,25 @@
\providecommand\HyField@AuxAddToFields[1]{}
\providecommand\HyField@AuxAddToCoFields[2]{}
\bibstyle{plainnat}
\citation{Leigh2008,Leigh2011,Susko2006,Narechania2016,Nye2011,Yoshida2015,Lewitus2015,Kendall2016,Colijn2018}
\citation{astral,astrid}
\babel@aux{english}{}
\citation{Leigh2008,Leigh2011,Susko2006,Narechania2016,Nye2011,Yoshida2015,Lewitus2015,Kendall2016,Colijn2018}
\citation{Gori2016,Jombart2017,Huang2016}
\citation{Kendall2018}
\citation{Smith2018}
\@writefile{toc}{\contentsline {section}{\numberline {1}Methods}{3}{section.1}}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (\IeC {\textquoteleft }design matrix\IeC {\textquoteright }) which can be used in downstream analyses. \relax }}{4}{figure.caption.1}}
\@writefile{toc}{\contentsline {section}{\numberline {1}Methods}{4}{section.1}}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (\IeC {\textquoteleft }design matrix\IeC {\textquoteright }) which can be used in downstream analyses. \relax }}{5}{figure.caption.1}}
\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
\newlabel{figure01}{{1}{4}{Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (‘design matrix’) which can be used in downstream analyses. \relax }{figure.caption.1}{}}
\newlabel{figure01}{{1}{5}{Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (‘design matrix’) which can be used in downstream analyses. \relax }{figure.caption.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.1}Distances}{6}{subsection.1.1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.2}Normalisation}{7}{subsection.1.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}Choice of reference trees}{7}{subsection.1.3}}
\@writefile{toc}{\contentsline {section}{\numberline {2}Results}{8}{section.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}Choice of reference trees}{8}{subsection.1.3}}
\@writefile{toc}{\contentsline {section}{\numberline {2}Results}{9}{section.2}}
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces SPR chain simulation with missing data. On the left panel we see the MDS projections of the samples when their signal is calculated from similar reference trees, and on the right we have the projections when random species trees are used. The colors represent the four groupings (consecutive trees in the SPR chain, separated from each other by more SPR branch swappings) defined in the simulation. \relax }}{10}{figure.caption.2}}
\newlabel{figure002}{{2}{10}{SPR chain simulation with missing data. On the left panel we see the MDS projections of the samples when their signal is calculated from similar reference trees, and on the right we have the projections when random species trees are used. The colors represent the four groupings (consecutive trees in the SPR chain, separated from each other by more SPR branch swappings) defined in the simulation. \relax }{figure.caption.2}{}}
\@writefile{toc}{\contentsline {section}{\numberline {3}Discussion}{10}{section.3}}
\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Typical case (simphy simulation). Colors represent underlying species trees. \relax }}{11}{figure.caption.3}}
\newlabel{figure003}{{3}{11}{Typical case (simphy simulation). Colors represent underlying species trees. \relax }{figure.caption.3}{}}
\@writefile{toc}{\contentsline {section}{\numberline {3}Discussion}{11}{section.3}}
\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Fungal data set \relax }}{12}{figure.caption.4}}
\newlabel{figure004}{{4}{12}{Fungal data set \relax }{figure.caption.4}{}}
\bibdata{references}
Expand All @@ -47,8 +48,10 @@
\bibcite{Leigh2008}{{7}{2008}{{Leigh et~al.}}{{Leigh, Susko, Baumgartner, and Roger}}}
\bibcite{Leigh2011}{{8}{2011}{{Leigh et~al.}}{{Leigh, Schliep, Lopez, and Bapteste}}}
\bibcite{Lewitus2015}{{9}{2015}{{Lewitus and Morlon}}{{}}}
\bibcite{Narechania2016}{{10}{2016}{{Narechania et~al.}}{{Narechania, Baker, DeSalle, Mathema, Kolokotronis, Kreiswirth, and Planet}}}
\bibcite{Nye2011}{{11}{2011}{{Nye}}{{}}}
\bibcite{Smith2018}{{12}{2018}{{Smith et~al.}}{{Smith, Brown, and Walker}}}
\bibcite{Susko2006}{{13}{2006}{{Susko et~al.}}{{Susko, Leigh, Doolittle, and Bapteste}}}
\bibcite{Yoshida2015}{{14}{2015}{{Yoshida et~al.}}{{Yoshida, Fukumizu, and Vogiatzis}}}
\bibcite{astral}{{10}{2015}{{Mirarab and Warnow}}{{}}}
\bibcite{Narechania2016}{{11}{2016}{{Narechania et~al.}}{{Narechania, Baker, DeSalle, Mathema, Kolokotronis, Kreiswirth, and Planet}}}
\bibcite{Nye2011}{{12}{2011}{{Nye}}{{}}}
\bibcite{Smith2018}{{13}{2018}{{Smith et~al.}}{{Smith, Brown, and Walker}}}
\bibcite{Susko2006}{{14}{2006}{{Susko et~al.}}{{Susko, Leigh, Doolittle, and Bapteste}}}
\bibcite{astrid}{{15}{2015}{{Vachaspati and Warnow}}{{}}}
\bibcite{Yoshida2015}{{16}{2015}{{Yoshida et~al.}}{{Yoshida, Fukumizu, and Vogiatzis}}}
Loading

0 comments on commit 21db061

Please sign in to comment.