latex

leomrtns · Jan 13, 2020 · 21db061 · 21db061
1 parent e77e68e
commit 21db061
Show file tree

Hide file tree

Showing 11 changed files with 322 additions and 140 deletions.
diff --git a/docs/ms001/introduction.tex b/docs/ms001/introduction.tex
@@ -6,28 +6,42 @@
 %* On empirical data, cluster (either using an automated method, or quadrants, or by hand), and try to reconstruct a consensus tree for each cluster vs. overall consensus tree.
 %* Potential discussion point: discrete clusters or continuum? ILS and “isolated” HGT events, missing data, inference errors -> probably better modeled as a continuum -> interesting to see the extremes?
 
+A fundamental unit in phylogenomic analysis is the gene (or genomic locus), and the most detailed evolutionary
+history of a gene includes the duplication and loss events by which an ancestral locus gave rise to all
+observed diversity of loci and genes --- the so-called gene family, of which a single copy gene is a particular case.
+Each gene family will then be described by all the loci within a species connected through an common ancestor 
+(i.e. inferred to be homologous to each other).
+And their histories are expected to differ from one another due to the coalescent, duplications, losses, and
+other biological events.
+Therefore, even for the simplest case of single copy genes, we might still observe distinct patterns of their presence and
+absence amongst species, and conflicting inferred phylogenies due to the coalescent, lateral transfers, and the very
+inference process. 
 The accumulation of large-scale phylogenomic data sets leads to new challenges of comparison and visualisation of
 distinct gene families, as well as of detecting the influence of each genomic region into the overall phylogenomic
-signals. 
-State-of-the-art phylogenetic methods take gene trees as input, and model the incongruence among them in
-various ways, based on various assumptions. 
+signals.
+Many state-of-the-art phylogenetic methods take gene trees as input, and model the incongruence among them in
+various ways, based on parametric and non-parametric assumptions \citep{astral, astrid}.
 Many of these methods require the input gene trees to have at most one
-representative from each species (e.g. by requiring the user to first run an orthology inference pipeline). This
-limitation is hard to circumvent since almost all tree distance measures (required to measure incongruence between two
-trees) assume that the same leaves are present on both trees.
+representative from each species (e.g. by requiring the user to first run an orthology inference pipeline). 
+This limitation is hard to circumvent since almost all tree distance measures (required to measure incongruence between
+two trees) assume that the same leaves are present on both trees.
 
 There has been several attempts at describing phylogenetic trees as vectors of features, suitable for statistical
 comparison, such as \cite{Leigh2008, Leigh2011, Susko2006, Narechania2016, Nye2011, Yoshida2015, Lewitus2015,
 Kendall2016, Colijn2018}. 
 There are also a few methods that rely on pairwise tree distance matrices, which could then be projected into a new
 coordinate system.
 
-Unsupervised learning algorithms accept one of two forms of input: a design (also called feature) matrix X of size nXp
-(n samples with p dimensions each), or a dissimilarity matrix D of size nXn describing the distances between each pair
-of samples. Given D, one can project the samples into a feature space for further analysis (using multidimensional
-scaling, for instance). However, this projection needs to be recalculated if new samples arise. Our method, on the other
+Unsupervised learning algorithms accept one of two forms of input: a design (also called feature) matrix $X$ of size
+$n\times p$ (n samples with $p$ dimensions each), or a dissimilarity matrix $D$ of size $n\times n$ 
+describing the distances between each pair of samples. 
+Given $D$, one can project the samples into a feature space for further analysis (using multidimensional
+scaling, for instance). 
+However, this projection needs to be recalculated if new samples arise. 
+Our method, on the other
 hand, allows for disentangling the acquisition of sample gene trees and their projection, since their feature space can
-be described without resorting to the whole set of existing sample trees. This can become particularly relevant when the
+be described without resorting to the whole set of existing sample trees. 
+This can become particularly relevant when the
 number of sample gene trees exceeds largely the number of reference species trees.
 
 Visualisation and comparison of gene trees has been increasingly recognised as a way to objectively partition
@@ -37,15 +51,19 @@
 However in many cases we cannot or prefer not to decide beforehand the orthologous groups. 
 In these cases we must work with  the so-called
 multi-labelled trees (or mul-trees, for short), which are trees with potentially more than one leaf with same label
-(labelled by the same species, in our case). At the same time, dissimilarity matrices are not the only input for
+(labelled by the same species, in our case). 
+At the same time, dissimilarity matrices are not the only input for
 classification algorithms, and describing samples through a coordinate system can have advantages.
 
-There are many new algos thanks to big data, and our data sets are also increasing, therefore we can make use of their
-novelties if we write our problem as a big data one. <...> This analysis can also help in ‘gene shopping’, i.e. when
+There are many new algorithms thanks to big data, and our data sets are also increasing, therefore we can make use of their
+novelties if we write our problem as a big data one.
+\red{to fill in something}
+This analysis can also help in ‘gene shopping’, i.e. when
 only genomic regions with desired properties are selected \citep{Smith2018}.
 On the other hand, we might be concerned if a certain selection of genes can be responsible for a bias in the results.
 
-Each gene family tree is represented by a set of features, and may contain paralogs or missing species. Each gene family
+Each gene family tree is represented by a set of features, and may contain paralogs or missing species. 
+Each gene family
 can be represented by several trees, all sharing same pattern of missing/duplicate species, as in Bayesian posterior
 distributions. (However for testing purposes we might prune individual trees from a gene family.)
 
@@ -70,15 +88,16 @@
 can "weight" these hypotheses by their representativity in the reference sptrees (minimal case is to use just two
 sptrees, as "only" dimensions in the eigenbasis).
 
-At the same time, choosing just "a few" sptrees allows our matrix to be lower dimensional than a full pairwise distance.
+At the same time, choosing just ``a few'' sptrees allows our matrix to be lower dimensional than a full pairwise distance.
 This becomes more evident when 1) gene families are much larger than sptrees (more leaves), and 2) many samples from
 many genefams are analysed (e.g. 1M trees per family).
 
 The idea is that although we may lose a lot of resolution when comparing two gene family trees directly (assuming such
 comparison can be accomplished), we may have higher resolution [signal] by comparing each gene family to a species tree.
-The difference lies in the number of species in common: when comparing two gene families G1 and G2 representing
-respectively n1 and n2 species (over possibly N species), they will have in the worst case only max(0,n1+n2-N) ≤
-min(n1,n2)  ---  where min(n1,n2) is the worst case comparison between G1 or G2 and the species tree.
+The difference lies in the number of species in common: when comparing two gene families $G_1$ and $G_2$ representing
+respectively $n_1$ and $n_2$ species (over possibly $N$ species), they will have in the worst case only
+$\max\left(0,n1+n2-N\right) \leq \min\left(n1,n2\right)$  --- where $\min\left(n1,n2\right)$ 
+is the worst case comparison between $G_1$ or $G_2$ and the species tree.
 
 
 \begin{figure}[!htbp]

diff --git a/docs/ms001/methods.tex b/docs/ms001/methods.tex
@@ -1,19 +1,28 @@
 \section{Methods}
-The gene family trees represent orthogroups or root HOGs <ref>, that is, a tree describing all sequences assumed to
-share a common ancestral sequence (including paralogs, or several individuals from the same population). These trees are
+The gene family trees represent orthogroups or root HOGs [ref], that is, a tree describing all sequences assumed to
+share a common ancestral sequence (including paralogs, or several individuals from the same population). 
+These trees are
 the input to the algorithm and may have been estimated by any phylogenetic method --- the algorithm is agnostic to the
 source of disagreement (and therefore to the reason for the multiple leaves with same species label) or to the inference
-procedure. The reference trees represent possible species trees, and must be on the same set of all species (i.e. all
-species must be present in all reference trees). The set of reference trees should capture the variability between gene
-trees with respect to the set of distances used to describe them. Therefore a good choice would be a set of optimal
+procedure. 
+The reference trees represent possible species trees, and must be on the same set of all species (i.e. all
+species must be present in all reference trees). 
+The set of reference trees should capture the variability between gene
+trees with respect to the set of distances used to describe them. 
+Therefore a good choice would be a set of optimal
 species trees for the most dissimilar sets of gene families, although here we usually limit ourselves to a few species
-trees inferred from a single set of gene families, with further randomisation in a few cases. It is important to notice
+trees inferred from a single set of gene families, with further randomisation in a few cases. 
+It is important to notice
 that the reference species trees are not restricted to the optimal ones (under some notion of optimality) and do not
-need to include all optimal species trees. However having good candidates for the species trees will help interpreting
-the gene trees in terms of them. The gene trees are assumed to be unrooted since the basic phylogenetic inference models
-can’t infer the root location, but our method can be easily adapted for rooted gene family trees. The reference species
-trees are rooted, although some distances disregard this information. The reference species trees are fixed beforehand,
-but once they are set they can be used to create the “tree signal” of any gene family online, as long as all species
+need to include all optimal species trees. 
+However having good candidates for the species trees will help interpreting
+the gene trees in terms of them. 
+The gene trees are assumed to be unrooted since the basic phylogenetic inference models
+can’t infer the root location, but our method can be easily adapted for rooted gene family trees.
+The reference species
+trees are rooted, although some distances disregard this information.
+The reference species trees are fixed beforehand,
+but once they are set they can be used to create the ``tree signal'' of any gene family online, as long as all species
 present in the gene family are represented in the reference species trees.
 
 \red{describe the vector}

diff --git a/docs/ms001/phylosignal.aux b/docs/ms001/phylosignal.aux
@@ -17,24 +17,25 @@
 \providecommand\HyField@AuxAddToFields[1]{}
 \providecommand\HyField@AuxAddToCoFields[2]{}
 \bibstyle{plainnat}
-\citation{Leigh2008,Leigh2011,Susko2006,Narechania2016,Nye2011,Yoshida2015,Lewitus2015,Kendall2016,Colijn2018}
+\citation{astral,astrid}
 \babel@aux{english}{}
+\citation{Leigh2008,Leigh2011,Susko2006,Narechania2016,Nye2011,Yoshida2015,Lewitus2015,Kendall2016,Colijn2018}
 \citation{Gori2016,Jombart2017,Huang2016}
 \citation{Kendall2018}
 \citation{Smith2018}
-\@writefile{toc}{\contentsline {section}{\numberline {1}Methods}{3}{section.1}}
-\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces  Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (\IeC {\textquoteleft }design matrix\IeC {\textquoteright }) which can be used in downstream analyses. \relax }}{4}{figure.caption.1}}
+\@writefile{toc}{\contentsline {section}{\numberline {1}Methods}{4}{section.1}}
+\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces  Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (\IeC {\textquoteleft }design matrix\IeC {\textquoteright }) which can be used in downstream analyses. \relax }}{5}{figure.caption.1}}
 \providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
-\newlabel{figure01}{{1}{4}{Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (‘design matrix’) which can be used in downstream analyses. \relax }{figure.caption.1}{}}
+\newlabel{figure01}{{1}{5}{Schematic representation of the tree signal calculation. In panel A we show two simple cases for a sample of 8 gene family trees: at the top we compare each gene tree to two distinct reference (species) trees using the minimum number of duplications (duplication distance), and at the bottom we compare all sample gene trees to a single reference tree, but using two distinct metrics --- the Robinson-Foulds (RF) distance and the duplication distance. Both comparisons provide little information in isolation, but when combined allow for distinguishing the two groups of gene families (represented by distinct colours). Panel B shows how the tree signal of a single gene family tree can be calculated, given a set of species trees and a set of distances. Notice that currently we work with unrooted gene trees and rooted species trees. In panel C we show that once we have the tree signal from each gene family, then we can create a feature matrix (‘design matrix’) which can be used in downstream analyses. \relax }{figure.caption.1}{}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {1.1}Distances}{6}{subsection.1.1}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {1.2}Normalisation}{7}{subsection.1.2}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}Choice of reference trees}{7}{subsection.1.3}}
-\@writefile{toc}{\contentsline {section}{\numberline {2}Results}{8}{section.2}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}Choice of reference trees}{8}{subsection.1.3}}
+\@writefile{toc}{\contentsline {section}{\numberline {2}Results}{9}{section.2}}
 \@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces  SPR chain simulation with missing data. On the left panel we see the MDS projections of the samples when their signal is calculated from similar reference trees, and on the right we have the projections when random species trees are used. The colors represent the four groupings (consecutive trees in the SPR chain, separated from each other by more SPR branch swappings) defined in the simulation. \relax }}{10}{figure.caption.2}}
 \newlabel{figure002}{{2}{10}{SPR chain simulation with missing data. On the left panel we see the MDS projections of the samples when their signal is calculated from similar reference trees, and on the right we have the projections when random species trees are used. The colors represent the four groupings (consecutive trees in the SPR chain, separated from each other by more SPR branch swappings) defined in the simulation. \relax }{figure.caption.2}{}}
-\@writefile{toc}{\contentsline {section}{\numberline {3}Discussion}{10}{section.3}}
 \@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces  Typical case (simphy simulation). Colors represent underlying species trees. \relax }}{11}{figure.caption.3}}
 \newlabel{figure003}{{3}{11}{Typical case (simphy simulation). Colors represent underlying species trees. \relax }{figure.caption.3}{}}
+\@writefile{toc}{\contentsline {section}{\numberline {3}Discussion}{11}{section.3}}
 \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces  Fungal data set \relax }}{12}{figure.caption.4}}
 \newlabel{figure004}{{4}{12}{Fungal data set \relax }{figure.caption.4}{}}
 \bibdata{references}
@@ -47,8 +48,10 @@
 \bibcite{Leigh2008}{{7}{2008}{{Leigh et~al.}}{{Leigh, Susko, Baumgartner, and Roger}}}
 \bibcite{Leigh2011}{{8}{2011}{{Leigh et~al.}}{{Leigh, Schliep, Lopez, and Bapteste}}}
 \bibcite{Lewitus2015}{{9}{2015}{{Lewitus and Morlon}}{{}}}
-\bibcite{Narechania2016}{{10}{2016}{{Narechania et~al.}}{{Narechania, Baker, DeSalle, Mathema, Kolokotronis, Kreiswirth, and Planet}}}
-\bibcite{Nye2011}{{11}{2011}{{Nye}}{{}}}
-\bibcite{Smith2018}{{12}{2018}{{Smith et~al.}}{{Smith, Brown, and Walker}}}
-\bibcite{Susko2006}{{13}{2006}{{Susko et~al.}}{{Susko, Leigh, Doolittle, and Bapteste}}}
-\bibcite{Yoshida2015}{{14}{2015}{{Yoshida et~al.}}{{Yoshida, Fukumizu, and Vogiatzis}}}
+\bibcite{astral}{{10}{2015}{{Mirarab and Warnow}}{{}}}
+\bibcite{Narechania2016}{{11}{2016}{{Narechania et~al.}}{{Narechania, Baker, DeSalle, Mathema, Kolokotronis, Kreiswirth, and Planet}}}
+\bibcite{Nye2011}{{12}{2011}{{Nye}}{{}}}
+\bibcite{Smith2018}{{13}{2018}{{Smith et~al.}}{{Smith, Brown, and Walker}}}
+\bibcite{Susko2006}{{14}{2006}{{Susko et~al.}}{{Susko, Leigh, Doolittle, and Bapteste}}}
+\bibcite{astrid}{{15}{2015}{{Vachaspati and Warnow}}{{}}}
+\bibcite{Yoshida2015}{{16}{2015}{{Yoshida et~al.}}{{Yoshida, Fukumizu, and Vogiatzis}}}