Skip to content

Latest commit

 

History

History
300 lines (213 loc) · 17.3 KB

benchmark_against_WGCNA.md

File metadata and controls

300 lines (213 loc) · 17.3 KB

Introduction

This work sheet aims to compare Li's SimpleTidy GeneCoEx workflow with the WGCNA package. WGCNA is a widely accepted gene co-expression analysis method. We will be benchmarking the two workflows using both use cases presented in this repository.

This work sheet mainly presents the results, and not the underlying scripts. The scripts that generated the results can be found in the /Scripts folder. You will need the rmarkdown package to open them.

Table of contents

  1. Dependencies
  2. WGCNA - tomato fruit development series
  3. WGCNA - tepary leaf heat stress time course
  4. Tightness of module
  5. Discussion and Conclusion

Dependencies

library(tidyverse)
library(WGCNA)

library(ggalluvial)
library(ggbeeswarm)

library(patchwork)
library(RColorBrewer)
library(viridis)

set.seed(666)

I downloaded WGCNA using install.packages("BiocManager") followed by BiocManager::install("WGCNA", force = T). It downloads WGCNA as well as all of its dependencies. The rest of the packages are used for data visualization, and not required for WGCNA per se.

WGCNA - tomato fruit development series

For WGCNA, we need a normalized gene expression matrix. WGCNA requires a matrix with libraries as rows and genes as columns. I provided a matrix where the biological replicates were already averaged up to the level of treatments. In this case, the treatments are tissue by developmental stage combinations. I also only used samples collected by hand, not by laser capture. For the purpose of benchmarking, I only used top 5000 variable genes as in the main README page. For more details, please see the main README page.

To run WGCNA, I followed this tutorial. Again, I am using normalized, log transformed data where reps were already averaged up to the treatment as input.

Power selection

WGCNA has its edge selection method. It picks a threshold value, below which edges are removed. The math behind it is beyond me.

Tomato diagnostic stats

According to the tutorial, a power is picked based on where the diagnostic curves stablize. I went with a power of 15.

Module detection

After gene co-expression modules are detected, I looked at their correspondence with the treatments. WGCNA_tomato_heatmap

In this heat map, each row is a gene co-expression module detected by WGCNA. The color strip at the right indicates the names, which is represented by a set of colors. Each column is a tissue by developmental stage combination, which are also annotated by color strips below the x-axis.

It appears when all transcripts are used to detect modules (aka without pre-selecting which genes goes into the analysis), WGCNA is better at pinpointing genes most highly enriched in individual tissue by stage combinations, as indicated by red spots across the diagonal.

For a comparison, we can look at the module heat map detected by Li's Simple Tidy GeneCoEx method: tomato heatmap

Module QC

One way to QC the gene co-expression modules is to look at bait genes that are known to be involved in the same process, because genes involved in the same biological process are likely to be co-expressed and thus detected in the same module.

I picked two bait genes:

  1. PG: Solly.M82.10G020850.1, involved in making the fruit softer.
  2. PSY1: Solly.M82.03G005440.5, involved in making the fruit red.

They are both in module "blue", which is good to see.

We can also graph the z score of genes across a couple modules. WGCNA_tomato_module_line_graphs

In this figure, each large row is a module. I picked two modules, module blue and module turquiose, which are late and early development modules, respectively. Note that blue is where out bait genes are assigned to. As you can see, the overall trend for blue is increasing through development, which is consistent with the known biology. Each big column is a tissue. The x-axis is developmental stages. Each thin grey line is an individual gene. Thick black lines are the average of grey lines in each facet. y-axis values are z score, which is adjusted to the mean and standard deviation of each gene, that is $z = (x - mean) \over sd(x)$.

What stood out to me is that while the general trend is apparent, the the grey lines are a bit spiky (or toothy)? They remind me of sharp teeth of sharks.

For a comparison, we can look at the module line graphs detected by Li's Simple Tidy GeneCoEx method: tomato line graphs.

We will come back to quantifying module tightness between the two methods later.

Module correspondence between two methods

Next I looked at how do modules detected by one method map to those detected by the other method. To do so, I correlated every modules detected by WGCNA to every module detected by Li's method. Results are shown below:

Tomato_correspondence

Here each row is a WGCNA module. Each column is a module detected by Li's method. The colors indicate correlation coefficient r. A high r value indicates the modules have very similar expression pattern, and thus a corresponding module between two methods.

As you can see, there is a clear red signal across the diagonal of this heatmap.

Obviously, both methods work; both detects modules with similar expression patterns. For example, module blue is highly correlated with module 5; both are upregulated in fruit ripenning. Another example is that module brown is highly correlated with module 9; both are highly enriched in the seeds towards later time points.

Membership correspondence between two methods

Next, in addition to looking at how modules map to each other between methods, I look at how genes 'flow' between modules detected by the two methods. In other words, do the methods detect modules that contains the same genes?

WGCNA_tidy_memebership

This is a called an alluvial plot. The top row is modules detected by WGCNA, the lower row is modules detected by Li's method. Each grey box is a module. Each ribbon is a block of genes shared by modules across the two methods. The ribbons are also color coded by their peak expression.

As you can see, large blocks of genes 'flow' between modules detected by the two methods. In some cases, two methods detected modules that share practically the same membership.

WGCNA - tepary leaf heat stress time course

I also benchmarked the 2nd use case concerning tepary heat stress time course as well. In this case, I did something slightly different. I constrained the WGCNA analysis to use the same genes that I used for co-expression in Li's method. The tutorial I followed does not come with a gene selection method, which implies the entire transcriptome can go into WGCNA. In contrast, in Li's method, there is a gene selection section before going into gene co-expression. The methods I used is high variance gene and (or) high F statistics genes. For more info, refer to gene selection section.

In this benchmarking experiment, I will be only using genes that are either high variance or high F for WGCNA.

Power selection

Agian, I follow the tutorial.

Tepary diagnostic stats

The power curve on the left look a bit funny. It might be due to we are constraining the genes we put into WGCNA. But regardless of the reason, both curves stablized at a power of 16. So, I picked 16 and roll with it.

Module detection

After gene co-expression modules are detected, I looked at their correspondence with the treatments.
Tepary WGCNA heatmap.z

In this heatmap, the big columns indicate the control vs heat stress treatments. Each row is a module, which is also annotated by the colors on the right. Each column is a time point.

For a comparison, I can pull out the heat map for Li's method as well:

Tepary_Li_heatmap

As you can see, when the genes are pre-selected, the overall pattern between two methods are quite similar. In this case, Li's method detected more modules.

Module QC

We have some bait genes, all involved in trehalose.

  1. Phacu.CVR.003G017200 TPS6
  2. Phacu.CVR.003G183300 TPS11
  3. Phacu.CVR.009G053300 TPSJ
  4. Phacu.CVR.002G288900 TPSJ

Let's check which module(s) they are assigned to.

It turns out there are placed into different modules (modules 7, 8, and 11). This is not necessarily a problem, because the expression patterns of these 3 modules are somewhat similar, according to the heatmap.

We can graph a few modules to check. We will do Modules 7, 8, and 11 because that where our baits are. The corresponding 'colors' are light green, light yellow, and yellow.

Tepary_WGCNA_line_graphs

In this figure, each large row is the control or heat stress treatment. Each big column is a module. The x-axis is time points. Each thin grey line is an individual gene. Thick black lines are the average of grey lines in each facet. y-axis values are z score, which is adjusted to the mean and standard deviation of each gene, that is $z = { (x - mean) \over sd(x) }$.

It looks less spiky than the tomato WGCNA line graphs. Perhaps contraining the genes going into WGCNA helps. We will come back to quantifying module tightness between the two methods later.

Module correspondence between two methods

Next I looked at how do modules detected by one method map to those detected by the other method. To do so, I correlated every modules detected by WGCNA to every module detected by Li's method. Results are shown below:

Tepary_correspondance

In this heatmap, each row is a module detected by WGCNA, which is also annotated by the color strip on the left. Each column is a module detected by Li's method. Color strips on the right and bottom annotate the peak time point of these modules. The colors indicate correlation coefficient r. A high r value indicates the modules have very similar expression pattern, and thus a corresponding module between two methods.

This might look a bit messy at first glance, but it actually makes a lot of sense. In the diurnal sense, the last time point of the experiment (24 hr) is very similar to the first time point (1 hr). So it makes sense that modules peak at the last time point also correlate well with modules that peak at the first time point.

Membership correspondence between two methods

Next, in addition to looking at how modules map to each other between methods, I look at how genes 'flow' between modules detected by the two methods. In other words, do the methods detect modules that contains the same genes?

WGCNA_tepary_tidy_memebership

In this alluvial plot, the top row is modules detected by WGCNA, the lower row is modules detected by Li's method. Each grey box is a module. Each ribbon is a block of genes shared by modules across the two methods. The ribbons are also color coded by the time point their expression peaked.

While there are a lot of n-to-n correspondence, large blocks of genes 'flow' between corresponding modules detected by both methods. In some cases, two methods detected modules that share practically the same membership.

Tightness of module

To evaluate module tightless between both methods, I used a version of the squared error loss, where tigher modules have lower squared errors. The squared loss function is widely used in statistics and machine learning. The squred loss function is defined by squared loss function

To adabt the squared loss function to this context, I modified the loss function as:

For gene $i$ and treatment $j$ in module $m$, the mean sum of squares of such module $m$, i.e., $msq_m$ is computed as:

$$msq_m = { \sum \left( z_{ijm} - mean \left( z_i \right)_{jm} \right)^2 \over n_m }$$

where $z_{ij}$ is the z score of each gene at each treatment, $n_m$ is the total number of genes in each module, such that the sum of squares is normalized to number of genes in each module, and thus mean sum of squares.

Tomato dataset

Tomato_Comparison

Each dot on these graphs is a module. Modules are color coded by which method they are deteced by (WGCNA vs Simple Tidy). We can see that for the tomato dataset, the loss functions are very comparable between the two methods. (MedianLi = 26.1; MedianWGCNA = 28.7; P = 0.24, Wilcoxon Rank Sum Test). There is only a weak correlation between module size and loss function value. Even small modules detected by WGCNA have higher loss function values, indicating the higher loss is not due to insufficient resolution or insufficient module separation.

Tepary dataset

Tepary Comparison

Again, each dot on these graphs is a module. Modules are color coded by which method they are deteced by (WGCNA vs Simple Tidy). Again we see that Li's method detected modules that are tighter, as quantified by a lower loss function value. (MedianLi = 1.71; MedianWGCNA = 2.85; P = 3.1e-5, Wilcoxon Rank Sum Test). In this case, there is a mild correlation between loss function value and module size, indicating both methods could benefit from higher resoltion or more module separation.

We can compare both methods after controling for module size. This can be achieved by fitting a linear model with module size as a covariate, i.e.,

msq ~ (1|module_size) + method,

then pass the model to analysis of covariance (ANCOVA). The (1|module_size) syntax declares module size as a random effect in the linear model, which controls for the effect of module size on moduel tightness.

Even after controlling for module size, Li's method returns lower loss than WGCNA (F = 8.6, P = 0.0067, ANCOVA). Differences in estimated marginal means after controlling for module sizes are:

Simple Tidy - WGCNA = -0.939, CL = [-1.6, -0.276].

Discussion and Conclusion

The potential reason underlying differences in module tightness might be due to the module detection methods. WGCNA uses hierarchical clustering followed by tree cutting to detect modules (ref).

The default method is hierarchical clustering using the standard R function hclust; branches of the hierarchical clustering dendrogram correspond to modules and can be identified using one of a number of available branch cutting methods, for example the constant-height cut or two Dynamic Branch Cut methods.

In contrast, Simple Tidy GeneCoEx uses the Leiden algorithm to detect modules, which returns modules that are highly interconnected (ref).

We prove that the Leiden algorithm yields communities that are guaranteed to be connected.

As a result, highly interconnected modules may imply tighter modules.

In conclusion:

  1. Both methods detect gene co-expression modules with similar expression patterns.
  2. Large blocks of genes are shared between modules detected by the two methods.
  3. The Simple Tidy GeneCoEx method can detect tighter modules depending on the dataset, as quantified by a mean sum of squares loss function.