A simple gene co-expression analyses workflow powered by tidyverse and graph analyses
- Introduction
- Dependencies
- Required input
- Experimental Design
- Global view of the experiment
- Gene co-expression analyses
- Pull out candidate genes
- Conclusions
This is a gene co-expression analysis workflow powered by tidyverse and graph analyses. The essence of this workflow is simple and tidy. This is by no means the best workflow, but it is conceptually simple if you are familiar with tidyverse. The goal of this workflow is identify genes co-expressed with known genes of interest.
- Author: Chenxin Li, Ph.D., Assistant Research Scientist at Department of Crop & Soil Sciences and Center for Applied Genetic Technologies, University of Georgia
- Contact: [email protected]
For this demonstration, we will be using Moghaddam et al., 2022. In this study, the authors looked at gene expression of tepary bean leaves under control and heat stress. They sampled 5 time points during the stress treatment. One of the goals of the experiment is to understand gene regulation in response to heat stress.
library(tidyverse)
library(igraph)
library(ggraph)
library(readxl)
library(patchwork)
library(RColorBrewer)
library(viridis)
set.seed(666)
The tidyverse and igraph packages will be doing a lot of the heavy lifting.
ggraph is a grammar of graphics extension for igraph
, which provides effective visualization of network graphs.
The rest of the packages are mainly for data visualization and not required for the gene expression analyses.
The package readxl
is only required if you have any files in .xlsx
or .xlx
format (anything only Excel readable).
The Scripts/
directory contains .Rmd
files that generate the graphics shown below.
It requires R, RStudio, and the rmarkdown package.
- R: R Download
- RStudio: RStudio Download
- rmarkdown can be installed using the install packages interface in RStudio
The workflow requires 3 input.
- Gene expression matrix
- Metadata
- Bait genes (genes involved in the biological process of interest from previous studies)
Many software can generate gene expression matrix, such as Cufflinks, kallisto, and STAR.
My go-to is kallisto, but you do you. The requirements are:
- Estimation of gene expression abundance, in units of TPM or FPKM.
- Each row is a gene, and each column is a library.
Exp_table <- read_excel("../Data/Moghaddam2022_data/Pacu.CVR.HeatStress.xlsx",
col_types = c("text", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric"))
head(Exp_table)
dim(Exp_table)
## [1] 27541 31
Looks like we have 27540 genes and 30 libraries.
According to the Methods section of the paper, the table was generated using Cufflinks.
The values in the table are FPKM. I prefer TPM values, but this will work jsut fine.
According to author's notes, library C_3hr_B2
is flagged with low correlation coefficient between other reps.
So let's throw it out upfront.
Exp_table <- Exp_table %>%
select(-C_3hr_B2)
dim(Exp_table)
Exp_table %>%
group_by(LocusName) %>%
count() %>%
filter(n > 1) %>%
nrow()
## [1] 27541 30
## [1] 3
For whatever reason there are 3 genes appearing more than once in this gene expression matrix. I have no idea why. They should not be. Let's fix that.
Exp_table <- Exp_table %>%
distinct(LocusName, .keep_all = T)
I can go to SRA to fetch the metadata. But the authors have done a good job naming their libraries and providing descriptions.
C
vsS
in library names stands forcontrol
vsstressed
.- Data were collect at 1, 3, 6, 12, and 24 hrs post treatment.
- Library names end with
B1
,B2
, orB3
, which correspond to the 3 bio-reps.
Let's parse those out.
Metadata <- data.frame(
library = colnames(Exp_table)[-1]
) %>%
mutate(treatment = case_when(
str_detect(library, "C_") ~ "control",
str_detect(library, "S_") ~ "heat"
)) %>%
mutate(time_point = case_when(
str_detect(library, "_1hr_") ~ "1",
str_detect(library, "_3hr_") ~ "3",
str_detect(library, "_6hr_") ~ "6",
str_detect(library, "_12hr_") ~ "12",
str_detect(library, "_24hr_") ~ "24"
))
dim(Metadata)
## [1] 29 3
We have 29 libraries left. We started with 30 and threw out 1.
I know very little about heat stress. But just for the sake of this example, let's use a few trehalose synthesis-related genes as bait genes. There are some evidence about trehalose protecting the cell from stress, so let's run with that. Refs: Crowe et al., 1998, Magaz et al, 2012.
Baits <- read_delim("../Data/TeparyBaits.txt", delim = "\t", col_names = F, col_types = cols())
Baits
## A tibble:4 × 2
I pulled out 4 trehalose-6 phosphate synthase (TPS) genes from the genome annotation. It will be interesting to see their expression patterns in this heat stress time course.
Before I start doing any analysis, I would first try to wrap my head around the experimental design. Having a good understanding of the experimental design helps me decide how I want to analyze and visualize the data.
Key questions are:
- What are the sources of variation?
- What are the levels of replication?
This is where the metadata come in handy.
Metadata %>%
group_by(treatment) %>%
count()
## A tibble:2 × 2
## Groups:treatment [2]
We have two treatments: control and heat.
Metadata %>%
group_by(time_point) %>%
count()
## A tibble:5 × 2
## Groups:time_point [5]
We have 5 time points, 1, 3, 6, 12, and 24 hrs post treatment.
Metadata %>%
group_by(treatment, time_point) %>%
count()
## A tibble:10 × 3
## Groups:treatment, time_point [10]
We have 10 unique treatment by time point combination. 2 * 5 = 10. We are good. One of the combination only has 2 reps, because we threw out 1. The rest all have 3 reps. 10 * 3 - 1 = 29, matching the number of rows in the metadata table. This is very good.
This is a two factor experimental design: time point * treatment. The major sources of variations are time point, treatment, and replicates. I usually make a summary table to guide my downstream analyses.
source | type | levels |
---|---|---|
Treat | Qual | 2 |
Time | Num/qual | 5 |
Reps | EU, OU | 29 |
The source column indicates the sources of variations. This will become important when we try to understand the major driver of variance in this experiment. The type column indicates if the factor in question is a qualitative (discrete) or numeric variable. A special note is that time points can be either analyzed as numeric variable or a qualitative variable. "EU" and "OU" in the Reps row stands for experimental unit and observational unit. In this case, the rep is both EU and OU. This is not always the case, especially if the same library is sequenced twice and uploaded with two different SRA number.
Now we understand the experimental design, we will figure out what is the major driver of variance in the experiment next. In other words, between time point and treatment, which factor contributes more to the variance in this experiment? The answer to this question matters in terms of how we mostly effectively visualize our data.
A good way to have a global view of the experiment is doing a principal component analysis (PCA).
This is a tidyverse workflow, so I will be doing things in the tidyverse way. Brace yourself for %>%
.
The first thing for tidyverse workflow is going to from wide format to tidy (or long format).
In tidy format, each row is an observation, and each column is a variable.
We can go from wide to long using the pivot_longer()
function.
Exp_table_long <- Exp_table %>%
pivot_longer(cols = !LocusName, names_to = "library", values_to = "FPKM") %>%
mutate(FPKM = case_when(
is.na(FPKM) ~ 0,
T ~ FPKM
)) %>%
mutate(logFPKM = log10(FPKM + 1))
nrow(Exp_table_long)/nrow(Exp_table)
## [1] 29
In this code chunk, I also log transformed the FPKM values. I also set any missing values to 0. All in one pipe. We will come back to this long table later. This long table is the basis of all downstream analyses.
However, the input data for PCA is a numeric matrix, so we have to go from long to wide back again.
To do that, we use pivot_wider()
.
Exp_table_log_wide <- Exp_table_long %>%
select(LocusName, library, logFPKM) %>%
pivot_wider(names_from = library, values_from = logFPKM)
head(Exp_table_log_wide)
my_pca <- prcomp(t(Exp_table_log_wide[, -1]))
pc_importance <- as.data.frame(t(summary(my_pca)$importance))
head(pc_importance, 20)
## $1 Standard deviation
## $2 Proportion of Variance
## $3 Cumulative Proportion
## PC1 18.089046 0.50245 0.50245
## PC2 9.455056 0.13727 0.63972
## PC3 8.825565 0.11960 0.75932
## PC4 4.903644 0.03692 0.79625
## PC5 4.461639 0.03057 0.82681
## PC6 3.722977 0.02128 0.84810
## PC7 3.571763 0.01959 0.86769
## PC8 3.353137 0.01726 0.88495
## PC9 3.238616 0.01611 0.90106
## PC10 2.757507 0.01168 0.91273
prcomp()
performs PCA for you, given a numeric matrix, which is just the transposed Exp_table_log_wide
, but without the gene ID column.
as.data.frame(t(summary(my_pca)$importance))
saves the sd and proportion of variance into a data table.
In this case, the 1st PC accounts for 50% of the variance in this experiment.
The 2nd PC accounts for 13% of the variance. Taking a quick look, the first 20 PCs accounts for 97% of all the variation in the data.
To make a PCA plot, we will graph the data stored in my_pca$x
, which stores the coordinates of each library in PC space.
Let's pull that data out and annotate them (with metadata).
PCA_coord <- my_pca$x[, 1:10] %>%
as.data.frame() %>%
mutate(library = row.names(.)) %>%
full_join(Metadata, by = "library")
head(PCA_coord)
For the purpose of visualization, I only pulled the first 10 PC. In fact, I will be only plotting the first 2 or 3 PCs.
I noticed that the time point is recorded as a character, so let's fix that here. I think I am going to make it numeric now. We can turn it into a ordered factor, if we need to later.
PCA_coord <- PCA_coord %>%
mutate(time_point = as.numeric(time_point))
PCA_by_treatment <- PCA_coord %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(fill = treatment), color = "grey20", shape = 21, size = 3, alpha = 0.8) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = paste("PC1 (", pc_importance[1, 2] %>% signif(3)*100, "% of Variance)", sep = ""),
y = paste("PC2 (", pc_importance[2, 2] %>% signif(3)*100, "% of Variance)", " ", sep = ""),
fill = "treatment") +
theme_bw() +
theme(
text = element_text(size= 14),
axis.text = element_text(color = "black")
)
PCA_by_time <- PCA_coord %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(fill = time_point), color = "grey20", shape = 21, size = 3, alpha = 0.8) +
scale_fill_gradientn(colors = viridis(10, option = "A")) +
labs(x = paste("PC1 (", pc_importance[1, 2] %>% signif(3)*100, "% of Variance)", sep = ""),
y = paste("PC2 (", pc_importance[2, 2] %>% signif(3)*100, "% of Variance)", " ", sep = ""),
fill = "time point") +
guides(fill = guide_colorsteps()) +
theme_bw() +
theme(
text = element_text(size= 14),
axis.text = element_text(color = "black")
)
wrap_plots(PCA_by_time, PCA_by_treatment, nrow = 1)
ggsave("../Results/Tepary_PCA_by_stage_tissue.svg", height = 3.5, width = 8.5, bg = "white")
ggsave("../Results/Tepary_PCA_by_stage_tissue.png", height = 3.5, width = 8.5, bg = "white")
It might not make sense in the 1st glance, but time is circular! 24hr is very close to 1hr in the diurnal sense.
So PC1 separates early day to mid day. PC2 separates control vs. heat treated plants. So the major driver of variation in this dataset is time (in the diurnal sense), followed by our treatment. Seeing this, we can expect a lot of interaction between diurnal gene regulation and stress.
All of the above are preparatory work. It helps us understand the data. Now we are ready to do co-expression analyses.
There are multiple steps. Let's go over them one by one.
We will first average up the reps to the level of tissue-stage combination. We are interested in the biological variation among tissue-stage combination, and less interested in the noise among reps of the same treatment. Again, this is a tidyverse based workflow.
Exp_table_long_averaged <- Exp_table_long %>%
full_join(PCA_coord,
by = "library") %>%
group_by(LocusName, time_point, treatment) %>%
summarise(mean.logFPKM = mean(logFPKM)) %>%
ungroup()
head(Exp_table_long_averaged)
We start from the long (tidy) table we made earlier. I also pulled the metadata as well to guide the averaging process.
group_by()
followed by summarise(mean = ...)
takes each gene, time point, and treatment, and computes the mean.
The elegance of a tidyverse based workflow is that you do not have to do loops! You let group_by()
do the heavy lifting.
This could take a moment. This step is doing a lot of mean calculations.
Once we averaged up the reps, we will standardize the expression pattern using z score.
A z score is the difference from mean over the standard deviation, or
Exp_table_long_averaged_z <- Exp_table_long_averaged %>%
group_by(LocusName) %>%
mutate(z.score = (mean.logFPKM - mean(mean.logFPKM))/sd(mean.logFPKM)) %>%
ungroup()
head(Exp_table_long_averaged_z)
In this step, we are grouping by gene. Treatment-time points with higher expression will have a higher z score and vice versa.
Note that this is completely relative to each gene itself.
Again, the advantage of a tidyverse workflow is you let group_by()
do all the heavy lifting. No need for loops or apply()
.
The next step is correlating each gene to every other gene. However, we have 27k genes in this dataset. The number of correlations scales to the square of number of genes. To make things faster and less cumbersome, we can select only the high variance genes. The underlying rationale is if a gene is expressed at a similar level across all samples, it is unlikely that is involved in the biology in a particular time point or treatment.
Another method is filter genes by F statistics. This is a method that detects genes that are changing expression based on its change (signal) by noise ratio.
But before we do either of those, we can filter for expressed genes. Let's say we filter for genes that are expressed in all reps of any time point-treatment combination. We can use an arbitrary cutoff of 5 FPKM. Because our lowest level of replication is 2, let's filter for genes expressed with > 5 FPKM in >=2 libraries.
Expressed_genes <- Exp_table_long %>%
filter(FPKM > 5) %>%
group_by(LocusName) %>%
count() %>%
filter(n >= 2)
dim(Expressed_genes)
## [1] 13196 2
If we use > 5 FPKM in >=2 libraries, we are down to 13k genes, just about half from what we started with. Again, this is using some arbitrary cutoffs. You do you.
high_var_genes <- Exp_table_long_averaged_z %>%
filter(LocusName %in% Expressed_genes$LocusName) %>%
group_by(LocusName) %>%
summarise(var = var(mean.logFPKM)) %>%
ungroup() %>%
filter(var > quantile(var,0.5))
dim(high_var_genes)
## [1] 6598 2
Let's say we pick top 50% of high variance genes. We are down to ~6600 genes.
One way to check if we have keeping enough genes is to check if bait genes are still there.
high_var_genes %>%
filter(LocusName %in% Baits$X1)
## A tibble:3 × 2
I put in 4 baits, and 3 of them are still there. So I guess we are good?
You might ask, why did I choose top 50%? Why not 33%? or all of it? The short answer is this is arbitrary.
However, if you want some sort of "objective" way of defining gene selection cutoffs, you can use the variance distribution and your bait genes.
all_var_and_ranks <- Exp_table_long_averaged_z %>%
group_by(LocusName) %>%
summarise(var = var(mean.logFPKM)) %>%
ungroup() %>%
mutate(rank = rank(var, ties.method = "average"))
bait_var <- all_var_and_ranks %>%
filter(LocusName %in% Baits$X1)
bait_var
The 1st chunk of code I calculate the variance for each gene and rank them. The 2nd chunk of code I look at the variance of bait genes.
We can look at where your bait genes are along the variance distribution.
all_var_and_ranks %>%
ggplot(aes(x = var, y = rank)) +
geom_rect(
xmax = max(high_var_genes$var),
xmin = min(high_var_genes$var),
ymax = nrow(all_var_and_ranks),
ymin = nrow(all_var_and_ranks) - nrow(high_var_genes),
fill = "dodgerblue2", alpha = 0.2
) +
geom_line(size = 1.1) +
geom_hline(
data = bait_var, aes(yintercept = rank),
color = "tomato1", size = 0.8, alpha = 0.5
) +
geom_vline(
data = bait_var, aes(xintercept = var),
color = "tomato1", size = 0.8, alpha = 0.5
) +
labs(y = "rank",
x = "var(log10(FPKM))",
caption = "Blue box = high var expressed genes.\nRed lines = bait genes.") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
plot.caption = element_text(hjust = 0)
)
ggsave("../Results/Tepary_gene_var_distribution.svg", height = 3.5, width = 3.5)
ggsave("../Results/Tepary_gene_var_distribution.png", height = 3.5, width = 3.5)
It turns out our bait genes are not the most highly variable genes. But at least three of them are up there. Let's roll with this.
Exp_table_long_averaged_z_high_var <- Exp_table_long_averaged_z %>%
filter(LocusName %in% high_var_genes$LocusName)
head(Exp_table_long_averaged_z_high_var)
Exp_table_long_averaged_z_high_var %>%
group_by(LocusName) %>%
count() %>%
nrow()
## [1] 6598
The %in%
operator filters genes that are present in high_var_genes$LocusName
, thus retaining only high var genes.
It has been brought to my attention that high variance genes may not capture genes oscillating with low amplitude, which might be relevant for this experiment.
In an attempt to address this issue, I developed another method to filter genes.
The goal here is to detect genes of which expression levels are changing, and the metric I will be using is F statistics.
The formalism of the statistics is not important, but the F value can be roughly understood as the ratio of between group variance and within group variance.
Genes with high signal to noise ratio will have a high F value, and we will be filtering based on F.
The F stat can be computed using anova()
. We will compute that for each expressed genes.
compute_F <- function(data, gene){
anova_table = lm(logFPKM ~ time_point:treatment, data %>%
filter(LocusName == gene)) %>%
anova()
cbind(anova_table$`F value`[1], anova_table$`Pr(>F)`[1]) %>%
as.data.frame()
}
So I write this function to automate this. We will use this function to evaluate all the expressed genes.
The function first make a linear model, it models logFPKM as a function of time point and treatment combination.
Then the linear model is passed onto anova()
.
The function returns two outputs, the 1st is F value for time point by treatment interaction, the 2nd is p value.
We will be looking at genes whose expression changes depending on time point and treatment.
We are interested in genes that are changing at all, so we won't be computing separate F values for time point only or treatment only.
compute_F(data = Exp_table_long %>%
full_join(PCA_coord, by = "library"), gene = "Phacu.CVR.002G288900")
## V1 V2
## 25.34084 7.828205e-07
I threw in a bait gene to try it out. The bait gene is up there in term of variance, so it should have a high signal to noise ratio.
The function returned F = 25, P = 7.8e-7. The null hypothesis is F = 1, meaning between group variance = within group variance, or the lack of signal beyond noise. In this case our bait gene has F = 25, meaning signal is 25x higher than noise. Very good to see. Now we just to run the function to all expressed genes.
ANOVA_results <- purrr::map_dfr(
.x = Expressed_genes$LocusName,
.f = compute_F,
data = Exp_table_long %>%
full_join(PCA_coord, by = "library")
) %>%
cbind(LocusName = Expressed_genes$LocusName) %>%
as.data.frame() %>%
rename(
F_stat = V1,
p.value = V2
) %>%
mutate(FDR = p.adjust(p.value, method = "fdr"))
head(ANOVA_results)
This could take a long time. If you have access to a computing cluster, you can submit this step as a job.
We can make a scatter plot comparing the variance and the F value.
F_high_var_comparison <- ANOVA_results %>%
left_join(all_var_and_ranks, by = "LocusName") %>%
mutate(high_F = case_when(
F_stat >= 2 ~ "high F",
T ~ "low F"
)) %>%
mutate(high_var = case_when(
LocusName %in% high_var_genes$LocusName ~ "high var",
T ~ "low var"
)) %>%
mutate(type = case_when(
high_F == "high F" &
high_var == "high var" ~ "both",
high_F == "high F" &
high_var == "low var" ~ "high F only",
high_F == "low F" &
high_var == "high var" ~ "high var only",
T ~ "neither"
))
Here I set genes with F >= 2 as "high F", instead of using a p value cutoff. P value is a function of sample size. Since this experiment has a relatively low level of replication (n = 2 to 3), using p value strictly will be too stringent. The F statistics on the other hand, is more or less independent of sample size.
F_var_scatter <- F_high_var_comparison %>%
ggplot(aes(x = F_stat, y = sqrt(var))) +
geom_point(aes(color = type), alpha = 0.5) +
scale_color_manual(values = brewer.pal(n = 8, "Set2")) +
labs(x = "F stat",
y = "sd(log10(FPKM))",
color = NULL) +
theme_classic() +
theme(
legend.position = c(0.75, 0.75),
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
F_dist <- F_high_var_comparison %>%
ggplot(aes(x = F_stat)) +
geom_histogram(bins = 100, fill = brewer.pal(8, "Set2")[2]) +
labs(x = NULL) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
var_dist <- F_high_var_comparison %>%
ggplot(aes(x = sqrt(var))) +
geom_histogram(bins = 100, fill = brewer.pal(8, "Set2")[3]) +
labs(x = NULL) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black")
) +
coord_flip()
blank <- F_high_var_comparison %>%
ggplot(aes(x = -log10(FDR), y = sqrt(var))) +
theme_void()
wrap_plots(
F_dist, blank,
F_var_scatter, var_dist,
nrow = 2, ncol = 2,
heights = c(0.35, 1),
widths = c(1, 0.35)
)
ggsave("../Results/Tepary_var_F_scatter.svg", width = 5.5, height = 5.5)
ggsave("../Results/Tepary_var_F_scatter.png", width = 5.5, height = 5.5)
This scatter plot is a bit unexpected, but interesting. I graph F stat on x axis and sd (rather than var) on y axis to make dots spread out better. The distributions are still skewed (as indicated by the histograms on the right and top). I was expecting genes with high F to have high var as well, which did not seem to be the case.
We can also look at the overlap between these two methods.
- method 1: top 50% high variance expressed genes.
- method 2: F > 2.
F_high_var_comparison %>%
group_by(type) %>%
count()
F_high_var_comparison %>%
group_by(high_var) %>%
count()
F_high_var_comparison %>%
group_by(high_F) %>%
count()
Pulling out the numbers: there are 6598 high var genes, 5424 high F genes, and 2373 overlaps. The overlap is not very high, which is also indicated by the scatter plot.
Now we have to decide we gene selection method to use for downstream analyses. But before we do that, We should examine some of the common and distinct genes detected by both methods so that we understand what we are getting.
high_F_only_examples <- F_high_var_comparison %>%
filter(type == "high F only") %>%
slice_max(order_by = F_stat, n = 1) %>%
inner_join(Exp_table_long, by = "LocusName") %>%
inner_join(PCA_coord, by = "library")
high_var_only_examples <- F_high_var_comparison %>%
filter(type == "high var only") %>%
slice_max(order_by = var, n = 1) %>%
inner_join(Exp_table_long, by = "LocusName") %>%
inner_join(PCA_coord, by = "library")
high_var_high_F_examples <- F_high_var_comparison %>%
filter(type == "both") %>%
slice_max(order_by = F_stat, n = 1) %>%
inner_join(Exp_table_long, by = "LocusName") %>%
inner_join(PCA_coord, by = "library")
neither_examples <- F_high_var_comparison %>%
filter(type == "neither") %>%
slice_max(order_by = var, n = 1) %>%
inner_join(Exp_table_long, by = "LocusName") %>%
inner_join(PCA_coord, by = "library")
high_F_only_graphs <- high_F_only_examples %>%
ggplot(aes(x = time_point, y = logFPKM)) +
geom_point(aes(fill = treatment), shape = 21, color = "white",
size = 3, alpha = 0.8, position = position_jitter(0.2, seed = 666)) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = "time point",
y = "log10(FPKM)",
title = paste0("high F only example\n",
high_F_only_examples$LocusName)) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
plot.title = element_text(size = 12)
)
high_var_only_graphs <- high_var_only_examples %>%
ggplot(aes(x = time_point, y = logFPKM)) +
geom_point(aes(fill = treatment), shape = 21, color = "white",
size = 3, alpha = 0.8, position = position_jitter(0.2, seed = 666)) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = "time point",
y = "log10(FPKM)",
title = paste0("high var only example\n",
high_var_only_examples$LocusName)) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
plot.title = element_text(size = 12)
)
high_var_high_F_graphs <- high_var_high_F_examples %>%
ggplot(aes(x = time_point, y = logFPKM)) +
geom_point(aes(fill = treatment), shape = 21, color = "white",
size = 3, alpha = 0.8, position = position_jitter(0.2, seed = 666)) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = "time point",
y = "log10(FPKM)",
title = paste0("high F high var example\n",
high_var_high_F_examples$LocusName)) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
plot.title = element_text(size = 12)
)
low_var_low_F_graphs <- neither_examples %>%
ggplot(aes(x = time_point, y = logFPKM)) +
geom_point(aes(fill = treatment), shape = 21, color = "white",
size = 3, alpha = 0.8, position = position_jitter(0.2, seed = 666)) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = "time point",
y = "log10(FPKM)",
title = paste0("low F low var example\n",
neither_examples$LocusName)) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
plot.title = element_text(size = 12)
)
wrap_plots(
high_var_only_graphs,
high_var_high_F_graphs,
low_var_low_F_graphs,
high_F_only_graphs,
nrow = 2, guides = "collect"
) & theme(legend.position = "bottom")
ggsave("../Results/Tepary_gene_select.svg", width = 6, height = 6)
ggsave("../Results/Tepary_gene_select.png", width = 6, height = 6)
There are a lot to unpack from these graphs.
- High var only genes seems to be strictly diurnal, no separation between treatments.
- High F and high var genes changes across time and treatments.
- Low var low F genes are too noisy to say anything.
- High F only genes seem to be more different at the level of treatments rather than time.
...at least from the 4 examples we saw. Given the strong circadian/diurnal component in this experiment (PC1), perhaps we can take the union of high F and high var genes for co-expression analysis.
high_var_or_high_F_genes <- F_high_var_comparison %>%
filter(type != "neither")
dim(high_var_or_high_F_genes)
Exp_table_long_averaged_z_high_var_or_high_F <- Exp_table_long_averaged_z %>%
filter(LocusName %in% high_var_or_high_F_genes$LocusName)
head(Exp_table_long_averaged_z_high_var_or_high_F)
## [1] 9649 9
We are putting in 9649 genes for co-expression analysis.
Now we can correlate each gene to every other gene. The essence of this workflow is simple, so we will use a simple correlation. If you want, you can use fancier methods such as GENIE3.
We will use the cor()
function in R. But the cor()
only take vector or matrix as input, so we need to go from long to wide again.
z_score_wide <- Exp_table_long_averaged_z_high_var_or_high_F %>%
mutate(tag = paste(time_point, treatment, sep = "-")) %>%
select(LocusName, tag, z.score) %>%
pivot_wider(names_from = tag, values_from = z.score) %>%
as.data.frame()
row.names(z_score_wide) <- z_score_wide$LocusName
head(z_score_wide)
The tag
column contains info for both time point and treatment.
After long to wide transformation, the tag
column now becomes the column name of this wide table.
Then we produce the correlation matrix. The underlying math here is R takes each column of a matrix and correlates it to every other columns.
To get this to work on our wide table, we remove the LocusName
column, transpose it, and feed it into cor()
.
cor_matrix <- cor(t(z_score_wide[, -1]))
dim(cor_matrix)
## [1] 9649 9649
This step can take a while, because it is computing many correlation coefficients. We threw in 9649 high var genes, so it is computing 9649^2 correlations. The correlation matrix should contain 9649 rows and 9649 columns.
Now we have this huge correlation matrix, what do we do next? Not all correlation are statistical significant (whatever that means), and definitely not all correlation are biologically meaningful. How do we select which correlations to use in downstream analyses. I call this step "edge selection", because this is building up to a network analysis, where each gene is node, and each correlation is an edge.
I have two ways to do this.
- t distribution approximation
- Empirical determination using rank distribution.
It turns out for each correlation coefficient r, you can approximate a t statistics, under some arbitrary assumptions.
The equation is
number_of_time_treatment <- ncol(z_score_wide) - 1
number_of_time_treatment
## [1] 10
In this case, it is 10. There are two way to find it.
The first way is the number of columns in the z score wide table - 1, because the 1st column is gene ID.
The other way is using the parsed metadata, which is now part of PCA_coord
.
PCA_coord %>%
group_by(time_point, treatment) %>%
count() %>%
nrow()
## [1] 10
Both methods say we have 10 unique time point by treatment combinations. We are good to proceed.
cor_matrix_upper_tri <- cor_matrix
cor_matrix_upper_tri[lower.tri(cor_matrix_upper_tri)] <- NA
Before we select edges (correlations), we need to deal with some redundant data. The correlation matrix is symmetrical along its diagonal. The diagonal will be 1, because it is correlating with itself. Everything else appears twice. We can take care of that by setting the upper (or lower) triangle of this matrix to NA. This step can take a while. The larger the matrix, the slower it is.
Now we can compute a t statistic from r and compute a p value using the t distribution.
Again, this is a tidyverse workflow, so brace yourself for many %>%
.
edge_table <- cor_matrix_upper_tri %>%
as.data.frame() %>%
mutate(from = row.names(cor_matrix)) %>%
pivot_longer(cols = !from, names_to = "to", values_to = "r") %>%
filter(is.na(r) == F) %>%
filter(from != to) %>%
mutate(t = r*sqrt((number_of_time_treatment-2)/(1-r^2))) %>%
mutate(p.value = case_when(
t > 0 ~ pt(t, df = number_of_time_treatment-2, lower.tail = F),
t <=0 ~ pt(t, df = number_of_time_treatment-2, lower.tail = T)
)) %>%
mutate(FDR = p.adjust(p.value, method = "fdr"))
head(edge_table)
This chunk converts the correlation matrix into a data table.
Then it goes from wide to long using pivot_longer()
.
After that, everything is normal dyplr verbs, such as mutate()
and filter()
.
P values are computed using the t distribution.
Depending on the sign of t, the upper of lower tail probability is taken.
Finally, the p values are adjusted for multiple comparisons using FDR.
This step can take a while. Turning a large wide table to a long table always takes a while.
Your computer may not have enough memory to run this step if you put in many genes.
You can look at various adjusted p value cutoffs and the corresponding r value before proceeding. Let's say we just look at positively correlated genes.
edge_table %>%
filter(r > 0) %>%
filter(FDR < 0.05) %>%
slice_min(order_by = abs(r), n = 10)
edge_table %>%
filter(r > 0) %>%
filter(FDR < 0.01) %>%
slice_min(order_by = abs(r), n = 10)
## from to r t p.value FDR
## Phacu.CVR.002G106700 Phacu.CVR.003G099300 0.6896428 2.693648 0.01367084 0.04999999
## from to r t p.value FDR
## Phacu.CVR.007G004700 Phacu.CVR.008G034800 0.8481535 4.528434 0.0009641728 0.009999979
If you cut off the FDR at 0.05, then your r values are 0.68 or larger. If you cut off the FDR at 0.01, then your r values are 0.84 or larger.
If I go into this analysis not knowing any biology, then I would proceed with a t approximation followed by some p value cutoff. I think in real life, this is hardly the case. We usually know something a priori. This is where bait genes can be helpful. You can use the bait genes to determine the cutoff if you know two bait genes are involved in the same process. The underlying assumption is if two bait genes are involved in the same process, they might be co-expressed. Because this selection method is based on empirical observations, I argue this is better than using an arbitrary p value cutoff.
edge_table %>%
filter(from == "Phacu.CVR.002G288900" &
to == "Phacu.CVR.003G017200" |
from == "Phacu.CVR.003G017200" &
to == "Phacu.CVR.002G288900" )
## from to r t p.value FDR
## Phacu.CVR.002G288900 Phacu.CVR.003G017200 0.7741764 3.459408 0.004287624 0.02426155
I tried out two bait genes. They are correlated at r = 0.77. Base on this empirical observation, we can say we cut off at the vicinity of 0.77, maybe r > 0.75.
You can also look at the distribution of r values.
edge_table %>%
slice_sample(n = 20000) %>%
ggplot(aes(x = r)) +
geom_histogram(color = "white", bins = 100) +
geom_vline(xintercept = 0.75, color = "tomato1", size = 1.2) +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
ggsave("../Results/Tepary_r_histogram.svg", height = 3.5, width = 5, bg = "white")
ggsave("../Results/Tepary_r_histogram.png", height = 3.5, width = 5, bg = "white")
Here I randomly sampled 20k edges and plot a histogram. You can plot the whole edge table, but it will take a lot longer to make the graph. When you sample large enough, it does not change the shape of the distribution. Looks like at r > 0.75 (red line), the distribution trails off rapidly. So let's use r > 0.75 as a cutoff.
Why do I warn against determining cutoffs using p values alone? Because p value is a function of both effect size (r) and degrees of freedom (df). Experiments with larger df produces smaller p values given the same effect size. The advantage of empirical determination using bait genes is that the correlation between baits are more or less independent of df.
Note that there are many negatively correlated genes, we can look at those at well. But for the sake of this example, let's just look at positively correlated genes.
edge_table_select <- edge_table %>%
filter(r >= 0.75)
dim(edge_table_select)
## [1] 6741133 6
We are now down to 6,741,133 edges. Still A LOT.
Is this a perfect cutoff calling method? No. Is this method grounded in sound understanding of statistics, heuristics, and guided by the biology? Yes.
Before we move forward, we can examine the correlation between two bait genes using a scatter plot.
Bait_cor_by_time <- z_score_wide %>%
filter(LocusName == "Phacu.CVR.002G288900" |
LocusName == "Phacu.CVR.003G017200") %>%
select(-LocusName) %>%
t() %>%
as.data.frame() %>%
mutate(tag = row.names(.)) %>%
separate(tag, c("time_point", "treatment"), sep = "-") %>%
mutate(time_point = as.numeric(time_point)) %>%
inner_join(PCA_coord, by = c("time_point", "treatment")) %>%
ggplot(aes(x = Phacu.CVR.002G288900,
y = Phacu.CVR.003G017200)) +
geom_point(aes(fill = time_point), color = "grey20",
size = 2, alpha = 0.8, shape = 21) +
scale_fill_gradientn(colors = viridis(10, option = "A")) +
guides(fill = guide_colorsteps()) +
labs(x = "Bait1 z score",
y = "Bait2 z score",
fill = "time point") +
theme_classic() +
theme(
legend.position = c(0.25, 0.7),
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
Bait_cor_by_treat <- z_score_wide %>%
filter(LocusName == "Phacu.CVR.002G288900" |
LocusName == "Phacu.CVR.003G017200") %>%
select(-LocusName) %>%
t() %>%
as.data.frame() %>%
mutate(tag = row.names(.)) %>%
separate(tag, c("time_point", "treatment"), sep = "-") %>%
mutate(time_point = as.numeric(time_point)) %>%
inner_join(PCA_coord, by = c("time_point", "treatment")) %>%
ggplot(aes(x = Phacu.CVR.002G288900,
y = Phacu.CVR.003G017200)) +
geom_point(aes(fill = treatment), color = "grey20",
size = 2, alpha = 0.8, shape = 21) +
scale_fill_manual(values = c("grey70", "tomato1")) +
labs(x = "Bait1 z score",
y = "Bait2 z score",
fill = "treatment") +
theme_classic() +
theme(
legend.position = c(0.25, 0.7),
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
wrap_plots(
Bait_cor_by_time,
Bait_cor_by_treat,
nrow = 1
)
ggsave("../Results/Tepary_Bait_correlation.svg", height = 4.5, width = 9, bg = "white")
ggsave("../Results/Tepary_Bait_correlation.png", height = 4.5, width = 9, bg = "white")
Here each dot is a library. You can annotate the libraries using metadata, which is now part of PCA_coord
.
We can see that these two bait genes (involved in trehalose biosynthesis) have similar expression pattern. There are both up-regulated heat treatment.
Perhaps this is consistent with the hypothesis trehalose has protective roles under stress.
We will be using igraph
to do some of the downstream analyses. It will do a lot of the heavy lifting for us.
While you can get Leiden as a standalone package, Leiden is also part of the igraph
package.
The first thing to do is producing a graph object, also known as a network object.
To make a graph object, you need a edge table.
We already made that, which is edge_table_select
, a edge table that we filtered based on some kind of r cutoff.
Optionally, we can also provide a node table, which contains information about all the notes present in this network.
We can make that.
We need to two things.
- Non-redundant gene IDs from the edge table
- Functional annotation.
funct_anno <- read_delim("../Data/Moghaddam2022_data/Pacu.CVR.working_models.func_anno.txt",
delim = "\t", col_names = F, col_types = cols())
funct_anno <- funct_anno %>%
rename(gene = X1) %>%
mutate(LocusName = str_remove(gene, "\\.\\d+$")) %>%
distinct(LocusName, .keep_all = T)
head(funct_anno)
I noticed the annotation is at isoform level. So I removed the isoform number and took 1 isoform per gene.
node_table <- data.frame(
LocusName = c(edge_table_select$from, edge_table_select$to) %>% unique()
) %>%
left_join(funct_anno %>%
select(-gene), by = "LocusName") %>%
rename(functional_annotation = X2)
head(node_table)
dim(node_table)
## [1] 9649 2
We have 9649 genes in this network, along with 6,741,133 edges.
my_network <- graph_from_data_frame(
edge_table_select,
vertices = node_table,
directed = F
)
graph_from_data_frame()
is a function from the igraph
package.
It takes your edge table and node table and produce a graph (aka network) from it.
Note that I selected the directed = F
argument, because we made our network using correlation.
Correlation is non-directional, because cor(A,B) = cor(B,A).
The next step is detect modules from the graph object. But first we need to optimize the resolution parameter for clustering. The optimal resolution for module detection differs between networks. A key factor that contributes to the difference in optimal resolution is to what extent are nodes inter-connected.
Since this is a simple workflow, we can determine the optimal resolution using heuristics. We can test a range of resolutions and monitor two key performance indexes:
- Optimize number of modules that have >= 5 genes.
- Optimize number of genes that are contained in modules that have >= 5 genes.
Because:
- Too low resolution leads to forcing genes with different expression patterns into the same module.
- Too high resolution leads to many genes not contained in any one module.
optimize_resolution <- function(network, resolution){
modules = network %>%
cluster_leiden(resolution_parameter = resolution,
objective_function = "modularity")
parsed_modules = data.frame(
gene_ID = names(membership(modules)),
module = as.vector(membership(modules))
)
num_module_5 = parsed_modules %>%
group_by(module) %>%
count() %>%
arrange(-n) %>%
filter(n >= 5) %>%
nrow() %>%
as.numeric()
num_genes_contained = parsed_modules %>%
group_by(module) %>%
count() %>%
arrange(-n) %>%
filter(n >= 5) %>%
ungroup() %>%
summarise(sum = sum(n)) %>%
as.numeric()
cbind(num_module_5, num_genes_contained) %>%
as.data.frame()
}
Here I wrote a function to detect module, pull out number of modules that have >= 5 genes, and count number of genes containedin modules that have >= 5 genes. All in one function.
Then I can test a list of resolutions in this function. Let's test a range of resolution from 0.25 to 5, in steps of 0.25.
optimization_results <- purrr::map_dfr(
.x = seq(from = 0.25, to = 5, by = 0.25),
.f = optimize_resolution,
network = my_network
) %>%
cbind(
resolution = seq(from = 0.25, to = 5, by = 0.25)
) %>%
as.data.frame()
head(optimization_results)
This could take a while. We have the results organized into one tidy data table. We can graph it.
Optimize_num_module <- optimization_results %>%
ggplot(aes(x = resolution, y = num_module_5)) +
geom_line(size = 1.1, alpha = 0.8, color = "dodgerblue2") +
geom_point(size = 3, alpha = 0.7) +
geom_vline(xintercept = 2, size = 1, linetype = 4) +
labs(x = "resolution parameter",
y = "num. modules\nw/ >=5 genes") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
Optimize_num_gene <- optimization_results %>%
ggplot(aes(x = resolution, y = num_genes_contained)) +
geom_line(size = 1.1, alpha = 0.8, color = "violetred2") +
geom_point(size = 3, alpha = 0.7) +
geom_vline(xintercept = 2, size = 1, linetype = 4) +
labs(x = "resolution parameter",
y = "num. genes in\nmodules w/ >=5 genes") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black")
)
wrap_plots(Optimize_num_module, Optimize_num_gene, nrow = 2)
ggsave("../Results/Tepary_Optimize_resolution.svg", height = 5, width = 3.2, bg ="white")
ggsave("../Results/tepary_Optimize_resolution.png", height = 5, width = 3.2, bg ="white")
Looking at these graphs, it is not clear what the optimum would be. The number of modules continue to increase for the most part, while the number of contained genes continues to decrease. There is a clear trade-off between the two metrics. In this example, we have a lot of genes, so we can afford to go higher resolution. But you do you.
modules <- cluster_leiden(my_network, resolution_parameter = 5,
objective_function = "modularity")
cluster_leiden()
runs the Leiden algorithm for you.
resolution_parameter
controls how many clusters you will get. The larger it is, the more clusters.
The underlying math of objective_function
is beyond me, but it specifies how the modules are computed.
Next, we need to link the module membership to the gene IDs.
my_network_modules <- data.frame(
LocusName = names(membership(modules)),
module = as.vector(membership(modules))
) %>%
inner_join(node_table, by = "LocusName")
my_network_modules %>%
group_by(module) %>%
count() %>%
arrange(-n) %>%
filter(n >= 5)
my_network_modules %>%
group_by(module) %>%
count() %>%
arrange(-n) %>%
filter(n >= 5) %>%
ungroup() %>%
summarise(sum = sum(n))
We have 28 modules, containing 6215 genes. Note that Leiden clustering has a stochastic aspect. The membership maybe slightly different every time you run it. Moving forward we will only use modules that have 5 or more genes.
module_5 <- my_network_modules %>%
group_by(module) %>%
count() %>%
arrange(-n) %>%
filter(n >= 5)
my_network_modules <- my_network_modules %>%
filter(module %in% module_5$module)
head(my_network_modules)
We have a bunch of different modules now, how do we know if they make any sense? One way to QC these modules is looking at our bait genes.
my_network_modules %>%
filter(LocusName == "Phacu.CVR.003G017200" |
LocusName == "Phacu.CVR.002G288900")
Looks like they are in the same module, very good to see. Remember, they are correlated with a r > 0.7; they should be in the same module.
## LocusName module functional_annotation
## Phacu.CVR.002G288900 4 Haloacid dehalogenase-like hydrolase (HAD) superfamily protein
## Phacu.CVR.003G017200 4 UDP-Glycosyltransferase / trehalose-phosphatase family protein
The next key task is understanding the expression pattern of the clusters. Again, the essence of this workflow is simple, so we will use a simple method: peak expression. To do that, we append the module membership data back to the long table containing z scores.
Exp_table_long_averaged_z_high_var_or_high_F_modules <- Exp_table_long_averaged_z_high_var_or_high_F %>%
inner_join(my_network_modules, by = "LocusName")
head(Exp_table_long_averaged_z_high_var_or_high_F_modules)
Now we can produce summary statistics for each cluster and look at their expression pattern using mean.
modules_mean_z <- Exp_table_long_averaged_z_high_var_or_high_F_modules %>%
group_by(module, time_point, treatment) %>%
summarise(mean.z = mean(z.score)) %>%
ungroup()
head(modules_mean_z)
Then we look at at which time point and treatment is each module most highly expressed.
module_peak_exp <- modules_mean_z %>%
group_by(module) %>%
slice_max(order_by = mean.z, n = 1)
module_peak_exp
Again, group_by()
is doing a lot of heavy lifting here.
You can also QC the clusters via a line graph It will be too much to look at if graph all the modules, so let's just pick 3.
I picked:
- module 18, which is the largest cluster.
- module 4, where our bait genes are. It appears to be a heat responsive module. Also the 2nd largest module.
- module 20, third largest module.
module_line_plot <- Exp_table_long_averaged_z_high_var_or_high_F_modules %>%
mutate(treatment = factor(treatment, levels = c("control", "heat"))) %>%
filter(module == "18" |
module == "4" |
module == "20") %>%
ggplot(aes(x = time_point, y = z.score)) +
facet_grid(treatment ~ module) +
geom_line(aes(group = LocusName), alpha = 0.3, color = "grey70") +
geom_line(
data = modules_mean_z %>%
filter(module == "18" |
module == "4"|
module == "20") %>%
mutate(treatment = factor(treatment, levels = c("control", "heat"))),
aes(y = mean.z, group = module),
size = 1.1, alpha = 0.8
) +
labs(x = "time point",
y = "z score") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
#axis.text.x = element_blank(),
panel.spacing = unit(1, "line")
)
module_line_plot
ggsave("../Results/Terpary_module_line_plots.svg", height = 4, width = 8.2, bg = "white")
ggsave("../Results/Terpary_module_line_plots.png", height = 4, width = 8.2, bg = "white")
Very interesting. Module 4 is messy, but it is higher expressed in response to heat in the 12hr time point. Module 4 is pretty much a flat line in the control treatment, a good contrast from the heat treatment.
On the other hand, module 18 has a robust pattern peaking at the 12hr time point for both treatments. Similar to module 18, module 20 has a robust pattern peaking at the 24hr time point for both treatments.
A good way to present these modules is to make a heat map. To make an effective heatmap though, we need to take care of a few things.
- reorder x and y axis
- take care of outliers
modules_mean_z$mean.z %>% summary()
quantile(modules_mean_z$mean.z, c(0.05, 0.95))
Looking at the quartiles and extremes of z score, we can probably clip the z score at +/- 1.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.5865 -0.5289 0.1582 0.0000 0.6044 1.6226
## 5% 95%
## -1.636757 1.115807
modules_mean_z <- modules_mean_z %>%
mutate(mean.z.clipped = case_when(
z > 1.5 ~ 1.5,
z < -1.5 ~ 1.5
T ~ mean.z
))
This sets z scores > 1.5 or < -1.5 to 1.5 or -1.5, respectively. The rest remain unchanged.
Let's say we graph modules on y axis, and time/treatment on x-axis. Reordering columns are easy, time point already has an order.
Ordering rows is not as straightforward.
What I usually do is I reorder the rows based on their peak expression.
We use the module_peak_exp
table that we already made.
modules_mean_z_reordered <- modules_mean_z %>%
full_join(module_peak_exp %>%
rename(peak_time = time_point) %>%
select(module, peak_time), by = "module") %>%
mutate(module = reorder(module, -peak_time))
head(modules_mean_z_reordered)
Because we know time point is the major driver of variance in this dataset, so I only reordered the rows by peak expression across time points, rather than both time point and treatment.
modules_mean_z_reordered %>%
ggplot(aes(x = as.factor(time_point), y = as.factor(module))) +
facet_grid(. ~ treatment, scales = "free", space = "free") +
geom_tile(aes(fill = mean.z.clipped)) +
scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")), limits = c(-1.5, 1.5),
breaks = c(-1.5, 0, 1.5), labels = c("< -1.5", "0", "> 1.5")) +
labs(x = "Time point",
y = "Module",
fill = "z score") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
legend.position = "top"
)
ggsave("../Results/Tepary_module_heatmap.svg", height = 6, width = 5, bg = "white")
ggsave("../Results/Tepary_module_heatmap.png", height = 6, width = 5, bg = "white")
First of all, the 12hr time point appears to be the "hot spot" for heat response. We detected various heat responsive modules, with varying interaction with the diurnal pattern. For example, module 4 is not diurnal in the control treatment but up-regulated during heat treatment. While module 10 is up-regulated at heat treatment, it has a more robust diurnal pattern. And also detected many heat repressed modules.
We can plot some of them. I think modules 3, 7, and 9 look interesting.
Exp_table_long_averaged_z_high_var_or_high_F_modules %>%
mutate(treatment = factor(treatment, levels = c("control", "heat"))) %>%
filter(module == "3" |
module == "7" |
module == "9") %>%
ggplot(aes(x = time_point, y = z.score)) +
facet_grid(treatment ~ module) +
geom_line(aes(group = LocusName), alpha = 0.3, color = "grey70") +
geom_line(
data = modules_mean_z %>%
filter(module == "3" |
module == "7"|
module == "9") %>%
mutate(treatment = factor(treatment, levels = c("control", "heat"))),
aes(y = mean.z, group = module),
size = 1.1, alpha = 0.8
) +
labs(x = "time point",
y = "z score") +
theme_classic() +
theme(
text = element_text(size = 14),
axis.text = element_text(color = "black"),
#axis.text.x = element_blank(),
panel.spacing = unit(1, "line")
)
ggsave("../Results/Terpary_module_line_plots_hr.svg", height = 4, width = 8.2, bg = "white")
ggsave("../Results/Terpary_module_line_plots_hr.png", height = 4, width = 8.2, bg = "white")
All 3 of these modules are heat responsive, but in different ways.
A common data visualization for gene co-expression analyses is network graphs.
We will be using ggraph
, a ggplot
extension of igraph
.
Our network has almost 5000 genes and more than 1 million edges. It's too much to look at if we graph the full network. On the other hand, there is not much to look at anyway for very large networks. You just get messy hairballs.
Say we want to look at genes directly co-expressed with our bait genes.
We can pull out their neighbors using the neighbors()
function within igraph
.
igraph
comes with a set of network analysis functions that we can call.
For the sake of this example, let's just a couple genes from other clusters as well. In addition to the bait genes (trehalose synthesis genes), I also picked a photosynthesis gene and a heat shock protein.
neighbors_of_bait <- c(
neighbors(my_network, v = "Phacu.CVR.003G017200"), # TPS
neighbors(my_network, v = "Phacu.CVR.002G288900"), # TPS
neighbors(my_network, v = "Phacu.CVR.001G000200"), # photosynthesis
neighbors(my_network, v = "Phacu.CVR.006G015700") # heat shock
) %>%
unique()
length(neighbors_of_bait)
## [1] 1333
We can make a sub-network object. First we subset edges in the network.
subnetwork_edges <- edge_table_select %>%
filter(from %in% names(neighbors_of_bait) &
to %in% names(neighbors_of_bait)) %>%
group_by(from) %>%
slice_max(order_by = r, n = 5) %>%
ungroup() %>%
group_by(to) %>%
slice_max(order_by = r, n = 5) %>%
ungroup()
subnetwork_genes <- c(subnetwork_edges$from, subnetwork_edges$to) %>% unique()
length(subnetwork_genes)
dim(subnetwork_edges)
## [1] 1332
## [1] 3667 6
We can constrain the edges such that both the start and end of edges are neighbors of baits. I also filtered for highly correlated neighbors (top 5 edges/node based on r value). We still have 3667 edges and 1332 nodes. Note that the most correlated edges for each bait many have overlaps, so the total number of edges remaining will be less than what you think.
Then we subset nodes in the network.
subnetwork_nodes <- node_table %>%
filter(LocusName %in% subnetwork_genes) %>%
left_join(my_network_modules, by = "LocusName") %>%
left_join(module_peak_exp, by = "module")
dim(subnetwork_nodes)
## [1] 1337 2
Then make sub-network object from subsetted edges and nodes.
my_subnetwork <- graph_from_data_frame(subnetwork_edges,
vertices = subnetwork_nodes,
directed = F)
Use graph_from_data_frame()
from igraph
to build the sub-network.
There are ways to directly filter existing networks, but I always find it more straightforward to build sub-network de novo from filtered edge and node tables.
my_subnetwork %>%
ggraph(layout = "kk", circular = F) +
geom_edge_diagonal(color = "grey70", width = 0.5, alpha = 0.5) +
geom_node_point(alpha = 0.8, color = "grey30", shape = 21, size = 2,
aes(fill = time_point)) +
scale_fill_gradientn(colors = viridis(10, option = "A")) +
labs(fill = "Peak time point") +
guides(size = "none",
fill = guide_colorsteps()) +
theme_void()+
theme(
text = element_text(size = 14),
legend.position = "bottom",
legend.justification = 1,
title = element_text(size = 12)
)
ggsave("../Results/Tepary_subnetwork_graph.svg", height = 5, width = 4, bg = "white")
ggsave("../Results/Tepary_subnetwork_graph.png", height = 5, width = 4, bg = "white")
I don't know how useful this actually is, other than this is somewhat visually impressive. If you are more familiar with the biology, this type of visualization will be more informative. For example, if you highlight certain candidate genes on this network. This could take a while. It is trying to draw many many lines and many dots. Unsurprisingly, we get a bunch of distinct hairballs. A good advice here is to check different graph layouts. The layout of the graphs can have a huge impact on the appearance of the network graph. See igraph layouts, ggraph layouts, and trying different layouts for more information.
We did a bunch of analyses, now what?
A common "ultimate" goal for gene co-expression analyses is to find new candidate genes, which are genes co-expressed with bait genes.
After doing network analysis, this is very easy to find.
We can either look at what other genes are in module 8, which both our bait genes are in, or we can look at direct neighbors of bait genes.
igraph
comes with a set of network analysis functions that we can call.
And we already did that earlier for the sub-network.
neighbors_of_TPS <- c(
neighbors(my_network, v = "Phacu.CVR.003G017200"),
neighbors(my_network, v = "Phacu.CVR.003G183300"),
neighbors(my_network, v = "Phacu.CVR.002G288900")
) %>%
unique()
length(neighbors_of_TPS)
## [1] 533
Looks like there are 533 direct neighbors of our trehalose phophate synthase genes. We can take a quick look at their functional annotation.
Let's say you are interested in Protein kinases.
my_protein_kinase <- my_network_modules %>%
filter(LocusName %in% names(neighbors_of_TPS)) %>%
filter(str_detect(functional_annotation, "Protein kinase"))
nrow(my_protein_kinase)
## [1] 13
Looks like there are 13 of them.
Let's say you are interested in a few of those kinases you found. We can also graph them.
Exp_table_long %>%
filter(LocusName %in% my_protein_kinase[1:3, ]$LocusName) %>%
inner_join(PCA_coord, by = "library") %>%
mutate(tag = str_remove(LocusName, "Phacu.CVR.")) %>%
ggplot(aes(x = time_point, y = 10^logFPKM)) +
facet_grid(tag ~ ., scales = "free_y") +
geom_point(aes(fill = treatment), color = "white", size = 2,
alpha = 0.8, shape = 21, position = position_jitter(0.1, seed = 666)) +
stat_summary(geom = "line", aes(group = treatment, color = treatment),
fun = mean, alpha = 0.8, size = 1.1) +
scale_fill_manual(values = c("grey70", "tomato1")) +
scale_color_manual(values = c("grey70", "tomato1")) +
labs(x = "time point",
y = "FPKM") +
theme_bw() +
theme(
legend.position = "none",
panel.spacing = unit(1, "lines"),
text = element_text(size = 14),
axis.text = element_text(color = "black"),
strip.background = element_blank()
)
ggsave("../Results/Tepary_Candidate_genes.svg", height = 4.8, width = 4, bg = "white")
ggsave("../Results/Tepary_Candidate_genes.png", height = 4.8, width = 4, bg = "white")
I only graphed the first 3 protein kinases on the list. Looks like all 3 of them are heat inducible in some way.
Finally, I want to write out the neighbors of out bait genes as a table onto the hard drive. That's easy.
Bait_neighors <- funct_anno %>%
filter(LocusName %in% names(neighbors_of_TPS)) %>%
rename(annotation = X2) %>%
select(LocusName, annotation)
head(Bait_neighors)
write_excel_csv(Bait_neighors, "../Results/Tepary_TPS_neighbors.csv", col_names = T)
What would be really interesting to do is you can take the list of neighbors and check if any overlaps QTL for heat tolerance.
Well, we are pretty much done. Now you just need to send the list of candidate genes and the nice graphics to your wet lab or breeder folks. Hopefully they find something interesting.