markdown_coOccur.Rmd

---
title: "UCDavisHackaton"
output: pdf_document
---

```{r setup_1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

This is an R Markdown document to make a co-occurence network analysis and stats.


In these analyses, presences of  OTUs are pair-wise compared across the different microbial samples. To limit the number of statistical comparisons, and thus constrain the number of potential false positives, OTUs that occurred in less than 5 samples will not be considered for network inferences. To establish edges (i.e. connections) between OTUs, and at the same time account for the limitations of scoring measures (Aitchison 1981; Legendre and Legendre, 1983), an ensemble method was used, based on non-parametric Spearman rank correlation and the Kullback-Leibler dissimilarity measure, as proposed by Faust et al. (2012). 

To start, initiate and create the working environment. This code will also install libraries if not yet there.

```{r setup_2, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(phyloseq)
library(ggplot2)
library(RColorBrewer)
library(tidyverse)
library(gplots)

library(vegan)
library(cooccur)
library(igraph)
library(flexmix)
library(tidyr)
library(randomcoloR)
```

## Creating a phyloseq object

### Read in some data
Here we're importing data that is more or less an otu table and a taxonomy table. These steps will be different if you use your own data.

```{r create_otu_1}
taxatable_raw <- read_csv('bact_alldata_taxatable_wTax.csv')
head(taxatable_raw)
```

As you can see, there is a missing column name, and R filled it in 'X1'. 

For the otu table, we want to have the OTU names as row names, and get rid of the taxa column. The final command turns the data frame into an otu object.
```{r create_otu_2}
otu <- as.data.frame(select(taxatable_raw, -X1,-taxonomy))
row.names(otu) <- taxatable_raw$X1
otu <- otu_table(otu, taxa_are_rows = T)
```


The taxonomy is in the same file, but phyloseq will want it separate. Further, we want to separate all the phylogenetic levels into separate columns. Row names should be the OTU ids. Phyloseq also wants the taxonomy to be a matrix, before converting to a taxonomy table.

```{r create_taxa_1}
taxonomy <- data.frame(taxonomy=taxatable_raw$taxonomy)
row.names(taxonomy)<- taxatable_raw$X1
# Split taxonomy into separate columns
taxonomy <- data.frame(separate(taxonomy, col=taxonomy, into= c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"),
                                sep = ";"))
# Change spaces to NA for missing data
taxonomy <- apply(taxonomy, 2, function(x) gsub("^$|^ $", NA, x))
taxonomy <- as.matrix(taxonomy)
taxonomy <- tax_table(taxonomy)
```

Finally, we want to get the sample meta data.

```{r create_meta_1}
map <- read_csv('bact_alldata_mapfile.csv')
head(map)
```

Again, there is a missing column header. Here, we change it to 'Sample'. We also want to convert some columns which were coded as integers into factors. 
Convert into sample data object.

```{r create_meta_2}
names(map)[1] <- 'Sample'
row.names(map) <- map$Sample
map[7:9] <- lapply(map[7:9] , factor)
head(map)
meta <- sample_data(map)
```

Finally, we can put it all together into a phyloseq object.

```{r create_phyloseq}
ps <- phyloseq(otu, taxonomy, meta)
```

... some further data formatting

``` {r format_data}
bacOTU<-data.frame(otu_table(ps))
bacTAX<-data.frame(tax_table(ps))

bacTAX$Domain <- gsub('k__', '', bacTAX$Domain, fixed=TRUE)
bacTAX$Phylum <- gsub('p__', '', bacTAX$Phylum, fixed=TRUE)
bacTAX$Class <- gsub('c__', '', bacTAX$Class, fixed=TRUE)
bacTAX$Order <- gsub('o__', '', bacTAX$Order, fixed=TRUE)
bacTAX$Family <- gsub('f__', '', bacTAX$Family, fixed=TRUE)
bacTAX$Genus <- gsub('g__', '', bacTAX$Genus, fixed=TRUE)
bacTAX$Species <- gsub('s__', '', bacTAX$Species, fixed=TRUE)

bacTAX[bacTAX=='']<-'Unassigned'
bacTAX[is.na(bacTAX)] <- 'Unassigned'
head(bacTAX)
```


``` {r  TEST}
##THIS STEP ONLY FOR TESTING
# only look at first XX OTUs to cut computation time: only to test code!!!
bacOTU_orig<-bacOTU
bacTAX_orig<-bacTAX

bacOTU<-bacOTU_orig[1:1000,]
bacTAX<-bacTAX_orig[1:1000,]

```

## Loading functions

Before we can get started, we need to load several custom functions, adapted from Karoline Faust (original code can be found at http://psbweb05.psb.ugent.be/conet/documents/networksFromSimCounts.R)

Before computing the Kullback-Leibler dissimilarity, OTU-counts should be normalized by dividing them by their total across the different samples.

These functions are used to compute the Kullback-Leibler dissimilarity and asses it's statistical significance, as described in Faust et al. (2012). To calculate a measure-specific p-value, bootstrap score distributions with 1000 iterations are compared to point null values that are computed by permutation (n=1000).


```{r functions_kld}

compute.kld=function(x, pseudocount=0.00000001){
  # diagonal is zero
  kld=matrix(data=0,nrow=nrow(x),ncol=nrow(x))
  for(i in 1:nrow(x)){
    for(j in 1:i){
      kld[i,j]=get.kld(x[i,],x[j,], pseudocount=pseudocount)   
      kld[j,i]=kld[i,j]  
    }
  }
  kld
}

get.kld=function(x,y, pseudocount=0.00000001){
  if(length(x) != length(y)){
    stop("The two vectors should have the same length!")
  }
  x[x==0]=pseudocount
  y[y==0]=pseudocount
  dis = 0
  x = x/sum(x)
  y = y/sum(y)
  for(i in 1:length(x)){
    if(!is.nan(x[i]) && !is.nan(y[i])){
      ratioxy = log(x[i]/y[i])
      ratioyx = log(y[i]/x[i])
      dis = x[i]*ratioxy+y[i]*ratioyx + dis
    }
  }
  dis
}

get.pval = function(matrix, x.index, y.index, N.rand=1000, method="spearman", renorm=F, permutandboot=F, plot=F, verbose=F) {
  x = matrix[x.index,]
  y = matrix[y.index,]
  lower.tail = TRUE
  # bray and kld are dissimilarities, so one-sided p-value needs to be computed from the upper tail
  if(method == "bray" || method == "kld"){
    lower.tail = FALSE
  }
  if(method == "spearman"){
    this.sim = cor(x, y, use="complete.obs", method="spearman")
  }else if(method == "pearson"){
    this.sim = cor(x, y, use="complete.obs", method="pearson")
  }else if(method == "bray"){
    this.sim= vegdist(rbind(x,y),method="bray")
  }else if(method == "kld"){
    this.sim=get.kld(x,y)
  }else{
    stop("Select either spearman, pearson, kld or bray as method.")
  }
  rand.sim = rep(NA, N.rand)
  boot.sim = rep(NA, N.rand)
  for (i in 1:N.rand) {
    rand = sample(x, length(x))
    if(renorm == T){
      mat.copy=matrix
      mat.copy[x.index,]=rand
      mat.copy = normalize(mat.copy)
      rand = mat.copy[x.index,]
      y = mat.copy[y.index,]
    }
    if(method == "spearman"){
      rand.sim[i] = cor(rand, y, method="spearman", use="complete.obs")
    }else if(method == "pearson"){
      rand.sim[i] = cor(rand, y, method="pearson",use="complete.obs")
    }else if(method == "bray"){
      rand.sim[i] = vegdist(rbind(rand,y),method="bray") 
    }else if(method == "kld"){
      rand.sim[i] = get.kld(rand,y)
    }
  }
  rand.sim = na.omit(rand.sim)
  if(plot == T){
    col1=rgb(0,0,1,1/3)
    col2=rgb(1,0,0,1/3)
    hist(rand.sim,col=col1)
    abline(v=mean(rand.sim),col="blue")
  }
  if(permutandboot){
    x=matrix[x.index,]
    y=matrix[y.index,]
    for (i in 1:N.rand) {
      rand.idx = sample(1:length(x),replace=TRUE)
      x.boot=x[rand.idx]
      y.boot=y[rand.idx]
      if(method == "spearman"){
        boot.sim[i] = cor(x.boot, y.boot, method="spearman", use="complete.obs")
      }else if(method == "pearson"){
        boot.sim[i] = cor(x.boot, y.boot, method="pearson",use="complete.obs")
      }else if(method == "bray"){
        boot.sim[i] = vegdist(rbind(x.boot,y.boot),method="bray") 
      }else if(method == "kld"){
        boot.sim[i] = get.kld(x.boot,y.boot)
      }
    }
    boot.sim = na.omit(boot.sim)
    if(plot == T){
      hist(boot.sim,col=col2,add=T)
      abline(v=mean(boot.sim),col="red")
      legend(x="topleft", c("Permut","Boot"), bg="white",col=c(col1,col2),lty=rep(1,2),merge=T)
    }
    # if we got enough non-NA permutation and bootstrap values, compute p-value
    if(length(rand.sim) > round(N.rand/3) && length(boot.sim) > round(N.rand/3)){
      pval = pnorm(mean(rand.sim),mean=mean(boot.sim),sd=sd(boot.sim), lower.tail=lower.tail)
    }else{
      pval = 0.5
    }
  }else{
    # if we got enough non-NA permutation values, compute p-value
    if(length(rand.sim) > round(N.rand/3)){
      if (lower.tail) {
        pval = (sum(this.sim > rand.sim) / length(rand.sim))
      } else {
        pval = (sum(this.sim < rand.sim) / length(rand.sim))
      }
    }else{
      pval = 0.5
    }
  }
  # set missing value (from constant vector) to intermediate p-value (worst possible p-value in this context)
  if(is.na(pval)){
    pval = 0.5
  }
  # p-values are one-sided, so high p-values signal mutual exclusion and are converted into low ones
  if(pval > 0.5){
    pval = 1 - pval
  }
  if(verbose == T){
    print(paste("p-value =",pval))
    print(paste("original score",this.sim))
    print(paste("mean of null distrib",mean(rand.sim)))
    print(paste("sd of null distrib",sd(rand.sim)))
    if(permutandboot == T){
      print(paste("mean of boot distrib",mean(boot.sim)))
      print(paste("sd of boot distrib",sd(boot.sim)))
    }
  }
  pval
}
```

# Getting started with the network analysis

First, we will need to make the data presence-absence, this is because the abundances can be skewed or biased. Samples need to be rows.

```{r presence_absence}
bacOTU_prab<-data.frame(t(bacOTU))
bacOTU_prab[bacOTU_prab>1]<-1

```

second, we need to remove non-informative OTUs, being those occuring in less than 5 samples, or occuring in all samples.

```{r non_informatives}
data_coOccur<-bacOTU_prab[,colSums(bacOTU_prab>0) >=5]
data_coOccur<-data_coOccur[,colSums(data_coOccur)<nrow(data_coOccur)]

```

# Creating the network matrix
this code will create the matrices to store the output of the co-occurence analysis

```{r network_1}
netmatrix <- data.frame(matrix(NA,nrow=ncol(data_coOccur),ncol=ncol(data_coOccur)))
rownames(netmatrix)<-colnames(data_coOccur)
colnames(netmatrix)<-colnames(data_coOccur)
p_matrx <- netmatrix
kldmatrix <- netmatrix
p_matrxkld <- netmatrix
```

Then, the correlations can be calculated. 

NOTICE! THIS STEP IS VERY COMPUTATIONALLY INTENSIVE, MAY TAKE HOURS TO DAYS

```{r network_2, message=FALSE}
for (sp1 in colnames(data_coOccur)){
  for (sp2 in colnames(data_coOccur)){
    #spearman
    if(sp1!=sp2){
      cor_out<-cor.test(data_coOccur[,sp1], data_coOccur[,sp2], method="spearman") 
      p_matrx[sp1, sp2]<-cor_out$p.value 
      
      #Kullback-Leibler distance
      y<-t(as.matrix(data.frame(spp1=data_coOccur[,sp1]/sum(data_coOccur[,sp1]),spp2=data_coOccur[,sp2]/sum(data_coOccur[,sp2]))))
      kld_value<-compute.kld(y)[1,2]
      pval_kld<-as.numeric(get.pval(y, 1, 2, N.rand=1000, permutandboot=F, method="kld"))
      p_matrxkld[sp1, sp2]<-pval_kld
      
      if (cor_out$p.value > 0.05 | pval_kld > 0.05){
        netmatrix[sp1,sp2] <- as.numeric(0)
        kldmatrix[sp1,sp2] <- as.numeric(0)
      }
      else{
          netmatrix[sp1,sp2] <- cor_out$estimate
          kldmatrix[sp1,sp2] <- kld_value
        }
    } 
    else{
      netmatrix[sp1,sp2] <- as.numeric(0)
      kldmatrix[sp1,sp2] <- as.numeric(0)
      p_matrx[sp1,sp2] <- as.numeric(0)
      p_matrxkld[sp1,sp2] <- as.numeric(0)
    }
  }
}
```

because the analysis takes so much time, it's best to save the outcome here before we continue

```{r network_3}
write.csv2(netmatrix, "UCDavisCourseNetmatrix.csv")
```

# Correcting p-values with the Benjamini-hochberg method and remove new non-significants correlations

To further reduce the number of false positives, the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) was implemented to correct for multiple testing. We also need to execute this for the two correlation methods that were used (Spearman and kld)

After this correction, edges with an adjusted p-value above 0.05 will be discarded. This way, only edges remain that were significant for both the Spearman correlation and kld dissimilarity measures.


```{r pvalues}
netmatrix2<-netmatrix
pvalsadj2<-p.adjust(unlist(p_matrxkld), 'BH')   
pvalsadj2<-matrix(nrow=nrow(p_matrxkld), ncol=ncol(p_matrxkld), data=pvalsadj2)
for(i in 1:nrow(pvalsadj2)){
  for(j in 1: ncol(pvalsadj2)){
    if(pvalsadj2[i,j]>0.05){
      netmatrix2[i,j]<-as.numeric(0)
    }
  }
}

pvalsadj2<-p.adjust(unlist(p_matrx), 'BH')   
pvalsadj2<-matrix(nrow=nrow(p_matrx), ncol=ncol(p_matrx), data=pvalsadj2)
for(i in 1:nrow(pvalsadj2)){
  for(j in 1: ncol(pvalsadj2)){
    if(pvalsadj2[i,j]>0.05){
      netmatrix2[i,j]<-as.numeric(0)
    }
  }
}

```

# Going through the edges

after these corrections, we need to remove OTUs with no edges

```{r edges_1}
netmatrix_parse<-netmatrix[,colSums(netmatrix)>0]
netmatrix_parse<-netmatrix_parse[rowSums(netmatrix_parse)>0,]
```

We should also discard edges which have low correlation coeficients (both positive and negative). Choosing this cutoff is an arbitrary descision. The more stringent it is, the fewer edges, but the higher the chance the correlation is indicating a true correlation. In this step, we also change the correlation matrix to a matrix with presence or absences of edeges (1 or 0). 

In this example, only edges that were significantly supported by both methods and with a Spearman rank correlation >0.6 were retained in the network (Barber??n et al., 2012).

```{r edges_2}
netmatrix_parse[netmatrix_parse>=0.60]<-1
netmatrix_parse[netmatrix_parse<0.60]<-0

netmatrix_parse<-netmatrix_parse[,colSums(netmatrix_parse)>0]
netmatrix_parse<-netmatrix_parse[rowSums(netmatrix_parse)>0,]

```

# Visualizing the network

This code will create the network object to be used in igraph.

```{r igraph}
subi_network <- graph.adjacency(as.matrix(netmatrix_parse),mode="undirected",weighted=NULL) 

```

We can use taxonomy as a vertex object (vertices are the same as the nodes). We will also assign a random colour to each phylum.

```{r tax_to_vertex_1}
tax_sub<-bacTAX[,c("Phylum", "Class")]
colnames(tax_sub)<-c("Phylum", "Class")
tax_sub<-tax_sub[names(netmatrix_parse),]

V(subi_network)$Phylum=as.character(tax_sub$Phylum[match(V(subi_network)$name,row.names(tax_sub))])
V(subi_network)$color=as.character(tax_sub$Phylum[match(V(subi_network)$name,row.names(tax_sub))])

for(n in names(table(V(subi_network)$Phylum))){
  randomcol<-randomColor(count = 1)
  V(subi_network)$color=gsub(n,randomcol,V(subi_network)$color, fixed=TRUE) 
  
}

```

To create a circular network, we need to set a few more visual variables, like vertex and label size. It is also better to give each taxon an order in the graph, so that taxonomic groups stay together. As ggplot usually orders data alphabethically, we can use this attribute by assigning a letter (in this case random) to each taxonomic group.


```{r tax_to_vertex_2}

V(subi_network)$label.cex = 0.6

V(subi_network)$circord=as.character(tax_sub$Phylum[match(V(subi_network)$name,row.names(tax_sub))])

for(n in names(table(V(subi_network)$Phylum))){
  randomalpha<-sample(LETTERS, 1, TRUE)
  V(subi_network)$circord=gsub(n,randomalpha,V(subi_network)$circord, fixed=TRUE) 
}
layout <- layout_in_circle(subi_network, order=order(V(subi_network)$circord))
```

and this is how the resulting network looks like

Note: as a good alternative to a circular representation: check out the fruchterman reingold representation of a network graph. This will better show the individual clusters. This can be done with the parameter layout=layout.fruchterman.reingold.

```{r plot_network}
plot(subi_network, layout=layout, vertex.size=5, vertex.label=V(subi_network)$Phylum)

```

# Network statistics

As you can see, this network has `r ecount(subi_network)` edges and `r vcount(subi_network)` nodes.

We can calculate some basic statistics on the network, like the modularity or the clustering coefficient (transivity). The modularity is a measure that indicates to what the degree the network exist of distinct modules, while the clustering coefficient is the average fraction of pairs of species one link away from a species that are also linked to one another.

```{r stats}
table(V(subi_network)$Phylum)
modularity(subi_network, walktrap.community(subi_network)$membership)
transitivity(subi_network, type="global")
```

# Analysis of statistical overrepresentation

Overrepresentation of higher level taxa (i.e. phyla or classes) in the networks can assessed with the hypergeometric distribution implemented in the R function phyper (at p adj.<0.05). This discrete probability distribution describes the chance of finding the observed number of nodes (i.e. OTUs) belonging to higher level taxa, when OTUs would be randomly drawn from the population without replacement. We can perform the analysis of overrepresentation of edges between higher level taxa using the binomial distribution (i.e. with replacement) implemented in the R function pbinom, and using the frequency of edges in the network as background probabilities (Lima-Mendez et al., 2015). A p-value can be calculated as the probability of obtaining the observed number of edges between two higher level taxa. All p-values should also be adjusted for multiple testing, for instance using the Benjamini-Hochberg correction (Benjamini and Hochberg, 1995).

To refresh our minds: although the original OTU table was stored as bacOTU with the taxonomy in bacTAX, we started the analysis with the data_coOccur object, where some non-informative OTUs were removed. this latter is our starting point to which we need to compare.

## over/under representation of taxa

In this first chunck of code, we'll calculate the expected frequencies

```{r overrepresentation_exp}
plotdata<-data.frame(exp=as.numeric(table(bacTAX$Phylum)), obs=0, row.names=names(table(bacTAX$Phylum)))

```

to this, we'll add the observed frequencies

```{r overrepresentation_obs}
for(i in 1:nrow(plotdata)){
  if(rownames(plotdata)[[i]] %in% names(table(V(subi_network)$Phylum))){
    plotdata[i,'obs']<-as.numeric(table(V(subi_network)$Phylum)[c(rownames(plotdata)[[i]])])
  }
}

```

This looks like this:
```{r overrepresentation_obs_2}
head(plotdata)
```

Here, we calculate what taxa (i.e. OTUs assigned to a particular phylum) are overrepresented in the network. First, we can have a quick look.

```{r overrepresentation_Taxa_1}
plot(plotdata$exp, plotdata$obs);text(plotdata$exp, plotdata$obs, labels=rownames(plotdata))
```

Then, we can calculate what taxonomic groups are overrepresented. In this process, we will also correct the p-values for multiple testing.

```{r overrepresentation_Taxa_2}
plotdata$phypMore<-phyper(plotdata[,'obs'], plotdata[,'exp'], sum(plotdata$exp)-plotdata[,'exp'], sum(plotdata$obs), lower.tail = FALSE)
plotdata$phypMore<-p.adjust(plotdata$phypMore, 'BH')
plotdata$phypLess<-phyper(plotdata[,'obs'], plotdata[,'exp'], sum(plotdata$exp)-plotdata[,'exp'], sum(plotdata$obs), lower.tail = TRUE)
plotdata$phypLess<-p.adjust(plotdata$phypLess, 'BH')

plotdata$statexp<-plotdata[,'exp']/sum(plotdata[,'exp'])*sum(plotdata[,'obs'])

```
the overrepresented taxonomic groups are:
```{r overrepresentation_Taxa_3, echo=FALSE}
rownames(plotdata[plotdata$phypMore<0.05,])
```

and the underrepresented taxonomic groups are:
```{r overrepresentation_Taxa_4, echo=FALSE}
rownames(plotdata[plotdata$phypLess<0.05,])
```

## over/under representation of edges
We can also asses if there are edges that are significantle over or underrepresented compared to randomly drawing edges between OTUs

```{r overrepresentation_Edges_1}
edgdata<-data.frame(taxon1=NA, taxon2=NA, exp=NA, obs=NA, prob=NA, statexp=NA)
ttedg<-length(colnames(data_coOccur))*length(colnames(data_coOccur))

cnt=0
for(i in 1:length(colnames(data_coOccur))){
  for(j in 1:length(colnames(data_coOccur))){
    cnt=cnt+1
    tax1name=names(table(colnames(data_coOccur)))[[i]]
    tax2name=names(table(colnames(data_coOccur)))[[j]]
    
    edgdata[cnt,"taxon1"]<-tax1name
    edgdata[cnt,"taxon2"]<-tax2name
    
    totedg<-length(colnames(data_coOccur))*length(colnames(data_coOccur))
    edgdata[cnt,"exp"]<-totedg
    
    if(totedg>0){
      
      taxnames<-colnames(data_coOccur)
      OTUnames1<-intersect(rownames(netmatrix_parse), rownames(bacTAX[bacTAX$Phylum==tax1name,]))
      OTUnames2<-intersect(rownames(netmatrix_parse), rownames(bacTAX[bacTAX$Phylum==tax2name,]))
      nt_mtrx_test<- netmatrix_parse[OTUnames1,OTUnames2]
      
      
      if(length(taxnames)>0){
        obsedg<-sum(nt_mtrx_test!=0)
        edgdata[cnt,"obs"]<-obsedg
        
        edgdata[cnt,"prob"]<-as.numeric(pbinom(obsedg, totedg, obsedg/ecount(subi_network)))
        edgdata[cnt,"statexp"]<-qbinom(0.5, ecount(subi_network), totedg/ttedg)
        
      } else{
        edgdata[cnt,"prob"]<-NA
      }
      
    } else{
      edgdata[cnt,"prob"]<-NA
    }
    
  }
}
edgdata<-edgdata[complete.cases(edgdata),]
edgdata$prob<-p.adjust(edgdata$prob, 'BH')   
```


These edges are significantly over or underreperesented
```{r overrepresentation_Edges_2, echo=FALSE}
edgdata_sign<-edgdata[edgdata$prob<0.05 & edgdata$obs>0,];edgdata_sign
```