Alina_Analysis.Rmd

---
title: "Alina's PhD Data Analysis Notebook"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

```{r}

library(tidyverse)
library(tidyr)
library(dplyr)
```

## Setting the Stage

1. A set of 13 different tables provided
2. Objective: To determine relationships/correlations between different variables of the table.  


## EDA

1. Reduce Variables and clean up names
2. Clean up missing Values
3. Perform Basic Visualization
4. Remove Outliers - 3 different approaches
4. Filter out outlier values based on statistical approaches
5. Hypothesis to check?

```{r echo=TRUE, message=TRUE, warning=TRUE, paged.print=TRUE}
d1 <- read.csv("~/R/Rtuts/Data/C57BL6_476_vs_C57BL6_ctl.csv")
head(d1)
```
```{r}
str(d1)
```

Removing unneccesary columns to reduce the table!

```{r}
d2 <- select(d1, -lfcSE, -stat, -symbol)
head(d2)
```

```{r}
str(d2)
```

 Rename the Variable names into more readable ones!
```{r}
d3 <- d2 %>% rename(CPM_476_R1 = CPM_.C57BL6_476_R1.) %>% rename(CPM_476_R2 = CPM_.C57BL6_476_R2.) %>% rename(CPM_ctl_R1 = CPM_.C57BL6_ctl_R1.)  %>% rename(CPM_ctl_R2 = CPM_.C57BL6_ctl_R2.)
head(d3)
```




```{r}
summary(d3)
```


### Question1 :

What does the value 0.0 in the min for CPM indicate? Is it a real value or worthy to be discarded?

### Checking Missing Values

```{r}
glimpse(d3)
```


###
```{r}
d3 %>% summarise(c1 = sum(is.na(CPM_476_R1)),
                 c2 = sum(is.na(CPM_476_R2)),
                 c3 = sum(is.na(CPM_ctl_R1)),
                 c4 = sum(is.na(CPM_ctl_R1))
                 )
```

Therefore, there are no missing values in the table.


## Basic Visualization to get a sneak peak at Data!

```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = baseMean)) + geom_point(alpha = 0.1)

```




```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_point(alpha = 0.1)
```


```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = padj)) + geom_point(alpha = 0.1)
```



```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1, y = CPM_ctl_R1)) + geom_point() + geom_smooth()
```

```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R2, y = CPM_ctl_R2)) + geom_point() + geom_smooth()
```


```{r}
ggplot(data = d3, mapping = aes(x = CPM_ctl_R1, y = CPM_ctl_R2)) + geom_point() + geom_smooth()
```



```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1, y = CPM_476_R2)) + geom_point() + geom_smooth()
```


### Hence, there exists major issue with "Outliers" in data, making the data skewed on the wrong end!

## Removing Outliers


Calculating Standard Deviation from current table gives misplaced picture. Hence, remove outliers and then calculate standard deviation to place meaningful thresholds!

The best way to look for outliers in two numeric variables is using a scatter plot.

```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_boxplot() + geom_point(alpha = 0.1, color = "blue")
```


```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_boxplot() + geom_point(alpha = 0.009, color = "blue")
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1)) + geom_dotplot(binwidth = 400, color = "brown") 
```


```{r}
ggplot(data = d3, mapping = aes(x = baseMean)) + geom_dotplot(binwidth = 8000, color = "green")
```

```{r}
ggplot(data = d3, mapping = aes(x = padj)) + geom_histogram( color = "green")

```
```{r}
ggplot(data = d3, mapping = aes(x = pvalue)) + geom_histogram( color = "green")
```

```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange)) + geom_histogram(binwidth = 0.15, color = "blue")
```

### Sorting the Data : in Ascending

```{r}
sorted_d3 <- d3 %>% arrange(baseMean, log2FoldChange, pvalue, CPM_476_R1, CPM_476_R2, CPM_ctl_R1, CPM_ctl_R2)
sorted_d3
```

### Rows at tail-end
```{r}
tail(sorted_d3)
```

### Question 2: 
The values of other variables are derived from a (some particular) transformation of CPM values? If yes, then filtering outliers from CPM will also remove values from other variables and could lead elimination of certain genes. Would that create a problem?


### Question 3: 
1. Are there any wrongly tabulated or mistakenly tabulated data? 

### Question 4: Removing Outliers: More an Art than Science!
#### Which approach to pick for removing outliers? 
1. Hard cut as a threshold based on visualization.
2. Based on Z-scores.
3. Based on IQR.
4. Based on Spread and variability of data : 1sigma, 2 sigma and 3 sigma spread.
5. If one needs to keep the data as it is, one can use MAD instead of SD -> More robust statistical measure!


### Question 4:
1. what about lfcse and stat variables? Cannot waste so much data, could be quite useful (continous) data!

### Next Steps:

1. After Removing outliers for this table, perform the same task for other 12 tables as well based on a common uniform approach.
2. Then combine the 13 tables (with relevant columns as shown here) into a single table as the MASTER table.
3. Determine correlations between different parameters and try to identify the genes in play.




# Round2:


```{r}

```