-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAlina_Analysis.Rmd
213 lines (137 loc) · 5.17 KB
/
Alina_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Alina's PhD Data Analysis Notebook"
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
---
```{r}
library(tidyverse)
library(tidyr)
library(dplyr)
```
## Setting the Stage
1. A set of 13 different tables provided
2. Objective: To determine relationships/correlations between different variables of the table.
## EDA
1. Reduce Variables and clean up names
2. Clean up missing Values
3. Perform Basic Visualization
4. Remove Outliers - 3 different approaches
4. Filter out outlier values based on statistical approaches
5. Hypothesis to check?
```{r echo=TRUE, message=TRUE, warning=TRUE, paged.print=TRUE}
d1 <- read.csv("~/R/Rtuts/Data/C57BL6_476_vs_C57BL6_ctl.csv")
head(d1)
```
```{r}
str(d1)
```
Removing unneccesary columns to reduce the table!
```{r}
d2 <- select(d1, -lfcSE, -stat, -symbol)
head(d2)
```
```{r}
str(d2)
```
Rename the Variable names into more readable ones!
```{r}
d3 <- d2 %>% rename(CPM_476_R1 = CPM_.C57BL6_476_R1.) %>% rename(CPM_476_R2 = CPM_.C57BL6_476_R2.) %>% rename(CPM_ctl_R1 = CPM_.C57BL6_ctl_R1.) %>% rename(CPM_ctl_R2 = CPM_.C57BL6_ctl_R2.)
head(d3)
```
```{r}
summary(d3)
```
### Question1 :
What does the value 0.0 in the min for CPM indicate? Is it a real value or worthy to be discarded?
### Checking Missing Values
```{r}
glimpse(d3)
```
###
```{r}
d3 %>% summarise(c1 = sum(is.na(CPM_476_R1)),
c2 = sum(is.na(CPM_476_R2)),
c3 = sum(is.na(CPM_ctl_R1)),
c4 = sum(is.na(CPM_ctl_R1))
)
```
Therefore, there are no missing values in the table.
## Basic Visualization to get a sneak peak at Data!
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = baseMean)) + geom_point(alpha = 0.1)
```
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_point(alpha = 0.1)
```
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = padj)) + geom_point(alpha = 0.1)
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1, y = CPM_ctl_R1)) + geom_point() + geom_smooth()
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R2, y = CPM_ctl_R2)) + geom_point() + geom_smooth()
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_ctl_R1, y = CPM_ctl_R2)) + geom_point() + geom_smooth()
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1, y = CPM_476_R2)) + geom_point() + geom_smooth()
```
### Hence, there exists major issue with "Outliers" in data, making the data skewed on the wrong end!
## Removing Outliers
Calculating Standard Deviation from current table gives misplaced picture. Hence, remove outliers and then calculate standard deviation to place meaningful thresholds!
The best way to look for outliers in two numeric variables is using a scatter plot.
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_boxplot() + geom_point(alpha = 0.1, color = "blue")
```
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange, y = pvalue)) + geom_boxplot() + geom_point(alpha = 0.009, color = "blue")
```
```{r}
ggplot(data = d3, mapping = aes(x = CPM_476_R1)) + geom_dotplot(binwidth = 400, color = "brown")
```
```{r}
ggplot(data = d3, mapping = aes(x = baseMean)) + geom_dotplot(binwidth = 8000, color = "green")
```
```{r}
ggplot(data = d3, mapping = aes(x = padj)) + geom_histogram( color = "green")
```
```{r}
ggplot(data = d3, mapping = aes(x = pvalue)) + geom_histogram( color = "green")
```
```{r}
ggplot(data = d3, mapping = aes(x = log2FoldChange)) + geom_histogram(binwidth = 0.15, color = "blue")
```
### Sorting the Data : in Ascending
```{r}
sorted_d3 <- d3 %>% arrange(baseMean, log2FoldChange, pvalue, CPM_476_R1, CPM_476_R2, CPM_ctl_R1, CPM_ctl_R2)
sorted_d3
```
### Rows at tail-end
```{r}
tail(sorted_d3)
```
### Question 2:
The values of other variables are derived from a (some particular) transformation of CPM values? If yes, then filtering outliers from CPM will also remove values from other variables and could lead elimination of certain genes. Would that create a problem?
### Question 3:
1. Are there any wrongly tabulated or mistakenly tabulated data?
### Question 4: Removing Outliers: More an Art than Science!
#### Which approach to pick for removing outliers?
1. Hard cut as a threshold based on visualization.
2. Based on Z-scores.
3. Based on IQR.
4. Based on Spread and variability of data : 1sigma, 2 sigma and 3 sigma spread.
5. If one needs to keep the data as it is, one can use MAD instead of SD -> More robust statistical measure!
### Question 4:
1. what about lfcse and stat variables? Cannot waste so much data, could be quite useful (continous) data!
### Next Steps:
1. After Removing outliers for this table, perform the same task for other 12 tables as well based on a common uniform approach.
2. Then combine the 13 tables (with relevant columns as shown here) into a single table as the MASTER table.
3. Determine correlations between different parameters and try to identify the genes in play.
# Round2:
```{r}
```