-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinspecting_multidataset.qmd
414 lines (311 loc) · 13.8 KB
/
inspecting_multidataset.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
# Inspecting the `MultiDataSet` object {#sec-inspecting-multidataset}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(Biobase)
library(purrr)
library(circlize)
library(ggplot2)
```
```{r setup-visible}
#| eval: false
library(targets)
library(moiraine)
library(Biobase)
## For working with lists
library(purrr)
## For custom colour palettes
library(circlize)
## For custom colour palettes
library(ggplot2)
```
In this chapter, we will show how to inspect the `MultiDataSet` object that we created in @sec-importing-data and how to extract information from it.
As a reminder, here is what the `_targets.R` script should look like so far:
<details>
<summary>`_targets.R` script</summary>
```{r previous-vignettes}
#| echo: false
#| results: asis
knitr::current_input() |>
get_previous_content() |>
cat()
```
</details>
At this stage, our different omics datasets (and associated metadata) are stored in a `MultiDataSet` object, saved in the target called `mo_set`:
```{r print-mo-set}
tar_load(mo_set)
mo_set
```
## Querying datasets names and dimensions
The names of the omics datasets stored in a `MultiDataSet` object can be obtained with:
```{r get-datasets-names}
names(mo_set)
```
It is also possible to query the number of features and samples in each dataset via `n_features()` and `n_samples()`. Both functions return a named integer vector:
```{r n-features}
n_features(mo_set)
```
```{r n-samples}
n_samples(mo_set)
```
The feature and sample IDs for each dataset can be extracted with the `get_features()` and `get_samples()` functions. Both functions return a named list of features or samples ID for each omics dataset:
```{r get-features}
get_features(mo_set) |> str()
```
```{r get-samples}
get_samples(mo_set) |> str()
```
## Extracting datasets and metadata
We can extract the dataset matrices from a `MultiDataSet` object with the `get_datasets()` function, which returns a named list of matrices, each with features as rows and samples as columns:
```{r get-datasets}
get_datasets(mo_set) |> str()
```
```{r get-datasets-example}
get_datasets(mo_set)[["snps"]][1:5, 1:5]
```
To obtain the matrix for a single dataset from the `MultiDataSet` object, the `get_dataset_matrix()` function can be used instead:
```{r get-dataset-matrix-example}
get_dataset_matrix(mo_set, "snps")[1:5, 1:5]
```
Similarly, the functions `get_features_metadata()` and `get_samples_metadata()` each return a named list of feature or sample metadata data-frames, one per omics dataset:
```{r get-features-metadata}
get_features_metadata(mo_set) |> str()
```
```{r get-samples-metadata}
get_samples_metadata(mo_set) |> str()
```
For the samples metadata, it is possible to extract a single data-frame that combines the metadata from the different datasets with the function `get_samples_metadata_combined()`. The `only_common_cols` argument controls whether only the columns that are common to the samples metadata of the different omics datasets should be returned. For this example, as the samples metadata is identical across the datasets, it makes no difference:
```{r get-samples-metadata-combined}
get_samples_metadata_combined(mo_set) |> head()
```
## Summary plots {#sec-inspecting-multidataset-summary-plots}
A number of plotting functions have been implemented to obtain a quick overview of the omics datasets in a `MultiDataSet` object.
### Samples upset plot
First, the `plot_samples_upset()` function displays the number of common and unique samples across the datasets with an [UpSet plot](https://upset.app):
```{r plot-samples-upset}
#| fig.width: 10
#| fig.height: 7
plot_samples_upset(mo_set)
```
As can be seen in the upset plot above, 135 samples have measurements across all three omics datasets. In addition, 4 samples have both transcriptomics and metabolomics measurements, but no transcriptomics information; 3 samples are present in the genomics and metabolomics datasets but not the transcriptomics dataset, and the genomics and transcriptomics datasets each have a unique sample not present in the other omics datasets.
### Datasets density plots
Next, we can show the density plot of each omics dataset with the `plot_density_data()` function. By default, all datasets are plotted onto the same axes, which is not very useful if they have very different scales. We can change that by setting the `combined` argument to `FALSE`, which splits the plot into one facet per dataset, and by setting `scales` to `'free'` in order to give its own scale to each dataset:
```{r plot-density-data}
#| fig.width: 9
#| fig.height: 4
plot_density_data(mo_set, combined = FALSE, scales = "free")
```
By default, all datasets are represented in the density plot, but it is possible to focus on one or a subset of them via the `datasets` argument. This is useful here as the plots for the transcriptomics and metabolomics could benefit from a log10 transformation for the x-axis:
```{r plot-density-data-xlog10}
#| fig.width: 8
#| fig.height: 4
plot_density_data(
mo_set,
datasets = c("rnaseq", "metabolome"),
combined = FALSE,
scales = "free"
) +
scale_x_log10()
```
Note that as the `plot_density_data()` function returns a ggplot, it can be further customised with other `ggplot2` functions as shown above.
### Datasets mean-sd plots
It is also possible to assess for each dataset whether there exists a relationship between the features mean and standard deviation, with the `plot_meansd_data()` function. The presence of such relationship indicates that the dataset should be transformed, via a log or variance-stabilising transformation. The function requires the `hexbin` package to be installed:
```{r plot-meansd-data, eval = requireNamespace("hexbin")}
#| fig.width: 8
#| fig.height: 4
plot_meansd_data(mo_set)
```
In our case, we can see a very strong relationship between features mean and standard deviation in both the transcriptomics and metabolomics datasets, which suggest that a log or variance-stabilising transformation will be necessary in both cases (datasets transformation are covered in @sec-preprocessing).
Note that the hexplots are only drawn for datasets with at least 30 features, and the trend curve (in pink) is only drawn for datasets with at least 10 features.
## Assessing missing values
Finally, one very important aspect to check is the presence of missing values in the datasets. The function `check_missing_values()` provide a summary of the number of missing values in each dataset:
```{r check-missing-values}
#| message: true
check_missing_values(mo_set)
```
The function returns an invisible character vector containing the messages printed above, which is useful for automatic reporting.
In @sec-preprocessing, we will see how to impute missing values.
## Visualising the datasets {#sec-inspecting-plot-data}
Once the omics datasets are stored in a `MultiDataSet` object, we can easily visualise the measurements for a set of features of interest. As an example, we will randomly select three features from each of the omics datasets:
```{r select-random-features}
set.seed(32)
random_features <- get_features(mo_set) |>
map(\(x) sample(x, size = 3, replace = FALSE)) |>
unlist() |>
unname()
random_features
```
### As a heatmap
The function `plot_data_heatmap()` allows us to view the data for these features as a heatmap. It relies on the `ComplexHeatmap::Heatmap()` function, and can be customised by passing arguments to this function (for example to remove the column labels):
```{r plot-data-heatmap}
#| fig.width: 10
plot_data_heatmap(
mo_set,
random_features,
center = TRUE,
scale = TRUE,
show_column_names = FALSE
)
```
Note that we specified that the data should be centred and scaled before plotting, to represent features from different datasets on a similar scale.
By default, all samples all represented, including those that are only present in some of the omics datasets (hence the warning about columns clustering). We can instead restrict the plot to only samples that are present across all datasets (`only_common_samples` argument), or to specific samples by passing a list of samples ID to the `samples` argument:
```{r plot-data-heatmap-samples-subset}
#| fig.width: 6
plot_data_heatmap(
mo_set,
random_features,
center = TRUE,
scale = TRUE,
show_column_names = FALSE,
samples = c("O4713", "Y3660", "R5979")
)
```
We can also add samples and/or features information to the sides of the heatmap through the `samples_info` and `features_info` arguments. These two arguments take a vector of column names from the samples or features metadata table, respectively. The `ComplexHeatmap::Heatmap()` picks random colours for these annotations, but we can set specific colour palettes by passing a list of colour palettes through the argument `colours_list`. For continuous annotations, the colour palette must be generated with `circlize::colorRamp2()`.
```{r plot-data-heatmap-annotations}
#| fig.width: 10
#| fig.height: 6
plot_data_heatmap(
mo_set,
random_features,
center = TRUE,
scale = TRUE,
show_column_names = FALSE,
only_common_samples = TRUE,
samples_info = c("status", "day_on_feed"),
features_info = c("chromosome"),
colours_list = list(
"status" = c("Control" = "gold", "BRD" = "lightblue"),
"day_on_feed" = colorRamp2(c(5, 70), c("white", "pink3"))
)
)
```
We can also use information from the features metadata tables to give a more meaningful label to the features. For example, we can use the column `Name` from the transcriptomics features metadata and the column `name` from the metabolomics features metadata to label the features. This is done by passing a named list through the `label_cols` argument, where each element is the name of the column to use and the name of the element gives the name of the dataset in the MultiDataSet object. If these labels are too long, we can truncate them through the `truncate` argument (see the function help).
```{r plot-data-heatmap-labels}
#| fig.width: 10
#| fig.height: 6
plot_data_heatmap(
mo_set,
random_features,
center = TRUE,
scale = TRUE,
show_column_names = FALSE,
only_common_samples = TRUE,
samples_info = c("status", "day_on_feed"),
features_info = c("chromosome"),
colours_list = list(
"status" = c("Control" = "gold", "BRD" = "lightblue"),
"day_on_feed" = colorRamp2(c(5, 70), c("white", "pink3"))
),
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
),
truncate = 20
)
```
Note that because we didn't include the `snps` dataset in the list passed through `label_cols`, the ID of the features are used as labels.
### Against samples covariates
Alternatively, we can display the features' measurements against some samples covariate, with the `plot_data_covariate()` function. As for the `plot_data_heatmap()` function, the plot shows data from all samples, unless otherwise specified (through either the `common_samples_only` or `samples` arguments). The covariate is specified as a column name from the samples metadata (can be from any dataset's samples metadata). If the covariate is categorical, the function generates violin plots. For example, we can represent the feature's measurements against the animal disease status:
```{r plot-data-covariate-categorical, fig.height = 6}
plot_data_covariate(
mo_set,
"status",
random_features,
only_common_samples = TRUE
)
```
We can use other columns from the samples metadata to customise the points colour and shape. For the colour, the constructed plot will depend on whether the corresponding in categorical or numeric:
```{r plot-data-covariate-categorical-colours, fig.height = 6}
plot_data_covariate(
mo_set,
"status",
random_features,
only_common_samples = TRUE,
colour_by = "gender",
shape_by = "feedlot"
)
plot_data_covariate(
mo_set,
"status",
random_features,
only_common_samples = TRUE,
colour_by = "day_on_feed",
shape_by = "feedlot"
)
```
If instead the covariate is numerical, the function produces scatterplots with a loess curve for each feature:
```{r plot-data-covariate-numeric, fig.height = 6}
plot_data_covariate(
mo_set,
"day_on_feed",
random_features,
only_common_samples = TRUE
)
```
Again, we can use other samples information to specify the colour or shapes of the samples. Note that if the covariate used for points colour is discrete, a loess curve will be fitted for each category. If the covariate is continuous, or if changing the shape of the points, only one loess curve will be fitted for all data points.
```{r plot-data-covariate-numeric-colours, fig.height = 6}
plot_data_covariate(
mo_set,
"day_on_feed",
random_features,
only_common_samples = TRUE,
colour_by = "status",
shape_by = "status"
)
plot_data_covariate(
mo_set,
"day_on_feed",
random_features,
only_common_samples = TRUE,
colour_by = "day_on_feed",
shape_by = "feedlot"
)
```
The features can be renamed using features metadata through the `label_cols` argument in the same way that with the `plot_data_heatmap()` function:
```{r plot-data-covariate-labels}
plot_data_covariate(
mo_set,
"day_on_feed",
random_features,
only_common_samples = TRUE,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
## Recap -- targets list
Although we didn't create any new target in this section, we can turn some plots into targets.
<details>
<summary>Targets list for inspecting a MultiDataSet object</summary>
::: {.targets-chunk}
```{targets recap-targets-list}
list(
## Creating a density plot for each dataset
tar_target(
density_plots,
plot_density_data(
mo_set,
combined = FALSE,
scales = "free"
)
),
## Plotting the relationship between features mean and standard deviation
## for each dataset
tar_target(
mean_sd_plots,
plot_meansd_data(mo_set)
),
## Assessing missing values
tar_target(
n_missing_values,
check_missing_values(mo_set)
)
)
```
:::
</details>