-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_struct_2.qmd
321 lines (209 loc) · 5.9 KB
/
data_struct_2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# Data Structures (Part II) {#sec-data-structure-2}
## Questions
- List the data structures that can be used in R?
- How do we define the various data structures?
- What operations can be performed on the different data structures?
- How do these data structures differ from one another in terms of storage and functionality?
## Learning Objectives
- Learn about data structures in R, specifically matrices, arrays, and factors.
- Define and manipulate matrices, arrays, and factors.
- Perform common operations on each data structure.
- Identify and differentiate between the diverse data structures offered by R.
- Choose the appropriate data structure based on the type and organization of your data.
- Understand the strengths and limitations of each structure for efficient analysis.
## Lesson Content
### Introduction
Data structures in R are the formats used to organize, process, retrieve, and store data. They help to organize stored data in a way that the data can be used more effectively. Data structures vary according to the number of dimensions and the data types (heterogeneous or homogeneous) contained.
The primary data structures are:
- Vectors
- Lists
- Data frames
- Matrices
- Arrays
- Factors
In this lesson, we will review the 2nd set of data structures: matrices, arrays, and factors.
### Matrices
A matrix is a rectangular two-dimensional (2D) homogeneous data set containing rows and columns. It contains real numbers that are arranged in a fixed number of rows and columns. Matrices are generally used for various mathematical and statistical applications.
a. Creation of matrices
Using the `matrix()` function
Format: `matrix(range, nrow = _, ncol = _)`
```{r}
m1 <- matrix(1:9, nrow = 3, ncol = 3)
m2 <- matrix(21:29, nrow = 3, ncol = 3)
m3 <- matrix(1:12, nrow = 2, ncol = 6)
m1
m2
m3
```
Using `cbind()` and `rbind()`
```{r}
```
b. Obtain the dimensions of the matrices `m1` and `m3`
```{r}
nrow(m1)
ncol(m1)
dim(m1)
```
```{r}
nrow(m3)
ncol(m3)
dim(m3)
```
c. Arithmetic with matrices
```{r}
m1 + m2
m1 - m2
m1 * m2
m1 / m2
m1 == m2
```
d. Matrix multiplication
```{r}
m5 <- matrix(1:10, nrow = 5)
m6 <- matrix(43:34, nrow = 5)
m5
m6
```
```{r}
m5 * m6
```
Knowledge of dimensions is very important when working with matrices. Here, we observe that we can't multiply m5 by m6 because of the matrix dimensions
```{r}
dim(m5)
dim(m6)
```
```{r}
# m5 %*% m6 This will not execute because the dimensions are not appropriate
```
The vector m6 needs to be transposed before multiplication.
Transpose
```{r}
dim(m5)
dim(t(m6))
```
Now we have a 5 by 2 matrix that can be multiplied by a 2 by 5 matrix.
```{r}
m5%*%t(m6)
```
e. Generate an identity matrix
```{r}
diag(5)
```
f. Column and row names
Dimensions of `m5`
```{r}
dim(m5)
```
Current column names of `m5`
```{r}
colnames(m5)
```
Set column names
```{r}
colnames(m5) <- c("AA", "BB")
```
Display the matrix `m5` and new column names
```{r}
m5
colnames(m5)
```
Set row names
Get the dimensions of `m6`
```{r}
dim(m6)
```
Display the current row names of `m6`
```{r}
rownames(m6)
```
Set the row names of `m6`
```{r}
rownames(m6) <- c("ABC", "BCA", "CAB", "ACB", "BAC")
```
```{r}
m6
rownames(m6)
```
g. Arithmetic within matrices
```{r}
colSums(m5)
```
```{r}
rowSums(m6)
```
h. Subsetting matrices
```{r}
subset_m5 <- m5[1:4, 1:2]
subset_m5
```
```{r}
subset_m6 <- m6[3:5, 1:2]
subset_m6
```
i. Matrix division
When you divide a matrix by a vector, the operation is row-wise.
```{r}
m5 / m6
```
### Arrays
An array is a multidimensional vector that stores homogeneous data. It can be thought of as a stacked matrix and stores data in more than 2 dimensions (n-dimensional). An array is composed of rows by columns by dimensions.
Example: an array with dimensions, dim = c(2,3,3), has 2 rows, 3 columns, and 3 matrices.
b. Creating arrays
```{r}
arr_1 <- array(1:12, dim = c(2,3,2))
```
```{r}
arr_1
```
b. Filter array by index
```{r}
arr_1[1 , , ]
```
```{r}
arr_1[1, ,1]
```
```{r}
arr_1[, , 1]
```
### Factors
Factors are used to store integers or strings which are categorical. They categorize data and store the data in different levels. This form of data storage is useful for statistical modelling. Examples include TRUE or FALSE and male or female. Useful for handling qualitative datasets.
**NOTE**: R has a more efficient method for handling categorical variables using the `forcats`package. This will be discussed in a subsequent series focusing on the `tidyverse`.
```{r}
vector <- c("Male", "Female")
```
```{r}
factor_1 <- factor(vector)
factor_1
```
OR
```{r}
factor_2 <- as.factor(vector)
factor_2
```
```{r}
as.numeric(factor_2)
```
**NOTE**: To view the internal structure of various data types described above, the learner can use the `str()` function.
Example
```{r}
str(m5)
```
```{r}
str(arr_1)
```
```{r}
str(factor_1)
```
## Exercises
- How do you create a matrix in R? Give an example.
- What are the column and row dimensions of a matrix called in R?
- How do you access elements in a matrix?
- Generate a 3x3 identity matrix using matrix().
- Write code to create a 3x3 matrix with sequential numeric values.
- Convert a character vector to a factor with 3 levels.
- Describe the purpose of factors in R.
- How do you create a factor variable with specified levels?
- What is an array, and how does it differ from a matrix in R?
- Provide an example of creating a three-dimensional array.
## Summary
In this chapter, we have completed our review of data structures. These data structures have different properties that influence how they are used in various computational tasks. Additionally, we have looked at the strengths and limitations of each structure for data analysis. However, we haven’t discussed an important aspect of data structures: what do we do with missing data? In the final chapter, we will tackle this issue and provide solutions.