-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmissing.qmd
102 lines (53 loc) · 4.4 KB
/
missing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Handling missing data {#sec-missing}
## Questions
- What is missing data?
- Why is missing data a common challenge in data analysis?
- How can we identify and handle missing data effectively in R?
- What are the consequences of ignoring or mishandling missing data?
- What are the best practices for imputing missing values?
- When should I choose deletion, imputation, or alternative approaches?
## Learning Objectives
- Define the main types of missing data.
- Identify missing data in R and assess its impact on analyses.
- Work with datasets that contain missing data.
- Understand the importance of addressing missing data in data analysis.
- Apply best practices for handling missing data in your R projects.
- Apply functions like `is.na()` and `complete.cases()` to identify `NA` values.
## Lesson Content
### Introduction
One of the most common tasks/activities in data analysis is handling missing values. Missing data are found everywhere; therefore, it is important to learn how to identify and summarize them. Missing data can arise from different sources, including nonresponses in surveys or technical difficulties during data collection.
- Unanswered survey questions
- Data entry errors
- Errors in measurements
- Data merge issues
Missingness presents itself in various ways in R. The most common way to represent a missing value in R is with the value `NA`. Other reserved values include:
- `NULL` = neither TRUE or FALSE (`is.null()` identifies a null value and `as.null()` converts to a null value).
- `NaN` = impossible values
- `Inf` = infinite value obtained by dividing by zero (`is.infinite()` identifies an infinite value and `as.infinite()` converts to an infinite value).
Examples of the missing values
`NA` = `1/NA`
`NaN` = `0/0`
`Inf` = `10/0`
`NULL` = `5/Inf`
Common strategies for handling missing data
When investigating a dataset, there are several functions that can assist with the process. To find missing and non-missing values, the user can work with `is.na()` and `!is.na()`, respectively. If any rows contain missing values, one can use `na.omit()` or `drop_na()`. If there are missing values in the dataset, the arguments `na.rm = TRUE`, `remove_na = TRUE`, or the function `complete.cases()` can be used to remove missing values.
- Deletion: Remove rows or columns with missing values (simplest method but it can lead to bias).
- Imputation: Replace missing values with estimated values based on statistical methods (can introduce artificial data).
R has two types of missing data, `NA` and `NULL`.
Missing values can represent a fixed value like 0 (zero) or a negative number.
Missing can be `NA` or `NaN` and represented by `is.na()` or `is.nan()`
### NA
R uses `NA` to represent missing data. The `NA` appears as another element of a vector. To test each element for missingness we use `is.na()`. Generally, we can use tools such as `mi`, `mice`, and `Amelia` (which will be discussed later) to deal with missing data. The deletion of this missing data may lead to bias or data loss, so we need to be careful when we choose this option. In subsequent blog posts, we will look at the use of imputation to deal with missing data.
### NULL
NULL represents nothingness or the “absence of anything”. 2 It does not mean missing but represents nothing. NULL cannot exist within a vector because it disappears.
## Exercises
i. Identify missing values in a dataset using the `is.na()` and `complete.cases()` functions.
ii. Practice removing missing data using the `na.omit()` function.
iii. How can you handle missing values when calculating summary statistics like mean or median?
iv. If you take the mean of a vector with `NA` values, what will the result be?
v. How are missing values represented in R?
vi. What is `NA` in R? How is it different from `NULL`?
vii. What are some common reasons you may get `NA` values in a dataset?
viii. How can you check if a value is `NA` in R?
## Summary
The learner should now have a firm grasp of what missing data is and why it is a common challenge in data analysis. Additionally, after completing this chapter, the learner should know how to identify and handle missing data with various R functions. This is the final chapter in the book, and I hope you have enjoyed your learning journey so far. In the next chapter, I provide a conclusion to the book and some next steps you can take to continue on your R learning journey.