forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
143 lines (92 loc) · 6.08 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "Reproducible Research - Week 2 Project"
author: "Diana"
date: "23 septembre 2017"
output: md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(lattice)
```
# Activity monitoring data
This assignment makes use of data from a personal activity monitoring device.
This device collects data at **5 minute intervals** through out the day.
The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
### Loading and preprocessing the data
The activity data has been downloaded from [here](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip) and stored in the working directory.
The variables included in this dataset are:
- **steps**: Number of steps taking in a 5-minute interval (missing values are coded as NA)
- **date**: The date on which the measurement was taken in YYYY-MM-DD format
- **interval**: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
The following code first loads the data from the Working Directory to the **activityData** dataframe.
It then converts the **date** column of the dataframe to the Date format.
```{r activity}
activityData <- read.csv("./activity.csv")
activityData$date <- as.Date(activityData$date)
```
### What is mean total number of steps taken per day?
For now, we will ignore the missing values in this dataset.
We first calculate the total number of steps per day and contruct a histogram with the result.
```{r}
totalStepsperDay <- activityData %>% group_by(date) %>% summarize(totalSteps = sum(steps, na.rm = TRUE))
hist(totalStepsperDay$totalSteps, xlab = "Total number of steps", col = "#2b8cbe", border = "white", main = "Total number of steps per day")
```
We can also compute the mean and the median of the total number of steps taken each day.
```{r, results='hide'}
meanSteps <- format(mean(totalStepsperDay$totalSteps, na.rm = TRUE), digits = 7)
medianSteps <- format(median(totalStepsperDay$totalSteps, na.rm = TRUE), digits = 7)
```
The mean is **`r meanSteps`** and the median **`r medianSteps`**.
### What is the average daily activity pattern?
To answer this question, we first group the data by interval. Then for each interval, we compute the mean of the number of taken steps. This will give us the dataframe **avgDailyPattern**, representing the average daily pattern.
We can also compute the maximum average and the interval for which it is reached and store the coordinates in **maxInterval**.
```{r}
avgDailyPattern <- activityData %>% group_by(interval) %>% summarize(avg = mean(steps,na.rm = TRUE))
maxInterval <- filter(avgDailyPattern, avg == max(avg))
```
The following plot displays the average number of taken steps per interval, across all days.
The green point represents the maximum, which is reached for the interval **`r maxInterval$interval`**.
```{r}
with(avgDailyPattern, plot(interval, avg, type="l", xlab = "Interval", ylab = "Average number of taken steps", main = "Average Daily Pattern", col = "#2b8cbe", lwd = 2))
points(maxInterval, pch = 19, col = "#99d594")
text(x = maxInterval, labels = maxInterval$interval, col = "#99d594", adj = 1, font = 2)
```
## Imputing missing values
In this step, we are trying to replace the missing values by suitable values.
```{r}
numberNA <- count(filter(activityData, is.na(steps)))
```
There are **`r numberNA`** missing values in this dataset.
We create another dataset, called **activityDataNoNA**, which will contain the new data without NA values. We first merge the initial dataset with the previously computed **avgDailyPattern** dataset, to ensure easily retrieval of the average number of steps taken per interval. We then use this value to replace the NA ones.
```{r}
activityDataNoNA <- merge(activityData,avgDailyPattern,all = TRUE)
activityDataNoNA <- activityDataNoNA %>% mutate(steps = ifelse(is.na(steps),avg,steps))
```
Same as in the second question, we compute the total number of steps taken per day, plot a histogram and then compute the mean and median.
```{r}
totalStepsperDayNoNA <- activityDataNoNA %>% group_by(date) %>% summarize(totalSteps = sum(steps))
hist(totalStepsperDayNoNA$totalSteps, xlab = "Total number of steps", col = "#2b8cbe", border = "white", main = "Total number of steps per day")
mean(totalStepsperDayNoNA$totalSteps)
median(totalStepsperDayNoNA$totalSteps)
```
We now notice that the mean and median have changes and are equal.
## Are there differences in activity patterns between weekdays and weekends?
We want to compare the activity patterns between weekdays and weekends.
First, let's write the isWeekday() function, which returns "weekday" or "weekend", depending on the date. We then use this function to add a new column to the dataframe, representing the type of day and simply called **day**. Lastly, this column is converted to Factor.
```{r}
isWeekday <- function(day) {
if(weekdays(day) %in% c("samedi","dimanche"))
return("weekend")
else return("weekday")
}
activityDataNoNA <- activityDataNoNA %>% mutate(day = sapply(date,isWeekday))
activityDataNoNA$day <- as.factor(activityDataNoNA$day)
```
After grouping the data by type of day and then by interval, we compute the average number of steps taken per type of day and per interval. We then plot the new dataframe with the lattice system.
```{r}
avgDailyPatternTypeOfDay <- activityDataNoNA %>% group_by(day,interval) %>% summarize(avg = mean(steps))
xyplot(avg ~ interval | day, data = avgDailyPatternTypeOfDay, layout = c(1,2),type = "l", col = "#2b8cbe", lwd = 2, par.settings = list(strip.background=list(col="#99d594")), main = "Average daily pattern per type of day", ylab = "Average number of taken steps")
```
We notice that in the second half of a day, the activity is more important during weekends then during weekdays.