forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
152 lines (104 loc) · 6.38 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
This assignment utilizes the source data file from:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip
Although the file is already available within the repository, first check whether `activity.csv` file exists. If not, the code:
- Downloads a new zip file into a local copy with **today's date**
- Unzips the file and reads it into a dataframe.
Given the appropriate csv, use `read.csv()` to read file. Finally, convert date string into a date format for future manipulation.
```{r}
options(scipen=999) # Disable scientific notation in output
if (!file.exists("activity.csv")) {
zipfileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
zipfilename <- paste0("rep-data-activity-",strftime(Sys.time(),"%Y%m%d"),".zip")
download.file(url=zipfileurl,destfile=zipfilename,method="curl")
unzip(zipfile=zipfilename)
}
activity <- read.csv("activity.csv",stringsAsFactors=FALSE)
activity$date <- as.Date(activity$date)
```
## What is mean total number of steps taken per day?
Create a new dataframe for the steps per day with the `ddply` from the `plyr` package. The results are a summarized dataframe with the total steps per day.
```{r}
library(plyr)
stepsperday <- ddply(activity, .(date),summarise,steps=sum(steps))
```
Plot the histogram of the number of steps per day. A bin width of (max steps)/20 was selected after various exploratory graphs.
```{r}
library(ggplot2)
ggplot(stepsperday,aes(steps)) +
geom_histogram(binwidth=max(stepsperday$steps,na.rm=TRUE)/20) +
labs(title="Histogram of Steps Per Day",x="Steps",y="Frequency")
````
Find the mean and median. Since the data includes a number of missing observations, utilize `na.rm=TRUE`
```{r}
meanperday <- mean(stepsperday$steps,na.rm=TRUE)
medianperday <- median(stepsperday$steps,na.rm=TRUE)
meanperday
medianperday
```
The mean is **`r meanperday`** and the median is **`r medianperday`**.
## What is the average daily activity pattern?
Leverage `ddply` again, this time to average the steps per interval. Next `which.max` simplifies finding the interval with the highest average. For the plot, the blue lines highlight the location of the max and its interval.
```{r}
stepsperinterval <- ddply(activity,.(interval),summarise,steps=mean(steps,na.rm=TRUE))
indexofmax <- which.max(stepsperinterval$steps)
maxinterval <- stepsperinterval[indexofmax,"interval"]
maxsteps <- stepsperinterval[indexofmax,"steps"]
ggplot(stepsperinterval,aes(interval,steps)) +
geom_line() +
geom_hline(aes(yintercept=maxsteps),color="blue") +
geom_vline(xintercept=maxinterval,color="blue") +
labs(title="Average Steps Per Interval",x="Interval",y="Steps")
```
The interval with the greatest average number of steps was **`r maxinterval`** with **`r maxsteps`** steps.
## Imputing missing values
First, find the number of observations with missing values (**`NA`**)
```{r}
missingcount <- sum(!complete.cases(activity))
```
With `r missingcount` missing observations, let's see if there is a pattern. Summarising the missing values by date provides:
```{r}
missingintervalcount <- ddply(activity,.(date),summarise,missing=sum(is.na(steps)))
missingintervalcount[missingintervalcount$missing>0,]
```
Given that there are 288 five minute intervals in a day - the pattern of missing data is all the data on a given day. Perhaps imputing the missing intervals with the average intervals for the same day of the week for the same interval is a good approach since individual movement patterns tend to be cyclical by week. First, assign the week day to date column and summarize the missing steps by the week day.
```{r}
activity$day <- weekdays.Date(activity$date)
ddply(activity,.(day),summarise,missing=sum(is.na(steps)))
```
This shows us that at most, 2 days are missing for any given week day with Tuesday not having any missing data. Next, create a new data frame with means for each interval and each week day. This is used to merge into the new `imputed` data frame.
```{r}
weekdaymeans <- ddply(activity,.(day,interval),summarise,meanstepweekday=mean(steps,na.rm=TRUE))
imputed <- merge(activity,weekdaymeans,by=c("day","interval"))
imputed$steps[is.na(imputed$steps)] <- imputed$meanstepweekday[is.na(imputed$steps)]
imputed$meanstepweekday <- NULL
```
Now, a new histogram, mean and median with the `imputed` activity data.
```{r}
imputedstepsperday <- ddply(imputed, .(date),summarise,steps=sum(steps))
library(ggplot2)
ggplot(imputedstepsperday,aes(steps)) +
geom_histogram(binwidth=max(imputedstepsperday$steps,na.rm=TRUE)/20) +
labs(title="Histogram of Steps Per Day (imputed data)",x="Steps",y="Frequency")
````
```{r}
imputedmeanperday <- mean(imputedstepsperday$steps)
imputedmedianperday <- median(imputedstepsperday$steps)
imputedmeanperday
imputedmedianperday
```
The mean of the imputed data is **`r imputedmeanperday`** and the median is **`r imputedmedianperday`**. Compared to the results from the original dataset, removing the `NA` had a mean of **`r meanperday`** and a median of **`r medianperday`** - so both the mean and median increased by a small amount after imputing the data with this method.
## Are there differences in activity patterns between weekdays and weekends?
Create a new `daytype` variable to categorize weekday versus weekend and use this to facet a plot to find patterns among intervals. Lines are added to represent the mean and max of the respective sets of weekend versus weekday data to highlight differences.
```{r}
imputed$daytype <- as.factor(ifelse(weekdays(imputed$date) %in% c("Saturday","Sunday"), "weekend", "weekday"))
stepsperimpinterval <- ddply(imputed,.(interval,daytype),summarise,steps=mean(steps,na.rm=TRUE))
ggplot(stepsperimpinterval,aes(interval,steps)) +
geom_line() +
facet_wrap(~ daytype,nrow=2) +
geom_line(stat="hline", linetype="dotted", yintercept="mean") +
geom_line(stat="hline", linetype="dotted", yintercept="max") +
labs(title="Average Steps Per Interval",x="Interval",y="Steps")
```
The plot shows distinct differences between weekdays and weekends. More steps are taken in the earlier part of the day during the weekdays but lower numbers of steps in the later portion of the day compared to weekend trends. Weekdays result in a higher max steps per interval but lower averages across the entire day.