forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathProject-01.Rmd
More file actions
108 lines (91 loc) · 5.19 KB
/
Project-01.Rmd
File metadata and controls
108 lines (91 loc) · 5.19 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
Processing and interpreting the personal activity monitoring Data
=================================================================
## Introduction
Loading and preprocessing the data; Read the data in the csv and do the right transformations
```{r}
Sys.setlocale(category = "LC_ALL", locale = "C")
setwd("C:/Work/bibliography/courses/2014/Data Specialization/Reproducible Research/Project-1")
ActivityData <- read.table("./activity.csv", sep = ",", header = TRUE, na.strings=NA)
ActivityData$date <- as.Date(ActivityData$date)
head(ActivityData, n=10)
```
## What is mean total number of steps taken per day?
We do the following:
1. Summarize the steps per day by using the plyr package.
2. Plot the histogram of the sums
3. Calculate and plot the mean and the median
```{r histogram, echo=TRUE}
library(plyr)
df <- ddply(ActivityData, .(date), summarize, sumSteps=sum(steps, na.rm = TRUE))
hist( df$sumSteps, col = "green", breaks=10, xlab="Number of steps per day", main="Histogram of the total number of steps taken each day" )
stepsMean <- mean(df$sumSteps, na.rm = TRUE)
stepsMedian <- median(df$sumSteps, na.rm = TRUE)
abline(v = mean(df$sumSteps, na.rm = TRUE), col = "red", lwd = 4)
abline(v = median(df$sumSteps, na.rm = TRUE), col = "magenta", lwd = 4)
```
The mean value of the total number of steps taken per day is **`r stepsMean`** and the median value of the total number of steps taken per day is **`r stepsMedian`**.
### What is the average daily activity pattern?
Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
```{r timesSeries, echo=TRUE}
library(plyr)
library(ggplot2)
df2 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
g <- ggplot(df2, aes(x = df2$interval, y = df2$avgSteps))
g + geom_line()+ theme(panel.background = element_rect(colour = "pink"))+ labs(x = "Interval")+ labs(y = "Average Steps per interval")+ labs(title = "Time series plot of the 5-minute interval and the avg num of steps")
```
### Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r echo=TRUE}
df2 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
maxAvgSteps <- max(df2$avgSteps, na.rm = TRUE)
tmp <- df2[df2$avgSteps == maxAvgSteps,]
maxInterval <- tmp$interval
maxInterval
```
The **`r maxInterval`** 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps.
### Imputing missing values
```{r missingValues, echo=TRUE}
totalNumOfNAs<- sum(is.na(ActivityData$steps))
```
There are **`r totalNumOfNAs`** missing values in the dataset (i.e. the total number of rows with NAs).
### Filling in all of the missing values in the dataset with the avg of steps in the same interval
```{r missingvaluesRefill, echo=TRUE}
library(plyr)
df3 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
newActivityData<-ActivityData
for(row in 1:length(newActivityData$steps)) {
if ( is.na(newActivityData$steps[row]) ) {
ndate <- newActivityData$interval[row]
tmp <- df3[df3$interval == ndate,]
result <- tmp$avgSteps
newActivityData$steps[row]=result
}
}
df3 <- ddply(newActivityData, .(date), summarize, avgSteps=sum(steps,na.rm = TRUE))
hist( df3$avgSteps, col = "green", breaks=10, xlab="Number of steps per day", main="Histogram of the total number of steps taken each day" )
nstepsMean <- mean(df3$avgSteps, na.rm = TRUE)
nstepsMedian <- median(df3$avgSteps, na.rm = TRUE)
abline(v = mean(df3$avgSteps, na.rm = TRUE), col = "red", lwd = 4)
abline(v = median(df3$avgSteps, na.rm = TRUE), col = "magenta", lwd = 4)
```
###What is mean total number of steps taken per day with the missing values filled in?
The mean value of the total number of steps taken per day is **`r nstepsMean`** and the median value of the total number of steps taken per day is **`r nstepsMedian`**.
## Are there differences in activity patterns between weekdays and weekends?
Create a new factor variable in the dataset with two levels **weekday** and **weekend** indicating whether a given date is a weekday or weekend day.
```{r weekdays, echo=TRUE}
for(row in 1:length(newActivityData$date)) {
if ( weekdays(as.Date(newActivityData$date[row])) %in% c("Saturday","Sunday")) {
newActivityData$typeOfDay[row]= "Weekend"
} else {
newActivityData$typeOfDay[row]= "Weekday"
}
}
newActivityData$typeOfDay <- as.factor(newActivityData$typeOfDay)
```
Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
```{r plots, echo=TRUE}
library(ggplot2)
df4 <- ddply(newActivityData, .(interval,typeOfDay), summarize, avgSteps = sum(steps,na.rm = TRUE))
colnames(df4) = c("interval", "typeOfDay", "avgSteps")
g <- ggplot(df4, aes(x = interval, y = avgSteps))
g + geom_line() + facet_grid(typeOfDay~.) + theme(panel.background = element_rect(colour = "pink"))+ labs(x = "Interval")+ labs(y = "Average Steps per interval")+ labs(title = "Time series plot of the 5-minute interval and the avg num of steps")
```