RepData_PeerAssessment1/Project-01.Rmd at master · xpapag/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
Processing and interpreting the personal activity monitoring Data
=================================================================

## Introduction

Loading and preprocessing the data; Read the data in the csv and do the right transformations
```{r}
Sys.setlocale(category = "LC_ALL", locale = "C")
setwd("C:/Work/bibliography/courses/2014/Data Specialization/Reproducible Research/Project-1")
ActivityData <- read.table("./activity.csv", sep = ",", header = TRUE, na.strings=NA)
ActivityData$date <- as.Date(ActivityData$date)
head(ActivityData, n=10)
```

## What is mean total number of steps taken per day?

We do the following:

1. Summarize the steps per day by using the plyr package.

2. Plot the histogram of the sums

3. Calculate and plot the mean and the median
```{r histogram, echo=TRUE}
library(plyr)
df <- ddply(ActivityData, .(date), summarize, sumSteps=sum(steps, na.rm = TRUE))
hist( df$sumSteps, col = "green", breaks=10, xlab="Number of steps per day", main="Histogram of the total number of steps taken each day" )
stepsMean <- mean(df$sumSteps, na.rm = TRUE)
stepsMedian <- median(df$sumSteps, na.rm = TRUE)
abline(v = mean(df$sumSteps, na.rm = TRUE), col = "red", lwd = 4)
abline(v = median(df$sumSteps, na.rm = TRUE), col = "magenta", lwd = 4)
```

The mean value of the total number of steps taken per day is **`r stepsMean`** and the median value of the total number of steps taken per day is **`r stepsMedian`**.

### What is the average daily activity pattern?
Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

```{r timesSeries, echo=TRUE}
library(plyr)
library(ggplot2)
df2 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
g <- ggplot(df2, aes(x = df2$interval, y = df2$avgSteps))
g + geom_line()+ theme(panel.background = element_rect(colour = "pink"))+ labs(x = "Interval")+ labs(y = "Average Steps per interval")+ labs(title = "Time series plot of the 5-minute interval and the avg num of steps")
```

### Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r echo=TRUE}
df2 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
maxAvgSteps <- max(df2$avgSteps, na.rm = TRUE)
tmp <- df2[df2$avgSteps == maxAvgSteps,]
maxInterval <- tmp$interval
maxInterval
```
The **`r maxInterval`** 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps.

### Imputing missing values
```{r missingValues, echo=TRUE}
totalNumOfNAs<- sum(is.na(ActivityData$steps))
```
There are **`r totalNumOfNAs`** missing values in the dataset (i.e. the total number of rows with NAs).

### Filling in all of the missing values in the dataset with the avg of steps in the same interval

```{r missingvaluesRefill, echo=TRUE}
library(plyr)
df3 <- ddply(ActivityData, .(interval), summarize, avgSteps=mean(steps,na.rm = TRUE))
newActivityData<-ActivityData
for(row in 1:length(newActivityData$steps)) {
     if ( is.na(newActivityData$steps[row]) ) {
          ndate <- newActivityData$interval[row]
          tmp <- df3[df3$interval == ndate,]
          result <- tmp$avgSteps
          newActivityData$steps[row]=result
     }
}
df3 <- ddply(newActivityData, .(date), summarize, avgSteps=sum(steps,na.rm = TRUE))
hist( df3$avgSteps, col = "green", breaks=10, xlab="Number of steps per day", main="Histogram of the total number of steps taken each day" )
nstepsMean <- mean(df3$avgSteps, na.rm = TRUE)
nstepsMedian <- median(df3$avgSteps, na.rm = TRUE)
abline(v = mean(df3$avgSteps, na.rm = TRUE), col = "red", lwd = 4)
abline(v = median(df3$avgSteps, na.rm = TRUE), col = "magenta", lwd = 4)
```
###What is mean total number of steps taken per day with the missing values filled in?
The mean value of the total number of steps taken per day is **`r nstepsMean`** and the median value of the total number of steps taken per day is **`r nstepsMedian`**.

## Are there differences in activity patterns between weekdays and weekends?
Create a new factor variable in the dataset with two levels **weekday** and **weekend** indicating whether a given date is a weekday or weekend day.

```{r weekdays, echo=TRUE}
for(row in 1:length(newActivityData$date)) {
     if ( weekdays(as.Date(newActivityData$date[row])) %in%  c("Saturday","Sunday")) {
          newActivityData$typeOfDay[row]= "Weekend"
     } else {
          newActivityData$typeOfDay[row]= "Weekday"
     }
}
newActivityData$typeOfDay <- as.factor(newActivityData$typeOfDay)
```

Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
```{r plots, echo=TRUE}
library(ggplot2)
df4 <- ddply(newActivityData, .(interval,typeOfDay), summarize, avgSteps = sum(steps,na.rm = TRUE))
colnames(df4) = c("interval", "typeOfDay", "avgSteps")
g <- ggplot(df4, aes(x = interval, y = avgSteps))
g + geom_line() + facet_grid(typeOfDay~.) + theme(panel.background = element_rect(colour = "pink"))+ labs(x = "Interval")+ labs(y = "Average Steps per interval")+ labs(title = "Time series plot of the 5-minute interval and the avg num of steps")
```