Statistics-Using-R/Regression at master · saurabh0501/Statistics-Using-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: "IMT 573 Lab: Regression"
author: "Saurabh Sharma"
date: "November 7th, 2019"
output:
  tufte_handout:
    highlight: tango
---

\marginnote{\textcolor{blue}{Don't forget to list the full names of your collaborators!}}

# Collaborators:

# \textbf{Instructions:}

Before beginning this assignment, please ensure you have access to R and/or RStudio.

1. Download the `week7b_lab.Rmd` file from Canvas. Open `week7b_lab.Rmd` in RStudio (or your favorite editor) and supply your solutions to the assignment by editing `week7b_lab.Rmd`.

2. Replace the "Insert Your Name Here" text in the `author:` field with your own full name.

3. Be sure to include code chucks, figures and written explanations as necessary. Any collaborators must be listed on the top of your assignment. Any figures should be clearly labeled and appropriately referenced within the text.

4. When you have completed the assignment and have **checked** that your code both runs in the Console and knits correctly when you click `Knit`, rename the R Markdown file to `YourLastName_YourFirstName_lab5b.Rmd`, and knit it into a PDF. Submit the compiled PDF on Canvas.

In this lab, you will use whichever libraries are most useful to you.

```{r Setup, message=FALSE}
# Load some helpful libraries
install.packages("dummies", repos = "http://cran.us.r-project.org")
library(tidyverse)

library(dummies)
```

Let's begin by reading in data from the following course:  http://data.princeton.edu/wws509/datasets/#salary. Note that this url will have information on variables in the dataset. Also note that the dataset we will be using was previously used as part of a textbook and is a rather well-known dataset for practicing regression.

## Inspecting the data

#### 1. Read in the data. Is it clean and ready for analysis? What do the variables look like?
```{r}
data <- read.table("https://data.princeton.edu/wws509/datasets/salary.dat")
# the above command has headers in the first row
data <- read.table("https://data.princeton.edu/wws509/datasets/salary.dat" , header = TRUE)
summary(data)
#The data is
```


## Univariate regression

#### 2. Plot the salary vs years in current academic position.
```{r}
ggplot(data,aes(x=yr, y=sl)) + geom_point(color = 'blue', size = 5)
mean(data$yr)
cor(data$sl, data$yr)
cor(data$yr, data$sl)
fit <- lm(sl ~ yr, data)
summary(fit) # we don't have the confidence interval here.
confint(fit, 'yr', level = 0.95)
```


#### 3. What is the salary increase associated with each additional year in current academic position? What do the confidence intervals look like?

```{r}
residuals <- resid(fit)
plotResiduals <- ggplot(data = data.frame(x = data$sl, y= residuals), aes(x = x, y=y)) + geom_point(color = 'red',size = 5)
plotResiduals
# adding additional layers to the plots we already have.
```

#### 4. What do the residuals look like? Plot and describe them.

#### 5. Add the regression line to your plot of the data

## Multivariate regression

#### 6. Run a multiple regression on salary while including the following variables: yr, yd, rk, dg. What is now the effect of each additional year in current academic position?
```{r}
plotResiduals <- plotResiduals +
  stat_smooth(method = 'lm',se = FALSE, color = 'red')+ xlim(14000, 38000)+ ylim(-7500, 7500)+labs(title = "First residual plot", y = 'residual', x = 'salary')
plotResiduals

DgDummies <- dummy(data$dg)
RkDummies <- dummy(data$rk)


dataDummy <- subset(data, select = c('sl', 'yr', 'yd'))

dataDummy <- cbind(dataDummy, RkDummies)
dataDummy <- cbind(dataDummy, DgDummies)


fitDummy <- lm(sl ~., dataDummy)
# using . includes in our fit every column in the dataDummy

summary(fitDummy)


dataDummy2 <- subset(dataDummy, select = c('sl', 'yr', 'yd', 'rkassociate', 'rkfull', 'dgdoctorate'))

fitDummy2 <- lm(sl ~., dataDummy2)


```

#### 7. Did the coefficent change from the previous regression? Is it still statistically significant? Can you explain this?

#### 8. What do the residuals look like? Plot and describe them.

```{r}

residuals2 <- resid(fitDummy2)

plotResiduals2 <- ggplot(data = data.frame(x = data$sl, y= residuals2), aes(x = x, y=y)) + geom_point(color = 'red',size = 5)
plotResiduals2


plotResiduals2 <- plotResiduals2 +
  stat_smooth(method = 'lm',se = FALSE, color = 'red')+ xlim(14000, 38000)+ ylim(-7500, 7500)+labs(title = "Secondresidual plot", y = 'residual', x = 'salary')
plotResiduals2


```

## Adding randomness

#### 9. create 20 new variables, each with randomly sampled data from a distribution of your choice. Add these to the above regression.

#### 10. Did the coefficent change from the previous regression? Is it still statistically significant?

#### 11. What do the residuals look like? Plot and describe them.

#### 12. Did the model fit improve (R-sq)? Is this expected? Why?

#### 13. How did the model fare compared to previous models with respect to AIC/BIC?

## Forward and backward selection

#### Run forward selection on all the data using salary as your exogenous variable. What does your final model look like?

#### Run backward selection on all the data using salary as your exogenous variable. What does your final model look like?