DataScienceInR/5-MultipleRegression.qmd at main · TheoreticalEcology/DataScienceInR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
output: html_document
editor_options:
  chunk_output_type: console
---

```{r, include=FALSE}
set.seed(42)
```

# Multiple regression

![](resources/mlm.png)

We want to understand what happens if there are more variables in the system that also affect the response and maybe also other predictors.

The lm can be then extended to:

$$
y = a_0 + a_1*x_1 + a_2*x_2
$$

This is important because of the omitted variable bias: If there is a confounder which has an effect on the predictor and the response, and we don't condition the model on it, the effect will be absorbed by the predictor, potentially causing a spurious correlation. Conditioning means that we need to include the variables even though we are not really interested in it! (Those variables are called *nuisance parameters*.)

In the worst case it can lead to a Simpson's paradox: An unobserved variable purports the effect of a predictor on the response variable and removes the predictor's effect or even changes its direction in the opposite direction to the true correlation.

## Confounder

Confounders have effects on the response and another predictor.

Let's simulate some data about `Growth` of plants influenced by `Climate` and `Temperature`. However, we will make `Temperature` also depending on `Climate` to some extent:

```{r}
Climate = runif(100)
Temp = Climate + rnorm(100, sd = 0.2)
Growth = 0.5*Temp - 1.0*Climate + rnorm(100, sd = 0.2)
```

Simple linear model of Growth to temperature

```{r}
summary(lm(Growth~Temp))
```

Take a look at the effects of temperature and compare with the simulated one. What happened here??

Now, we introduce the Climate variable (our confounder):

```{r}
summary(lm(Growth~Temp+Climate)) # correct effects!!
```

The effect of temperature is now positive as we have created in the data!

Identifying confounders is the most important challenge in observational studies: For example, several studies showed that overweight adults (BMI) have lower mortality. However, another study shows that these earlier results might have come up due to confounding: smoking!

-   smokers: higher mortality and lower BMI -\> people with lower BMI have higher mortality rates

-   When we correct for the confounder *smoking*, the correlation between BMI and mortality goes in the other direction, i.e. obese people have higher mortality!

Confounders can even lead to observed correlations where in reality there is no such correlation. This is called *spurious correlation*.

::: callout-warning
Conclusion: Confounders can cause correlations where no causal relationship exists.
:::

## Multiple LM

::: callout-note
A linear regression with a quadratic effect (or any polinomial) of X is a "multiple regression" in the sense that the squared X can be considered another "variable": $$  y = a_0 + a_1*x_1 + a_2*x_1^2  $$
:::

Multiple linear regression expands simple linear regression to a polynomial of several explanatory variables x1, x2... e.g.: $$
    y = a_0 + a_1*x_1 + a_2*x_2 + a_3*x_3
    $$

-   Idea: if we jointly consider "all" variables in the model formula, the influence of confounding variables is incorporated.

Let's see an example with the `airquality` data:

```{r}
## first remove observations with NA values
newAirquality = airquality[complete.cases(airquality),]
#summary(newAirquality)
```

The simple regression of Ozone with temperature:

```{r}
# simple regression
m0 = lm(Ozone ~ Temp , data = newAirquality)
summary(m0)

plot(Ozone ~ Temp , data = newAirquality)
abline(m0, col = "blue", lwd = 3)
```

Including `Wind` effect:

```{r}
m1 = lm(Ozone ~ Temp + Wind , data = newAirquality)
summary(m1)
```

To vizualize the effects of each predictor on the ozone, we can use the package `effects`. The package will plot the effects of each variable separately (plots called partial slopes) controling for (or adjusting for) the other variables in the model. Controling here means that for ploting the effect of `Temp`, the variable `Wind` was fixed at it's average value (the fixed value can be changed by the user, but the default is to take the mean of a continuous variable).

Plotting multiple regression outputs using the package `effects`:
```{r}
library(effects)
plot(allEffects(m1))
```

A predictor eﬀect plot summarizes the role of a selected focal predictor in a fitted regression model. These graphs are an alternative to tables of fitted coeﬃcients, which can be much harder to interpret than predictor eﬀect plots. (Info taken from the vignette of the `effects` package, take a look at it!).

Let's interprete the partial slope plot for `Temp`. It shows that there is a positive relationship of temperature and Ozone. The shaded blue area.

The intercept of the line aﬀects only the height of the line, and is determined by the choices made for averaging over the fixed predictors, but for any choice of averaging method, the slope of the line would be the same (because the model has no interactions between variables - next topic). The shaded area is a pointwise confidence band for the fitted values, based on standard errors computed from the covariance matrix of the fitted regression coeﬃcients. The rug plot at the bottom of the graph shows the location of the `Temp` data values.

If we omit `Wind` will we have a different effect of `Temp` in Ozone?

```{r}
## Omitted variable bias
both = lm(Ozone ~ Wind + Temp, newAirquality)
wind = lm(Ozone ~ Wind , newAirquality)
temp = lm(Ozone ~ Temp, newAirquality)
#summary(both)
#summary(wind)

slopes <- data.frame(
  predictor = c("Wind", "Temp"),
  both.pred = round(coef(both)[2:3], digits = 2),
  only.wind = c(round(coef(wind)[2], digits = 2), "NA"),
  only.temp = c("NA", round(coef(temp)[2], digits = 2))
)
slopes
```

Yes, omitting Wind makes the effect of Temperature larger.

**Problem:** Multiple regression can separate the effect of collinear explanatory variables, but only, if collinearity is not too strong.

**Solution:** If the correlation is really strong, we can omit one variable and interpret the remaining collinear variable as representing both.

## Interactions between variables

If one predictor influences the effect of the other predictor, we can include an interaction term into our model:

$$
y \sim a + b + a:b
$$

or:

$$
y \sim a*b
$$

Let's include an interaction effect of `Wind` and `Temp` to the `Ozone` model.

::: callout-note
## Scale your continous variables!

if including interactions, always scale your predictor variables! scale means: subtracts the mean and divides by standard deviation
:::

```{r}
m2 = lm(Ozone ~ scale(Temp)* scale(Wind) , data = newAirquality)
summary(m2)
```

The influence of temperature on ozone depends on the amount of wind. When wind is low, the relationship is strongly positive, but when wind is high this relationship becomes slightly negative.

```{r, warning=F}
plot(allEffects(m2, xlevels=3))
```

### Interactions with categorical variables

How does everything change, if we have factorial predictors?

```{r}
newAirquality$MonthFactor = factor(newAirquality$Month)
m4 = lm(Ozone ~ MonthFactor * scale(Temp),
        data = newAirquality)
summary(m4)
plot(allEffects(m4))
```

## Sequential (type I) ANOVA

Let's take our old model with an interaction term between temperature and wind:


```{r}
m2 = lm(Ozone ~ scale(Temp)* scale(Wind) , data = newAirquality)
summary(m2)
```

The total adjusted R² of the model is `r round(summary(m2)$adj.r.squared,2)`, which is better than the model that only included `Temp` as a predictor (`r round(summary(temp)$adj.r.squared,2)`). However, we might be interested in knowing **how much** each predictor contributes to the model. This is where ANOVA comes in.

An ANOVA (Analysis of Variance), or more specifically a **sequential (type I) ANOVA**, evaluates how the model's explanatory power increases as we add predictors one at a time. It allows us to assess the **incremental contribution** of each variable, **in the order they are added**.

Let's run a type I ANOVA for m2

```{r}
anova(m2)
```

This output shows the reduction in residual sum of squares as each term is added to the model. Importantly, because this is **sequential**, the order of variables matters: the second variable is evaluated **after accounting for the first**, and so on.

::: callout-note
As the name type I suggests, there are also other (non-sequential) types of ANOVA. More on this in the Advanced Regression Model lecture notes [here](https://theoreticalecology.github.io/AdvancedRegressionModels/2B-ANOVA.html#partitioning-variance-if-order-matters).
:::

## Model selection

Model selection is a **controversial and nuanced** area of statistics. The criteria you use depend heavily on your goal: are you trying to **explain relationships** (inference) or make **accurate predictions**?

Here, we will briefly present two commonly used tools for model selection: the already mentioned `anova` function and the **Akaike Information Criterion (AIC)**.

Let’s return to our `airquality` dataset and fit four increasingly complex models:
```{r}
m0 = lm(Ozone ~ 1 , data = newAirquality)
m1 = lm(Ozone ~ Temp  , data = newAirquality)
m2 = lm(Ozone ~ Wind , data = newAirquality)
m3 = lm(Ozone ~ Temp + Wind , data = newAirquality)
```

We can compare the nested models (one is the smaller version of the other) with the anova:

```{r}
anova(m0, m1, m3)
```

This tells us whether each added variable significantly improves the model. However, we **cannot** use ANOVA to compare **non-nested models** — i.e., models that include different predictors but are not subsets of one another:


```{r}
anova(m1, m2) # This doesn't make sense: models are not nested
```

To compare models regardless of nesting, we can use **AIC (Akaike Information Criterion)**. Lower AIC values indicate better trade-off between goodness-of-fit and model complexity:

```{r}
AIC(m0, m1, m2,m3)
```

This allows us to compare all models side by side and select the one that balances fit and parsimony.

## Formula syntax

| Formula | Meaning | Details |
|------------------------|------------------------|------------------------|
| `y~x_1` | $y=a_0 +a_1*x_1$ | Slope+Intercept |
| `y~x_1 - 1` | $y=a_1*x_1$ | Slope, no intercept |
| `y~I(x_1^2)` | $y=a_0 + a_1*(x_1^2)$ | Quadratic effect |
| `y~x_1+x_2` | $y=a_0+a_1*x_1+a_2*x_2$ | Multiple linear regression (two variables) |
| `y~x_1:x_2` | $y=a_0+a_1*(x_1*x_2)$ | Interaction between x~1~ and x~2~ |
| `y~x_1*x_2` | $y=a_0+a_1*(x_1*x_2)+a_2*x_1+a_3*x_2$ | Interaction and main effects |

: Formula syntax