-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy path2A-LinearRegression.qmd
More file actions
1308 lines (899 loc) · 60.1 KB
/
2A-LinearRegression.qmd
File metadata and controls
1308 lines (899 loc) · 60.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
editor_options:
chunk_output_type: console
---
```{r, include=FALSE}
set.seed(42)
```
# Linear Regression
## Fitting a simple linear regression
Let's start with the basics. For this chapter, we will rely a lot on the airquality data, which is one of the built-in datasets in R.
::: {.callout-tip collapse="true"}
#### More info on the airquality data
Most datasets in R have a help file. Thus, if you don't know the data set, have a look at the description via pressing F1 with the curses on the word, or via typing
```{r chunk_chapter3_chunk2, echo=TRUE, eval=FALSE}
?airquality
```
As you can read, the datasets comprises daily air quality measurements in New York, May to September 1973. You can see the structure of the data via
```{r chunk_chapter3_chunk3, echo=TRUE, eval=FALSE}
str(airquality)
```
Variables are described in the help. Note that Month / Day are currently coded as integers, would be better coded as date or (ordered) factors (something you could do later, if you use this dataset).
:::
Let's say we want to examine the relationship between Ozone and Wind in these data.
```{r chunk_chapter3_chunk4, echo=TRUE}
plot(Ozone ~ Wind, data = airquality)
```
OK, I would say there is some dependency here. To quantify this numerically, we want to fit a linear regression model through the data with the `lm()`{.R} function of R. We can do this in R by typing
```{r chunk_chapter3_chunk12, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Wind, data = airquality)
```
This command creates a linear regression model that fits a straight line, adjusting slope and intercept to get the best fit through the data. We can see the fitted coefficients if we print the model object
```{r}
fit
```
::: callout-note
The function name lm is short for "linear model". Remember from your basic stats course: This model is **not** called linear because we necessarily fit a linear function. It's called linear as long as we express the response (in our case Ozone) as a polynomial of the predictor(s). A polynomial ensures that when estimating the unknown parameters (we call them "*effects*"), they all affect predictions linearly, and can thus be solved as a system of linear equations.
So, an lm is any regression of the form $y = \operatorname{f}(x) + \mathcal{N}(0, \sigma)$, where $\operatorname{f}$ is a polynomial, e.g. ${a}_{0} + {a}_{1} \cdot x + {a}_{2} \cdot {x}^{2}$, and $\mathcal{N}(0, \sigma)$ means that we assume the data scattering as a normal (Gaussian) distribution with unknown standard deviation $\sigma$ around $\operatorname{f}(x)$.
:::
::: {.callout-tip collapse="true"}
## Understanding how the lm works
To understand how the lm works, we will estimate a lm using a custom likelihood function:
1. **Create some toy data:**
```{r}
Xobs = rnorm(100, sd = 1.0)
Y = 0.0+1.0*Xobs + rnorm(100,sd = 0.2)
```
2. **Define the likelihood function and optimize the parameters (slope+intercept):**
```{r}
likelihood = function(par) { # three parameters now
lls = dnorm(Y, mean = Xobs*par[2]+par[1], sd = par[3], log = TRUE)
# calculate for each observation the probability to observe the data given our model
# we use the logLikilihood because of numerical reasons
return(sum(lls))
}
likelihood(c(0, 0, 0.2))
opt = optim(c(0.0, 0.0, 1.0), fn = function(par) -likelihood(par), hessian = TRUE )
opt$par
```
3. **Test the estimated parameters against 0 (standard errors can be derived from the hessian of the MLE):** Our true parameters are 0.0 for the intercept, 1.0 for the slope, and 0.22 for the sd of the observational error:
1. calculate standard error
2. calculate t-statistic
3. calculate p-value
```{r}
st_errors = sqrt(diag(solve(opt$hessian)))
st_errors[2]
t_statistic = opt$par[2] / st_errors[2]
pt(t_statistic, df = 100-3, lower.tail = FALSE)*2
```
The p-value is smaller than $\alpha$, so the effect is significant! Let's compare it against the output of the lm function:
```{r}
model = lm(Y~Xobs) # 1. Get estimates, MLE
model
summary(model) # 2. Calculate standard errors, CI, and p-values
```
:::
## Interpreting the fitted model
We can get a more detailed summary of the statistical estimates with the `summary()`{.R} function.
```{r}
summary(fit)
```
In interpreting this, recall that:
- "**Call**" repeats the regression formula.
- "**Residuals**" gives you an indication about how far the observed data scatters around the fitted regression line / function.
- The *regression table* (starting with "**Coefficients**") provides the estimated parameters, one row for each fitted parameter. The first column is the estimate, the second is the standard error (for 0.95 confidence interval multiply with 1.96), and the fourth column is the p-value for a two-sided test with ${H}_{0}$: "Estimate is zero". The t-value is used for calculation of the p-value and can usually be ignored.
- The last section of the summary provides information about the *model fit*.
- Residual error = Standard deviation of the residuals,
- 114 df = Degrees of freedom = Observed - fitted parameters.
- R-squared $\left({R}^{2}\right)$ = How much of the signal, respective variance is explained by the model, calculated by $\displaystyle 1 - \frac{\text{residual variance}}{\text{total variance}}$.
- Adjusted R-squared = Adjusted for model complexity.
- F-test = Test against intercept only model, i.e. is the fitted model significantly better than the intercept only model (most relevant for models with \> 1 predictor).
::: {.callout-caution icon="false"}
#### Question
1. What is the meaning of "An effect is not significant"?
2. Is an effect with three \*\*\* more significant / certain than an effect with one \*?
:::
::: {.callout-tip collapse="true" appearance="minimal" icon="false"}
#### Solution
1. You should NOT say that the effect is zero, or that the null hypothesis has been accepted. Official language is "there is no significant evidence for an effect(p = XXX)". If we would like to assess what that means, some people do a post-hoc power analysis (which effect size could have been estimated), but better is typically just to discuss the confidence interval, i.e. look at the confidence interval and say: if there is an effect, we are relatively certain that it is smaller than X, given the confidence interval of XYZ.
2. Many people view it that way, and some even write "highly significant" for \*\*\* . It is probably true that we should have a slightly higher confidence in a very small p-value, but strictly speaking, however, there is only *significant*, or *not significant*. Interpreting the p-value as a measure of certainty is a slight misinterpretation. Again, if we want to say how certain we are about the effect, it is better to look again at the confidence interval, i.e. the standard error and use this to discuss the precision of the estimate (small confidence interval / standard error = high precision / certainty).
:::
## Visualizing the results
For a simple lm and one predictor, we can visualize the results via
```{r}
plot(Ozone ~ Wind, data = airquality)
abline(fit)
```
A more elegant way to visualize the regression, which also works once we move to generalized linear models (GLM) and multiple regression, is using the effects package. The base command to visualize the fitted model is:
```{r, message=FALSE}
library(effects)
plot(allEffects(fit, partial.residuals = T))
```
The blue line is the fitted model (with confidence interval in light blue). When using the argument partial.residuals = T, purple circles are the data, and the purple line is a nonparametric fit to the data. If the two lines deviate strongly, this indicates that we may have a misspecified model (see next section on residual checks).
::: {.callout-caution icon="false"}
#### Question
Fit Ozone \~ Temp, and look at summary, residuals and visualizations. What would you conclude?
:::
::: {.callout-tip collapse="true" appearance="minimal" icon="false"}
#### Solution
```{r}
fit = lm(Ozone ~ Temp, data = airquality)
plot(allEffects(fit, partial.residuals = T))
summary(fit)
```
Temperature has a significant positive effect of Ozone. However, effect seems nonlinear, could consider adding a quadratic term of a GAM to improve the fit. The model explains nearly 49% of the variance of the given data (R2). 37 observations have missing data and are omitted. The model explains the data significantly better compared a model with only an intercept (F-statistic).
:::
## Residual checks
The regression line of an lm, including p-values and all that, is estimated under the assumptions that the residuals scatter iid normal. If that is not (approximately) the case, the results may be nonsense. Looking at the results of the effect plots above, we can already see a first indication that there may be a problem.
What we see highlighted here is that the data seems to follow a completely different curve than the fitted model. The conclusion here would be: The model we are fitting does not fit to the data, we should not interpret its outputs, but rather say that we reject it, it's the wrong model, we have to search for a more appropriate description of the data (see next section, "Adjusting the functional form").
### LM residual plots
To better analyse these residuals (and potential problems), R offers a function for residual plots. It produces 4 plots. I think it's most convenient plotting them all into one figure, via the command `par(mfrow = c(2, 2))`, which produces a figure with 2 x 2 = 4 panels.
```{r, echo=TRUE, eval=TRUE}
par(mfrow = c(2, 2))
plot(fit)
```
**Interpretation of the panels**:
- **Residuals vs Fitted:** Shows misfits and wrong functional form. Scattering should be uniformly distributed.
- **Normal Q-Q:** Checks if residuals follow an overall normal distribution. Bullets should lie on the line in the middle of the plot and may scatter a little bit at the ends.
- **Scale - Location:** Checks for heteroskedasticity. Does the variance change with predictions/changing values? Scattering should be uniformly distributed.
- **Residuals vs Leverage:** How much impact do outliers have on the regression? Data points with high leverage should not have high residuals and vice versa. Bad points lie in the upper right or in the lower right corner. This is measured via the Cook's distance. Distances higher than 0.5 indicate candidates for relevant outliers or strange effects.
::: callout-note
The plot(model) function should ONLY be used for linear models, because it explicitly tests for the assumptions of normally distributed residuals. Unfortunately, it also works for the results of the glm() function, but when used on a GLM beyond the linear model, it is not interpretable.
:::
### DHARMa residual plots
The residual plots above only work for testing linear models, because they explicitly test for the assumptions of a normal distribution, which we will relax once we go to more complex models. The DHARMa package provides a general way to perform residual checks for practically any model from the generalized linear mixed-effect model (GLMM) family.
```{r, echo=TRUE, eval=TRUE, message=FALSE}
library(DHARMa)
res <- simulateResiduals(fit, plot = T)
```
The `simulateResiduals()` function uses simulations to generate standardized residuals for any GLMM. Standardized means that a uniform distribution of residuals signifies perfect fit. Here, the residuals are saved in the variable res.
The argument `plot = T` creates a plot with two panels (alternatively, you could also type `plot(res)`). The left panel is a uniform qq plot (calling plotQQunif), and the right panel shows residuals against predicted values (calling plotResiduals), with outliers highlighted in red.
Very briefly, we would expect that a correctly specified model shows:
1. a straight 1-1 line, as well as n.s. of the displayed tests in the qq-plot (left) -\> evidence for an the correct overall residual distribution (for more details on the interpretation of this plot, see plotQQunif)
2. visual homogeneity of residuals in both vertical and horizontal direction, as well as n.s. of quantile tests in the res \~ predictor plot (for more details on the interpretation of this plot, see `?plotResiduals`)
Deviations from these expectations can be interpreted similar to a linear regression. See the DHARMa vignette for detailed examples.
## Adjusting the Functional Form
### Mean vs. variance problems
Looking at the residual plots of our regression model, there are two types of problems, and there is a clear order in which you should solve them:
1. **Systematic misfit of the mean:** the first thing we should worry about is if the model describes the mean of the data well, i.e. if our function form (we assumed a straight line) fits to the data. You can see misfit for our model in a) the effects plot, b) the res \~ fitted plot, and c) the DHARMa res \~ fitted plot.
2. **Distributional problems:** only if your model fits the mean of the data well should you look at distributional problems. Distributional problems means that the residuals do not scatter as assumed by the regression (for an lm iid normal). The reason why we look only after solving 1) at the distribution is that a misfit can easily create distributional problems.
Here, and in this part of the book in general, we will first be concerned with adjusting the model to describe the mean correctly. In the second part of the book, we will then also turn to advanced options to model the variance
::: callout-caution
**Important**: Residuals are always getting better for more complex models. They should therefore NOT solely be used for automatic model selection, i.e. you shouldn't assume that the model with the best residuals is necessarily preferable (on how to select models, see section on model selection). The way you should view residual checks is as a rejection test: the question you are asking is if the residuals are so bad that the model needs to be rejected!
:::
### R regression syntax
If we see a misfit of the mean, we should adjust the functional form of the model. Here a few options (see @tbl-syntax for more details):
```{r chunk_chapter3_task_10, message=FALSE, warning=FALSE}
fit = lm(Ozone ~ Wind, data = airquality) # Intercept + slope.
fit = lm(Ozone ~ 1, data = airquality) # Only intercept.
fit = lm(Ozone ~ Wind - 1 , data = airquality) # Only slope.
fit = lm(Ozone ~ log(Wind), data = airquality) # Predictor variables can be transformed.
fit = lm(Ozone^0.5 ~ Wind, data = airquality) # Output variables can also be transformed.
fit = lm(Ozone ~ Wind + I(Wind^2), data = airquality) # Mathematical functions with I() command.
```
Basically, you can use all these tools to search for a better fit through the data.
Note that if you transform the response variable, you change both the functional form and the spread of the residuals. Thus, this option can change (and often improve) both the mean and distributional problems, and is a common strategy to deal with non-symmetric residuals etc. There is a small convenience function in the package MASS that automatically searches for the best-fitting power transformation of the response.
```{r}
library(MASS)
boxcox(fit)
```
In the result, you can see, that the best fit assuming a power transformation is Ozone\^k is delivered by k \~= 0.35.
::: {.callout-caution icon="false"}
#### Task
Modify the formula with the tools above to get (as far as possible) an acceptable fit to the data.
:::
::: {.callout-tip collapse="true" appearance="minimal" icon="false"}
#### Solution
Possible solution, adding a quadratic predictor and chosing a power of 0.35 transformation based on the boxcox function:
```{r chunk_chapter3_task_11, message=FALSE, warning=FALSE}
fit1 = lm(Ozone^0.35 ~ Wind + I(Wind^2), data = airquality)
plot(allEffects(fit1, partial.residuals = T), selection = 1)
```
Note that if you `plot(fit1)`, there seems to be still a bit of a pattern in the scale-location plot (heteroskedasticity). I would say that this is not particularly concerning, but if you are concerned and don't manage to address it via changing the functional form, you could fit a regression with variable variance. We will discuss this in a later chapter.
You could get an even better fit by adding more and more predictors, for example:
```{r chunk_chapter3_task_16, message=FALSE, warning=FALSE}
fit2 = lm(Ozone^0.35 ~ Wind + I(Wind^2) + I(Wind^3) + I(Wind^4) + I(Wind^5) +
I(Wind^6) + I(Wind^7) + I(Wind^8), data = airquality)
plot(allEffects(fit2, partial.residuals = T), selection = 1)
```
However, as noted above, and as we will further discuss in the section on model selection, this is not a good idea, because while a more complicated model always improves residuals, it has other disadvantages. This model is likely too complex for the data (aka it overfits). We can see this by looking at common model selection indicators (again, more in the section on model selection).
AIC comparison (lower = better)
```{r}
AIC(fit1)
AIC(fit2)
```
Likelihood ratio test (is there evidence for the more complex model?)
```{r}
anova(fit1, fit2)
```
:::
### Generalized additive models (GAMs)
Another options to get the fit right are Generalized Additive Models (GAM). The idea is to fit a smooth function to data, to automatically find the "right" functional form. The smoothness of the function is automatically optimized.
```{r chunk_chapter4_chunk4, echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE}
library(mgcv)
fit = gam(Ozone ~ s(Wind) , data = airquality)
# allEffects doesn't work here.
plot(fit, pages = 0, residuals = T, pch = 20, lwd = 1.8, cex = 0.7,
col = c("black", rep("red", length(fit$residuals))))
```
In the summary(), you still get significance for the smoothing term (i.e. a p-value), but given that you fit a line that could go up and down, you don't get an effect direction.
```{r}
summary(fit)
```
## Categorical Predictors
The `lm()`{.R} function can handle both numerical and categorical variables. To understand what happens if the predictor is categorical, we'll use another data set here, `PlantGrowth`{.R} (type `?PlantGrowth`{.R} or F1 help if you want details). We visualize the data via:
```{r chunk_chapter3_chunk27, echo=TRUE, eval=TRUE}
boxplot(weight ~ group, data = PlantGrowth)
```
### Understanding contrasts
Let's fit an `lm()`{.R} now with the categorical explanatory variable group. They syntax is the same as before
```{r chunk_chapter3_chunk28, echo=TRUE, eval=TRUE}
fit = lm(weight ~ group, data = PlantGrowth)
summary(fit)
```
but the interpretation of the results often leads to confusion. We see effects for trt1 and trt2, but where is ctrl? The answer is: the R is calculating so-called **treatment contrasts** for categorical predictors. In treatment contrasts, we have a **reference level** (in our case ctrl), which is the intercept of the model, and then we estimate the **difference to the reference** for all other factor levels (in our case trt1 and trt2).
::: callout-tip
By default, R orders factors levels alphabetically, and thus the factor level that is first in the alphabet will be used as reference level in `lm()`. In our example, ctrl is before trt in the alphabet and thus used as reference, which also makes sense, because it is scientifically logical to compare treatments to the control. In other cases, it may be that the logical reference level does not correspond to the alphabetical order of the factors. If you want to change which factor level is the reference, you can re-order your factor using the `relevel()` function. In this case, estimates and p-values will change, because different comparisons (contrasts) are being made, but model predictions will stay the same!
:::
This also becomes clear if we plot the results.
```{r}
plot(allEffects(fit))
```
Note that in this case, the effects package depicts the CI by purple error bars.
So, in our example, the intercept estimates the mean weight in the ctr group (with a p-value tested against zero, usually not interesting), and the effects of trt1 and trt2 are the differences of weight compared to the control, with a p-value for a test against an effect size of zero.
::: {.callout-tip collapse="true"}
#### More details on contrasts
Treatment contrasts are not the only option to deal with categorical variables. There are many other options to set up contrasts, which may be appropriate in certain situations. A simple way to change contrasts is the following syntax
```{r}
fit = lm(weight ~ 0 + group, data = PlantGrowth)
summary(fit)
```
which results in fitting the mean for each factor level. Unfortunately, this syntax option doesn't generalize to multiple regressions. There are a number of further options for specifying contrasts. You can tell R by hand how the levels should be compared or use some of the pre-defined contrasts. Here is an example:
```{r}
PlantGrowth$group3 = PlantGrowth$group
contrasts(PlantGrowth$group3) = contr.helmert
fit = lm(weight ~ group3, data = PlantGrowth)
summary(fit)
```
What we are using here is Helmert contrasts, which contrast the second level with the first, the third with the average of the first two, and so on. Which contrasts make most sense depends on the question. For more details, see chapter 10 in Stéphanie M. van den Berg's [book "Analysing Data using Linear Models"](https://bookdown.org/pingapang9/linear_models_bookdown/contrasts.html#introduction-2) and also [this paper](https://besjournals.onlinelibrary.wiley.com/doi/epdf/10.1111/j.2041-210X.2010.00012.x).
:::
### ANOVA
For categorical predictors, one nearly always performs an **ANOVA** on top of the linear regression estimates. The underlying ideas behind this will be explained in more detail in our section on ANOVA, but just very quickly:
An ANOVA starts with a base model (in this case intercept only) and adds the variable group. It then measures:
- How much the model improves in terms of ${R}^{2}$ (this is in the column Sum Sq).
- If this increase of model fit is significant.
We can run an ANOVA on the fitted model object, using the `aov()` function:
```{r chunk_chapter3_chunk33, echo=TRUE, eval=TRUE}
anov = aov(fit)
summary(anov)
```
In this case, we would conclude that the variable group (3 levels) significantly improves model fit, i.e. the group overall has a significnat effect, even though the individual contrasts in the original model where not significant.
::: callout-note
The situation that the ANOVA is significant, but the linear regression contrasts is often perceived as confusing, but this is perfectly normal and not a contradiction.
In general, the ANOVA is more sensitive, because it tests for an overall effect (thus also has a larger n), whereas the individual contrasts test with a reduced n. Moreover, as we have seen, in the summary of a `lm()`, for \>2 levels, not all possible contrasts are tested. In our example, we estimate p-values for crtl-trt1 and ctrl-trt2, which are both n.s., but we do not test for differences of trt1-trt2 (which happens to be significant).
Thus, the general idea is: 1. ANOVA tells us if there is an effect of the variable at all, 2. `summary()` tests for specific group differences.
:::
### Post-Hoc Tests
Another common test performed on top of an ANOVA or a categorical predictor of an `lm()` is a so-called post-hoc test. Basically, a post-hoc test computes p-values for all possible combinations of factor levels, usually corrected for multiple testing:
```{r chunk_chapter3_chunk36, echo=FALSE, eval=TRUE}
TukeyHSD(anov)
```
In the result, we see, as hinted to before, that there is a significant difference between trt1 and trt2. Note that the p-values of the post-hoc tests are larger than those of the lm `summary()`. This is because post-hoc tests nearly always correct p-values for multiple testing, while regression estimates are usually not corrected for multiple testing (you could of course if you wanted).
::: callout-note
Remember: a significance test with a significance threshold of alpha = 0.05 has a type I error rate of 5%. Thus, if you make 20 tests, you would get at least one significant result, even if there are no effects whatsoever. Corrections for multiple testing are used to compensate for this, usually aiming at controlling the **family-wise error rate** (**FWER**), the probability of making one or more type I errors when performing multiple hypotheses tests.
:::
### Compact Letter Display
It is common to visualize the results of the post-hoc tests with the so-called **Compact Letter Display** (cld). This doesn't work with the base `TukeyHSD`{.R} function, so we will use the `multcomp`{.R} pacakge:
```{r chunk_chapter3_chunk37, echo=TRUE, eval=TRUE, fig.show='hide', message=FALSE, warning=FALSE}
library(multcomp)
fit = lm(weight ~ group, data = PlantGrowth)
tuk = glht(fit, linfct = mcp(group = "Tukey"))
summary(tuk) # Standard display.
tuk.cld = cld(tuk) # Letter-based display.
plot(tuk.cld)
```
The cld gives a new letter for each group of factor levels that are statistically undistinguishable. You can see the output via `tuk.cld`{.R}, here I only show the plot:
```{r chunk_chapter3_chunk38, echo=FALSE, eval=TRUE}
plot(tuk.cld)
```
## Specifying a multiple regression
Multiple (linear) regression means that we consider more than 1 predictor in the same model. The syntax is very easy: Just add your predictors (numerical or categorical) to your regression formula, as in the following example for the airquality dataset. To be able to also add a factor, I created a new variable fMonth to have month as a factor (categorical):
```{r chunk_chapter3_chunk47, echo=TRUE, eval=TRUE}
airquality$fMonth = factor(airquality$Month)
fit = lm(Ozone ~ Temp + Wind + Solar.R + fMonth, data = airquality)
```
In principle, everything is interpreted as before: First of all, residual plots:
```{r}
par(mfrow =c(2,2))
plot(fit)
```
::: callout-tip
Maybe you have noted that the residual plots used fitted values as the x axis. This is in general the most parsimonious choice, as there are many predictors in a multiple regression, and each influences the response. However, if running a multiple regression, you should additionally plot residuals against each predictor separately - if doing so, you often see a misfit that doesn't occur in the res \~ predicted plot.
:::
The resulting regression table already looks a bit intimidating,
```{r}
summary(fit)
```
Luckily, we also have the effect plots to make sense of this:
```{r}
library(effects)
plot(allEffects(fit, partial.residuals = T) )
```
::: callout-tip
Note that for the multiple regression, that scatter around the regression line displayed by the effects package is not the raw data, but the so-called partial residuals. Essentially, what this means is that we plot the fitted line for the respective predictor, and then we calculate the residual for the full model (including the other predictors) and add them on the regression line. The advantage is that this subtracts the effect of other predictors, and removes possible spurious correlatins that occur due to collinearity between predictors (see next section).
:::
| Formula | Meaning | Details |
|------------------|-------------------------|-----------------------------|
| `y~x_1` | $y=a_0 +a_1*x_1$ | Slope+Intercept |
| `y~x_1 - 1` | $y=a_1*x_1$ | Slope, no intercept |
| `y~I(x_1^2)` | $y=a_0 + a_1*(x_1^2)$ | Quadratic effect |
| `y~x_1+x_2` | $y=a_0+a_1*x_1+a_2*x_2$ | Multiple linear regression (two variables) |
| `y~x_1:x_2` | $y=a_0+a_1*(x_1*x_2)$ | Interaction between x~1~ and x~2~ |
| `y~x_1*x_2` | $y=a_0+a_1*(x_1*x_2)+a_2*x_1+a_3*x_2$ | Interaction and main effects |
: Formula syntax
## The Effect of Collinearity
A common misunderstanding is that a multiple regression is just a convenient way to run several independent univariate regressions. This, however, is wrong. A multiple regression is doing a fundamentally different thing. Let's compare the results of the multiple regression above to the simple regression.
```{r chunk_chapter3_chunk50, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Wind , data = airquality)
summary(fit)
```
The simple regression estimates an effect of Wind that is- 5.55, while in the multiple regression, we had -3.1.
The reason is that a multiple regression estimates the effect of Wind, **corrected for the effect of the other variables**, while a univariate regression estimates the raw correlation of Ozone and Wind. And this can make a big difference if Wind has correlations with other variables(the technical term is **collinear**). We can see, e.g., that Wind and Temp are indeed correlated by running
```{r chunk_chapter3_chunk51, echo=TRUE, eval=TRUE}
plot(Wind ~ Temp, data = airquality)
```
In such a case, if Temp is not included in the regression, **the effect of Temp will be absorbed in the Wind effect to the extent to which both variables are collinear**. In other words, if one variable is missing, there will be a spillover of the effect of this variable to all collinear variables.
::: callout-tip
To give a simple example of this spillover, let's say
- Wind has a positive effect of 1
- Temp has a negative effect of -1
- Wind and Temp are correlated 50%
A multiple regression would estimate 1,-1. In a single regression of Wind against Temp, the collinear part (50%) of the Temp effect would be absorbed by wind, so we would estimate a Wind effect of 0.5.
:::
Note that the same happens in univariate plots of the correlation, i.e. this is not a problem of the linear regression, but rather of looking at the univariate correlation of Ozone \~ Wind without correcting for the effect of Temp.
Also, note that collinearity is not fundamentally a problem for the regression - regression estimates stay unbiased and p-values and CIs calibrated under arbitrary high collinearity. However, the more collinearity we have, the more uncertain regression estimates will become (with associated large p-values). Here an example, where we simulate data under the assumption that x1 and x2 have each an effect of 1
```{r}
set.seed(123)
x1 = runif(100)
x2 = 0.95 *x1 + 0.05 * runif(100)
y = x1 + x2 + rnorm(100)
summary(lm(y ~ x1 + x2))
```
We see that the multiple regression is not terribly wrong, but very uncertain about the true values. Some authors therefore advice to remove collinear predictions. This can sometimes be useful, in particular for predictions (see section on model selection); however, note that the effect of removed predictor will then be absorbed by the remaining predictor(s). In our case, the univariate regression would be very sure that there is an effect, note that the CI does not cover the true effect of x1 (which was 1).
```{r}
summary(lm(y ~ x1))
```
::: {.callout-tip collapse="true"}
#### Understanding collinearity spillover in more detail
We can understand the problem of one variable influencing the effect of the other in more detail if we simulate some data. Let's create 2 positively collinear predictors:
```{r chunk_chapter3_chunk52, echo=TRUE, eval=TRUE}
x1 = runif(100, -5, 5)
x2 = x1 + 0.2*runif(100, -5, 5)
```
We can check whether this has worked:
```{r chunk_chapter3_chunk53, echo=TRUE, eval=TRUE}
cor(x1, x2)
```
Now, the first case I want to look at, is when the effects for x1 and x2 have equal signs. Let's create such a situation, by simulating a normal response $y$, where the intercept is 0, and both predictors have effect = 1:
```{r chunk_chapter3_chunk54, echo=TRUE, eval=TRUE}
y = 0 + 1*x1 + 1*x2 + rnorm(100)
```
If you look at the formula, you can nearly visually see what the problem of the univariate regression is: as x2 is nearly identical to x1, the apparent slope between y and either x1 or x2 will be around 2 (although the true effect is 1).
Consequently, the univariate models will estimate too high effect sizes. We also say by taking out one predictor, the remaining one is absorbing the effect of the other predictor.
```{r chunk_chapter3_chunk55, echo=TRUE, eval=TRUE}
coef(lm(y ~ x1))
coef(lm(y ~ x2))
```
The multivariate model, on the other hand, gets the right estimates (with a bit of error):
```{r chunk_chapter3_chunk57, echo=TRUE, eval=TRUE}
coef(lm(y~x1 + x2))
```
You can also see this visually:
```{r chunk_chapter3_chunk56, echo=TRUE, eval=TRUE}
par(mfrow = c(1, 2))
plot(x1, y, main = "x1 effect", ylim = c(-12, 12))
abline(lm(y ~ x1))
# Draw a line with intercept 0 and slope 1,
# just like we simulated the true dependency of y on x1:
abline(0, 1, col = "red")
legend("topleft", c("fitted", "true"), lwd = 1, col = c("black", "red"))
plot(x2, y, main = "x2 effect", ylim = c(-12, 12))
abline(lm(y ~ x2))
abline(0, 1, col = "red")
legend("topleft", c("fitted", "true"), lwd = 1, col = c("black", "red"))
```
Check what happens if the 2 effects have opposite sign.
```{r , message=FALSE, warning=FALSE}
x1 = runif(100, -5, 5)
x2 = -x1 + 0.2*runif(100, -5, 5)
y = 0 + 1*x1 + 1*x2 + rnorm(100)
cor(x1, x2)
coef(lm(y ~ x1))
coef(lm(y ~ x2))
par(mfrow = c(1, 2))
plot(x1, y, main = "x1 effect", ylim = c(-12, 12))
abline(lm(y ~ x1))
abline(0, 1, col = "red")
legend("topleft", c("fitted", "true"), lwd = 1, col = c("black", "red"))
plot(x2, y, main = "x2 effect", ylim = c(-12, 12))
abline(lm(y ~ x2))
abline(0, 1, col = "red")
legend("topleft", c("fitted", "true"), lwd = 1, col = c("black", "red"))
coef(lm(y~x1 + x2))
```
*Both effects cancel out*.
:::
## Centering and Scaling of Predictor Variables
### Centering
In our multiple regression, we saw an intercept of -74. Per definition, the intercept is the predicted value for $y$ (Ozone) at a value of 0 for all other variables. It's fine to report this, as long as we are interested in this value, but often the value at predictor = 0 is not particularly interesting. In this specific case, a value of -74 clearly doesn't make sense, as Ozone concentrations can only be positive.
To explain what's going on, let's look at the univariate regression against Temp:
```{r chunk_chapter3_chunk14, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Temp, data = airquality)
summary(fit)
```
Now, the intercept is -146. We can see the reason more clearly when we plot the results:
```{r chunk_chapter3_chunk15, echo=TRUE, eval=TRUE}
plot(Ozone ~ Temp, data = airquality, xlim = c(-10, 110), ylim = c(-200, 170))
abline(fit)
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
```
That shows us that the value 0 is far outside of the set of our observed values for Temp, which is measured in Fahrenheit. Thus, we are extrapolating the Ozone far beyond the observed data. To avoid this, we can simply re-define the x-Axis, by subtracting the mean, which is called **centering**:
```{r, echo=TRUE, eval=TRUE}
airquality$cTemp = airquality$Temp - mean(airquality$Temp)
```
Alternatively, you can center with the build-in R command scale
```{r chunk_chapter3_chunk17, echo=TRUE, eval=TRUE}
airquality$cTemp = scale(airquality$Temp, center = T, scale = F)
```
Fitting the model with the centered variable moves the intercept line in the middle of the predictor range
```{r chunk_chapter3_chunk18, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ cTemp, data = airquality)
plot(Ozone ~ cTemp, data = airquality)
abline(fit)
abline(v = 0, lty = 2)
```
which produces a more interpretable value for the intercept: when we center, the intercept of the centered variable can be interpreted as the Ozone concentrate at the mean temperature. For a univariate regression, this value will also typically be very similar to the grand mean `mean(airquality$Ozone)`.
```{r}
summary(fit)
```
::: callout-caution
There is no simple way to center categorical variables for a fixed effect model. Thus, note that if you center, the intercept is predicted for the **reference group** of all categorical variables, evaluated at value of the mean for all numeric variables. However, as we will see later, **random effects**, among other things, are a means to center categorical variables, as they estimate a **grand mean** over all groups.
:::
### Scaling
Another very common transformation is to divide the x axis by a certain value to bring all variables on a similar scale. This is called **scaling**.
The main motivation for scaling is to make **effect sizes in a multiple regression more comparable**. As an example, look again at the multiple regression with 4 predictors:
```{r chunk_chapter3_chunk58, echo=TRUE, eval=TRUE}
airquality$fMonth = factor(airquality$Month)
fit = lm(Ozone ~ Temp + Wind + Solar.R + fMonth, data = airquality)
summary(fit)
```
So, which of the predictors is the strongest (= largest effect on the response)? Superficially, it looks as if Month has the highest values. But we have to remember that the effect on the response is $y = \text{regression estimate} * \text{predictor}$, i.e if we have a predictor with a large range (difference between min/max values), it may have a strong effect even though the estimate is small. So, if the predictors have a different range, we cannot compare effect sizes directly regarding their "effective influence" on y.
To make the effect sizes comparable, we can scale all numeric predictors by dividing them by their standard deviation, which will bring them all roughly on a range between -2, 2. You can do this by hand, or use the `scale()`{.R} function in R:
::: callout-tip
By default, the `scale(...)`{.R} function will scale and center. As discussed before, centering is nearly always useful as it improves the interpretability of the intercept, so I would suggest to use this as a default when scaling.
:::
```{r}
airquality$sTemp = scale(airquality$Temp)
airquality$sWind = scale(airquality$Wind)
airquality$sSolar.R = scale(airquality$Solar.R)
```
::: callout-tip
Here, I create a new variable for each scaled predictor. It is possible to use the `scale()` command inside the regression formula as well, but that sometimes creates problems with downstream plotting functions, so I recommend to create a new variable in your data for scaling.
:::
Running the regression:
```{r chunk_chapter3_chunk60, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ sTemp + sWind + sSolar.R + fMonth, data = airquality)
summary(fit)
```
As you see, effect sizes have now changed considerably. The difference in interpretation is the following: for the unscaled variables, we estimate the effect of 1 unit change (e.g. 1 degree for temp). For the scaled variable, we estimate the effect of a change of 1 standard deviation. If we look at the sd of temperature
```{r}
sd(airquality$Temp)
```
we can conclude that the scaled regression is estimating an effect of a 9.46527 degree change in temperature.
While the latter is numerically a bit confusing, the advantage is that so we can interpret the Ozone effect scaled to typical temperature differences in the data, and the same for all other variables. Consequently, it makes sense to compare the scaled effect sizes between variables, which suggests that Temp is the most important predictor.
What about categorical variables, do we need to scale them as well? The answer is no, because they are effectively already scaled to a range of 1, because in the standard contrasts (treatment effects), we estimate the effect of going from option 1 to option 2. However, note that because the sd scaling creates a numeric range with a average of +/-1 sd = 2, some authors also scale by 2 sd, so that numeric and categorical variables are more comparable in their effect sizes.
::: {.callout-caution icon="false"}
#### Discuss
Under which circumstances should you center / scale, and how should you discuss the estimated coefficients in a paper?
:::
::: {.callout-tip collapse="true" appearance="minimal" icon="false"}
#### Solution
Scaling: scaled, you get estimate of relative importance, while unscaled, effects are interpretable in their original units. Depending on what you are more interested in, you may report one of the two, or both. Centering: changes the place of the origin, and thus the interpretation of the intercept (and possibly also main effects if interactions are present, see next chapter). Per default, I would scale.
:::
::: {.callout-caution icon="false"}
#### Excercise
How does adding or multiplying a factor on the predictor change the intercept and slope (effect, CI, p-values) of a regression?
:::
::: {.callout-tip collapse="true" appearance="minimal" icon="false"}
#### Solution
Original model
```{r, message=FALSE, warning=FALSE}
fit = lm(Ozone ~ Temp, data = airquality)
summary(fit)
```
```{r, message=FALSE, warning=FALSE}
fit = lm(Ozone ~ I(Temp + 100), data = airquality)
summary(fit)
```
Additive transformation change the intercept value. All p-values, CIs stay the same (except for the intercept, as the test contrast changes)
```{r, message=FALSE, warning=FALSE}
fit = lm(Ozone ~ I(Temp * 10), data = airquality)
summary(fit)
```
Multiplicative transformations change the slope value. P-values and relative CIs for intercept and slope stay the same. Combinations of both have both effects together
:::
## Interactions
### Syntax
When we have multiple variables, we can have the situation that the value of one variable influences the effect of the other(s). Technically, this is called in **interaction**. In situations where the causal direction is known, this is also called a **moderator**. An example: Imagine we observe that the effect of aspirin differs depending on the weight of the subject. Technically, we have an interaction between aspirin and weight. Physiologically, we know the causal direction is "weight -\> effect of aspirin", so we can say weight is a moderator for the effect of aspirin.
```{r chunk_chapter3_chunk68, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Temp * Wind, data = airquality)
plot(allEffects(fit))
```
We will have a look at the summary later, but for the moment, let's just look at the output visually. In the effect plots, we see the effect of Temperature on Ozone for different values of Wind. We also see that the slope changes. For low Wind, we have a strong effect of Temperature. For high Wind, the effect is basically gone.
Let's look at the interaction syntax in more detail. The "\*" operator in an `lm()`{.R} is a shorthand for main effects + interactions. You can write equivalently:
```{r chunk_chapter3_chunk69, echo=TRUE, eval=FALSE}
fit = lm(Ozone ~ Wind + Temp + Wind:Temp, data = airquality)
```
What is fit here is literally a third predictor that is specified as Wind \* Temp (normal multiplication). The above syntax would allow you to also have interactions without main effects, e.g.:
```{r chunk_chapter3_chunk70, echo=TRUE, eval=FALSE}
fit = lm(Ozone ~ Wind + Wind:Temp, data = airquality)
```
Although this is generally **never** advisable, as the main effect influences the interaction, unless you are sure that the main effect must be zero.
There is another important syntax in R:
```{r chunk_chapter3_chunk71, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ (Wind + Temp + Solar.R)^2 , data = airquality)
summary(fit)
plot(allEffects(fit), selection = 1)
plot(allEffects(fit), selection = 2)
plot(allEffects(fit), selection = 3)
```
This creates all main effect and second order (aka two-way) interactions between variables. You can also use `^3`{.R} to create all possible 2-way and 3-way interactions between the variables in the parentheses. By the way: The `()^2`{.R} syntax for interactions is the reason why we have to write `I(x^2)`{.R} if we want to write a quadratic effect in an lm.
### Interactions with categorical variables
When you include an interaction with a categorical variable, that means a separate effect will be fit for each level of the categorical variable, as in
```{r chunk_chapter3_chunk72, echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Wind * fMonth, data = airquality)
summary(fit)
```
The interpretation is like for a single categorical predictor, i.e. we see the effect of Wind as the effect for the first Month 5, and the Wind:fMonth6 effect, for example, tests for a difference in the Wind effect between month 5 (reference) and month 6. As before, you could change this behavior by changing contrasts, e.g. mean contrasts for main effects:
```{r echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Wind * fMonth - 1, data = airquality)
```
and to change to mean contrasts for main effects and int, we can type:
```{r echo=TRUE, eval=TRUE}
fit = lm(Ozone ~ Wind * fMonth - 1 - Wind, data = airquality)
```
### Interactions and centering
A super important topic when working with numeric interactions is centering. To get this started, let's make a small experiment:
::: {.callout-caution icon="false"}
#### Task
Compare the estimates for Wind / Temp for the following models
- Ozone \~ Wind
- Ozone \~ Temp
- Ozone \~ Wind + Temp
- Ozone \~ Wind \* Temp
How do you explain the differences in the estimates for the main effects of Wind and Temp? What do you think corresponds most closely to the "true" effect of Wind and Temp?
:::
What you should have seen in the models above is that the main effects for Wind / Temp change significantly when the interaction is introduced.
Maybe you know the answer already. If not, consider the following simulation, where we create data with known effect sizes:
```{r chunk_chapter3_chunk73, echo=TRUE, eval=TRUE}
# Create predictor variables.
x1 = runif(100, -1, 1)
x2 = runif(100, -1, 1)
# Create response for lm, all effects are 1.
y = x1 + x2 + x1*x2 + rnorm(100, sd = 0.3)
# Fit model, but shift the mean of the predictor.
fit = lm(y ~ x1 * I(x2 + 5))
summary(fit)
plot(allEffects(fit))
```
Play around with the shift in x2, and observe how the effects change. Try how the estimates change when centering the variables via the `scale()`{.R} command. If you understand what's going on, you will realize that you should **always center** your variables, whenever you use any interactions.\
Excellent explanations of the issues also in the attached paper\
<a href="https://besjournals.onlinelibrary.wiley.com/doi/epdf/10.1111/j.2041-210X.2010.00012.x" target="_blank" rel="noopener">https://besjournals.onlinelibrary.wiley.com/doi/epdf/10.1111/j.2041-210X.2010.00012.x</a>.
### Interactions in GAMs
The equivalent of an interaction in a GAM is a **tensor spline**. It describes how the response varies as a smooth function of 2 predictors. You can specify a tensor spline with the `te()` command, as in
```{r, echo=TRUE, eval=TRUE}
library(mgcv)
fit = gam(Ozone ~ te(Wind, Temp) , data = airquality)
plot(fit, pages = 0, residuals = T, pch = 20, lwd = 1.9, cex = 0.4)
```
## Predictions
To predict with a fitted model
```{r}
fit = lm(Ozone ~ Wind, data = airquality)
```
we can use the `predict()` function.
```{r,eval=F}
predict(fit) # predicts on current data
predict(fit, newdata = X) # predicts on new data
predict(fit, se.fit = T)
```
To generate confidence intervals on predictions, we can use se.fit = T, which returns standard errors on predictions. As always, 95% confidence interval (CI) = 1.96 \* se.
So, let's assume we want to predict Ozone with a 95% CI for for wind between 0 and 10:
```{r}
Wind = seq(0,10,0.1)
newData = data.frame(Wind = Wind)
pred = predict(fit, newdata = newData, se.fit = T)
plot(Wind, pred$fit, type = "l")
lines(Wind, pred$fit - 1.96 * pred$se.fit, lty = 2)
lines(Wind, pred$fit + 1.96 * pred$se.fit, lty = 2)
```
## Nonparametrics and the bootstrap
The linear model (LM) has the beautiful property that it delivers unbiased estimates, calibrated p-values and all that based on analytical expressions that remain valid, regardless of the sample size. Unfortunately, there are very few extensions of the LM that completely maintain all of these desirable properties for limited sample sizes. The underlying problem is that for more complicated models, there are usually no closed-form solutions for the expected null or sampling distributions, or those distributions depend on the observed data. For that reason, non-parametric methods to generate confidence intervals and p-values are increasingly used when going to more complicated models, which motivates this first small excursion into non-parametric methods, and in particular cross-validation and the bootstrap.
There are two non-parametric methods that you should absolutely know and that form the backbone of basically all non-parametric for regression analyses with statistical and machine learning models. Those two are:
1. the bootstrap: generate variance of estimates, allows us to calculate non-parametric CIs and p-values
2. cross-validation: generate non-parametric estimates of the predictive error of the model (connects to ANOVA, AIC, see later sections on ANOVA and model selection)
Here, we will look at the bootstrap. Cross-validation will be introduced in the chapter on model selection.
::: {.callout-tip collapse="true"}
#### Excursion: randomization null models
Another non-parametric method that you may encounter are randomization null models. Let's look at this with an example: here, we create data from a normal and a lognormal distribution. We are interested in the difference of the mean of the two (test statistic).
```{r}
set.seed(1337)
groupA = rnorm(50)
groupB = rlnorm(50)
dat = data.frame(value = c(groupA, groupB), group = factor(rep(c("A", "B"), each = 50)))
plot(value ~ group, data = dat)
# test statistic: difference of the means
reference = mean(groupA) - mean(groupB)
```
We can show mathematically that means of repeated draws from a normal distribution are Student-t distributed, which is why we can use this distribution in the t-test to calculate the probability of a certain deviation from a null expectation. If we work with different distributions, however, we may not be so sure what the null distribution should be. Either we have to do the (possibly very complicated) math to get a closed-form expression for this problem, or we try to generate a non-parametric expectation for the null distribution via a **randomization null model**.
The idea is that we generate a null expectation for the test statistic by re-shuffling the data in a way that conforms to the null assumption. In this case, we can just randomize the group labels, which means that we assume that both distributions (A,B) are the same (note that this is a bit more than asking if they have the same mean, but for the sake of the example, we are happy with this simplification).
```{r}
nSim = 5000
nullDistribution = rep(NA, nSim)
for(i in 1:nSim){
sel = dat$value[sample.int(100, size = 100)]
nullDistribution[i] = mean(sel[1:50]) - mean(sel[51:100])
}
hist(nullDistribution, xlim = c(-2,2))
abline(v = reference, col = "red")
ecdf(nullDistribution)(reference) # 1-sided p-value
```
Voila, we have a p-value, and it is significant. Randomization null models are used in many R packages where analytical p-values are not available, e.g., in the packages vegan (ordinations, community analysis) and bipartide (networks).
:::
### Non-parametric bootstrap
The key problem to calculate parametric p-values or CIs is that we need to know the distribution of the statistics of interest that we would obtain if we (hypothetically) repeated the experiment many times. The bootstrap generates new data so that we can approximate this distribution. There are two variants of the bootstrap:
1. Non-parametric bootstrap: we sample new data from the old data with replacement
2. Parametric bootstrap: we sample new data from the fitted model
Let's first look at the non-parametric bootstrap, using a simple regression of Ozone \~ Wind
```{r}
fit = lm(Ozone ~ Wind, data = airquality)
```
First, for convenience, we generate a function that extracts the estimate that we want to bootstrap from a fitted model. I decided to extract the slope estimate for Wind, but you could make difference choices
```{r}
getEstimate <- function(model){
coef(model)[2]
}
getEstimate(fit)
```
Next, we perform the non-parametric bootstrap
```{r}
performBootstrap = function(){
sel = sample.int(n = nrow(airquality), replace = T)
fitNew = lm(Ozone ~ Wind, data = airquality[sel,])
return(getEstimate(fitNew))
}
bootstrappedEstimate = replicate(1000, performBootstrap())
```
The bootstrapped estimates of the slope can basically be interpreted as the uncertainty of the slope estimate.
```{r}
hist(bootstrappedEstimate, breaks = 50)
abline(v = mean(bootstrappedEstimate), col = "red", lwd = 2)
abline(v = getEstimate(fit), col = "red", lwd = 2, lty = 2)
sd(bootstrappedEstimate)