Watching Claus Wilke’s rstudio::conf2019 talk on his new ungeviz r package to visualize uncertainty with ggplot2, I realized that I only have a fuzzy idea what bootstrapping is. For some reason I didn’t find the wikipedia entry particularly illuminating for what I assumed to be a simple procedure. Also it is Sunday and I wanted to watch a video rather than read something.
Resampling and Bootstrapping
Looking for someone to explain this to me in terms I can understand on a Sunday I found Christina Knudson’s youtube channel where she does exactly that in small, easily digestible chunks. Thanks! After watching this 5 part series on bootstrapping is turns out that bootstrapping really just means resampling a random sample with replacement at the same size as the original sample.
Or, in the words of the rsample help: A bootstrap sample is a sample that is the same size as the original data set that is made using replacement. This results in analysis samples that have multiple replicates of some of the original rows of the data.
Bootstrapping is used, for example, to estimate the uncertainty of statsitical properties of a sample without having to make a lot of assumptions about it (e.g. distribution type, sample size, etc).
There are lots of different r packages to do resampling and bootstrapping but it so happens that Alex Hayes, in his talk about the spectacular broom package at the same rstudio conference uses the rsample package from the tidymodels family to bootstrap. So this seems a good choice.
With this, creating a numer of bootstrap samples using the mtcars
dataset is straightforward.
library(tidyverse)
library(rsample)
samples = 1000
boots = bootstraps(mtcars, times=samples)
head(boots)
## # A tibble: 6 x 2
## splits id
## * <list> <chr>
## 1 <split [32/14]> Bootstrap0001
## 2 <split [32/13]> Bootstrap0002
## 3 <split [32/10]> Bootstrap0003
## 4 <split [32/12]> Bootstrap0004
## 5 <split [32/13]> Bootstrap0005
## 6 <split [32/8]> Bootstrap0006
boots
is a compact but somewhat daunting return value. It is a tibble that contains a list column (splits) and an id column. The id column simply holds an identifier for the bootstrap resample. splits
is a list of length 4 where the first element points to the original data set, the second holds the indices of the resample and the rest are irrelevant (at this point). So the pointer to the original data frame in the first row could be accessed like this boots$splits[[1]]$data
. This is really fast because (as I understand it) nothing is created or copied yet. We have a pointer to the original data and a vector of integers with the sampled indices.
To replicate Christina’s example in the video, I want to simply bootstrap to estimate the density distribution of the mean of the wt
variable. To do this, I need to calculate the mean for wt
for each bootstrap of the 1000
samples.
Because I want to do this for each resample and boots$splits
is a list column the purrr package will be helpful if I want to avoid writing lots of loops or, god forbid, use the l/s/apply
functions which I have regrettably never become friends with. If you are unfamiliar with purrr
, Charlotte Wickham’s tutorial is a great place to start, especially considering that I am only linking videos today.
The first step in figuring out how to map something with purrr
is to figure it out for a single row. What I am doing now comes with no guarantees for efficiency. What I want is the actual data frame for the resample (not just an index vector and a pointer to the original data currently in splits
). Conveniently, the somewhat cryptically named analysis
function in the rsample
package does just that.
bootstrap_sample = analysis(boots$splits[[1]])
head(bootstrap_sample)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
So calculating the mean of wt
in the first resample would work simply like this:
mean(analysis(boots$splits[[1]])$wt)
## [1] 2.929
To generalize this to the purrr
recipe for all resamples I simply replace the first element [[1]]
by .x
.
boots_mean = boots %>%
mutate(mn = map_dbl(splits, ~ mean(analysis(.x)$wt)))
head(boots_mean)
## # A tibble: 6 x 3
## splits id mn
## * <list> <chr> <dbl>
## 1 <split [32/9]> Bootstrap0001 3.46
## 2 <split [32/12]> Bootstrap0002 3.10
## 3 <split [32/13]> Bootstrap0003 3.55
## 4 <split [32/13]> Bootstrap0004 2.96
## 5 <split [32/8]> Bootstrap0005 3.06
## 6 <split [32/13]> Bootstrap0006 3.14
Having added the column containing all the bootstrap sample means, I can use ggplot to approximate the sample distribution of the mean.
boots_mean %>%
ggplot(aes(mn)) +
geom_density() +
labs(title = "Approximated sampling distribution of mean wt",
x = "mean wt",
y = "frequency") +
theme_minimal()
For completeness sake I am also replicating Alex’ example here where he uses boostrapping and the augment
function of the broom package to visualize the uncertainty of a model fit.
His example is the relationship between mpg
and wt
in mtcars
.
mtcars %>%
ggplot(aes(wt,mpg)) +
geom_point() +
theme_minimal()
Again, first working things out for a single case to feed to purrr
, we construct the helper function fit_nls_on_bootstrap
that fits a nonlinear model to boots
. We also map tidy model coefficients k
and b
.
fit_nls_on_bootstrap <- function(split) {
nls(
mpg ~ k / wt + b,
analysis(split),
start = c(k = 1, b = 0)
)
}
boots_fit = boots %>%
mutate(fit = map(splits, fit_nls_on_bootstrap),
coef_info = map(fit, tidy))
head(boots_fit)
## # A tibble: 6 x 4
## splits id fit coef_info
## * <list> <chr> <list> <list>
## 1 <split [32/14]> Bootstrap0001 <S3: nls> <tibble [2 × 5]>
## 2 <split [32/13]> Bootstrap0002 <S3: nls> <tibble [2 × 5]>
## 3 <split [32/10]> Bootstrap0003 <S3: nls> <tibble [2 × 5]>
## 4 <split [32/12]> Bootstrap0004 <S3: nls> <tibble [2 × 5]>
## 5 <split [32/13]> Bootstrap0005 <S3: nls> <tibble [2 × 5]>
## 6 <split [32/8]> Bootstrap0006 <S3: nls> <tibble [2 × 5]>
This returns an even more frightening thing than before. Fortunately, we can use unnest
to flatten this horrendous thing into something more sane.
boots_coefs = boots_fit %>%
unnest(coef_info)
head(boots_coefs)
## # A tibble: 6 x 6
## id term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bootstrap0001 k 43.3 5.20 8.33 2.69e- 9
## 2 Bootstrap0001 b 5.72 1.93 2.96 5.92e- 3
## 3 Bootstrap0002 k 46.3 4.12 11.2 2.91e-12
## 4 Bootstrap0002 b 3.91 1.42 2.75 1.00e- 2
## 5 Bootstrap0003 k 38.9 3.93 9.89 5.95e-11
## 6 Bootstrap0003 b 6.64 1.45 4.59 7.31e- 5
And plot the histograms for k and b:
boots_coefs %>%
ggplot(aes(estimate)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ term, scales = "free") +
labs(title = "Approximated sampling distributions of k and b",
x = "value",
y = "count") +
theme_minimal()
Finally, we use broom::augment
to get fitted values from each model and plot the fits of the first 100 bootstrap samples.
library(broom)
boot_aug = boots_fit %>%
mutate(augmented = map(fit, augment)) %>%
unnest(augmented)
boot_aug %>%
filter(id %in% unique(boot_aug$id)[1:100]) %>%
ggplot(aes(wt, mpg)) +
geom_point() +
geom_line(aes(y=.fitted, group=id),alpha=0.1) +
theme_minimal()
Sadly, it seems the ungeviz package has to wait for another day…