Statistical computing with vectorised operations on distributions

9th July 2024 @ UseR! 2024

Mitchell O’Hara-Wild, Monash University

Supervised by Rob Hyndman and George Athanasopoulos

Using distributions in R

As a statistical computing language, writing analysis R code often involves distributions.

Common uses of distributions

  • Predictions from models
  • Simulation studies
  • Hypothesis testing
  • Teaching statistics

Plentiful packages

The probability distributions CRAN Task View lists over 300 packages for probability distributions!

Using distributions in R

Problems with probability

Using distributions in R is unnecessarily complex!

Difficult distributions

There are two fundamental problems that make distributions difficult to use in R.

  1. The design of distribution functions
  2. The distribution outputs of models and tests

A better way?

Improved distributional designs overcomes these issues.

R’s distribution functions

The included distributions in R (and many extension packages) use p/d/q/r functions for statistical operations on distributions.

The p/d/q/r functions

These functions allow you to calculate statistical operations from distributions:

  • p: The ‘probability’ (CDF)
  • d: The ‘density’ (PDF)
  • q: The ‘quantiles’
  • r: The ‘random’ samples

Some packages also define m functions for moments!

R’s distribution functions

Using p/d/q/r functions

These operation prefixes are used in conjunction with the distribution’s shape. The general form is:

<op><shape>(<args>, <parameters>)

For example, the density (p) at 0.5 of a Normal (norm) distribution with mean 1 and standard deviation 3 is:

pnorm(q = 0.5, mean = 1, sd = 3)

[1] 0.4338162

The quantile at probability 0.975 is:

qnorm(p = 0.975, mean = 1, sd = 3)

[1] 6.879892

Name conflicts

Ever saved your dataset as df or dt? Think again!

R’s distribution functions

Risky recycling

These p/d/q/r functions are vectorised and fast 🎉

How these functions vectorise inputs is surprising 👻

Let’s calculate 95% intervals of N(1,9).

qnorm(c(0.025, 0.975), mean = 1, sd = 3)
[1] -4.879892  6.879892

Great! What about multiple distributions N(1,9), N(3,4), N(-1, 16) and N(2, 1)?

qnorm(c(0.025, 0.975), 
      mean = c(1, 3, -1, 2), sd = c(3, 2, 4, 1))
[1] -4.879892  6.919928 -8.839856  3.959964

R’s distribution functions

Well designed functions have…

  • Clear and descriptive function names
  • Well-defined and named parameters
  • Minimal code duplication
  • Predictable behaviour

Design drawbacks

The p/d/q/r design leaves much to be desired.

  • Short and confusing function names
  • Mixed parameters between operation and distribution
  • Surprising recycling behaviour

R’s modelling output

The distributional nature of predictions and hypothesis testing is de-emphasised in R.

Predicting penguins

Consider the output when using predict() on a lm() for the length of penguin bills using depth and species.

fit <- lm(
  bill_length_mm ~ species*bill_depth_mm, 
  data = palmerpenguins::penguins
)
predict(fit, tail(palmerpenguins::penguins))
       1        2        3        4        5        6 
46.10333 51.48517 48.21763 48.40983 49.94750 49.37088 

Where’s the uncertainty!

By default, predictions only return the expected value.

R’s modelling output

Finding uncertainty!

To obtain the other parameters for the distribution, we set se.fit = TRUE in predict().

predict(fit, tail(palmerpenguins::penguins), 
        se.fit = TRUE)
$fit
       1        2        3        4        5        6 
46.10333 51.48517 48.21763 48.40983 49.94750 49.37088 

$se.fit
        1         2         3         4         5         6 
0.4769975 0.4685605 0.3082198 0.3020842 0.3333431 0.3054341 

$df
[1] 336

$residual.scale
[1] 2.444663

This gives more than just standard errors… but still no probability distribution.

R’s modelling output

Here’s the code to obtain 95% prediction intervals for the penguins data:

pred <- predict(fit, tail(palmerpenguins::penguins), se.fit = TRUE)
sprintf(
  "[%f, %f]95",
  pred$fit + qt(0.025, df = pred$df) * pred$se.fit,
  pred$fit + qt(0.975, df = pred$df) * pred$se.fit
)
[45.165056, 47.041611]95
[50.563487, 52.406850]95
[47.611342, 48.823910]95
[47.815620, 49.004049]95
[49.291799, 50.603204]95
[48.770072, 49.971680]95

Not quite right

The above calculations are for confidence intervals, not prediction intervals! Did you notice? Probably not!

R’s modelling output

The correct code for prediction intervals is:

pred <- predict(fit, tail(palmerpenguins::penguins), se.fit = TRUE)
sprintf(
  "[%f, %f]95",
  pred$fit + qt(0.025, df = pred$df) * sqrt(pred$se.fit^2 + pred$residual.scale^2),
  pred$fit + qt(0.975, df = pred$df) * sqrt(pred$se.fit^2 + pred$residual.scale^2)
)
[41.203878, 51.002789]95
[46.588864, 56.381473]95
[43.370784, 53.064468]95
[43.564487, 53.255182]95
[45.094230, 54.800773]95
[44.524716, 54.217036]95

Making mistakes

There’s a lot to know about regression, distributions, and R functions to get correct prediction intervals.

It’s easy to make mistakes (or ignore uncertainty).

Better distributions for R

The distributional package makes it simpler to create and use distributions in R.

Design improvements

The package design overcomes all earlier problems by:

  • Creating distributions separately from operations

    (with clear function and variable names)

  • Statistical computing done with common functions

    (the same function works for all distributions)

  • Applying predictable recycling rules

Better distributions for R

Creating a distribution

All distributions in the package start with dist_*().

To create a normal distribution, we use dist_normal().

library(distributional)
dist_normal(1, 3)
<distribution[1]>
[1] N(1, 9)

The package currently supports:

  • 21 continuous distributions,
  • 9 discrete distributions,
  • p/d/q/r functions via dist_wrap(),
  • sample, degenerate and percentile distributions.

Vectorised distributions

Creating multiple distributions

The package supports multiple distributions in a vector.

dist_normal(mu = c(1, 3, -1, 2), sigma = c(3, 2, 4, 1))
<distribution[4]>
[1] N(1, 9)   N(3, 4)   N(-1, 16) N(2, 1)  

You can even mix distributions of different shapes.

c(
  dist_normal(mu = c(1, 3, -1, 2), sigma = c(3, 2, 4, 1)),
  dist_gamma(2,1)
)
<distribution[5]>
[1] N(1, 9)     N(3, 4)     N(-1, 16)   N(2, 1)     Gamma(2, 1)

Vectorised distributions

Since distributions are vectors, they work great with data frames and the tidyverse.

tibble::tibble(
  x = 1:4,
  y = dist_normal(mu = c(1, 3, -1, 2), sigma = c(3, 2, 4, 1))
)
# A tibble: 4 x 2
      x         y
  <int>    <dist>
1     1   N(1, 9)
2     2   N(3, 4)
3     3 N(-1, 16)
4     4   N(2, 1)

Distributional predictions from models

This allows model predictions to return entire distributions, making it easier to get the correct results.

Vectorised distributions

The fable package produces forecasts with distributions, making later analysis easy.

library(fable)
as_tsibble(sunspot.year) |> 
  model(ARIMA(value)) |> 
  forecast(h = "10 years")
# A fable: 10 x 4 [1Y]
# Key:     .model [1]
   .model       index       value .mean
   <chr>        <dbl>      <dist> <dbl>
 1 ARIMA(value)  1989 N(145, 240) 145. 
 2 ARIMA(value)  1990 N(165, 582) 165. 
 3 ARIMA(value)  1991 N(161, 819) 161. 
 4 ARIMA(value)  1992 N(136, 915) 136. 
 5 ARIMA(value)  1993  N(98, 928)  98.3
 6 ARIMA(value)  1994  N(62, 929)  61.8
 7 ARIMA(value)  1995  N(38, 938)  37.9
 8 ARIMA(value)  1996  N(33, 941)  33.4
 9 ARIMA(value)  1997  N(48, 947)  48.5
10 ARIMA(value)  1998 N(77, 1006)  77.0

Statistical computation

The p/d/q/r functions have more descriptive alternatives:

  • p->cdf(): The CDF
  • d->density(): The density (PDF)
  • q->quantile(): The quantile
  • r->generate(): Random samples

Distributional operations

These same functions are the same for any distribution.

Statistical computation

To calculate statistics like intervals:

y <- dist_normal(mu = c(1, 3, -1, 2), sigma = c(3, 2, 4, 1))
quantile(y, c(0.025, 0.975))
[[1]]
[1] -4.879892  6.879892

[[2]]
[1] -0.919928  6.919928

[[3]]
[1] -8.839856  6.839856

[[4]]
[1] 0.04003602 3.95996398

Or simply use the hilo() function:

hilo(y, 95)
<hilo[4]>
[1] [-4.87989195, 6.879892]95 [-0.91992797, 6.919928]95
[3] [-8.83985594, 6.839856]95 [ 0.04003602, 3.959964]95

Statistical computation

Other operations

There are many more statistics than p/d/q/r.

  • log_likelihood()/likelihood()
  • hilo()
  • hdr()
  • support()
  • mean()
  • variance()/covariance()
  • skewness()
  • kurtosis()

Vectorised operations

Vectorised operations in distributional are safer than the p/d/q/r equivalents.

Vectorising in two ways

There are two types of operation arguments:

  • vector/matrix inputs

    Vectorises across distributions, then arguments.

    This approach is simpler, especially single arguments.

  • list/data.frame inputs

    Vectorises across arguments, then distributions.

    This approach is more flexible and powerful.

Vectorised operations (vectors)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))
<distribution[3]>
[1] N(0, 1)     Beta(5, 1)  Gamma(2, 1)

Vectors/matrices apply the same operation inputs to each distribution.

density(y, at = 0.65)
[1] 0.3229724 0.8925312 0.3393298
density(y, at = c(0.65, 0.9))
[[1]]
[1] 0.3229724 0.2660852

[[2]]
[1] 0.8925312 3.2805000

[[3]]
[1] 0.3393298 0.3659127

Vectorised operations (vectors)

Distributions in data analysis

This also works well with data frames.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = 0.65))
# A tibble: 3 x 2
            y `density(y, at = 0.65)`
       <dist>                   <dbl>
1     N(0, 1)                   0.323
2  Beta(5, 1)                   0.893
3 Gamma(2, 1)                   0.339

Although its a bit tricky for more than one input.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = c(0.65, 0.9)))
# A tibble: 3 x 2
            y `density(y, at = c(0.65, 0.9))`
       <dist> <list>                         
1     N(0, 1) <dbl [2]>                      
2  Beta(5, 1) <dbl [2]>                      
3 Gamma(2, 1) <dbl [2]>                      

Vectorised operations (lists)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))
<distribution[3]>
[1] N(0, 1)     Beta(5, 1)  Gamma(2, 1)

Lists/data.frames recycle each input argument to the length of distributions.

density(y, at = list(d1 = 0.65))
         d1
1 0.3229724
2 0.8925312
3 0.3393298
density(y, at = list(d1 = 0.65, d2 = 0.9))
         d1        d2
1 0.3229724 0.2660852
2 0.8925312 3.2805000
3 0.3393298 0.3659127

Vectorised operations (lists)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))
<distribution[3]>
[1] N(0, 1)     Beta(5, 1)  Gamma(2, 1)

This also allows vectorisation across both inputs and distributions.

density(y, at = list(dens = c(0.65, 0.9, 0.3)))
       dens
1 0.3229724
2 3.2805000
3 0.2222455

Reliable recycling

density(y, list(dens = c(0.65, 0.9)))
Error in `FUN()`:
! Cannot recycle input of size 2 to match the distributions (size 3).

Vectorised operations (lists)

Distributions in data analysis

This also works really well with data frames.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = list(d1 = 0.65)))
# A tibble: 3 x 2
            y    d1
       <dist> <dbl>
1     N(0, 1) 0.323
2  Beta(5, 1) 0.893
3 Gamma(2, 1) 0.339

mutate() automatically unpacks the results if unnamed.

tibble::tibble(y) |> 
  dplyr::mutate(
    density(y, at = list(d1 = 0.65, d2 = 0.9))
  )
# A tibble: 3 x 3
            y    d1    d2
       <dist> <dbl> <dbl>
1     N(0, 1) 0.323 0.266
2  Beta(5, 1) 0.893 3.28 
3 Gamma(2, 1) 0.339 0.366

Modifying distributions

Distributions can be used in the creation of new distributions.

  • dist_inflated()
  • dist_truncated()
  • dist_transformed()
  • dist_mixture()

Modifying distributions

Transforming distributions

Where possible, the transformation directly changes the underlying distribution.

dist_normal(1,3)
<distribution[1]>
[1] N(1, 9)
2 + dist_normal(1,3)
<distribution[1]>
[1] N(3, 9)
3 * (2 + dist_normal(1,3))
<distribution[1]>
[1] N(9, 81)
exp(3 * (2 + dist_normal(1,3)))
<distribution[1]>
[1] lN(9, 81)

Modifying distributions

Transforming distributions

In other cases, the transformation falls back to a ‘transformed’ distribution.

(3 * (2 + dist_normal(1,3)))^2
<distribution[1]>
[1] t(N(9, 81))

The statistics are calculated numerically, with symbolic differentiation where possible.

x <- (3 * (2 + dist_normal(1,3)))^2
mean(x)
[1] 162
cdf(x, 0.5)
[1] 0.1784123

Visualising distributions

The ggdist package by Matthew Kay (@mjskay) extends ggplot2 with support for visualising distributions in many ways.

It works directly with distributional vectors!

Visualising distributions

library(ggdist)
library(ggplot2)
df <- tibble::tibble(
  name = c("Gamma(2,1)", "Normal(5,1)", "Mixture"),
  dist = c(dist_gamma(2,1), dist_normal(5,1),
           dist_mixture(dist_gamma(2,1), dist_normal(5, 1), weights = c(0.4, 0.6)))
)
ggplot(df, aes(y = factor(name, levels = rev(name)))) +
  stat_dist_halfeye(aes(dist = dist)) + 
  labs(title = "Density function for a mixture of distributions", y = NULL, x = NULL)

Summary

Key ideas

  1. Be careful when using p/d/q/r functions
  2. Good design can prevent some common mistakes
  3. Consider using distributional in analysis/packages
  4. It’s also a great teaching tool!

Future work

  • Add more distributions (help please!)
  • Convolutions of distributions
  • Add moments of distributions
  • Create a hex sticker

Thanks for your time!

Final remarks

  • Uncertainty shouldn’t be an afterthought.

  • Easier statistics is better statistics.

  • Made with ❤️ and vctrs.

  • Consider contributing to the package.

    (distributions, documentation, ideas…)

Unsplash credits

Thanks to these Unsplash contributors for their photos