index

Vectorised Statistics with Composite Data Types

4th December 2025 (Perth, Australia)

Mitchell O’Hara-Wild, Monash University

Useful links

social.mitchelloharawild.com

slides.mitchelloharawild.com/asc2025

mitchelloharawild/talk-vectorised-statistics

Vectorised statistics

Most programming languages require using explicit loops for statistics on vectors.

// C code for a log transformation
double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
size_t n = sizeof(x) / sizeof(x[0]);

for (size_t i = 0; i < n; ++i) {
  x[i] = log(x[i]);
}

Languages used for statistical computing have vectorised operations (implicit loops).

# R code for a log transformation
log(1:5)

[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

What are vectors? (aka arrays)

A set of values with the same data type.

⚛️ Atomic vectors (aka primitive types)

Simple data types, such as:

logical: c(TRUE, FALSE)
integer: c(1L, 2L, 3L)
double: c(1.618, 3.14)
character: c("a", "b", "c")

🏗️ Classed vectors

Other data types are built from them, such as:

Factor (integer: indexing levels)
Date (double: days since 1970-01-01)

Classed vectors encode semantics

Many data types have intrinsic structure, requiring special care in statistical analysis.

🔢 ordinal: ordered, {forcats}
🍀 uncertainty: p/d/q/r, {distributional}
⌛ time: Date, POSIXt, {hms}, {mixtime}
🕸️ graph: {igraph}, {tidygraph}, {graphvec}
🗺️ spatial: {sp}, {sf}

Safer statistics with semantics

Statistical operations leverage their structure to prevent analysis that violate semantics.

Jan31 <- as.Date("2025-01-31")

Jan31 + 1

[1] "2025-02-01"

Jan31 * 2

Error in Ops.Date(Jan31, 2): * not defined for "Date" objects

log(Jan31)

Error in Math.Date(Jan31): log not defined for "Date" objects

Jan31 + lubridate::period(1, "months")

[1] NA

Unsafe Stats: R’s distributions

The included distributions in R (and many extension packages) use p/d/q/r functions for statistical operations on distributions.

The p/d/q/r functions

These functions allow you to calculate statistical operations from distributions:

p: The ‘probability’ (CDF)
d: The ‘density’ (PDF)
q: The ‘quantiles’
r: The ‘random’ samples

Some packages also define m functions for moments!

Unsafe Stats: R’s distributions

Using p/d/q/r functions

These operation prefixes are used in conjunction with the distribution’s shape. The general form is:

<op><shape>(<args>, <parameters>)

For example, the density (d) at 0.5 of a Normal (norm) distribution with mean 1 and standard deviation 3 is:

dnorm(q = 0.5, mean = 1, sd = 3)

[1] 0.1311466

Semantics not included

These functions are very error-prone since users have the burden of performing valid distributional operations.

Unsafe Stats: R’s distributions

Short and confusing function names

The p/d/q/r functions need memorisation for each shape.

Risky recycling

These p/d/q/r functions are vectorised and fast 🎉

How these functions vectorise inputs is surprising 👻

# 95% intervals from N(1,9)
qnorm(c(0.025, 0.975), mean = 1, sd = 3)

[1] -4.879892  6.879892

# 2.5% from N(1,9) and 97.5% from N(1,16)
qnorm(c(0.025, 0.975), mean = 1, sd = c(3,4))

[1] -4.879892  8.839856

# W̷͌͘h̵̕͘a̶̚͠t̷͘͠ ̷̈́͝i̴͠͠s̴͝͝ ̶̀̕h̷́͠á̶͝p̷͝͝p̴̀͝e̶͘͠n̷͝͝i̸̕͠n̵͝͠g̷̀͝ ̶́͠h̷́͠è̴͘ŕ̷͠e̵͝͝⸘
qnorm(c(0.025, 0.975), mean = c(1,2), sd = c(3,4,5))

[1] -4.879892  9.839856 -8.799820

Unsafe Stats: R’s distributions

R’s model prediction output

Worst of all is how distributions are obtained from models.

🐧 Predicting penguins

Consider the output when using predict() on a lm() for the length of penguin bills using depth and species.

fit <- lm(
  bill_length_mm ~ species*bill_depth_mm, 
  data = palmerpenguins::penguins
)
predict(fit, tail(palmerpenguins::penguins))

       1        2        3        4        5        6 
46.10333 51.48517 48.21763 48.40983 49.94750 49.37088

Where’s the uncertainty!

By default, predictions only return the expected value.

Unsafe Stats: R’s distributions

Finding uncertainty!

Use se.fit = TRUE in predict() to get standard errors.

predict(fit, tail(palmerpenguins::penguins), 
        se.fit = TRUE)

$fit
       1        2        3        4        5        6 
46.10333 51.48517 48.21763 48.40983 49.94750 49.37088 

$se.fit
        1         2         3         4         5         6 
0.4769975 0.4685605 0.3082198 0.3020842 0.3333431 0.3054341 

$df
[1] 336

$residual.scale
[1] 2.444663

This gives more than just standard errors…

but we’re still missing the shape of the distribution!

Unsafe Stats: R’s distributions

Here’s the code to obtain 95% prediction intervals for the penguins data:

pred <- predict(fit, tail(palmerpenguins::penguins), se.fit = TRUE)
sprintf(
  "[%f, %f]95",
  pred$fit + qt(0.025, df = pred$df) * pred$se.fit,
  pred$fit + qt(0.975, df = pred$df) * pred$se.fit
)

[45.165056, 47.041611]95
[50.563487, 52.406850]95
[47.611342, 48.823910]95
[47.815620, 49.004049]95
[49.291799, 50.603204]95
[48.770072, 49.971680]95

Not quite right

The above calculations are for confidence intervals, not prediction intervals! Did you notice? Probably not!

Unsafe Stats: R’s distributions

The correct code for prediction intervals is:

pred <- predict(fit, tail(palmerpenguins::penguins), se.fit = TRUE)
sprintf(
  "[%f, %f]95",
  pred$fit + qt(0.025, df = pred$df) * sqrt(pred$se.fit^2 + pred$residual.scale^2),
  pred$fit + qt(0.975, df = pred$df) * sqrt(pred$se.fit^2 + pred$residual.scale^2)
)

[41.203878, 51.002789]95
[46.588864, 56.381473]95
[43.370784, 53.064468]95
[43.564487, 53.255182]95
[45.094230, 54.800773]95
[44.524716, 54.217036]95

Error-prone analysis

There’s a lot to know about regression, distributions, and R functions to get correct prediction intervals.

It’s easy to make mistakes (or ignore uncertainty).

A better alternative?

What if model predictions could directly produce a vector of distributions?

# Hypothetical predict() method for lm()
pred <- predict(fit, tail(palmerpenguins::penguins))
pred

<distribution[6]>
[1] t(336, 46, 2.5) t(336, 51, 2.5) t(336, 48, 2.5)
[4] t(336, 48, 2.5) t(336, 50, 2.5) t(336, 49, 2.5)

hilo(pred, 95)

<hilo[6]>
[1] [41.20388, 51.00279]95 [46.58886, 56.38147]95
[3] [43.37078, 53.06447]95 [43.56449, 53.25518]95
[5] [45.09423, 54.80077]95 [44.52472, 54.21704]95

Making better distributions

{vctrs} offers two vector types:

🏷️ Vectors: new_vctr()

Vectors add attributes to existing vector types.

e.g. POSIXct is the seconds since 1970-01-01.

unclass(as.POSIXct("2025-12-04 11:00:00"))

[1] 1764817200
attr(,"tzone")
[1] ""

This enables single-parameter distributions:

library(distributional)
dist_poisson(c(4, 2, 6))

<distribution[3]>
[1] Pois(4) Pois(2) Pois(6)

What about multi-parameter distributions?

Making better distributions

{vctrs} offers two vector types:

💿 Records: new_rcrd()

Records use information from multiple vectors (of same length).

e.g. POSIXlt contains the parts of a datetime.

unclass(as.POSIXlt("2025-12-04 11:00:00")) |> str()

List of 11
 $ sec   : num 0
 $ min   : int 0
 $ hour  : int 11
 $ mday  : int 4
 $ mon   : int 11
 $ year  : int 125
 $ wday  : int 4
 $ yday  : int 337
 $ isdst : int 0
 $ zone  : chr "AWST"
 $ gmtoff: int NA
 - attr(*, "tzone")= chr [1:3] "" "AWST" "AWST"
 - attr(*, "balanced")= logi TRUE

This enables multi-parameter distributions:

library(distributional)
dist_normal(mu = c(1, 3, -1), sigma = c(3, 2, 4))

<distribution[3]>
[1] N(1, 9)   N(3, 4)   N(-1, 16)

How can different shaped distributions coexist within the same vector? That’s trickier!

Mixed-type vectors

{vecvec} creates vectors of vectors:

🧬 vecvec: new_vecvec()

These vectors mix different vector types together.

library(vecvec)
vecvec(c(1,2), c("a","b"), c(TRUE, FALSE))

<vecvec[6]>
[1] 1     2     a     b      TRUE FALSE

🛑 A bad idea

Almost always, this causes more problems than it solves.

Mixed-type vectors

library(tidyr)
household

# A tibble: 5 × 5
  family dob_child1 dob_child2 name_child1 name_child2
   <int> <date>     <date>     <chr>       <chr>      
1      1 1998-11-26 2000-01-29 Susan       Jose       
2      2 1996-06-22 NA         Mark        <NA>       
3      3 2002-07-11 2004-04-05 Sam         Seth       
4      4 2004-10-10 2009-08-27 Craig       Khai       
5      5 2000-12-05 2005-02-28 Parker      Gracie

Mixed-type vectors

library(tidyr)
household |> 
  pivot_longer(
    everything(), 
    values_transform = vecvec::vecvec
  )

# A tibble: 25 × 2
   name             value
   <chr>         <vecvec>
 1 family               1
 2 dob_child1  1998-11-26
 3 dob_child2  2000-01-29
 4 name_child1     Susan 
 5 name_child2     Jose  
 6 family               2
 7 dob_child1  1996-06-22
 8 dob_child2          NA
 9 name_child1     Mark  
10 name_child2     NA    
# ℹ 15 more rows

Mixed-type semantic vectors

{vecvec} is perhaps only useful for mixed-type vectors that share common semantics.

Suitable semantic data-types include:

📊 Distributional (different shapes)

📅 Temporal (different chronons / calendars)

🗺️ Spatial (different geometries)

🗳️ Preferential (different candidates)

Mixed-type semantic vectors

{vecvec} is perhaps only useful for mixed-type vectors that share common semantics.

📊 Mixed distributions

Putting it all together…

library(distributional)
dist <- c(
  dist_poisson(c(4, 2, 6)),
  dist_normal(mu = c(1, 3, -1), sigma = c(3, 2, 4))
)
dist

<distribution[6]>
[1] Pois(4)   Pois(2)   Pois(6)   N(1, 9)   N(3, 4)  
[6] N(-1, 16)

🧮 Vectorised statistics across distributions

Since distributions share common semantics, we can apply statistical operations across them.

density(dist, at = 1)

[1] 0.07326256 0.27067057 0.01487251 0.13298076 0.12098536
[6] 0.08801633

Better distributions for R

The distributional package makes it simpler to create and use distributions in R.

Creating distributions

All distributions in the package start with dist_*().

dist <- c(
  dist_poisson(c(4, 2, 6)),
  dist_normal(mu = c(1, 3, -1), sigma = c(3, 2, 4))
)

The package currently provides:

25 continuous distributions,
9 discrete distributions,
p/d/q/r distributions via dist_wrap(),
sample, degenerate and percentile distributions.

Vectorised distribution statistics

The p/d/q/r functions have more descriptive alternatives:

p->cdf(): The CDF
d->density(): The density (PDF)
q->quantile(): The quantile
r->generate(): Random samples

Distributional operations

These functions are the same for any distribution.

Vectorised distribution statistics

Other operations

There are many more statistics than p/d/q/r.

log_likelihood()/likelihood()
hilo()
hdr()
support()
mean()
variance()/covariance()
skewness()
kurtosis()

Vectorised distribution statistics

Use distribution vectors in data frames!

library(dplyr)
tibble(
  dist = c(
    dist_poisson(c(4, 2, 6)),
    dist_normal(mu = c(1, 3, -1), sigma = c(3, 2, 4))
  )
)

# A tibble: 6 × 1
# ℹ 1 more variable: dist <dist>

Vectorised distribution statistics

Perform statistics alongside distributions.

library(dplyr)
tibble(
  dist = c(
    dist_poisson(c(4, 2, 6)),
    dist_normal(mu = c(1, 3, -1), sigma = c(3, 2, 4))
  )
) |> 
  mutate(
    mean = mean(dist), var = variance(dist),
    pdf = density(dist, 1), cdf = cdf(dist, 1)
  )

# A tibble: 6 × 5
       dist  mean   var    pdf    cdf
     <dist> <dbl> <dbl>  <dbl>  <dbl>
1   Pois(4)     4     4 0.0733 0.0916
2   Pois(2)     2     2 0.271  0.406 
3   Pois(6)     6     6 0.0149 0.0174
4   N(1, 9)     1     9 0.133  0.5   
5   N(3, 4)     3     4 0.121  0.159 
6 N(-1, 16)    -1    16 0.0880 0.691

Modifying distributions

Distributions can be transformed using mathematical operations.

dist_normal(1,3)

<distribution[1]>
[1] N(1, 9)

2 + dist_normal(1,3)

<distribution[1]>
[1] N(3, 9)

3 * (2 + dist_normal(1,3))

<distribution[1]>
[1] N(9, 81)

exp(3 * (2 + dist_normal(1,3)))

<distribution[1]>
[1] lN(9, 81)

Modifying distributions

Transformed distribution

dist_transformed() enables arbitrary transformations.

(3 * (2 + dist_normal(1,3)))^2

<distribution[1]>
[1] t(N(9, 81))

Other distribution modifiers

🎈 dist_inflated()

✂️ dist_truncated()

🥣 dist_mixture()

Other semantic vectors

📅 Temporal

library(mixtime)
today <- Sys.Date()
c(year(today), yearquarter(today), yearmonth(today))

<mixtime[3]>
[1] 2025     2025-Q4  2025-Dec

🗺️ Spatial

library(sf)
c(st_point(c(1, 1)), st_linestring(rbind(c(1, 1),c(2, 2),c(3, 1))), st_polygon(list(rbind(c(0, 0), c(4, 0), c(4, 4), c(0, 4), c(0, 0)))))

Geometry set for 3 features 
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: 0 ymin: 0 xmax: 4 ymax: 4
Geodetic CRS:  WGS 84

🗳️ Preferential

library(prefio)
read_preflib("00004 - netflix/00004-00000138.soc", from_preflib = TRUE)

[1] [Beverly Hills Cop > Mean Girls > Mission: Impossible II > The Mummy Returns]
[2] [Mean Girls > Beverly Hills Cop > Mission: Impossible II > The Mummy Returns]
[3] [Beverly Hills Cop > Mean Girls > The Mummy Returns > Mission: Impossible II]

🕸️ Graph

library(graphvec)
edge_vec(from = c(1L, 2L, 1L), to = c(2L, 3L, 3L), nodes = data.frame(id = 1:3, label = c("A", "B", "C")))

<edge_vec[3]>
[1] [1:A]->[2:B] [2:B]->[3:C] [1:A]->[3:C]

Combining semantic vectors

🗺️📅 Spatio-temporal data:

{mixtime} + {sf}
🕸️📅 Graph-temporal data:

{graphvec} + {mixtime}
🕸️📊 Network modelling:

{graphvec} + {distributional}
📊🗺️📅🕸️

Probabilistic-spatio-temporal-graph:

{distributional} + {sf} + {mixtime} + {graphvec}

Combining semantic vectors

Forecasts from {fable} combines {mixtime} and {distributional} vectors.

library(fable)
sunspots_tsbl |> 
  model(ARIMA(value)) |> 
  forecast(h = "10 years")

# A fable: 10 x 4 [1Y]
# Key:     .model [1]
   .model           index       value .mean
   <chr>        <mixtime>      <dist> <dbl>
 1 ARIMA(value)      1989 N(145, 240) 145. 
 2 ARIMA(value)      1990 N(165, 582) 165. 
 3 ARIMA(value)      1991 N(161, 819) 161. 
 4 ARIMA(value)      1992 N(136, 915) 136. 
 5 ARIMA(value)      1993  N(98, 928)  98.3
 6 ARIMA(value)      1994  N(62, 929)  61.8
 7 ARIMA(value)      1995  N(38, 938)  37.9
 8 ARIMA(value)      1996  N(33, 941)  33.4
 9 ARIMA(value)      1997  N(48, 947)  48.5
10 ARIMA(value)      1998 N(77, 1006)  77.0

Visualising distributions

The {ggdist} and {ggdibbler} packages extend {ggplot2} with distributional graphics.

Visualising distributions

The {ggdist} and {ggdibbler} packages extend {ggplot2} with distributional graphics.

Visualising time

The {ggtime} package extends {ggplot2} with temporal graphics.

Visualising forecasts

Combining {ggtime} and {ggdist} enables flexible forecast visualisation.

Thanks for your time!

Final remarks

Endless semantic combinations for stats and data viz.
Safe statistics respects data semantics.
Vectorised code 🤝 statistical analysis.
Analysis workflows are better with semantic vectors.

Useful links

social.mitchelloharawild.com

slides.mitchelloharawild.com/asc2025

mitchelloharawild/talk-vectorised-statistics

Vectorised operations

Vectorised operations in distributional are safer than the p/d/q/r equivalents.

Vectorising in two ways

There are two types of operation arguments:

vector/matrix inputs

Vectorises across distributions, then arguments.

This approach is simpler, especially single arguments.
list/data.frame inputs

Vectorises across arguments, then distributions.

This approach is more flexible and powerful.

Vectorised operations (vectors)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))

<distribution[3]>
[1] N(0, 1)    Beta(5, 1) Γ(2, 1)

Vectors/matrices apply the same operation inputs to each distribution.

density(y, at = 0.65)

[1] 0.3229724 0.8925312 0.3393298

density(y, at = c(0.65, 0.9))

[[1]]
[1] 0.3229724 0.2660852

[[2]]
[1] 0.8925312 3.2805000

[[3]]
[1] 0.3393298 0.3659127

Vectorised operations (vectors)

Distributions in data analysis

This also works well with data frames.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = 0.65))

# A tibble: 3 × 2
           y `density(y, at = 0.65)`
      <dist>                   <dbl>
1    N(0, 1)                   0.323
2 Beta(5, 1)                   0.893
3    Γ(2, 1)                   0.339

Although its a bit tricky for more than one input.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = c(0.65, 0.9)))

# A tibble: 3 × 2
           y `density(y, at = c(0.65, 0.9))`
      <dist> <list>                         
1    N(0, 1) <dbl [2]>                      
2 Beta(5, 1) <dbl [2]>                      
3    Γ(2, 1) <dbl [2]>

Vectorised operations (lists)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))

<distribution[3]>
[1] N(0, 1)    Beta(5, 1) Γ(2, 1)

Lists/data.frames recycle each input argument to the length of distributions.

density(y, at = list(d1 = 0.65))

         d1
1 0.3229724
2 0.8925312
3 0.3393298

density(y, at = list(d1 = 0.65, d2 = 0.9))

         d1        d2
1 0.3229724 0.2660852
2 0.8925312 3.2805000
3 0.3393298 0.3659127

Vectorised operations (lists)

(y <- c(dist_normal(0, 1), dist_beta(5, 1), dist_gamma(2, 1)))

<distribution[3]>
[1] N(0, 1)    Beta(5, 1) Γ(2, 1)

This also allows vectorisation across both inputs and distributions.

density(y, at = list(dens = c(0.65, 0.9, 0.3)))

       dens
1 0.3229724
2 3.2805000
3 0.2222455

Reliable recycling

density(y, list(dens = c(0.65, 0.9)))

Error in `FUN()`:
! Cannot recycle input of size 2 to match the distributions (size 3).

Vectorised operations (lists)

Distributions in data analysis

This also works really well with data frames.

tibble::tibble(y) |> 
  dplyr::mutate(density(y, at = list(d1 = 0.65)))

# A tibble: 3 × 2
           y    d1
      <dist> <dbl>
1    N(0, 1) 0.323
2 Beta(5, 1) 0.893
3    Γ(2, 1) 0.339

mutate() automatically unpacks the results if unnamed.

tibble::tibble(y) |> 
  dplyr::mutate(
    density(y, at = list(d1 = 0.65, d2 = 0.9))
  )

# A tibble: 3 × 3
           y    d1    d2
      <dist> <dbl> <dbl>
1    N(0, 1) 0.323 0.266
2 Beta(5, 1) 0.893 3.28 
3    Γ(2, 1) 0.339 0.366

Visualising distributions

library(ggdist)
library(ggplot2)
df <- tibble::tibble(
  name = c("Gamma(2,1)", "Normal(5,1)", "Mixture"),
  dist = c(dist_gamma(2,1), dist_normal(5,1),
           dist_mixture(dist_gamma(2,1), dist_normal(5, 1), weights = c(0.4, 0.6)))
)
ggplot(df, aes(y = factor(name, levels = rev(name)))) +
  stat_dist_halfeye(aes(dist = dist)) + 
  labs(title = "Density function for a mixture of distributions", y = NULL, x = NULL)