Power Analysis and Sample Size Determination

Author

Mark Andrews

Abstract

Determining an adequate sample size for a multilevel study is more complex than in single-level designs because power depends on the number of groups, the number of observations per group, the variance components, and the effect size of interest. Closed-form analytical solutions are not available in general. The simulation-based approach implemented in the simr package constructs a hypothetical mixed effects model with specified parameters and estimates power by repeatedly simulating data from it and counting how often the key effect is detected at the chosen significance level.

library(tidyverse)
library(lme4)
library(simr)

Why simulation is needed

In a simple two-sample \(t\)-test, power depends on the effect size (the standardised mean difference), the sample size, and the significance level. A closed-form relationship between these quantities exists and can be solved analytically.

In a multilevel design, power depends on more quantities: the fixed effect of interest, the variance components at each level, the number of groups, the number of observations per group, and the correlation structure of the random effects. There is no single formula that captures all of this, so simulation is the standard approach.

The strategy is to construct a mixed effects model that reflects the design you are planning, specify the effect sizes and variance components you expect or consider plausible, and then simulate many datasets from this model. For each simulated dataset, fit the model and test the effect of interest. The proportion of simulations in which the test is significant at the chosen level is the estimated power.

The simr package

The simr package provides tools for this workflow. The key functions are:

makeLmer() — constructs a hypothetical lmer model with specified parameters.
powerSim() — estimates power by simulation.
extend() — modifies the sample size of the fake data to explore the power curve.

Specifying the hypothetical model

We will use the sleep deprivation study as a template. Suppose we are planning a new study with the same design — 10 days of measurement per subject — and we want to estimate the power to detect a slope of 10 ms per day, given realistic variance components.

First, construct a dataset with the planned design: 10 subjects, 10 days each.

fake_sleep_data <- expand_grid(subject = 1:10, days = 0:9)
fake_sleep_data

# A tibble: 100 × 2
   subject  days
     <int> <int>
 1       1     0
 2       1     1
 3       1     2
 4       1     3
 5       1     4
 6       1     5
 7       1     6
 8       1     7
 9       1     8
10       1     9
# ℹ 90 more rows

Now specify the parameter values for the hypothetical model. These should come from a combination of prior literature and the planned effect size.

fixed_b <- c(250, 10)  # intercept = 250 ms, slope = 10 ms/day

V <- matrix(
  c(625,  0,
      0, 25),
  nrow = 2
)  # tau_0 = 25, tau_1 = 5, rho = 0

s <- 25  # residual SD

The intercept of 250 ms and residual SD of 25 ms are plausible for a reaction time study. The slope of 10 ms per day is the effect we want to detect. The variance components specify how much subjects differ in their baselines (\(\tau_0 = 25\)) and how much they differ in their rate of slowing (\(\tau_1 = 5\)).

Construct the model object.

fake_model <- makeLmer(
  y ~ days + (days | subject),
  data    = fake_sleep_data,
  fixef   = fixed_b,
  VarCorr = V,
  sigma   = s
)

summary(fake_model)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ days + (days | subject)
   Data: fake_sleep_data

REML criterion at convergence: 944.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.09040 -0.41630 -0.06215  0.53974  2.21466 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 subject  (Intercept) 625      25            
          days         25       5       0.00 
 Residual             625      25            
Number of obs: 100, groups:  subject, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)  250.000      9.170  27.262
days          10.000      1.805   5.541

Correlation of Fixed Effects:
     (Intr)
days -0.206

Estimating power

With the model constructed, powerSim simulates data from it repeatedly and fits the model to each simulated dataset. The default test is a likelihood ratio test for all fixed effects; test = fixed("days") restricts it to the days slope.

The nsim argument controls how many simulations to run. More simulations give a more precise power estimate but take longer. For a quick check use 100; for a final report use at least 1000.

powerSim(fake_model, test = fixed("days"), nsim = 1000)

The output reports the estimated power with a 95% confidence interval. Power depends directly on the effect size (the slope of 10 ms/day relative to the noise). If the true slope were smaller, power would be lower.

Varying sample size

To find the sample size needed for a target power (typically 80% or 90%), repeat the simulation for different numbers of subjects. The extend function scales up the sample.

fake_model_20 <- extend(fake_model, along = "subject", n = 20)
powerSim(fake_model_20, test = fixed("days"), nsim = 1000)

fake_model_30 <- extend(fake_model, along = "subject", n = 30)
powerSim(fake_model_30, test = fixed("days"), nsim = 1000)

Running powerSim across a range of sample sizes and plotting the results gives a power curve, which shows how power increases with sample size.

A design with two conditions

Many experiments compare two conditions across subjects and time. The model extends naturally.

fake_sleep_data2 <- expand_grid(
  subject = 1:10,
  days    = 0:9,
  x       = c("A", "B")
)

fixed_b2 <- c(250, 10, 5, 2)  # intercept, days, xB, days:xB

fake_model2 <- makeLmer(
  y ~ x * days + (days | subject),
  data    = fake_sleep_data2,
  fixef   = fixed_b2,
  VarCorr = V,
  sigma   = s
)

summary(fake_model2)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x * days + (days | subject)
   Data: fake_sleep_data2

REML criterion at convergence: 1879.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.14822 -0.60620 -0.07965  0.69322  2.51141 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 subject  (Intercept) 625      25            
          days         25       5       0.00 
 Residual             625      25            
Number of obs: 200, groups:  subject, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)  250.000      9.170  27.262
xB            10.000      6.571   1.522
days           5.000      1.805   2.770
xB:days        2.000      1.231   1.625

Correlation of Fixed Effects:
        (Intr) xB     days  
xB      -0.358              
days    -0.206  0.287       
xB:days  0.302 -0.843 -0.341

The effect of primary interest here is the interaction xB:days — whether the rate of slowing differs between conditions. Its coefficient is 2 ms/day in this specification.

powerSim(fake_model2, test = fixed("xB:days"), nsim = 1000)

Sensitivity to variance components

Power in multilevel designs is sensitive to the variance components as well as to the fixed effects. If between-subject variability in slopes (\(\tau_1\)) is large, the fixed slope is harder to detect because subjects vary so much that the average effect is obscured.

Explore this by varying \(\tau_1\) in the model specification and re-running powerSim. This sensitivity analysis reveals which assumptions are driving the power estimate and helps identify the conditions under which the study is feasible.

Practical advice

The key inputs to a power analysis are:

The minimum effect size that would be scientifically meaningful.
The variance components, estimated from pilot data or prior literature.
The planned design (number of groups, observations per group).

If prior data are available, fit a model to them and use the estimated parameters as the basis for makeLmer. The fixed effect can then be set to a value smaller than the estimated one to represent a conservative lower bound on the true effect.

Power analysis for multilevel designs should always consider the number of groups as well as the number of observations per group. In many designs, power is more sensitive to the number of groups than to the number of observations per group, because the degrees of freedom for testing fixed effects depends primarily on the number of independent units at the highest relevant level.