Building Statistical Applications

Author

Mark Andrews

Abstract

This guide demonstrates how to build Shiny interfaces around realistic statistical analyses. We introduce reactive expressions as a way to share computations between outputs, examine the reactive dependency graph more carefully, and build complete applications for exploring likelihood functions, sampling distributions, and regression diagnostics.

Reactive expressions

A reactive expression is defined with reactive({...}) and behaves like a function that caches its result until one of its reactive dependencies changes. It is the primary tool for sharing a computation between multiple outputs without re-running it twice.

This directly addresses the problem noted at the end of the inputs and outputs guide. There, the scatter plot and the summary statistics each generated their own independent sample, so the two outputs were describing different data. A reactive expression solves this: the data is generated once, stored in samples(), and both outputs read from the same object.

Here is a complete application demonstrating the pattern:

ui <- fluidPage(
  titlePanel("Shared reactive sample"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("n",    "Sample size", min = 10, max = 1000, value = 200),
      sliderInput("mean", "Mean",        min = -5,  max = 5,   value = 0, step = 0.5),
      sliderInput("sd",   "SD",          min = 0.1, max = 5,   value = 1, step = 0.1)
    ),
    mainPanel(
      plotOutput("hist"),
      verbatimTextOutput("stats")
    )
  )
)

server <- function(input, output) {
  samples <- reactive({
    rnorm(input$n, mean = input$mean, sd = input$sd)
  })

  output$hist <- renderPlot({
    df <- data.frame(x = samples())
    ggplot(df, aes(x = x)) +
      geom_histogram(bins = 30, fill = "steelblue", colour = "white") +
      theme_minimal()
  })

  output$stats <- renderPrint({
    summary(samples())
  })
}

shinyApp(ui, server)

samples() is called with parentheses because reactive expressions behave like functions. Both renderPlot and renderPrint read samples(), so both depend on input$n, input$mean, and input$sd. When any of those inputs change, samples() re-evaluates once, and both outputs receive the same values.

Exploring a likelihood function

The likelihood function expresses, for each possible value of a parameter, how probable the observed data would be if that parameter value were true. For a normal distribution with unknown mean mu and known standard deviation, the log-likelihood as a function of mu is a downward-opening parabola. Its peak is the maximum likelihood estimate (MLE) of mu, which for the normal distribution equals the sample mean.

The natural question is: how does the shape of this curve change as the data changes? More data should sharpen the peak (less uncertainty about mu). A larger spread in the data should flatten it. A different sample mean should shift the whole curve. Shiny lets us explore all of these directly.

The three sliders control the sample mean, the sample standard deviation, and the sample size of the observed data. The application generates a synthetic dataset with those characteristics, computes the log-likelihood of mu across a grid of values, and plots the resulting curve. The red dashed line marks the sample mean, which is also the MLE.

ui <- fluidPage(
  titlePanel("Normal log-likelihood: the role of data"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("xbar", "Observed sample mean",
                  min = -3, max = 3, value = 1, step = 0.1),
      sliderInput("s",    "Observed sample SD",
                  min = 0.2, max = 3, value = 1, step = 0.1),
      sliderInput("n",    "Sample size",
                  min = 5, max = 100, value = 20, step = 5)
    ),
    mainPanel(
      plotOutput("llplot"),
      verbatimTextOutput("llinfo")
    )
  )
)

server <- function(input, output) {
  observed <- reactive({
    set.seed(1)
    x <- rnorm(input$n)
    (x - mean(x)) / sd(x) * input$s + input$xbar  # rescale to target mean and SD
  })

  output$llplot <- renderPlot({
    x       <- observed()
    s       <- input$s
    mu_grid <- seq(-5, 5, length.out = 200)
    ll_grid <- sapply(mu_grid, function(m)
      sum(dnorm(x, mean = m, sd = s, log = TRUE)))
    df <- data.frame(mu = mu_grid, ll = ll_grid)
    ggplot(df, aes(x = mu, y = ll)) +
      geom_line(colour = "steelblue", linewidth = 1) +
      geom_vline(xintercept = input$xbar, linetype = "dashed",
                 colour = "firebrick", linewidth = 0.8) +
      labs(x = expression(mu), y = "Log-likelihood",
           caption = "Red line: sample mean = MLE") +
      theme_minimal()
  })

  output$llinfo <- renderPrint({
    cat(sprintf("Sample mean (MLE of mu): %.2f\n", input$xbar))
    cat(sprintf("Sample SD:               %.2f\n", input$s))
    cat(sprintf("Sample size:             %d\n",   input$n))
    cat(sprintf("SE of mean (1/sqrt(n)):  %.3f\n", input$s / sqrt(input$n)))
  })
}

shinyApp(ui, server)

Move the sample mean slider and the entire curve shifts so its peak follows it. The MLE always equals the sample mean. Increase n and the curve sharpens: more data means less uncertainty about mu. Increase the sample SD and the curve flattens: noisier data means less precision.

From a Shiny perspective, this is the same reactive expression pattern already introduced. The observed() reactive generates the data once and is consumed by both output$llplot and output$llinfo. What the application adds is a statistically meaningful context for the pattern.

Visualising sampling distributions

The central limit theorem states that the distribution of the sample mean approaches a normal distribution as sample size grows, regardless of the shape of the population. The convergence happens faster for symmetric distributions and more slowly for skewed ones. This is a foundational result but one that is genuinely hard to internalise from a statement alone.

The application below lets you draw repeated samples of a given size from a chosen population distribution, compute the mean of each sample, and display the distribution of those means. Increasing n narrows the distribution and makes it look more normal. Switching from a normal population to an exponential one lets you see how the skewness of the population is inherited by the sampling distribution at small n but washes out at large n.

From a Shiny perspective this introduces nothing new in terms of technique: it is the same reactive expression pattern already seen. means() runs the simulation once and is shared between the plot and the text output. The application’s value is in making an abstract statistical theorem directly manipulable.

ui <- fluidPage(
  titlePanel("Sampling distribution of the mean"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("n",    "Sample size",         min = 2,   max = 200,  value = 10),
      sliderInput("reps", "Number of replications", min = 100, max = 5000, value = 1000),
      selectInput("dist", "Population distribution",
                  choices = c("Normal"      = "norm",
                              "Exponential" = "exp",
                              "Uniform"     = "unif"))
    ),
    mainPanel(
      plotOutput("sdplot"),
      verbatimTextOutput("sdstats")
    )
  )
)

server <- function(input, output) {
  means <- reactive({
    draw <- switch(input$dist,
                   norm = function(n) rnorm(n),
                   exp  = function(n) rexp(n),
                   unif = function(n) runif(n))
    replicate(input$reps, mean(draw(input$n)))
  })

  output$sdplot <- renderPlot({
    df <- data.frame(m = means())
    se <- sd(df$m)
    ggplot(df, aes(x = m)) +
      geom_histogram(aes(y = after_stat(density)),
                     bins = 50, fill = "steelblue", colour = "white") +
      stat_function(fun = dnorm,
                    args = list(mean = mean(df$m), sd = se),
                    colour = "firebrick", linewidth = 1) +
      labs(x = "Sample mean", y = "Density") +
      theme_minimal()
  })

  output$sdstats <- renderPrint({
    m <- means()
    cat(sprintf("Mean of means: %.4f\n", mean(m)))
    cat(sprintf("SD of means:   %.4f  (SE = 1/sqrt(n) = %.4f)\n",
                sd(m), 1 / sqrt(input$n)))
  })
}

shinyApp(ui, server)

Interactive regression diagnostics

Fitting a statistical model is typically more expensive than drawing a random sample, and the results need to feed into multiple outputs simultaneously: a plot and a coefficient table, for example. This is the case where a reactive expression is most obviously the right tool.

The application below fits a simple linear regression of mpg on a user-chosen predictor from mtcars. The fitted model object is produced once in model() and consumed by both the plot and the coefficient table. When the user selects a new predictor, the model is refitted exactly once and both outputs update from the same fitted object.

This is the same pattern as before but in a context where getting it wrong would be noticeable: if the plot and the table were using independently fitted models, the code would be doing duplicate work and the example would fail to demonstrate reactive sharing. The principle is the same in more expensive settings: share costly computations through a reactive expression rather than repeating them in each render block.

ui <- fluidPage(
  titlePanel("Simple linear regression"),
  sidebarLayout(
    sidebarPanel(
      selectInput("xvar", "Predictor",
                  choices = c("wt", "hp", "disp", "qsec")),
      checkboxInput("ci", "Show confidence band", value = TRUE)
    ),
    mainPanel(
      plotOutput("regplot"),
      tableOutput("coeftable")
    )
  )
)

server <- function(input, output) {
  model <- reactive({
    fmla <- as.formula(paste("mpg ~", input$xvar))
    lm(fmla, data = mtcars)
  })

  output$regplot <- renderPlot({
    df <- mtcars
    df$x <- df[[input$xvar]]
    grid <- data.frame(
      x = seq(min(df$x), max(df$x), length.out = 200)
    )
    pred_data <- setNames(data.frame(grid$x), input$xvar)
    pred <- if (input$ci) {
      predict(model(), newdata = pred_data, interval = "confidence")
    } else {
      predict(model(), newdata = pred_data)
    }
    if (input$ci) {
      fit_df <- data.frame(
        x = grid$x,
        fit = pred[, "fit"],
        lwr = pred[, "lwr"],
        upr = pred[, "upr"]
      )
    } else {
      fit_df <- data.frame(
        x = grid$x,
        fit = as.numeric(pred)
      )
    }
    p <- ggplot(df, aes(x = x, y = mpg)) +
      geom_point(colour = "steelblue") +
      labs(x = input$xvar, y = "mpg") +
      theme_minimal()
    if (input$ci) {
      p <- p +
        geom_ribbon(data = fit_df,
                    aes(x = x, ymin = lwr, ymax = upr),
                    inherit.aes = FALSE,
                    fill = "firebrick", alpha = 0.15) +
        geom_line(data = fit_df,
                  aes(x = x, y = fit),
                  inherit.aes = FALSE,
                  colour = "firebrick", linewidth = 1)
    } else {
      p <- p +
        geom_line(data = fit_df,
                  aes(x = x, y = fit),
                  inherit.aes = FALSE,
                  colour = "firebrick", linewidth = 1)
    }
    p
  })

  output$coeftable <- renderTable({
    m <- model()
    cf <- coef(summary(m))
    data.frame(Term = rownames(cf), round(cf, 4))
  }, rownames = FALSE)
}

shinyApp(ui, server)

Notice that model() is shared between output$regplot and output$coeftable. When the user selects a new predictor, the model is refitted once and both outputs update from the same fitted object.

Managing the reactive graph

When building these applications it is useful to think explicitly about the reactive dependency graph: which outputs depend on which inputs, and whether any reactive expressions sit between them. A reactive expression that depends on input$n and is consumed by two outputs creates a diamond-shaped graph, which is the efficient structure. If both outputs read input$n directly and repeat the same computation, the graph has two parallel edges that each trigger the same work. For cheap computations this does not matter. For expensive computations (model fitting, simulation, reading data) it matters a great deal, and a reactive expression is the correct solution.