Statistical Analysis

Author

Mark Andrews

Abstract

This guide introduces statistical data analysis in R using a dataset of height, weight, age, and gender measurements. It covers descriptive statistics with skimr, hypothesis tests (t-test and correlation), and linear models ranging from simple regression through to models with interactions. It also introduces one-way ANOVA with post-hoc pairwise comparisons.

Setup

Load the packages and read in the data. The weight.csv file should be in your project folder.

library(tidyverse)
library(skimr)
library(emmeans)

weight_df <- read_csv("weight.csv")

Get a first look at the data with glimpse():

glimpse(weight_df)
Rows: 6,068
Columns: 8
$ subjectid         <dbl> 10027, 10032, 10033, 10092, 10093, 10115, 10117, 102…
$ gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
$ height            <dbl> 177.6, 170.2, 173.5, 165.5, 191.4, 172.0, 181.0, 185…
$ height_selfreport <dbl> 180.34, 172.72, 172.72, 167.64, 195.58, 175.26, 182.…
$ weight            <dbl> 81.5, 72.6, 92.9, 79.4, 94.6, 80.2, 116.2, 95.4, 99.…
$ weight_selfreport <dbl> 81.66969, 72.59528, 93.01270, 79.40109, 96.64247, 79…
$ age               <dbl> 41, 35, 42, 31, 21, 39, 32, 23, 36, 23, 32, 28, 36, …
$ race              <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1…

The dataset contains measurements of height (cm), weight (kg), age, and gender for a sample of individuals.

Descriptive statistics

The skim() function from the skimr package provides a rich summary of every variable in a data frame, including means, standard deviations, quantiles, and a small histogram for numeric variables.

skim(weight_df)
── Data Summary ────────────────────────
                           Values   
Name                       weight_df
Number of rows             6068     
Number of columns          8        
_______________________             
Column type frequency:              
  character                1        
  numeric                  7        
________________________            
Group variables            None     

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 gender                0             1   4   6     0        2          0

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate     mean        sd      p0     p25
1 subjectid                 0             1 20757.   13159.    10027   14842. 
2 height                    0             1   171.       9.00    141.    165. 
3 height_selfreport         0             1   173.       9.81    142.    168. 
4 weight                    0             1    79.7     15.7      35.8    68.2
5 weight_selfreport         0             1    79.3     15.3       0      68.1
6 age                       0             1    29.8      8.67     17      23  
7 race                      0             1     1.62     0.979     1       1  
      p50     p75    p100 hist 
1 20064.  27234.  920103  ▇▁▁▁▁
2   172.    178.     199. ▁▃▇▆▁
3   173.    180.     239. ▂▇▂▁▁
4    78.5    89.6    144. ▁▇▇▂▁
5    78.5    88.9    146. ▁▁▇▃▁
6    28      36       58  ▇▆▃▂▁
7     1       2        8  ▇▁▁▁▁

Hypothesis tests

Independent-samples t-test

To test whether the mean height differs between genders, use t.test(). The formula syntax height ~ gender means “model height as a function of gender”:

result_1 <- t.test(height ~ gender, data = weight_df)
result_1

    Welch Two Sample t-test

data:  height by gender
t = -71.115, df = 4173.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -13.12629 -12.42197
sample estimates:
mean in group Female   mean in group Male 
            162.8473             175.6215 

The output shows the t-statistic, degrees of freedom, p-value, and a 95% confidence interval for the difference in means.

Correlation

cor.test() tests the Pearson correlation between two variables:

result_2 <- cor.test(~ height + weight, data = weight_df)
result_2

    Pearson's product-moment correlation

data:  height and weight
t = 68.472, df = 6066, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6458323 0.6742251
sample estimates:
      cor 
0.6602645 

For Spearman’s rank correlation, pass method = "spearman":

result_3 <- cor.test(~ height + weight, data = weight_df,
                     exact = FALSE,
                     method = "spearman")
result_3

    Spearman's rank correlation rho

data:  height and weight
S = 1.2536e+10, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.6633652 

Linear regression

Simple linear regression

lm() fits a linear model. Here we model weight as a linear function of height:

result_4 <- lm(weight ~ height, data = weight_df)
summary(result_4)

Call:
lm(formula = weight ~ height, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.563  -8.163  -0.594   7.239  46.482 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -117.12802    2.87869  -40.69   <2e-16 ***
height         1.14814    0.01677   68.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.76 on 6066 degrees of freedom
Multiple R-squared:  0.4359,    Adjusted R-squared:  0.4359 
F-statistic:  4688 on 1 and 6066 DF,  p-value: < 2.2e-16

summary() shows the estimated intercept and slope, their standard errors and p-values, and the overall model fit (R-squared and F-statistic).

To get confidence intervals for the coefficients:

confint(result_4)
                  2.5 %      97.5 %
(Intercept) -122.771280 -111.484761
height         1.115266    1.181009

For a 99% confidence interval instead of the default 95%:

confint(result_4, level = 0.99)
                  0.5 %      99.5 %
(Intercept) -124.545375 -109.710666
height         1.104932    1.191343

Multiple regression

Add more predictors by extending the formula with +:

result_5 <- lm(weight ~ height + age, data = weight_df)
summary(result_5)

Call:
lm(formula = weight ~ height + age, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.147  -7.540  -0.616   6.763  47.749 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -125.52848    2.80022  -44.83   <2e-16 ***
height         1.13395    0.01617   70.14   <2e-16 ***
age            0.36399    0.01678   21.70   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.33 on 6065 degrees of freedom
Multiple R-squared:  0.4766,    Adjusted R-squared:  0.4764 
F-statistic:  2761 on 2 and 6065 DF,  p-value: < 2.2e-16

Each coefficient now represents the effect of that variable holding the others constant.

Adding a categorical predictor

Including gender alongside height gives an additive model with a separate intercept for each gender:

result_6 <- lm(weight ~ height + gender, data = weight_df)
summary(result_6)

Call:
lm(formula = weight ~ height + gender, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.246  -7.811  -0.693   7.165  47.203 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.72937    3.61930  -24.24   <2e-16 ***
height        0.95481    0.02217   43.07   <2e-16 ***
genderMale    5.56894    0.42523   13.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.59 on 6065 degrees of freedom
Multiple R-squared:  0.4515,    Adjusted R-squared:  0.4513 
F-statistic:  2496 on 2 and 6065 DF,  p-value: < 2.2e-16

The coefficient for genderMale (or whichever level R treats as non-reference) is the estimated difference in weight between genders at any given height.

Interaction

An interaction term (height:gender or equivalently height * gender) allows the slope of height to differ between genders:

result_7 <- lm(weight ~ height * gender, data = weight_df)
summary(result_7)

Call:
lm(formula = weight ~ height * gender, data = weight_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.388  -7.811  -0.664   7.130  47.042 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -80.88583    6.60624 -12.244   <2e-16 ***
height              0.91278    0.04054  22.518   <2e-16 ***
genderMale         -4.42314    8.08057  -0.547    0.584    
height:genderMale   0.05995    0.04842   1.238    0.216    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.59 on 6064 degrees of freedom
Multiple R-squared:  0.4516,    Adjusted R-squared:  0.4513 
F-statistic:  1665 on 3 and 6064 DF,  p-value: < 2.2e-16

The interaction coefficient tells you how much the slope of height changes when moving from one gender to the other.

To compare the additive and interaction models formally, use anova():

anova(result_6, result_7)
Analysis of Variance Table

Model 1: weight ~ height + gender
Model 2: weight ~ height * gender
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   6065 815391                           
2   6064 815185  1    206.12 1.5333 0.2157

This tests the null hypothesis that the simpler (additive) model fits as well as the more complex (interaction) model. A small p-value is evidence in favour of the interaction.

One-way ANOVA

One-way ANOVA tests whether group means differ across more than two groups. The example here uses the built-in PlantGrowth dataset, which records plant weight across three treatment groups.

result_8 <- aov(weight ~ group, data = PlantGrowth)
summary(result_8)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic and its p-value test whether any group means differ. A significant result tells us that some differences exist, but not which pairs.

Post-hoc pairwise comparisons

To identify which groups differ from which, use emmeans() with a Bonferroni correction for multiple comparisons:

emmeans(result_8, specs = pairwise ~ group, adjust = "bonferroni")
$emmeans
 group emmean    SE df lower.CL upper.CL
 ctrl    5.03 0.197 27     4.63     5.44
 trt1    4.66 0.197 27     4.26     5.07
 trt2    5.53 0.197 27     5.12     5.93

Confidence level used: 0.95 

$contrasts
 contrast    estimate    SE df t.ratio p.value
 ctrl - trt1    0.371 0.279 27   1.331  0.5832
 ctrl - trt2   -0.494 0.279 27  -1.772  0.2630
 trt1 - trt2   -0.865 0.279 27  -3.103  0.0134

P value adjustment: bonferroni method for 3 tests 

The output shows the estimated marginal means for each group and the pairwise differences with adjusted p-values and confidence intervals.