library(tidyverse)
library(skimr)
library(emmeans)
weight_df <- read_csv("weight.csv")Statistical Analysis
This guide introduces statistical data analysis in R using a dataset of height, weight, age, and gender measurements. It covers descriptive statistics with skimr, hypothesis tests (t-test and correlation), and linear models ranging from simple regression through to models with interactions. It also introduces one-way ANOVA with post-hoc pairwise comparisons.
Setup
Load the packages and read in the data. The weight.csv file should be in your project folder.
Get a first look at the data with glimpse():
glimpse(weight_df)Rows: 6,068
Columns: 8
$ subjectid <dbl> 10027, 10032, 10033, 10092, 10093, 10115, 10117, 102…
$ gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
$ height <dbl> 177.6, 170.2, 173.5, 165.5, 191.4, 172.0, 181.0, 185…
$ height_selfreport <dbl> 180.34, 172.72, 172.72, 167.64, 195.58, 175.26, 182.…
$ weight <dbl> 81.5, 72.6, 92.9, 79.4, 94.6, 80.2, 116.2, 95.4, 99.…
$ weight_selfreport <dbl> 81.66969, 72.59528, 93.01270, 79.40109, 96.64247, 79…
$ age <dbl> 41, 35, 42, 31, 21, 39, 32, 23, 36, 23, 32, 28, 36, …
$ race <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1…
The dataset contains measurements of height (cm), weight (kg), age, and gender for a sample of individuals.
Descriptive statistics
The skim() function from the skimr package provides a rich summary of every variable in a data frame, including means, standard deviations, quantiles, and a small histogram for numeric variables.
skim(weight_df)── Data Summary ────────────────────────
Values
Name weight_df
Number of rows 6068
Number of columns 8
_______________________
Column type frequency:
character 1
numeric 7
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 gender 0 1 4 6 0 2 0
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25
1 subjectid 0 1 20757. 13159. 10027 14842.
2 height 0 1 171. 9.00 141. 165.
3 height_selfreport 0 1 173. 9.81 142. 168.
4 weight 0 1 79.7 15.7 35.8 68.2
5 weight_selfreport 0 1 79.3 15.3 0 68.1
6 age 0 1 29.8 8.67 17 23
7 race 0 1 1.62 0.979 1 1
p50 p75 p100 hist
1 20064. 27234. 920103 ▇▁▁▁▁
2 172. 178. 199. ▁▃▇▆▁
3 173. 180. 239. ▂▇▂▁▁
4 78.5 89.6 144. ▁▇▇▂▁
5 78.5 88.9 146. ▁▁▇▃▁
6 28 36 58 ▇▆▃▂▁
7 1 2 8 ▇▁▁▁▁
Hypothesis tests
Independent-samples t-test
To test whether the mean height differs between genders, use t.test(). The formula syntax height ~ gender means “model height as a function of gender”:
result_1 <- t.test(height ~ gender, data = weight_df)
result_1
Welch Two Sample t-test
data: height by gender
t = -71.115, df = 4173.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-13.12629 -12.42197
sample estimates:
mean in group Female mean in group Male
162.8473 175.6215
The output shows the t-statistic, degrees of freedom, p-value, and a 95% confidence interval for the difference in means.
Correlation
cor.test() tests the Pearson correlation between two variables:
result_2 <- cor.test(~ height + weight, data = weight_df)
result_2
Pearson's product-moment correlation
data: height and weight
t = 68.472, df = 6066, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6458323 0.6742251
sample estimates:
cor
0.6602645
For Spearman’s rank correlation, pass method = "spearman":
result_3 <- cor.test(~ height + weight, data = weight_df,
exact = FALSE,
method = "spearman")
result_3
Spearman's rank correlation rho
data: height and weight
S = 1.2536e+10, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6633652
Linear regression
Simple linear regression
lm() fits a linear model. Here we model weight as a linear function of height:
result_4 <- lm(weight ~ height, data = weight_df)
summary(result_4)
Call:
lm(formula = weight ~ height, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-35.563 -8.163 -0.594 7.239 46.482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -117.12802 2.87869 -40.69 <2e-16 ***
height 1.14814 0.01677 68.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.76 on 6066 degrees of freedom
Multiple R-squared: 0.4359, Adjusted R-squared: 0.4359
F-statistic: 4688 on 1 and 6066 DF, p-value: < 2.2e-16
summary() shows the estimated intercept and slope, their standard errors and p-values, and the overall model fit (R-squared and F-statistic).
To get confidence intervals for the coefficients:
confint(result_4) 2.5 % 97.5 %
(Intercept) -122.771280 -111.484761
height 1.115266 1.181009
For a 99% confidence interval instead of the default 95%:
confint(result_4, level = 0.99) 0.5 % 99.5 %
(Intercept) -124.545375 -109.710666
height 1.104932 1.191343
Multiple regression
Add more predictors by extending the formula with +:
result_5 <- lm(weight ~ height + age, data = weight_df)
summary(result_5)
Call:
lm(formula = weight ~ height + age, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-42.147 -7.540 -0.616 6.763 47.749
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -125.52848 2.80022 -44.83 <2e-16 ***
height 1.13395 0.01617 70.14 <2e-16 ***
age 0.36399 0.01678 21.70 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.33 on 6065 degrees of freedom
Multiple R-squared: 0.4766, Adjusted R-squared: 0.4764
F-statistic: 2761 on 2 and 6065 DF, p-value: < 2.2e-16
Each coefficient now represents the effect of that variable holding the others constant.
Adding a categorical predictor
Including gender alongside height gives an additive model with a separate intercept for each gender:
result_6 <- lm(weight ~ height + gender, data = weight_df)
summary(result_6)
Call:
lm(formula = weight ~ height + gender, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-34.246 -7.811 -0.693 7.165 47.203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.72937 3.61930 -24.24 <2e-16 ***
height 0.95481 0.02217 43.07 <2e-16 ***
genderMale 5.56894 0.42523 13.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.59 on 6065 degrees of freedom
Multiple R-squared: 0.4515, Adjusted R-squared: 0.4513
F-statistic: 2496 on 2 and 6065 DF, p-value: < 2.2e-16
The coefficient for genderMale (or whichever level R treats as non-reference) is the estimated difference in weight between genders at any given height.
Interaction
An interaction term (height:gender or equivalently height * gender) allows the slope of height to differ between genders:
result_7 <- lm(weight ~ height * gender, data = weight_df)
summary(result_7)
Call:
lm(formula = weight ~ height * gender, data = weight_df)
Residuals:
Min 1Q Median 3Q Max
-34.388 -7.811 -0.664 7.130 47.042
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.88583 6.60624 -12.244 <2e-16 ***
height 0.91278 0.04054 22.518 <2e-16 ***
genderMale -4.42314 8.08057 -0.547 0.584
height:genderMale 0.05995 0.04842 1.238 0.216
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.59 on 6064 degrees of freedom
Multiple R-squared: 0.4516, Adjusted R-squared: 0.4513
F-statistic: 1665 on 3 and 6064 DF, p-value: < 2.2e-16
The interaction coefficient tells you how much the slope of height changes when moving from one gender to the other.
To compare the additive and interaction models formally, use anova():
anova(result_6, result_7)Analysis of Variance Table
Model 1: weight ~ height + gender
Model 2: weight ~ height * gender
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6065 815391
2 6064 815185 1 206.12 1.5333 0.2157
This tests the null hypothesis that the simpler (additive) model fits as well as the more complex (interaction) model. A small p-value is evidence in favour of the interaction.
One-way ANOVA
One-way ANOVA tests whether group means differ across more than two groups. The example here uses the built-in PlantGrowth dataset, which records plant weight across three treatment groups.
result_8 <- aov(weight ~ group, data = PlantGrowth)
summary(result_8) Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic and its p-value test whether any group means differ. A significant result tells us that some differences exist, but not which pairs.
Post-hoc pairwise comparisons
To identify which groups differ from which, use emmeans() with a Bonferroni correction for multiple comparisons:
emmeans(result_8, specs = pairwise ~ group, adjust = "bonferroni")$emmeans
group emmean SE df lower.CL upper.CL
ctrl 5.03 0.197 27 4.63 5.44
trt1 4.66 0.197 27 4.26 5.07
trt2 5.53 0.197 27 5.12 5.93
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
ctrl - trt1 0.371 0.279 27 1.331 0.5832
ctrl - trt2 -0.494 0.279 27 -1.772 0.2630
trt1 - trt2 -0.865 0.279 27 -3.103 0.0134
P value adjustment: bonferroni method for 3 tests
The output shows the estimated marginal means for each group and the pairwise differences with adjusted p-values and confidence intervals.