Data Wrangling

Author

Mark Andrews

Abstract

This guide introduces data wrangling with the dplyr package. It covers the core verbs for reshaping and filtering data frames: select, rename, relocate, slice, filter, mutate, and summarise. It also shows how the pipe operator chains these operations into readable, step-by-step pipelines.

Setup

Load tidyverse and read in the dataset. The British Lexicon Project (BLP) data records reaction times and responses in a lexical decision task, where participants had to decide whether a letter string was a real English word.

library(tidyverse)

blp_df <- read_csv(
  "https://raw.githubusercontent.com/mark-andrews/isurr/refs/heads/isurr25/data/blp-trials-short.txt"
)
glimpse(blp_df)
Rows: 1,000
Columns: 7
$ participant <dbl> 20, 9, 47, 103, 45, 73, 24, 11, 32, 96, 82, 37, 52, 96, 96…
$ lex         <chr> "N", "N", "N", "N", "W", "W", "W", "W", "W", "N", "W", "W"…
$ spell       <chr> "staud", "dinbuss", "snilling", "gancens", "filled", "jour…
$ resp        <chr> "N", "N", "N", "N", "W", "W", "W", "W", "W", "N", "W", "N"…
$ rt          <dbl> 977, 565, 562, 572, 659, 538, 626, 566, 922, 555, 657, NA,…
$ prev.rt     <dbl> 511, 765, 496, 656, 981, 1505, 546, 717, 1471, 806, 728, 5…
$ rt.raw      <dbl> 977, 565, 562, 572, 659, 538, 626, 566, 922, 555, 657, 712…

The key columns are participant (a numeric ID), lex (whether the string was a word, W, or non-word, N), resp (the participant’s response), rt (the cleaned reaction time in ms), and rt.raw (the raw reaction time).

Selecting columns

select() keeps only the columns you name:

select(blp_df, lex, resp, rt)
# A tibble: 1,000 × 3
   lex   resp     rt
   <chr> <chr> <dbl>
 1 N     N       977
 2 N     N       565
 3 N     N       562
 4 N     N       572
 5 W     W       659
 6 W     W       538
 7 W     W       626
 8 W     W       566
 9 W     W       922
10 N     N       555
# ℹ 990 more rows

You can also select by column number, by range, or using helper functions:

select(blp_df, 1, 5, 3)
# A tibble: 1,000 × 3
   participant    rt spell   
         <dbl> <dbl> <chr>   
 1          20   977 staud   
 2           9   565 dinbuss 
 3          47   562 snilling
 4         103   572 gancens 
 5          45   659 filled  
 6          73   538 journals
 7          24   626 apache  
 8          11   566 flake   
 9          32   922 reliefs 
10          96   555 sarves  
# ℹ 990 more rows
select(blp_df, participant:rt)
# A tibble: 1,000 × 5
   participant lex   spell    resp     rt
         <dbl> <chr> <chr>    <chr> <dbl>
 1          20 N     staud    N       977
 2           9 N     dinbuss  N       565
 3          47 N     snilling N       562
 4         103 N     gancens  N       572
 5          45 W     filled   W       659
 6          73 W     journals W       538
 7          24 W     apache   W       626
 8          11 W     flake    W       566
 9          32 W     reliefs  W       922
10          96 N     sarves   N       555
# ℹ 990 more rows
select(blp_df, starts_with("r"))
# A tibble: 1,000 × 3
   resp     rt rt.raw
   <chr> <dbl>  <dbl>
 1 N       977    977
 2 N       565    565
 3 N       562    562
 4 N       572    572
 5 W       659    659
 6 W       538    538
 7 W       626    626
 8 W       566    566
 9 W       922    922
10 N       555    555
# ℹ 990 more rows
select(blp_df, ends_with("t"))
# A tibble: 1,000 × 3
   participant    rt prev.rt
         <dbl> <dbl>   <dbl>
 1          20   977     511
 2           9   565     765
 3          47   562     496
 4         103   572     656
 5          45   659     981
 6          73   538    1505
 7          24   626     546
 8          11   566     717
 9          32   922    1471
10          96   555     806
# ℹ 990 more rows
select(blp_df, contains("rt"))
# A tibble: 1,000 × 4
   participant    rt prev.rt rt.raw
         <dbl> <dbl>   <dbl>  <dbl>
 1          20   977     511    977
 2           9   565     765    565
 3          47   562     496    562
 4         103   572     656    572
 5          45   659     981    659
 6          73   538    1505    538
 7          24   626     546    626
 8          11   566     717    566
 9          32   922    1471    922
10          96   555     806    555
# ℹ 990 more rows

To drop columns, prefix their name with -:

select(blp_df, -participant)
# A tibble: 1,000 × 6
   lex   spell    resp     rt prev.rt rt.raw
   <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
 1 N     staud    N       977     511    977
 2 N     dinbuss  N       565     765    565
 3 N     snilling N       562     496    562
 4 N     gancens  N       572     656    572
 5 W     filled   W       659     981    659
 6 W     journals W       538    1505    538
 7 W     apache   W       626     546    626
 8 W     flake    W       566     717    566
 9 W     reliefs  W       922    1471    922
10 N     sarves   N       555     806    555
# ℹ 990 more rows
select(blp_df, -starts_with("rt"))
# A tibble: 1,000 × 5
   participant lex   spell    resp  prev.rt
         <dbl> <chr> <chr>    <chr>   <dbl>
 1          20 N     staud    N         511
 2           9 N     dinbuss  N         765
 3          47 N     snilling N         496
 4         103 N     gancens  N         656
 5          45 W     filled   W         981
 6          73 W     journals W        1505
 7          24 W     apache   W         546
 8          11 W     flake    W         717
 9          32 W     reliefs  W        1471
10          96 N     sarves   N         806
# ℹ 990 more rows

Relocating columns

relocate() moves columns without dropping others. By default it moves the named column to the front:

relocate(blp_df, rt)
# A tibble: 1,000 × 7
      rt participant lex   spell    resp  prev.rt rt.raw
   <dbl>       <dbl> <chr> <chr>    <chr>   <dbl>  <dbl>
 1   977          20 N     staud    N         511    977
 2   565           9 N     dinbuss  N         765    565
 3   562          47 N     snilling N         496    562
 4   572         103 N     gancens  N         656    572
 5   659          45 W     filled   W         981    659
 6   538          73 W     journals W        1505    538
 7   626          24 W     apache   W         546    626
 8   566          11 W     flake    W         717    566
 9   922          32 W     reliefs  W        1471    922
10   555          96 N     sarves   N         806    555
# ℹ 990 more rows

Use .after or .before to place it next to a specific column:

relocate(blp_df, starts_with("rt"), .after = participant)
# A tibble: 1,000 × 7
   participant    rt rt.raw lex   spell    resp  prev.rt
         <dbl> <dbl>  <dbl> <chr> <chr>    <chr>   <dbl>
 1          20   977    977 N     staud    N         511
 2           9   565    565 N     dinbuss  N         765
 3          47   562    562 N     snilling N         496
 4         103   572    572 N     gancens  N         656
 5          45   659    659 W     filled   W         981
 6          73   538    538 W     journals W        1505
 7          24   626    626 W     apache   W         546
 8          11   566    566 W     flake    W         717
 9          32   922    922 W     reliefs  W        1471
10          96   555    555 N     sarves   N         806
# ℹ 990 more rows

Renaming columns

rename() changes column names, keeping all other columns:

rename(blp_df, ID = participant)
# A tibble: 1,000 × 7
      ID lex   spell    resp     rt prev.rt rt.raw
   <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
 1    20 N     staud    N       977     511    977
 2     9 N     dinbuss  N       565     765    565
 3    47 N     snilling N       562     496    562
 4   103 N     gancens  N       572     656    572
 5    45 W     filled   W       659     981    659
 6    73 W     journals W       538    1505    538
 7    24 W     apache   W       626     546    626
 8    11 W     flake    W       566     717    566
 9    32 W     reliefs  W       922    1471    922
10    96 N     sarves   N       555     806    555
# ℹ 990 more rows
rename(blp_df, ID = participant, reaction_time = rt)
# A tibble: 1,000 × 7
      ID lex   spell    resp  reaction_time prev.rt rt.raw
   <dbl> <chr> <chr>    <chr>         <dbl>   <dbl>  <dbl>
 1    20 N     staud    N               977     511    977
 2     9 N     dinbuss  N               565     765    565
 3    47 N     snilling N               562     496    562
 4   103 N     gancens  N               572     656    572
 5    45 W     filled   W               659     981    659
 6    73 W     journals W               538    1505    538
 7    24 W     apache   W               626     546    626
 8    11 W     flake    W               566     717    566
 9    32 W     reliefs  W               922    1471    922
10    96 N     sarves   N               555     806    555
# ℹ 990 more rows

Note: select() can also rename while selecting — select(blp_df, ID = participant) keeps only ID. rename() keeps everything.

Slicing rows

slice() selects rows by position:

slice(blp_df, c(1, 200, 250))
# A tibble: 3 × 7
  participant lex   spell   resp     rt prev.rt rt.raw
        <dbl> <chr> <chr>   <chr> <dbl>   <dbl>  <dbl>
1          20 N     staud   N       977     511    977
2          74 N     hooths  N       711    1935    711
3         105 N     fondism N       827     541    827
slice(blp_df, 1:15)
# A tibble: 15 × 7
   participant lex   spell     resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
 1          20 N     staud     N       977     511    977
 2           9 N     dinbuss   N       565     765    565
 3          47 N     snilling  N       562     496    562
 4         103 N     gancens   N       572     656    572
 5          45 W     filled    W       659     981    659
 6          73 W     journals  W       538    1505    538
 7          24 W     apache    W       626     546    626
 8          11 W     flake     W       566     717    566
 9          32 W     reliefs   W       922    1471    922
10          96 N     sarves    N       555     806    555
11          82 W     deceits   W       657     728    657
12          37 W     nothings  N        NA     552    712
13          52 N     chuespies N       427     539    427
14          96 N     mowny     N      1352    1020   1352
15          96 N     cranned   N       907     573    907
slice_head(blp_df, n = 15)
# A tibble: 15 × 7
   participant lex   spell     resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
 1          20 N     staud     N       977     511    977
 2           9 N     dinbuss   N       565     765    565
 3          47 N     snilling  N       562     496    562
 4         103 N     gancens   N       572     656    572
 5          45 W     filled    W       659     981    659
 6          73 W     journals  W       538    1505    538
 7          24 W     apache    W       626     546    626
 8          11 W     flake     W       566     717    566
 9          32 W     reliefs   W       922    1471    922
10          96 N     sarves    N       555     806    555
11          82 W     deceits   W       657     728    657
12          37 W     nothings  N        NA     552    712
13          52 N     chuespies N       427     539    427
14          96 N     mowny     N      1352    1020   1352
15          96 N     cranned   N       907     573    907
slice_tail(blp_df, n = 15)
# A tibble: 15 × 7
   participant lex   spell     resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
 1          15 N     tiped     N       654     508    654
 2          22 N     cuess     N       553     520    553
 3          98 N     rotord    N       402     497    402
 4          37 W     wrenched  W       516     515    516
 5          29 N     khandles  N       511     777    511
 6          88 N     ixcurs    N       504     552    504
 7          50 N     homply    N       518     583    518
 8         103 W     baste     W       683     454    683
 9          67 W     tall      W       476     572    476
10          45 W     gardens   W       586    1023    586
11         105 W     goldfinch N        NA     903    775
12          72 W     varmint   N        NA     507    653
13           3 W     lurked    W       537     520    537
14           3 W     village   W       538     522    538
15          17 W     fudge     W       410     437    410

Filtering rows

filter() keeps rows that satisfy a logical condition:

filter(blp_df, rt > 500)
# A tibble: 650 × 7
   participant lex   spell    resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
 1          20 N     staud    N       977     511    977
 2           9 N     dinbuss  N       565     765    565
 3          47 N     snilling N       562     496    562
 4         103 N     gancens  N       572     656    572
 5          45 W     filled   W       659     981    659
 6          73 W     journals W       538    1505    538
 7          24 W     apache   W       626     546    626
 8          11 W     flake    W       566     717    566
 9          32 W     reliefs  W       922    1471    922
10          96 N     sarves   N       555     806    555
# ℹ 640 more rows
filter(blp_df, rt >= 500, rt <= 1000)
# A tibble: 610 × 7
   participant lex   spell    resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
 1          20 N     staud    N       977     511    977
 2           9 N     dinbuss  N       565     765    565
 3          47 N     snilling N       562     496    562
 4         103 N     gancens  N       572     656    572
 5          45 W     filled   W       659     981    659
 6          73 W     journals W       538    1505    538
 7          24 W     apache   W       626     546    626
 8          11 W     flake    W       566     717    566
 9          32 W     reliefs  W       922    1471    922
10          96 N     sarves   N       555     806    555
# ℹ 600 more rows
filter(blp_df, lex == "W", rt >= 500, rt <= 1000)
# A tibble: 279 × 7
   participant lex   spell    resp     rt prev.rt rt.raw
         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
 1          45 W     filled   W       659     981    659
 2          73 W     journals W       538    1505    538
 3          24 W     apache   W       626     546    626
 4          11 W     flake    W       566     717    566
 5          32 W     reliefs  W       922    1471    922
 6          82 W     deceits  W       657     728    657
 7          46 W     mirage   W       778     571    778
 8          75 W     poser    W       696     519    696
 9          36 W     outposts W       560     461    560
10          54 W     curbs    W       697     653    697
# ℹ 269 more rows

Multiple conditions separated by commas are combined with AND — a row must satisfy all of them to be kept. Use | for OR conditions.

Mutating columns

mutate() adds new columns or modifies existing ones. The new column is computed row by row from existing columns:

mutate(blp_df, accuracy = lex == resp)
# A tibble: 1,000 × 8
   participant lex   spell    resp     rt prev.rt rt.raw accuracy
         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl> <lgl>   
 1          20 N     staud    N       977     511    977 TRUE    
 2           9 N     dinbuss  N       565     765    565 TRUE    
 3          47 N     snilling N       562     496    562 TRUE    
 4         103 N     gancens  N       572     656    572 TRUE    
 5          45 W     filled   W       659     981    659 TRUE    
 6          73 W     journals W       538    1505    538 TRUE    
 7          24 W     apache   W       626     546    626 TRUE    
 8          11 W     flake    W       566     717    566 TRUE    
 9          32 W     reliefs  W       922    1471    922 TRUE    
10          96 N     sarves   N       555     806    555 TRUE    
# ℹ 990 more rows

Here accuracy is TRUE where the participant’s response matched the correct answer. Use .after to control placement:

mutate(blp_df, accuracy = lex == resp, .after = resp)
# A tibble: 1,000 × 8
   participant lex   spell    resp  accuracy    rt prev.rt rt.raw
         <dbl> <chr> <chr>    <chr> <lgl>    <dbl>   <dbl>  <dbl>
 1          20 N     staud    N     TRUE       977     511    977
 2           9 N     dinbuss  N     TRUE       565     765    565
 3          47 N     snilling N     TRUE       562     496    562
 4         103 N     gancens  N     TRUE       572     656    572
 5          45 W     filled   W     TRUE       659     981    659
 6          73 W     journals W     TRUE       538    1505    538
 7          24 W     apache   W     TRUE       626     546    626
 8          11 W     flake    W     TRUE       566     717    566
 9          32 W     reliefs  W     TRUE       922    1471    922
10          96 N     sarves   N     TRUE       555     806    555
# ℹ 990 more rows

if_else() applies a condition and returns different values for TRUE and FALSE:

mutate(blp_df,
       rt_speed = if_else(rt < median(rt, na.rm = TRUE), "fast", "slow"))
# A tibble: 1,000 × 8
   participant lex   spell    resp     rt prev.rt rt.raw rt_speed
         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl> <chr>   
 1          20 N     staud    N       977     511    977 slow    
 2           9 N     dinbuss  N       565     765    565 fast    
 3          47 N     snilling N       562     496    562 fast    
 4         103 N     gancens  N       572     656    572 fast    
 5          45 W     filled   W       659     981    659 slow    
 6          73 W     journals W       538    1505    538 fast    
 7          24 W     apache   W       626     546    626 slow    
 8          11 W     flake    W       566     717    566 fast    
 9          32 W     reliefs  W       922    1471    922 slow    
10          96 N     sarves   N       555     806    555 fast    
# ℹ 990 more rows

Summarising

summarise() collapses the data frame to a single row of summary statistics:

summarise(blp_df,
          avg_rt   = mean(rt.raw),
          mad_rt   = mad(rt.raw),
          accuracy = mean(lex == resp))
# A tibble: 1 × 3
  avg_rt mad_rt accuracy
   <dbl>  <dbl>    <dbl>
1   708.   153.    0.852

Use .by to compute summaries separately for each level of a grouping variable:

summarise(blp_df,
          avg_rt   = mean(rt.raw),
          mad_rt   = mad(rt.raw),
          accuracy = mean(lex == resp),
          .by = lex)
# A tibble: 2 × 4
  lex   avg_rt mad_rt accuracy
  <chr>  <dbl>  <dbl>    <dbl>
1 N       686.   158.    0.930
2 W       730.   150.    0.773
summarise(blp_df,
          avg_rt   = mean(rt.raw),
          mad_rt   = mad(rt.raw),
          accuracy = mean(lex == resp),
          .by = resp)
# A tibble: 2 × 4
  resp  avg_rt mad_rt accuracy
  <chr>  <dbl>  <dbl>    <dbl>
1 N       706.   165.    0.805
2 W       711.   136.    0.917

The pipe

The pipe |> sends the result of one expression into the first argument of the next function. This becomes particularly powerful when chaining several dplyr verbs together.

Without the pipe, nested calls quickly become hard to read:

primes <- c(2, 3, 5, 7, 11, 13)
log(sum(sqrt(log(primes))))
[1] 2.040487

With the pipe, the same operations read left to right:

primes |> log() |> sqrt() |> sum() |> log()
[1] 2.040487

Building a pipeline

Here is a complete data processing pipeline: read the data, create a new column, select and rename columns, convert a column type, drop rows with missing values, and summarise — all in a single readable chain:

read_csv("https://raw.githubusercontent.com/mark-andrews/isurr/refs/heads/isurr25/data/blp-trials-short.txt") |>
  mutate(accuracy = lex == resp) |>
  select(ID = participant, lex, resp, rt = rt.raw, accuracy) |>
  mutate(ID = as.factor(ID)) |>
  drop_na() |>
  summarise(avg_rt = mean(rt), sd_rt = sd(rt), .by = c(lex, accuracy))

Each step does one clear thing. Reading the pipeline from top to bottom tells you the full story of what happens to the data. This is the core pattern of tidyverse data wrangling: a sequence of small, named operations chained together in a way that is easy to read, check, and modify.