library(tidyverse)
blp_df <- read_csv(
"https://raw.githubusercontent.com/mark-andrews/isurr/refs/heads/isurr25/data/blp-trials-short.txt"
)Data Wrangling
This guide introduces data wrangling with the dplyr package. It covers the core verbs for reshaping and filtering data frames: select, rename, relocate, slice, filter, mutate, and summarise. It also shows how the pipe operator chains these operations into readable, step-by-step pipelines.
Setup
Load tidyverse and read in the dataset. The British Lexicon Project (BLP) data records reaction times and responses in a lexical decision task, where participants had to decide whether a letter string was a real English word.
glimpse(blp_df)Rows: 1,000
Columns: 7
$ participant <dbl> 20, 9, 47, 103, 45, 73, 24, 11, 32, 96, 82, 37, 52, 96, 96…
$ lex <chr> "N", "N", "N", "N", "W", "W", "W", "W", "W", "N", "W", "W"…
$ spell <chr> "staud", "dinbuss", "snilling", "gancens", "filled", "jour…
$ resp <chr> "N", "N", "N", "N", "W", "W", "W", "W", "W", "N", "W", "N"…
$ rt <dbl> 977, 565, 562, 572, 659, 538, 626, 566, 922, 555, 657, NA,…
$ prev.rt <dbl> 511, 765, 496, 656, 981, 1505, 546, 717, 1471, 806, 728, 5…
$ rt.raw <dbl> 977, 565, 562, 572, 659, 538, 626, 566, 922, 555, 657, 712…
The key columns are participant (a numeric ID), lex (whether the string was a word, W, or non-word, N), resp (the participant’s response), rt (the cleaned reaction time in ms), and rt.raw (the raw reaction time).
Selecting columns
select() keeps only the columns you name:
select(blp_df, lex, resp, rt)# A tibble: 1,000 × 3
lex resp rt
<chr> <chr> <dbl>
1 N N 977
2 N N 565
3 N N 562
4 N N 572
5 W W 659
6 W W 538
7 W W 626
8 W W 566
9 W W 922
10 N N 555
# ℹ 990 more rows
You can also select by column number, by range, or using helper functions:
select(blp_df, 1, 5, 3)# A tibble: 1,000 × 3
participant rt spell
<dbl> <dbl> <chr>
1 20 977 staud
2 9 565 dinbuss
3 47 562 snilling
4 103 572 gancens
5 45 659 filled
6 73 538 journals
7 24 626 apache
8 11 566 flake
9 32 922 reliefs
10 96 555 sarves
# ℹ 990 more rows
select(blp_df, participant:rt)# A tibble: 1,000 × 5
participant lex spell resp rt
<dbl> <chr> <chr> <chr> <dbl>
1 20 N staud N 977
2 9 N dinbuss N 565
3 47 N snilling N 562
4 103 N gancens N 572
5 45 W filled W 659
6 73 W journals W 538
7 24 W apache W 626
8 11 W flake W 566
9 32 W reliefs W 922
10 96 N sarves N 555
# ℹ 990 more rows
select(blp_df, starts_with("r"))# A tibble: 1,000 × 3
resp rt rt.raw
<chr> <dbl> <dbl>
1 N 977 977
2 N 565 565
3 N 562 562
4 N 572 572
5 W 659 659
6 W 538 538
7 W 626 626
8 W 566 566
9 W 922 922
10 N 555 555
# ℹ 990 more rows
select(blp_df, ends_with("t"))# A tibble: 1,000 × 3
participant rt prev.rt
<dbl> <dbl> <dbl>
1 20 977 511
2 9 565 765
3 47 562 496
4 103 572 656
5 45 659 981
6 73 538 1505
7 24 626 546
8 11 566 717
9 32 922 1471
10 96 555 806
# ℹ 990 more rows
select(blp_df, contains("rt"))# A tibble: 1,000 × 4
participant rt prev.rt rt.raw
<dbl> <dbl> <dbl> <dbl>
1 20 977 511 977
2 9 565 765 565
3 47 562 496 562
4 103 572 656 572
5 45 659 981 659
6 73 538 1505 538
7 24 626 546 626
8 11 566 717 566
9 32 922 1471 922
10 96 555 806 555
# ℹ 990 more rows
To drop columns, prefix their name with -:
select(blp_df, -participant)# A tibble: 1,000 × 6
lex spell resp rt prev.rt rt.raw
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 N staud N 977 511 977
2 N dinbuss N 565 765 565
3 N snilling N 562 496 562
4 N gancens N 572 656 572
5 W filled W 659 981 659
6 W journals W 538 1505 538
7 W apache W 626 546 626
8 W flake W 566 717 566
9 W reliefs W 922 1471 922
10 N sarves N 555 806 555
# ℹ 990 more rows
select(blp_df, -starts_with("rt"))# A tibble: 1,000 × 5
participant lex spell resp prev.rt
<dbl> <chr> <chr> <chr> <dbl>
1 20 N staud N 511
2 9 N dinbuss N 765
3 47 N snilling N 496
4 103 N gancens N 656
5 45 W filled W 981
6 73 W journals W 1505
7 24 W apache W 546
8 11 W flake W 717
9 32 W reliefs W 1471
10 96 N sarves N 806
# ℹ 990 more rows
Relocating columns
relocate() moves columns without dropping others. By default it moves the named column to the front:
relocate(blp_df, rt)# A tibble: 1,000 × 7
rt participant lex spell resp prev.rt rt.raw
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 977 20 N staud N 511 977
2 565 9 N dinbuss N 765 565
3 562 47 N snilling N 496 562
4 572 103 N gancens N 656 572
5 659 45 W filled W 981 659
6 538 73 W journals W 1505 538
7 626 24 W apache W 546 626
8 566 11 W flake W 717 566
9 922 32 W reliefs W 1471 922
10 555 96 N sarves N 806 555
# ℹ 990 more rows
Use .after or .before to place it next to a specific column:
relocate(blp_df, starts_with("rt"), .after = participant)# A tibble: 1,000 × 7
participant rt rt.raw lex spell resp prev.rt
<dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 20 977 977 N staud N 511
2 9 565 565 N dinbuss N 765
3 47 562 562 N snilling N 496
4 103 572 572 N gancens N 656
5 45 659 659 W filled W 981
6 73 538 538 W journals W 1505
7 24 626 626 W apache W 546
8 11 566 566 W flake W 717
9 32 922 922 W reliefs W 1471
10 96 555 555 N sarves N 806
# ℹ 990 more rows
Renaming columns
rename() changes column names, keeping all other columns:
rename(blp_df, ID = participant)# A tibble: 1,000 × 7
ID lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
rename(blp_df, ID = participant, reaction_time = rt)# A tibble: 1,000 × 7
ID lex spell resp reaction_time prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Note: select() can also rename while selecting — select(blp_df, ID = participant) keeps only ID. rename() keeps everything.
Slicing rows
slice() selects rows by position:
slice(blp_df, c(1, 200, 250))# A tibble: 3 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 74 N hooths N 711 1935 711
3 105 N fondism N 827 541 827
slice(blp_df, 1:15)# A tibble: 15 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
11 82 W deceits W 657 728 657
12 37 W nothings N NA 552 712
13 52 N chuespies N 427 539 427
14 96 N mowny N 1352 1020 1352
15 96 N cranned N 907 573 907
slice_head(blp_df, n = 15)# A tibble: 15 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
11 82 W deceits W 657 728 657
12 37 W nothings N NA 552 712
13 52 N chuespies N 427 539 427
14 96 N mowny N 1352 1020 1352
15 96 N cranned N 907 573 907
slice_tail(blp_df, n = 15)# A tibble: 15 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 15 N tiped N 654 508 654
2 22 N cuess N 553 520 553
3 98 N rotord N 402 497 402
4 37 W wrenched W 516 515 516
5 29 N khandles N 511 777 511
6 88 N ixcurs N 504 552 504
7 50 N homply N 518 583 518
8 103 W baste W 683 454 683
9 67 W tall W 476 572 476
10 45 W gardens W 586 1023 586
11 105 W goldfinch N NA 903 775
12 72 W varmint N NA 507 653
13 3 W lurked W 537 520 537
14 3 W village W 538 522 538
15 17 W fudge W 410 437 410
Filtering rows
filter() keeps rows that satisfy a logical condition:
filter(blp_df, rt > 500)# A tibble: 650 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 640 more rows
filter(blp_df, rt >= 500, rt <= 1000)# A tibble: 610 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 600 more rows
filter(blp_df, lex == "W", rt >= 500, rt <= 1000)# A tibble: 279 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 45 W filled W 659 981 659
2 73 W journals W 538 1505 538
3 24 W apache W 626 546 626
4 11 W flake W 566 717 566
5 32 W reliefs W 922 1471 922
6 82 W deceits W 657 728 657
7 46 W mirage W 778 571 778
8 75 W poser W 696 519 696
9 36 W outposts W 560 461 560
10 54 W curbs W 697 653 697
# ℹ 269 more rows
Multiple conditions separated by commas are combined with AND — a row must satisfy all of them to be kept. Use | for OR conditions.
Mutating columns
mutate() adds new columns or modifies existing ones. The new column is computed row by row from existing columns:
mutate(blp_df, accuracy = lex == resp)# A tibble: 1,000 × 8
participant lex spell resp rt prev.rt rt.raw accuracy
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 20 N staud N 977 511 977 TRUE
2 9 N dinbuss N 565 765 565 TRUE
3 47 N snilling N 562 496 562 TRUE
4 103 N gancens N 572 656 572 TRUE
5 45 W filled W 659 981 659 TRUE
6 73 W journals W 538 1505 538 TRUE
7 24 W apache W 626 546 626 TRUE
8 11 W flake W 566 717 566 TRUE
9 32 W reliefs W 922 1471 922 TRUE
10 96 N sarves N 555 806 555 TRUE
# ℹ 990 more rows
Here accuracy is TRUE where the participant’s response matched the correct answer. Use .after to control placement:
mutate(blp_df, accuracy = lex == resp, .after = resp)# A tibble: 1,000 × 8
participant lex spell resp accuracy rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl>
1 20 N staud N TRUE 977 511 977
2 9 N dinbuss N TRUE 565 765 565
3 47 N snilling N TRUE 562 496 562
4 103 N gancens N TRUE 572 656 572
5 45 W filled W TRUE 659 981 659
6 73 W journals W TRUE 538 1505 538
7 24 W apache W TRUE 626 546 626
8 11 W flake W TRUE 566 717 566
9 32 W reliefs W TRUE 922 1471 922
10 96 N sarves N TRUE 555 806 555
# ℹ 990 more rows
if_else() applies a condition and returns different values for TRUE and FALSE:
mutate(blp_df,
rt_speed = if_else(rt < median(rt, na.rm = TRUE), "fast", "slow"))# A tibble: 1,000 × 8
participant lex spell resp rt prev.rt rt.raw rt_speed
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 20 N staud N 977 511 977 slow
2 9 N dinbuss N 565 765 565 fast
3 47 N snilling N 562 496 562 fast
4 103 N gancens N 572 656 572 fast
5 45 W filled W 659 981 659 slow
6 73 W journals W 538 1505 538 fast
7 24 W apache W 626 546 626 slow
8 11 W flake W 566 717 566 fast
9 32 W reliefs W 922 1471 922 slow
10 96 N sarves N 555 806 555 fast
# ℹ 990 more rows
Summarising
summarise() collapses the data frame to a single row of summary statistics:
summarise(blp_df,
avg_rt = mean(rt.raw),
mad_rt = mad(rt.raw),
accuracy = mean(lex == resp))# A tibble: 1 × 3
avg_rt mad_rt accuracy
<dbl> <dbl> <dbl>
1 708. 153. 0.852
Use .by to compute summaries separately for each level of a grouping variable:
summarise(blp_df,
avg_rt = mean(rt.raw),
mad_rt = mad(rt.raw),
accuracy = mean(lex == resp),
.by = lex)# A tibble: 2 × 4
lex avg_rt mad_rt accuracy
<chr> <dbl> <dbl> <dbl>
1 N 686. 158. 0.930
2 W 730. 150. 0.773
summarise(blp_df,
avg_rt = mean(rt.raw),
mad_rt = mad(rt.raw),
accuracy = mean(lex == resp),
.by = resp)# A tibble: 2 × 4
resp avg_rt mad_rt accuracy
<chr> <dbl> <dbl> <dbl>
1 N 706. 165. 0.805
2 W 711. 136. 0.917
The pipe
The pipe |> sends the result of one expression into the first argument of the next function. This becomes particularly powerful when chaining several dplyr verbs together.
Without the pipe, nested calls quickly become hard to read:
primes <- c(2, 3, 5, 7, 11, 13)
log(sum(sqrt(log(primes))))[1] 2.040487
With the pipe, the same operations read left to right:
primes |> log() |> sqrt() |> sum() |> log()[1] 2.040487
Building a pipeline
Here is a complete data processing pipeline: read the data, create a new column, select and rename columns, convert a column type, drop rows with missing values, and summarise — all in a single readable chain:
read_csv("https://raw.githubusercontent.com/mark-andrews/isurr/refs/heads/isurr25/data/blp-trials-short.txt") |>
mutate(accuracy = lex == resp) |>
select(ID = participant, lex, resp, rt = rt.raw, accuracy) |>
mutate(ID = as.factor(ID)) |>
drop_na() |>
summarise(avg_rt = mean(rt), sd_rt = sd(rt), .by = c(lex, accuracy))Each step does one clear thing. Reading the pipeline from top to bottom tells you the full story of what happens to the data. This is the core pattern of tidyverse data wrangling: a sequence of small, named operations chained together in a way that is easy to read, check, and modify.