Six Verbs for Data Transformation

EE BIOL C177/C234

Chuliang Song

Today’s Menu 🎯

  1. The pipe operator |>
  2. Row verbs: arrange(), filter()
  3. Column verbs: select(), mutate()
  4. Group verbs: group_by(), summarize()

Setup 📦

Packages for this chapter

We’ll be relying on tidyverse and palmerpenguins. We’ll also use tidylog, a neat package that gives feedback on your data manipulations!

If you haven’t installed tidylog yet, run this in your R console:

pak::pak("tidylog")

The Pipe |>

Without Pipe vs With Pipe

😵 Nested (hard to read)

eat(dressed(shower(
    brush(wakeup(you))
)))

✅ Piped (clear flow)

you |>
    wakeup() |>
    brush() |>
    shower() |>
    get_dressed() |>
    eat()

Read |> as “then”

Row Verbs

arrange() — Sort Rows

arrange() — Descending

Use - for descending order:

Your Turn! 🏋️

Sort the penguins dataset by island first, then by descending bill_depth_mm.

filter() — Keep Matching Rows

filter() — Multiple Conditions

Common Operators

  • == (Equal): species == "Adelie"
  • != (Not equal): island != "Torgersen"
  • <, > (Less/Greater): body_mass_g > 4000
  • & (AND): sex == "male" & year == 2007
  • | (OR): species == "Adelie" | species == "Gentoo"
  • %in% (In group): island %in% c("Biscoe", "Dream")

Test these operators out here:

Your Turn! 🏋️

Filter male penguins with bill length < 40 mm, living on either Torgersen or Biscoe island.

Column Verbs

select() — Pick Columns

select() — Exclude Columns

Use - to exclude specific columns. You can also reorder columns!

Your Turn! 🏋️

Select the species, year, and body_mass_g columns from the penguins dataset.

mutate() — Create New Columns

if_else() — Conditional Mutate

Best for simple YES / NO conditions:

case_when() — Complex Conditions

Your Turn! 🏋️

Create bill_area (length * depth). If bill_area > 700, tag as “big”, else “small”. Then select species, bill_area, bill_category.

Group Verbs

Grouped filter()

Filter penguins by comparing them to their own species average:

Grouped mutate()

Standardize bill length within each species:

⚠️ The Golden Rule of Grouping

Always ungroup()!

Whenever you use group_by() with filter() or mutate(), always finish with ungroup().

Leaving data grouped implicitly is the source of many frustrating bugs in data science!

Your Turn! 🏋️ (Grouped Filter)

Group by species and sex, then filter for penguins whose bill_length_mm is > the mean + 1 standard dev.

group_by() + summarize()

Calculate statistics per group:

Multiple Summaries

Your Turn! 🏋️ (Summarize)

Group by island and sex. Calculate average bill_length_mm and bill_depth_mm.

Combining All Verbs

Logic vs. Logistics

penguins |>
    drop_na() |>
    group_by(sex, species) |>
    summarize(
        mean_bill = mean(bill_length_mm),
        sd_bill = sd(bill_length_mm)
    ) |>
    ungroup() |>
    select(sex, species, mean_bill, sd_bill) |>
    arrange(-mean_bill)
# The manual "Split-Apply-Combine" pattern
p_clean <- penguins[!is.na(penguins$bill_length_mm), ]

# 1. Split into a list of groups
groups <- split(p_clean, list(p_clean$sex, p_clean$species))

# 2. Apply summary to each group
stats <- lapply(groups, function(d) {
    if (nrow(d) == 0) return(NULL)
    data.frame(
        sex = d$sex[1], species = d$species[1],
        mean = mean(d$bill_length_mm), sd = sd(d$bill_length_mm)
    )
})

# 3. Combine back into a data frame and sort
res <- do.call(rbind, stats)
res[order(res$mean, decreasing = TRUE), ]
# Named aggregation is the closest Pandas gets to summarize()
(penguins
    .dropna(subset=['bill_length_mm'])
    .groupby(['sex', 'species'])
    .agg(
        mean_bill=('bill_length_mm', 'mean'),
        sd_bill=('bill_length_mm', 'std')
    )
    .reset_index()
    .sort_values('mean_bill', ascending=False))
import numpy as np

# Data as separate arrays
bill = np.array(bill_list)
sex_arr = np.array(sex_list)
spec_arr = np.array(spec_list)

# 1. Manual NA filtering
mask = ~np.isnan(bill)
bill, sex_arr, spec_arr = bill[mask], sex_arr[mask], spec_arr[mask]

# 2. Manual grouping
keys = np.array([f"{s}_{sp}" for s, sp in zip(sex_arr, spec_arr)])
unique_keys = np.unique(keys)

# 3. Manual aggregation
results = []
for k in unique_keys:
    group_vals = bill[keys == k]
    results.append({
        'group': k,
        'mean': np.mean(group_vals),
        'sd': np.std(group_vals, ddof=1)
    })

[!NOTE] dplyr provides a declarative syntax that mirrors our mental model of the data flow, abstracting away the “how” so we can focus on the “what”.

R’s Superpowers for Data Science

Why does dplyr feel so much more “natural” than the alternatives?

  • Non-Standard Evaluation (NSE): The “magic” that lets you use column names like bill_length_mm without quotes or complex lambda functions.
  • First-Class Missing Values (NA): In R, NA is a native type that correctly “poisons” calculations (1 + NA = NA), preventing silent errors.
  • Copy-on-Write Safety: R handles data “immutably” by default. You can’t accidentally overwrite your raw data when passing it to a function.
  • Vectorization is the Default: Unlike other languages, R is built around vectors. You don’t need external libraries to do basic array math.

“I want to specify at a conceptual level how the data should be analyzed… I don’t want to have to think about the logistics of how the computation is performed.” — Claus Wilke

Summary