Six Verbs for Data Transformation

EE BIOL C177/C234

Chuliang Song

Setup 📦

Packages for this chapter

We’ll be relying on tidyverse and palmerpenguins. We’ll also use tidylog, a neat package that gives feedback on your data manipulations!

If you haven’t installed tidylog yet, run this in your R console:

pak::pak("tidylog")

The Pipe `|>`

Without Pipe vs With Pipe

😵 Nested (hard to read)

eat(dressed(shower(
    brush(wakeup(you))
)))

✅ Piped (clear flow)

you |>
    wakeup() |>
    brush() |>
    shower() |>
    get_dressed() |>
    eat()

Read |> as “then”

Row Verbs

`arrange()` — Sort Rows

`arrange()` — Descending

Use - for descending order:

Your Turn! 🏋️

Sort the penguins dataset by island first, then by descending bill_depth_mm.

`filter()` — Keep Matching Rows

`filter()` — Multiple Conditions

Common Operators

== (Equal): species == "Adelie"
!= (Not equal): island != "Torgersen"
<, > (Less/Greater): body_mass_g > 4000
& (AND): sex == "male" & year == 2007
| (OR): species == "Adelie" | species == "Gentoo"
%in% (In group): island %in% c("Biscoe", "Dream")

Test these operators out here:

Your Turn! 🏋️

Filter male penguins with bill length < 40 mm, living on either Torgersen or Biscoe island.

Column Verbs

`select()` — Pick Columns

`select()` — Exclude Columns

Use - to exclude specific columns. You can also reorder columns!

Your Turn! 🏋️

Select the species, year, and body_mass_g columns from the penguins dataset.

`mutate()` — Create New Columns

`if_else()` — Conditional Mutate

Best for simple YES / NO conditions:

`case_when()` — Complex Conditions

Your Turn! 🏋️

Create bill_area (length * depth). If bill_area > 700, tag as “big”, else “small”. Then select species, bill_area, bill_category.

Group Verbs

Grouped `filter()`

Filter penguins by comparing them to their own species average:

Grouped `mutate()`

Standardize bill length within each species:

⚠️ The Golden Rule of Grouping

Always ungroup()!

Whenever you use group_by() with filter() or mutate(), always finish with ungroup().

Leaving data grouped implicitly is the source of many frustrating bugs in data science!

Your Turn! 🏋️ (Grouped Filter)

Group by species and sex, then filter for penguins whose bill_length_mm is > the mean + 1 standard dev.

`group_by()` + `summarize()`

Calculate statistics per group:

Multiple Summaries

Your Turn! 🏋️ (Summarize)

Group by island and sex. Calculate average bill_length_mm and bill_depth_mm.

Combining All Verbs

penguins |>
    drop_na() |>
    group_by(sex, species) |>
    summarize(
        mean_bill = mean(bill_length_mm),
        sd_bill = sd(bill_length_mm)
    ) |>
    ungroup() |>
    select(sex, species, mean_bill, sd_bill) |>
    arrange(-mean_bill)

# The manual "Split-Apply-Combine" pattern
p_clean <- penguins[!is.na(penguins$bill_length_mm), ]

# 1. Split into a list of groups
groups <- split(p_clean, list(p_clean$sex, p_clean$species))

# 2. Apply summary to each group
stats <- lapply(groups, function(d) {
    if (nrow(d) == 0) return(NULL)
    data.frame(
        sex = d$sex[1], species = d$species[1],
        mean = mean(d$bill_length_mm), sd = sd(d$bill_length_mm)
    )
})

# 3. Combine back into a data frame and sort
res <- do.call(rbind, stats)
res[order(res$mean, decreasing = TRUE), ]

# Named aggregation is the closest Pandas gets to summarize()
(penguins
    .dropna(subset=['bill_length_mm'])
    .groupby(['sex', 'species'])
    .agg(
        mean_bill=('bill_length_mm', 'mean'),
        sd_bill=('bill_length_mm', 'std')
    )
    .reset_index()
    .sort_values('mean_bill', ascending=False))

import numpy as np

# Data as separate arrays
bill = np.array(bill_list)
sex_arr = np.array(sex_list)
spec_arr = np.array(spec_list)

# 1. Manual NA filtering
mask = ~np.isnan(bill)
bill, sex_arr, spec_arr = bill[mask], sex_arr[mask], spec_arr[mask]

# 2. Manual grouping
keys = np.array([f"{s}_{sp}" for s, sp in zip(sex_arr, spec_arr)])
unique_keys = np.unique(keys)

# 3. Manual aggregation
results = []
for k in unique_keys:
    group_vals = bill[keys == k]
    results.append({
        'group': k,
        'mean': np.mean(group_vals),
        'sd': np.std(group_vals, ddof=1)
    })

[!NOTE] dplyr provides a declarative syntax that mirrors our mental model of the data flow, abstracting away the “how” so we can focus on the “what”.

R’s Superpowers for Data Science

Why does dplyr feel so much more “natural” than the alternatives?

Non-Standard Evaluation (NSE): The “magic” that lets you use column names like bill_length_mm without quotes or complex lambda functions.
First-Class Missing Values (NA): In R, NA is a native type that correctly “poisons” calculations (1 + NA = NA), preventing silent errors.
Copy-on-Write Safety: R handles data “immutably” by default. You can’t accidentally overwrite your raw data when passing it to a function.
Vectorization is the Default: Unlike other languages, R is built around vectors. You don’t need external libraries to do basic array math.

“I want to specify at a conceptual level how the data should be analyzed… I don’t want to have to think about the logistics of how the computation is performed.” — Claus Wilke

Summary

|> chains operations: data flows left to right
Rows: arrange() sorts, filter() subsets
Columns: select() picks, mutate() creates
Groups: group_by() + summarize() = aggregation
Combine verbs for complex transformations!

Six Verbs for Data Transformation

Today’s Menu 🎯

Setup 📦

The Pipe |>

Without Pipe vs With Pipe

Row Verbs

arrange() — Sort Rows

arrange() — Descending

Your Turn! 🏋️

filter() — Keep Matching Rows

filter() — Multiple Conditions

Common Operators

Your Turn! 🏋️

Column Verbs

select() — Pick Columns

select() — Exclude Columns

Your Turn! 🏋️

mutate() — Create New Columns

if_else() — Conditional Mutate

case_when() — Complex Conditions

Your Turn! 🏋️

Group Verbs

Grouped filter()

Grouped mutate()

⚠️ The Golden Rule of Grouping

Your Turn! 🏋️ (Grouped Filter)

group_by() + summarize()

Multiple Summaries

Your Turn! 🏋️ (Summarize)

Combining All Verbs

Logic vs. Logistics

R’s Superpowers for Data Science

Summary

The Pipe `|>`

`arrange()` — Sort Rows

`arrange()` — Descending

`filter()` — Keep Matching Rows

`filter()` — Multiple Conditions

`select()` — Pick Columns

`select()` — Exclude Columns

`mutate()` — Create New Columns

`if_else()` — Conditional Mutate

`case_when()` — Complex Conditions

Grouped `filter()`

Grouped `mutate()`

`group_by()` + `summarize()`