EE BIOL C177/C234
|>arrange(), filter()select(), mutate()group_by(), summarize()|>arrange() — Sort Rowsarrange() — DescendingUse - for descending order:
Sort the penguins dataset by island first, then by descending bill_depth_mm.
filter() — Keep Matching Rowsfilter() — Multiple Conditions== (Equal): species == "Adelie"!= (Not equal): island != "Torgersen"<, > (Less/Greater): body_mass_g > 4000& (AND): sex == "male" & year == 2007| (OR): species == "Adelie" | species == "Gentoo"%in% (In group): island %in% c("Biscoe", "Dream")Test these operators out here:
Filter male penguins with bill length < 40 mm, living on either Torgersen or Biscoe island.
select() — Pick Columnsselect() — Exclude ColumnsUse - to exclude specific columns. You can also reorder columns!
Select the species, year, and body_mass_g columns from the penguins dataset.
mutate() — Create New Columnsif_else() — Conditional MutateBest for simple YES / NO conditions:
case_when() — Complex ConditionsCreate bill_area (length * depth). If bill_area > 700, tag as “big”, else “small”. Then select species, bill_area, bill_category.
filter()Filter penguins by comparing them to their own species average:
mutate()Standardize bill length within each species:
Always ungroup()!
Whenever you use group_by() with filter() or mutate(), always finish with ungroup().
Leaving data grouped implicitly is the source of many frustrating bugs in data science!
Group by species and sex, then filter for penguins whose bill_length_mm is > the mean + 1 standard dev.
group_by() + summarize()Calculate statistics per group:
Group by island and sex. Calculate average bill_length_mm and bill_depth_mm.
# The manual "Split-Apply-Combine" pattern
p_clean <- penguins[!is.na(penguins$bill_length_mm), ]
# 1. Split into a list of groups
groups <- split(p_clean, list(p_clean$sex, p_clean$species))
# 2. Apply summary to each group
stats <- lapply(groups, function(d) {
if (nrow(d) == 0) return(NULL)
data.frame(
sex = d$sex[1], species = d$species[1],
mean = mean(d$bill_length_mm), sd = sd(d$bill_length_mm)
)
})
# 3. Combine back into a data frame and sort
res <- do.call(rbind, stats)
res[order(res$mean, decreasing = TRUE), ]import numpy as np
# Data as separate arrays
bill = np.array(bill_list)
sex_arr = np.array(sex_list)
spec_arr = np.array(spec_list)
# 1. Manual NA filtering
mask = ~np.isnan(bill)
bill, sex_arr, spec_arr = bill[mask], sex_arr[mask], spec_arr[mask]
# 2. Manual grouping
keys = np.array([f"{s}_{sp}" for s, sp in zip(sex_arr, spec_arr)])
unique_keys = np.unique(keys)
# 3. Manual aggregation
results = []
for k in unique_keys:
group_vals = bill[keys == k]
results.append({
'group': k,
'mean': np.mean(group_vals),
'sd': np.std(group_vals, ddof=1)
})[!NOTE]
dplyrprovides a declarative syntax that mirrors our mental model of the data flow, abstracting away the “how” so we can focus on the “what”.
Why does dplyr feel so much more “natural” than the alternatives?
bill_length_mm without quotes or complex lambda functions.NA): In R, NA is a native type that correctly “poisons” calculations (1 + NA = NA), preventing silent errors.“I want to specify at a conceptual level how the data should be analyzed… I don’t want to have to think about the logistics of how the computation is performed.” — Claus Wilke
|> chains operations: data flows left to rightarrange() sorts, filter() subsetsselect() picks, mutate() createsgroup_by() + summarize() = aggregationdplyr: Six Verbs