EE BIOL C177/C234
Tip
This allows you to organize data by grouping variables, keeping related datasets perfectly bundled together.
Before nesting tibbles, we must understand R lists. A list is a generic vector that can hold any types of elements:
Nested data is a specific, clean structure where: 1. Each element of a list is a tibble. 2. All those tibbles have the same column structure. 3. That list is stored as a list-column inside a parent tibble!
To create this, we use the nest() function from tidyr:
Let’s look at the result of penguins |> group_by(species) |> nest():
species: a normal character vector column.data: a list-column containing a tibble for each species!data for Adelie, we have a 344 × 7 sub-tibble containing all measurements (bill length, flipper depth, etc.) for only Adelie penguins.Now that we have nested boxes, how do we compute values on them? We use rowwise() to step through each species row-by-row:
penguins |>
group_by(species) |>
nest() |>
rowwise() |>
mutate(mean_bill = mean(data$bill_length_mm, na.rm = TRUE)) |>
ungroup()# A tibble: 3 × 3
species data mean_bill
<fct> <list> <dbl>
1 Adelie <tibble [152 × 7]> 38.8
2 Gentoo <tibble [124 × 7]> 47.5
3 Chinstrap <tibble [68 × 7]> 48.8
Key Detail: Since data is a column containing tibbles, data$bill_length_mm reaches inside the sub-tibble to grab the vector!
Why write nest() |> rowwise() |> mutate(...) when we could just write a flat summary?
✅ Summarize (Flat)
❌ Nested (Overcomplicated for simple tasks)
Warning
Rule of Thumb: Don’t overcomplicate simple calculations. If you only need basic averages or sums, stick to group_by() |> summarize()!
Nesting is a game-changer when we perform complex data science tasks—like fitting a separate linear regression model for each species!
penguins |>
group_by(species) |>
nest() |>
rowwise() |>
mutate(
model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)),
r_squared = summary(model)$r.squared
) |>
ungroup()# A tibble: 3 × 4
species data model r_squared
<fct> <list> <list> <dbl>
1 Adelie <tibble [152 × 7]> <lm> 0.153
2 Gentoo <tibble [124 × 7]> <lm> 0.414
3 Chinstrap <tibble [68 × 7]> <lm> 0.427
Let’s focus on how we fit a separate regression model for each row:
lm(bill_length_mm ~ bill_depth_mm, data = data): Fits a linear model using the row’s own sub-tibble (data = data).list() Wrapper: A normal R column can only hold simple vectors (numeric, character). A regression model is a complex list structure.list() tells R: “store this complex model object as an element inside a list-column!”Next, we extract statistical parameters from the fitted model:
summary(model)$r.squared: Grabs the model from the current row, summarises it, and extracts its goodness-of-fit metric.Writing nested loops in base R is notorious for hard-to-find bugs. In a nested tidyverse workflow, R tells you exactly where the failure occurred!
Suppose we have a list containing numeric inputs, but one element has a character string:
Because the error message explicitly points to Row 3, we don’t have to guess or inject print statements inside a loop. We can inspect the element directly:
We see immediately that [1] "a" is a character, causing sum() to fail. We can fix it by filtering or correcting the raw data at its source!
Use Nested Data when:
group_by() |> summarize().i, j) and index errors.Tip
Pro Tip: Nested data and flat summary operations are not mutually exclusive! You can nest data first to fit complex models, extract model parameters, and then flatten/summarize the parameters using normal tidyverse verbs.
Nested Data and Vectorization