Nested Data and Vectorization

EE BIOL C177/C234

Chuliang Song

The Closet Analogy 🧥

Imagine organising your bedroom closet:
- Flat Data: Throwing all clothes, shoes, socks, and accessories into one massive, messy pile. Finding a matching pair of socks is a nightmare!
- Nested Data: Using modular organiser boxes.
  - The main closet holds labeled bins (e.g., “Shirts”, “Shoes”).
  - Inside the “Shoes” bin, you have separate smaller boxes for sneakers, boots, and sandals.
Nested Data in R stores a mini-dataset (tibble) inside the cell of a parent dataset!

Tip

This allows you to organize data by grouping variables, keeping related datasets perfectly bundled together.

What Is Nested Data?

Lists — Flexible Containers 📦

Before nesting tibbles, we must understand R lists. A list is a generic vector that can hold any types of elements:

# 1. Vectors of different lengths
list(c(1), c(1, 2), c(1, 2, 3))

# 2. Tibbles of identical structures
list(
  tibble(x = 1, y = 2),
  tibble(x = 3, y = 4)
)

# 3. Mixed types (number, string, tibble)
list(1, "a", tibble(x = 1, y = 2))

From Lists to Nested Data

Nested data is a specific, clean structure where: 1. Each element of a list is a tibble. 2. All those tibbles have the same column structure. 3. That list is stored as a list-column inside a parent tibble!

To create this, we use the nest() function from tidyr:

library(tidyverse)
library(palmerpenguins)

penguins |> 
  group_by(species) |> 
  nest()

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]>

Anatomy of Nested Output 🔎

Let’s look at the result of penguins |> group_by(species) |> nest():

We get a tibble with exactly 3 rows (one for each penguin species).
The columns are:
- species: a normal character vector column.
- data: a list-column containing a tibble for each species!
Under data for Adelie, we have a 344 × 7 sub-tibble containing all measurements (bill length, flipper depth, etc.) for only Adelie penguins.
The groups are partitioned cleanly into independent boxes.

Working with Nested Data

Simple: Mean Bill Length

Now that we have nested boxes, how do we compute values on them? We use rowwise() to step through each species row-by-row:

penguins |>
  group_by(species) |>
  nest() |>
  rowwise() |>
  mutate(mean_bill = mean(data$bill_length_mm, na.rm = TRUE)) |>
  ungroup()

# A tibble: 3 × 3
  species   data               mean_bill
  <fct>     <list>                 <dbl>
1 Adelie    <tibble [152 × 7]>      38.8
2 Gentoo    <tibble [124 × 7]>      47.5
3 Chinstrap <tibble [68 × 7]>       48.8

Key Detail: Since data is a column containing tibbles, data$bill_length_mm reaches inside the sub-tibble to grab the vector!

Simple Summaries: A Better Way? ❓

Why write nest() |> rowwise() |> mutate(...) when we could just write a flat summary?

✅ Summarize (Flat)

penguins |>
  group_by(species) |>
  summarize(mean_bill = mean(bill_length_mm, na.rm = TRUE))

❌ Nested (Overcomplicated for simple tasks)

penguins |>
  group_by(species) |> nest() |>
  rowwise() |>
  mutate(mean_bill = mean(data$bill_length_mm, na.rm = TRUE)) |>
  ungroup()

Warning

Rule of Thumb: Don’t overcomplicate simple calculations. If you only need basic averages or sums, stick to group_by() |> summarize()!

When Nesting Shines: Model Fitting 🚀

Nesting is a game-changer when we perform complex data science tasks—like fitting a separate linear regression model for each species!

penguins |>
  group_by(species) |>
  nest() |>
  rowwise() |>
  mutate(
    model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)),
    r_squared = summary(model)$r.squared
  ) |>
  ungroup()

# A tibble: 3 × 4
  species   data               model  r_squared
  <fct>     <list>             <list>     <dbl>
1 Adelie    <tibble [152 × 7]> <lm>       0.153
2 Gentoo    <tibble [124 × 7]> <lm>       0.414
3 Chinstrap <tibble [68 × 7]>  <lm>       0.427

Decoding the Code: Model Fitting 🧠

Let’s focus on how we fit a separate regression model for each row:

mutate(
  model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)),
  r_squared = summary(model)$r.squared
)

lm(bill_length_mm ~ bill_depth_mm, data = data): Fits a linear model using the row’s own sub-tibble (data = data).
The list() Wrapper: A normal R column can only hold simple vectors (numeric, character). A regression model is a complex list structure.
Wrapping the model in list() tells R: “store this complex model object as an element inside a list-column!”

Decoding the Code: Extracting Metrics 📈

Next, we extract statistical parameters from the fitted model:

mutate(
  model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)),
  r_squared = summary(model)$r.squared
)

summary(model)$r.squared: Grabs the model from the current row, summarises it, and extracts its goodness-of-fit metric.
Vectorized Magic: You write this logic once, and R automatically performs the model fitting and parameter extraction across all species!
The results stay perfectly grouped alongside the raw data.

Debugging & Best Practices

Debugging Made Easier 🐛

Writing nested loops in base R is notorious for hard-to-find bugs. In a nested tidyverse workflow, R tells you exactly where the failure occurred!

Suppose we have a list containing numeric inputs, but one element has a character string:

df <- tibble(
  value = list(1, 2, "a", 4)  # "a" is in row 3!
)

df |>
  rowwise() |>
  mutate(sum_value = sum(value))

Error in `mutate()`:
ℹ In argument: `sum_value = sum(value)`.
ℹ In row 3.
Caused by error in `sum()`:
! invalid 'type' (character) of argument

Isolating and Fixing the Bug 🛠️

Because the error message explicitly points to Row 3, we don’t have to guess or inject print statements inside a loop. We can inspect the element directly:

# 1. Pull the list elements out of the tibble
df$value[[3]]

[1] "a"

We see immediately that [1] "a" is a character, causing sum() to fail. We can fix it by filtering or correcting the raw data at its source!

Summary Checklist 📋

Use Nested Data when:

You need to perform complex operations (e.g. fitting statistical models) across subsets of data.
The operation is too complex for simple group_by() |> summarize().
You want to maintain a clean relationship between raw datasets and summary outputs.
You want to completely avoid manual loop counters (i, j) and index errors.

Tip

Pro Tip: Nested data and flat summary operations are not mutually exclusive! You can nest data first to fit complex models, extract model parameters, and then flatten/summarize the parameters using normal tidyverse verbs.

Nested Data and Vectorization

Today’s Menu 🎯

The Closet Analogy 🧥

What Is Nested Data?

Lists — Flexible Containers 📦

From Lists to Nested Data

Anatomy of Nested Output 🔎

Working with Nested Data

Simple: Mean Bill Length

Simple Summaries: A Better Way? ❓

When Nesting Shines: Model Fitting 🚀

Decoding the Code: Model Fitting 🧠

Decoding the Code: Extracting Metrics 📈

Debugging & Best Practices

Debugging Made Easier 🐛

Isolating and Fixing the Bug 🛠️

Summary Checklist 📋