17  Nested Data and Vectorization

Class Objectives
  1. Understand the concept of lists and nested data
  2. Learn how to work with list columns using nest()
  3. Master vectorization with nested data using rowwise()
  4. Learn debugging strategies for nested data operations

17.1 What is Nested Data?

Picture this: you’re organizing your closet. You could throw everything into one big pile (that’s your flat data structure), OR you could use those neat organizing boxes where smaller boxes fit inside bigger ones (that’s nested data!). In R, nested data is exactly that - it’s like having organized boxes for your data, where each box can contain its own little dataset.

17.1.1 Understanding Lists First

Before we dive into nested data, we need to understand lists. A list in R is like a flexible container that can hold different types of data:

list(c(1), c(1, 2), c(1, 2, 3))

list(
    tibble(x = 1, y = 2),
    tibble(x = 1, y = 3),
    tibble(x = 1, y = 4)
)

list(1, "a", tibble(x = 1, y = 2))
1
Each element is a vector of increasing length
2
Each element is a tibble with the same structure
3
Elements can be of different types (number, text, tibble)

17.1.2 From Lists to Nested Data

Nested data is a special case of lists where:

  1. Each element is a tibble
  2. All tibbles have the same structure
  3. The list itself is stored in a column of another tibble

Example 2 above is almost a nested data structure, but it’s just a list. To create proper nested data, we use the nest() function:

library(tidyverse)
library(palmerpenguins)

# Create nested data from the penguins dataset
penguins |> 
    group_by(species) |>
    nest()
1
Group by species
2
Nest the remaining columns into a list column

Now we have a tibble where: - Each row represents a penguin species - The data column contains a tibble for each species with all their measurements

Exercise: Creating Nested Data

Let’s practice creating nested data with the mtcars dataset. We’ll group cars by both cylinder count (cyl) and gear count (gear).

17.2 Working with Nested Data

17.2.1 Simple Example: Mean Bill Length

Let’s start with a simple example to understand how to work with nested data:

library(tidyverse)
library(palmerpenguins)

# Step 1: Nest those penguins!
penguins_nested <- penguins |>
    group_by(species) |>
    nest()

penguins_nested
1
Nest the penguins data by species
2
Look at the structure

As we learned in the previous chapter, we can use rowwise() to apply a function to each row. This applies to nested data (a list of tibbles) as well.

penguins_nested |>
    rowwise() |>
    mutate(mean_bill_length = mean(data$bill_length_mm, na.rm = TRUE))

You may wonder why we would want to do this. Why not just use group_by() and summarize()?

penguins |>
    group_by(species) |>
    summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))

For simple calculations like means, group_by() and summarize() is clearer. So when should we use nested data?

17.2.2 Complex Example: Linear Regression

Nested data shines when we need to do complex operations that can’t be done with simple summary functions:

penguins |>
    group_by(species) |>
    nest() |>
    rowwise() |>
    mutate(
        # Fit a linear model for each species
        model = list(lm(bill_length_mm ~ bill_depth_mm, data = data)),
        # Extract the R-squared value
        r_squared = summary(model)$r.squared
    ) |>
    ungroup()
1
For each species, fit a model predicting bill length from bill depth
2
Calculate how well the model fits (R-squared value)

This is not a stats class, so we won’t go into the details of the model. But briefly, what’s happening here:

  1. lm() fits a linear regressionmodel using the formula bill_length_mm ~ bill_depth_mm
  2. The model is stored in the model column
  3. We extract the R-squared value using summary(model)$r.squared

The beauty is that we only need to write the analysis code once, and R applies it to each species automatically! Thus, vectorization and nested data help you focus on the creative part of your work, and strip away the tedious parts (like keeping track of the looping index).

Essentially, you just need to know how to perform the operation (like complicated stats) on one of the nested tibbles, and you can use rowwise() to apply it to each other cases. It is a general formula to save all your energy on the important stuff (like doing the stats) instead of getting lost in keeping track of the data index i.

Exercise: Bill Area Calculation

Calculate the mean bill area (length × depth) for each species using nested data.

17.3 Debugging Made Easier

Vectorization and nested data make debugging easier. Let’s look at an example:

# Create a problematic tibble
df <- tibble(
    value = list(1, 2, "a", 4)  # Note the "a" in position 3
)

# Try to sum each value
df |>
    rowwise() |>
    mutate(sum_value = sum(value))
Error in `mutate()`:
ℹ In argument: `sum_value = sum(value)`.
ℹ In row 3.
Caused by error in `sum()`:
! invalid 'type' (character) of argument

The error message clearly tells us there’s a problem in row 3, and the issue is that we can’t sum a character value.

This is a simple example, but it can be a pain to debug without vectorization and nested data (again, speaking from tears!).

17.4 Summary

Use nested data when:

  1. You need to perform complex operations (like fitting models)
  2. The operation can’t be done with simple group_by() and summarize()
  3. You want to keep all related data together
  4. You need to maintain relationships between different levels of data

Remember: for simple calculations (like means, sums, counts), stick with group_by() and mutate()/summarize()! Don’t make things more complicated than they need to be. 😉

Also remember, these two approaches are not mutually exclusive. You can use nested data first to perform complex operations, and then use group_by() and mutate()/summarize() to summarize the results.