Master vectorization with nested data using rowwise()
Learn debugging strategies for nested data operations
17.1 What is Nested Data?
Picture this: you’re organizing your closet. You could throw everything into one big pile (that’s your flat data structure), OR you could use those neat organizing boxes where smaller boxes fit inside bigger ones (that’s nested data!). In R, nested data is exactly that - it’s like having organized boxes for your data, where each box can contain its own little dataset.
17.1.1 Understanding Lists First
Before we dive into nested data, we need to understand lists. A list in R is like a flexible container that can hold different types of data:
list(c(1), c(1, 2), c(1, 2, 3))list(tibble(x =1, y =2),tibble(x =1, y =3),tibble(x =1, y =4))list(1, "a", tibble(x =1, y =2))
1
Each element is a vector of increasing length
2
Each element is a tibble with the same structure
3
Elements can be of different types (number, text, tibble)
17.1.2 From Lists to Nested Data
Nested data is a special case of lists where:
Each element is a tibble
All tibbles have the same structure
The list itself is stored in a column of another tibble
Example 2 above is almost a nested data structure, but it’s just a list. To create proper nested data, we use the nest() function:
library(tidyverse)library(palmerpenguins)# Create nested data from the penguins datasetpenguins |>group_by(species) |>nest()
1
Group by species
2
Nest the remaining columns into a list column
Now we have a tibble where: - Each row represents a penguin species - The data column contains a tibble for each species with all their measurements
Exercise: Creating Nested Data
Let’s practice creating nested data with the mtcars dataset. We’ll group cars by both cylinder count (cyl) and gear count (gear).
For simple calculations like means, group_by() and summarize() is clearer. So when should we use nested data?
17.2.2 Complex Example: Linear Regression
Nested data shines when we need to do complex operations that can’t be done with simple summary functions:
penguins |>group_by(species) |>nest() |>rowwise() |>mutate(# Fit a linear model for each speciesmodel =list(lm(bill_length_mm ~ bill_depth_mm, data = data)),# Extract the R-squared valuer_squared =summary(model)$r.squared ) |>ungroup()
1
For each species, fit a model predicting bill length from bill depth
2
Calculate how well the model fits (R-squared value)
This is not a stats class, so we won’t go into the details of the model. But briefly, what’s happening here:
lm() fits a linear regressionmodel using the formula bill_length_mm ~ bill_depth_mm
The model is stored in the model column
We extract the R-squared value using summary(model)$r.squared
The beauty is that we only need to write the analysis code once, and R applies it to each species automatically! Thus, vectorization and nested data help you focus on the creative part of your work, and strip away the tedious parts (like keeping track of the looping index).
Essentially, you just need to know how to perform the operation (like complicated stats) on one of the nested tibbles, and you can use rowwise() to apply it to each other cases. It is a general formula to save all your energy on the important stuff (like doing the stats) instead of getting lost in keeping track of the data index i.
Exercise: Bill Area Calculation
Calculate the mean bill area (length × depth) for each species using nested data.
Vectorization and nested data make debugging easier. Let’s look at an example:
# Create a problematic tibbledf <-tibble(value =list(1, 2, "a", 4) # Note the "a" in position 3)# Try to sum each valuedf |>rowwise() |>mutate(sum_value =sum(value))
Error in `mutate()`:
ℹ In argument: `sum_value = sum(value)`.
ℹ In row 3.
Caused by error in `sum()`:
! invalid 'type' (character) of argument
The error message clearly tells us there’s a problem in row 3, and the issue is that we can’t sum a character value.
This is a simple example, but it can be a pain to debug without vectorization and nested data (again, speaking from tears!).
17.4 Summary
Use nested data when:
You need to perform complex operations (like fitting models)
The operation can’t be done with simple group_by() and summarize()
You want to keep all related data together
You need to maintain relationships between different levels of data
Remember: for simple calculations (like means, sums, counts), stick with group_by() and mutate()/summarize()! Don’t make things more complicated than they need to be. 😉
Also remember, these two approaches are not mutually exclusive. You can use nested data first to perform complex operations, and then use group_by() and mutate()/summarize() to summarize the results.