library(tidyverse)
library(tidylog)
# if you haven't installed dcldata, you can do so by running:
# pak::pkg_install("dcl-docs/dcldata")
library(dcldata)
8 Five verbs for tidying data
tidyr
verbs
- Reshape data
pivot_longer()
: Transform wide data into a longer, βtallβ format.
pivot_wider()
: Transform long data into a wider format.
- Split cells
separate()
: Split one column into multiple columns.
- Dealing with missing values
drop_na()
: Drop rows with missing values.replace_na()
: Replace missing values with a specified value.
Itβs often said that data cleaning and preparation consume the lionβs share of an analystβs timeβsome claim as much as 80%. When real-world data arrives on your desk, chances are it isnβt in a convenient format. Columns may be mislabeled or scattered, variable names might be confusing, and rows may contain missing information. Over the years, analysts have discovered that a consistent βtidyβ format for data is invaluable: it lets you focus on extracting insights rather than wrangling columns and rows into submission.
Tidy datasets are all alike, but every messy dataset is messy in its own way. β Hadley Wickham
In practical termsβand echoing our earlier discussionβa tidy dataset has these defining traits:
- Every row represents one observation.
- Every column represents one variable.
- Every cell contains a single value.
The tidyr
package (part of the tidyverse
) is designed to help transform messy data into this tidy structure. In this chapter, weβll explore some of the most commonly used tidyr
functions, using example datasets that demonstrate where each function shines. Weβll pause our use of the penguins dataset momentarily, because itβs already tidy; instead, weβll rely on some datasets from dcldata
to illustrate typical data-wrangling challenges.
Hadley Wickham originally developed the reshape2
package in the 2000s to handle this task. Later, in the tidyr
package, he introduced gather()
and spread()
as an improved approach. These were eventually replaced by pivot_longer()
and pivot_wider()
, which offer greater clarity and consistency. While the older functions still work, it is recommended to use the newer ones.
8.1 pivot_longer()
Letβs begin with the example_eagle_nests
dataset from U.S. Fish and Wildlife Service:
example_eagle_nests
Notice that 2017
and 2019
are stored as column names, but theyβre really values (i.e., years). This is a classic βwideβ format. We can use pivot_longer()
to convert it into a tidy format:
<- example_eagle_nests |>
eagle_nests_longer pivot_longer(
cols = c(`2007`, `2009`),
names_to = "year",
values_to = "nests"
) eagle_nests_longer
- 1
-
pivot_longer()
transforms wide data into a longer, βtallβ format. - 2
-
cols
specifies the columns to pivot. - 3
-
names_to
creates a new column holding the former column names. - 4
-
values_to
creates a new column holding the values previously spread across multiple columns.
Why are βyearβ and βnestsβ in quotes? Because they are new column names that we are creating. As a general rule, when pivoting, we should always use quotes when creating new columns.
Specifying the columns can sometimes feel like using select()
. For instance, you can list the columns to include, or exclude the columns you want to keep untouched. The code below achieves the same result by excluding everything but region:
|>
example_eagle_nests pivot_longer(
cols = -c(region),
names_to = "year",
values_to = "nests"
)
Exercise
Try your hand on the billboard
dataset (in the tidyverse
). This dataset has columns labeled by week (wk1
through wk76
), but we want a tidy format that moves those week columns into a single βweekβ variable and the rankings into another βrankβ variable. Use pivot_longer()
to make it happen.
8.2 pivot_wider()
The pivot_wider()
function reverses what pivot_longer()
does, turning long datasets into wide formats. Suppose you want to revert our eagle_nests_longer
data back to its original shape:
|>
eagle_nests_longer pivot_wider(
names_from = "year",
values_from = "nests"
)
- 1
-
pivot_wider()
transforms long data into a wider format. - 2
-
names_from
identifies which column contains the new column names. - 3
-
values_from
identifies the column whose values will fill the new cells.
Letβs see a more practical example with our penguins
dataset. Perhaps we want to compare bill length between male and female penguins for each species. We can create a summary of the mean bill length by species and sex, then convert that summary into a wide layout:
We firstly use our hard-earned data transformation skills to calculate the mean bill length by the groups
library(palmerpenguins)
|>
penguins group_by(species, sex) |>
summarise(mean_bill_length = mean(bill_length_mm))
Because comparing values side by side can be clearer, letβs pivot the data wider:
|>
penguins group_by(species, sex) |>
summarise(mean_bill_length = mean(bill_length_mm)) |>
ungroup() |>
pivot_wider(
names_from = "sex",
values_from = "mean_bill_length"
)
We can go one step further and visualize how male and female bill lengths compare:
|>
penguins group_by(species, sex) |>
summarise(mean_bill_length = mean(bill_length_mm)) |>
ungroup() |>
pivot_wider(
names_from = "sex",
values_from = "mean_bill_length"
|>
) ggplot(aes(male, female)) +
geom_point(size = 4) +
::geom_text_repel(aes(label = species), box.padding = 1) +
ggrepelgeom_abline(intercept = 0, slope = 1) +
::theme_nice() jtools
Exercise
The dataset us_rent_income
(from the tidyverse
) contains 2017 American Community Survey data, where variable indicates whether itβs yearly income or monthly rent, and estimate gives the numeric value. Use pivot_wider()
to reshape, then calculate the ratio of annual rent to income, and finally sort the results by that ratio.
8.3 separate()
The separate()
function allows you to split one column into multiple columns. Consider the example_gymnastics_2
dataset from dcldata
, which tracks womenβs Olympic gymnastic scores for 2012 and 2016:
example_gymnastics_2
The column names arenβt tidy. First, letβs pivot everything into a long format:
|>
example_gymnastics_2 pivot_longer(
cols = -c(`country`),
names_to = "eventandyear",
values_to = "score"
)
We see that the new event_year
column contains both an event name and a year (like vault_2012
). We can split this into two columnsβevent
and year
βusing separate()
:
|>
example_gymnastics_2 pivot_longer(
cols = -c(`country`),
names_to = "eventandyear",
values_to = "score"
|>
) separate(
col = "eventandyear",
into = c("event", "year"),
sep = "_"
)
unite()
While separate()
splits columns, unite()
does the opposite: it combines multiple columns into one.
For a simple example, imagine that weβve separated out the month, day, and year of a date into three columns (month, day, year), but now we want a single date column again. Letβs illustrate with a small toy dataset:
<- tibble(
toy_dates month = c(1, 2, 3),
day = c(15, 20, 25),
year = c(2020, 2020, 2020)
)
We can re-combine these columns into one:
|>
toy_dates unite(
col = "full_date",
month, day, year,sep = "-"
)
- 1
- unite() creates a new column (col = βfull_dateβ) from the listed columns (month, day, year).
- 2
- The new column is named βfull_dateβ.
- 3
- The columns to combine are month, day, and year.
- 4
- sep = β-β indicates how to join the columns (here, with a dash).
Exercise
Now letβs tidy the who2
dataset (from the tidyverse
). Columns like sp_m_014
combine multiple pieces of informationβdiagnosis method (sp), gender (m), and age group (014). Your task:
- Convert the data into a long format using pivot_longer().
- Split the combined column into three separate columns (diagnosis, gender, age) using
separate()
.
8.4 drop_na()
and replace_na()
When dealing with real-world data, missing values (often represented by NA
in R) are unavoidable. It is a very broad topic and usually question-specific. However, we show two functions that help you manage missing data (a really bad story here). The main point is that you should ALWAYS be explict about how you handle missing data.
8.4.1 drop_na()
The function drop_na()
removes rows that contain any missing values in the specified columns. This can be handy when a row is so incomplete that itβs unusable for analysis.
For example, letβs look at a look from the penguins
dataset (pretend we didnβt want to keep partial observations):
penguins
We can drop any row with missing data simply by doing:
|>
penguins drop_na()
- 1
-
drop_na()
removes rows with missing values.
drop_na: removed 11 rows (3%), 333 rows remaining
If you only care about missing values in particular columns (say, bill_length_mm
), you can specify them as arguments:
|>
penguins drop_na(bill_length_mm)
- 1
-
drop_na()
removes rows with missing values in thebill_length_mm
column.
drop_na: removed 2 rows (1%), 342 rows remaining
8.4.2 replace_na()
Sometimes you donβt want to drop missing valuesβyou might prefer to fill them with a default value or placeholder. The replace_na()
function allows you to do exactly that.
For instance, if we believe that the missing sex is most likely to be male, we can replace the missing values with that assumption:
|>
penguins replace_na(
list(sex = 'male')
)
- 1
-
Replaces missing values in the
sex
column as specified.
replace_na: changed 11 values (3%) of 'sex' (11 fewer NAs)
8.5 Tidy a real dataset
Letβs explore the famous Datasaurus dataset, which perfectly illustrates why visualizing your data is crucial (more info). This dataset contains multiple distributions that have identical summary statistics but look completely different when plotted.
library(tidyverse)
# pak::pkg_install("datasauRus")
library(datasauRus)
datasaurus_dozen_wide
- 1
-
datasaurus_dozen_wide
is the wide format of the Datasaurus dataset.
Letβs tidy this dataset in three steps:
- First, weβll transform it from wide to long format:
<- datasaurus_dozen_wide |>
datasaurus_dozen_long mutate(row = row_number()) |>
pivot_longer(
cols = -"row",
names_to = "variable",
values_to = "value"
)
- 1
-
mutate(row = row_number())
adds a row identifier. - 2
-
pivot_longer()
pivots the data from wide to long format. - 3
-
names_to
creates a new column holding the former column names. - 4
-
values_to
creates a new column holding the values previously spread across multiple columns.
mutate: new variable 'row' (integer) with 142 unique values and 0% NA
pivot_longer: reorganized (away_x, away_y, bullseye_x, bullseye_y, circle_x, β¦) into (variable, value) [was 142x27, now 3692x3]
<- datasaurus_dozen_long |>
datasaurus separate(
variable, into = c("dataset", "coord"),
sep = "_(?=[^_]*$)"
|>
) pivot_wider(
names_from = "coord",
values_from = "value"
)
- 1
- Split at last underscore
- 2
- pivots the data from long to wide format.
pivot_wider: reorganized (coord, value) into (x, y) [was 3692x4, now 1846x4]
datasaurus
Now we can demonstrate why this dataset is famous. First, letβs check the summary statistics:
# Calculate summary statistics for each dataset
|>
datasaurus group_by(dataset) |>
summarise(
mean_x = mean(x),
mean_y = mean(y),
sd_x = sd(x),
sd_y = sd(y),
cor = cor(x, y)
)
group_by: one grouping variable (dataset)
summarise: now 13 rows and 6 columns, ungrouped
Despite having nearly identical summary statistics, when we visualize the data, we see that each dataset forms a completely different shape:
# Create faceted scatter plots
|>
datasaurus ggplot(aes(x, y, color = dataset)) +
geom_point() +
facet_wrap(vars(dataset)) +
::theme_nice() +
jtoolstheme(legend.position = "none") +
labs(title = "Same summary statistics, very different distributions")
This example demonstrates why itβs essential to both analyze AND visualize your data!