library(tidyverse)
library(palmerpenguins)
library(tidylog)
7 Six verbs for data transformation
- Rows
arrange
: Arrange rows based on one or more variables.filter
: Filter rows based on a condition.
- Columns
select
: Select columns from a data frame.mutate
: Create new columns or modify existing columns.
- Groups
group_by
: Group rows based on one or more variables.summarize
: Summarize data within groups.
We’ll pause our visualization work for a moment to focus on a crucial phase of data analysis: data wrangling. Data wrangling is the process of importing, tidying, and transforming data so that it’s easier to analyze and visualize. By mastering these skills, you’ll be able to spend less time on annoying details but more time on actual research. We will start with transofmation.
The dplyr
package offers an especially elegant way to handle data transformation in R. In this section, we’ll introduce six core functions (often called “verbs”) that cover the basics: they let you arrange rows, filter rows, select columns, create or modify columns, group data, and summarize data. We’ll use the penguins dataset as our guide, along with tidylog
package to keep track of what happens under the hood.
7.1 What is pipe
?
In the tidyverse, the pipe operator |>
is a central tool that allows you to link together multiple functions in a step-by-step sequence. Think of it as a chain: the output of one function becomes the input of the next. This approach makes code more readable and easier to understand. When you use a pipe:
- You start with a tibble as your first argument.
- Every additional function that follows manipulates the tibble in some way.
- The result is always a new tibble.
If you’re using R version 4.1 or later, you can write:
|>
data function1() |>
function2()
- 1
-
The
|>
pipe operator is used to pass the output of one function as the input to the next function.
If you have an older version of R, you might still use the magrittr
pipe %>%
(automatically loaded with tidyverse
):
%>%
data function1() %>%
function2()
Given how often you’ll use the pipe, it helps to memorize the shortcut: Ctrl + Shift + M
(Windows) or Cmd + Shift + M
(Mac) in RStudio automatically inserts a pipe operator. You can also change which pipe operator you use in RStudio by going to Tools
> Global Options
> Code
> Editing
> Use native pipe operator
.
7.2 arrange()
arrange()
reorders the rows of your tibble based on the values in one or more columns. For example, to sort the penguins
dataset by bill length (from smallest to largest):
|>
penguins arrange(bill_length_mm)
- 1
-
Within
arrange
function, put in the variable name (column) that you want to sort by.
To sort in descending order, add a minus sign (-
) before the column name:
|>
penguins arrange(-bill_length_mm)
- 1
-
The minus sign
-
before the column name will sort the column in descending order.
Exercise
Sort the penguins
dataset by island
first, then by descending bill_depth_mm
.
7.3 filter()
filter()
selectively retains rows based on certain conditions. Suppose you want to keep only the Adelie penguins:
|>
penguins filter(species == "Adelie")
- 1
-
Within
filter
function, put in the condition you want to filter by. Note that it is==
for equal not=
.
filter: removed 192 rows (56%), 152 rows remaining
You can combine multiple conditions with logical operators. For instance, filter for Adelie penguins that are male:
|>
penguins filter(species == "Adelie" & sex == "male")
- 1
-
Use the
&
operator to combine multiple conditions.
filter: removed 271 rows (79%), 73 rows remaining
To allow for “either/or” conditions, you can use the |
operator or the %in%
function. Below, we keep rows where the species is Adelie or Chinstrap:
|>
penguins filter(species == "Adelie" | species == "Chinstrap")
- 1
-
Use the
|
operator to combine multiple conditions.
filter: removed 124 rows (36%), 220 rows remaining
This is particularly useful when you have many values to filter.
|>
penguins filter(species %in% c("Adelie", "Chinstrap"))
- 1
-
Use the
%in%
function to filter rows based on multiple values.
filter: removed 124 rows (36%), 220 rows remaining
Exercise
Filter the penguins
dataset for male penguins with bill length less than 40 mm, living on either Torgersen or Biscoe island.
7.4 select()
Where filter()
chooses rows, select()
helps you choose columns. For instance, if you’re only interested in species, island, and the two bill measurements:
|>
penguins select(species, island, bill_length_mm, bill_depth_mm)
- 1
- List the columns you want inside select().
select: dropped 4 variables (flipper_length_mm, body_mass_g, sex, year)
If you want to remove specific columns, put a minus sign in front of each column name:
|>
penguins select(-island)
- 1
-
The minus sign
-
before the column name will exclude the column from the output.
select: dropped one variable (island)
You can rearrange columns simply by listing them in the order you want:
|>
penguins select(bill_depth_mm, bill_length_mm, species)
select: dropped 5 variables (island, flipper_length_mm, body_mass_g, sex, year)
Exercise
Select the species
, year
, body_mass_g
columns from the penguins
dataset.
7.5 mutate()
mutate()
allows you to create new columns or change existing ones. This name is often considered the least intutive in tidyverse, but I guess it makes the most sense for our department :)
For example, suppose you want to create a new column called bill_ratio—which is bill length divided by bill depth—and also adjust the year column so that the year 2017 becomes your new “zero” starting point:
|>
penguins mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(year = year - 2017)
- 1
-
Within
mutate
function, put in the new column name and the calculation. - 2
- We subtract 2017 from the existing year column so that 2017 becomes year 0. This new column replaces the previous one
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: converted 'year' from integer to double (0 new NA)
A very useful trick is to combine mutate()
with conditional functions like if_else()
(two conditions) or case_when()
(multiple conditions):
|>
penguins mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(bill_category = if_else(bill_ratio > 2, "long", "short")) |>
select(species, bill_category)
- 1
-
Create a new column
bill_ratio
that is the ratio ofbill_length_mm
tobill_depth_mm
. - 2
-
Create a new column
bill_category
that islong
ifbill_ratio
is greater than 2, otherwiseshort
. - 3
-
Select only the
species
andbill_category
columns.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 3 unique values and 1% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)
|>
penguins mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(bill_category = case_when(
> 2 ~ "long",
bill_ratio < 1.5 ~ "short",
bill_ratio TRUE ~ "medium"
|>
)) select(species, bill_category)
- 1
-
Create a new column
bill_category
based on the conditions. - 2
-
If
bill_ratio
is greater than 2, thenbill_category
islong
. - 3
-
If
bill_ratio
is less than 1.5, thenbill_category
isshort
. - 4
-
If none of the above conditions are met, then
bill_category
ismedium
.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 2 unique values and 0% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)
Exercise
Create a new column bill_area
(the product of bill_length_mm and bill_depth_mm). Label those with bill_area
> 700 as “big” and everything else as “small”. Finally, select only three columns: species
, bill_area
, and bill_category
.
7.6 group_by()
We’ve dealt with rows and columns individually, but one of the greatest strengths of dplyr
is its ability to handle groups. With group_by()
, you tell R to treat subsets of your data as separate groups based on one or more columns. For example:
|>
penguins group_by(species)
- 1
-
Within
group_by
function, specify the column(s) you want to group by.
group_by: one grouping variable (species)
Once data is grouped, you can apply other operations such as filter()
or mutate()
within each group. For example, to keep only penguins whose bill length is above the mean bill length for their own species (use filter()
):
|>
penguins group_by(species) |>
filter(bill_length_mm > mean(bill_length_mm, na.rm = TRUE)) |>
ungroup()
- 1
-
Filter the
penguins
dataset to only penguins withbill_length_mm
greater than the meanbill_length_mm
for each species. - 2
-
Use
ungroup
to remove the grouping.
group_by: one grouping variable (species)
filter (grouped): removed 172 rows (50%), 172 rows remaining (removed 0 groups, 3 groups remaining)
ungroup: no grouping variables remain
Note that it is generally a good practice to use ungroup()
after you are done with the grouping. This hs been the source of many bugs in real-world data analysis.
Similarly, you might standardize bill lengths by species for better comparisons (use mutate()
):
|>
penguins group_by(species) |>
mutate(scaled_bill_length =
- mean(bill_length_mm, na.rm = TRUE)) /
(bill_length_mm sd(bill_length_mm, na.rm = TRUE) ) |>
ungroup() |>
select(species, bill_length_mm, scaled_bill_length)
- 1
-
Create a new column
scaled_bill_length
that is the standardizedbill_length_mm
for each species. - 2
-
Select only the
species
,bill_length_mm
, andscaled_bill_length
columns.
group_by: one grouping variable (species)
mutate (grouped): new variable 'scaled_bill_length' (double) with 209 unique values and 1% NA
ungroup: no grouping variables remain
select: dropped 6 variables (island, bill_depth_mm, flipper_length_mm, body_mass_g, sex, …)
Exercise
Group by both species
and sex
, then filter for penguins whose bill_length_mm
is greater than the mean plus one standard deviation for their species-and-sex group.
7.7 summarize()
Once you’ve grouped your data, summarize
helps you collapse each group into summary results per grouping category. For instance, to compute the average bill_length_mm
by species:
|>
penguins group_by(species) |>
summarize(
avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
)
- 1
-
Within
summarize
function, put in the new column name and the calculation. - 2
-
Calculate the average
bill_length_mm
for each species. Thena.rm = TRUE
argument is used to remove missing values (NAs).
group_by: one grouping variable (species)
summarize: now 3 rows and 2 columns, ungrouped
Notice that it has also removed the ungrouped columns. This is a feature, not a bug.
We can also get more than one summary results. For example, we can get the mean and variance of bill_length_mm
for each species and each sex:
|>
penguins group_by(species, sex) |>
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
var_bill_length = var(bill_length_mm, na.rm = TRUE))
group_by: 2 grouping variables (species, sex)
summarize: now 8 rows and 4 columns, one group variable remaining (species)
Exercise
Group the penguins
dataset by island
and sex
. Then calculate the average bill_length_mm
and bill_depth_mm
for each group.
7.8 Combine verbs for complicated tasks
By now, you’ve seen how each of these six verbs works. In practice, you’ll often chain them together to tackle more involved questions. Here’s an example where we examine the difference in bill area between male and female penguins in the year 2009. Specifically, we’ll compute the mean and standard deviation of bill area per species, reorder the columns to place sex before species, and sort by descending mean bill size:
|>
penguins filter(year == 2009) |>
mutate(bill_area = bill_length_mm * bill_depth_mm) |>
group_by(species, sex) |>
summarize(
mean_bill_size = mean(bill_area, na.rm = TRUE),
sd_bill_size = sd(bill_area, na.rm = TRUE)
|>
) ungroup() |>
select(sex, species, mean_bill_size, sd_bill_size) |>
arrange(-mean_bill_size)
filter: removed 224 rows (65%), 120 rows remaining
mutate: new variable 'bill_area' (double) with 120 unique values and 1% NA
group_by: 2 grouping variables (species, sex)
summarize: now 7 rows and 4 columns, one group variable remaining (species)
ungroup: no grouping variables remain
select: columns reordered (sex, species, mean_bill_size, sd_bill_size)
By weaving together these verbs, you can accomplish nuanced data transformations in just a few lines of code.