library(tidyverse)
library(palmerpenguins)
library(tidylog)7 Six verbs for data transformation
- Rows
arrange: Arrange rows based on one or more variables.filter: Filter rows based on a condition.
- Columns
select: Select columns from a data frame.mutate: Create new columns or modify existing columns.
- Groups
group_by: Group rows based on one or more variables.summarize: Summarize data within groups.
We’ll pause our visualization work for a moment to focus on a crucial phase of data analysis: data wrangling. Data wrangling is the process of importing, tidying, and transforming data so that it’s easier to analyze and visualize. By mastering these skills, you’ll be able to spend less time on annoying details but more time on actual research. We will start with transofmation.
The dplyr package offers an especially elegant way to handle data transformation in R. In this section, we’ll introduce six core functions (often called “verbs”) that cover the basics: they let you arrange rows, filter rows, select columns, create or modify columns, group data, and summarize data. We’ll use the penguins dataset as our guide, along with tidylog package to keep track of what happens under the hood.
7.1 What is pipe?
In the tidyverse, the pipe operator |> is a central tool that allows you to link together multiple functions in a step-by-step sequence. Think of it as a chain: the output of one function becomes the input of the next. This approach makes code more readable and easier to understand. When you use a pipe:
- You start with a tibble as your first argument.
- Every additional function that follows manipulates the tibble in some way.
- The result is always a new tibble.
If you’re using R version 4.1 or later, you can write:
data |>
function1() |>
function2()- 1
-
The
|>pipe operator is used to pass the output of one function as the input to the next function.
If you have an older version of R, you might still use the magrittr pipe %>% (automatically loaded with tidyverse):
data %>%
function1() %>%
function2()Given how often you’ll use the pipe, it helps to memorize the shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac) in RStudio automatically inserts a pipe operator. You can also change which pipe operator you use in RStudio by going to Tools > Global Options > Code > Editing > Use native pipe operator.
7.2 arrange()
arrange() reorders the rows of your tibble based on the values in one or more columns. For example, to sort the penguins dataset by bill length (from smallest to largest):
penguins |>
arrange(bill_length_mm)- 1
-
Within
arrangefunction, put in the variable name (column) that you want to sort by.
To sort in descending order, add a minus sign (-) before the column name:
penguins |>
arrange(-bill_length_mm)- 1
-
The minus sign
-before the column name will sort the column in descending order.
Exercise
Sort the penguins dataset by island first, then by descending bill_depth_mm.
7.3 filter()
filter() selectively retains rows based on certain conditions. Suppose you want to keep only the Adelie penguins:
penguins |>
filter(species == "Adelie")- 1
-
Within
filterfunction, put in the condition you want to filter by. Note that it is==for equal not=.
filter: removed 192 rows (56%), 152 rows remaining
You can combine multiple conditions with logical operators. For instance, filter for Adelie penguins that are male:
penguins |>
filter(species == "Adelie" & sex == "male")- 1
-
Use the
&operator to combine multiple conditions.
filter: removed 271 rows (79%), 73 rows remaining
To allow for “either/or” conditions, you can use the | operator or the %in% function. Below, we keep rows where the species is Adelie or Chinstrap:
penguins |>
filter(species == "Adelie" | species == "Chinstrap")- 1
-
Use the
|operator to combine multiple conditions.
filter: removed 124 rows (36%), 220 rows remaining
This is particularly useful when you have many values to filter.
penguins |>
filter(species %in% c("Adelie", "Chinstrap"))- 1
-
Use the
%in%function to filter rows based on multiple values.
filter: removed 124 rows (36%), 220 rows remaining
Exercise
Filter the penguins dataset for male penguins with bill length less than 40 mm, living on either Torgersen or Biscoe island.
7.4 select()
Where filter() chooses rows, select() helps you choose columns. For instance, if you’re only interested in species, island, and the two bill measurements:
penguins |>
select(species, island, bill_length_mm, bill_depth_mm)- 1
- List the columns you want inside select().
select: dropped 4 variables (flipper_length_mm, body_mass_g, sex, year)
If you want to remove specific columns, put a minus sign in front of each column name:
penguins |>
select(-island)- 1
-
The minus sign
-before the column name will exclude the column from the output.
select: dropped one variable (island)
You can rearrange columns simply by listing them in the order you want:
penguins |>
select(bill_depth_mm, bill_length_mm, species)select: dropped 5 variables (island, flipper_length_mm, body_mass_g, sex, year)
Exercise
Select the species, year, body_mass_g columns from the penguins dataset.
7.5 mutate()
mutate() allows you to create new columns or change existing ones. This name is often considered the least intutive in tidyverse, but I guess it makes the most sense for our department :)
For example, suppose you want to create a new column called bill_ratio—which is bill length divided by bill depth—and also adjust the year column so that the year 2017 becomes your new “zero” starting point:
penguins |>
mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(year = year - 2017)- 1
-
Within
mutatefunction, put in the new column name and the calculation. - 2
- We subtract 2017 from the existing year column so that 2017 becomes year 0. This new column replaces the previous one
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: converted 'year' from integer to double (0 new NA)
A very useful trick is to combine mutate() with conditional functions like if_else() (two conditions) or case_when() (multiple conditions):
penguins |>
mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(bill_category = if_else(bill_ratio > 2, "long", "short")) |>
select(species, bill_category)- 1
-
Create a new column
bill_ratiothat is the ratio ofbill_length_mmtobill_depth_mm. - 2
-
Create a new column
bill_categorythat islongifbill_ratiois greater than 2, otherwiseshort. - 3
-
Select only the
speciesandbill_categorycolumns.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 3 unique values and 1% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)
penguins |>
mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
mutate(bill_category = case_when(
bill_ratio > 2 ~ "long",
bill_ratio < 1.5 ~ "short",
TRUE ~ "medium"
)) |>
select(species, bill_category)- 1
-
Create a new column
bill_categorybased on the conditions. - 2
-
If
bill_ratiois greater than 2, thenbill_categoryislong. - 3
-
If
bill_ratiois less than 1.5, thenbill_categoryisshort. - 4
-
If none of the above conditions are met, then
bill_categoryismedium.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 2 unique values and 0% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)
Exercise
Create a new column bill_area (the product of bill_length_mm and bill_depth_mm). Label those with bill_area > 700 as “big” and everything else as “small”. Finally, select only three columns: species, bill_area, and bill_category.
7.6 group_by()
We’ve dealt with rows and columns individually, but one of the greatest strengths of dplyr is its ability to handle groups. With group_by(), you tell R to treat subsets of your data as separate groups based on one or more columns. For example:
penguins |>
group_by(species)- 1
-
Within
group_byfunction, specify the column(s) you want to group by.
group_by: one grouping variable (species)
Once data is grouped, you can apply other operations such as filter() or mutate() within each group. For example, to keep only penguins whose bill length is above the mean bill length for their own species (use filter()):
penguins |>
group_by(species) |>
filter(bill_length_mm > mean(bill_length_mm, na.rm = TRUE)) |>
ungroup()- 1
-
Filter the
penguinsdataset to only penguins withbill_length_mmgreater than the meanbill_length_mmfor each species. - 2
-
Use
ungroupto remove the grouping.
group_by: one grouping variable (species)
filter (grouped): removed 172 rows (50%), 172 rows remaining (removed 0 groups, 3 groups remaining)
ungroup: no grouping variables remain
Note that it is generally a good practice to use ungroup() after you are done with the grouping. This hs been the source of many bugs in real-world data analysis.
Similarly, you might standardize bill lengths by species for better comparisons (use mutate()):
penguins |>
group_by(species) |>
mutate(scaled_bill_length =
(bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) /
sd(bill_length_mm, na.rm = TRUE) ) |>
ungroup() |>
select(species, bill_length_mm, scaled_bill_length)- 1
-
Create a new column
scaled_bill_lengththat is the standardizedbill_length_mmfor each species. - 2
-
Select only the
species,bill_length_mm, andscaled_bill_lengthcolumns.
group_by: one grouping variable (species)
mutate (grouped): new variable 'scaled_bill_length' (double) with 209 unique values and 1% NA
ungroup: no grouping variables remain
select: dropped 6 variables (island, bill_depth_mm, flipper_length_mm, body_mass_g, sex, …)
Exercise
Group by both species and sex, then filter for penguins whose bill_length_mm is greater than the mean plus one standard deviation for their species-and-sex group.
7.7 summarize()
Once you’ve grouped your data, summarize helps you collapse each group into summary results per grouping category. For instance, to compute the average bill_length_mm by species:
penguins |>
group_by(species) |>
summarize(
avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
) - 1
-
Within
summarizefunction, put in the new column name and the calculation. - 2
-
Calculate the average
bill_length_mmfor each species. Thena.rm = TRUEargument is used to remove missing values (NAs).
group_by: one grouping variable (species)
summarize: now 3 rows and 2 columns, ungrouped
Notice that it has also removed the ungrouped columns. This is a feature, not a bug.
We can also get more than one summary results. For example, we can get the mean and variance of bill_length_mm for each species and each sex:
penguins |>
group_by(species, sex) |>
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
var_bill_length = var(bill_length_mm, na.rm = TRUE))group_by: 2 grouping variables (species, sex)
summarize: now 8 rows and 4 columns, one group variable remaining (species)
Exercise
Group the penguins dataset by island and sex. Then calculate the average bill_length_mm and bill_depth_mm for each group.
7.8 Combine verbs for complicated tasks
By now, you’ve seen how each of these six verbs works. In practice, you’ll often chain them together to tackle more involved questions. Here’s an example where we examine the difference in bill area between male and female penguins in the year 2009. Specifically, we’ll compute the mean and standard deviation of bill area per species, reorder the columns to place sex before species, and sort by descending mean bill size:
penguins |>
filter(year == 2009) |>
mutate(bill_area = bill_length_mm * bill_depth_mm) |>
group_by(species, sex) |>
summarize(
mean_bill_size = mean(bill_area, na.rm = TRUE),
sd_bill_size = sd(bill_area, na.rm = TRUE)
) |>
ungroup() |>
select(sex, species, mean_bill_size, sd_bill_size) |>
arrange(-mean_bill_size)filter: removed 224 rows (65%), 120 rows remaining
mutate: new variable 'bill_area' (double) with 120 unique values and 1% NA
group_by: 2 grouping variables (species, sex)
summarize: now 7 rows and 4 columns, one group variable remaining (species)
ungroup: no grouping variables remain
select: columns reordered (sex, species, mean_bill_size, sd_bill_size)
By weaving together these verbs, you can accomplish nuanced data transformations in just a few lines of code.