7  Six verbs for data transformation

Class Objectives: Mastering six verbs
  1. Rows
    • arrange: Arrange rows based on one or more variables.
    • filter: Filter rows based on a condition.
  2. Columns
    • select: Select columns from a data frame.
    • mutate: Create new columns or modify existing columns.
  3. Groups
    • group_by: Group rows based on one or more variables.
    • summarize: Summarize data within groups.

We’ll pause our visualization work for a moment to focus on a crucial phase of data analysis: data wrangling. Data wrangling is the process of importing, tidying, and transforming data so that it’s easier to analyze and visualize. By mastering these skills, you’ll be able to spend less time on annoying details but more time on actual research. We will start with transofmation.

The dplyr package offers an especially elegant way to handle data transformation in R. In this section, we’ll introduce six core functions (often called “verbs”) that cover the basics: they let you arrange rows, filter rows, select columns, create or modify columns, group data, and summarize data. We’ll use the penguins dataset as our guide, along with tidylog package to keep track of what happens under the hood.

library(tidyverse)
library(palmerpenguins)
library(tidylog)

7.1 What is pipe?

In the tidyverse, the pipe operator |> is a central tool that allows you to link together multiple functions in a step-by-step sequence. Think of it as a chain: the output of one function becomes the input of the next. This approach makes code more readable and easier to understand. When you use a pipe:

  1. You start with a tibble as your first argument.
  2. Every additional function that follows manipulates the tibble in some way.
  3. The result is always a new tibble.

If you’re using R version 4.1 or later, you can write:

data |>
    function1() |> 
    function2()
1
The |> pipe operator is used to pass the output of one function as the input to the next function.

If you have an older version of R, you might still use the magrittr pipe %>% (automatically loaded with tidyverse):

data %>%
    function1() %>%
    function2()

Given how often you’ll use the pipe, it helps to memorize the shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac) in RStudio automatically inserts a pipe operator. You can also change which pipe operator you use in RStudio by going to Tools > Global Options > Code > Editing > Use native pipe operator.

7.2 arrange()

arrange() reorders the rows of your tibble based on the values in one or more columns. For example, to sort the penguins dataset by bill length (from smallest to largest):

penguins |>
    arrange(bill_length_mm)
1
Within arrange function, put in the variable name (column) that you want to sort by.

To sort in descending order, add a minus sign (-) before the column name:

penguins |>
    arrange(-bill_length_mm)
1
The minus sign - before the column name will sort the column in descending order.

Exercise

Sort the penguins dataset by island first, then by descending bill_depth_mm.

7.3 filter()

filter() selectively retains rows based on certain conditions. Suppose you want to keep only the Adelie penguins:

penguins |>
    filter(species == "Adelie")
1
Within filter function, put in the condition you want to filter by. Note that it is == for equal not =.
filter: removed 192 rows (56%), 152 rows remaining

You can combine multiple conditions with logical operators. For instance, filter for Adelie penguins that are male:

penguins |>
    filter(species == "Adelie" & sex == "male")
1
Use the & operator to combine multiple conditions.
filter: removed 271 rows (79%), 73 rows remaining

To allow for “either/or” conditions, you can use the | operator or the %in% function. Below, we keep rows where the species is Adelie or Chinstrap:

penguins |>
    filter(species == "Adelie" | species == "Chinstrap")
1
Use the | operator to combine multiple conditions.
filter: removed 124 rows (36%), 220 rows remaining

This is particularly useful when you have many values to filter.

penguins |>
    filter(species %in% c("Adelie", "Chinstrap"))
1
Use the %in% function to filter rows based on multiple values.
filter: removed 124 rows (36%), 220 rows remaining

Exercise

Filter the penguins dataset for male penguins with bill length less than 40 mm, living on either Torgersen or Biscoe island.

7.4 select()

Where filter() chooses rows, select() helps you choose columns. For instance, if you’re only interested in species, island, and the two bill measurements:

penguins |>
    select(species, island, bill_length_mm, bill_depth_mm)
1
List the columns you want inside select().
select: dropped 4 variables (flipper_length_mm, body_mass_g, sex, year)

If you want to remove specific columns, put a minus sign in front of each column name:

penguins |>
    select(-island)
1
The minus sign - before the column name will exclude the column from the output.
select: dropped one variable (island)

You can rearrange columns simply by listing them in the order you want:

penguins |> 
    select(bill_depth_mm, bill_length_mm, species)
select: dropped 5 variables (island, flipper_length_mm, body_mass_g, sex, year)

Exercise

Select the species, year, body_mass_g columns from the penguins dataset.

7.5 mutate()

mutate() allows you to create new columns or change existing ones. This name is often considered the least intutive in tidyverse, but I guess it makes the most sense for our department :)

For example, suppose you want to create a new column called bill_ratio—which is bill length divided by bill depth—and also adjust the year column so that the year 2017 becomes your new “zero” starting point:

penguins |>
    mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
    mutate(year = year - 2017)
1
Within mutate function, put in the new column name and the calculation.
2
We subtract 2017 from the existing year column so that 2017 becomes year 0. This new column replaces the previous one
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: converted 'year' from integer to double (0 new NA)

A very useful trick is to combine mutate() with conditional functions like if_else() (two conditions) or case_when() (multiple conditions):

penguins |>
    mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
    mutate(bill_category = if_else(bill_ratio > 2, "long", "short"))  |>
    select(species, bill_category)
1
Create a new column bill_ratio that is the ratio of bill_length_mm to bill_depth_mm.
2
Create a new column bill_category that is long if bill_ratio is greater than 2, otherwise short.
3
Select only the species and bill_category columns.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 3 unique values and 1% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)
penguins |>
    mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
    mutate(bill_category = case_when(
        bill_ratio > 2 ~ "long",
        bill_ratio < 1.5 ~ "short",
        TRUE ~ "medium"
    ))  |> 
    select(species, bill_category)
1
Create a new column bill_category based on the conditions.
2
If bill_ratio is greater than 2, then bill_category is long.
3
If bill_ratio is less than 1.5, then bill_category is short.
4
If none of the above conditions are met, then bill_category is medium.
mutate: new variable 'bill_ratio' (double) with 338 unique values and 1% NA
mutate: new variable 'bill_category' (character) with 2 unique values and 0% NA
select: dropped 8 variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, …)

Exercise

Create a new column bill_area (the product of bill_length_mm and bill_depth_mm). Label those with bill_area > 700 as “big” and everything else as “small”. Finally, select only three columns: species, bill_area, and bill_category.

7.6 group_by()

We’ve dealt with rows and columns individually, but one of the greatest strengths of dplyr is its ability to handle groups. With group_by(), you tell R to treat subsets of your data as separate groups based on one or more columns. For example:

penguins |>
    group_by(species)
1
Within group_by function, specify the column(s) you want to group by.
group_by: one grouping variable (species)

Once data is grouped, you can apply other operations such as filter() or mutate() within each group. For example, to keep only penguins whose bill length is above the mean bill length for their own species (use filter()):

penguins |>
    group_by(species) |>
    filter(bill_length_mm > mean(bill_length_mm, na.rm = TRUE))  |>
    ungroup()
1
Filter the penguins dataset to only penguins with bill_length_mm greater than the mean bill_length_mm for each species.
2
Use ungroup to remove the grouping.
group_by: one grouping variable (species)
filter (grouped): removed 172 rows (50%), 172 rows remaining (removed 0 groups, 3 groups remaining)
ungroup: no grouping variables remain

Note that it is generally a good practice to use ungroup() after you are done with the grouping. This hs been the source of many bugs in real-world data analysis.

Similarly, you might standardize bill lengths by species for better comparisons (use mutate()):

penguins |>
    group_by(species) |>
    mutate(scaled_bill_length =
    (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) /
    sd(bill_length_mm, na.rm = TRUE) ) |>
    ungroup()  |> 
    select(species, bill_length_mm, scaled_bill_length)
1
Create a new column scaled_bill_length that is the standardized bill_length_mm for each species.
2
Select only the species, bill_length_mm, and scaled_bill_length columns.
group_by: one grouping variable (species)
mutate (grouped): new variable 'scaled_bill_length' (double) with 209 unique values and 1% NA
ungroup: no grouping variables remain
select: dropped 6 variables (island, bill_depth_mm, flipper_length_mm, body_mass_g, sex, …)

Exercise

Group by both species and sex, then filter for penguins whose bill_length_mm is greater than the mean plus one standard deviation for their species-and-sex group.

7.7 summarize()

Once you’ve grouped your data, summarize helps you collapse each group into summary results per grouping category. For instance, to compute the average bill_length_mm by species:

penguins |>
    group_by(species) |>
    summarize(
        avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
    ) 
1
Within summarize function, put in the new column name and the calculation.
2
Calculate the average bill_length_mm for each species. The na.rm = TRUE argument is used to remove missing values (NAs).
group_by: one grouping variable (species)
summarize: now 3 rows and 2 columns, ungrouped

Notice that it has also removed the ungrouped columns. This is a feature, not a bug.

We can also get more than one summary results. For example, we can get the mean and variance of bill_length_mm for each species and each sex:

penguins |>
    group_by(species, sex)  |> 
    summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE), 
              var_bill_length = var(bill_length_mm, na.rm = TRUE))
group_by: 2 grouping variables (species, sex)
summarize: now 8 rows and 4 columns, one group variable remaining (species)

Exercise

Group the penguins dataset by island and sex. Then calculate the average bill_length_mm and bill_depth_mm for each group.

7.8 Combine verbs for complicated tasks

By now, you’ve seen how each of these six verbs works. In practice, you’ll often chain them together to tackle more involved questions. Here’s an example where we examine the difference in bill area between male and female penguins in the year 2009. Specifically, we’ll compute the mean and standard deviation of bill area per species, reorder the columns to place sex before species, and sort by descending mean bill size:

penguins |> 
    filter(year == 2009) |>
    mutate(bill_area = bill_length_mm * bill_depth_mm) |>
    group_by(species, sex)  |>
    summarize(
        mean_bill_size = mean(bill_area, na.rm = TRUE),
        sd_bill_size = sd(bill_area, na.rm = TRUE)
    )  |> 
    ungroup() |>
    select(sex, species, mean_bill_size, sd_bill_size) |>
    arrange(-mean_bill_size)
filter: removed 224 rows (65%), 120 rows remaining
mutate: new variable 'bill_area' (double) with 120 unique values and 1% NA
group_by: 2 grouping variables (species, sex)
summarize: now 7 rows and 4 columns, one group variable remaining (species)
ungroup: no grouping variables remain
select: columns reordered (sex, species, mean_bill_size, sd_bill_size)

By weaving together these verbs, you can accomplish nuanced data transformations in just a few lines of code.