PSet 2

Welcome back! In this problem set you’ll flex your data-wrangling muscles. You will import, clean, join, and transform real ecological data β€” then visualize the results in publication-quality figures.

How You’ll Be Graded
Criterion Weight What we’re looking for
Required elements 50% Does each step include every bullet-pointed requirement listed below?
Readability 20% Are axis labels, titles, and annotations large enough to read comfortably? (Recall the Claus Wilke quote from Chapter 6!)
Aesthetics & creativity 20% Did you go beyond the minimum? Custom theme, smart color palette, clean layout, etc.
Reproducibility 10% Does your script run top-to-bottom inside the RStudio project and produce the submitted figures via ggsave() with explicit width and height?

The data come from a recent paper studying land-snail diversity across the GalΓ‘pagos Islands. You can read the original paper here. Let’s get started!

Step 1 β€” Project Setup

  1. Create a new RStudio project for this assignment.
  2. Download the data from here. The archive contains two CSV files:
    • PSet2_snail.csv β€” snail community data (species diversity, functional diversity, habitat type, per island).
    • PSet2_vegzonetotals.csv β€” total species counts per vegetation zone for each island.
  3. Organize your files β€” place both CSV files in a subfolder named data/ within your project directory.

Your project folder should look like this:

PSet2_project/
β”œβ”€β”€ PSet2_project.Rproj
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ PSet2_snail.csv
β”‚   └── PSet2_vegzonetotals.csv
└── analysis.R

Using here::here() inside your script ensures file paths work on any machine (see Chapter 9):

library(here)
data_snail <- read_csv(here("data", "PSet2_snail.csv"))

Step 2 β€” Data Import

  1. Load both CSV files into R using read_csv() (from the readr/tidyverse package).
  2. Clean the column names with janitor::clean_names() for consistency.

Step 1 β€” Import with read_csv(). Use the RStudio import shortcut (File β†’ Import Dataset β†’ From Text (readr)) to preview the data, then copy-paste the generated read_csv() code into your script:

library(tidyverse)

data_snail <- read_csv(here("data", "PSet2_snail.csv"))
data_veg   <- read_csv(here("data", "PSet2_vegzonetotals.csv"))

Step 2 β€” Clean column names. The janitor package standardizes column names to snake_case:

library(janitor)

data_snail <- data_snail |> clean_names()
data_veg   <- data_veg   |> clean_names()

Step 3 β€” Data Cleaning: Filter Out Specific Islands

  1. Remove rows from data_snail where the island column is "CH", "ED", or "GA".
  2. Use filter() β€” think about why filter() is the right verb here instead of select() (see Chapter 7).

Filtering rows with %in%. The %in% operator checks whether each value appears in a vector. Negate it with ! to exclude matches:

data_snail <- data_snail |>
  filter(!island %in% c("CH", "ED", "GA"))

Why filter() and not select()? Because filter() removes rows based on a condition, while select() removes columns. Here we want to drop specific islands (rows), not variables (columns).

Step 4 β€” Join Data

  1. Join data_snail with data_veg using the common column island.
  2. Choose the right join β€” consider whether a left_join(), inner_join(), or full_join() is most appropriate (see Chapter 9).

left_join() keeps all snail rows. Since we want to keep every row in data_snail and attach the vegetation totals where available:

data_joined <- data_snail |>
  left_join(data_veg, by = "island")

If an island in data_snail has no match in data_veg, its arid_total and humid_total columns will be NA. That’s fine β€” it means the vegetation data wasn’t collected for that island.

Step 5 β€” Data Transformation: Normalize Species Diversity

  1. Create a new variable normalized_spdiv that normalizes spdiv (species diversity) based on habitat type:
    • If habitat is "arid" β†’ normalized_spdiv = spdiv / arid_total
    • If habitat is "humid" β†’ normalized_spdiv = spdiv / humid_total
  2. Use mutate() together with case_when() or if_else() (see Chapter 7).

case_when() for conditional logic inside mutate(). This is cleaner than nested if_else() when you have more than two cases:

data_final <- data_joined |>
  mutate(
    normalized_spdiv = case_when(
      habitat == "arid"  ~ spdiv / arid_total,
      habitat == "humid" ~ spdiv / humid_total
    )
  )

case_when() evaluates conditions top to bottom and returns the right-hand value for the first match. Any row that doesn’t match either condition gets NA.

Step 6 β€” Data Visualization: Replicate Figure 5

Create a scatter plot that shows the relationship between normalized species diversity (normalized_spdiv) and functional diversity (funcdiv). Your plot must include:

  1. geom_text() instead of geom_point() β€” label each point with the island code.
  2. Color by habitat type β€” distinguish arid vs. humid points.
  3. A descriptive title that communicates the main message (use labs(title = ...)).
  4. Readable text sizes β€” axis labels, titles, and point labels should be big enough to read comfortably. The defaults are almost always too small (Chapter 6).
  5. A non-default theme β€” pick one from jtools, hrbrthemes, ggthemr, etc. (Chapter 6).
  6. Export with ggsave() β€” specify explicit width and height (Chapter 6).

Step 1 β€” Labeled scatter plot. Use geom_text() to display island codes instead of dots:

ggplot(data_final, aes(x = normalized_spdiv, y = funcdiv)) +
  geom_text(aes(label = island, color = habitat), size = 4)

Step 2 β€” Custom color palette. Use scale_color_manual() with colors you like:

scale_color_manual(
  values = c("arid" = "#E69F00", "humid" = "#56B4E9")
)

Step 3 β€” Informative labels. A good title tells the reader the takeaway:

labs(
  title = "Functional diversity tracks normalized species diversity",
  x = "Normalized species diversity",
  y = "Functional diversity",
  color = "Habitat",
  caption = "Data: Kraemer et al. (2022), Nature Communications"
)

Step 4 β€” Export. Save with explicit dimensions:

ggsave("figure5_replication.pdf", width = 8, height = 6)
ggsave("figure5_replication.png", width = 8, height = 6, dpi = 300)

As one example, below is a figure I created. This is just a reference to help you navigate β€” I’m looking forward to seeing your more creative versions!

Step 7 β€” Create One More Publication-Quality Figure

  1. Explore another interesting relationship in the dataset β€” for example, how species richness (s) varies across islands, the distribution of functional diversity by habitat, or any other pattern you find compelling.
  2. Apply what you’ve learned β€” use appropriate plot types (amounts, distributions, trends, associations β€” see Chapters 10–13) and make the figure informative and aesthetically pleasing.
  3. A non-default theme (Chapter 6).
  4. Export with ggsave() β€” specify explicit width and height.