Writing Functions

EE BIOL C177/C234

Chuliang Song

Today’s Menu 🎯

  1. Why Write Functions: The rule of three
  2. Anatomy of a Function: Arguments and returns
  3. Vectorized Functions: Writing efficient code
  4. Tidy Evaluation: Embracing curly-curly { }
  5. Best Practices: Validation & documentation

Why Write Functions?

The Rule of Three 💡

When should you write a function? Look for these three signs:

  1. Copy-Paste Trap: You’ve copied and pasted the same code more than twice.
  2. Muted Constants: You find yourself changing the same numbers in multiple places.
  3. Cognitive Load: You want to make your code more readable, modular, and maintainable.

Tip

Writing functions helps you avoid manual errors (e.g., forgetting to update one value in a copy-pasted block).

Anatomy of a Function

Basic Function Structure 🔎

Here is a simple function to calculate Body Mass Index (BMI):

calculate_bmi <- function(weight, height) {
    bmi <- weight / (height^2)
    return(bmi)
}

calculate_bmi(weight = 70, height = 1.75)
[1] 22.85714

Styling R Functions 🎨

In R, the value of the last line evaluated in a function is automatically returned.

Therefore, you can omit return():

calculate_bmi <- function(weight, height) {
    weight / (height^2)
}

Note

Pro Tip: The Tidyverse style guide recommends only using return() for early returns, and omitting it otherwise.

Vectorized Functions

Automatic Vectorization ⚡

If you write your function using vectorized operations (like /, ^, +, *), it will automatically be vectorized!

Our calculate_bmi() is fully vectorized and works with tibbles/vectors out of the box:

library(tidyverse)

tibble(
    weight = c(70, 80, 90),
    height = c(1.75, 1.80, 1.85)
) |>
    mutate(bmi = calculate_bmi(weight, height))
# A tibble: 3 × 3
  weight height   bmi
   <dbl>  <dbl> <dbl>
1     70   1.75  22.9
2     80   1.8   24.7
3     90   1.85  26.3

Non-Vectorized Functions 📦

Some functions work on entire datasets (e.g., fitting a model) rather than element-by-element. They are not automatically vectorized:

calculate_r2 <- function(data) {
    model <- lm(bill_length_mm ~ bill_depth_mm, data = data)
    return(summary(model)$r.squared)
}

If you try to run this directly inside mutate() on grouped columns, it will fail because it expects a single tibble block. We must use rowwise() with nested data!

Using Non-Vectorized Functions 🛠️

We pair nest() |> rowwise() |> mutate() |> ungroup() to apply non-vectorized calculations to nested sub-tibbles:

penguins |> 
    group_by(species) |> 
    nest() |> 
    rowwise() |>  
    mutate(r2 = calculate_r2(data)) |>
    ungroup()
# A tibble: 3 × 3
  species   data                  r2
  <fct>     <list>             <dbl>
1 Adelie    <tibble [152 × 7]> 0.153
2 Gentoo    <tibble [124 × 7]> 0.414
3 Chinstrap <tibble [68 × 7]>  0.427

Tidy Evaluation

The Column Name Problem ⚠️

In the Tidyverse, we refer to columns directly without quotes (species, not "species"). But what happens if we write a function that takes a column name as an argument?

grouped_mean <- function(data, variable) {
    data |>
        group_by(variable) |>
        summarise(mean = mean(bill_length_mm, na.rm = TRUE))
}

grouped_mean(penguins, species)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `variable` is not found.

Why Did It Fail? 🧐

  • R tried to look for a column literally named variable inside the dataset.
  • Since no such column exists, it threw an error!
  • We need a way to tell R: “Hey, use the column represented by the argument variable, not the word ‘variable’!”

The Fix: Embrace Curly-Curly { } 🪄

Wrapping our variable argument in { } tells Quarto / Tidyverse to evaluate it as the passed column name:

grouped_mean_corrected <- function(data, variable) {
    data |>
        group_by({{ variable }}) |> 
        summarise(mean = mean(bill_length_mm, na.rm = TRUE))
}

grouped_mean_corrected(penguins, species)
# A tibble: 3 × 2
  species    mean
  <fct>     <dbl>
1 Adelie     38.8
2 Chinstrap  48.8
3 Gentoo     47.5

{} vs. { }: When to use which? 🤔

It is easy to get confused! Here is the difference:

Single { } (R Code Block / Function Body)

  • Groups multiple expressions together.
  • Defines what code to run (e.g., function bodies, if statements).
my_fun <- function(x) {
    # This is a code block!
    y <- x + 1
    return(y)
}

Double { } (Curly-Curly / Tidy Eval)

  • Used inside tidyverse functions.
  • Tells R: “Evaluate the column name I passed here.”
my_summary <- function(df, col) {
    df |> summarise(mean_val = mean({{ col }})) # Tidy eval!
}

Tidy Eval: Behind the Scenes 🛠️

  • Tidy evaluation (tidy-eval) is incredibly powerful but was notoriously difficult to design.
  • The Tidyverse core team spent years refining this syntax, making it as simple as { } today.
  • It ensures that R functions can match the natural, declarative feel of standard dplyr pipelines.

Best Practices

Names Matter 🏷️

  • Use Verbs: Functions perform actions. Choose descriptive names like calculate_bmi(), plot_trends(), or convert_temp() (not bmi(), trends()).
  • Naming Style: Be consistent! Stick to one style like snake_case (my_function) or camelCase (myFunction).
  • Clarity vs Brevity: Prefer a descriptive name over an ambiguous shortcut.

Flexible Arguments & Defaults 🎛️

Place the most important arguments first, and provide default values when they make sense:

calculate_bmi <- function(weight, height, units = "metric") {
    if (units == "imperial") {
        # Convert pounds to kg and inches to meters
        weight <- weight * 0.453592
        height <- height * 0.0254
    }
    return(weight / (height^2))
}

calculate_bmi(70, 1.75) # metric (default)
calculate_bmi(154, 69, units = "imperial") # imperial
[1] 22.85714
[1] 22.74157

Input Validation: stop() 🛑

Help the user (and yourself!) by checking inputs early and throwing helpful error messages with stop():

calculate_bmi <- function(weight, height, units = "metric") {
    # Check inputs
    if (weight <= 0 || height <= 0) {
        stop("Weight and height must be positive numbers")
    }

    if (units == "imperial") {
        weight <- weight * 0.453592
        height <- height * 0.0254
    } else if (units != "metric") {
        stop('Units must be either "metric" or "imperial"')
    }
    return(weight / (height^2))
}

Documenting Functions: Roxygen2 📝

Use roxygen2 comments (#') directly above your function definition:

#' Calculate Body Mass Index (BMI)
#'
#' @param weight Weight in kilograms
#' @param height Height in meters
#' @return BMI value (kg/m^2)
#' @examples
#' calculate_bmi(70, 1.75)
calculate_bmi <- function(weight, height) {
    weight / (height^2)
}

Summary Checklist 📋

  • Write functions when copying code more than twice.
  • Vectorize your functions by default.
  • Use rowwise() for non-vectorized operations on nested tibbles.
  • Wrap column arguments in { } for tidy evaluation.
  • Validate inputs using stop(), document with roxygen2, and pick clear, verb-based names!