18  Writing Functions

Class Objectives
  1. Understand when and why to write functions
  2. Learn the basic structure of R functions
  3. Master writing vectorized functions
  4. Learn best practices for function documentation

18.1 Why Write Functions?

Remember our discussion about vectorization and nested data? We’ve been using functions all along - mean() and sum(). But sometimes, you need a function that doesn’t exist yet. That’s when you write your own!

Here are three signs you need a function:

  1. You’ve copied and pasted code more than twice
  2. You find yourself changing the same numbers in multiple places
  3. You want to make your code more readable and maintainable

Writing functions also helps you avoid mistakes. For example, if you need to change a number in multiple places, you may forget to change one of them (speaking from experience :-(

18.2 Basic Function Structure

Let’s start with a simple function:

calculate_bmi <- function(weight, height) {
    bmi <- weight / (height^2)
    return(bmi)
}

calculate_bmi(weight = 70, height = 1.75)
[1] 22.85714

Let’s break down the components:

  1. calculate_bmi: The function name (make it descriptive!)
  2. function(weight, height): The arguments your function needs
  3. The code block between { and }: What your function does
  4. return(bmi): What your function gives back

Importantly, the last line of your function is the value that is returned. So you can also write:

calculate_bmi <- function(weight, height) {
    weight / (height^2)
}

It is recommened to not use return() in your function.

Exercise: Your First Function

Write a function to convert Fahrenheit to Celsius.

18.3 Writing Vectorized Functions

Remember how we talked about the importance of vectorization? When writing your own functions, you want them to be vectorized too! The good news is that R makes this easy - if you use vectorized operations inside your function, your function will be vectorized automatically. For example, our function calculate_bmi() is vectorized automatically:

library(tidyverse)
tibble(
    weight = c(70, 80, 90),
    height = c(1.75, 1.80, 1.85)
) |>
    mutate(bmi = calculate_bmi(weight, height))

Sometimes it might be not obvious that a function is vectorized. The most simple solution is to test your function to a vector or a tibble. If it works, it’s (almost certainly) vectorized!

Exercise: Vectorized Function

Write a function to calculate bill area (length × depth) that works with the penguins dataset.

18.4 Working with non-vectorized functions

Some functions work with entire datasets rather than element by element. For these cases, we need to use rowwise() when working with nested data:

# Function that works with a dataset
calculate_r2 <- function(data) {
    model <- lm(bill_length_mm ~ bill_depth_mm, data = data)
    return(summary(model)$r.squared)
}

library(palmerpenguins)
# Need rowwise() because calculate_r2 works on entire datasets
penguins |> 
    group_by(species) |> 
    nest() |> 
    rowwise() |>  # Important! Because calculate_r2 is not vectorized
    mutate(
        r2 = calculate_r2(data)
    ) |>
    ungroup()

18.5 Programming with dplyr: Tidy Evaluation

We now touch a (used to be scary) topic known as tidy evaluation. In tidyverse, a gereat feature is that we do not need to use "variable" to refer to a variable. Instead, we can use the variable itself without quoting. However, this feature can be a bit tricky when we want to write functions that uses variables as arguments.

I know it sounds abstract. Let’s imagine you’re analyzing the penguins dataset and want to calculate the mean bill_length_mm grouped by different groups. Your code looks like this:

# Group by species
penguins |>
    group_by(species) |>
    summarise(mean = mean(bill_length_mm, na.rm = TRUE))
# Group by sex
penguins |>
    group_by(sex) |>
    summarise(mean = mean(bill_length_mm, na.rm = TRUE))
# ... and so on for island, year, etc. 😩

It gets a bit tedious to write the same code over and over again. Let’s write a function to automate this.

18.5.1 The Naive Attempt (Spoiler: It Breaks)

With what we know so far, you might try this:

grouped_mean <- function(data, variable) {
    data |>
        group_by(variable) |>
        summarise(mean = mean(bill_length_mm, na.rm = TRUE))
}

However, this function does not work:

grouped_mean(penguins, "species")
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `variable` is not found.

R thinks you’re literally grouping by a column named “variable” in the dataset (which doesn’t exist).

18.5.2 The Fix: Embrace the Curly-Curly {{ }}

Luckily, it is super easy to fix this. To tell R, “Hey, use the variable I’m giving you, not the word ‘variable’!” wrap your argument in { }:

grouped_mean_corrected <- function(data, variable) {
    data |>
        group_by({{ variable }}) |>
        summarise(mean = mean(bill_length_mm, na.rm = TRUE))
}

grouped_mean_corrected(penguins, species)
1
✅ Magic happens here!

In other words, {{ }} tells tidyverse: “Treat variable as the column I pass into the function, not as a literal name.”

It does not look complicated, right? Actually, the tidyverse team spent ages perfecting this feature. Check here for its history.

From here

18.6 Best Practices

When writing functions, follow these guidelines:

  • Use verbs for function names: calculate_bmi(), not bmi()
  • Be consistent with your naming style
  • Make names descriptive but not too long
  • Put the most important arguments first
  • Use clear argument names
  • Provide default values when it makes sense
# Good function design
calculate_bmi <- function(weight, height, units = "metric") {
    if (units == "imperial") {
        # Convert pounds to kg and inches to meters
        weight <- weight * 0.453592
        height <- height * 0.0254
    }
    return(weight / (height^2))
}

# Now works with both metric and imperial units
calculate_bmi(70, 1.75) # metric (default)
[1] 22.85714
calculate_bmi(154, 69, units = "imperial") # imperial
[1] 22.74157

It is a good idea to check the input validity of your function. This can be done by using stop() to stop the function and return an error message.

calculate_bmi <- function(weight, height, units = "metric") {
    # Input validation
    if (weight <= 0 || height <= 0) {
        stop("Weight and height must be positive numbers")
    }

    if (units == "imperial") {
        weight <- weight * 0.453592
        height <- height * 0.0254
    } else if (units != "metric") {
        stop('Units must be either "metric" or "imperial"')
    }

    return(weight / (height^2))
}

Good functions need good documentation, especially if you want others (or even you in the future) what the function does.

In R, we use roxygen2 style comments to document our functions:

#' Calculate Body Mass Index (BMI)
#'
#' @param weight Weight in kilograms
#' @param height Height in meters
#' @return BMI value (kg/m^2)
#' @examples
#' calculate_bmi(70, 1.75)
calculate_bmi <- function(weight, height) {
    bmi <- weight / (height^2)
    return(bmi)
}

The documentation includes:

  1. A brief description
  2. Parameter descriptions (@param)
  3. What the function returns (@return)
  4. Example usage (@examples)

18.7 Summary

Writing functions is a crucial skill that builds on our understanding of vectorization and nested data:

  1. Write functions when you find yourself repeating code
  2. Make your functions vectorized when possible
  3. Use clear names and good documentation
  4. Include input validation and helpful error messages
  5. Test your functions with simple cases first

Remember: A well-written function is like a good tool - it should do one thing, do it well, and be easy to use! 🛠️