Basics of Vectorization

EE BIOL C177/C234

Chuliang Song

Today’s Menu 🎯

  1. The Loop-Free Mystery: Why have we avoided loops?
  2. Vectorization 101: How R handles operations element-by-element
  3. The Silent Trap: Scalar recycling and common pitfalls
  4. The Rescue: Writing row-wise operations using rowwise()

Wait… Where are the Loops? 🔍

  • If you’ve taken standard programming courses, you usually learn for and while loops in Week 1 or 2.
  • In this course, we’ve had 18 lectures without writing a single for loop!
  • How did we:
    • Filter and mutate thousands of data rows?
    • Transform and tidy complex ecological tables?
    • Create publication-quality plots?
  • Answer: R and the Tidyverse rely fundamentally on vectorization!

Vectorization 101

A Simple Example: Adding Vectors

Suppose we want to add two numeric vectors element-by-element:

Vectorized (Logic)

x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
[1] 5 7 9

Loop (Logistics)

z <- c()
for (i in 1:3) {
    z[i] <- x[i] + y[i]
}
z
[1] 5 7 9

Loop: Under the Hood ⚙️

What is the CPU doing in our loop?

z <- c()
for (i in 1:3) {
    z[i] <- x[i] + y[i]
}
  1. Initialize: Creates an empty vector z.
  2. Loop Overhead: Sets up loop counter i = 1.
  3. Element Lookup: Retrieves x[1], retrieves y[1], adds them.
  4. Resizing (Slow!): Allocates new memory to grow z to size 1, stores result.
  5. Repeat: Increments i = 2, repeats lookup, grows z again, etc.

Vectorized: Under the Hood ⚡

What is R doing with the vectorized approach?

x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
  • R allocates a result vector of size 3 all at once (no resizing).
  • It hands the addition task down to a pre-compiled, highly-optimized loop in C/C++.
  • The iteration runs at native hardware speed.
  • Overheads, tracking variables, and manual indexing are completely bypassed!

Logic vs. Logistics 🧩

  • Logistics: The mechanics of iteration
    • How do we get to each element?
    • Managing loop counters (i), allocating empty vectors, and index offsets.
  • Logic: The actual scientific computation
    • What do we want to do to the data? (e.g., x + y, log(x))
  • Vectorization hides the logistics so you can focus entirely on the logic!

Focus on the Concept 🧠

“I want to specify at a conceptual level how the data should be analyzed… I don’t want to have to think about the logistics of how the computation is performed.”
— Claus Wilke

  • Loops force us to think like a computer (managing indices and memory allocations).
  • Vectorization allows us to think like scientists (specifying high-level mathematical formulas and data transformations).

What is Vectorization, Really?

Conceptual Definition

Say we have a function f() and we pass a vector x to it. If f() is vectorized, it automatically applies to each element of x and returns a vector of results:
c(f(x[1]), f(x[2]), f(x[3]), ...)

Pitfalls & Recycling

Recycling: A Hidden Trap 🪤

What happens when we add a vector and a single number?

x <- c(1, 2, 3)
x + 2
[1] 3 4 5
  • R silently recycles (repeats) the scalar 2 to match the length of x.
  • Behind the scenes: c(1, 2, 3) + c(2, 2, 2).
  • While convenient, this silent duplication can lead to dangerous bugs when vectors have mismatched, unexpected lengths!

💡 Pro Tip: Modern Alternatives

In newer languages like Julia, this implicit recycling is considered a design flaw and is prohibited!

#| eval: false
# Julia code
x = [1, 2, 3]

x + 2   # ❌ ERROR: DimensionMismatch!
x .+ 2  # ✅ Works! The dot (.) explicitly requests element-wise vectorization.

R is extremely permissive, which means the programmer must be extra careful!

When Vectorization Fails: The Tibble Bug ⚠️

Suppose we have a tibble of sites and sample depths, and we want to calculate the average depth per row (mean of x and y):

library(tidyverse)
tibble(
  x = c(1, 2, 3),
  y = c(4, 5, 6)
) |> 
  mutate(row_mean = mean(c(x, y)))
# A tibble: 3 × 3
      x     y row_mean
  <dbl> <dbl>    <dbl>
1     1     4      3.5
2     2     5      3.5
3     3     6      3.5

Wait… why is the row mean 3.5 for every single row?

Dissecting the Bug 🔍

Let’s look at how R evaluates mutate(row_mean = mean(c(x, y))):

  1. Concatenate: R evaluates c(x, y) first by combining the entire x column and the entire y column: c(1, 2, 3, 4, 5, 6).
  2. Summarize: mean() is not vectorized! It takes that combined 6-element vector and calculates its single average value: 3.5.
  3. Recycle: Because mutate() expects one result per row, R silently recycles 3.5 to fill all 3 rows.
  4. The code runs with zero errors or warnings, but the science is completely wrong!

The Rescue: rowwise()

Introducing rowwise() 🎯

rowwise() is a special grouping function in dplyr that forces operations inside mutate() (and other verbs) to be applied individually, row by row:

tibble(
  x = c(1, 2, 3),
  y = c(4, 5, 6)
) |> 
  rowwise() |> 
  mutate(row_mean = mean(c(x, y))) |> 
  ungroup()
# A tibble: 3 × 3
      x     y row_mean
  <dbl> <dbl>    <dbl>
1     1     4      2.5
2     2     5      3.5
3     3     6      4.5
  • Row 1: c(1, 4) → mean is 2.5
  • Row 2: c(2, 5) → mean is 3.5
  • Row 3: c(3, 6) → mean is 4.5

⚠️ The Golden Rule of Grouping

Always ungroup()!

rowwise() is a special type of grouping. If you forget to ungroup(), all future manipulations in your script will continue to run row-by-row!

This will make your code run painfully slow and can cause strange bugs down the line. Always pair them: rowwise() |> ... |> ungroup().

Why Avoid Loops in R?

1. Readability: Focus on the “What”

By using vectorized functions and dplyr verbs, your code reads like a list of conceptual steps:

Vectorized (Declarative)

penguins |>
  filter(species == "Adelie") |>
  mutate(ratio = bill_length_mm / bill_depth_mm)

Loop (Imperative)

# Allocate, filter, check NA, compute...
res <- data.frame()
for (i in 1:nrow(penguins)) {
  if (penguins$species[i] == "Adelie") {
    # ... tedious indexing ...
  }
}

Tip

Vectorized code describes what you want to achieve, rather than details of how to traverse the computer’s memory.

2. Performance: R Loops are Slow 🐢

  • R is a high-level, interpreted language.
  • Every time R runs a line inside a standard for loop, it has to evaluate types, check variables, and manage memory under the hood.
  • Vectorized operations delegate the loop directly to highly-optimized compiled C/C++ or Fortran code.
  • If you find your R script taking minutes or hours to run, look for a standard for loop to vectorize!

Messy Control Flow ⚠️

Managing manual index offsets and tracking variables leads to complicated control flow:

“Complicated control flows confuse programmers. Messy code often hides bugs.”
— Bjarne Stroustrup (Creator of C++)

Core Rule

Let the language handle the logistics of traversal, so you can focus on writing correct scientific logic.

Summary Checklist 📋

  • Vectorization operates on entire vectors or groups of data at once, bypassing manual loops.
  • Tidyverse verbs (like mutate(), filter()) are designed to be vectorized.
  • Recycling is R’s feature where short vectors are repeated, but it can silently hide bugs.
  • Non-vectorized functions like mean(), sum() average or sum their entire input.
  • Use rowwise() |> ... |> ungroup() to force row-by-row evaluation in tibbles!