16  Basics of Vectorization

You might have noticed we haven’t talked about loops so far. This is quite atypical for a programming course. In most courses, loops are introduced early and used throughout the course.

This choice is intentional. In data science (and especially in R with the tidyverse), we lean heavily on vectorization. Rather than painstakingly iterating over each element with loops, vectorized functions let you operate on whole groups of data at once. The result? Code that’s cleaner, faster, and less prone to bugs.

Before diving into examples, as usual, let’s load our libraries first:

library(tidyverse)
library(tidylog)

16.1 A Simple Example: Adding Vectors

Vectorization might sound like a fancy buzzword, but you’ve been using it all along! Let’s start with a classic: adding two vectors element-by-element.

When you add two vectors in R using the + operator, R automatically performs the addition element-by-element.

x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
1
create a vector x with elements 1, 2, 3.
2
create a vector y with elements 4, 5, 6.
[1] 5 7 9

What happens under the hood? R takes the first element of x and adds it to the first element of y, the second with the second, and so on. VoilĂ !

Now, if you were to do the same thing using a loop, it might look like this:

z <- c()
for (i in 1:3) {
    z[i] <- x[i] + y[i]
}
z
1
create an empty vector z to store the result.
2
loop through each element.
3
in each iteration, add the corresponding elements of x and y and store the result in z[i].
[1] 5 7 9

While this loop does the job, it’s more code and more room for errors—especially as your tasks grow more complex.

Exercise

Multiply two vectors element-by-element. Let c <- c(2, 4, 6) and d <- c(3, 5, 7).

16.2 What Is Vectorization, Really?

At its heart, vectorization means that a function automatically applies to each element of a vector:

Definition of vectorized

Say we have a function f() and we pass a vector x to it. If f() is vectorized, it will apply f() to each element of x and return a vector of results: f(x[1]), f(x[2]), ....

Some common examples of vectorized functions in R include arithmetic operators like +, -, *, /. In contrast, sum() is not vectorized as it takes a vector as a single input (and returns a single value instead of a vector of results).

More importantly, most verbs we have learned–like mutate() and summarize()–in the tidyverse are vectorized!

16.3 Beware the Pitfalls

Sometimes, vectorization can lead to confusing or unexpected behavior—especially when functions behave differently depending on context.

Example: Automatic Vectorization

x <- c(1, 2, 3)
x + 2
[1] 3 4 5

What’s happening here? R “recycles” the scalar 2 across the vector x and adds it to each element. Despite it seems convenient, this is actually a design flaw in R as it can lead to the worst type of bugs: code runs successfully, but gives you a wrong answer.

In more modern languages like Julia, the above code will throw an error, and you need to do

# Julia language 
x = [1, 2, 3]
x .+ 2
1
create a vector x with elements 1, 2, 3.
2
add 2 to each element of x.

where . is a placeholder to indicate that + is a vectorized operation.

Example: Non-Vectorized Function

Some functions are not vectorized. That is, the function f(x) is not applied to each element of the vector x but to the entire vector. It is not always obvious whether a function is vectorized or not.

This has been a common source of bugs in R. You may wonder how come this is a problem? Doesn’t the non-vectorized function give a vector of results while the vectorized function gives a single result?

Well, let’s consider a tibble with two columns containing numeric vectors, and we want to compute the mean of each row.

tibble(
    x = c(1, 2, 3),
    y = c(4, 5, 6)
) |> 
    mutate(mean = mean(c(x, y)))
mutate: new variable 'mean' (double) with one unique value and 0% NA

Uh-oh! Instead of calculating the mean of each individual element, mean(c(x, y)) computes the mean of the entire vector. R then recycles that single result across all rows. Thus, the code runs successfully but gives you a wrong answer.

16.4 Fixing Issues with rowwise()

The solution? rowwise(). This function treats each row as its own group, ensuring that operations inside mutate() (or other verbs) are applied individually. In other words, it will always apply the operation to each row of the tibble.

tibble(
    x = c(1, 2, 3),
    y = c(4, 5, 6)
) |> 
    rowwise() |> 
    mutate(mean = mean(c(x, y))) |> 
    ungroup()
mutate: new variable 'mean' (double) with 3 unique values and 0% NA
ungroup: no grouping variables remain

As you can see, mean is now correctly computed as the mean of each row.

Note: Don’t forget to ungroup() after rowwise(). As a rule of thumb, you should always ungroup() after grouping operations (rowwise() is a type of grouping operation!).

Exercise

For a tibble with a list-column containing numeric vectors, add a column that contains the sum of each inner vector.

16.5 So, Why Avoid Loops?

Vectorization is a game changer in R. To summarize, vectorized code is better than loops because:

  • Readability: You focus on what you want to compute rather than how to iterate.
  • Performance: R optimizes vectorized operations, making your code run faster.
  • Fewer Bugs: No manual index tracking means fewer chances for off-by-one errors or other loop-related bugs.

When you write a loop, you’re busy managing an index—something that, in a vectorized approach, is completely abstracted away. The real computation is what matters, and vectorization lets you concentrate on that.

“Complicated control flows confuse programmers. Messy code often hides bugs.” — Bjarne Stroustrup