library(tidyverse)
library(tidylog)
library(palmerpenguins)
theme_set(cowplot::theme_cowplot())
13 Visualizing Association
Understanding how variables are associated is one of the most important aspects of data analysis. We have already seen some examples of association in the previous chapters. But given its importance, we will greatly expand on this topic. In this chapter, we will explore a variety of techniques to visualize these associations.
13.1 Marginal distributions
Marginal distributions show the distribution of each variable along the axes of a scatter plot. They provide additional context to understand the spread and density of the data.
A simple way to add marginal information is by using geom_rug()
, which adds tick marks along the axes:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_rug()
With many points, it becomes difficult to see the marginal distributions. I prefer to use geom_hdr_rug()
from ggdensity
package:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
::geom_hdr_rug() ggdensity
Despite its usefulness, when we have multiple groups, it would be difficult to differentiate the marginal distributions among groups.
For a clearer separation of marginal distributions by group, the ggside package provides density plots along each axis:
library(ggside)
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_xsidedensity(aes(y = after_stat(density))) +
geom_ysidedensity(aes(x = after_stat(density))) +
theme(
ggside.axis.text = element_blank()
)
- 1
- Add density plot along the x-axis
- 2
- Add density plot along the y-axis
- 3
- Remove side axis text
13.2 Addressing Overplotting
When there are too many points in a scatter plot, overplotting can obscure patterns and lead to misinterpretation (a related real science story here).
We will use the lending_club
dataset from the modeldata
package to illustrate several strategies for managing overplotting. We fist load and transform the data so that variables spanning several orders of magnitude are easier to visualize (using a log10()
transformation):
data(lending_club, package = "modeldata")
<- lending_club |>
lending_club_log10 mutate(annual_inc = log10(annual_inc)) |>
mutate(total_il_high_credit_limit = log10(total_il_high_credit_limit)) |>
filter(is.finite(annual_inc), is.finite(total_il_high_credit_limit))
|>
lending_club_log10 ggplot(aes(x = annual_inc, y = total_il_high_credit_limit)) +
geom_point()
By reducing the transparency (alpha) of points, overlapping points become less dominant:
|>
lending_club_log10 ggplot(aes(x = annual_inc, y = total_il_high_credit_limit)) +
geom_point(alpha = 0.05)
- 1
- Tune the transparency
For example, alpha = 0.05 means that 20 overlapping points are equivalent to 1 opaque point.
Instead of plotting every individual point, you can visualize the density of points using counter plots. We will use the ggdensity
package to help with this.
|>
lending_club_log10 ggplot(aes(x = annual_inc, y = total_il_high_credit_limit)) +
::geom_hdr() ggdensity
- 1
- Add a 2D counter plot
Alternatively, you can overlay density contours on top of a scatter plot:
|>
lending_club_log10 ggplot(aes(x = annual_inc, y = total_il_high_credit_limit)) +
::geom_hdr_lines() +
ggdensitygeom_point(pch='.', color = 'black', size = 0.5, alpha = 0.05)
- 1
- Add density contours only
- 2
- Add points with low transparency
Note that we used pch='.'
. This makes data points as non-aliased single pixels. It greatly speed up rendering (which could take forever if you have a lot of points).
The package scattermore
(link) provides an insanely fast way to render scatter plots. For most use cases, it won’t matter that much, but it is always good to know you have options.
The ggpointdensity
package colors points based on the local point density, highlighting areas of high concentration. It is a combination of two best worlds.
|>
lending_club_log10 ggplot(aes(x = annual_inc, y = total_il_high_credit_limit)) +
::geom_pointdensity() +
ggpointdensityscale_color_viridis_c()
- 1
- Use viridis color palette (generally preferred over the default)
13.3 Axis Transformation
Because data can span several orders of magnitude, we often use log transformed axes. It is straightforward to do so in ggplot2
. We will test the famous Kleiber’s law on how metabolic rate scales with body mass. We use the ex0826
dataset from the Sleuth3
package.
data(ex0826, package = "Sleuth3")
|>
ex0826 ggplot(aes(x = Mass, y = Metab)) +
geom_point() +
scale_x_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
+
) scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
+
) annotation_logticks() +
geom_smooth(method = "lm") +
::stat_quant_eq() ggpmisc
Log plots are super useful. However, it warrants caution as it may not be as intuitive as it seems (see a recent study).
Additionally, visual associations cannot replace rigorous statistical tests. As scientists, we should always put ourselves to a high standard to avoid artefacts (see two related scientific debates here and here).
13.4 High-Dimensional Data
When dealing with multiple variables, it is often insightful to explore relationships beyond simple 2D plots. We illustrate two common approaches with the penguins
dataset.
<- penguins |>
penguins_small select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
drop_na()
select: dropped 3 variables (island, sex, year)
drop_na: removed 2 rows (1%), 342 rows remaining
13.4.1 Pairwise plots
Pairwise plots (or scatterplot matrices) allow you to visualize the relationships between all pairs of variables in a dataset. They are especially useful for exploratory data analysis. We will use the GGally
package to create these plots.
|>
penguins_small ::ggpairs() GGally
13.4.2 Dimension reduction
Dimension reduction techniques help summarize high-dimensional data into two or three dimensions, making it easier to visualize complex relationships. Although many packages exist, we will use the ggpca
package as it offers several options (e.g., PCA, t-SNE, UMAP) through a consistent interface.
library(ggpca)
|>
penguins_small ggpca(
metadata_cols = "species",
mode = "pca",
color_var = "species",
ellipse = TRUE
)
- 1
- Column to use for grouping/metadata
- 2
- Choose PCA; change to “tsne” or “umap” as desired
- 3
- Color points by species
- 4
- Add ellipses for group boundaries
While dimension reduction techniques are powerful, they can sometimes be misleading. They simplify complex relationships and might hide important nuances of the data. For more insights on potential pitfalls, see this article.