Muted spaghetti line charts with R's ggplot2

12 April 2021

If someone tells me their sales last month was $10M - what do I make of it? With just the bare number, I don't know what to think. To make sense of the number I need context, perhaps over time, perhaps compared to compatible companies. Using a data visualization can help me put a number into a context that allows me to make sense of it.

One particularly useful form of context is context over time. How does today's figure match up with that value over time? A line chart, plotted against time, helps me see this.

Here is a rather more sombre example than sales revenues, the deaths per 100,000 due to covid in the state of Massachusetts.

This is valuable, as I can now put today's figure in historical context, comparing recent figures to those in the last two peaks. It's also very easy to plot this chart in R, needing just a few lines of code.

death_pp %>%
    filter(state == "MA") %>%
    ggplot(aes(date, death_pm_rm)) +
    labs(y = "deaths per 100,000") +
    geom_line(color = "blue")
show code to load death_pp
# cdc covid data records New York City seperately from New York state
cdc_pops <- pops %>% 
  mutate(pop = if_else(state == "NY", pop - 8400000, pop)) %>%  
    add_row(name = "New York City", state = "NYC", pop = 8400000)

# http  -d "" > cdc_cases.csv
cdc_cases <- read_csv("cdc_cases.csv") %>% 
    select(state, submission_date, new_death, tot_death) %>% 
    mutate(date = mdy(submission_date)) %>% 
    arrange(date) %>% 

death_pp <- cdc_cases %>% 
  left_join(cdc_pops, by = "state") %>% 
  drop_na(pop) %>% 
  mutate(death_pm = new_death * 1000000 / pop) %>% 
  mutate(death_pm_rm = rollmean(death_pm, 7, fill=NA, align="right"))

But I can show more context than just time. To better understand how the epidemic has been in Massachusetts, I can compare it to how things have gone in the other states. A good way to do this is to show the line chart for every other US state as a muted background.

As far as I can tell, there's no generally accepted term for this kind of plot. Putting multiple lines on a line chart is sometimes referred to as a spaghetti line chart. So I'll refer to this as a muted-spaghetti chart.

In R it's pretty easy to plot this, the key is to plot another geom_line with a different data source as the primary line we're looking at.

death_pp %>%
    filter(state == "MA") %>%
    ggplot(aes(date, death_pm_rm)) +
    labs(y = "deaths per 100,000") +
    geom_line(data = death_pp, aes(group = state), color = "grey", size = 1, alpha = 0.5) +
    geom_line(aes(y = death_pm_rm), color = "blue")

Note that I plot the background before the foreground line to ensure the foreground line pops clearly on top.

Doing this with a grid (facets)

Showing this with one state is good, but it's often useful to be able to look at several states in this way. ggplot2 provides the very nifty facet_wrap command to plot a line chart for every value in a set, but it requires a little trickery to make it work with a muted-spaghetti background like this.

The trickery comes with the way I need to specify the grouping for the spaghetti.

death_pp %>%
  filter(state %in% c("MA", "VT", "CT", "RI", "NH")) %>% 
  ggplot(aes(date, death_pm_rm)) +
  labs(y = "deaths per 100,000") +
  geom_line(data = death_pp %>% rename(s = state),
              aes(group = s), color = "grey", size = 1, alpha = 0.5) +
  geom_line(color = "blue") +
  facet_wrap(~state, ncol = 3)

By renaming the grouping column, ggplot only facets the primary line and plots the spaghetti on each facet. [1]


1: It took me ages of experimenting and web searching to find how to do this with facets. Eventually I found the answer at from data to viz