This is the second blog post in the “Teaching the Tidyverse in 2020” series. The first post was on getting started, and today our focus is data visualisation. In this post, I’ll highlight some new(ish) features of ggplot2 as well as not-so-new features that I see rarely in introductory teaching materials that I think belong there.
Let’s start by loading the tidyverse and palmerpenguins packages.
We’ll be using the
penguins data from the palmerpenguins package. You can read about the variables in the data frame
here and a quick peek is provided below.
## Rows: 344 ## Columns: 7 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,… ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,… ## $ sex <fct> male, female, female, NA, female, male, female, mal…
At some point in ggplot2 history, making a box plot of a single variable required adding
factor = 1 to the aesthetic mapping, which I never found to be easy to explain.
ggplot(penguins, aes(y = body_mass_g, factor = 1)) + geom_boxplot()
You can now (or more like for the
last couple years!) omit
factor = 1!
ggplot(penguins, aes(y = body_mass_g)) + geom_boxplot()
Horizontal bar plots can be really useful, especially for categorical data whose levels have long names that overlap if placed on the x-axis. Previously, making horizontal bar plots required mapping the variable to be plotted to the
x aesthetic and then applying a
coord_flip() layer to flip the axes, e.g.
ggplot(penguins, aes(x = species)) + geom_bar() + coord_flip()
geom_bar() now works in both directions, so the categorical variable can be directly mapped to the
y aesthetic to achieve the horizontal box plot.
ggplot(penguins, aes(y = species)) + geom_bar()
You can read more about what this change enables in other geoms here.
One of the biggest strengths of ggplot2 is that it has fantastic defaults and your plots look pretty good without trying hard. This is very attractive especially for teaching introductory audiences when you just want to focus on the basics. However, when teaching data visualisation (even in the simplest sense) it is important to discuss data visualisation good practices, e.g. present data in ascending/descending order, label axes in a way that matches your narrative, etc. Packages like scales and forcats are incredibly helpful for implementing these small but impactful improvements to visualisations.
Suppose we want to plot the percentage distribution of different species of penguins. We can first calculate these percentages in a dplyr pipeline.
penguins %>% count(species) %>% mutate(prop = n / sum(n))
## # A tibble: 3 x 3 ## species n prop ## <fct> <int> <dbl> ## 1 Adelie 152 0.442 ## 2 Chinstrap 68 0.198 ## 3 Gentoo 124 0.360
We can then feed the result to the
ggplot() function to be represented using
penguins %>% count(species) %>% mutate(prop = n / sum(n)) %>% ggplot(aes(x = prop, y = species)) + geom_col()
First, let’s reorder the bars in descending order, with the penguin species with the highest proportion on top. We use
fct_reorder() function from the forcats package for this. We can read
fct_reorder(species, prop) as “reorder the levels of
species based on the values of
penguins %>% count(species) %>% mutate(prop = n / sum(n)) %>% ggplot(aes(x = prop, y = fct_reorder(species, prop))) + geom_col()
The forcats package contains a
variety of functions starting with the prefix
fct_ that operate on factor levels. Perhaps unsurprisingly, they’re named pretty intuitively (at least for English speakers). (Side note: After
fct_reorder(), my favourite function for improving visualisations of messy survey data with factors with many levels is
Next, let’s think about how we would interpret this visualisation. I would probably say something like “Adelie penguins make up over 40% of the sample, followed by Gentoo penguins at roughly over 35%, and Chinstrap penguins make up the remaining 20%". But the x-axis is showing proportions instead of percentages.
The scales package offers a
variety of functions, starting with the prefix
label_* that help with this sort of task.
The function we want here is
label_percent(). We can also specify the number of decimal places to show with the
penguins %>% count(species) %>% mutate(prop = n / sum(n)) %>% ggplot(aes(x = prop, y = fct_reorder(species, prop))) + geom_col() + scale_x_continuous(labels = label_percent(accuracy = 1))
And finally, let’s fix up the axis labels.
penguins %>% count(species) %>% mutate(prop = n / sum(n)) %>% ggplot(aes(x = prop, y = fct_reorder(species, prop))) + geom_col() + scale_x_continuous(labels = label_percent(accuracy = 1)) + labs( x = "Percentage", y = "Species", title = "Species distribution of penguins", subtitle = "Adelie, Gentoo, and Chinstrap Penguins at Palmer Station LTER", caption = "Source: allisonhorst.github.io/palmerpenguins" )
Another very attractive feature of ggplot2 is that you get legends for free. But sometimes the factor levels displayed in the legends may not be optimally ordered. In these cases
fct_reorder2() might be helpful. See
this gist from Jenny Bryan for an example using this function to reorder levels displayed on a legend of a line plot.
ggplot2 is a vast package, and I don’t think it’s feasible to teach the complete functionality of the package within an intro stats or data science curriculum. That being said, I think it is important to not just teach how to make plots, but how to make good plots that follow data visualisation best practices. Leveraging packages like forcats and scales allows you to do that without adding too much cognitive burden on the learners.
There are many packages that extend ggplot2 (e.g. gganimate, ggridges, ggrepel, GGally) and just like it’s not possible to cover all of ggplot2 in an intro course, it’s definitely not possible to cover the entire gg-verse. However, it is useful to introduce your students to some of these packages so they are aware of the existence of the larger ecosystem.
A particularly good place to start is when you find a pain point with ggplot2 and the solution is using one of these other packages (e.g. ggrepel for labelling data points on a plot without overlapping text).
Another good approach for introducing these packages are assignments where students independently explore packages you haven’t explicitly taught them. I like doing this using the following two assignments:
You gotta pick a package or two: Provide a list of fun and/or easy to onboard packages and ask students to read the package documentation and use the package to do something.
Mirror, mirror on the wall, who’s the ugliest of them all? I got the idea for this assignment Give a simple plot and ask students to make the plot as ugly as possible. I like doing this with a disclaimer that says “I realize that ugly is subjective, so we’re mostly looking to see if you can figure out how to change the look of a plot using help files of functions you haven’t learned before. As you do, try to make your plot as visually unappealing as possible.“I got the idea for this assignment from Drew Steen, who shared on Twitter sample student plots that are, well, awful, in the best way possible. Alison Horst also has a lovely blog post where she shares her experience with a similar assignment here.
You can find a more detailed write up of these assignments here.
That’s it for today, next up on the “Teaching the Tidyverse in 2020” series is data wrangling and tidying.