A Summer of RStudio and ggplot2

Photo by kazuend

Dewey Dunnington

This post was written by Dewey Dunnington about his work during his 2019 RStudio internship. Dewey’s original post was published on his blog and is published here with some minor edits with Dewey’s consent, as part of our series highlighting the work of RStudio’s summer interns.

This past summer, I had the incredible opportunity to spend the summer as an RStudio intern working with Hadley Wickham on the ggplot2 package. It was a welcome change of pace from writing articles about mud in lakes, and I’m sad the internship is over.

RStudio summer interns and software engineers meet up at the Boston office

Figure 1: RStudio summer interns and software engineers meet up at the Boston office

I had the opportunity to work alongside a lot of great interns at a fantastic company, prepare tons of issues for tidy-dev-day at UseR!, become an RStudio-certified tidyverse trainer, spiff up my website considerably with blogdown, and of course develop a few humble new features for ggplot2! Here are a few of them:

library(ggplot2)

New vignette on using ggplot2 in packages

I joined as an intern just prior to the release of ggplot2 3.2.0. Before all ggplot2 releases, we run the CMD check on every reverse dependency (there are 2,622 of them as of the last revdep check) with the CRAN version and with the release candidate to make sure we don’t introduce new failures. In sifting through the failures, it became clear that there was no documentation about how to use ggplot2 in a package in a way that wouldn’t be likely to break in the future. Thus, the Using ggplot2 in packages vignette was born! It covers how to refer to ggplot2 functions, how to create a mapping without triggering a CMD check error, and best practices for common ggplot2 uses in packages (like creating a theme or visualizing an object).

coord_trans() improvements

The difference between a ggplot with scale_(x|y)_log10() and a ggplot with coord_trans((x|y) = "log10") is a common reason that issues get opened in ggplot2. In short, scale_(x|y)_log10() applies log10() to the x and/or y aesthetics before anything happens, including computing any statistics. Using coord_trans((x|y) = "log10") applies log10() after everything happens. This means that a geom_boxplot() with scale_(x|y)_log10() is going to have different outliers (say) than a geom_boxplot() with coord_trans((x|y) = "log10").

p <- ggplot(diamonds, aes(cut, price)) + geom_boxplot()

patchwork::wrap_plots(
  p + scale_y_log10(),
  p + coord_trans(y = "log10"),
  nrow = 1
)

It’s common for an issue to be opened for cases where this is non-intuitive (stat_summary() comes to mind - it’s not intuitive that summary statistics are not calculated on the original data), and the response is often that coord_trans() should be used instead of a transformed scale. However, there were problems with the expansion of discrete scales in coord_trans() that prevented coord_trans() from being a viable solution. In the PR fixing this, I also fixed a problem with second axes in coord_trans(), and made sure that the "reverse" trans worked (it didn’t before, but it doesn’t appear that anybody noticed). Hopefully coord_trans() is now ready to serve as a drop-in replacement when scale_(x|y)_log10() gives non-intuitive results!

NA limits in coord_cartesian()

Another common source of confusion in ggplot2 is the difference between scale_(x|y)_continuous(limits = ...) and coord_cartesian((x|y)lim = ...). When setting scale limits (this includes xlim() and ylim()), data is “censored” by default, meaning values outside this range magically turn into NA and disappear; when setting the coordinate system limits, the data are still exist, but data outside the (expanded) limits are not shown.

patchwork::wrap_plots(
  p + scale_y_continuous(limits = c(0, 10000)),
  p + coord_cartesian(ylim = c(0, 10000))
)
## Warning: Removed 5222 rows containing non-finite values (stat_boxplot).

In this example, using scale limits (on the left) leads to displaying spurious information about where the min and max of the data are. When this issue comes up, the response is usually that the user should use coord_cartesian(ylim = ...) (as shown on the right) instead of scale_y_continuous(limits = ...). Scale limits have this awesome feature where you can pass NA as one or more of the limits to refer to the minimum or maximum of the data, but this previously wasn’t possible for coordinate system limits. Now it is! It’s particularly useful with facets where scales = "free":

ggplot(diamonds, aes(color, price)) +
  geom_boxplot() +
  facet_wrap(vars(cut), scales = "free_y") +
  coord_cartesian(ylim = c(0, NA))

Axis guide improvements

When I started my internship, there was a long-standing open issue about overlapping axis text. Previously, it was impossible to do any customization of axes other than change the breaks and/or labels in the scale_*() functions, which could be customized a bit using theme(), and anything else was a crazy workaround. Now that this pull request has been merged, axes will use the same guide system that powers guide_legend() and guide_colourbar(), such that you will be able to customize how axes are drawn (and in the future create custom ones!). This feature comes with a couple improvements for dealing with overlapping text in the new guide_axis() function:

# you'll need the current development version of the package
# remotes::install_github("tidyverse/ggplot2")
ggplot(mpg, aes(hwy, cty)) +
  geom_point() +
  # create closely-spaced breaks that will overlap
  scale_x_continuous(breaks = seq(10, 50, 2)) +
  facet_wrap(vars(drv)) +
  guides(x = guide_axis(check.overlap = TRUE))

online
December 9 – 10, 2019
This workshop is the first step in becoming a certified RStudio instructor, and is run online for four hours each day for two days. Please fill in this form if you wish to take part.
Boston, MA
December 12 – 13, 2019
This two-day workshop is a gentle introduction to machine learning and to the tidyverse packages that do machine learning.