This post was written by Dewey Dunnington about his work during his 2019 RStudio internship. Dewey’s original post was published on his blog and is published here with some minor edits with Dewey’s consent, as part of our series highlighting the work of RStudio’s summer interns.
This past summer, I had the incredible opportunity to spend the summer as an RStudio intern working with Hadley Wickham on the ggplot2 package. It was a welcome change of pace from writing articles about mud in lakes, and I’m sad the internship is over.
I had the opportunity to work alongside a lot of great interns at a fantastic company, prepare tons of issues for tidy-dev-day at UseR!, become an RStudio-certified tidyverse trainer, spiff up my website considerably with blogdown, and of course develop a few humble new features for ggplot2! Here are a few of them:
New vignette on using ggplot2 in packages
I joined as an intern just prior to the release of ggplot2 3.2.0. Before all ggplot2 releases, we run the CMD check on every reverse dependency (there are 2,622 of them as of the last revdep check) with the CRAN version and with the release candidate to make sure we don’t introduce new failures. In sifting through the failures, it became clear that there was no documentation about how to use ggplot2 in a package in a way that wouldn’t be likely to break in the future. Thus, the Using ggplot2 in packages vignette was born! It covers how to refer to ggplot2 functions, how to create a mapping without triggering a CMD check error, and best practices for common ggplot2 uses in packages (like creating a theme or visualizing an object).
The difference between a ggplot with
scale_(x|y)_log10() and a ggplot with
coord_trans((x|y) = "log10") is a common reason that issues get opened in ggplot2. In short,
log10() to the
y aesthetics before anything happens, including computing any statistics. Using
coord_trans((x|y) = "log10") applies
log10() after everything happens. This means that a
scale_(x|y)_log10() is going to have different outliers (say) than a
coord_trans((x|y) = "log10").
p <- ggplot(diamonds, aes(cut, price)) + geom_boxplot() patchwork::wrap_plots( p + scale_y_log10(), p + coord_trans(y = "log10"), nrow = 1 )
It’s common for an issue to be opened for cases where this is non-intuitive (
stat_summary() comes to mind - it’s not intuitive that summary statistics are not calculated on the original data), and the response is often that
coord_trans() should be used instead of a transformed scale. However, there were problems with the expansion of discrete scales in coord_trans() that prevented
coord_trans() from being a viable solution. In the PR fixing this, I also fixed a problem with second axes in coord_trans(), and made sure that the
"reverse" trans worked (it didn’t before, but it doesn’t appear that anybody noticed). Hopefully
coord_trans() is now ready to serve as a drop-in replacement when
scale_(x|y)_log10() gives non-intuitive results!
NA limits in coord_cartesian()
Another common source of confusion in ggplot2 is the difference between
scale_(x|y)_continuous(limits = ...) and
coord_cartesian((x|y)lim = ...). When setting scale limits (this includes
ylim()), data is “censored” by default, meaning values outside this range magically turn into
NA and disappear; when setting the coordinate system limits, the data are still exist, but data outside the (expanded) limits are not shown.
patchwork::wrap_plots( p + scale_y_continuous(limits = c(0, 10000)), p + coord_cartesian(ylim = c(0, 10000)) )
## Warning: Removed 5222 rows containing non-finite values (stat_boxplot).
In this example, using scale limits (on the left) leads to displaying spurious information about where the min and max of the data are. When this issue comes up, the response is usually that the user should use
coord_cartesian(ylim = ...) (as shown on the right) instead of
scale_y_continuous(limits = ...). Scale limits have this awesome feature where you can pass
NA as one or more of the limits to refer to the minimum or maximum of the data, but this previously wasn’t possible for coordinate system limits. Now it is! It’s particularly useful with facets where
scales = "free":
ggplot(diamonds, aes(color, price)) + geom_boxplot() + facet_wrap(vars(cut), scales = "free_y") + coord_cartesian(ylim = c(0, NA))
Axis guide improvements
When I started my internship, there was a long-standing open issue about overlapping axis text. Previously, it was impossible to do any customization of axes other than change the
labels in the
scale_*() functions, which could be customized a bit using
theme(), and anything else was a crazy workaround. Now that this pull request has been merged, axes will use the same guide system that powers
guide_colourbar(), such that you will be able to customize how axes are drawn (and in the future create custom ones!). This feature comes with a couple improvements for dealing with overlapping text in the new
# you'll need the current development version of the package # remotes::install_github("tidyverse/ggplot2") ggplot(mpg, aes(hwy, cty)) + geom_point() + # create closely-spaced breaks that will overlap scale_x_continuous(breaks = seq(10, 50, 2)) + facet_wrap(vars(drv)) + guides(x = guide_axis(check.overlap = TRUE))