flowchart TD
A((Components))
A -->|Data to visual|AA{{Aesthetics}}:::api
A -->|Multiple panels|AB{{Facet}}:::api
A -->|Visual representation|AC{{Geometries}}:::api
A -->|Non-data elements|AD{{Parameters}}
AD -->ADA(Axis):::api
AD -->ADB(Annotations)
AD -->ADC(Title)
AD -->AG(Theme)
AD -->AF(Legend)
classDef api fill:#f96,color:#fff
Lecture 8 - Data Visualization Advanced
Summary
Details
flowchart LR
A((Components))
A -->|Data to visual|AA{{Aesthetics}}:::api
AA -->AAD(x, y):::api
AA -->AAA(color):::api
AA -->AAB(shape):::api
AA -->AAC(size):::api
AA -->AAE(...)
A -->|Multiple panels|AB{{Facet}}:::api
AB -->ABA(facet_wrap):::api
AB -->ABB(facet_grid):::api
A -->|Visual representation|AC{{Geometries}}:::api
AC -->ACA(geom_point):::api
AC -->ACB(geom_line):::api
AC -->ACC(geom_bar):::api
AC -->ACD(geom_histogram):::api
AC -->ACE(geom_boxplot):::api
AC -->ACF(geom_smooth):::api
AC -->ACG(geom_freqpoly):::api
AC -->ACH(geom_density):::api
A -->|Non-data elements|AD{{Parameters}}
AD -->ADA(Axis):::api
ADA -->ADAA(lim):::api
ADA -->ADAB(lab):::api
ADA -->ADAC(coord_flip)
ADA -->ADAD(scale)
AD -->ADB(Annotations)
AD -->ADC(Title)
ADC -->AEA(ggtitle)
AD -->AG(Theme)
AD -->AF(Legend)
classDef api fill:#f96,color:#fff
1 Aesthetics
Tip
flowchart TD
AA{{Aesthetics}}
AA -->AAD(x, y)
AA -->AAA(color)
AA -->AAB(shape)
AA -->AAC(size)
AA -->AAE(...)
The goal of this module is to teach you how to produce useful graphics with ggplot2 as quickly as possible. You’ll exploit its grammar a little further and learn some useful “recipes” to make the most important plots.
Again, let us first load the tidyverse package.
library(tidyverse)1.1 Fuel economy data
In this module, we’ll mostly use one data set that’s bundled with ggplot2::mpg. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency, http://fueleconomy.gov. You can access the data as long you have loaded ggplot2:
mpgThe variables are mostly self-explanatory:
ctyandhwyrecord miles per gallon (mpg) for city and highway driving.displis the engine displacement in litres.drvis the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).modelis the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.classis a categorical variable describing the “type” of car: two seater, SUV, compact, etc.
Recall that we can create a scatterplot using
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()This produces a scatterplot defined by:
- Data:
mpg. - Aesthetic mapping: engine size mapped to \(x\) position, fuel economy to \(y\) position.
- Layer: observations rendered as points.
Exercise A
1.2 color, shape, size and other aesthetic attributes
To add additional variables to a plot, we can use other aesthetics like color, shape, and size (Note: Both American and British spellings are accepted by ggplot2). These work in the same way as the \(x\) and \(y\) aesthetics, and are added into the call to aes():
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()aes(displ, hwy, color = class) gives each point a unique color corresponding to its class. The legend allows us to read data values from the color, showing us that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.
ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
geom_point()ggplot2 takes care of the details of converting data (e.g., “f”, “r”, “4”) into aesthetics (e.g., “triangles”, “squares”, “circles”) with a scale.
ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
geom_point()There is one scale for each aesthetic mapping in a plot. The scale is also responsible for creating a guide, an axis or legend, that allows you to read the plot, converting aesthetic values back into data values. For now, we’ll stick with the default scales provided by ggplot2.
If you want to set an aesthetic to a fixed value, without scaling it, do so in the individual layer outside of aes(). Compare the following two plots:
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = "blue"))ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue")In the first plot, the value “blue” is scaled to a pinkish color, and a legend is added. In the second plot, the points are given the R color blue. This is an important technique. See vignette("ggplot2-specs") for the values needed for color and other aesthetics.
Different types of aesthetic attributes work better with different types of variables. For example, color and shape work well with categorical variables, while size works well for continuous variables. The amount of data also makes a difference: if there is a lot of data it can be hard to distinguish different groups. An alternative solution is to use faceting, as described next.
When using aesthetics in a plot, less is usually more. It’s difficult to see the simultaneous relationships among color and shape and size, so exercise restraint when using aesthetics. Instead of trying to make one very complex plot that shows everything at once, see if you can create a series of simple plots that tell a story, leading the reader from ignorance to knowledge.
Exercise B
Work with mpg. Try to generate a scatterplot with displ on the \(x\) axis and cty on the \(y\) axis.
2 Faceting
Tip
flowchart LR
AB{{Facet}}
AB -->ABA(facet_wrap)
AB -->ABB(facet_grid)
classDef api fill:#f96,color:#fff
library(tidyverse)Another technique for displaying additional categorical variables on a plot is faceting. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.
There are two types of faceting: grid and wrapped. The differences between facet_wrap() and facet_grid() are illustrated below.
facet_grid() (left) is fundamentally 2d, being made up of two independent components. facet_wrap() (right) is 1d, but wrapped into 2d to save space.
facet_wrap() makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d. This is useful if you have a single variable with many levels and want to arrange the plots in a more space efficient manner. It takes the name of a variable preceded by ~.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class)You can control how the ribbon is wrapped into a grid with ncol, nrow, as.table and dir.
ncol and nrow control how many columns and rows (you only need to set one).
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class, ncol = 4)as.table controls whether the facets are laid out like a table (TRUE), with highest values at the bottom-right, or a plot (FALSE), with the highest values at the top-right.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class, ncol = 4, as.table = F)dir controls the direction of wrap: horizontal or vertical.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class, ncol = 4, dir = "v")facet_grid() lays out plots in a 2d grid, as defined by a formula:
a ~ bspreadsaacross columns andbdown rows. You’ll usually want to put the variable with the greatest number of levels in the columns, to take advantage of the aspect ratio of your screen.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(drv ~ cyl). ~ aspreads the values ofaacross the columns. This direction facilitates comparisons of \(y\) position, because the vertical scales are aligned.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(. ~ cyl)b ~ .spreads the values ofbdown the rows. This direction facilitates comparison of \(x\) position because the horizontal scales are aligned. This makes it particularly useful for comparing distributions.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(drv ~ .)You can use multiple variables in the rows or columns, by “adding” them together, e.g. a + b ~ c + d. Variables appearing together on the rows or columns are nested in the sense that only combinations that appear in the data will appear in the plot. Variables that are specified on rows and columns will be crossed: all combinations will be shown, including those that didn’t appear in the original data set: this may result in empty panels.
Exercise C
3 Geometrics
Tip
flowchart LR
AC{{Geometries}}
AC -->ACA(geom_point)
AC -->ACB(geom_line)
AC -->ACC(geom_bar)
AC -->ACD(geom_histogram)
AC -->ACE(geom_boxplot)
AC -->ACF(geom_smooth)
AC -->ACG(geom_freqpoly)
AC -->ACH(geom_density)
library(tidyverse)In the previous module, we have learned geom_point(), geom_line(), geom_bar(), geom_histogram(), geom_boxplot(). Here let us explore more geom_*() functions.
3.1 Adding a smoother to a plot
If you have a scatterplot with a lot of noise, it can be hard to see the dominant pattern. In this case it’s useful to add a smoothed line to the plot with geom_smooth(), which fits a smoother to the data and displays the smooth and its standard error.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).
An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve:
method = "loess", the default for small \(n\), uses a smooth local regression (as described in?loess). The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, span = 0.2)ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, span = 1)loess does not work well for large data sets, so an alternative smoothing algorithm is used when \(n\) is greater than 1,000.
method = "gam"fits a generalized additive model provided by themgcvpackage. You need to first loadmgcv, then use a formula likeformula = y ~ s(x)ory ~ s(x, bs = "cs")(for large data). This is whatggplot2uses when there are more than 1,000 points.
library(mgcv)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))method = "lm"fits a linear model, giving the line of best fit.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)method = "rlm"works likelm(), but uses a robust fitting algorithm so that outliers don’t affect the fit as much. It’s part of theMASSpackage, so remember to load that first.
library(MASS)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "rlm", formula = y ~ x)Exercise D
3.2 Histograms and frequency polygons
Histograms and frequency polygons show the distribution of a single numeric variable. They provide more information about the distribution of a single group than boxplots do, at the expense of needing more space.
ggplot(mpg, aes(x = hwy)) + geom_histogram(binwidth = 2.5)ggplot(mpg, aes(x = hwy)) + geom_freqpoly(binwidth = 2.5)Like histogram, it is also important to specify the binwidth for geom_freqpoly().
An alternative to the frequency polygon is the density plot, geom_density(). It is sometimes nice to add the density curve on the histogram.
ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 2.5, alpha = 0.6) +
geom_density(alpha = 0.5)Note here that we need to specify the aesthetic of \(y\) in geom_histogram(). Otherwise, the y scales of the two graphs do not match.
To compare the distributions of different subgroups, you can map a categorical variable to either fill (for geom_histogram()) or color (for geom_freqpoly()).
However, note that by default, the bars will be stacked in a histogram. If you want to compare the distributions side by side, you might consider setting the position argument to “dodge”.
geom_freqpoly(), on the other hand, is more effective for comparing distributions as it overlays line plots of frequencies for each category. It can handle overlapping data better than histograms in some cases.
ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) +
geom_histogram(binwidth = 2.5, alpha = 0.6, position = "dodge")ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) +
geom_freqpoly(binwidth = 2.5)Exercise E
library(tidyverse)4 Modifying the axes
Tip
flowchart LR ADA(Axis) ADA -->ADAA(lim):::api ADA -->ADAB(lab):::api ADA -->ADAC(coord_flip) ADA -->ADAD(scale) classDef api fill:#f96,color:#fff
library(tidyverse)Here let us focus on two families of useful helpers that let you make the most common modifications. xlab() and ylab() modify the \(x\)- and \(y\)-axis labels:
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(alpha = 1 / 3)ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(alpha = 1 / 3) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")# Remove the axis labels with NULL
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(alpha = 1 / 3) +
xlab(NULL) +
ylab(NULL)xlim() and ylim() modify the limits of axes:
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot()ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
xlim("f", "r") +
ylim(20, 30)# Use NA for the lower limit in ylim() to automatically calculate
# it based on the minimum value in the data,
# while setting the upper limit to 30.
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
ylim(NA, 30)Changing the axis limits in ggplot2 converts values outside the specified range to NA. To suppress warnings associated with these NA values, you can use na.rm = TRUE. However, exercise caution: this conversion to NA occurs before the computation of summary statistics like the sample mean. As a result, it may affect the accuracy of these statistics, leading to potentially misleading interpretations.
# This plot ylim to restrict the y-axis from 20 to 30,
# suppressing NA-related warnings with na.rm = TRUE.
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot(na.rm = TRUE) +
ylim(20, 30)Exercise F
library(tidyverse)5 Output
Most of the time you create a plot object and immediately render it, but you can also save a plot to a variable and manipulate it:
p <- ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
geom_point()Once you have a plot object, there are a few things you can do with it:
- Render it on screen by calling its name or with
print(). This happens automatically when running interactively, but inside a loop or function, you’ll need toprint()it yourself.
pprint(p)- Save it to disk with
ggsave().
# Save png to disk
ggsave("Output/plot.png", p, width = 5, height = 5)- Briefly describe its structure with
summary().
summary(p)data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
class [234x11]
mapping: x = ~displ, y = ~hwy, colour = ~factor(cyl)
faceting: <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity
- Save a cached copy of it to disk, with
saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it withreadRDS().
saveRDS(p, "Output/plot.rds")
q <- readRDS("Output/plot.rds")