Lecture 8 - Data Visualization Advanced

Summary

flowchart TD
  A((Components)) 

  A -->|Data to visual|AA{{Aesthetics}}:::api
  A -->|Multiple panels|AB{{Facet}}:::api
  A -->|Visual representation|AC{{Geometries}}:::api
  A -->|Non-data elements|AD{{Parameters}}
  AD -->ADA(Axis):::api
  AD -->ADB(Annotations)
  AD -->ADC(Title)
  AD -->AG(Theme)
  AD -->AF(Legend)

  classDef api fill:#f96,color:#fff

Details

flowchart LR
  A((Components)) 

  A -->|Data to visual|AA{{Aesthetics}}:::api
  AA -->AAD(x, y):::api
  AA -->AAA(color):::api
  AA -->AAB(shape):::api
  AA -->AAC(size):::api
  AA -->AAE(...)

  A -->|Multiple panels|AB{{Facet}}:::api
  AB -->ABA(facet_wrap):::api
  AB -->ABB(facet_grid):::api

  A -->|Visual representation|AC{{Geometries}}:::api
  AC -->ACA(geom_point):::api
  AC -->ACB(geom_line):::api
  AC -->ACC(geom_bar):::api
  AC -->ACD(geom_histogram):::api
  AC -->ACE(geom_boxplot):::api
  AC -->ACF(geom_smooth):::api
  AC -->ACG(geom_freqpoly):::api
  AC -->ACH(geom_density):::api

  A -->|Non-data elements|AD{{Parameters}}
  AD -->ADA(Axis):::api
  ADA -->ADAA(lim):::api
  ADA -->ADAB(lab):::api
  ADA -->ADAC(coord_flip)
  ADA -->ADAD(scale)

  AD -->ADB(Annotations)
  AD -->ADC(Title)
  ADC -->AEA(ggtitle)

  AD -->AG(Theme)
  AD -->AF(Legend)

  classDef api fill:#f96,color:#fff

1 Aesthetics

Tip

flowchart TD
  AA{{Aesthetics}}
  AA -->AAD(x, y)
  AA -->AAA(color)
  AA -->AAB(shape)
  AA -->AAC(size)
  AA -->AAE(...)

The goal of this module is to teach you how to produce useful graphics with ggplot2 as quickly as possible. You’ll exploit its grammar a little further and learn some useful “recipes” to make the most important plots.

Again, let us first load the tidyverse package.

library(tidyverse)

1.1 Fuel economy data

In this module, we’ll mostly use one data set that’s bundled with ggplot2::mpg. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency, http://fueleconomy.gov. You can access the data as long you have loaded ggplot2:

mpg

The variables are mostly self-explanatory:

  • cty and hwy record miles per gallon (mpg) for city and highway driving.
  • displ is the engine displacement in litres.
  • drv is the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).
  • model is the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.
  • class is a categorical variable describing the “type” of car: two seater, SUV, compact, etc.

Recall that we can create a scatterplot using

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

This produces a scatterplot defined by:

  1. Data: mpg.
  2. Aesthetic mapping: engine size mapped to \(x\) position, fuel economy to \(y\) position.
  3. Layer: observations rendered as points.

Exercise A

Q1

Describe the key components: data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the data sets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.

  1. ggplot(mpg, aes(cty, hwy)) + geom_point()
  2. ggplot(diamonds, aes(carat, price)) + geom_point()
  3. ggplot(economics, aes(date, unemploy)) + geom_line()
  4. ggplot(mpg, aes(cty)) + geom_histogram()

1.2 color, shape, size and other aesthetic attributes

To add additional variables to a plot, we can use other aesthetics like color, shape, and size (Note: Both American and British spellings are accepted by ggplot2). These work in the same way as the \(x\) and \(y\) aesthetics, and are added into the call to aes():

ggplot(mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point()

aes(displ, hwy, color = class) gives each point a unique color corresponding to its class. The legend allows us to read data values from the color, showing us that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.

ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + 
  geom_point()

ggplot2 takes care of the details of converting data (e.g., “f”, “r”, “4”) into aesthetics (e.g., “triangles”, “squares”, “circles”) with a scale.

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) + 
  geom_point()

There is one scale for each aesthetic mapping in a plot. The scale is also responsible for creating a guide, an axis or legend, that allows you to read the plot, converting aesthetic values back into data values. For now, we’ll stick with the default scales provided by ggplot2.

If you want to set an aesthetic to a fixed value, without scaling it, do so in the individual layer outside of aes(). Compare the following two plots:

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = "blue"))

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue")

In the first plot, the value “blue” is scaled to a pinkish color, and a legend is added. In the second plot, the points are given the R color blue. This is an important technique. See vignette("ggplot2-specs") for the values needed for color and other aesthetics.

Different types of aesthetic attributes work better with different types of variables. For example, color and shape work well with categorical variables, while size works well for continuous variables. The amount of data also makes a difference: if there is a lot of data it can be hard to distinguish different groups. An alternative solution is to use faceting, as described next.

When using aesthetics in a plot, less is usually more. It’s difficult to see the simultaneous relationships among color and shape and size, so exercise restraint when using aesthetics. Instead of trying to make one very complex plot that shows everything at once, see if you can create a series of simple plots that tell a story, leading the reader from ignorance to knowledge.

Exercise B

Work with mpg. Try to generate a scatterplot with displ on the \(x\) axis and cty on the \(y\) axis.

Q1

Color code the observations according to year. What do you see? How would you make corrections so that the color scale is more appropriate?

Q2

Use different shapes to scale observations according to manufacturer. What do you see? Would you recommend using this visualization? Why/Why not?

Q3

Render the observations with various sizes according to drv. What do you see? Would you recommend using this visualization? Why/Why not?

2 Faceting

Tip

flowchart LR
  AB{{Facet}}
  AB -->ABA(facet_wrap)
  AB -->ABB(facet_grid)
  classDef api fill:#f96,color:#fff

library(tidyverse)

Another technique for displaying additional categorical variables on a plot is faceting. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.

There are two types of faceting: grid and wrapped. The differences between facet_wrap() and facet_grid() are illustrated below.

facet_grid() (left) is fundamentally 2d, being made up of two independent components. facet_wrap() (right) is 1d, but wrapped into 2d to save space.

facet_wrap() makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d. This is useful if you have a single variable with many levels and want to arrange the plots in a more space efficient manner. It takes the name of a variable preceded by ~.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~class)

You can control how the ribbon is wrapped into a grid with ncol, nrow, as.table and dir.

ncol and nrow control how many columns and rows (you only need to set one).

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~class, ncol = 4)

as.table controls whether the facets are laid out like a table (TRUE), with highest values at the bottom-right, or a plot (FALSE), with the highest values at the top-right.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~class, ncol = 4, as.table = F)

dir controls the direction of wrap: horizontal or vertical.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~class, ncol = 4, dir = "v")

facet_grid() lays out plots in a 2d grid, as defined by a formula:

  • a ~ b spreads a across columns and b down rows. You’ll usually want to put the variable with the greatest number of levels in the columns, to take advantage of the aspect ratio of your screen.
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_grid(drv ~ cyl)

  • . ~ a spreads the values of a across the columns. This direction facilitates comparisons of \(y\) position, because the vertical scales are aligned.
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_grid(. ~ cyl)

  • b ~ . spreads the values of b down the rows. This direction facilitates comparison of \(x\) position because the horizontal scales are aligned. This makes it particularly useful for comparing distributions.
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_grid(drv ~ .)

You can use multiple variables in the rows or columns, by “adding” them together, e.g. a + b ~ c + d. Variables appearing together on the rows or columns are nested in the sense that only combinations that appear in the data will appear in the plot. Variables that are specified on rows and columns will be crossed: all combinations will be shown, including those that didn’t appear in the original data set: this may result in empty panels.


Exercise C

Q1

What happens if you try to facet by a continuous variable like cty?

Q2

Use faceting to explore the cross-sectional (year year, and fuel type fl) relationship between city fuel economy cty, engine size displ. How does faceting by fuel type and year change your assessment of the relationship between engine size and fuel economy?

3 Geometrics

Tip

flowchart LR
  AC{{Geometries}}
  AC -->ACA(geom_point)
  AC -->ACB(geom_line)
  AC -->ACC(geom_bar)
  AC -->ACD(geom_histogram)
  AC -->ACE(geom_boxplot)
  AC -->ACF(geom_smooth)
  AC -->ACG(geom_freqpoly)
  AC -->ACH(geom_density)

library(tidyverse)

In the previous module, we have learned geom_point(), geom_line(), geom_bar(), geom_histogram(), geom_boxplot(). Here let us explore more geom_*() functions.

3.1 Adding a smoother to a plot

If you have a scatterplot with a lot of noise, it can be hard to see the dominant pattern. In this case it’s useful to add a smoothed line to the plot with geom_smooth(), which fits a smoother to the data and displays the smooth and its standard error.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).

An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve:

  • method = "loess", the default for small \(n\), uses a smooth local regression (as described in ?loess). The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, span = 0.2)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, span = 1)

loess does not work well for large data sets, so an alternative smoothing algorithm is used when \(n\) is greater than 1,000.

  • method = "gam" fits a generalized additive model provided by the mgcv package. You need to first load mgcv, then use a formula like formula = y ~ s(x) or y ~ s(x, bs = "cs") (for large data). This is what ggplot2 uses when there are more than 1,000 points.
library(mgcv)
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x))

  • method = "lm" fits a linear model, giving the line of best fit.
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ x)

  • method = "rlm" works like lm(), but uses a robust fitting algorithm so that outliers don’t affect the fit as much. It’s part of the MASS package, so remember to load that first.
library(MASS)
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "rlm", formula = y ~ x)

Exercise D

Q1

Add a smoother to the scatterplot addressing the relationship between cty and displ. You may explore different methods.

3.2 Histograms and frequency polygons

Histograms and frequency polygons show the distribution of a single numeric variable. They provide more information about the distribution of a single group than boxplots do, at the expense of needing more space.

ggplot(mpg, aes(x = hwy)) + geom_histogram(binwidth = 2.5)

ggplot(mpg, aes(x = hwy)) + geom_freqpoly(binwidth = 2.5)

Like histogram, it is also important to specify the binwidth for geom_freqpoly().

An alternative to the frequency polygon is the density plot, geom_density(). It is sometimes nice to add the density curve on the histogram.

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 binwidth = 2.5, alpha = 0.6) +
  geom_density(alpha = 0.5)

Note here that we need to specify the aesthetic of \(y\) in geom_histogram(). Otherwise, the y scales of the two graphs do not match.

To compare the distributions of different subgroups, you can map a categorical variable to either fill (for geom_histogram()) or color (for geom_freqpoly()).

However, note that by default, the bars will be stacked in a histogram. If you want to compare the distributions side by side, you might consider setting the position argument to “dodge”.

geom_freqpoly(), on the other hand, is more effective for comparing distributions as it overlays line plots of frequencies for each category. It can handle overlapping data better than histograms in some cases.

ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + 
  geom_histogram(binwidth = 2.5, alpha = 0.6, position = "dodge")

ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + 
  geom_freqpoly(binwidth = 2.5)

Exercise E

Q1

Work with the mpg data. Use faceting to create subplots so as to compare the distributions of displ for different drive trains drv. In each subplot, you can either use a histogram or a frequency polygon.

Q2

Work with the iris data. Draw a histogram to compare the distributions of Petal.Length for different Species. This time let us plot everything on one graph and add the density curves on top of the histograms.

library(tidyverse)

4 Modifying the axes

Tip

flowchart LR
  ADA(Axis)
  ADA -->ADAA(lim):::api
  ADA -->ADAB(lab):::api
  ADA -->ADAC(coord_flip)
  ADA -->ADAD(scale)

  classDef api fill:#f96,color:#fff

library(tidyverse)

Here let us focus on two families of useful helpers that let you make the most common modifications. xlab() and ylab() modify the \(x\)- and \(y\)-axis labels:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 1 / 3)

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab("city driving (mpg)") + 
  ylab("highway driving (mpg)")

# Remove the axis labels with NULL
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab(NULL) + 
  ylab(NULL)

xlim() and ylim() modify the limits of axes:

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot()

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot() + 
  xlim("f", "r") + 
  ylim(20, 30)

# Use NA for the lower limit in ylim() to automatically calculate
# it based on the minimum value in the data,
# while setting the upper limit to 30.
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot() + 
  ylim(NA, 30)

Changing the axis limits in ggplot2 converts values outside the specified range to NA. To suppress warnings associated with these NA values, you can use na.rm = TRUE. However, exercise caution: this conversion to NA occurs before the computation of summary statistics like the sample mean. As a result, it may affect the accuracy of these statistics, leading to potentially misleading interpretations.

# This plot ylim to restrict the y-axis from 20 to 30, 
# suppressing NA-related warnings with na.rm = TRUE.
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot(na.rm = TRUE) + 
  ylim(20, 30)

Exercise F

Q1

Change the \(x\) lable and \(y\) label for the scatterplot of cty and displ using appropriate texts.

Q1

First render a frequency plot demonstrating the counts of cars grouped by manufacturer in the mpg data. Then, set the \(x\)-axis limit so that the chart only displays the counts for “audi”, “mercury” and “volkswagen”.

library(tidyverse)

5 Output

Most of the time you create a plot object and immediately render it, but you can also save a plot to a variable and manipulate it:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
  geom_point()

Once you have a plot object, there are a few things you can do with it:

  • Render it on screen by calling its name or with print(). This happens automatically when running interactively, but inside a loop or function, you’ll need to print() it yourself.
p

print(p)

  • Save it to disk with ggsave().
# Save png to disk
ggsave("Output/plot.png", p, width = 5, height = 5)
  • Briefly describe its structure with summary().
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
  class [234x11]
mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
faceting: <ggproto object: Class FacetNull, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
  • Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS().
saveRDS(p, "Output/plot.rds")
q <- readRDS("Output/plot.rds")

Exercise G

Q1

Save any of the graphs you have produced earlier to your local Output folder, assuming that you have a sub-folder called Output in our current working directory.