Lecture 7 - Data Visualization Basics

Summary

flowchart TD
A((Gallery)) 
  A -->AA(Scatter):::api
  A -->AB(Line):::api
  A -->AC(Bar):::api
  A -->AD(Histogram):::api
  A -->AE(Boxplot):::api

flowchart TD
  A((ggplot2 Components))
  A -->AA(aesthetics):::apid
  A -->AB(geometric):::apid
  A -->AC(layer):::apid
  A -->AF(facet):::apid
  A -->AD(stats)
  A -->AE(theme)
  classDef api fill:#f96,color:#fff
  classDef apid fill:#ab6,color:#fff

1 Introduction to Data Visualization

Like descriptive statistics, the purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data you are trying to answer. A good visualization will clearly answer your question without distraction; a great visualization will suggest even what the question was itself without additional explanation. Imagine your visualization as part of a poster presentation for a project; even if you aren’t standing at the poster explaining things, an effective visualization will convey your message to the audience.

With the visualizations we will cover in this chapter, we will again focus on addressing descriptive and exploratory questions. Be careful to not answer any predictive, inferential, causal or mechanistic questions with the visualizations presented here, as we have not learned the tools necessary to do that properly just yet.

As with most coding tasks, it is totally fine (and quite common) to make mistakes and iterate a few times before you find the right visualization for your data and question.

Graphics is a great strength of R. The graphics package is part of the standard distribution and contains many useful functions for creating a variety of graphic displays. The base functionality has been expanded and made easier with ggplot2, part of the tidyverse of packages. In this module, we will focus on examples using ggplot2, and we will occasionally suggest other packages.

As usual, let us first load the tidyverse package by calling

library(tidyverse)

1.1 `ggplot2` basics

While the package is called ggplot2, the primary plotting function in the package is called ggplot. It is important to understand the basic pieces of a ggplot2 graph. Let us see a simple example:

data_ex1 <- data.frame(x = 1:5, y = 1:5)
ggplot(data_ex1, aes(x, y)) +
  geom_point()

In the preceding example, you can see that we pass data into ggplot, then define how the graph is created by stacking together small phrases that describe some aspect of the plot. This stacking together of phrases is part of the “grammar of graphics” ethos (that’s where the gg comes from). To learn more, you can read “A Layered Grammar of Graphics” written by ggplot2 author Hadley Wickham. The “grammar of graphics” concept originated with Leland Wilkinson, who articulated the idea of building graphics up from a set of primitives (i.e., verbs and nouns).

As we talk about ggplot graphics, it’s worth defining the components of a ggplot graph:

aesthetics
The aesthetics, or aesthetic mappings, communicate to ggplot which fields in the source data get mapped to which visual elements in the graphic. This is the aes line in a ggplot call.
geometric object functions
These are geometric objects that describe the type of graph being created. These start with geom_ and examples include geom_line, geom_boxplot, and geom_point, along with dozens more.
layer
A layer is a combination of data, aesthetics, a geometric object, a stat, and other options to produce a visual layer in the ggplot graphic.
facet functions
Facets are subplots where each small plot represents a subgroup of the data. The faceting functions include facet_wrap and facet_grid.
stats
Stats are statistical transformations that are done before displaying the data. Not all graphs will have stats, but a few common stats are stat_ecdf (the empirical cumulative distribution function) and stat_identity, which tells ggplot to pass the data without doing any stats at all.
themes
Themes are the visual elements of the plot that are not tied to data. These might include titles, margins, legend locations, or font choices.

“Long” Versus “Wide” Data with ggplot

One of the first sources of confusion for new ggplot users is that they are inclined to reshape their data to be “wide” before plotting it. “Wide” here means every variable they are plotting is its own column in the underlying data frame. This is an approach that many users develop while using Excel and then bring with them to R. ggplot works most easily with “long” data where additional variables are added as rows in the data frame rather than columns. The great side effect of adding more measurements as rows is that any properly constructed ggplot graphs will automatically update to reflect the new data without changing the ggplot code. If each additional variable were added as a column, then the plotting code would have to be changed to introduce additional variables. This idea of “long” versus “wide” data will become more obvious in the examples in the rest of this module.

1.2 Creating a Scatter Plot

A scatter plot is used when you have paired observations: \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). It’s associated with a question: Is there a relationship between X and Y? We can plot the data by calling ggplot, passing in the data frame, and invoking a geometric point function:

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point()

A scatter plot is a common first attack on a new dataset. It’s a quick way to see the relationship, if any, between \(x\) and \(y\).

Plotting with ggplot requires telling ggplot what data frame to use, then what type of graph to create, and which aesthetic mapping (aes) to use. The aes in this case defines which field from iris goes into which axis on the plot. Then the command geom_point communicates that you want a point graph, as opposed to a line or other type of graphic.

Exercise A

Use the built-in mtcars data set to create a scatter plot with horsepower (hp) on the x-axis and fuel economy (mpg) on the y-axis.

1.3 Creating a Line Plot

Line plots are used to visualize trends with respect to an independent, ordered quantity (e.g. time). A typical question could be: Does the X change over time, and are there any interesting patterns to note?

data_ex2 <- data.frame(year = seq(2000, 2021),
                       value = rnorm(22, mean = 80, sd = 2)) # generate 22 random values from a normal distribution
ggplot(data_ex2, aes(x = year, y = value)) + 
  geom_line()

You can also add points using geom_point().

ggplot(data_ex2, aes(x = year, y = value)) + 
  geom_line() +
  geom_point()

Note it’s common with ggplot to split the command on multiple lines, ending each line with a + so that R knows that the command will continue on the next line.

Exercise B

Work with the Weekly data from the ISLR package.

Obtain the average weekly return in each year and store the values in a data frame called avg_return.

Draw a line graph of the average weekly percentage return overtime from 1990 to 1999, based on the data frame ave_return.

1.4 Creating a Bar Plot

Bar plots are good for visualizing comparisons of amounts. A typical question can be something like: Do cars with more cylinders have lower miles per gallon? Hence, we can make a bar plot of the average miles per gallon by number of cylinders, we first create a data frame containing the mean values.

avg_mpg <- mtcars |> 
  group_by(cyl) |> 
  summarize(mean_mpg = mean(mpg))
avg_mpg

Then with ggplot2 we can use geom_col(). Notice the difference in the output when the \(x\) variable is continuous and when it is discrete:

# Bar plot of values. This uses the avg_mpg data frame, with the
# "cyl" column for x values and the "mean_mpg" column for y values.
ggplot(avg_mpg, aes(x = cyl, y = mean_mpg)) +
  geom_col()

# Convert the x variable to a factor, so that it is treated as discrete
ggplot(avg_mpg, aes(x = factor(cyl), y = mean_mpg)) +
  geom_col()

ggplot2 can also be used to plot the count of the number of data rows in each category, by using geom_bar() instead of geom_col(). Once again, notice the difference between a continuous \(x\)-axis and a discrete one. For some kinds of data, it may make more sense to convert the continuous \(x\) variable to a discrete one, with the factor() function.

# Bar plot of counts. This uses the mtcars data frame, with the 
# "cyl" column for x position. The y position is calculated by 
# counting the number of rows for each value of cyl.
ggplot(mtcars, aes(x = cyl)) +
  geom_bar()

# Bar graph of counts
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar()

Exercise C

Create a data frame with values of mean Sepal.Length grouped by Species from the iris data. Assign the data frame to an object named avg_sepal.

Draw a bar plot comparing the mean Sepal.Length among different iris Species.

Draw a bar plot counting the number of each Species in the iris data.

Tip

Alternatively, you can use geom_bar() to create the bar chart with the y-axis displaying mean_mpg, but you will need to set additional arguments in the geom_bar() function.

ggplot(avg_mpg, aes(x = cyl, y = mean_mpg)) +
  geom_bar(stat = "identity")

By default, geom_bar() uses the stat = “count” argument, which automatically counts the number of observations in each category of a factor.

geom_col() is essentially a shortcut for geom_bar(stat = "identity"), where it directly takes the y-values provided without performing any statistical transformation.

While geom_bar() and geom_col() can be used to achieve similar visual outcomes, their default functionalities cater to different types of data input - raw and summarized data, respectively.

1.5 Creating a Histogram

Histograms are used to visualize the distribution of a quantitative variable, i.e., all its possible values and how often they occur over different intervals. It is associated with the question: what is a typical value for miles per gallon? What’s the range of it?

With the ggplot2, you can produce a histogram using geom_histogram().

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When you create a histogram without specifying the bin width, ggplot() prints out a message telling you that it’s defaulting to 30 bins, and to pick a better bin width. This is because it’s important to explore your data using different bin widths; the default of 30 bins may or may not show you something useful about your data. Now let us try a wider bin width which corresponds to smaller number of bins.

# With wider bins
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 4)

Exercise D

Draw a histogram of the Sepal.Width in the iris dataset. Find an appropriate binwidth that can nicely represent your data.

Draw a histogram of the Petal.Length in the iris dataset. Find an appropriate binwidth that can nicely represent your data.

1.6 Creating a Boxplot

Boxplots are another tool for visualizing data distributions, so you can use them to answer similar questions from above. The following diagram depicts how a histogram and a box plot is created.

A boxplot directly displays summary statistics including first quartile, median, and third quartile. The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually. Boxplots are especially useful when comparing distributions across different categories or groups side by side.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot()

It’s also possible to make box plots for a combination of two grouping variables, with the interaction() function:

ggplot(mtcars, aes(x = interaction(factor(cyl), factor(am)), 
                   y = mpg)) +
  geom_boxplot()

Exercise E

Sketch boxplots of Sepal.Length grouped by different iris Species.

Sketch boxplots of Petal.Width grouped by different iris Species.