Lecture 5 - Data Manipulation Advanced

Summary

#1 Data manipulation basics (Single Dataset)

flowchart TD
  A((dplyr)) --> AA{{selection}}
  AA -->|columns|AAA(select):::api
  AA -->|rows|AAB(filter):::api
  A -->AB{{transformation}}
  AB -->|columns|ABA(mutate):::api
  AB -->|rows|ABB(arrange):::api
  A -->AC{{aggregation}}
  AC -->ACA(summarize):::api
  ACA --o ACAA(group_by):::api
  classDef api fill:#f96,color:#fff

#2 Joining Data (Two Datasets)

flowchart TD
  A((dplyr)) --> AD{{join}}
  AD --> ADA{{mutating}}
  ADA --> ADAA(inner):::api
  ADA --> ADAB(outer):::api
  ADA --> ADAC(left):::api
  ADA --> ADAD(right):::api
  ADA --> ADAE(full):::api
  AD --> ADB{{filtering}}
  ADB --> ADBA(semi):::api
  ADB --> ADBB(anti):::api
  classDef api fill:#f96,color:#fff

#3 Pivot Data (Reshape Dataset)

flowchart TD
  A((tidyr)) -->B{{reshaping}}
  B --> C(pivot_wider):::api
  B --> D(pivot_longer):::api
  classDef api fill:#f96,color:#fff

1 Join data using dplyr

flowchart TD
  A((dplyr)) -->|two datasets|AD{{join}}
  AD --> ADA{{mutating}}
  ADA --> ADAA(inner):::api
  ADA --> ADAB(outer):::api
  ADA --> ADAC(left):::api
  ADA --> ADAD(right):::api
  ADA --> ADAE(full):::api
  AD --> ADB{{filtering}}
  ADB --> ADBA(semi):::api
  ADB --> ADBB(anti):::api
  classDef api fill:#f96,color:#fff

Previously, we focused on the five-verb functions in dplyr and got familiar with the pipe operator. Here, we will continue to explore a few more functions and discover what else we can do with dplyr.

It is rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. This section will first introduce you to two important types of joins:

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.

We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in two simple data sets, then use that knowledge to start joining data frames together.

Let us load the tidyverse package first.

library(tidyverse)

In this section, we will be working on two data sets. The first one, order, contains details of orders placed by different customers. The second data set, customer contains details of each customer. The lists below demonstrate the variables in each data set.

order

  • customer_id
  • order_date
  • amount

customer

  • customer_id
  • first_name
  • city

Let us import both the data sets using read_delim which allows us to read in files with any delimiter.

order <- read_delim('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/order.csv', delim = ';')
order
customer <- read_delim('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/customer.csv', delim = ';')
customer

1.1 Keys

Every join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key. For example, customer records three pieces of data about each order: customer id, first name, and state where their city is in. You can identify a customer with the id, making id the primary key.

Exercise A

Q1

Identify the prime key in order.

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example: order$id is a foreign key that corresponds to the primary key customer$id.

Now that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that customer looks good:

customer |>   
  count(id) |> 
  filter(n > 1)

You should also check for missing values in your primary keys - if a value is missing then it can’t identify an observation!

customer |>  
  filter(is.na(id))

Exercise B

Q1

Verify your previously specified prime key in order.

To do join data sets, we will need to use the family of *_join() functions provided in dplyr.

  • inner_join
  • left_join
  • right_join
  • full_join
  • semi_join
  • anti_join

They all have the same interface: they take a pair of data frames (\(x\) and \(y\)) and return a data frame. The order of the rows and columns in the output is primarily determined by \(x\).

1.2 Mutating joins

Inner join

The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal. The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.

We will use two sample data sets to illustrate how inner join works. You can view the example data sets below.

Age <- data.frame(
  Name = c("John", "Jenny", "Jacob"),
  Age = c(26, 24, 28))

Height <- data.frame(
  Name = c("John", "Jenny", "Janet"),
  Height = c(170, 174, 166))

Inner join will return all rows from Age where there are matching values in Height, and all columns from Age and Height. If there are multiple matches between Age and Height, all combination of the matches are returned.

Age |> inner_join(Height, by = join_by(Name))

Exercise C

Q1

Using order and customer data sets, can you generate a data frame which contains both order and customer info for all customers who have placed orders in the past?

Outer joins

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA. Graphically, that looks like:

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Another way to depict the different types of joins is with a Venn diagram:

Left join

In our example, left join return all rows from Age, and all columns from Age and Height. Rows in Age with no match in Height will have NA in the new columns. If there are multiple matches between Age and Height, all combinations of the matches are returned.

Age |> left_join(Height, by = join_by(Name))

Right join

In our example, right join return all rows from Height, and all columns from Age and Height. Rows in Height with no match in Age will have NA in the new columns. If there are multiple matches between Age and Height, all combinations of the matches are returned.

Age |> right_join(Height, by = join_by(Name))

Full join

Full join return all rows and all columns from both Age and Height. Where there are not matching values, returns NA for the one missing.

Age |> full_join(Height, by = join_by(Name))

Exercise D

Working with our order and customer data, first indicate the *_join() function that should be used for each of the following tasks and then implement them in your own R session.

Q1

Obtain data that includes all order records and their corresponding customers’ info - no need to include customers who did not place an order.

Q2

Obtain data that includes all customers as well as their order info (if available) - if an order is placed, include its info, if no order is placed, include NA for this customer.

Q3

Obtain data that includes all customers and all orders. Fill in missing values with NA for non-matching items.

1.3 Filtering joins

Filtering joins match observations in the same way as mutating joins, i.e., inner and outer joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.

Semi join

Semi join connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y:

Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do:

For example, semi join return all rows from Age where there are matching values in Height, keeping just columns from Age. A semi join differs from an inner join because an inner join will return one row of Age for each matching row of Height, where a semi join will never duplicate rows of Age.

Age |> semi_join(Height, by = join_by(Name))

Anti join

The inverse of a semi join is an anti join. An anti join keeps the rows that don’t have a match:

Anti joins are useful for diagnosing join mismatches. In our example, anti join return all rows from Age where there are not matching values in Height, keeping just columns from Age.

Age |> anti_join(Height, by = join_by(Name))

Exercise E

Working with our order and customer data, first indicate the *_join() function that should be used for each of the following tasks and then implement them in your own R session.

Q1

Identify orders for which the corresponding customers are present in the data.

Q2

Identify customers who have not placed orders.

2 Pivot data using tidyr

flowchart TD
  A((tidyr)) -->B{{reshaping}}
  B --> C(pivot_wider):::api
  B --> D(pivot_longer):::api
  classDef api fill:#f96,color:#fff

In addition, we will learn pivoting data using the tidyr package which is also embedded in the meta package tidyverse package.

Let us load the tidyverse package first.

library(tidyverse)

dplyr pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per observation. Sometimes we want a data frame where each observation type has its own column, and rows are instead more aggregated groups - like surveys, where each column represents an answer. Moving back and forth between these formats is nontrivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about tidyr, you may want to check out this cheatsheet about tidyr.

2.1 Long vs Wide table format

The ‘long’ format is where:

  • each column is a variable
  • each row is an observation

In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format, a row, for example, could be a research subject for which you have multiple observation variables containing the same type of data, for example responses to a set of survey questions, or repeated observations over time, or a mix of both.

You may find data input to be simpler or some other applications may prefer the ‘wide’ format. However, many of R‘s functions have been designed assuming you have ’long’ format data. This tutorial will help you efficiently transform your data regardless of original format.

The choice of data format affects readability. For humans, the wide format is often more intuitive, since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our data frames are similar to the fields in a database and observed variables are like the database values.

2.2 Long to Wide with pivot_wider

Let us see this in action using the built-in data ToothGrowth. We will be using the summary of the data by taking the mean lengths by supplement and doses.

avg_ToothGrowth <- ToothGrowth |> 
  group_by(supp, dose) |> 
  summarise(avg_len = mean(len), .groups = "drop")
avg_ToothGrowth

Now, to make this long data wide, we use pivot_wider from tidyr to turn the supplement types into columns. In addition to our data table we provide pivot_wider with two arguments: names_from describes which column to use for name of the output column, and values_from tells it from column to get the cell values. We’ll use a pipe so we can ignore the data argument.

supp_wide <- avg_ToothGrowth |> 
  pivot_wider(names_from = supp,
              values_from = avg_len)
supp_wide

Exercise F

Q1

Convert avg_ToothGrowth to a wide format by turning dose into columns.

Q2

Work with a built-in data OrchardSprays. Convert this data frame to a wide format by turning treatment into columns.

2.3 Wide to Long with pivot_longer

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use pivot_longer, which will increase the number of rows and decrease the number of columns. We provide the function with thee arguments: cols which are the columns we want to pivot into the long format, names_to, which is a string specifying the name of the column to create from the data stored in the column names, and values_to, which is also a string, specifying the name of the column to create from the data stored in cell values. So, to go backwards from supp_wide, and exclude dose from the long format, we would do the following:

supp_long <- supp_wide |> 
  pivot_longer(cols = -dose, # exclude the dose column
               names_to = "supp", # name is a string!
               values_to = "avg_len") # also a string
supp_long

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are adjacent to each other, we don’t even need to list them all out - we can use the : operator!

supp_long <- supp_wide |> 
  pivot_longer(cols = OJ:VC, # columns from OJ to VC
               names_to = "supp",
               values_to = "avg_len")
supp_long

Exercise G

Q1

Convert both data sets you created in Exercise F to a long format.

Q2

Work with the data tidyr::world_bank_pop. Convert it to a long format by turning the years into rows and place them under a column named year. Store population in a new column called population, and name the resulting long data frame pop_long.

2.4 Exporting data

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), make sure you have created an DataOutput folder in the current working directory that will store this generated data set. We don’t want to write generated data sets in the same directory as our raw data. It’s good practice to keep them separate. The Data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the DataOutput directory, so even if the files it contains are deleted, we can always re-generate them.

We can now save the table generated above in our DataOutput folder:

write_csv(supp_wide, "DataOutput/supp_wide.csv")