31 Data Cleaning

Data rarely comes in a clean and model-ready format. Cleaning data might involve handling duplicates, missing values, erroneous data, and outliers.

31.1 Handling Duplicates

Identifying and handling duplicate records is crucial to ensure the accuracy of analysis. Duplicates can skew results, lead to incorrect conclusions, and generally compromise the integrity of your data analysis. Thus, detecting and appropriately dealing with duplicates is a fundamental step in preparing your dataset.

Duplicates refer to rows in a dataset where certain or all fields are repeated. The approach to handling duplicates depends on the context: sometimes all duplicate records need to be removed, while other times, duplicates may be valid but need verification.

Typically, the following functions might be helpful identifying and handling duplicates.

duplicated(): This function returns a logical vector indicating whether each row of a dataframe or specific columns are duplicates of rows with smaller subscripts.
unique(): Removes duplicate rows from a dataframe based on all or selected columns.
dplyr::distinct() is used to remove duplicate rows from a dataframe, and you can specify which columns to consider when identifying duplicates. This makes it versatile for datasets where only certain fields need to be unique.

Let’s say you have a dataset of customer transactions where each transaction is supposed to be unique, but due to data collection errors, some transactions have been recorded more than once.

# Load the dplyr package for filter()
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Create a sample dataframe
transactions <- data.frame(
  TransactionID = c(1, 2, 2, 3, 4, 5, 5, 5),
  CustomerID = c(101, 102, 102, 103, 104, 105, 105, 105),
  Amount = c(100, 150, 150, 200, 250, 300, 300, 300))

# Checking for duplicates
duplicates <- duplicated(transactions)

# Displaying duplicates
transactions[duplicates, ]

# Removing duplicates
transactions_clean <- transactions[!duplicated(transactions), ]

# Alternatively, using filter()
transactions_clean <- transactions |> filter(!duplicated(transactions))

# Displaying the cleaned data
transactions_clean

unique() removes duplicates in a similar way.

# Using unique()
transactions_unique <- unique(transactions)

# Display the cleaned data
transactions_unique

Now, let’s look at an example with dplyr::distinct(). Imagine you have a dataset recording customer interactions, and you want to ensure there are no duplicate records based on customer IDs and the interaction dates.

# Load the dplyr package
library(dplyr)

# Create a sample dataframe
interactions <- data.frame(
  CustomerID = c(101, 102, 102, 103, 104, 105, 105, 105),
  InteractionDate = as.Date(c('2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-05', '2021-01-05')),
  IssueResolved = c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE))

# Removing duplicates based on CustomerID and InteractionDate
unique_interactions <- interactions |> 
  distinct(CustomerID, InteractionDate, .keep_all = TRUE)

# Display the cleaned data
unique_interactions

Poll Time

Suppose you have a dataset containing sales transactions data for a retail store. The dataset includes columns for TransactionID, CustomerID, TransactionDate, and Amount. Due to an error in data entry, some transactions have been recorded more than once.

Given the dataset described, which dplyr function and approach would you use to ensure that each transaction is uniquely counted based on TransactionID?

Use filter() to select rows based on a specific condition that identifies duplicates.
Use distinct() to remove duplicate rows based solely on the TransactionID column.
Use distinct() with all columns specified to ensure that all data fields are considered in removing duplicates.
Use summarise() to aggregate data by TransactionID and calculate the sum of Amounts for duplicates.

Access the live poll here: https://PollEv.com/weihongni276

31.2 Handling Missing Values

In R: You can handle missing data using functions like

na.omit(): Removes all rows from the data frame (or object) that contain any missing values (NA).
tidyr::drop_na(): Drops all rows that contain NA values in specified columns or across the entire data frame if no specific columns are mentioned.
dplyr::replace_na(): Replaces all NA values in a data frame with specified values, allowing for different replacement values for different columns.

# Example data frame
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie", NA),
  Age = c(25, NA, 30, 22),
  Salary = c(NA, 50000, 60000, 45000))

# Remove rows with any NAs
clean_data <- na.omit(data)

# Display the cleaned data
clean_data

# Load tidyr package
library(tidyr)

# Example data frame
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(25, NA, 30, 22),
  Salary = c(NA, 50000, 60000, 45000))

# Remove rows where NAs appear in the 'Salary' column
clean_data <- data |> drop_na(Salary)

# Display the cleaned data
clean_data

# Load dplyr package
library(dplyr)

# Example data frame
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(25, NA, 30, 22),
  Salary = c(NA, 50000, 60000, 45000))

# Replace NAs in 'Age' with the average age and in 'Salary' with the median salary
clean_data <- data %>%
  mutate(
    Age = replace_na(Age, mean(Age, na.rm = TRUE)),
    Salary = replace_na(Salary, median(Salary, na.rm = TRUE)))

# Display the cleaned data
clean_data

Poll Time

Imagine you have a dataset containing employment information for a company. The dataset includes the following columns: Employee Name, Age, Department, and Salary. There are missing values in both the Department and Salary columns.

Given the dataset described, which data cleaning function would you use if you need to preserve as much data as possible, yet ensure that analyses involving department and salary are not affected by missing values?

Use na.omit() to remove all rows with any missing values.
Use tidyr::drop_na() and specify only the Department and Salary columns.
Use dplyr::replace_na() to replace missing departments with the department with the majority employees in this dataset and missing salaries with the median salary.
Do nothing and leave the dataset as is.

Access the live poll here: https://PollEv.com/weihongni276