Data rarely comes in a clean and model-ready format. Cleaning data might involve handling duplicates, missing values, erroneous data, and outliers.
31.1 Handling Duplicates
Identifying and handling duplicate records is crucial to ensure the accuracy of analysis. Duplicates can skew results, lead to incorrect conclusions, and generally compromise the integrity of your data analysis. Thus, detecting and appropriately dealing with duplicates is a fundamental step in preparing your dataset.
Duplicates refer to rows in a dataset where certain or all fields are repeated. The approach to handling duplicates depends on the context: sometimes all duplicate records need to be removed, while other times, duplicates may be valid but need verification.
Typically, the following functions might be helpful identifying and handling duplicates.
duplicated(): This function returns a logical vector indicating whether each row of a dataframe or specific columns are duplicates of rows with smaller subscripts.
unique(): Removes duplicate rows from a dataframe based on all or selected columns.
dplyr::distinct() is used to remove duplicate rows from a dataframe, and you can specify which columns to consider when identifying duplicates. This makes it versatile for datasets where only certain fields need to be unique.
Let’s say you have a dataset of customer transactions where each transaction is supposed to be unique, but due to data collection errors, some transactions have been recorded more than once.
# Load the dplyr package for filter()library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Removing duplicatestransactions_clean <- transactions[!duplicated(transactions), ]# Alternatively, using filter()transactions_clean <- transactions |>filter(!duplicated(transactions))# Displaying the cleaned datatransactions_clean
unique() removes duplicates in a similar way.
# Using unique()transactions_unique <-unique(transactions)# Display the cleaned datatransactions_unique
Now, let’s look at an example with dplyr::distinct(). Imagine you have a dataset recording customer interactions, and you want to ensure there are no duplicate records based on customer IDs and the interaction dates.
# Load the dplyr packagelibrary(dplyr)# Create a sample dataframeinteractions <-data.frame(CustomerID =c(101, 102, 102, 103, 104, 105, 105, 105),InteractionDate =as.Date(c('2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-05', '2021-01-05')),IssueResolved =c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE))# Removing duplicates based on CustomerID and InteractionDateunique_interactions <- interactions |>distinct(CustomerID, InteractionDate, .keep_all =TRUE)# Display the cleaned dataunique_interactions
Poll Time
Q1
Suppose you have a dataset containing sales transactions data for a retail store. The dataset includes columns for TransactionID, CustomerID, TransactionDate, and Amount. Due to an error in data entry, some transactions have been recorded more than once.
Given the dataset described, which dplyr function and approach would you use to ensure that each transaction is uniquely counted based on TransactionID?
Use filter() to select rows based on a specific condition that identifies duplicates.
Use distinct() to remove duplicate rows based solely on the TransactionID column.
Use distinct() with all columns specified to ensure that all data fields are considered in removing duplicates.
Use summarise() to aggregate data by TransactionID and calculate the sum of Amounts for duplicates.
In R: You can handle missing data using functions like
na.omit(): Removes all rows from the data frame (or object) that contain any missing values (NA).
tidyr::drop_na(): Drops all rows that contain NA values in specified columns or across the entire data frame if no specific columns are mentioned.
dplyr::replace_na(): Replaces all NA values in a data frame with specified values, allowing for different replacement values for different columns.
# Example data framedata <-data.frame(Name =c("Alice", "Bob", "Charlie", NA),Age =c(25, NA, 30, 22),Salary =c(NA, 50000, 60000, 45000))# Remove rows with any NAsclean_data <-na.omit(data)# Display the cleaned dataclean_data
# Load tidyr packagelibrary(tidyr)# Example data framedata <-data.frame(Name =c("Alice", "Bob", "Charlie", "David"),Age =c(25, NA, 30, 22),Salary =c(NA, 50000, 60000, 45000))# Remove rows where NAs appear in the 'Salary' columnclean_data <- data |>drop_na(Salary)# Display the cleaned dataclean_data
# Load dplyr packagelibrary(dplyr)# Example data framedata <-data.frame(Name =c("Alice", "Bob", "Charlie", "David"),Age =c(25, NA, 30, 22),Salary =c(NA, 50000, 60000, 45000))# Replace NAs in 'Age' with the average age and in 'Salary' with the median salaryclean_data <- data %>%mutate(Age =replace_na(Age, mean(Age, na.rm =TRUE)),Salary =replace_na(Salary, median(Salary, na.rm =TRUE)))# Display the cleaned dataclean_data
Poll Time
Q1
Imagine you have a dataset containing employment information for a company. The dataset includes the following columns: Employee Name, Age, Department, and Salary. There are missing values in both the Department and Salary columns.
Given the dataset described, which data cleaning function would you use if you need to preserve as much data as possible, yet ensure that analyses involving department and salary are not affected by missing values?
Use na.omit() to remove all rows with any missing values.
Use tidyr::drop_na() and specify only the Department and Salary columns.
Use dplyr::replace_na() to replace missing departments with the department with the majority employees in this dataset and missing salaries with the median salary.