# Create a sample dataframe
transactions <- data.frame(
CustomerID = c(101, 102, 102, 103, 104, 105, 105, 105),
Age = c(22, 45, 45, 18, 62, 34, 34, 34),
TransactionDate = as.Date(c('2021-01-01', '2021-12-02', '2022-02-05', '2023-01-03', '2023-11-14', '2024-02-02', '2024-04-15', '2024-07-25')),
ItemCategory = c("beauty", "grocery", "electronics", "grocery", "home", "electronics", "grocery", "toys"),
Amount = c(143, 24, 365, 27, 88, 589, 90, 104))
transactions32 Feature Engineering
Feature engineering is a critical step in the predictive modeling process. It might involve:
- Removing unnecessary columns
- Transforming raw data into features that can better represent the underlying problem to predictive models, enhancing model accuracy and performance
- Scaling or Standardizing the data
The goal is to establish a set of features that capture important patterns in the data, which may not be readily apparent in the raw data itself.
We have learned that dplyr is a powerful package in R that facilitates data manipulation and transformation. It is particularly useful for feature engineering because it allows you to efficiently create new variables, modify existing ones, and condense or expand information to better suit modeling needs.
Suppose you are working with a dataset of retail store transactions that includes each transaction’s date, customer age, purchased item category, and the amount spent. You want to predict future sales based on these features, but first, you need to engineer features that better capture customer purchasing behaviors and seasonal trends.
- Creating Time-Based Features:
Extracting day of the week, month, and year from transaction dates can help capture seasonal trends and weekly cycles in purchasing behavior.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Extract the year, month, week day info from TransactionDate
transactions <- transactions |>
mutate(
Year = as.POSIXlt(TransactionDate)$year + 1900, # calendar year
Month = as.POSIXlt(TransactionDate)$mon + 1, # so that months are shown from 1-12
DayOfWeek = as.POSIXlt(TransactionDate)$wday) # 0 = Sunday, 1 = Monday, etc.
# Displaying the data after transformation
transactions- Aggregating Customer Spending:
Creating a feature for total spending per customer can provide insights into customer loyalty and spending habits.
# Update the 'transactions' dataframe with a new feature
transactions <- transactions |>
group_by(CustomerID) |> # Group data by 'CustomerID'
mutate(TotalSpending = sum(Amount)) |> # Calculate the total spending for each customer
ungroup() # Remove the grouping so further operations are not confined to groups
transactions- Categorizing Age Groups: Binning customer ages into groups like ‘Youth’, ‘Adult’, ‘Senior’ can help in tailoring marketing strategies and understanding demographic preferences.
# Add a new column 'AgeGroup' based on 'Age'
transactions <- transactions |>
mutate(
AgeGroup = case_when(
Age <= 25 ~ "Youth", # Assign "Youth" to 'AgeGroup' if 'Age' is 25 or younger
Age > 25 & Age <= 60 ~ "Adult", # Assign "Adult" to 'AgeGroup' if 'Age' is between 26 and 60
TRUE ~ "Senior")) # Assign "Senior" to 'AgeGroup' for all other cases (i.e., older than 60)
# Display only the 'Age' and 'AgeGroup' columns from the updated dataframe
transactions |> select(Age, AgeGroup)- Item Popularity:
Flagging items based on their popularity (e.g., top 20% most frequently bought items) might reveal patterns useful for stocking and promotion.
# Calculate the frequency of each item category and identify the top 20% most popular items
popular_items <- transactions |>
count(ItemCategory) |> # Count the number of occurrences of each 'ItemCategory'
slice_max(order_by = n, prop = 0.2) |> # Select the top 20% items based on frequency
select(ItemCategory) # Keep only the 'ItemCategory' column
# Display the dataframe containing the top 20% most popular item categories
popular_items# Add a new column 'PopularItem' to flag popular items
transactions <- transactions |>
mutate(PopularItem = ItemCategory %in% popular_items$ItemCategory)
# Display only the 'ItemCategory' and 'PopularItem' columns
transactions |> select(ItemCategory, PopularItem)