32 Feature Engineering

Feature engineering is a critical step in the predictive modeling process. It might involve:

Removing unnecessary columns
Transforming raw data into features that can better represent the underlying problem to predictive models, enhancing model accuracy and performance
Scaling or Standardizing the data

The goal is to establish a set of features that capture important patterns in the data, which may not be readily apparent in the raw data itself.

We have learned that dplyr is a powerful package in R that facilitates data manipulation and transformation. It is particularly useful for feature engineering because it allows you to efficiently create new variables, modify existing ones, and condense or expand information to better suit modeling needs.

Suppose you are working with a dataset of retail store transactions that includes each transaction’s date, customer age, purchased item category, and the amount spent. You want to predict future sales based on these features, but first, you need to engineer features that better capture customer purchasing behaviors and seasonal trends.

# Create a sample dataframe
transactions <- data.frame(
  CustomerID = c(101, 102, 102, 103, 104, 105, 105, 105),
  Age = c(22, 45, 45, 18, 62, 34, 34, 34),
  TransactionDate = as.Date(c('2021-01-01', '2021-12-02', '2022-02-05', '2023-01-03', '2023-11-14', '2024-02-02', '2024-04-15', '2024-07-25')),
  ItemCategory = c("beauty", "grocery", "electronics", "grocery", "home", "electronics", "grocery", "toys"),
  Amount = c(143, 24, 365, 27, 88, 589, 90, 104))

transactions

Creating Time-Based Features:

Extracting day of the week, month, and year from transaction dates can help capture seasonal trends and weekly cycles in purchasing behavior.

# Load the dplyr package
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Extract the year, month, week day info from TransactionDate
transactions <- transactions |> 
  mutate(
    Year = as.POSIXlt(TransactionDate)$year + 1900, # calendar year
    Month = as.POSIXlt(TransactionDate)$mon + 1, # so that months are shown from 1-12
    DayOfWeek = as.POSIXlt(TransactionDate)$wday) # 0 = Sunday, 1 = Monday, etc.

# Displaying the data after transformation
transactions

Aggregating Customer Spending:

Creating a feature for total spending per customer can provide insights into customer loyalty and spending habits.

# Update the 'transactions' dataframe with a new feature
transactions <- transactions |> 
  group_by(CustomerID) |> # Group data by 'CustomerID'
  mutate(TotalSpending = sum(Amount)) |> # Calculate the total spending for each customer
  ungroup() # Remove the grouping so further operations are not confined to groups

transactions

Categorizing Age Groups: Binning customer ages into groups like ‘Youth’, ‘Adult’, ‘Senior’ can help in tailoring marketing strategies and understanding demographic preferences.

# Add a new column 'AgeGroup' based on 'Age'
transactions <- transactions |> 
  mutate(
    AgeGroup = case_when(
      Age <= 25 ~ "Youth", # Assign "Youth" to 'AgeGroup' if 'Age' is 25 or younger
      Age > 25 & Age <= 60 ~ "Adult", # Assign "Adult" to 'AgeGroup' if 'Age' is between 26 and 60
      TRUE ~ "Senior")) # Assign "Senior" to 'AgeGroup' for all other cases (i.e., older than 60)

# Display only the 'Age' and 'AgeGroup' columns from the updated dataframe
transactions |> select(Age, AgeGroup)

Item Popularity:

Flagging items based on their popularity (e.g., top 20% most frequently bought items) might reveal patterns useful for stocking and promotion.

# Calculate the frequency of each item category and identify the top 20% most popular items
popular_items <- transactions |> 
  count(ItemCategory) |> # Count the number of occurrences of each 'ItemCategory'
  slice_max(order_by = n, prop = 0.2) |> # Select the top 20% items based on frequency
  select(ItemCategory) # Keep only the 'ItemCategory' column

# Display the dataframe containing the top 20% most popular item categories
popular_items

# Add a new column 'PopularItem' to flag popular items
transactions <- transactions |> 
  mutate(PopularItem = ItemCategory %in% popular_items$ItemCategory)

# Display only the 'ItemCategory' and 'PopularItem' columns
transactions |> select(ItemCategory, PopularItem)

Poll Time

Imagine you are working with a dataset containing customer transaction data at a retail store. The dataset includes the following columns: TransactionDate, CustomerID, AmountSpent, and ProductCategory. You are tasked with predicting future spending amounts by the customers.

Given the dataset described, which feature engineering strategy would likely be useful for improving a model’s ability to predict future customer spending?

In addition, identify the key functions to be used when implementing the selected feature engineering strategies.

Time of Day Feature: Extract the time of day from TransactionDate to see if spending habits vary between morning, afternoon, and evening.
Customer Loyalty Feature: Calculate the total number of transactions per CustomerID to create a ‘Loyalty Score’.
Seasonal Spending Feature: Extract the month from TransactionDate and calculate average spending per month to identify seasonal trends.
Customer Visit Frequency Feature: Calculate the number of visits per customer per year from TransactionDate to identify ‘frequent visitors’ and assess if visit frequency correlates with spending habits.

Access the live poll here: https://PollEv.com/weihongni276