π€΅ Author: Tri Nguyen
π Date: Jan. 10, 2025
π» Tools Used: Python
- π Background & Overview
- π Dataset Description & Data Structure
- π§ Design Thinking Process
- π Key Insights & Visualizations
- π Final Conclusion & Recommendations
Understanding customer behavior through RFM (Recency, Frequency, Monetary) analysis helps businesses make informed decisions and enhance customer loyalty. The Marketing department needs to segment customers for holiday campaigns, and the Marketing Director suggests using RFM analysis.
RFM Segmentation is a method to analyze customer behavior based on three key metrics:
- Recency (R): Time since the last purchase. Recent buyers are more likely to purchase again.
- Frequency (F): Number of purchases within a period. Frequent buyers are more loyal.
- Monetary Value (M): Total money spent. High spenders are more valuable to the business.
- The Marketing Department who need to understand customer behavior and thier values for new marketing statergy.
- Customer Segmentation for Marketing Campaigns: How can the Marketing department classify customer segments effectively to deploy tailored marketing campaigns for Christmas and New Year, appreciating loyal customers and attracting potential ones?
- Implementing RFM Model: How can the RFM (Recency, Frequency, Monetary) model be utilized to analyze and segment customers to enhance the effectiveness of marketing campaigns?
Summarize key findings and insights/ trends/ themes in a concise, bullet-point format.
Example:
βοΈ Sales Trends: The top X% of products generate Y% of revenue.
βοΈ Inventory Optimization: Certain products are frequently out-of-stock, causing revenue loss.
βοΈ Customer Behavior: Returning customers spend Z% more per transaction than new customers.
- Source: (Mention where the dataset is obtained fromβKaggle, company database, government sources, etc.)
- Size: (Mention the number of rows & columns)
- Format: (.csv, .sql, .xlsx, etc.)
Mention how many tables are in the dataset.
Table 1: ecommerce_retail
| Fields | DataType | Description |
|---|---|---|
| InvoiceNo | String | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation. |
| StockCode | String | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. |
| Description | String | Product (item) name. Nominal. |
| Quantity | Integer | The quantities of each product (item) per transaction. Numeric. |
| InvoiceDate | DateTime | Invoice Date and time. Numeric, the day and time when each transaction was generated. |
| UnitPrice | Decimal | Unit price. Numeric, Product price per unit in sterling. |
| CustomerID | Integer | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. |
| Country | String | Country name. Nominal, the name of the country where each customer resides. |
Table 2: segmentation
| Segment | RFM Score |
|---|---|
| Champions | 555, 554, 544, 545, 454, 455, 445 |
| Loyal | 543, 444, 435, 355, 354, 345, 344, 335 |
| Potential Loyalist | 553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323 |
| New Customers | 512, 511, 422, 421, 412, 411, 311 |
| Promising | 525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313 |
| Need Attention | 535, 534, 443, 434, 343, 334, 325, 324 |
| About To Sleep | 331, 321, 312, 221, 213, 231, 241, 251 |
| At Risk | 255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124 |
| Cannot Lose Them | 155, 154, 144, 214, 215, 115, 114, 113 |
| Hibernating customers | 332, 322, 233, 232, 223, 222, 132, 123, 122, 212, 211 |
| Lost customers | 111, 112, 121, 131, 141, 151 |
1οΈβ£ Load Dataset
ecm = pd.read_excel('ecommerce_retail.xlsx', sheet_name='ecommerce_retail')
ecm.head(10)2οΈβ£ Exploratory Data Analysis (EDA)
# @title Get infor about data type & data value
print(ecm.info())
print('---')
# detect data value (min, max, count,...)
ecm.describe()
Using ProfileReport
# @title Using ProfileReport to Understand more about Category Data Type
profile = ProfileReport(ecm)
profile
Detect why "Quantity" and "Price" have negative values (<0)
# subset dataframe Δα» xem qua cΓ‘c rΖ‘w cΓ³ giΓ‘ trα» Γ’m
print('DΓ²ng cΓ³ Quantity Γ’m (<0)')
ecm[ecm['Quantity'] < 0]
# Δược biαΊΏt kΓ½ hiα»u C α» "InvoiceN0" nghΔ©a lΓ ΔΖ‘n bα» cancel, check xem Quantity Γ’m thΓ¬ cΓ³ phαΊ£i ΔΖ‘n cancel khΓ΄ng?
# tαΊ‘o cα»t check tΓ¬nh trαΊ‘ng cancel cα»§a data
# .astype(str) de khong bi loi TypeError: 'int' object is not subscriptable
ecm['InvoiceNo'] = ecm['InvoiceNo'].astype(str)
ecm['check_cancel'] = ecm['InvoiceNo'].apply(lambda x: True if x[0] == 'C' else False)
# check lΓ½ do cα»t Quantity < 0 cΓ³ phαΊ£i do ΔΖ‘n bα» cancel hay khΓ΄ng
ecm[(ecm['check_cancel'] == True) & (ecm['Quantity'] < 0)].sort_values('Quantity')
ecm[(ecm['check_cancel'] == True) & (ecm['Quantity'] <= 0)].sort_values('Quantity')
# cΓ³ nhα»―ng ΔΖ‘n nΓ o khΓ΄ng bα» cancel "C" mΓ vαΊ«n cΓ³ quantity Γ’m khΓ΄ng
ecm[(ecm['check_cancel'] == False) & (ecm['Quantity'] < 0)].sort_values('Quantity')
ecm[(ecm['check_cancel'] == False) & (ecm['Quantity'] <= 0)].sort_values('Quantity')
# check lΓ½ do
ecm[(ecm['check_cancel'] == False) & (ecm['Quantity'] < 0)]['Description'].value_counts()
Handling abnormal data values
# drop abnormal values
# UnitPrice < 0 , unitprice = 0 cΓ³ thα» lΓ hΓ ng tαΊ·ng
ecm = ecm[ecm['UnitPrice'] >= 0]
# xα» lΓ½ loαΊ‘i bα» data cα»§a hΓ ng khΓ΄ng thuα»c nhΓ³m ΔΖ‘n cancel "C" nhΖ°ng cΓ³ quantity Γ’m
ecm_cancel = ecm[ecm['check_cancel'] == True]
ecm_p_qty = ecm[ecm['Quantity'] >= 0]
ecm = pd.concat([ecm_cancel, ecm_p_qty], copy= False)
ecm.sort_values('CustomerID')
Checking, Detecting and Handling missing values and duplicates
# Thα»ng kΓͺ nhα»―ng cα»t cΓ³ missing values
print('Thα»ng kΓͺ sα» lượng vΓ tα» lα» missing data')
missing_dict = {'volume':ecm.isnull().sum(),
'percent':ecm.isnull().sum() / (ecm.shape[0])}
missing_df = pd.DataFrame.from_dict(missing_dict)
missing_df
# detect why missing values were very high
ecm[ecm['CustomerID'].isnull()].head()
ecm[ecm['CustomerID'].isnull()].tail()
# tαΊ‘o cα»t yyyy-mm-dd sau ΔΓ³ dΓΉng apply Δα» bα» 3 giΓ‘ trα» cuα»i cα»§a cα»t nΓ y, chα» lαΊ₯y yyyy-mm
ecm['Date'] = pd.to_datetime(ecm['InvoiceDate']).dt.date
ecm['Month'] = ecm['Date'].apply(lambda x: str(x)[:-3])
# tΓnh tα»ng ΔΖ‘n hΓ ng bα» missing customerID theo thΓ‘ng
ecm_group_month = ecm[ecm['CustomerID'].isnull()][['Month','InvoiceNo']].groupby(['Month']).count().reset_index().sort_values('Month')
ecm_group_month
# mα»i thΓ‘ng Δα»u cΓ³ mα»t sα» lượng missing customerID, khΓ΄ng bα» tαΊp trung vΓ o mα»t sα» thΓ‘ng nhαΊ₯t Δα»nh nΓ o
# cΓ‘c ΔΖ‘n missing customerID vα»«a cΓ³ thα» lΓ ΔΖ‘n bα» cancel, vα»«a cΓ³ thα» lΓ ΔΖ‘n bΓ¬nh thΖ°α»ng
# do yΓͺu cαΊ§u lΓ chia nhΓ³m khΓ‘ch hΓ ng, nΓͺn nhα»―ng ΔΖ‘n bα» mαΊ₯t customerID sαΊ½ khΓ΄ng tΓnh Δược -> nΓͺn bα» Δi
# xα» lΓ½ missing values -> bα» Δi
ecm = ecm[ecm['CustomerID'].notnull()]
ecm.sort_values('CustomerID')
# check duplicated values trong dataset
# thαΊΏ nΓ o lΓ duplicate data (Δiα»u kiα»n duplicate ΔΖ°a ra nhα»―ng cα»t nΓ o)
ecm_dups = ecm.duplicated(subset=["InvoiceNo", "StockCode","InvoiceDate","CustomerID"])
print(ecm[ecm_dups].shape)
print('')
print(ecm.shape)
# dat gia thuyet de kiem tra loi
# vi du de kiem tra
ecm[(ecm['InvoiceNo'] == '581538') & (ecm['StockCode'] == 23343)]
# vi du de kiem tra
ecm[(ecm['InvoiceNo'] == '581538') & (ecm['StockCode'] == 21194)]
# vi du de kiem tra
ecm[(ecm['InvoiceNo'] == '538341') & (ecm['StockCode'] == 22988)]
# khi kiα»m tra thα» 1 ΔΖ‘n cancel, khΓ΄ng tΓ¬m thαΊ₯y ΔΖ‘n mua tΖ°Ζ‘ng tα»±
# -> thay Δα»i phΖ°Ζ‘ng Γ‘n, bα» luΓ΄n nhα»―ng ΔΖ‘n cancel
# Δα»i vα»i duplicates, sαΊ½ chα» giα»― lαΊ‘i dΓ²ng ΔαΊ§u tiΓͺn
ecm = ecm.drop_duplicates(subset=["InvoiceNo", "StockCode","InvoiceDate","CustomerID"], keep= 'first')
# convert nhα»―ng columns sang vα» ΔΓΊng format string -> trΓ‘nh mαΊ₯t dα»― liα»u vΓ¬ sα» dα»
bα» lΓ m trΓ²n
ecm.columns
# col_list to be converted to 'str' format
str_col_list = ['InvoiceNo', 'StockCode', 'Description', 'CustomerID', 'Country']
for c in str_col_list:
ecm[c] = ecm[c].astype(str)
ecm.info()
Data Processing - RFM calculating
# create new column "cost"
ecm['Cost'] = ecm['Quantity'] * ecm['UnitPrice']
last_day = ecm['Date'].max()
RFM_ecm = ecm.groupby('CustomerID').agg(
Recency = ('Date', lambda x: last_day - x.max()),
Frequency = ('CustomerID', 'count'),
Monetary = ('Cost', 'sum')
).reset_index()
# astype column "recency"
RFM_ecm['Recency'] = RFM_ecm['Recency'].dt.days.astype('int16')
RFM_ecm.dtypes
# use qcut to divide the dataset into 5 ranges
RFM_ecm['R'] = pd.qcut(RFM_ecm['Recency'], 5, labels= range(1,6)).astype(str)
RFM_ecm['F'] = pd.qcut(RFM_ecm['Frequency'], 5, labels= range(1,6)).astype(str)
RFM_ecm['M'] = pd.qcut(RFM_ecm['Monetary'], 5, labels= range(1,6)).astype(str)
RFM_ecm['RFM'] = RFM_ecm.apply(lambda x: x['R'] + x['F'] + x['M'], axis= 1)
%matplotlib inline
order_ = ['Champions', 'Loyal', 'Potential Loyalist', 'New Customers', 'Promising', 'Need Attention', 'About To Sleep', 'At Risk', 'Cannot Lose Them', 'Hibernating customers', 'Lost customers']
pivot_ecm = pivot_ecm[order_]
plt.figure(figsize=(18,9))
r = sns.lineplot(data= ecm_distribution_table, x= 'Segment', y= 'avg_recc')
f = sns.lineplot(data= ecm_distribution_table, x= 'Segment', y= 'avg_freq')
plt.xticks([])
plt.ylabel('avg_recc and avg_freq')
plt.xlabel('')
plt.title('Avg Recency and Frequency by Segmentation (days)')
rcolors = np.full(len(pivot_ecm.index), 'linen')
ccolors = np.full(len(pivot_ecm.columns), 'lavender')
table = plt.table(cellText=pivot_ecm.values,
rowLabels= pivot_ecm.index,
colLabels= pivot_ecm.columns,
bbox=(0, -.5, 1, .5), rowLoc= 'center', colLoc= 'center', loc= 'center',
rowColours= rcolors,
colColours= ccolors)
table.scale(1, 1)
table.auto_set_font_size(False)
table.set_fontsize(9)
table.set_label('Customer Segmentation Detail')
Avg Recency and Frequency by Segmentation
π Analysis 1:
- Highly Engaged: Segments like "Cannot Lose Them" and "Champions" are highly engaged with recent interactions. Keep nurturing these relationships.
- At Risk & Inactive: "At Risk" and "Lost customers" show signs of declining interest. Prioritize re-engagement and understand reasons for inactivity.
- New Customer Focus: "New Customers" have high volume but low engagement. Encourage repeat purchases and build brand loyalty.
order_ = ['Champions', 'Loyal', 'Potential Loyalist', 'New Customers', 'Promising', 'Need Attention', 'About To Sleep', 'At Risk', 'Cannot Lose Them', 'Hibernating customers', 'Lost customers']
pivot_ecm = pivot_ecm[order_]
plt.figure(figsize=(18,9))
hue_colors = sns.color_palette("Paired")
hue_colors2 = sns.color_palette("flare")
p1 = sns.barplot(data= ecm_distribution, x='Segment', y='distribution_percent',
hue='type', palette= hue_colors,
order= order_)
for container in p1.containers :
p1.bar_label(container)
p2 = sns.lineplot(data= ecm_distribution, x='Segment', y='distribution_percent',
hue='type', palette= hue_colors2, solid_capstyle= 'round')
plt.ylabel('')
plt.xlabel('')
plt.xticks([])
plt.title('Customer Segmentation Contribution (%)')
rcolors = np.full(len(pivot_ecm.index), 'linen')
ccolors = np.full(len(pivot_ecm.columns), 'lavender')
table = plt.table(cellText=pivot_ecm.values,
rowLabels= pivot_ecm.index,
colLabels= pivot_ecm.columns,
bbox=(0, -.5, 1, .5), rowLoc= 'center', colLoc= 'center', loc= 'center',
rowColours= rcolors,
colColours= ccolors)
table.scale(1, 1)
table.auto_set_font_size(False)
table.set_fontsize(9)
table.set_label('Customer Segmentation Detail')
plt.show()
Customer Segmentation Contribution using Seaborn
π Analysis 2:
- High-Value Customers: Keep engaged with loyalty programs and strong relationships.
- At Risk & Growth Potential: Re-engage, target with personalized offers, and convert to loyalty.
- New & Inactive Customers: Encourage repeat purchases, build brand loyalty, and re-engage.
# total value for recency
f_total = len(RFM_ecm['Frequency'])
# Create the plot for Distribution of Frequency
binsF = [0, 2, 5, 20, np.inf]
labelsF = ['1-2', '2-5', '5-20', '20+']
RFM_ecm['FrequencyGroup'] = pd.cut(RFM_ecm['Frequency'], bins=binsF, labels=labelsF)
fig, ax = plt.subplots(figsize=(8, 3))
sns.countplot(x='FrequencyGroup', data=RFM_ecm, ax=ax)
ax.set_title('Distribution of Frequency')
ax.yaxis.set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Add percentage label
for container in ax.containers:
labels = [f'{int(height)}\n({height/r_total:.1%})' for height in container.datavalues]
ax.bar_label(container, labels=labels, label_type='edge', padding=2)
plt.show()
Distribution of Frequency
π Analysis 3:
- Observation: The first bin (0-50 days) shows the highest frequency with 1884 customers (43,4%). This suggests a strong and recent engagement with a significant portion of the customer base actively making purchases.
- Recommendation:
- Continue to nurture and engage the segment that has recently made purchases through personalized offers, loyalty programs, and exceptional customer service.
- Regularly analyze customer purchasing behavior to identify trends and adjust strategies accordingly. Use predictive analytics to forecast potential churn and proactively address it.
# total value for recency
r_total = len(RFM_ecm['Recency'])
# Create the plot for Distribution of Recency
fig, ax = plt.subplots(figsize=(12, 3))
sns.histplot(data=RFM_ecm, x='Recency', bins =10, ax=ax)
ax.set_title('Distribution of Recency')
ax.set_xlim(left=0)
ax.yaxis.set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Add percentage label
for container in ax.containers:
labels = [f'{int(height)}\n({height/r_total:.1%})' for height in container.datavalues]
ax.bar_label(container, labels=labels, label_type='edge', padding=2)
plt.show()
π Analysis 4:
- Observation:
- The "20+ days" group has the highest frequency with 3044 occurrences, making up 70.2% of the total. The majority of customers fall into the "20+ days" frequency group, indicating that a significant portion of your customer base takes longer intervals between purchases. This segment may need special attention to encourage more frequent buying.
- The "1-2 days" and "2-5 days" groups have the fewest occurrences, with 125 (2.9%) and 175 (4.0%) occurrences respectively. The "1-2 days" and "2-5 days" groups have the fewest occurrences, indicating very frequent buyers. These customers are highly engaged and should be rewarded for their loyalty.
- Recommendation:
- "20+ days": Implement targeted marketing campaigns to encourage more frequent purchases. Offer personalized incentives or discounts to reduce the time between purchases.
- "1-2 days" and "2-5 days": Recognize and reward their loyalty through VIP programs or special deals. Maintain their engagement with personalized communication and exclusive offers.
# total value for recency
m_total = len(RFM_ecm['Monetary'])
#Distribution of Monetary
binsM = [0, 100, 1000, 10000, np.inf]
labelsM = ['0-100', '100-1k', '1k-10k', '10k+']
RFM_ecm['MonetaryGroup'] = pd.cut(RFM_ecm['Monetary'], bins=binsM, labels=labelsM)
fig, ax = plt.subplots(figsize=(8, 3))
sns.countplot(x='MonetaryGroup', data=RFM_ecm, ax=ax)
ax.set_title('Distribution of Monetary', pad=20)
ax.yaxis.set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for container in ax.containers:
labels = [f'{int(height)}\n({height/m_total:.1%})' for height in container.datavalues]
ax.bar_label(container, labels=labels, label_type='edge', padding=2)
plt.show()
π Analysis 5:
- Observation:
- The majority of customers fall into the "100-1k" spending group, indicating that most of your customer base spends within this range. This segment contributes significantly to your overall revenue and should be nurtured and engaged effectively.
- The "1k-10k" group is also a significant segment, representing 35.8% of the total occurrences. These customers are moderate spenders who have the potential to be moved into higher spending categories with proper engagement strategies.
- The "10k+" group also has a relatively low frequency, representing 2.4% of the total. These are your high-end spenders who make substantial purchases. Special attention and personalized offers can help retain these valuable customers.
- Recommendation:
- "100-1k" Group: Continue engaging these customers with personalized offers, loyalty programs, and exceptional customer service. Recognize their value and provide incentives for repeat purchases.
- "1k-10k" Group: Implement targeted marketing campaigns to encourage higher spending. Offer exclusive discounts and promotions to move them into higher spending categories.
- "10k+" Group: Recognize and reward their loyalty through VIP programs or special deals. Maintain their engagement with personalized communication and exclusive offers.
#Assign color
segment_colors = {
'Champions': '#a7a6dd',
'Loyal': '#e2a5de',
'Potential Loyalist': '#ffa6c5',
'At Risk': '#ffb49e',
'Hibernating customers': '#ffd379',
'Lost customers': '#f9f871',
'Need Attention': '#aba9bc',
'About To Sleep': '#efedff',
'New Customers': '#ffefca',
'Promising': '#2c8f8c',
'Cannot Lose Them': '#e29896'
}
# % and Monetary values by Segment
segment_monetary = ecm_f.groupby('Segment')['Monetary'].sum().reset_index()
total_monetary = segment_monetary['Monetary'].sum()
segment_monetary['Percentage'] = segment_monetary['Monetary'] / total_monetary * 100
segment_monetary = segment_monetary.sort_values('Monetary', ascending=False)
# Create the treemap
fig, ax = plt.subplots(1, figsize=(15,10))
squarify.plot(sizes=segment_monetary['Monetary'],
label=[f"{s}\n${int(m):,}\n{int(p)}%"
for s, m, p in zip(segment_monetary['Segment'],
segment_monetary['Monetary'],
segment_monetary['Percentage'])],
color=[segment_colors[segment] for segment in segment_monetary['Segment']],
alpha=0.8,
bar_kwargs=dict(linewidth=1.5, edgecolor="white"))
plt.title('Monetary contribution by RFM segment - Treemap', fontsize=16)
plt.axis('off')
plt.show()
π Analysis 6:
- Observation:
- Cannot Lose Them: Contribute 45% of total revenue. High spenders with frequent purchases.
- At risk: Contribute 24% of total revenue. Significant spenders but showing signs of potential churn.
- Recommendation:
- Cannot Lose Them: Provide exclusive offers and VIP treatment to ensure exceptional customer service and regular engagement.
- At risk: Gather feedback to understand their concerns. Implement targeted win-back campaigns with personalized offers.
| Segment | Description | Recommendation |
|---|---|---|
| Champions | high purchase frequency and value | Reward them, ask for feedback, and promote referrals -> keep them always be loyal customers -> their feedback will help improving products and services |
| Loyal | consistently buy from you but aren't your highest spenders | Offer exclusive deals and keep them engaged with loyalty programs -> encourage them to buy more products and services |
| Potential Loyalist | recently purchased and show potential to become loyal | Encourage repeat purchases with personalized offers and excellent service -> encourage them to buy more products and services |
| New Customers | customers who have made their first purchase recently | Exceptional service and follow-up communication -> Make a great first impression to customer and encourage to keep considering using our product and services |
| Promising | who have shown interest but haven't yet made significant | Nurture them with targeted marketing campaigns and special promotions -> Encourage them to make the first purchase |
| Need Attention | who used to purchase frequently but have slowed down | Reach out with re-engagement campaigns and personalized offers, Communication to understand reasons why they haven't purchase frequently |
| About To Sleep | who are at risk of becoming inactive | Send reminders, special promotions, and reactivation incentives -> Try encouraging them repurchase with any product or service |
| At Risk | who haven't purchased in a while and are close to becoming inactive | Actively re-engage them with strong offers and personal outreach -> Try encouraging them repurchase with any product or service |
| Cannot Lose Them | High-paying customer About to leave |
Take immediate action with special offers, personalized communication, and excellent service -> To understand their problems, solve their problems and keep them using service |
| Hibernating customers | who haven't purchased for a long time | Send win-back campaigns and try to understand their reasons for inactivity -> Try encouraging them repurchase with any product or service |
| Lost customers | who have not engaged or purchased for a very long time | Decide whether to invest in re-engagement or focus on new customer acquisition -> Send reminder messages to encourage customers to reuse the service. |




















