Starbucks Capstone Challenge
Udacity Data scientist Capstone project to explore Starbucks data and build prediction model.
Wrangle and explore Starbucks simulation data on members, portfolio and offer event logs. Build a prediction model to predict how user would respond to offer (View / Complete). Find the most important features of the prediction model.
The program used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers. Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.
As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded. There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational. In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount. In a discount, a user gains a reward equal to a fraction of the amount spent. In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend. Offers can be delivered via multiple channels.
The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer.
Rewards program users (17000 users x 5 fields)
gender: (categorical) M, F, O, or null
age: (numeric) missing value encoded as 118
became_member_on: (date) format YYYYMMDD
Offers sent during 30-day test period (10 offers x 6 fields)
reward: (numeric) money awarded for the amount spent
channels: (list) web, email, mobile, social
difficulty: (numeric) money required to be spent to receive reward
duration: (numeric) time for offer to be open, in days
offer_type: (string) bogo, discount, informational
Event log (306648 events x 4 fields)
event: (string) offer received, offer viewed, transaction, offer completed
value: (dictionary) different values depending on event type
offer id: (string/hash) not associated with any “transaction”
amount: (numeric) money spent in “transaction”
reward: (numeric) money gained from “offer completed”
time: (numeric) hours after start of test
I have four problem statements for this analysis:
- Build a model to predict important features impacting customer action: offer completion or offer viewing.
- Explore the member genders, income, age and their offer completion choice.
- Calculate the ‘offer view rate’ and ‘offer completion rate’ for bogo and discount offers.
- Understand the most important features impacting customer action of viewing and completing an offer.
Strategy to solve problem
I wrangled the data and explore the features with customer profiles, offer completion and offer viewing. This provided me more knowledge on which features to include in the model. I tested different classifier models like: Logistic Regression, K-Neighbors Classifier, Decision Tree, AdaClassifer and Random Forest Classifier models. I also used faster and modern XGBoost and LightGBM models for faster performance. Higher F1 Score on these models helped me select the right model to focus.
I used the selected model with highest F1 Score in GridSearchCV to identify the best parameters for the model. I hypertuned the parameters to get the best estimator model. I further used this model to understand which are the most influencing features affecting the prediction.
- Channels were hot encoded into columns: web, email, mobile and social
- Event was hot encoded into columns : offer_received, offer_viewed, offer_completed and transaction
- 2175 duplicate entries were removed
- value dictionary was converted into columns: reward, amount and offer_id
- time was converted to days
- User profiles with age 118 by NaN were removed.
- id was renamed to person
- became_member_on was converted to datetime format from int
- Income was grouped into categorical field income_group:
< 49000 as Low
≥ 49000 AND ≤ 80000 as Mid
> 80000 as High
4. Age was grouped into categorical field age_generation:
< 24 as Gen Z
≥ 25 and ≤ 40 as Millennials
≥ 41 and ≤ 56 as Gen X
≥ 57 and ≤ 75 as Baby Boomers
≥ 56 as Gen Silent
Exploratory Data Analysis
I merged transcript and portfolio data and explored interesting features.
1. Cumulative count of members over time
From 2013 onwards till 2017 the number of members seemed to have increased steadily over time. However, in 2018, the overall members fell down below 38000.
2. Distribution of member’s income
Membership income distribution looks multimodal in nature. There is a huge peak between 68k and a mean income value of 65k for the customers. However, we observe a range of income values from 30k to 120k.
When the income is grouped, it is observed that most of the members have their income in mid-level (>= 49000 and <= 80000)
3. Age Distribution per gender of members
The age distribution of the male members look bimodal in nature with first largest peak at 55years and a smaller peak at 30 years. There are total 8484 male members.
The age distribution of the female members has biggest peak at 58 years and a very smaller edge peak at 20 years. There are total 6129 Females
Most of the members with Other gender have age 60. The distribution looks unimodal in nature and total number of other members are 212.
A mapping into the age generation looks like the Other gender is quite small. Around 49.49% of the members are from Generation Silent i.e above 56 years. Second most popular age group(28.23%) is Gen X (Between 41 and 56 years of age) and then Millennials (25 to 40 years of age) are 16.36%. Around just 4.9% are from Generation Z (20 to 24 years of age).
4. Percentage of offers received, viewed and completed.
I wanted to build something like a funnel analysis of the three steps of the offer to have an idea on Offer-View-Rate(Ratio of number of offers viewed against number of offers received) and Offer-Completion- Rate(Ratio of number of offers completed against number of offers received).
There are 30543 discount offers and slightly less 30499 bogo (Buy-One-Get-One-Free) offers received by the customers. 15235 informational offers were also received.
However, did customers receive some offers more than other offers?
There are 10 different offers numbered from 1 to 10.
It seems like all offer_ids were equally received by the members for a fair experiment. 2 (discount) ,10(discount),5(bogo) and 9(bogo) are most viewed offers.
Offer IDs 2 and 10 are also most completed. We see surprisingly offer #5 (bogo)which is third most viewed offer is the least completed offer. The most difficult offer where the user needs to spent 20 to get a reward of 5 discount is least viewed and second least completed.
Bogo offers were viewed 8% more time than discount offers and 26% more than informational offers.
Offer View % for Bogo = 44.08%
Offer View % for Discount = 37.15%
Offer View % for Information = 18.76%
Surprisingly, 53% of discount offers were completed and 46% of bogo offers were completed. Informational offers do not have any further action than a view so it was not seen in the last chart.
Offer Completion % for Bogo = 46.71%
Offer Completion % for Discount = 53.28%
Offer view rate = # of offers viewed / # of offers received
The Offer view rate calculated for Bogo was 83.44% and for Discount offers was 70.21%.
Offer completion rate = # of offers completed / # of offers received
The offer completion rate for Bogo was 50.82% and lower than completion rate for Discount (57.88%)
When I map to look which members completed most bogo and discount offers, it was primarily Silent Generation who went most for the discount offers and then bogo offers. Relatively other generations did not complete many bogo or discount offers.
Offer completion by gender shows us, that female and male equally choose bogo offers. However, discount is more popular than bogo.
I joined the transcript_df and portfolio_df on offer_id. This gave additional information on all corresponding offers. Further I merged the member profile details with this new dataframe on person. The final dataframe had 21 columns and 272388 entries.
Further, I converted the became_member_on field to tenure field which provides the number of days of membership.
I filtered out events of offer_received and transaction. I also hot encoded all the categorical fields.
I did not consider transaction fields like amount, reward to be relevant for the offer_completion or viewing so skipped them.
I removed rows with NaN values.
I split the data into training and test data with test size 30%.
I standardized all my numerical features to take value from 0 and 1. I chose standardization over normalization as my data distribution was Gaussian in nature. I standardized the X_train and X_test numerical columns after doing a train_test_split.
Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. (More info)
Now I merged the X_train and X_test data with the categorical columns to get the datasets in the final format.
More information on standardization and normalization can be found here.
Finally, we build a prediction model to predict customer response (View/Completion) to offer.
Variable to predict
action : completed=1,viewed=0
The shape of my features and labels was:
Training Feature Shape: (61447, 18)
Training Labels Shape: (61447,)
Testing Feature Shape:(20483,18)
Testing Labels Shape:(20483,)
I did not select accuracy as a metric for my analysis as I found that my classes were not balanced. I have 1.5 times more offer viewing vs offer completion. This ruled out accuracy as the correct measure to analyze the models.
I based this analysis on this blog.
What is more important to Starbucks True Positive or True Negative?
True Negatives: Customer selected(viewed/completed) offer but was not predicted.
True Positive: Customer selected(viewed/completed) offer and was predicted correctly.
Both True Positive and True Negative are important as Starbucks could lose opportunity to target potential customers. However, the data is not balanced and offer views are 1.5 times more than offer completed. It will be misleading to use accuracy as a metric.
Which one has a higher costs to business, False Positives or False Negatives?
False Positive: Customers who have not selected offer are predicted.
False Negatives: Customers who have not selected offer are not predicted.
False Positive, could mean a marketing campaign targets customers who would not select the offer. This is not good as it could mean we have less offer-view-rate and offer-completion-rate which will have some impact on the business.
False Negatives, could mean Starbucks cannot identify those customers who are more likely to reject offer. The impact of this is less than False Positive. However, it is still important for the business.
I will select F1-score as the metric.
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.
F1 Score : The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. I assumed equal weights to both precision and recall.
I also chose to print out the classification report of our models. However, I used majorly the F1 score to predict the performance of the models.
Based just on predicted F1-Score on test data, XGBoostClassifier and LGBMClassifier has the best performance on test data of 0.61. I would select both these models for hypertuning and finding better parameters to improve model scores due to its slightly higher F1 Score.
I used GridSearchCv for finding the best parameters to tune my models.
I found an amazing article on Kaggle that helped me understand which parameters to hypertune for the LightGBM model.
To improve F1 score of the model, I focused on :
Use large max_bin (may be slower).
Use small learning_rate with large num_iterations
Use large num_leaves (may cause overfitting)
Use bigger training data
- num_leaves: max number of leaves in one tree
- min_child_samples: minimal number of samples in one leaf. Can be used to deal with over-fitting
- num_iterations: number of boosting iterations
- application: binary since I have 2 classes to classify
- boosting: gbdt → traditional Gradient Boosting Decision Tree,dart → dropouts meet multiple additive regression trees
- learning_rate: step size at each iteration while moving toward a minimum of a loss function.
I found the best_estimator model with F1-Score of 0.61 again.
I ran the model through number of different iterations and ranges however, the F1 score would remain the same.
Now, I tried hypertuning the XGBoost Model.
In case, of XGBoost, I selected the following tuning parameters:
- n_estimators : number of trees you want to build.
- gamma : controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners.
- max_depth : determines how deeply each tree is allowed to grow during any boosting round.
- min_child_weight: minimum sum of instance weight (hessian) needed in a child.
I found this tutorial quite helpful to hypertune the XGBoost parameters.
As observed, the F1-score after GridSearchCV for XGBoost was reduced to 0.60 which showed LightGBM was a better model.
I selected the best_estimator_model for LightGBM for my feature importance analysis further.
I applied the model to get the most influencing features affecting the customer action.
Features most affecting customer actions on offers are as follows:
The time taken by the customer to act on the offer after its start was seen as the most important feature to predict customer action. The membership duration of the member was second most important feature. And finally the time for which offer was open (duration) was the third most important feature.
The Starbucks Capstone data was quite interesting to explore.
First, I assessed the data and looked for quality and tidiness issues.
Second, I explored the data to have more understanding about field distribution. Data exploration also helped me answer interesting questions.
Then I merged the three datasets :portfolio, transcript and profile to create an master dataset. I hot encoded the categorical columns and removed events of transaction and offer received.
Then I tried six different classifier models to find the model with highest accuracy. I selected the XGBoost and LightGBM models in GridSearchCV to find the best parameters for this model tuning. Hypertuning parameters did not have expected increase on the F1-Score so I decided to use LightGBM model as the best model for further analysis.
I used this model to understand which are the most affecting features that are impacting the customer action using the model. This will help Starbucks understand member characteristics which are most prone to view or complete the offer.
By using the LightGBM model was able to classify most important features affecting offer completion and offer viewing. I can also use the model to predict customer action. The F1-score on offer viewed was higher than offer completed which shows that the model would be able to predict offer view better than offer completed. However, the overall avg F1-score was 0.7 which was acceptable.
The overall F1 score I achieved was 0.61. Hypertuning the LightGBM model had no impact on improving the F1 Score. Eventually the score might improve when I integrate more data into the models supervised learning.
As a next step I want to try a custom ensemble model to improve the F1 score further. And then test the model’s performance with a k-fold cross validation to validate the model robustness.
The model can have more features related to customer transactions such as total amount spent by the customer. And also some more customer demographics. I would reconsider my decision on removing amount and reward fields from predicting the model.
I also considered the overall tenure of the members but did not fine tune it to see member joining years e.g 2018,2019 as categorical fields. Since tenure is an important feature, hot encoding the tenure years could be useful to get more specific important features.
The data in profile contained a lot of NaN values and age range beyond 100. This data collection methods need to be revised to avoid faulty data. However, once I removed these faulty rows I had very clean data.
Merging all three datasets was helpful in building the model. The dictionary values in some columns made analysis difficult. In some events of the dictionary, some field names were named both as ‘offer id’ and ‘offer_id’ even though they indicated the same value.
The most difficult part was hypertuning the models. The model score did not change and different parameters showed little impact even inside the classification report. This is something I want to work on, on the next iteration of this project.
If you want to further build the analysis, you can find the entire code and dataset on my Github here.