Nanodegree Capstone Project — Starbucks

22 min readJul 11, 2021

This version of the post is structured around the project rubric for easier evaluation. The narrative version of the post is here.

Project Definition

Project Overview

This project is an analysis of simulated dataset created by Starbucks. Starbucks uses this simulation and the resulting data to test different people’s and algorithms’ ability to identify patterns in customer behavior.

The simulation tracks 17,000 customers of a fictional company (Barstucks? Buckstars? Stuckbars? … Stuckbars) as they purchase coffee and souvenir mugs over the course of 30 days. While philosophers may be unsure if we live in a simulation, these customers definitely do. Over the same 30-day period, Stuckbars is running a test to improve their marketing and loyalty programs. Periodically, each customer is randomly presented with one of 10 different offers. These offers can be for a discount (spend-$X-get-$Y-off), buy-one-get-one for a certain product, or simply “informational.” As a control, the option of not sending an offer is also included in the randomization.

Problem Statement

While there are many ways to look at this dataset, for this project:

The objective is to determine the best offer to present to each customer.
The best offer is the one that yields maximum return for Stuckbars.

To determine which offer yields the maximum return, we will perform exploratory data analysis, process the data to prepare for supervised learning, then build a model to predict for any given customer the return associated with each offer. This will allow us to pick the offer with the highest predicted return, fulfulling the objective.

Metrics

The primary metric for evaluating the offers will be:

Return = (10-Day-Profit)-(10-Day Marketing Rewards)

The profit margin for Stuckbars is not given as part of the problem statement, but this article from Visual Capitalist gives a margin of 25% excluding Marketing and Starbucks’ 2019 Annual Report gives a margin of 20% in Americas including Marketing. Given these, 25% margin seems reasonable and will be used. So the final version of the primary metric is:

Return = (10-Day Revenue * 0.25)-(10-Day Marketing Rewards)

A wide variety of other metrics will be used to break down why offers may be performing the way they do:

10-Day Revenue and 10-Day Rewards are important separately to understand the components of the total return
Fraction of users who viewed an offer is important to understand the reach of the marketing channels
Fraction of users who completed an offer will be cited but is a rough metric that we will improve on
Fraction of users who completed an offer after viewing it is a better metric of an offer’s attractiveness, because it removes the impact of different marketing channels
Fraction of users who completed an offer without viewing it is important to understand what parts of the customer base may not respond to offers

The primary metric for evaluating the supervised learning models will be the Mean Squared Error, since predicting return is a regression task.

Analysis

Data Exploration

The data is presented in three parts:

A list of the possible offers, with associated characteristics
A list of customers, with associated characteristics (if known)
A time-series of the events which occured during the test, such as transactions, offers sent and offers completed.

Offers Data

Let’s start by meeting the offers:

As we can see, the offers have other features in addition to the 3 types:

Marketing channels used to reach customers
Duration (in days) that the offer is valid
‘Difficulty’ — the spend (in.. $? £? €? ₿?) required to complete the offer
Reward — the credit received (in…. we’ll go with $) upon completion

The option space is well covered by the offers, with a wide variety of difficulty/reward/duration combinations across the offer types. Very little processing is required here to prepare this data table for analysis. The only truly necessary transformation is to turn the ‘channels’ column into 4 boolean True/False columns: channel_web, channel_social, etc. It also helps to establish a nickname for the offers to aid in plotting later, as was done above.

The offers dataset is fully populated (no missing data), which makes sense as it is an internal summary dataset not relying on any customer input.

Customer Profile Data

With the offers in hand, we can seek to understand the customers. The customer profiles have only 4 attributes: age, membership start date, gender, and income.

Right from the beginning, we can see that some customers have declined to provide any personal information — the only thing known about them is their membership start date. Those customers have a gender of None, an age of 118, and an income of NaN. In terms of missing data, that is all there is — all customers who provided personal information provided all possible information. The only data transformation necessary is to make the membership start date more palatable to modeling by turning it into an integer value — “days of membership.”

The clean division of missing data within customer profiles into ‘anonymous’ customers with very little data and profiled customers with all data available is a major simplification in the simulation from real-world customer data. In most situations missing data will be scattered across all the columns and will not necessarily divide so evenly into customers with data / customers without data.

The customers are gender-imbalanced in favor of men, just over 10% of profiles anonymous and about 1% of customers have gender=Other. There are notable peaks in the distributions of the continuous variables, likely reflecting the different groups created in the simulation setup.

The continuous variables differ somewhat by gender — the women in the dataset skew older and higher-income than the men, but are slightly newer in terms of membership.

While not visible in this graph, the ‘gender:Other’ customers fall between men and women on age and income and are slightly newer members than women.

Event Data — the ‘Transcript’

The final dataset to process is the events data — called the “transcript.” The transcript is by far the most complex to analyze and process. It is a time series of 4 different event types:

Transaction: customer makes a purchase at Stuckbars
Offer Received: customer receives a marketing offer
Offer Viewed: customer views a marketing offer
Offer Completed: customer fulfills requirements for an offer and receives the associated reward

The lifecycle of an offer can take many forms, which complicates the analysis. The classical case is for the offer to be received, viewed and then completed. But since offers bring only benefits, customers do not opt-in to them. It is possible for a customer to complete an offer without ever viewing it, or to complete an offer before viewing it. This occurs in the example below: the customer completes their last offer at t=624 hours before viewing it at t=690 hours.

A sample transcript from a random customer

Other elements of customer behavior also complicate the analysis. Above, the customer completes their first offer at time t=30 with a purchase of $339.69. There are a significant number of such large purchases in the dataset. As plotted below, the median transaction amount is about $10, and amounts range from $0.05 to over $1000. About 10% of transactions are for less than $1. On the high side, less than 0.5% of transactions are high-end outliers by the ‘fence’ definition of Q3 + (IQR * {1.5, 3}), yet because of their size these outlier transactions make up more than 10% of total revenue. Very few customers made more than one outlier transaction. Only 13 customers registered 2 transactions over $53 during the 30-day test, and none made more than 2. There are no low-end outliers in the dataset as the lower outlier fences are negative.

Distribution of transaction sizes and their share of Stuckbars’ revenue

The outliers bear more consideration: what type of purchase do they represent? A number of options come to mind. They could be business or organization accounts, making large orders of coffee for events. If Stuckbars offers catering, they could be catering orders. Perhaps Stuckbars offers ground coffee for sale and these represent customers stocking up on coffee beans or ground coffee to then brew at home. They could even be errors (charging $530.00 instead of $53.00), although this is less likely because there are no negative transactions or transaction reversals in the dataset. One thing is clear — these large outliers represent a completely different type of transaction than the normal. This leads to our first key recommendation to Stuckbars:

Establish what type(s) of purchase(s) the large transactions represent and create different classes of offers around them.

The need for a different type of offer becomes clear when we ask ourselves the following question: When customerd53717f54... purchased $340 worth of coffee, were they motivated by the offer of $5 back when spending $20 or more? Almost certainly not. So in this case the $5 reward was sent out without any real gain. But this doesn’t mean that the large purchases cannot be influenced and increased by other types of offers such as volume discounts, free delivery, or promotions to add-on a different type of product to a large order. Since the outlier transactions represent a significant fraction of revenue, it is worth investigating how to increase this segment of the business. However, for the remainder of this analysis, outlier transactions will be excluded since they are not the type of purchase this marketing program is aiming to drive.

Checking Invariant Metrics

To ensure the results from the test are valid, it is important to check the invariant metrics on the offer populations to ensure the population that receives each offer is representative of the customer population as a whole. The test generally passes invariant metric checks — only two variables have statistically significant differences from the population. This is somewhat to be expected, as we are making 88 comparisons (11 offers * 8 invariant metrics ). An overview of the invariant metrics per offer is below — more detailed invariant metric comparisons are available in the notebook.

Overview of the invariant metrics, compared by offer

The distributions and proportions in each offer (in different colors) match the distributions of the profiles dataset (in black) nicely. The test is representative and can be analyzed directly.

Data Visualization

Some of the visualizations are listed above, but many of the important data visualizations to explore the customers and the profiles are only available after data processing. Once processing is complete, we will return to exploration of the implications from a customer and offer lens.

Methodology

Data Processing

Processing of the offers and profiles dataset is relatively trivial — it was mentioned in passing during the exploration phase. The primary processing to be done in this project is on the transcript data. The most difficult obstacle in data processing and analysis of the transcript comes from the test design itself: offers were sent in batches at exactly 0, 7, 14, 17, 21 and 24 days into the 30-day test. Given that the offers have durations of up to 10 days, at day 21 a customer could have 3 offers active at the same time. Given that the offers are randomized, at day 21 a customer could have 3 copies of the same offer active.

This complexity is amplified by the fact that Stuckbars does not have a uniqueID to distinguish each time a customer receives a specific instance of an offer. Due to the repetition of offers, this renders the data ambiguous in certain cases. Consider the following scenarios with of back-to-back presentation of the same offer with a 7-day interval between offers and 10-day offer validity:

UniqueIDs: Not just for people and offers

Situations 1, 2 and 3 are ambiguous with respect to offer completion — the offer completion could be applied to either instance 1 or instance 2. It’s possible that the customer’s later behavior could make the situation clearer (as in Situation 4) — but a third copy of the same offer could come in later and complicate the situation again, as in Situation 5.

To resolve this, consistent logic has to be applied. Luckily in this case a sensible logic is relatively straightforward to establish — offer completions are applied to the oldest uncompleted unexpired offer of the same type. This gives each customer the freedom to continue and complete the later versions of the same offer, which is a benefit for them and the marketing department.

Offer views, unfortunately, are harder. They carry the additional complexity that viewing an offer can occur after said offer has expired. Consider Situation 6 above — clearly the customer completed the second instance of the offer. But which instance did they view? Intuition tells us they viewed the active offer that they later completed, but since offers can be viewed after expiry it’s not clear. Situation 7 is muddier still — both offers were active at the time of viewing, but only one at the time of completion. It’s not even clear that offer views are individual. One possible interpretation is that in Situations 6 and 7 the customer viewed both prior offers.

Again a consistent logic has to be applied. Our logic will be to apply offer views individually:

To the oldest unexpired unviewed offer

— If no unexpired unviewed offers exist —

2. To the oldest unviewed offer

This complexity leads to our second key recommendation to Stuckbars:

2. To improve data integrity, establish a uniqueID to distinguish each offer presentation (meaning a unique combination of offer_id/person/time_received) and reference offer views and offer completions to the appropriate offer presentation uniqueID.

Returning to the test design complexity — offers sent in batches at 0, 7, 14, 17, 21 and 24 days, allowing for multiple valid offers at the same time.

Customer 3526938fb466470 gets a surprise reward bonanza

The design also raises questions on handling transactions — are they applied to only one offer or all valid ones? It seems to be the latter, since one customer completed 3 copies of the spend-$20-get-$5-back offer with a single $22 transaction. Notably, the customer did not even view the offers so the ~75% discount they received is hardly a case of effective marketing spend.

How common is this, and how much does it matter? For most offers, simultaneous completions make up between 10 and 15% of the total, which could simply be random chance. In the case of the above spend-$20-get-$5 offer it makes up more than 20%. This could also be random chance — since more spend is required to fulfill this offer, it increases the likelihood for a single large transaction to finish this offer and other smaller offers. Even if this is due to random chance, there is a meaningful difference between the offers on this point and so the effect must be taken into account.

How often different offers are completed at the same time as another offer

With the complexities on the table, it’s time to fully process the transcript data. What should the transcript data be transformed into? As the objective is to understand the performance of the offers, we need each instance of an offer on its own line. As for features, we need to know whether the offer was viewed and completed (with associated times), as well as how much revenue came in and the reward amounts paid out during the offer’s influence period.

The desired output dataframe looks like this:

Once the need for line-by-line processing of the transcript data was identified, that implementation was relatively straightforward. The major challenges in generating the desired data output above were in identifying the complications in the base data and working methodologies around them.

For example, the initial attempt to determine if an offer was completed or not was quite simple:

This will produce a dataframe that appears correct, but has garbage results in the case where a user receives the same offer multiple times. This is because the join occurs on only offer_id and person, duplicating rows where that combination of values is not unique. To handle this, the first hints of iteration work their way in. The next attempt was to map a function across the rows of the list of the offers received. This queried the transcript data for every offer sent, to look for whether an offer completion existed within the offer’s validity duration.

Getting warmer…..

This is getting closer — it gets the durations right but now fails in cases where the user has two or more copies of the same offer active, and completes one of them. At this point I realized the ambiguity in the case of back-to-back identical offers, created the scenarios to map out some possibilities and set the logic. From there, iterating through the dataset line-by-line was straightforward for offer completion. Offer views were a final thorn because it took quite a while to realize the fact that offers can be viewed post-expiration (given the repeated offers limiting the window to the offer duration makes a lot of sense).

Reshaping the data into the required output was by far the most difficult and error-prone part of the process. A critical success factor here was that from the beginning I created a number of assertions to make sure the output dataframe was correct. This key step from test-driven development prevented the errors above from making it into the results.

The assertions that saved the day

Once the transcript dataframe was reshaped into the desired format, the averages and confidence intervals for each offer were calculated through the use of seaborn’s pointplot function, which runs 1000 bootstrap iterations and calculates 95% confidence intervals before displaying the results visually. To compare across customer segments, seaborn’s catplot function was used, which maps pointplot across facets.

Tabular data analysis was done via pandas’ style.background_gradient function to visually show offer rankings.

Model Implementation

To apply machine learning, a model was built to predict the 10-day Revenue-Reward for customers with a profile. For anonymous customers, heuristics will be used. The model was a neural net implemented using tensorflow's keras API. As preprocessing for supervised learning:

Primary features were composed of the customers’ profile data:
[age, income, gender, days_of_membership]
The offer_id was added as a categorical feature to allow prediction of revenue for all offers with a single neural net
The dataset was split into training, validation and test sets
Continuous features were standardized using scikit-learn's StandardScaler
Categorical features were one-hot encoded using pd.get_dummies
Transformations were combined into a pipeline function for easier iterative preprocessing

As the task is relatively straightforward regression, a standard densely-connected neural net was used, with one hidden layer. The layer node count was 32/16/1.

As a baseline, the naive prediction approach was taken to be:
Based on the offer, predict the average return for that offer

The MSE on test set for the naive approach was 58.

The initial model achieved a MSE of 48.8 on the validation set, indicating better performance than baseline even without refinement.

Model Refinement

Two alternative model architectures were tested:

A neural net with 2 hidden layers instead of 1 (64/32/16/1)
A different neural net with 1 hidden layer, using fewer nodes (16/8/1)

The MSE did not improve with the alternative model architectures, and all results were quite close.

To improve the model further, an additional continuous feature was brought in — the customer’s median transaction amount. Stuckbars could reasonably know each customer’s median transaction amount based on their order history prior to the test. Inclusion of this feature significantly improved the MSE from 48.8 to 46.

Finally, an alternative approach was tested for predicting revenue across the different offers. Instead of training a single neural net to predict return with offer_id as an input feature, a separate neural net was trained for each offer to make the same predictions. This approach lead to a slight decline in the MSE performance, and also did not help with biased predictions for different offers.

The actual/predicted results for the refined model and the naive prediction are below.

Refined-Model Performance and Naive-Prediction Performance

Results

Exploratory Data Analysis

Before looking at the offer performance and model performance, I returned to some exploratory-type data analysis about the behavior of users. I plotted different user behavior characteristics (number of transactions, total transaction amount, etc) separated out by different categorical variables. An example is below:

Behavior comparison: anonymous users vs users with a profile

From this, we can see that relative to users with a completed profile, anonymous users make slightly fewer transactions, spend far less, receive the same number of offers, view slightly more offers, complete fewer offers and as a result complete far fewer offers without viewing them. This analysis provided important background instead of jumping right into the offer performance. Other analyses and insights of this type are in the notebook.

From there, the results were tabulated:

There’s a lot to absorb here, so let’s go column-by-column:

Across all users , the spend-$10-get-$2-back offer gives the highest average return, and it’s not particularly close
All offers brought in incremental revenue — “no offer” is dead last on revenue. The informational offers rank low on the revenue metric but high on profit less reward because no reward is offered for them.
The bogo offers bring in good revenue (rank reasonably well on that metric) but the rewards given out are disproportional to the incremental revenue and so they rank poorly on the profit-less-reward.
There is a wide variation in the fraction of customers who viewed an offer (Column: fr_view). This fraction has a strong relationship with channel usage — offers that did not use the social and mobile channels were viewed significantly less. Offers that did not use the web channel showed a modest decline.
Offers that used all channels were viewed by > 95% of customers, indicating very good reach.
The fraction of customers completing an offer without viewing (Column: fr_compl_unview) is also primarily driven by the channel usage. Offers with low view fraction have high unviewed completion percentage, and vice-versa.
To separate the attractiveness of the offer from the channel usage, we can calculate the fraction of offer viewers who later completed the offer (Column: fr_of_view_who_compl).

The observation about channel usage leads to our third key recommendation to Stuckbars:

3. Evaluate marketing channels and personalize their use to individual customers.

In this scenario the costs of the individual marketing channels are not given, so it is not possible to fully evaluate the cost-benefit of different channels. However, it becomes clear that channel effectiveness does vary from customer to customer when we compare offers that differ only by one channel. Some examples (shown below) are that the social media channel is more important for younger users and that the web channel is more important for older users. As a result, there is clearly an opportunity to optimize channel usage. A future test could be designed and run with evaluation of offer channels front and center.

Those youngsters — always on social media. Back in my day we used the web, and by golly we liked it.

Returning to the offer performance: our objective is not only to understand the best offer overall, but also the best offer to present to each customer. Breaking down the performance by customer profile, there are a number of cases where the disc-10/2/10 offer is not the best option to send.

For example, for anonymous customers the informational offers (and possibly bogo offers) work better.

New customers also respond differently to the offers. For customers who have been members less than about 6 months, the informational offers and even the option of not sending an offer have overlapping confidence intervals with the standard top offer. Given the overlapping confidence interval on the primary metric, the choice is clear: choose the informational offers or no offer, since for them no reward is handed out. For customers with more than 6 months of membership tenure, the top offer is the standard disc-10/2/10.

Offer performance for different membership durations (not all shown)

The most interesting feature that impacts offer performance is not in the customer’s profile data at all. When segmenting customers by median transaction amount, we find that for customers who normally spend relatively little in a transaction the $10 bogo offers perform much better than normal and even surpass the other options — although the confidence intervals do overlap somewhat, so the result may not be statistically significant. So for customers who don’t spend much the bogo offers look good, but to be statistically certain a new test will have to be designed and run.

In the case of gender and income, the best overall offer (disc-10/2/10) remained on top for all slices of the customer base — so no deviation from the global optimum there. As a result, for single-category that’s as much as we can do.

Model Evaluation and Validation

Returning to the neural net predictions — the refined model is a densely-connected neural net in regression mode (without an activation function on the final layer), a single hidden layer, relu activation on the input and hidden layers, and a small number of nodes (layer structure: 32/16/1). This simple architecture is aligned with the simple nature of the regression task: a single prediction based on only 18 input features.

In addition to evaluating on the primary metric of MSE, we must keep in mind that the plan is to use this model for comparisons between offers to determine which performs best.

As a result, this means that in addition to MSE it is also important to evaluate whether the model is biased in favor of or against certain offers.

Model bias evaluation for the single model

The results of the single model are reasonable, in that the 95% intervals overlap significantly across the offers. However, we would like to see the errors more even across the different offers.

Model bias evalution for the multi-model approach

The multi-model approach yields a different set of biases, but not necessarily a better one. The total range is similar, and as noted above the multi-model approach has a slightly worse MSE. As a result, the single-model approach is accepted.

In addition to checking the bias, we must also evaluate what predictions the model is making — do they make sense with what was learned from exploratory data analysis?

To answer this question, revenue predictions were made for every offer and every customer, and the top offer per customer selected.

Model’s recommendations across the customer base

This makes sense — the offer found to be generally best by exploratory data analysis the one most recommended by the model. The informational, no_offer, and bogo offers also showed up in the heuristics as situationally better, so their representation makes sense as well.

As another check, we can distribute the predictions across the membership tenure, matching the chart made in exploratory analysis.

Model predictions of best offer for different membership tenures

This also generally matches our observations from exploratory analysis — for the newer members, informational offers and no offer were very competitive with the standard top offer. It also appears that for the longer-term members the model has identified a subset where one of the bogo offers is best.

Justification / Results Summary

Exploratory Analysis and Heuristics:
To summarize the recommendations for offer targeting based on heuristics we have:

4. For anonymous customers, send informational offers

5. For customers with < 6mo of membership, send informational offers, no offer, or alternately the top disc-10/2/10 offer

6. For customers who normally spend < ~$2.50, send one of the $10-for-$10 bogo offers

7. For all other customers, send the disc-10/2/10 offer (best overall)

I was surprised that gender was not a useful separator in terms of offer performance: the top offer of disc-10/2/10 was best across all genders. The customer behavior plots (shown in this post for anonymous/non-anonymous users) helped to illustrate why this is. There are differences in customer behavior by gender: the median woman spends the most and completes the most transactions. Customers with gender=Other view the most offers. However, the differences within genders are larger than the differences between them. This becomes apparent when comparing the separation between genders (where no difference in offer performance was found) to the separation for anonymous users and across median transaction bins (where we did see a difference in offer performance).

Modeling:
Our neural net significantly outperforms the heuristics on the modeling metric of Mean Squared Error (46 vs 58). This allows for more personalized offers to be sent to customers, improving results. The model’s predictions align generally with the observations in the base data, improving confidence in its more specific predictions.

The lack of improvement for different model structures indicates that even the initial model is reaching the regression limits inherent in the base data — that it is not possible to achieve further separation without additional features. The interpretation is further supported by the fact that including the median transaction amount as a feature improved the result.

This model was built only for customers with a completed profile: as a result, the heuristic of ‘send informational offers’ will be applied for anonymous customers.

Conclusion

Reflection

To summarize, our analysis of the offer test done by Stuckbars has yielded a number of useful elements. We identified the best offer overall, as well as a number of subsets of the customer base who respond better to a different offer. We modeled the results of the test using a neural net to assess the best offer for each customer as an individual, and evaluated those predictions in a general way against the learnings from exploratory analysis. We also identified a number of other areas to look into — a system improvement, an opportunity for a different rewards program, and an opportunity to improve marketing channel use.

The project was interesting in a number of ways — chief among them being how many different opportunities there were to understand the customer data. A great example is the difference in channel performance for different user demographics. That aspect was a sidenote for this objective, but an entire project could definitely be done looking at just that facet of the data.

Improvement

There are a number of interesting ways analysis of this same test could be amplified:

Customers could be clustered according to their purchase patterns in order to find new subsets of customers who may respond better to different offers. (Clustering could be done on the base customer data or on extracted PCA latent features). If meaningful clusters are discovered, they could be useful in all approaches to analyzing offer performance.

Machine learning could be used on the current data structure to:

Model the performance of offer configurations that weren’t used during the test. Take, for example, disc-20/5/10 offer, which was viewed only 35% of the time due to low channel usage and so was heavily disadvantaged. Incremental revenue and the offer completion rate could be estimated for this offer using machine learning, then combined with the view rates for offers that used all channels to estimate how the offer would have performed under different marketing channel usage.

The transcript data could be kept closer to its original time-series format and used as input to a Recurrent Neural Net. This RNN would:

Take in offer receipts, views, completions and transaction amounts
Used to predict future transaction amounts given the send of different offers at t=0

The RNN approach would have the advantage of explicitly handling the multiple-simultaneous-offers aspect of the test design, at a disadvantage of additional complexity.

Code and data for this project are available in the github repo.