You may be in the planning phase of building your own machine learning model, be it a propensity model to help you better target prospects or a clustering model to better segment your addressable market. This post covers the key stages of building a successful machine learning model and includes specific questions to answer and tasks to accomplish in each stage. This post assumes you understand what machine learning is and how it impacts marketing analytics.
Once you’re done reading this, you can try building a live model using our guide: How to build a propensity model.
Pre-processing
During the pre-processing stage, you will need to assemble your dataset by aggregating relevant data sources into a single table. Be sure to clean up and validate your data so it can yield high-quality predictions. In addition, you’ll need to convert categorical variables into dummy variables and normalize quantitative variables as appropriate, before you can structure your data to generate independent variables.
Questions to ask:
- Have you clearly identified your training dataset? This is particularly important for propensity models. While you may have an extensive database of prospects, for training purposes you want to include only prospect accounts that have been marketed to, and exclude prospects that have not been exposed to any marketing or sales touches, since a negative outcome won’t be meaningful in this case. In addition, make sure you have a clearly defined outcome metric , e.g., we want to predict which accounts will yield a closed won deal.
- Are your fill rates high enough? Datasets are rarely, if ever, complete in real life. As you aggregate different datasets with missing records (i.e., low fill rates), any gaps in your final dataset can quickly become a problem as you’re left with a very small number of complete records. Consider filling in missing data with zeros, median values, or a placeholder value in the case of a categorical variable. This will allow you to increase the size of your usable data without compromising the predictive value of your model.
- Are lagging effects significant? Not all variables might impact the outcome you’re predicting within the same timeframe. If you’re trying to factor in advertising spend as an independent variable, you may want to consider using various trailing windows, for example, trailing 6 months for brand spend and trailing 3 weeks for promotional spend.
- How collinear are the input variables? Different machine learning algorithms react differently to collinearity. You may want to find out whether the algorithm you plan to use is sensitive to collinearity and whether your variables are highly collinear.
- Are the data sources diverse enough? If you have conversion-rich data only for segment A of your market and you’re trying to predict segment B, your input data may not be diverse enough. You’ll also want to ensure that your data sources are likely to provide the information you’re looking for. If all you have is billing data, but you’re trying to understand which prospects will convert, you may not have the right data.
- Is the sample size adequate? How much data you need is driven by how many variables you’re using and how accurate you want the model to be. At a minimum, try to use a sample of at least 50x to 100x the number of variables you’re using in your model.
Feature identification
At the feature identification stage, you will need to conduct data discovery and identify major features (variables) and clusters (segments) showing predictive value. To do this you will need to analyze interactions between variables. You may also need to generate new variables, such as in the lagged advertising spend example we mentioned earlier.
Questions to ask:
- Are known major business drivers captured in the existing data set? Make sure you use all of the domain knowledge that is available on the outcome you are trying to predict. Talk to business experts about which drivers (variables) should be factored in to predict the outcome. Model training will subsequently validate if the conventional wisdom was correct.
- Are you accounting for known biases? In many cases, your dataset will include a lot of biases that may affect its predictive value. These include selection bias (is the data representative?), metric bias (is the outcome measured correctly?), and system bias (did a system migration affect the data?).
- Are the drivers actionable? If your model will be used to optimize an outcome, make sure your independent variables include drivers you can manage and act on, such as price or spend, rather than only non-actionable variables, such as inflation.
- Are there significant differences in dynamics across markets? Be careful when combining data from different markets. You may be trying to predict B2B propensity to buy during the COVID pandemic. If so, you may find out that the California market and the Florida market are very different due to different COVID policies, so it’s best to model them separately.
- Is there non-stationarity and/or market saturation? The market you’re trying to model may be going through significant changes, so history may be a poor indicator of future performance. For instance, if the market is getting saturated and growth rates are trending down for all competitors, you may need to adjust predicted propensities down.
Algorithm and parameter selection
At this stage, you will want to select an algorithm along with its tuning parameters. While it may be tempting to use the AutoML feature common to many platforms, in practice you will want to select the best algorithm yourself based on your specific use case to build an effective model. Then you will need to tune your model’s parameters through cross-validation and compute feature importances.
Questions to ask:
- Is the observed behavior captured in feature importance consistent with business experience? This question is paramount as you will often identify data integrity issues by answering this question. For instance, you may have chosen an independent variable that is a system artifact and therefore highly dependent on the outcome you are predicting. This variable would show up as “high importance,” but quick due diligence would show it needs to be excluded in order to avoid circular logic.
- Is data sparseness an issue? If you’ve previously identified sparseness as an issue, you may want to select an algorithm that works better with sparse datasets, such as XGBoost.
- Is there consistent behavior across algorithms? If the same variables keep coming up at the top of your feature importance chart with multiple algorithms, chances are these variables are the true drivers of the outcome you are trying to predict. If a variable shows up with one algorithm only, it may be correct, but it may also be an artifact that you won’t be able to replicate later.
- Is the model optimally tuned? Make sure that you’re confident in your cross-validation approach, and that you ran a broad enough search to tune your model and maximize its predictive performance.
Error analysis
Error analysis is an important task during model training. By looking at an array of error metrics generated during cross-validation, you will be able to identify and solve modeling issues. For a classifier (e.g., a propensity model), you will need to look at the model’s confusion matrix, precision, accuracy, recall, and F1 score. For a regression, you will need to look at your model’s R2, RMSE, MAE, and MAPE. In addition, determine your preferences and/or penalties for false positives and false negatives.
Questions to ask:
- How does the model performance compare to naïve trending or existing predictive estimates? While machine learning is a better prediction technology, it is not a panacea. You should always ask yourself whether a simpler approach or the legacy approach yields better results.
- Is the model correctly predicting known outcomes? To find out, you can often compare predictions with actuals outcomes in your training data, such as in these benchmark comparisons.
- If your model is incorrect, do you understand why? Often looking at patterns in incorrect predictions of known outcomes is the best way to identify missing variables.
- Is there evidence of bias? This is particularly important for individual consumer outcomes, such as credit or banking. It’s also critical in a B2B environment to ask yourself the same question. For instance, is propensity high in a particular segment because the company has historically advertised more to that segment? In this case, high propensity simply reflects a past choice with advertising funds, and directing funds to another segment might yield similar results even though it might be considered low propensity based on past data.
- Are you accounting for most of the observed variability in the data? For example, if your R2 is less than 50%, this means your model can explain less than 50% of the variability in your dataset. It’s a sign you are likely missing important variables.
Driver contribution analysis
Once your model is trained and you’ve completed the error analysis phase, you need to ensure you can interpret modeling results in business terms that can be compared to the lived experience of domain experts and field professionals. By analyzing the contribution or impact of each driver (e.g., with the feature importance chart or through sensitivity analysis), you can interpret predictive results in business terms and identify key business insights.
Questions to ask:
- Can you interpret results in business terms? You want to be able to tell which variables matter, and those variables should align with your field experience. If there are surprises, you’ll want to rule out data integrity and system artifacts as the possible cause.
- Are the insights relevant, material, and actionable? Remember, all three conditions must be met for your insights to have business value.
How Can We Help?
Feel free to check us out and start your free trial at https://app.g2m.ai or contact us below!