ML needs a training and inference pipeline. The training pipeline (discussed later) creates our model , and the inference pipeline makes the predictions.
Build an Inference Pipeline: Start with deploying an inference pipeline. By quickly delivering a basic model, we can begin to collect usage data that enables further training and performance gains. This initial models is based on simple heuristics from subject matter expert (SME) knowledge.
Using a HTTP request time of 1250 ms as a threshold, let’s run our training data through this Running our training data through this basic model. Because of the spread in the data, we’ll plot the log of request time to make the results more consumable.
Test Workflows: After we can serve our simple model, let’s investigate our assumptions about the user experience and model results.
User experience is an important consideration because usability will drive usage of our model. Asking what’s the best way to present our model? How can we ensure that results are presented usefully? What data would actually be helpful if returned? forces us to consider our frontend design. Spamming users with about anomalies with no notion of causal context contributes to alert fatigue and instantly turns people away from our product. Instead, useful results contain data about significance of an anomaly and the specific factors that caused the anomalous classification.
Model results are a necessary consideration as well. Ask: are we getting non-trivial results? Is the training data accurate and representative? How can we remove bias? Initial runs show that the seriously bad anomalies get picked up, but much of the variation and nuance remains unexamined. Because there are many other metrics that comprise an anomalous classification (not just HTTP request time), we need to develop a better model that considers these features.
Understanding your data will lead to the biggest performance gains. Be efficient in your exploration, starting small with a dataset that’s easy to work with. Looking at your dataset to learn about its features is the easiest way to develop a solid model and feature engineering pipeline. This typically consumes the majority of development time.
Ensure that your data is formatted, with clear inputs and outputs that will be available at prediction time. Make sure that the data fields aren’t missing, corrupted, or imprecise by manually verifying labels or feature values. Examine the data quantity (to make sure you have at least 10k examples), generating synthetic data if needed. Look into clusters and summary statistics. Examining our training set, we can drop all examples without meaningful (empty or na) timing data.
Create features from patterns like seasonality, feature crosses, or other meaningful patterns you observe. We can vectorize to create new features from raw data and reduce the dimensionality (representing vectors in fewer dimensions while preserving as much structure as possible). These features make it easier to detect non-linear relationships in neural networks. Make sure that the transformations and process for creating new features is saved, as it must run on input data at prediction time too, or the model will not function properly.
We’ll create features for every hour of the day and day of the week as well as a feature cross between the two. Further, we’ll create a new feature called ‘total’ that is the sum of tlsHandshakeMs, httpRequestTimeMs, and the rest of the other timing metrics.
Creating additional models can aid the training process. An ‘error model’ can be trained to detect what types of examples your model fails on. A ‘labeling model’ can be used to find the best examples to label next if you’re working with a supervised training model. This model detects labeled versus unlabeled examples, allowing you to label those unlabeled examples most different from the labeled examples. Because we’re sticking with a simple, unsupervised approach, we can ignore creating these models for now.
Now that we have our engineered features, we can train our model by splitting our data, judging prediction performance, and evaluating feature performance.
Split Data: To ensure we can validate the results of the trained model, we need to set aside data. Specifically, 70% of our data is used in training, 20% in validation, and 10% for testing. The training data optimizes the weights of the model, the validation set tunes the hyperparameters (network depth, number of filters per layer, and weight decay), and the test set is how we can evaluate the model.
Make sure that validation and test datasets are as close to production as possible to prevent data leakage (future data used as a training feature, duplicates between sets means outsized performance from overfitting). Ensure that the data is split properly.
Debug: Abnormally high performance typically indicates data leakage or bugs. Google has written extensively on potential tests for a ML system. To debug a model,
- First, ensure the model is wired properly so that the data flows through from input to prediction
- Next, make sure that the model can fit the training data
- Finally, check if the model can fit unseen data, assuming it’s within a reasonable range
Judge Performance: Choose an appropriate cost function to optimize your model. Examining your cost function on test and training data helps estimate the bias-variance trade-off in the model, the degree to which our model has learned valuable generalized information without just memorizing specific details of the training set. Specifically, examine:
- Confusion Matrix: performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values
- ROC Curves: shows how capable a model is of distinguishing between classes. Product Managers benefit from adding vertical or horizontal lines that correspond to allowed false positive (FP) or false negative (FN) rates based on product requirements
- Calibration Curves: fraction of TP relative to confidence level
Evaluate Features: Inspect which features aid classifications, by examining the classifier. If using a neural net, use black-box explainers such as LIME or SHAP.
Because we’re using unsupervised methods, we can ignore all of this. As our data doesn’t contain labels, we can’t actually validate a model with these methods. Therefore, we send all data into training. Instead, we can use the silhouette score to check our output. On our KMeans model, this metric returns a 0.802 (where 1 is best and -1 is worst). Visualizing the classified labels shows that this unsupervised approach turned out reasonable results.
The benefits of the simple model are made clear in our latency metrics. On a machine with 4 vCPUs and 15 GB of RAM, training took an average of .516 seconds. Inference for all the training data took .004 seconds. That’s well below any reasonable human reaction time.
However, our model is creating a lot of alerts. This underscores the necessity of collecting user feedback. If we had data on which anomalies were helpful, we could further tune our model with supervised methods toward only providing actionable alerts — the ones our users actually want.
Now that the model’s trained, we perform a final validation, build a production environment, and start monitoring. Here’s a high-level look at the necessary components.
Final validation is the last sanity check before moving a model to production. Ask: What assumptions is your model making from the training data? How was the training data collection? Does that differ meaningfully from production data? Is the dataset representative enough to produce a useful model?
It also requires considering the intended use and scope. Confirm the data used is authorized for collection, usage, and storage. Remove training bias by ensuring no measurement errors, corrupted data, and proper representation of all feature classes. Eliminate systematic bias and modeling concerns by employing feedback loops, and benchmarking on all subsets of training data. Defend against adversaries by deploying monitoring systems that detect shifts in usage activity.
Build a production environment that is reliable to properly run your model, by engineering for:
- Failures (I/O checks): Sanitize input data by ensuring it falls within the range and distribution of training data. Create a second ‘failure detection model’ that predicts most likely failed inputs or a ‘filtering model’ that pre-screens inferences. If these models indicate a potentially dubious result, fallback on a simple model or heuristic that provides a plausible output.
- Performance: Speed up performance with caching if applicable (a least recently used or LRU cache is commonly used when dealing with repeated inputs). Make sure to manage model and data lifecycle management (when to retrain the model based on indications of drift), reproducibility, resilience, and pipeline flexibility.
- Feedback: Gathering implicit feedback is crucial for judging performance. Consider looking at actions users perform to infer whether a model provided useful results.
- CICD: To implement CICD for machine learning, consider deploying in shadow mode (in parallel to existing model, checking results against production model). Either can use A/B testing or multi-armed bandit (explore/exploit for multiple models in production).
Monitor models to inform resilience of production environment. Specifically, look at the accuracy and usage over time. Accuracy over time to informs the refresh rate, or when the model is retrained. Most models need to be updated regularly to maintain a level of performance. Monitoring can detect when a model is no longer fresh and needs to be retrained. Usage can show patterns of abuse (for example, an anomalous number of logins). Similarly, monitor the metrics outlined previously for performance and business metrics.