(Source: TensorSpark/stock.adobe.com); generated with AI
After fortifying the machine learning (ML) proof of concept (POC) and launching it into production, the next important steps are planning for its future and ensuring its continued success. Although the bulk of the effort and resources for a project like this will go into the first steps that we covered previously in the series—including determining metrics and objectives, building the dataset, establishing an experiment environment, and developing and deploying the POC model and code—more tasks remain after production.
This final blog will cover what comes right at the “end,” once the project seemingly should be finished. In particular, as the model begins to be used, various elements should be tracked such as input and output data, performance, and any relevant metrics like latency or compute usage. These should then help identify performance issues or data drift, which can be corrected when necessary, by retraining and redeploying the model after establishing that it performs as well as or better than previous versions. This blog highlights important elements to track post-production, some useful tools for this type of monitoring, and strategies for deciding when and how to upgrade model versions as well as for launching them successfully.
Once a model is in production, as with any piece of software, the typical discussion occurs around what to monitor, particularly in terms of performance and failures. However, with ML, monitoring goes further than just collecting information about things like resource usage or latency. Monitoring for post-production ML that goes beyond these low-level metrics is termed “observability” and is vital for determining not only when and how a model has gone astray but also how to fix it. Tracking for ML observability gives developers the ability to probe the underlying reasons why a model isn’t performing as expected in production. In addition to identifying model drift (when the model’s performance drops or gradually becomes worse over time), tracking under observability can help uncover causes—in particular, data drift. This occurs when the input data in production no longer match the expected data on which the model was trained. For example, a vision model trained to select certain information from scanned documents with the same format suddenly fails because newer documents have a different format.
Beyond data collected for observability and model troubleshooting, lower-level software-type metrics as well as higher-level business key performance indicator (KPI)–based metrics exist. Examples of each type include the following:
Note that to track model performance or accuracy, the true label or "correct answer" for what the model was trying to predict is also needed. Hence, incorporating a method for users to rate or correct the model's output is vital for capturing that information and ensuring that decisions about model retraining and redeployment can be made.
This may seem like a lot to keep track of post-production, but, fortunately, these monitoring or observability features no longer need to be built from scratch. Many solutions exist across the spectrum of levels of monitoring. For the software-type metrics, tools like Grafana or Datadog integrate well and collect necessary data seamlessly, typically with a helpful user interface on top. Other platforms, like Neptune, Evidently AI, and Arize, cover the more complex ML observability metrics from tracking model performance to providing features that help uncover issues like data drift.
Updating ML models and launching changes in production requires a strategic approach to ensure optimal performance. Triggers for model updates include significant shifts in input data patterns, declining model accuracy, or the introduction of new features that could enhance predictions. When working with pre-trained or foundation models, like large language models, retraining strategies might involve fine-tuning the model with a smaller, domain-specific dataset or using transfer learning to adapt to new tasks. To safeguard against performance regressions, A/B testing and canary rollouts are effective methods for evaluating the new model. By gradually exposing a small subset of users to the updated model, teams can monitor its performance closely against the existing version, ensuring that the new model meets or exceeds benchmarks before a full deployment. This systematic approach not only mitigates risk but also fosters confidence in the reliability of the ML system. Additionally, monitoring KPIs throughout the process and having a rollback plan in place to revert quickly to the previous model if needed is crucial.
As you transition your successful ML POC to a fully operational production model, establishing a comprehensive strategy for ongoing monitoring and updates is essential. In particular, the key elements to monitor post-production include various metrics across several levels, from software performance to ML observability to business value, with particular attention toward ML-critical aspects such as data drift and model performance. Leveraging existing observability tools will accelerate the post-production work as well. Strategies for determining when to retrain models and best practices for testing new versions will ensure that your ML system remains effective and resilient over time.
Throughout this series, we’ve explored the critical steps necessary for a successful ML project, from setting goals and metrics to preparing datasets and developing the initial POC model. With this final blog, you should have all the knowledge and guidance you need to bring your ML ideas to fruition.
Becks Simpson is a Machine Learning Lead at AlleyCorp Nord where developers, product designers and ML specialists work alongside clients to bring their AI product dreams to life. In her spare time, she also works with Whale Seeker, another startup using AI to detect whales so that industry and these gentle giants can coexist profitably. She has worked across the spectrum in deep learning and machine learning from investigating novel deep learning methods and applying research directly for solving real world problems to architecting pipelines and platforms to train and deploy AI models in the wild and advising startups on their AI and data strategies.