Improving our chances at achieving true business value through MLOps

This past June 3rd to 7th, ALTEN hosted its first International Tech Week, featuring a series of distinctive and insightful webinars on the given theme: Understanding the Impact of AI & ML on Our Industries and Daily Life.

One standout session was Improving our chances at achieving true business value through MLOps, delivered by Jeroen Bleker, Machine Learning Consultant at ALTEN Netherlands. As a continuation to his previous article, Jeroen now addresses the challenges one may encounter when implementing MLOps, as well as the best practices to not overlook.

Essential MLOps Elements

Beginning of the ML lifecycle

The first organisational MLOps focus is the structure and archetype of data personnel. Working in silos and having unclear responsibility boundaries will negatively affect ML projects. Organisations should select a fitting organisational archetype, e.g., ML-first (UC Berkeley, 2021). After selecting the archetype, team roles should be clearly defined and filled when missing (Fig 1). ML is experimental, can plateau, or lead to negative results which interferes with traditional SWE lifecycle techniques. The ML project manager should therefore focus on end-to-end development, and project work should iterate from there

Fig. 1: UC Berkeley full stack 2021 ML team role overview and the level of hard skills (SWE, ML) and soft skills

After organising the team roles, a scoping structure should be set up to make sure that accepted ML projects are an appropriate fit for ML. Using an intake or assessment is the best practice to ensure the correct framing of the problem, justify project cost vs. value, and ensure stakeholder trust. The Data Project Checklist by fast.ai offers a structured approach to the intake.

Finally, Exploratory Data Analysis (EDA) should be implemented at the beginning of the ML lifecycle. Starting a project with no relevant business data or noisy and poor-quality data almost always leads to failure. Ask yourself questions such as: Does data exist? Where? What is the quality of this data (e.g., type checking, distribution expectations, value error)? Verify that the EDA is documented and reproducible; a notebook with environment management is suitable.

Middle of the ML lifecycle

During model development, a major challenge that many companies face is a mismatch between the technical metrics of the model and the business metrics used for decision-making. Often, models are optimised based on technical metrics, rather than being tuned to align with decision-making criteria. As a result, it becomes unclear to stakeholders how the model will improve business outcomes. To address this challenge, models should be calibrated using business metrics. Additionally, model errors should be analysed to understand their causes and implications.

End of the ML lifecycle

While models are based on a static situation, the world around it is anything but static. It is chaotic, complex, and constantly changing (Fig. 2). The data used for training might even be significantly different from the production data (Fig. 3). Model versioning, registry, and monitoring are essential for observability, reproducibility, and trust. Monitoring the distribution of the input data can be a quicker way to identify issues before the data is used for modelling. In addition to being used for calibration, business metrics should be included in monitoring. Optionally, a container registry can be used when making use of containers.

Fig. 2: Feature distribution shift
Fig. 3: Example of a possible difference between training (Then) and production (Now) data

The final essential MLOps end-of-lifecycle element is establishing some form of testing setup. ML models fail silently, so data scientists should consider beforehand what would define proper learning and prediction. It’s important to create documentation to make the test reproducible. Adaption of DevOps-based CI/CD should be the goal in the long run

Recommended MLOps Elements

Governance

The recommended MLOps elements start with governance. Without proper governance, compliance is challenging to check, and auditing becomes difficult. Organisations should start by setting up a data management strategy. This strategy should include elements of data saving, data protection, and data roles such as stewards and owners. From the strategy, create a data governance framework that includes data agreements, data standards, standard operating procedures, and monitoring plans. Both the strategy and framework should be promoted among the team, well visualised, and easily accessible. The belief that validation, documentation, and other external regulation requirements are priorities should be nurtured among team members.

Managing models

While calibration and use of business metrics are essential MLOps elements, model evaluation can be further improved. Manual processes, such as manual experimentation, are more unreliable, hard to reproduce, and share. These issues are easily solved by using an experiment tracking tool. Additionally, it would be prudent to start all experimentation with a baseline set using a simple model that is reviewed together with domain experts. Ensure that the performance metric tracked during experiment tracking is linked to the business objective. Be wary of an overreliance on aggregated metrics, as stated previously: the world is chaotic and complex while models try to make an approximation of this situation. Aggregated metrics might not capture this total complexity.

The velocity of modelling and deployment can be another variable for effective AI. Implementing dependency management and focusing on iteration using a developer sandbox environment can speed up projects. The sandbox environment can be notebooks, though most of the time, this should not be used for production code. Refactoring notebooks to scripts takes too much time. Setup of feature stores or vector databases (LLM) can further accelerate development.

CI/CDfor MLOps

As stated, traditional CI/CD should be adapted to ML, finalising testing and increasing its velocity. Furthermore, organisations should think about the details of deployment and scaling beforehand. Ask questions such as, how often can/should we retrain, what latency can we expect (e.g., Batch vs. streaming), can we run inference at scale? What can be done to run inference at scale (e.g., Compiling & distilling)? Finally, start working on automation and orchestration. Make pipelines automated and implement changes through tracked settings files instead of scripts.

Recommended MLOps Elements

While it won’t ruin projects on its own, poor or absent knowledge sharing can lead to duplicated work and time-consuming onboarding. Begin by finding the tool that best fits your organisation (e.g., Dashboard, Wiki, Teams/Slack/Discord) and strive to generate enthusiasm among your team about using it.

Conclusion

MLOps is all about organising the right people and implementing the right processes. The use of MLOps elements should consider priorities, and the final platform should be flexible enough to support workflows while integrating with existing solutions. No true one-size-fits-all solution exists for MLOps, and each implementation of the mentioned MLOps elements should be tailored to your organisation. Simplicity and covering all necessary functionalities are key, use priorities and domain knowledge to define necessity. As the hype for AI continues to increase, organisations that embrace MLOps will thrive!

Write the story with us!