An Overview of Machine Learning Operations
Building Mature Machine Learning Production Systems through MLOps
Starting in 2018 the term “MLOps” (short for Machine Learning Operations) started to be used more widely to describe the discipline of standardizing and streamlining the machine learning lifecycle management process. MLOps advocates for bringing CI/CD practices into the machine learning lifecycle. MLOps lies at the intersection of data engineering, machine learning, and development operations.
It has famously been written that putting a machine learning model into production incurs a lot of technical debt. From ‘Hidden Technical Debt in Machine Learning Systems’, “developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” Because MLOPs’ goal is to automate and monitor all parts of the ML lifecycle, it has the ability to introduce systems that can reduce maintenance burden, complexity and ultimately technical debt.
The popularization of MLOps can be seen as evidence of a larger maturation process that the field of machine learning is undergoing. Instead of Machine Learning Engineers and ML Platform teams providing infrastructure for just the model deployment process, we are expanding the camera view to look at the entire ML lifecycle. One could argue that MLOps is the foundation for more mature machine learning processes. With a more mature perspective, there can be a shift of focus onto what happens both before and after the model is put into production. New questions arise such as:
- How can we determine when a model is degrading and needs to be retrained?
- How can we perform root cause analysis on degrading models?
- How can data scientists easily iterate on their models?
- How can data scientists track what iterations of their model they have already been tested in order to gain insight on what iterations to test in the future?
- How can we facilitate the sharing and reuse of models?
- How can we create and document model lineage and history?
- How can we A|B test models to determine if a challenger model is ready to replace a champion model?
While many different functionalities can be considered part of the MLOps realm, some of the more core components of a sound MLOps infrastructure include versioning, run tracking, AB experimentation and monitoring. For these core components of MLOps, we’ll now review what defines these problems, key considerations, and potential solutions.
As data scientists iterate on models over time, tracking model versions and tracking model runs are essential to documenting lineage. Tracking model versions refers to the idea of enumerating a production model and capturing its associated model performance, artifacts and code. Tracking runs involve recording different tested (non-production) iterations of the model. Both tracking versions and runs enable data scientists to understand how a model has evolved over time and give them pointers on how to improve their models. Through tracking data scientists can look back to understand what they’ve already tested to determine patterns and create hypotheses on what techniques will work well in the future. As data science teams scale, models are often handed off to new team members. Tracking is instrumental in aiding in the handoff of the contextual knowledge built up by hours spent playing around with the model.
Monitoring is a critical piece to a well built MLOps system as it enables data scientists, product owners, and executives to trust that critical systems are functioning correctly. Monitoring can be placed on many parts of the system: monitoring on feature stores and feature pipelines, monitoring of end-to-end model training and serving pipelines, and monitoring of model performance. For large companies, data is flowing in from many sources making it hard to guarantee quality and know when definitions have changed. Monitoring at various steps as data flows to the feature store, as well as in to and out of the feature store can give insight into whether there are any large shifts in data quality and definition that could impact models. The creation and monitoring of end-to-end training and serving pipelines is a critical piece to how Google thinks of MLOps.
Model monitoring is critical to understanding when models are degrading in production. Standard model monitoring metrics such as accuracy, precision and recall are useful to determine model performance decline. Many models at Square suffer from the problem that labels or “ground truth” are not received until anywhere from 3 months to 1 year after the prediction is implemented in the product. As a result, Square must rely on drift monitoring of inputs (features) and outputs (predictions) from our models as a proxy for model performance. The benefit of tracking both model inputs and outputs is that one level of root cause analysis can be performed when predictions drift as the user can explore what features may be causing prediction or “concept” drift.
Easily accessible, automated AB experimentation is a critical piece to MLOps-focused model development. Data scientists often develop a new model and test its performance on a test set or through backtesting. Unfortunately, training-serving skew is not uncommon in many ML stacks so having the ability to test models in production is vital. A common way to approximate AB testing when one lacks the capability is to shadow deploy the new model into production to see what it would have predicted in comparison to what the existing model is predicting. Unfortunately, in that scenario, we don’t get the feedback of how users actually respond to product changes that result from the new model’s predictions so it’s hard to know the true impact of the new model. AB experimentation is needed so data scientists can view the impact of the changed product (due to a prediction) on a customer’s behavior. Through AB testing and traffic shaping we can deploy some percent of traffic to the existing model and some (usually smaller) percentage to the new model. AB testing enables the data scientist to determine the effect of their model on user behavior to a subset of users in a controlled environment.
The need for MLOps has arisen due to two key shifts in the Machine Learning space. First, there has been a significant increase in the number of models that companies maintain in production. Historically companies managed a few models through custom implementations in production but the massive increase in the use of machine learning in automated decisioning has changed the landscape. With many models in production it becomes difficult to have a global view of the states of these models without standardization. Among several standardization functions, MLOps ensures that during the model development process, both technical and business teams can document, keep track of, and be on the same page about different model versions. It also ensures that models can be reproduced and explained. The need to find solutions to manage many models in production is a problem that Square has as our Machine Learning Engineers manage atleast 400 models in production. The need for central tracking and versioning systems has risen as teams grow. With larger teams it’s less likely that a model is owned only by one person and instead it becomes increasingly likely the model gets passed around between many people. When models are passed around there needs to be a centralized, standardized way to communicate key information about the model such as model version and performance. As we can see, as the size of teams and the number of models grow, so does the need for MLOps.
The second shift in the Machine Learning field leading to the development of the MLOps discipline is the increased reliance on machine learning for decision automating. When relying heavily on automated decisioning, mitigating model risk becomes a very high priority. Our Data Scientists rely on Square’s Machine Learning Platform team to create a safe system that enables them to perform the “emotional job” of trusting the data and systems that make automated decisions affecting millions of customers and leading to millions of dollars of revenue. Our Machine Learning Platform team’s risk decision service executes about 500 decisions per second or about 43 million decisions per day. We need to be able to trust the fidelity of these decisions. Unmonitored models are a real threat to Square’s business. For example, an underwriting model on our Capital product failed over Christmas break one year which led to millions of dollars of poorly originated loans. As many people are often involved in the complex process of building ML models, having a unified view and standardized methodologies for managing models becomes essential.
Over the last few years there has been an increased commodification of third party tools for all parts of the ML lifecycle, including arguably most critically, model serving. Examples of these tools include Google’s Vertex AI, Amazon’s Sagemaker, Microsoft’s Azure, DataRobot, Seldon.io and many others. Because of ubiquity and commodification of serving and ML lifecycle tooling, it’s becoming increasingly less attractive for large companies to maintain their own end-to-end systems. Data Scientists require ever-green ML infrastructure that is always ready to serve models that feature the newest and fanciest of ML capabilities. Data Scientists don’t have time for infrastructure to take 1-3 years to catch up to their use cases. Keeping inhouse ML infrastructure up to date can require tens to hundreds of engineers.
As a result, Square’s Machine Learning Platform is increasingly embracing a “buy over build” philosophy which fits our needs for several reasons. First, when it comes to software it’s important to not reinvent the wheel (as specialists can do it better!). Square is a payments and financial services company, not a model monitoring company. It would be naive of us to assume that we can produce state of the art ML lifecycle software better than a company that specializes in that field and given the recent explosion of strong solutions, there is no lack of capable third party tools. Second, Square’s MLP aims to embrace the philosophy of no weirdware which is best ensured by buying well-built software instead of building custom code that is unusual or difficult to define. Next, _engineering maintenance is real. _It is often colloquially said that for every hour spent building the software, 10 hours will be required to maintain it. While the actual build to maintenance ratio may differ, engineering maintenance is a considerable cost that one should not overlook. By buying software we can heavily reduce our maintenance burden to just the code required to integrate with the third party software.
MLOps is a new field that expands the field of vision from just feature stores and serving solutions to include more advanced ML lifecycle capabilities such as versioning, run tracking, AB experimentation and monitoring. MLOps aims to apply standard engineering CI/CD principles to the ML engineering infrastructure space. The emergence of MLOps as a discipline demonstrates a maturation of the Machine Learning space as we are able to functionality beyond just serving. Companies who don’t provide support for MLOps functionality risk decreased model development and iteration efficiency among their data scientists, leading to sub par models deployed in production which can significantly hurt a company's bottom line. MLOps is a trendy field for a reason; it provides a roadmap for the next generation of engineering advancements that can further optimize the ML lifecycle and subsequently ML models themselves. With strong MLOps functionality, companies can expect to maximize the learnings and predictive capabilities of their models, enabling them to “squeeze the most juice” from their data.