What is MLOps, and why do we need it?

Most machine learning models that get built never reach production. They sit in a notebook on someone’s laptop, perform well on a held-out test set, and then stall. The gap is not modelling skill — it is the absence of the operational scaffolding that turns a trained model into a service the rest of the business can actually use. MLOps, short for Machine Learning Operations, is the discipline that closes that gap: the deployment pipeline, the monitoring, the retraining loop, and the version control that together let a model run in production without a human babysitting it.

For organisations that have never operationalised a model before, the practical question is not “what is MLOps in the abstract” but “what is the smallest viable stack that still produces a production-quality deployment?” That is the question this article answers.

Understanding MLOps

MLOps borrows heavily from DevOps but adds three dimensions that traditional software operations do not have to handle: data pipelines that feed the model, statistical drift in the inputs after deployment, and rollback semantics where the artifact you are rolling back is a trained weight file rather than a code commit. The end-to-end lifecycle MLOps automates runs from data collection through feature engineering, training, validation, deployment, monitoring, and retraining.

In our experience, the first MLOps implementation in an organisation rarely needs the full lifecycle on day one. What it needs is a working path from a registered model artifact to a served prediction, plus enough monitoring to detect when that path breaks. Everything else — automated retraining, feature stores, shadow deployments — is layered on once the first model is genuinely running.

Why do most ML models never reach production?

The notebook-to-production gap has a specific shape, and naming it is the first step to closing it. We see the same pattern across our consulting engagements with teams attempting their first deployment:

The model lives in a Jupyter notebook with hard-coded paths and undocumented dependencies.
There is no separation between training code and serving code, so the inference logic has to be re-derived.
The data the model was trained on is not versioned, so the model itself cannot be rebuilt deterministically.
There is no agreed contract for how the rest of the business calls the model — REST, batch file, message queue.
Once deployed, no one knows whether the model is still working a month later.

Each of these is a missing MLOps capability. The reason most models stall is not that the team lacks data science talent; it is that none of the five items above are anyone’s explicit responsibility.

What does a first MLOps stack actually need?

A common mistake is to copy the reference architecture of a mature AI organisation — feature store, online inference cluster, automated retraining, drift detectors, canary deployments — and try to build it all before the first model ships. That is overengineering for a first project and almost always fails on cost or complexity grounds.

The minimum viable stack for a first deployment has four parts:

Capability	Why it matters on day one	Realistic first-deployment tools
Model registry	Lets you point a serving system at “the current model” without rebuilding it	MLflow Model Registry, or even a versioned S3 bucket
Containerised serving	Makes the model environment reproducible across machines	Docker + FastAPI or BentoML; Kubernetes only if you already run it
CI/CD for models	Automates the path from “new model file” to “deployed endpoint”	GitHub Actions or GitLab CI invoking the container build
Basic monitoring	Tells you when latency, error rate, or input distribution shifts	Prometheus + Grafana for ops metrics; simple input statistics logging

This is the floor. It is not glamorous, but in our experience it carries 80% of the production benefit of a full MLOps stack at perhaps 20% of the build cost. Once this stack is running and a model is actually serving requests, the team has the operational vocabulary to decide what to add next — usually automated retraining or a feature store, not both at once.

Which MLOps capabilities are overengineering for a first project?

The capabilities most often built too early, in our observation, are:

Online feature stores — useful when many models share features, painful when only one model exists.
Shadow / canary deployment infrastructure — valuable at scale, expensive overhead for one model serving low traffic.
Automated retraining pipelines — premature until you have at least a month of production data showing the model actually needs retraining.
Custom drift detection frameworks — start with logging input distributions and reviewing them weekly; only build automation once you know what drift looks like in your data.

The discipline is to add each of these after a triggering signal from the running system, not before.

How does MLOps differ from DevOps?

The two share the same automation philosophy — version control, CI/CD, infrastructure-as-code, monitoring — but diverge on three axes that matter operationally.

Data is part of the artifact. A DevOps pipeline versions source code. An MLOps pipeline must also version the training data and the feature transformations, because the same code on different data produces a different model. Tools like DVC and lakeFS exist specifically to bring data under the same version control discipline.

Drift is a first-class failure mode. A deployed web service that worked yesterday will, all else equal, work today. A deployed model that worked yesterday may quietly degrade today because the input distribution has shifted — a new product category, a seasonal pattern, a change in upstream data formatting. Monitoring has to cover not just latency and error rate but also the statistical properties of the inputs and outputs.

Rollback semantics are different. Rolling back a code commit is well-understood. Rolling back a model means restoring a previous model version and understanding whether the input data the older model expects still matches what’s flowing through the pipeline. The model registry exists precisely to make this clean.

Which MLOps tools are realistic, and which assume maturity?

The open-source MLOps landscape is crowded, and many of the most-cited tools assume a level of data engineering maturity that first-time deployers do not have. A realistic mapping:

Tool	Realistic for first deployment?	Why
MLflow	Yes	Lightweight, runs locally, covers tracking + registry without a Kubernetes cluster
Docker + FastAPI	Yes	Standard container serving; no MLOps-specific infrastructure needed
DVC	Yes	Git-like data versioning; low operational overhead
Kubeflow	Only if you already run Kubernetes	Powerful, but operating Kubernetes is itself a major undertaking
TensorFlow Extended (TFX)	No, for most first projects	Assumes TensorFlow throughout and a maturity of data pipelines that first-time teams rarely have
Feature stores (Feast, Tecton)	No	Pay off when multiple models share features; overhead for a single model
Airflow / Prefect	Yes, if batch	Useful for batch retraining or batch inference orchestration

The pattern is consistent: tools that wrap a single concern (registry, container, data version) are realistic on day one; platforms that assume an orchestrated ecosystem are realistic on day 200.

A recognisable first-deployment arc

In a typical first MLOps engagement we work on, the arc looks like this. The team starts with a model in a notebook that demonstrably outperforms the existing rule-based system. We extract the inference logic into a Python module with a clear function signature, containerise it with FastAPI, register the trained model in MLflow, and wire a GitHub Actions workflow that rebuilds the container when a new model version is registered. Prometheus and Grafana cover latency and error rate; a separate logging job snapshots input feature distributions daily into a Parquet file that a data scientist can inspect when the model misbehaves.

Each phase produces an artifact that has value independent of the final system. The containerised inference module is reusable for the next model. The CI/CD workflow is a template. The monitoring dashboard generalises. This is what we mean when we say the second model deployment costs less than the first: the infrastructure is the durable output, not just the served model.

What usually remains imperfect after the first deployment is automated retraining and drift response. Both require enough production data to calibrate, which by definition does not exist on day one. Naming that explicitly is part of the engagement — a first MLOps implementation closes the notebook-to-production gap, and prepares the team to close the drift-response gap next.

FAQ

What does MLOps actually mean for an organisation that has never operationalised a model?

It means installing the minimum operational scaffolding to move a trained model out of a notebook and into a service the rest of the business can call, monitor, and update. Concretely: a model registry, containerised serving, a CI/CD pipeline that builds and deploys the container, and basic monitoring of latency, error rate, and input distributions.

Which MLOps capabilities does a first project genuinely need, and which are overengineering?

A first project needs a model registry, container-based serving, CI/CD for the container, and basic ops + input monitoring. It does not need an online feature store, automated retraining, canary infrastructure, or a custom drift framework on day one. Each of those should be added after a specific signal from the running system, not before.

Which MLOps tools and frameworks are realistic for a first deployment, and which assume mature data engineering already in place?

Realistic: MLflow, Docker + FastAPI, DVC, GitHub Actions / GitLab CI, Prometheus + Grafana. Assumes maturity: Kubeflow (requires Kubernetes operations), TFX (assumes TensorFlow-first pipelines), Feast / Tecton (pay off only with multiple models sharing features).

What is the smallest viable MLOps stack that still produces a production-quality deployment?

Model registry + containerised serving + a CI/CD workflow that rebuilds the container on a new model version + ops and input-distribution monitoring. Four moving parts, all available as open source, none requiring a Kubernetes cluster.

How does MLOps differ from DevOps in the data-pipeline, drift, and rollback dimensions?

DevOps versions code; MLOps must also version data and feature transformations, because the same code on different data yields a different model. DevOps does not have to monitor for statistical drift; MLOps must. DevOps rollback restores a code commit; MLOps rollback restores a model version and must verify the expected input contract still holds.

Why do most ML models never reach production, and which MLOps gaps cause that?

Because the operational scaffolding is no one’s explicit job. The recurring gaps we see are: notebooks with hard-coded paths instead of a packaged inference module, no separation between training and serving code, unversioned training data, no agreed calling contract for the rest of the business, and no monitoring once the model is deployed. Each gap maps directly to an MLOps capability.

We cover this transition in more depth in our companion piece on MLOps for organisations that have never operationalised a model, and revisit the broader operational picture in our introduction to MLOps.

What is MLOps, and why do we need it?

Understanding MLOps

Why do most ML models never reach production?

What does a first MLOps stack actually need?

Which MLOps capabilities are overengineering for a first project?

How does MLOps differ from DevOps?

Which MLOps tools are realistic, and which assume maturity?

A recognisable first-deployment arc

FAQ

What are MLOps, and why do we need them?

Introduction to MLOps

The Pros and Cons of MLOps Tools

MLOps Tools Stack: Experiment Tracking, Registries, Orchestration, and Serving