What are MLOps, and why do we need them?

Q: What does MLOps actually mean for an organisation that has never operationalised a model?

For first-time team MLOps means having repeatable answers to five operational questions: how does model move from notebook to production (defined path — artefact format, deployment target, serving infrastructure, version control); how does production data reach the model (data pipeline — source-to-feature transformation, batch or streaming, pipeline health monitoring); how is model performance measured in production (metrics emitted, dashboards, alerts on degradation); how is model updated (retraining trigger, training pipeline, evaluation, deployment); how is model rolled back if it misbehaves (previous version retained, traffic switching, monitoring during rollout). Five questions need answers before deployment; don't require comprehensive platform, require defined practices and minimal infrastructure. What it doesn't mean: comprehensive platform with feature stores, model registries, experiment tracking, automated retraining, A/B testing, shadow deployment, canary rollout all integrated (mature MLOps; building all before first deployment is over-engineering); team of dedicated MLOps engineers (first usually means one data scientist learning enough engineering and one engineer learning enough ML; dedicated teams come later); replicating FAANG ML organisation patterns (those optimise for scale — thousands of models, millions requests/sec — first-time teams have different constraints). First-implementation goal: deploy one model, measure production performance, learn team's actual MLOps gaps, address in second deployment; first deployment also learning exercise.

Q: Which MLOps capabilities does a first project genuinely need, and which are overengineering?

Genuinely needed: version control for code (Git, non-negotiable); version control for model artefact (defined storage location, versioning by date or hash, manual acceptable initially); deployment automation (repeatable procedure, can be documented runbook plus script, doesn't need full CI/CD initially); production monitoring (prediction latency, throughput, error rate, alerts on degradation); data pipeline reliability (data feeding model in production reliable, pipeline health monitored); performance evaluation in production (some way to measure — at minimum manual review of sampled predictions); rollback capability (revert to previous model version if new misbehaves); documentation (sufficient that another team member could deploy or revert). Over-engineering for first: feature store with online/offline consistency (useful for many-model organisations, over-engineering for one); comprehensive experiment tracking platform (MLflow, W&B useful but not blocking); model registry with metadata schemas, lineage, approval workflows (useful at scale, overhead for first); automated retraining with continuous training (premature; defer until manual cadence established); A/B testing infrastructure (useful with multiple candidates, premature for one); shadow deployment with traffic mirroring (useful for high-risk, overhead for first); multi-environment promotion dev/staging/prod (useful at scale, one well-monitored prod acceptable). Pattern: first deployment focuses on getting one model live with reliable basics; maturity accretes from there; over-engineering platform before first deployment delays first deployment which delays learning which delays maturation.

Q: Which MLOps tools and frameworks are realistic for a first deployment, and which assume mature data engineering already in place?

Realistic for first deployment (limited assumptions): containerisation (Docker universal, model serves in container, cloud-agnostic); serving frameworks (FastAPI for HTTP; ONNX Runtime, Triton Inference Server, BentoML for higher-performance; pick one, start simple); cloud-managed deployment (AWS SageMaker, Azure ML, Google Vertex AI manage deployment, scaling, monitoring; higher cost, lower operational burden); self-managed Kubernetes (KServe, Seldon Core, Ray Serve on K8s; lower cost, higher operational burden, assumes K8s expertise); monitoring (Prometheus + Grafana; integrate with team's existing stack); CI/CD (GitHub Actions, GitLab CI, Jenkins — whatever team already uses, extend with model-deployment workflow); experiment tracking (MLflow open source adequate, W&B commercial alternative, first deployment can defer). Mature data engineering assumed by many tools: feature stores (Feast, Tecton, Vertex Feature Store assume reliable feature pipelines; without mature data engineering, feature store is complication); data warehouses with consistent schemas (Snowflake, BigQuery, Databricks for feature derivation; without these, derivation per-model engineering); streaming infrastructure (Kafka, Kinesis for real-time features; without this freshness must be batch-driven); workflow orchestration (Airflow, Dagster, Prefect for pipeline reliability; often present, if not batch retraining manual until pipeline built); data lineage tools (DataHub, Marquez, OpenMetadata; useful at scale, defer for first). Recommendation: cloud-managed deployment for operational burden reduction (cost justified by faster time-to-first-deployment); add MLflow if not present; monitor with team's existing observability; defer feature store, streaming, lineage until second/third deployment justifies.

Q: What is the smallest viable MLOps stack that still produces a production-quality deployment?

Minimum viable stack: code (Git repo with training code, deployment configuration, serving code); training (Jupyter or script producing model artefact, runs on developer laptop or single VM); model artefact storage (S3 bucket or equivalent with versioned files, metadata in simple text file); serving (containerised FastAPI on managed Kubernetes — EKS, GKE, AKS — or cloud-managed inference endpoint — SageMaker Endpoint); monitoring (CloudWatch/Stackdriver/Azure Monitor for infrastructure; custom metrics emitted by serving code for ML-specific — prediction distribution, latency); alerting (PagerDuty or Slack with monitoring; on-call rotation); data pipeline (scheduled batch job — cron, Airflow if already used — pulls data and transforms); documentation (README with deployment runbook, runbook for rollback, runbook for retraining); deployment automation (CI/CD pipeline builds container, pushes to registry, updates K8s deployment); retraining (manual; data scientist triggers when needed, deployment via same automation); rollback (K8s deployment rollback or model artefact version swap; documented procedure). Stack ships value: model in production serving real traffic, performance observable, failures alerted, updates deployable, rollback possible. Defers: feature store, multi-environment promotion, shadow deployment, automated retraining, A/B testing. Maturity path: second deployment exposes gaps (manual retraining too slow), build specific capability addressing gap (basic retraining pipeline), third deployment exposes more, capability accumulates; in 12-18 months team has built mature platform matching specific needs; faster than building comprehensive platform first.

Q: Why do most ML models never reach production, and which MLOps gaps cause that?

Observed reasons: business case missing or weak (model built without clear deployment intent — 'interesting' not 'valuable enough to deploy'; project framing gap not MLOps). Data pipeline doesn't exist (training assembled manually for prototype, deployment requires production pipeline, team didn't build it; MLOps gap — pipeline part of project from start). Serving infrastructure unclear (built model but didn't know how to deploy, no defined serving target, project sits pre-deployment; gap — deployment target defined before training). Ownership unclear (model exists, nobody owns operating in production, nobody picks up work; gap — ownership defined from start). Risk and compliance unresolved (legal/compliance review hasn't completed, deployment blocked; gap — risk review in parallel with development). Integration unclear (model exists, integration with consuming application undefined, integration work doesn't start; gap — integration plan from start). Monitoring undefined (deployed but team can't tell if working, stakeholders don't trust, rolled back; gap — monitoring before deployment). Maintenance unaffordable (cost of keeping running exceeds value produced, deprecated; sometimes value misjudgement). Team turnover (person who built leaves, nobody else can operate; gap — documentation, runbooks, knowledge transfer). Drift detected without remediation path (model degrades, team doesn't have retraining capability, deprecates; gap — retraining pipeline). Pattern: most projects don't fail because model science wrong, fail because MLOps surrounding model wasn't planned from start; first MLOps implementation isn't building platform, it's making these gaps part of project plan from beginning.

Introduction to MLOps

Most organisations that build machine-learning models never deploy them: the model sits in a notebook; the business case never materialises; the team builds another model. MLOps is the operational discipline that closes this gap — the practices, tooling, and infrastructure that move models from notebook to production and keep them performing there. The framing this article uses is “first MLOps implementation for a team that has models but no production pipeline” — the recognisable problem class with specific tools, specific failure points (the notebook-to-production gap), and specific outcomes including what remained imperfect. See the services landing for the broader consulting programme.

The corrected approach is smallest-viable-stack first: deploy something that produces value, build the second deployment on the first, expand the stack as deployments justify. Over-engineering the platform before the first deployment is the failure mode this article warns against.

What this means in practice

The smallest viable MLOps stack is much smaller than vendor blueprints suggest.
The notebook-to-production gap is mostly engineering and ownership, not model science.
Second deployment should be cheaper than first — that’s the test of whether your MLOps actually works.
MLOps differs from DevOps in data pipelines, drift, and rollback dimensions specifically.

What does MLOps actually mean for an organisation that has never operationalised a model?

For a first-time team, MLOps means having repeatable answers to five operational questions:

How does a model move from notebook to production? Defined path: model artefact format, deployment target, serving infrastructure, version control.

How does production data reach the model? Data pipeline: source-to-feature transformation, batch or streaming, monitoring of pipeline health.

How is the model performance measured in production? Metrics emitted, dashboards, alerts on degradation.

How is the model updated? Retraining trigger, training pipeline, evaluation, deployment.

How is the model rolled back if it misbehaves? Previous version retained, traffic switching, monitoring during rollout.

These five questions need answers before deployment. The questions don’t require a comprehensive platform; they require defined practices and minimal infrastructure.

What MLOps doesn’t mean for first-timers:

A comprehensive platform with feature stores, model registries, experiment tracking, automated retraining, A/B testing, shadow deployment, canary rollout — all integrated. This is mature MLOps; building all of it before the first deployment is over-engineering.

A team of dedicated MLOps engineers. First MLOps usually means one data scientist learning enough engineering and one engineer learning enough ML; dedicated MLOps teams come later.

Replicating the patterns of FAANG ML organisations. Those patterns optimise for scale (thousands of models, millions of requests/sec); first-time teams have different constraints and different priorities.

The first-implementation goal. Deploy one model, measure its production performance, learn what the team’s actual MLOps gaps are, address those gaps in the second deployment. The first deployment is also a learning exercise.

Which MLOps capabilities (CI/CD for models, monitoring, retraining, registry) does a first project genuinely need, and which are overengineering?

Genuinely needed for first deployment:

Version control for code. Git or equivalent. Non-negotiable.

Version control for model artefact. A defined storage location for trained model files; versioning by date or hash; manual is acceptable initially.

Deployment automation. A repeatable procedure to deploy the model to serving infrastructure. Can be a documented runbook plus a script; doesn’t need full CI/CD initially.

Production monitoring. Metrics on prediction latency, throughput, error rate; alerts on degradation.

Data pipeline reliability. The data feeding the model in production is reliable; pipeline health is monitored.

Performance evaluation in production. Some way to measure whether the model is doing what it should — at minimum, a manual review of sampled predictions.

Rollback capability. A way to revert to the previous model version if the new one misbehaves.

Documentation. Sufficient documentation that another team member could deploy or revert the model.

Over-engineering for first deployment:

Feature store with online and offline consistency. Useful for many-model organisations; over-engineering for one.

Comprehensive experiment tracking platform. MLflow, Weights & Biases — useful but not blocking for first deployment.

Model registry with metadata schemas, lineage tracking, approval workflows. Useful at scale; overhead for first deployment.

Automated retraining pipelines with continuous training. Premature; defer until manual retraining cadence is established.

A/B testing infrastructure with experiment management. Useful when you have multiple model candidates; premature for one model.

Shadow deployment with traffic mirroring. Useful for high-risk deployments; overhead for first deployment.

Multi-environment promotion (dev → staging → prod). Useful at scale; one well-monitored prod environment is acceptable for first deployment.

The pattern. First deployment focuses on getting one model live with reliable basics. Maturity accretes from there. Over-engineering the platform before the first deployment delays the first deployment, which delays the learning, which delays the team’s actual MLOps maturation.

Which MLOps tools and frameworks are realistic for a first deployment, and which assume mature data engineering already in place?

Realistic for first deployment (limited assumptions):

Containerisation. Docker is universal; the model serves in a container. Cloud-agnostic.

Serving frameworks. FastAPI for HTTP serving; ONNX Runtime, Triton Inference Server, BentoML for higher-performance serving. Pick one; start simple.

Cloud-managed deployment. AWS SageMaker, Azure ML, Google Vertex AI manage the deployment, scaling, monitoring infrastructure. Higher cost; lower operational burden.

Self-managed Kubernetes. KServe, Seldon Core, Ray Serve on K8s. Lower cost; higher operational burden; assumes K8s expertise.

Monitoring. Prometheus + Grafana for metrics; integrate with team’s existing monitoring stack.

CI/CD. GitHub Actions, GitLab CI, Jenkins — whatever the team already uses; extend with model-deployment workflow.

Experiment tracking. MLflow is open source and adequate; Weights & Biases is commercial alternative; first deployment can defer.

Mature data engineering assumed by many MLOps tools:

Feature stores (Feast, Tecton, Vertex Feature Store) assume reliable feature pipelines feeding them. Without mature data engineering, the feature store is a complication.

Data warehouses with consistent schemas (Snowflake, BigQuery, Databricks) for feature derivation. Without these, the feature derivation is per-model engineering.

Streaming infrastructure (Kafka, Kinesis) for real-time features. Without this, the feature freshness must be batch-driven.

Workflow orchestration (Airflow, Dagster, Prefect) for pipeline reliability. Often present already; if not, batch retraining is run manually until pipeline is built.

Data lineage tools (DataHub, Marquez, OpenMetadata) for tracking data flow. Useful at scale; first deployment can defer.

The first-deployment recommendation. Use cloud-managed deployment (SageMaker, Vertex, Azure ML) for the operational burden reduction; the cost is justified by faster time-to-first-deployment. Add MLflow for experiment tracking if not already present. Monitor with team’s existing observability stack. Defer feature store, streaming, data lineage tools until the second or third deployment justifies the investment.

What is the smallest viable MLOps stack that still produces a production-quality deployment?

The minimum viable MLOps stack (representative):

Code. Git repository with model training code, deployment configuration, serving code.

Training. Jupyter notebook or script that produces a model artefact; runs on developer laptop or single VM.

Model artefact storage. S3 bucket (or equivalent) with versioned model files; metadata in a simple text file alongside.

Serving. Containerised FastAPI service running on a managed Kubernetes service (EKS, GKE, AKS) or cloud-managed inference endpoint (SageMaker Endpoint).

Monitoring. CloudWatch / Stackdriver / Azure Monitor for infrastructure metrics; custom metrics emitted by serving code for ML-specific (prediction distribution, latency).

Alerting. PagerDuty or Slack integration with monitoring; on-call rotation.

Data pipeline. Scheduled batch job (cron, Airflow if already in use) that pulls data and transforms it.

Documentation. README in the repository with deployment runbook; runbook for rollback; runbook for retraining.

Deployment automation. CI/CD pipeline that builds container, pushes to registry, updates Kubernetes deployment.

Retraining. Manual; data scientist triggers training when needed; deployment via the same automation.

Rollback. Kubernetes deployment rollback or model artefact version swap; documented procedure.

The minimum stack ships value:

The model is in production serving real traffic.

Performance is observable.

Failures are alerted.

Updates can be deployed.

Rollback is possible.

The minimum stack defers:

Feature store; multi-environment promotion; shadow deployment; automated retraining; A/B testing.

The maturity path from minimum:

Second deployment exposes gaps (e.g., the manual retraining is too slow). Build the specific capability that addresses the gap (a basic retraining pipeline). Third deployment exposes more gaps. Capability accumulates; in 12-18 months the team has built a mature MLOps platform that matches its specific needs.

This pattern is faster and produces better-fitting infrastructure than building the comprehensive platform first.

How does MLOps differ from DevOps in the data-pipeline, drift, and rollback dimensions?

DevOps focuses on software lifecycle: code → build → test → deploy → monitor → repeat. MLOps extends this with ML-specific dimensions:

Data pipeline as part of the system. In DevOps, the application has a defined API contract with data sources; data is mostly transactional and well-structured. In MLOps, the data pipeline is part of the model’s behaviour — the features the model receives are produced by code that itself is part of the system. Changes to the data pipeline change the model’s behaviour. Versioning data and pipeline together is required.

Training as part of build. In DevOps, the build is deterministic — same code in, same artefact out. In MLOps, training is stochastic — same code and data in, similar but different artefact out. Training is part of the build for ML systems; reproducibility requires explicit handling (random seeds, data ordering, framework version pinning).

Model drift. In DevOps, the deployed software does what it did at deploy time forever. In MLOps, the deployed model performs less well over time as data drifts. Monitoring must include performance drift, not just infrastructure metrics. Retraining is part of normal operations, not exception handling.

Rollback complexity. In DevOps, rollback is straightforward — deploy previous artefact. In MLOps, rollback is more complex: model version, training data, feature pipeline version must all be consistent. Rolling back the model without rolling back the feature pipeline can produce mismatch. Coordinated rollback procedures are needed.

Testing. In DevOps, unit tests, integration tests, end-to-end tests. In MLOps, the above plus: data validation tests (input distribution checks), model behaviour tests (expected predictions on sentinel cases), regression tests against historical predictions, fairness/bias tests where applicable.

Monitoring. In DevOps, infrastructure metrics (CPU, memory, latency, error rate) and business metrics. In MLOps, the above plus: prediction distribution, feature distribution, ground truth comparison when feedback is available, drift indicators.

Deployment strategies. In DevOps, blue-green, canary, rolling deployments. In MLOps, the above plus: shadow deployment (new model receives traffic copy but doesn’t serve), A/B testing (model variants compared on traffic split), staged rollout based on cohort.

Compliance and audit. In DevOps, code changes, deployment events. In MLOps, the above plus: model lineage (which training data, which code version, which hyperparameters), explainability artefacts, fairness assessments.

The headline. MLOps is DevOps plus the data-pipeline-as-part-of-system, plus drift, plus stochastic training, plus model-specific testing and monitoring. Teams treating MLOps as DevOps applied to ML miss the dimensions; teams treating it as a separate discipline (without DevOps foundations) miss the operational reliability that DevOps practices contribute.

Why do most ML models never reach production, and which MLOps gaps cause that?

Observed reasons for ML projects that don’t reach production:

Business case missing or weak. The model was built without clear deployment intent; “interesting” but not “valuable enough to deploy”. Not an MLOps gap; a project framing gap.

Data pipeline doesn’t exist. Training data was assembled manually for the prototype; deployment requires production data pipeline; team didn’t build it; project stalls. MLOps gap: pipeline as part of the project from start.

Serving infrastructure unclear. Team built model but didn’t know how to deploy; no defined serving target; project sits in pre-deployment. MLOps gap: deployment target defined before training.

Ownership unclear. Model exists; nobody owns operating it in production; nobody picks up the work. MLOps gap: ownership defined from start.

Risk and compliance unresolved. Model exists but legal/compliance review hasn’t completed; deployment blocked. MLOps gap: risk review in parallel with development.

Integration unclear. Model exists; integration with the application that will use it is undefined; integration work doesn’t start. MLOps gap: integration plan from start.

Monitoring undefined. Model deployed but team can’t tell if it’s working; stakeholders don’t trust; deployment rolled back. MLOps gap: monitoring before deployment.

Maintenance unaffordable. Model deployed; the cost of keeping it running (engineering time, infrastructure cost) exceeds the value it produces; deprecated. Not strictly an MLOps gap; sometimes a value misjudgement.

Team turnover. Person who built the model leaves; nobody else can operate it. MLOps gap: documentation, runbooks, knowledge transfer.

Drift detected without remediation path. Model degrades; team doesn’t have retraining capability; deprecates. MLOps gap: retraining pipeline.

The pattern. Most projects don’t fail because the model science is wrong; they fail because the MLOps surrounding the model wasn’t planned from the start. The first MLOps implementation isn’t building a platform; it’s making these gaps part of the project plan from the beginning.

Limitations that remained

The smallest viable stack doesn’t scale to many models. As the organisation deploys its second, third, fifth model, the manual processes that worked for one become bottlenecks. The platform maturation must happen; the question is when (not whether).

Cloud-managed deployment has cost implications. Cloud-managed inference is convenient but billed per inference or per active endpoint. High-volume deployments justify self-managed; low-volume don’t. Cost monitoring is part of MLOps.

Compliance varies by industry. Regulated industries (pharma, healthcare, finance) have requirements that go beyond minimum MLOps. The minimum stack works for low-regulation contexts; regulated contexts require additional validation, documentation, audit trail from the start.

Team skill gaps slow first deployment. Data scientists who haven’t deployed and engineers who haven’t deployed ML are both common; the skill gap is the actual bottleneck for many first-time efforts. Investing in one engineer learning ML or one data scientist learning engineering pays off across multiple deployments.

Mature MLOps takes 12-18 months. The “ship first, learn, mature” approach is not fast in absolute terms; first deployment in 3 months, second in 2 months, mature platform in 12-18 months. Organisational expectations must align.

How TechnoLynx Can Help

TechnoLynx works with engineering teams on first-time MLOps deployments — smallest-viable-stack design, deployment automation, monitoring setup, retraining loop. We focus on shipping the first model and letting platform maturity follow deployment learning. If your team has models without production, contact us.

Image credits: Freepik