A first production deployment is where MLOps assumptions actually get tested Part 1 of this series walked through how we frame a hospital staff tracking system as an MLOps problem: the data pipeline, the labelling effort, and the operational constraints that any computer vision system inside a clinical environment has to respect. This second part covers the side that catches most teams off guard — actually training the model, deploying it without breaking the rest of the hospital’s stack, and monitoring it once it is live. Most ML models built inside hospitals never reach production. Not because the model is poor, but because the team has no path from a notebook to a service that runs reliably behind authentication, behind rate limits, and behind a monitoring dashboard that someone actually looks at. That gap is the heart of MLOps for a first-time team, and it is the gap we work to close in our engagements. The computer vision market in healthcare is growing at a CAGR of 34.3% between 2024 and 2032 — directional industry-scale, not an operational benchmark for any single hospital. We treat numbers like that as evidence the category is moving, not as a guarantee of project ROI. If you have not yet read it, the Part 1 article on building the data and infrastructure foundation sets up the constraints we assume here. What does the model training stage actually require? The model behind a staff tracking system has to do three jobs at once: detect people, hold an identity across frames, and do both fast enough to be useful for live decisions. That third constraint reshapes every earlier choice. Preparing the dataset Training starts from the labelled dataset described in Part 1. Two practical points carry most of the weight at this stage. First, augmentation. Hospital CCTV streams vary in lighting, angle, and crowding far more than a clean academic dataset suggests. Augmentation techniques — horizontal flips, brightness shifts, rotations, occasionally synthetic occlusions — are not a luxury here; they are the difference between a model that works in the corridor it was filmed in and one that works in the corridor next door. Second, the split. We separate the data into training, validation, and test sets up front, and we keep the test set untouched until the final evaluation. Re-using validation data for final reporting is one of the most common ways teams accidentally inflate accuracy figures. An example of how labelled data is split. Source: CloudFactory A short OpenCV example illustrates the kind of augmentation we use as a baseline. It is deliberately simple — three transforms, no learned augmentation — because for a first deployment, simple and inspectable beats clever and opaque. import cv2 import numpy as np # Load an image from file image = cv2.imread('path_to_your_image.jpg') # 1. Horizontal flip flipped_image = cv2.flip(image, 1) # 2. Brightness adjustment brightness = 50 # positive to brighten, negative to darken bright_image = cv2.convertScaleAbs(image, beta=brightness) # 3. Rotation angle = 45 (h, w) = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, angle, 1.0) rotated_image = cv2.warpAffine(image, M, (w, h)) cv2.imshow('Original', image) cv2.imshow('Flipped', flipped_image) cv2.imshow('Brightness Adjusted', bright_image) cv2.imshow('Rotated', rotated_image) cv2.waitKey(0) cv2.destroyAllWindows() Original image Flipped image Brightness-adjusted image Rotated image Choosing a computer vision model Architecture choice is the point where many first-time MLOps teams over-shoot. The realistic candidates for a real-time staff tracking system fall into three families: CNN-based detectors — YOLO variants and Faster R-CNN. Mature tooling, well-understood deployment paths, predictable latency on commodity GPUs. Sequence models — LSTM-based trackers that link detections across frames. Useful when re-identification across occlusions matters more than raw detection accuracy. Transformer-based detectors — DETR and its descendants. Strong on accuracy, but heavier to serve and less forgiving when latency budgets are tight. For a first production system we usually start with a YOLO-family detector plus a light tracker on top. It is not the most accurate option on paper. It is the one with the shortest path to a stable inference service, and that matters more on a first deployment than a marginal accuracy gain. Hyperparameter tuning and evaluation Tuning is bounded by the same constraint: we are not chasing a benchmark, we are producing a model that has to fit a latency budget. Learning rate, batch size, and backbone depth are the dimensions we move first. Grid search is wasteful at this scale; random search or a small Bayesian-optimisation loop is enough. Evaluation has two halves. Accuracy, precision, recall, and F1 on the held-out test set tell us whether the model identifies staff correctly. Inference time, measured on the hardware the model will actually run on, tells us whether the model is deployable. Sustained throughput under realistic load — not peak burst on an empty GPU — is the operationally relevant measure here, and it is the one we report back to the hospital. How does model deployment differ in a clinical environment? Three deployment topologies are realistic for a hospital. Each comes with different operational costs. Topology Strengths Trade-offs Edge (on-camera or local appliance) Low latency, data stays inside the hospital, simpler privacy story Limited compute, hard to update at scale, hardware drift Cloud (AWS, GCP, Azure) Elastic compute, simpler MLOps tooling, easier rollback Network latency, ongoing cost, data-residency questions On-premise servers Full control, predictable cost after capex, fits existing IT governance Up-front investment, the hospital owns the operations burden For a first MLOps engagement we lean toward cloud deployment because the tooling around CI/CD, autoscaling, and monitoring is the most mature there. On-premise becomes the right answer when data-residency or network constraints rule cloud out, but it doubles the operational load on the team. The rest of this section assumes a cloud deployment. Model serving The model has to be reachable as a service. TensorFlow Serving, TorchServe, and NVIDIA Triton Inference Server are the three mature options. Triton has the broadest framework support and the best story for mixed CPU/GPU workloads, which matters when the same inference cluster serves more than one model. Containerising with Docker is non-negotiable for a first deployment — not because Docker is fashionable, but because it removes an entire class of “works on my machine” failures between training and serving. Kubernetes handles autoscaling for the inference fleet. Prometheus and Grafana provide the operational telemetry: request rate, p50 / p95 / p99 latency, GPU utilisation, and per-model accuracy proxies. Example of model deployment with TensorFlow Serving running in Docker and consumed by a Flask app. Source: Ubuntu API integration A REST API in front of the model server is where the system meets the rest of the hospital’s stack. Flask and FastAPI are both reasonable; FastAPI’s async model handles bursty traffic better, which matters when the CCTV gateway dumps a batch of frames in one connection. The API layer is also where security lives. OAuth for authentication, HTTPS for transport, rate limiting and request throttling to protect the inference cluster from accidental floods, and structured error responses so the consuming application can degrade gracefully when a frame cannot be processed. None of this is exciting work. All of it is the difference between a system that survives its first month in production and one that does not. Web application integration The clinical-facing application reads from the API, not from the model directly. That separation matters: it lets the model evolve — retrain, version, roll back — without forcing changes in the front-end. The application typically shows real-time staff locations on a hospital floor plan, with historical playback and role-based access for ward managers and operations staff. Why is monitoring the part most first-time teams underestimate? A model that worked on the test set is not a model that works in production six months later. The conditions it was trained on drift — sometimes slowly, sometimes overnight when a corridor is reorganised. Monitoring is what closes the loop. We see this pattern regularly: the system is launched, looks good for the first few weeks, and then quietly degrades. Nobody notices until a ward manager mentions the dashboard is wrong. By that point the model has been making poor predictions for weeks. The fix is to assume drift will happen and instrument for it from day one. See our note on peak performance versus steady-state performance in AI systems for the broader argument. Performance monitoring in the ML model lifecycle. Source: Evidently.ai What drives data drift in hospital settings Two categories dominate in our experience across hospital deployments (observed pattern, not a benchmarked rate): Environmental drift. Camera angles change after maintenance. Lighting changes seasonally. New equipment appears in the frame. Each of these shifts the input distribution the model was trained on. Behavioural drift. Staff routines change under new protocols. Emergency patterns differ from normal flow. Shift patterns evolve over months. Visualising data drift. Source: Evidently.ai Detection methods we rely on in practice: Statistical divergence measures (Kullback-Leibler divergence, Population Stability Index) comparing live input distributions to the training distribution. Pixel-level distribution checks on the incoming video frames — a coarse but cheap signal. Structured feedback channels for staff to flag obvious misidentifications. The qualitative signal often catches drift the statistical tests miss. Retraining and redeployment Retraining is not a one-off; it is a scheduled operation. Cadence depends on the rate of drift, but a quarterly retrain plus event-triggered retrains (after major environmental changes) is a workable starting point. Automated pipelines built with Kubeflow or MLflow handle the mechanics — data ingestion, training, validation, packaging, deployment. The deployment of a retrained model is itself a moment of risk. We use a canary rollout: route a small fraction of traffic to the new model, compare its outputs against the incumbent on the live stream, and only promote it after the comparison passes. A full cutover on day one is the kind of move that ends with a rollback at 3am. FAQ Closing the loop A computer vision staff tracking system is a good test case for first-time MLOps because every weak point in the operational chain shows up quickly. Training the model is the easy half. Serving it under authentication, monitoring it for drift, and retraining it on a schedule are the parts that decide whether the deployment is still working in a year. The pattern generalises. Whether the model tracks staff, reads scans, or routes patients, the same operational scaffolding applies. The first deployment is expensive because the scaffolding has to be built. The second is cheaper because the scaffolding is reused. If you are scoping a first MLOps engagement and want to talk through the trade-offs in your environment, get in touch. Part 1 of this series — Building a Robust Staff Tracking System — covers the data and infrastructure groundwork that this article assumes is already in place. Sources for the images CloudFactory Limited, 2024. Computer Vision Wiki. Vasconcelos, R. (2021) ‘A guide to ML model serving’, Ubuntu, 17 May. Evidently AI, (n.d.) Model monitoring for ML in production: a comprehensive guide. Evidently AI, (n.d.) What is data drift in ML, and how to detect and handle it. References Ghaemmaghami, M.P., 2017. Tracking of Humans in Video Stream Using LSTM Recurrent Neural Network. Degree Project in Computer Science and Engineering, Stockholm. Global Market Insights (2024). Computer Vision in Healthcare Market. Potrimba, P., 2023. What is DETR? Roboflow Blog.