GAMP 5 was not designed for software that learns
The original GAMP 5 framework (2008) classifies software into categories based on complexity and configurability. Category 1 is infrastructure software (operating systems, database engines). Category 3 is non-configured products used as-is. Category 4 is configured products (ERP systems, LIMS, MES configured for the specific facility). Category 5 is custom-developed software built specifically for the intended use. Each category carries a prescribed validation approach: lower categories require less testing; higher categories require more.
This classification assumes a fundamental property of traditional software: deterministic behaviour. The same input produces the same output, the behaviour is fully defined by the code, and the validation evidence from version 1.0 remains valid until someone changes the code. An ML model violates all three assumptions. It learns from data rather than being explicitly programmed. Its behaviour is shaped by the training dataset, not just the source code. And that behaviour changes every time the model is retrained on new data — which is the expected operational mode, not an exception.
The regulatory landscape reflects this shift. The FDA reports that over 1,000 AI/ML-enabled medical devices have received regulatory authorisation as of 2025 (FDA, Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices, updated October 2024), with the majority requiring validation approaches beyond traditional GAMP 5 categories.
ISPE estimates that pharmaceutical companies spend 6–18 months validating Category 5 systems under traditional CSV, compared to 2–6 months under risk-based approaches aligned with ISPE’s GAMP 5 Second Edition (2022).
The GAMP 5 Second Edition is now the de facto validation framework across 40+ countries, with a Community of Practice of over 10,000 members.
We have seen both outcomes. Forcing an ML model into Category 4 or Category 5 without acknowledging these differences produces one of two failures: a validation approach that tests the wrong properties (verifying deterministic input-output behaviour that the model was not designed to exhibit), or a revalidation burden so heavy that every model update triggers a months-long validation cycle that makes the system unmaintainable in practice.
The Second Edition reframe
The GAMP 5 Second Edition (2022) and the accompanying ISPE GAMP guidance for AI/ML systems address this gap directly. The core change is a shift from category-based validation (which type of software is this?) to risk-based validation (what is the impact if this system fails?).
For AI/ML systems, the Second Edition establishes several principles that the original framework did not accommodate:
Critical thinking over prescriptive testing. The Second Edition explicitly advocates “critical thinking” in validation planning — assessing what needs to be tested based on risk, rather than following a prescribed set of test types based on software category. For an ML model in a GxP environment, this means the validation plan should focus on the failure modes that matter (model drift, data distribution shift, adversarial inputs, performance degradation over time) rather than on verifying input-output pairs that a deterministic system would produce.
Unscripted testing as a valid approach. Traditional CSV relies heavily on scripted test cases: pre-defined inputs with expected outputs, executed and documented in traceability matrices. The Second Edition recognises that unscripted testing — exploratory testing, error-based testing, and scenario-based testing — is valid for moderate- and lower-risk systems. For ML models, unscripted testing is often more informative than scripted testing: exploring model behaviour at class boundaries, testing with adversarial or out-of-distribution inputs, and evaluating performance across data subsets (sliced evaluation) reveals weaknesses that scripted pass/fail tests would miss.
Continuous validation. The most significant departure from the original framework. Traditional validation is a point-in-time event: validate once, maintain through change control. ML models that are retrained on new data — which is the normal operating mode for production ML systems — require continuous validation: ongoing performance monitoring against documented acceptance criteria, with automated alerts when performance degrades. The GxP validation frameworks that accommodate AI must include monitoring infrastructure as a validation component, not as a post-validation operational concern.
How do you classify an AI/ML system under the current framework?
The practical classification of an AI/ML system under GAMP 5 Second Edition follows the risk-based approach rather than the category-based approach. The methodology:
Step 1: Define the intended use. What does the AI/ML system do in the GxP context? This must be specific: “The system classifies visual inspection images of sterile injectable products as pass or fail, with the classification used to support — but not replace — the human inspector’s release decision.” The intended use statement bounds the validation scope — the system is validated for what it is intended to do, not for everything it could theoretically do.
Step 2: Assess the GxP impact. Using the three-dimension framework — product quality impact, patient safety impact, data integrity impact — classify the system’s GxP scope. This determines the overall risk tier and the proportionate validation intensity.
Step 3: Identify the ML-specific risks. Beyond the standard GxP risks that apply to any software system, ML systems introduce specific risk categories that must be assessed:
- Training data risk: Is the training data representative of the production environment? Is it labelled consistently? Has it been audited for bias or gaps?
- Model drift risk: How quickly does the model’s performance degrade when the production data distribution changes? What is the monitoring strategy for detecting drift?
- Retraining risk: When the model is retrained, how is the new version validated? What acceptance criteria must the retrained model meet before it replaces the production version?
- Explainability risk: Can the model’s decisions be understood well enough to investigate failures? For GxP-critical systems, the quality team must be able to determine why the model produced a specific output — not at the individual-weight level, but at the feature-importance or decision-boundary level.
Step 4: Design the validation approach proportionate to the risk. High-risk ML systems (direct GxP impact, autonomous decisions) receive comprehensive validation with documented acceptance criteria, scripted and unscripted testing, and mandatory continuous monitoring. Moderate-risk systems (supporting GxP decisions, with human oversight) receive risk-based testing focused on the ML-specific risks identified in Step 3. Low-risk systems (minimal GxP impact, fully mitigated by other controls) receive minimal validation — typically a documented risk assessment and performance verification against basic acceptance criteria.
The ISPE AI maturity model
The ISPE GAMP guidance for AI/ML introduces a maturity model for pharmaceutical organisations adopting AI. The model is useful not as a prescriptive roadmap but as a diagnostic: it identifies where an organisation’s current practices have gaps relative to the regulatory expectations for AI in GxP environments.
The maturity levels relevant to validation:
Awareness. The organisation recognises that AI/ML systems require different validation approaches than deterministic software, but has not yet developed policies or procedures. Most pharmaceutical companies that have deployed AI in non-GxP contexts (scheduling, supply chain) but not yet in GxP contexts are at this level. In our work with pharma organisations, this is the most common starting point.
Defined. The organisation has developed policies for AI/ML validation — including risk assessment templates, acceptance criteria guidelines, and change control procedures for model retraining. The policies are documented but may not yet have been tested through a production GxP deployment.
Managed. The organisation has deployed AI/ML in GxP contexts using the defined policies, has validated at least one system through the full lifecycle, and has operational experience with continuous monitoring, drift detection, and model retraining under change control. This is the level at which the organisation has practical evidence — not just policy documents — that its AI validation approach works.
The practical value of the maturity model is in identifying the specific gaps between an organisation’s current state and the managed level. For organisations at the awareness level, the gap is policy development. For organisations at the defined level, the gap is operational experience — which is best acquired through a first deployment on a moderate-risk system where the validation effort is proportionate and the learning is transferable to higher-risk deployments later.
What a validated ML system looks like in practice
A production ML model operating in a GxP pharmaceutical environment with validated status includes the following artifacts and controls:
Validation documentation. Intended use statement, risk assessment (including ML-specific risks), validation plan specifying testing approach and acceptance criteria, test execution records (both scripted and unscripted), and validation summary report with documented pass/fail against criteria.
Model artifacts under version control. The trained model (weights, architecture definition), the preprocessing pipeline (feature engineering, normalisation, augmentation logic), the training dataset (or documented dataset provenance with reproducibility information), the hyperparameter configuration, and the evaluation metrics on the validation dataset. All artifacts are version-controlled with traceable change history.
Continuous monitoring infrastructure. Automated performance tracking against documented acceptance criteria (accuracy, precision, recall, and domain-specific metrics), data drift detection (statistical comparison between production data distribution and training data distribution), alert mechanisms for performance degradation or drift detection, and a documented response protocol for when alerts fire.
Change control for retraining. Every model retrain triggers a documented change control process that includes: the rationale for retraining (new data availability, drift detection, expanded intended use), the training dataset for the new version, performance comparison between new and current production versions, acceptance criteria evaluation, and approval workflow before the new version enters production.
Audit trail. Every model inference in the GxP context is logged with: timestamp, model version, input data reference, output (prediction/classification), confidence score, and whether the output was accepted or overridden by a human operator.
This is the operational state that regulatory auditors expect to find for a GxP-validated AI/ML system. The documentation burden is proportionate to the risk — but the core elements (intended use, risk assessment, continuous monitoring, change control, audit trail) are non-negotiable regardless of the risk tier.
30-day GAMP 5 AI/ML validation fast-start
A moderate-risk first deployment can move from policy gap to validated operational state in 30 days when the effort is structured around the risk-based methodology described above.
-
Week 1 — Risk classification and intended use definition. Write the intended use statement for the target AI/ML system, bounding the validation scope to what the system is intended to do. Complete the three-dimension GxP impact assessment (product quality, patient safety, data integrity). Identify the ML-specific risks: training data representativeness, model drift exposure, retraining frequency, and explainability requirements.
-
Week 2 — Validation planning and acceptance criteria. Design the risk-proportionate validation approach (Step 4): define scripted test cases for high-risk failure modes and unscripted testing protocols for boundary exploration, adversarial inputs, and sliced evaluation across data subsets. Document acceptance criteria for accuracy, precision, recall, and domain-specific metrics. Draft the validation plan linking each test to the risks identified in Week 1.
-
Week 3 — Test execution and monitoring infrastructure. Execute the scripted and unscripted test protocols against the model. Deploy continuous monitoring infrastructure: automated performance tracking against the documented acceptance criteria, statistical drift detection comparing production data distribution to training data distribution, and alert mechanisms for degradation. Configure the audit trail to log every inference with model version, input reference, output, confidence score, and human override status.
-
Week 4 — Change control, documentation, and operational handoff. Implement the change control procedure for model retraining: documented rationale, dataset provenance, performance comparison, acceptance criteria evaluation, and approval workflow. Compile the validation summary report with pass/fail results. Place all model artifacts (weights, preprocessing pipeline, hyperparameter configuration, training dataset provenance) under version control with traceable change history.
The methodology for getting from no ML validation experience to this operational state is best learned on a moderate-risk first deployment — one where the GxP impact is real but bounded, the validation effort produces transferable templates, and the continuous monitoring infrastructure becomes reusable across subsequent deployments. If your pharma AI use cases are identified but the validation pathway for the first GxP deployment is not yet defined, a GxP Regulatory Scope Analysis produces the classification and validation approach per system.