The evaluation problem
Choosing an AI consulting firm is a decision made with significant information asymmetry. The buyer typically has less technical AI expertise than the seller (that is why they are buying consulting), which means the buyer cannot independently evaluate the seller’s technical claims. The result: purchasing decisions are influenced by signals that correlate weakly with delivery quality — brand recognition, partnership badges, slide deck polish, and the number of logos on the “clients” page.
The firms that deliver are not always the ones that present best. And the firms that present best are not always the ones that deliver. The evaluation needs structure — a set of criteria that a non-technical buyer can assess and that correlate with actual delivery quality.
How do you assess technical depth?
Every AI consulting firm claims expertise in machine learning, deep learning, NLP, computer vision, generative AI, and MLOps. The claims are not differentiating because everyone makes them. What differentiates is the depth behind the claim.
How to assess depth: Ask the firm to describe a specific technical decision they made on a recent project and why they made it. Not “we built a computer vision model” — but “we chose a YOLOv8 architecture over Faster R-CNN because the client’s latency requirement was 40ms per frame on a Jetson Orin, and YOLOv8-nano achieves 35ms at INT8 quantisation while Faster R-CNN exceeded 80ms even after optimisation.” The specificity of the answer reveals whether the firm’s team has hands-on implementation experience or whether they are reselling subcontracted work with a slide deck overlay.
Red flag: The firm cannot name the specific technologies, architectures, or tools used on their projects. Answers remain at the level of “we used advanced machine learning techniques” without specifics.
Green flag: The firm’s technical team can discuss trade-offs — why they chose one approach over another, what alternatives they considered, what the limitations of their chosen approach were, and what they would do differently on a similar future project.
What sophisticated buyers systematically miss: Depth signals are easy to rehearse. We have seen firms deliver impressive technical walk-throughs of their best project during evaluation, then staff the actual engagement with junior engineers who had no involvement in that project. The anti-gaming check is not just “can they describe technical depth?” but “can the specific people proposed for your project describe it on demand, unrehearsed, for their own recent work?”
Criterion 2: Delivery evidence, not capability claims
A firm’s capability deck describes what they can do. Delivery evidence shows what they have done. The gap between the two is often substantial.
How to assess delivery: Request case studies that include specific, measurable outcomes — not “we improved accuracy” but “we reduced false-positive rates from 12% to 3.2% on the client’s production defect detection system, measured over 90 days of production operation.” Request references from clients who can speak to the delivery experience — not just the outcome, but the process: was the team responsive, did they meet timelines, did they communicate problems early?
Red flag: All case studies describe pilot projects and POCs. None describe production deployments that operated for months or years. This pattern suggests the firm is good at demos but has not solved the production engineering problems.
Green flag: Case studies describe production systems with operational metrics (uptime, accuracy over time, maintenance burden) and the transition from pilot to production. The firm’s work is still running in production, not just sitting in a report.
Criterion 3: Knowledge transfer, not dependency creation
An AI consulting firm that delivers a model but does not transfer the knowledge to operate and maintain it has created a dependency — the client must return to the firm for every update, every retraining cycle, and every debugging session. This dependency is profitable for the firm and expensive for the client.
How to assess knowledge transfer intent: Ask what the firm’s delivery includes beyond the model itself. Does it include documentation (architecture decisions, training procedures, evaluation criteria, monitoring setup)? Does it include training for the client’s team (how to retrain, how to evaluate, how to debug)? Does the delivery include the complete codebase with clear documentation, or is it a deployed model with opaque configuration?
Red flag: The firm’s engagement model is ongoing managed service with no option for the client to take over operation. The “deliverable” is access to a running system, not the system itself.
Green flag: The firm explicitly plans for disengagement — the engagement includes knowledge transfer milestones, the client’s team is involved in development from the start, and the firm’s goal is to make itself unnecessary for ongoing operations. This is how well-structured engagements work — long-term client dependency is not a sustainable model for either party.
Criterion 4: Honest scoping, not optimistic estimation
The firm’s proposal should reflect realistic effort estimates based on the project’s actual complexity, data readiness, and integration requirements. A proposal that is significantly cheaper or faster than competitors may be underestimating the work — and the project will either blow past the estimate or deliver a cut-scope version that does not meet the original requirements.
How to assess scoping honesty: Compare the proposal against the predictable failure patterns — does the proposal include data readiness assessment, clear success criteria, integration scoping, and risk identification? Or does it jump directly to model development without addressing the prerequisites?
Red flag: The proposal does not mention data assessment, does not define success criteria, and estimates the project at 6–8 weeks for a problem that clearly requires data engineering, model development, integration, and production deployment. Either the firm is planning to deliver a POC and call it done, or the estimate is unrealistic.
Green flag: The proposal includes a scoping phase before committing to the full project, identifies specific risks and mitigation strategies, and provides a range of effort estimates with the factors that determine where in the range the project will fall.
Criterion 5: Team composition, not firm size
The quality of the consulting engagement depends on the people who do the work, not the firm’s total headcount. A 500-person firm that assigns junior consultants to your project will deliver worse results than a 20-person firm that assigns senior engineers with relevant domain experience.
How to assess team composition: Ask who specifically will work on the project. Request CVs or profiles. Ask about their relevant project experience — not in general, but on projects similar to yours (same industry, same technical approach, same scale). Ask whether the proposed team will remain assigned for the project’s duration, or whether team members may be rotated to other projects.
Red flag: The firm cannot name the specific people who will work on the project until after the contract is signed. The proposal lists senior people who disappear after the kick-off meeting.
Green flag: The proposed team is named, their relevant experience is documented, and the firm commits to team continuity for the engagement duration.
The evaluation process
A structured evaluation scores each firm against these five criteria, with evidence requirements for each:
- Technical depth — score based on specificity of technical discussion
- Delivery evidence — score based on production case studies with measurable outcomes
- Knowledge transfer — score based on explicit transfer plan and disengagement strategy
- Scoping honesty — score based on proposal realism and risk identification
- Team composition — score based on named team with relevant experience
Weighted scoring rubric with anti-gaming checks
Not all criteria matter equally. Technical depth and delivery evidence carry more weight because they are harder to fake and correlate most strongly with actual project outcomes. Use this rubric to score each firm on a 1–5 scale per criterion, then multiply by the weight to get the weighted score.
| Criterion | Weight | Score 1 | Score 3 | Score 5 | Anti-Gaming Check |
|---|---|---|---|---|---|
| Technical depth | 3 | Answers stay at buzzword level (“advanced ML techniques”) with no architecture or trade-off detail | Names specific tools and architectures but cannot explain why they were chosen over alternatives | Describes architecture decisions, quantified trade-offs, and limitations on a recent project unprompted | Ask the team to walk through a real technical decision live — not from slides. Probe with “why not X?” follow-ups to test whether depth is rehearsed or genuine |
| Delivery evidence | 3 | Only capability decks and pilot-stage case studies; no production metrics | Production case studies exist but metrics are vague (“improved accuracy”) or unverified | Case studies include quantified production outcomes (e.g., “false-positive rate from 12% to 3.2% over 90 days”) with referenceable clients | Request a reference call with a client whose project is still in production. Ask the client whether the system is still running and what maintenance looks like |
| Knowledge transfer | 2 | Deliverable is access to a running system with no documentation, code, or training plan | Documentation and code are included but no structured training or disengagement plan | Engagement includes architecture docs, retraining procedures, client team training, and explicit disengagement milestones | Ask to see a sample deliverable package from a past engagement. Verify it includes runnable code, not just a deployed endpoint |
| Scoping honesty | 2 | Proposal jumps to model development with no data assessment, no success criteria, and an unrealistically short timeline | Proposal mentions data readiness and success criteria but does not include a scoping phase or risk identification | Proposal includes a paid scoping phase, named risks with mitigations, and effort ranges tied to specific contingencies | Compare the timeline against at least two other firms. If one estimate is half the others, ask what is excluded — data engineering, integration, or production deployment |
| Team composition | 2 | Firm cannot name who will work on the project; team is “to be assigned” | Team is named but relevant project experience is generic or unverifiable | Named individuals with documented experience on similar projects (same domain, scale, and technical approach), with a continuity commitment | Ask for named individuals, not roles. Request LinkedIn profiles or CVs. Ask whether the same people presented in the proposal will remain through delivery |
How to use: Score each firm 1–5 per criterion, multiply by the weight, and sum. Maximum possible score is 60. A firm scoring below 36 (60% of maximum) on this rubric has significant gaps that should be addressed before contracting. Pay particular attention to any criterion where the anti-gaming check reveals a discrepancy between the firm’s claims and verifiable evidence.
The total score is more informative than any single criterion, and the process forces the evaluation to be evidence-based rather than impression-based.
The scoring framework above turns vendor selection from an impressionistic exercise into an evidence-based comparison — the same discipline these firms should be bringing to your AI projects.