AI Assistants and the Feasibility Question Behind Productivity Gains

The productivity claim, and the question it skips

AI assistants are sold as productivity multipliers. Say “Hey Siri”, “Hey Google”, or open ChatGPT, and the assistant drafts your email, summarises your meeting, or pulls a chart out of a spreadsheet. The pitch is seductive enough that buyers approve budgets before anyone has asked the harder question: which of the tasks we want this assistant to do are technically feasible with current models, and which are not?

That gap — between what an assistant looks like it can do in a demo and what it can reliably do in production — is where most generative AI budgets quietly disappear. In our experience across generative AI engagements, the failure mode is almost never “the model didn’t work at all”. It is “the model worked on three of the eight use cases we scoped, and the other five were never feasible in the first place”. The buyer who approved the eight-use-case scope owns the wasted spend.

This article reframes the productivity conversation around feasibility. The history and the wow-moments are real, but they are not the deciding factor. The deciding factor is whether the specific task you want automated falls inside the envelope of what large language models, speech recognition, and adjacent components can do today, on your data, at your latency and accuracy bar.

Figure 1 – Is a brain faster than AI assistants (618media, 2024)?

Where assistants actually came from

The concept is older than the marketing suggests. ELIZA, a chatbot that simulated a psychotherapist, was built at MIT in 1966 (Epstein and Klinkenberg, 2001). Assistants became mainstream when Siri shipped on the iPhone 4S, and the modern wave — ChatGPT, Perplexity, Copilot, Claude — is built on transformer-based natural language processing trained on internet-scale corpora.

The mechanics matter because they bound the feasibility envelope. A modern assistant tokenises the input, runs it through an attention-based neural network (typically served via PyTorch or TensorRT on GPU, often with FlashAttention kernels for throughput), generates a probability distribution over the next token, and samples from it. That pipeline is excellent at plausibility — producing output that reads correctly — and structurally weak at factual guarantee and bounded reasoning over private data the model has never seen. Every feasibility judgement starts there.

How to classify a use case before you build it

The decision framework we use on engagements is a per-use-case classification, not a per-organisation one. A single department’s wishlist usually splits across three buckets, and the budget should split the same way.

Class	Definition	What to do
Automatable	Current models can perform the task reliably on representative inputs from your environment, given available data.	Proceed to build. Define measurable outcomes (accuracy, latency, cost per task) before development starts.
Speculative	The task requires capability beyond what current models reliably deliver — multi-step reasoning over proprietary data, super-human judgement, or strict factual guarantees the model architecture cannot provide.	Do not commit to a build. Either drop the use case or fund a bounded research phase first.
Research	The feasibility is unknown without investigation — the data exists but its quality is untested, or the task is borderline.	Run a time-boxed investigation with explicit go/no-go criteria. Spend is capped; commitment is deferred.

This classification is the core of a structured GenAI feasibility assessment. It produces a defensible artifact: if the project proceeds, the assessment justifies the spend; if it does not, the assessment is what prevents the waste. The point is not to be pessimistic. The point is to put each candidate task in the right column before procurement signs a contract scoped to all three columns at once.

What “automatable” actually requires

For a use case to land in the automatable column, three things have to hold. First, the task has to be expressible in the assistant’s input/output modality — text in, text out, with optionally a tool call. Second, the data the assistant needs to do the task must be available, clean enough, and accessible at runtime (retrieval pipelines, vector stores, structured APIs). Third, the consequences of a wrong answer must be bounded — the assistant either gets a human review pass, or the cost of a wrong answer is small enough to absorb.

Drafting first-pass marketing copy, summarising a meeting transcript, classifying inbound support tickets, generating boilerplate code — these tend to be automatable because all three conditions hold. The assistant produces a draft, a human checks it, and the productivity gain is real and measurable.

What pushes a use case into “speculative”

A use case becomes speculative the moment one of those three conditions breaks. “The assistant should diagnose customer issues end-to-end with no human review” breaks the third condition. “The assistant should answer accurately about our internal product spec that lives in twelve unindexed Confluence spaces” breaks the second. “The assistant should reason about a multi-step legal scenario and produce a defensible conclusion” breaks the first, because next-token plausibility is not a substitute for legal reasoning under accountability.

Continental’s in-car voice assistant (Continental AG, 2019) is a useful illustration of staying inside the envelope. The assistant remembers user preferences, suggests refuelling on Fridays for a regular weekend driver, finds restaurants when the driver says “I’m hungry”. Each of those is a well-bounded retrieval or recommendation task — automatable. It does not try to diagnose engine faults from natural-language complaints, which would push it into speculative territory.

Data readiness is the silent gate

The most common reason a use case slips from “looks automatable” to “actually speculative” is that the underlying data is not ready. This is the readiness check we run before any commitment:

Coverage — does the data the assistant needs at runtime actually exist in a machine-readable form, or does it live in PDFs, screenshots, and tribal knowledge?
Freshness — how stale can the data be before the assistant’s answers become wrong? Daily? Hourly? Real-time?
Access — can the runtime system reach the data with acceptable latency? Retrieval-augmented generation over a poorly indexed corpus produces confident wrong answers.
Governance — who is allowed to see what, and can the assistant enforce that boundary? An assistant that answers “what is Alex’s salary” to the wrong person is a breach, not a feature.

Data readiness is a project-specific outcome to measure before development starts, not a question to discover during it. The companion question — whether the organisation is ready to run an AI project at all — is a different assessment, and we cover it in enterprise AI readiness for genuine business outcomes. Per-use-case feasibility runs once organisational readiness is established.

Figure 3 - Slack Project Management Workspace has incorporated AI tools into its pipeline (Peterson, 2024).

Productivity wins that survive the feasibility filter

The use cases that consistently pass classification — and produce measurable productivity gains — share a profile. They sit alongside a human, they have a bounded scope, and they have a feedback loop.

Project coordination. Tools like Slack now ship AI-driven conversation summaries and thread digests (Peterson, 2024). The task is bounded (summarise this thread), the data is in-scope by construction (the assistant already has access to the channel), and the user reads the summary before acting on it. Automatable.

Customer service triage. Studies suggest a meaningful share of online shoppers ask for help during purchase (Stats, 2013), in contrast with in-store behaviour where most prefer to be left alone (Turner, 2018). A retrieval-augmented assistant that surfaces the right answer from a product knowledge base, with a human handoff for edge cases, is automatable. The same assistant pretending to resolve a billing dispute autonomously is speculative.

Content generation with human review. Canva’s image generator wraps multiple backends (DALL·E, Imagen, Dream Lab) and produces twenty variants per prompt. The user picks one. The task is bounded, the failure mode (a bad image) is cheap, the productivity gain is real. ChatGPT used as a marketing-copy draft tool, as Coca-Cola has explored (Marr, 2023), follows the same shape.

Industrial assistance. Harley-Davidson reported a sales uplift after deploying the Albert AI marketing tool (Marr, 2018). The use case — campaign optimisation over structured ad-platform data — is well-scoped and measurable. Insurance examples are the same shape: Allstate using AI to flag suspicious claims, Zurich using predictive analytics over climate and geographic data to warn policyholders (Poleo, 2023). In each case, the AI produces a signal; a human acts on it.

These are not exciting because they are autonomous. They are exciting because they are real. The productivity gain is the integral of many small, well-scoped automations, not one heroic agent.

What measurable outcome should you define?

Before development starts on any automatable use case, write down the answer to four questions. If the assessment cannot answer them, the use case is not ready to build.

What does the assistant produce? A draft, a classification, a retrieval, a recommendation. Be specific about the output type.
What is the accuracy bar? Not “high”. A number, or a measurable proxy — error rate on a held-out sample, human-edit distance on drafts, escalation rate on tickets.
What is the latency and cost budget? Per task, at expected volume. A workflow that needs sub-second response constrains the model choice far more than a workflow that tolerates ten seconds.
What does failure look like, and who catches it? A human in the loop, a downstream check, a confidence threshold. Without this, every wrong answer is a production incident.

These four answers are what makes the spend defensible later. They are also what separates a feasibility assessment from a wishlist.

FAQ

How do I judge whether a specific generative AI use case is technically feasible with current models?

Classify the use case against three conditions: the task fits the model’s input/output modality, the data the model needs exists and is accessible at runtime, and the consequences of a wrong answer are bounded by a human review pass or a small cost-per-error. If all three hold, the use case is automatable. If one breaks, it is speculative or research-class.

What does a structured GenAI feasibility assessment look like, and what does it answer?

It is a per-use-case classification (automatable, speculative, or research) plus a data-readiness check, plus explicit measurable outcomes for each automatable item — accuracy bar, latency budget, cost per task, failure-handling path. The output is a defensible artifact: it justifies the spend if the project proceeds and prevents the waste if it does not.

Which use cases should we classify as automatable, speculative, or research — and why?

Automatable: bounded tasks with available data and human review (draft generation, ticket triage, summarisation, structured retrieval). Speculative: tasks needing super-human judgement, strict factual guarantees, or end-to-end autonomy on consequential decisions. Research: tasks where data quality or feasibility is unknown and a time-boxed investigation is cheaper than a commitment.

How do I assess data readiness before committing to a GenAI build?

Check four properties of the data the assistant will need at runtime: coverage (does it exist in machine-readable form), freshness (how stale can it be before answers go wrong), access (latency and indexing for retrieval), and governance (who is allowed to see what, and can the assistant enforce that). If any of the four is failing, the use case is research-class until it is fixed.

What measurable outcomes should we define before development starts so the spend is defensible later?

Output type, accuracy bar (as a number or a measurable proxy), latency and cost budget per task at expected volume, and the failure-handling path with explicit ownership. Writing these down before any code is the difference between a feasibility assessment and a wishlist.

How does per-use-case feasibility relate to (and depend on) organisational AI readiness?

They are sequenced. Organisational readiness — the question of whether the organisation can run an AI project at all (data foundations, governance, change capacity) — is the gate on whether to start. Per-use-case feasibility is the filter on which use cases inside that project to pursue. Readiness is covered separately under our enterprise AI readiness work; feasibility runs after it.

What we offer

We run structured feasibility assessments for organisations standing at a generative AI investment decision point. The output is the artifact described above — a per-use-case classification, a data-readiness check, and measurable outcomes for each item that survives — produced before development commitment, not discovered during it. If you have a GenAI budget approved and a list of candidate use cases longer than it should be, tell us where you are and we will help you split the list into the columns that matter.

A common pattern is the buyer who has been sold an eight-use-case scope and suspects, correctly, that only three of them are real. Naming the other five — and the reason each one is speculative or research-class — is what a feasibility assessment is for.