“NLP in our computer vision pipeline” is one of the most ambiguous requests we see in scoping calls. The phrase covers at least four distinct engineering problems — optical character recognition (text from images), captioning (text about scenes), visual question answering (text plus image, answered in text), and grounded scene reasoning (text as input to a structured scene graph). Each has its own data shape, its own evaluation metric, and its own build-versus-buy economics. Teams that specify which one they actually need buy or build the right component. Teams that ask broadly burn quarters integrating capabilities they did not want. This piece separates the four, names the architectural patterns that fuse vision and language, and points at where production multimodal models shift the build-versus-buy decision. What does “NLP in computer vision” actually mean? The label is doing too much work. In our experience reviewing computer-vision scopes, the same phrase shows up to mean very different things depending on who is asking: OCR and document AI — the input is an image, the output is a text string or a structured extraction (invoice fields, ID document fields, radiology-report headers). The “NLP” part is downstream of pixel-to-text and looks like classical text processing. Image and video captioning — the input is a scene, the output is a natural-language description. Used heavily for accessibility (Facebook’s automatic alt-text is the canonical example) and for indexing media libraries. Visual question answering (VQA) — the input is an image plus a free-text question, the output is a free-text answer. Used in customer-service flows where users send a product photo with a question attached. Grounded scene reasoning — the input is text and an image, the output is a structured grounding (bounding boxes tied to noun phrases, scene graphs, action labels). This is the layer underneath things like robotic instruction following or referring-expression segmentation. These are not interchangeable. An OCR pipeline does not give you VQA. A captioning model does not give you grounded reasoning. Conflating them is the most common source of mis-scoped multimodal projects we see. Which architectures actually fuse vision and language? Two architectural families dominate the production landscape today, and a third sits above them. Dual-encoder contrastive models (CLIP-style). A vision encoder (typically a Vision Transformer or a CNN backbone in older deployments) and a text encoder are trained to map images and captions into a shared embedding space. The fused representation is a similarity score, not a generated sentence. CLIP-style models are the workhorse for image-text retrieval, zero-shot classification, and content-moderation gating — anywhere you need “does this image match this description” without generating new text. Multimodal transformers with cross-attention. A single transformer stack ingests both visual tokens (from a ViT patch encoder or a CNN feature map) and language tokens, with cross-attention layers that let language tokens attend to visual ones and vice versa. This is the architecture behind production VQA systems and captioning models. The implementation detail that matters: how the visual tokens are produced (patch embeddings, region proposals, learned queries via a Perceiver-style resampler) drives both latency and accuracy on small-object tasks. Multimodal LLMs (vision-language models). GPT-4V, Gemini, Claude with vision, and the open Llava and Qwen-VL families wrap a vision encoder into a chat-style LLM. The interface is a free-form prompt that can contain images. For many product teams this collapses the captioning / VQA / grounding distinction into a single API call — which is exactly why the build-versus-buy question has changed. When build-versus-buy actually changes Three years ago, building a VQA system meant assembling a vision encoder, a language model, a fusion layer, training data, and a serving stack. Today, a multimodal LLM API call covers most of that for prototype-quality workloads. What the API call does not cover, and what we still see teams build in-house: domain-specific OCR on degraded documents (the failure mode is not “the model is less accurate”, it is “the model hallucinates plausible-looking fields”), real-time on-device captioning for accessibility on low-power hardware (latency and offline operation rule out hosted APIs), and any workflow where the visual data cannot leave a controlled environment (medical imaging, defence, regulated finance). The build case has narrowed but not disappeared. Where this matters in production A few practical pairings show up repeatedly in our scoping work. We’ve covered the broader landscape of computer vision in action across industries — the items below are specifically the cases where the language layer is what makes the application work. Application What NLP contributes What CV contributes Why the pair matters Image captioning for accessibility Generates the descriptive sentence Identifies objects, scene, and relationships Alt-text at the scale of a social platform cannot be authored by humans Visual question answering Parses the user’s question; generates the answer Extracts visual features the question refers to Customer-service flows where the photo is the query Document AI Structures extracted text; reasons over fields Detects layout, tables, signatures Classical OCR + downstream NLP, often the cheapest path Multimodal RAG over visual data Embeds queries; generates grounded answers Embeds images and frames into the same space Search across mixed corpora (engineering docs with diagrams, video archives) Content moderation Classifies the textual context and the OCR’d overlay Classifies the underlying imagery User-generated content combines image + caption + comments — each can independently violate policy Radiology workflow audit Extracts findings from the report Localises the corresponding regions in the scan Cross-checking that what the radiologist wrote matches what the image shows Which CV applications require an NLP layer to be useful? The clearest cases are the ones where the output must be readable by a human or by a downstream agent, not by another model. Accessibility captioning is the canonical example — a bounding box is useless to a screen-reader user. Multimodal retrieval-augmented generation over a visual corpus is another: indexing the images is a CV problem, but answering “show me the failure mode that looks like this but in last quarter’s batch” requires a language interface on top of the embeddings. And document understanding at any scale beyond toy demos requires NLP downstream of OCR — the raw text is rarely the deliverable. The cases where NLP is not required, despite frequently being asked for: pure detection and tracking pipelines (the output is structured coordinates, not language), industrial inspection (pass/fail with a defect class), and most surveillance and analytics deployments. NLP-in-CV stack versus classical OCR + NLP For document understanding specifically, the choice between a multimodal LLM and a classical OCR-then-NLP pipeline is the most common build decision we see, and it does not have a universal answer. The classical pipeline — a dedicated OCR engine (Tesseract, AWS Textract, Azure Document Intelligence) feeding a downstream NLP stage — has two structural advantages: the OCR step is auditable (you can see exactly what text was extracted before any reasoning happens), and the failure modes are bounded (low-confidence characters fail visibly rather than being smoothly hallucinated). The classical path is also dramatically cheaper at high volume. The multimodal-LLM path collapses the two stages and handles layout, handwriting, and mixed languages in one shot. It is the right choice when the document variety is too wide to train a specialised OCR for, when the downstream task requires reasoning over the document (not just extraction), and when audit requirements allow it. The wrong choice when “did this field contain ‘O’ or ‘0’” matters for downstream processing — language models are confident in ways OCR engines are not. The relevant production lesson is that the choice is rarely either/or. The strongest stacks we have seen pair a deterministic OCR layer with a multimodal model used only for the reasoning step it is genuinely better at. What this means for scoping a project When a team comes to us asking for “NLP in our computer vision pipeline”, the first hour is usually spent on three questions: What is the input modality at inference time — image only, image plus text, or text only? What does the system produce — a structured output, a free-text response, or a binary decision? And who or what consumes that output — a human, an agent, or another model? The answers select which of the four problem classes is actually in scope, and from there the architectural choice is mostly mechanical. The same scoping pattern shows up across our computer-vision engagement work in regulated industries, where ambiguity in the language layer of the spec drives more cost overruns than ambiguity in the CV layer. FAQ Where do NLP and CV actually meet in production today — captioning, VQA, document AI, multimodal LLMs? All four, and the boundaries between them are blurring. Captioning and VQA used to require separate models; multimodal LLMs now cover both behind a single prompt interface. Document AI remains its own track because audit and cost constraints push it toward classical OCR + NLP pipelines, with the multimodal model used only where it is genuinely better. The honest answer is that the label is one thing and the engineering problem is four, and you scope the engineering problem. What architectural patterns fuse vision and language (CLIP-style, multimodal transformers)? Three families dominate. Dual-encoder contrastive models (CLIP and its descendants) map images and text into a shared embedding space and are the workhorse for retrieval and zero-shot classification. Multimodal transformers with cross-attention let language tokens attend directly to visual tokens and back, which is the architecture under VQA and captioning. Multimodal LLMs — GPT-4V, Gemini, Llava, Qwen-VL — wrap a vision encoder into a chat-style language model and collapse the prior distinctions into a single prompt interface. How do production multimodal models change the build-versus-buy decision for CV apps? They narrow the build case substantially. Prototype-quality VQA, captioning, and basic document understanding are now an API call. The remaining build cases are domain-specific OCR on degraded documents (hallucination is the failure mode, not accuracy), real-time on-device workloads where latency or offline operation rule out hosted APIs, and any workflow where the visual data cannot leave a controlled environment. The build case has narrowed; it has not disappeared. Which CV applications now require an NLP layer to be useful (for example, RAG over visual data)? The ones where the output must be readable by a human or by a downstream agent — accessibility captioning, multimodal retrieval-augmented generation over visual corpora, document understanding beyond toy extractions. Pure detection, tracking, and industrial inspection do not require an NLP layer and frequently should not have one bolted on. What are concrete real-world CV applications that fail without NLP integration? Accessibility tooling (Facebook’s automatic alt-text and similar features cannot deliver a bounding box to a screen reader), customer-service flows where the user sends a photo with a question attached (the question is the spec; without parsing it the CV model has no task), and any retrospective audit workflow that compares a written report against the underlying imagery — radiology cross-checking is the clearest example. How does the NLP-in-CV stack compare with classical OCR + NLP pipelines for document understanding? The classical pipeline is auditable, cheap at scale, and fails visibly. The multimodal path handles layout variety and downstream reasoning in one shot but can hallucinate confidently where an OCR engine would emit a low-confidence flag. The strongest production stacks we have seen pair a deterministic OCR layer with a multimodal model used only for the reasoning step it is genuinely better at — not either/or.