Search engines used to be ranked-link engines with a thin query parser on top. They are becoming something else: a retrieval system fused to a synthesis system, where the user’s query is answered by a model that has just read a handful of documents. The visible change is the answer box. The structural change is that the engine now takes on editorial responsibility for the words on the page, not just the ranking of pages. That shift reorganises what works, what breaks, and what teams should measure when they integrate generative AI into their own search surfaces. This is the same disambiguation we keep returning to in generative AI for analytics and business workflows: co-pilot patterns (summarise, rephrase, surface) behave very differently from agent patterns (decide, execute, route). Search has both, glued together by an answer box that papers over which one is doing the work. What actually changed under the hood Classical web search is a retrieval problem with a relevance ranker on top. The output is a list — ten blue links, an entity card, a few rich snippets. The user does the synthesis: skim, click, read, decide. Generative search inserts a second stage. The retriever still pulls candidate documents, but a language model then reads them and writes a paragraph back to the user. This is retrieval-augmented generation in the consumer-facing form Google, Bing Copilot, Perplexity, and ChatGPT Search have all converged on. That structure has two consequences worth naming directly. First, the answer is no longer a ranked list of someone else’s content — it is a new piece of content produced at query time. The engine is now authoring. Citations are a footnote on that authorship, not the primary artefact. This is the operationally relevant measure of the change: in our experience working on enterprise search surfaces, the moment a model writes the answer, the system inherits an editorial liability it did not have when it only ranked links. Second, the cost profile inverts. Classical search is cheap to serve and expensive to index. Generative search is moderately expensive to index (you still need a good retriever, often a vector store on top of a lexical index) and significantly more expensive to serve, because every query runs an LLM. This is not a marketing detail — it shapes which queries get the generative treatment and which fall back to links. Where the answer surface is genuinely useful Some query classes benefit cleanly from a synthesised answer. The pattern is consistent: queries where the user wants a digest rather than a destination, and where the underlying facts are stable enough that synthesis does not drift. Query class Why the answer surface helps Risk class Definitional (“what is X”) One coherent paragraph beats ten overlapping intros Low — well-trodden ground Comparison (“X vs Y”) Side-by-side framing is hard to assemble from links Medium — depends on document recency How-to with named steps Sequenced instructions, with source links per step Medium — version drift on tools Disambiguation (“did you mean…”) Model can ask back in natural language Low Re-summarisation of long documents Reads the PDF the user would not Medium — hallucination risk on figures Across these classes the productivity uplift is real but bounded. The user saves the click-skim-bounce loop. That is worth something, particularly on mobile and voice, where the loop is expensive. It is not, however, the same as the model “understanding” the query — it is the model rewriting a small set of retrieved passages into one passage. Where it leaks The failure modes cluster in three places. Freshness mismatch between retriever and generator. The retriever may pull a document indexed yesterday; the generator may still smooth it through its pretraining priors, which are months or years old. The output reads fluent and confident even when the underlying fact is stale. Filters and reranking help, but the structural issue — that the generator has its own implicit beliefs — does not go away. Citation-to-claim drift. The model lists three sources. Two of them support the sentence they are attached to. The third was retrieved but did not actually contribute. The user, scanning the citations as a trust signal, assumes triple-redundancy where there is only single-source. This is observable across most consumer generative-search products today. Long-tail collapse. The blue-link list naturally surfaces niche, specialist, contrarian, or non-English-first sources. A synthesised answer averages over the corpus and tends toward whatever the bulk of indexed English-language documents say. For long-tail queries this is a loss, not a gain — the user wanted the niche source and got a centroid. These are not arguments against generative search. They are the boundary conditions a search team has to instrument for if they ship one. What enterprise teams should actually do Most teams reading this are not building a consumer web search engine. They are deciding whether to bolt a generative answer layer on top of their internal documentation, support portal, or product catalogue. The decision rule we use with clients is co-pilot-first, the same rule we apply to GenAI in analytics workflows: ship the synthesis case where the cost of a wrong answer is low and the productivity uplift is measurable, before attempting the agent case where the model takes action on the user’s behalf. Concretely, that means starting with retrieval over your own corpus, with a generator that is heavily constrained — extractive summarisation, quote-with-attribution, “answer only if confident” patterns — and instrumenting four things from day one: Citation hit rate: of the sources the model cites, what share actually contain the claim attached to them. This is the single most useful internal metric we have seen. It catches drift before users do. Deflection rate: share of queries where the user did not click through to a source after seeing the answer. High deflection is a signal, not a goal — it means either the answer was sufficient or the user gave up. Answer latency at p95: generative answers can be slow. Users will tolerate a second or two; beyond that, the experience degrades faster than a link list. Fallback share: how often the system declines to synthesise and falls back to a link list. A healthy generative search system declines sometimes. One that always synthesises is over-confident. These are observed patterns from search-surface engagements rather than a benchmarked rate, but they have held across the cases we have seen. The shape of the next two years Generative search is not replacing classical search; it is sitting on top of it. The retriever is more important than ever — a bad retriever feeds a good generator garbage, and the generator’s fluency hides the problem. The interesting engineering work has moved into the boundary: rerankers that understand the generator’s failure modes, evaluation harnesses that measure citation-to-claim alignment, and policies for when not to synthesise. Multilingual coverage, voice input, and rich media (charts, tables, generated diagrams) will continue to expand the surface. The harder problem — which queries deserve an LLM-written answer and which are better served by the old blue links — will not be solved by a bigger model. It will be solved by instrumentation. FAQ How TechnoLynx can help We work with teams shipping retrieval-and-synthesis surfaces into production search, support, and analytics workflows. The work is rarely about the model itself — it is about the retrieval quality, the citation discipline, the fallback policy, and the instrumentation that tells you when the system is drifting. If that is the shape of the problem you are working on, get in touch. Failure class: citation-to-claim drift in retrieval-augmented generation. Artifact: GenAI Feasibility Audit.