LLM architecture type determines suitability, not just size When comparing LLMs, parameter count and benchmark scores dominate the discussion. Architecture type — decoder-only, encoder-decoder, or encoder-only — is equally important for matching a model to a use case and understanding its deployment constraints. These are not interchangeable architectures with the same capabilities at different scales. Decoder-only models Examples: GPT-4, Llama 3, Claude, Gemini, Mistral The decoder-only architecture generates text autoregressively: each output token is predicted from all previous tokens. The model sees only past context (causal attention mask). This is the dominant architecture for general-purpose LLMs. Strengths: Natural fit for open-ended text generation, instruction following, reasoning, code generation. Scales well. Works well with RLHF alignment techniques. Limitations: Not inherently suited for tasks requiring bidirectional understanding (classification on full documents). Inference latency scales with output length. Encoder-decoder models Examples: T5, FLAN-T5, mT5, Bart The encoder processes the input bidirectionally (sees all input tokens simultaneously) and produces a representation. The decoder generates output autoregressively from that representation. Strengths: Well-suited for tasks with clear input/output structure: summarization, translation, question answering from context, structured extraction. The encoder’s bidirectional attention captures input semantics better than causal attention for tasks where full input comprehension matters. Limitations: Requires more careful fine-tuning per task. Less amenable to few-shot prompting than decoder-only models. Smaller ecosystem of production-ready models. Encoder-only models Examples: BERT, RoBERTa, DeBERTa, sentence-transformers Processes input bidirectionally, produces contextualized representations. Does not generate text — it represents it. Strengths: Fast inference, small footprint, excellent for classification, named entity recognition, semantic search (embedding generation), and tasks requiring full-document understanding. Limitations: Cannot generate text. Requires fine-tuning for most tasks rather than prompting. Architecture comparison Architecture Generation Classification Embedding Inference cost Best for Decoder-only ✓ Excellent ✓ Possible ✓ With pooling High (KV cache) General tasks, instruction following Encoder-decoder ✓ Structured ✓ With head Limited Medium Translation, summarization Encoder-only ✗ ✓ Excellent ✓ Excellent Low Search, classification, NER Which type to choose? For production systems, the architecture choice follows the task: Semantic search / retrieval: Encoder-only or decoder-only with embedding pooling. Bi-encoders (encoder-only) are typically faster for large-scale retrieval. Classification / extraction: Encoder-only or small encoder-decoder, fine-tuned. Far cheaper to run than decoder-only at scale. Summarization / translation: Encoder-decoder models fine-tuned on the task, or decoder-only with a specific prompt. Open-ended generation / instruction following / RAG generation: Decoder-only. The modern LLM ecosystem is built on this architecture. For more on how these model types fit into the broader generative AI landscape, what types of generative AI models exist beyond LLMs covers the full taxonomy. How does architecture choice affect deployment cost? The architecture type directly determines inference cost through two mechanisms: memory footprint and computational complexity per token. Decoder-only models (GPT-style) generate tokens autoregressively — each new token requires attending to all previous tokens, creating O(n²) attention cost that grows with sequence length. Encoder-decoder models (T5-style) compute the encoder output once and reuse it during decoding, making them more efficient for tasks where the input is long relative to the output. For summarisation (long input, short output), encoder-decoder models are computationally cheaper because the encoder processes the input once and the decoder generates only the summary. We have measured 40–60% lower inference cost for T5-family models versus GPT-family models on summarisation tasks with input lengths above 2,000 tokens. The crossover point — where decoder-only becomes cheaper — occurs when output length exceeds input length, which is rare in production applications. Encoder-only models (BERT-style) occupy a different cost tier entirely. They process the full input in a single forward pass without autoregressive generation, making them 5–10× cheaper per inference than generative models for classification, embedding, and extraction tasks. We deploy BERT-family models for tasks that require understanding but not generation: document classification, named entity recognition, and semantic search. Using a generative model for these tasks wastes compute on generation capability that is not needed. The memory footprint difference also affects hardware requirements. A 7B parameter decoder-only model requires approximately 14 GB of GPU memory at FP16 just for the model weights. An encoder-only model with equivalent understanding capability (e.g., DeBERTa-v3-large at 304M parameters) requires less than 1 GB. For deployment scenarios where GPU memory is the constraining resource, architecture selection determines how many models can be co-located on a single GPU. Our architecture selection framework: start with encoder-only for classification and extraction tasks, use encoder-decoder for input-to-output transformation tasks (translation, summarisation), and reserve decoder-only for open-ended generation where the output length and content are unpredictable. This task-architecture matching reduces inference costs by 30–70% compared to defaulting to decoder-only for all tasks.