What are transformers in deep learning?

Q: What are the main parts of a transformer model?

Four repeating ingredients per layer: multi-head self-attention that computes query/key/value projections; a position-wise feedforward network (2 linear layers with a non-linearity, or SwiGLU); residual connections around both blocks; and layer normalisation (or RMSNorm in modern variants). Positional information is added separately - sinusoidal in the original, learned or rotary (RoPE) in modern systems.

Transformers are the neural-network family that quietly absorbed most of deep learning between 2017 and today. The architecture is small enough to sketch on a napkin — attention, feedforward, residual, normalisation — but it now sits underneath nearly every system labelled “AI” in a product roadmap, from chat assistants and image generators to protein folding and speech transcription. Understanding what a transformer actually does, and where it sits relative to older sequence models, is one of the cheapest investments a technical leader can make before scoping a generative-AI project.

The short version: a transformer reads an entire input at once, computes how strongly every position should attend to every other position, and uses those weights to build a contextual representation. Everything else — multi-head attention, positional encoding, the encoder-decoder split — is engineering scaffolding around that single idea.

What problem does the transformer actually solve?

Before 2017, sequence models were dominated by recurrent neural networks (RNNs) and their gated variants (LSTMs, GRUs). These networks processed inputs one step at a time, carrying a hidden state forward. That design had two structural problems that became binding as models grew. First, sequential processing meant the network could not parallelise across the time axis, which left modern GPUs under-utilised. Second, long-range dependencies — relationships between tokens far apart in a sequence — had to survive many hops through the hidden state, and in practice they tended to fade.

Convolutional networks (CNNs) helped on the parallelism front but introduced a different limit: their receptive field grew only linearly with depth, so capturing genuinely long-range structure required deep stacks.

The transformer, introduced in Attention Is All You Need (Vaswani et al., 2017), removed both constraints in one move. By replacing recurrence with self-attention, every output position can directly attend to every input position in a single layer, and the whole computation parallelises cleanly across positions. This is an observed-pattern claim from the architectural literature: attention-based models train faster on parallel hardware and capture long-range dependencies more reliably than recurrent baselines on the same compute budget.

How does self-attention work, in one paragraph?

For each input token, the model produces three learned projections: a query, a key, and a value. The query for token A is compared (via scaled dot product) against the keys of every other token, producing a vector of compatibility scores. Softmax turns those scores into a probability distribution — the attention weights. The output for token A is then a weighted sum of all the values, using those weights. The geometric reading: each token “looks around” the sequence and pulls in a context-specific blend of information from the positions it cares about.

Multi-head attention runs this process in parallel across several subspaces (typically 8–32 heads in modern systems). Different heads specialise — some track syntactic structure, some track entity coreference, some attend almost uniformly. The model concatenates the heads and projects back to the model dimension.

The four repeating ingredients

A transformer layer has the same four pieces, repeated as a stack:

Ingredient	Role	Modern variants
Multi-head self-attention	Mix information across positions	FlashAttention v3, sliding-window, grouped-query attention
Position-wise feedforward	Per-token non-linear transformation	SwiGLU, GeGLU in place of the original ReLU MLP
Residual connections	Preserve gradient flow through depth	Standard; sometimes scaled (DeepNet)
Layer normalisation	Stabilise training	RMSNorm replaces LayerNorm in most 2024+ systems

The original encoder-decoder layout still applies to translation-style tasks, but most large language models today are decoder-only stacks (GPT, Llama, Mistral, DeepSeek). Encoder-only stacks (BERT, ModernBERT) remain the workhorse for classification and retrieval, and full encoder-decoder models (T5, Whisper) dominate where the input and output sequences are clearly distinct.

Why positional encoding matters

Self-attention is permutation-invariant: shuffle the input tokens and the attention output shuffles with them. That is fatal for language, code, or any sequence where order is part of the meaning. Positional encoding injects order information back in.

The original transformer used fixed sinusoidal encodings added to the token embeddings. Modern systems have largely moved to rotary position embeddings (RoPE), which rotate the query and key vectors in a position-dependent way before the dot product. RoPE generalises better to sequence lengths the model was not trained on, and it is now standard in Llama, Mistral, Qwen, DeepSeek, and most open-weight families. ALiBi (attention with linear biases) is the other commonly seen variant, especially in long-context models.

Where transformers sit in the broader taxonomy

This is where a lot of stakeholder conversations get muddled. “AI”, “machine learning”, “deep learning”, “transformers”, and “generative AI” are not synonyms, but they are routinely treated as such. We explore the full mapping in Symbolic vs Generative vs Traditional ML: A Working Taxonomy for Practitioners; the short read is:

Machine learning is the umbrella discipline of learning patterns from data.
Deep learning is the subset of ML using deep neural networks.
Transformers are one architecture family within deep learning.
Large language models (LLMs) are transformers trained on very large text corpora with a generative objective.
Generative AI is the broader category of systems whose output is itself content (text, image, audio, video, code). Most current generative systems are transformer-based, but not all transformer-based systems are generative.

A vision transformer fine-tuned for defect classification is a transformer, deep learning, and machine learning — but it is not generative AI. Confusing the two leads to scoping conversations where the team commits to “a GenAI solution” when what the problem actually needs is a discriminative classifier.

Beyond language: where the architecture has spread

Transformers escaped the natural-language ghetto fast. The pattern is consistent: tokenise the input into a sequence, apply attention, optionally cross-attend across modalities.

Vision — Vision Transformer (ViT), DINOv2, SigLIP, and Segment Anything tokenise an image into patches and treat them as a sequence.
Audio — Audio Spectrogram Transformer (AST) and Whisper apply attention over spectrogram patches; Whisper is now the de facto open transcription baseline.
Protein structure — AlphaFold 2 and AlphaFold 3 use transformer-style attention over residue pairs and have reshaped structural biology.
Reinforcement learning — Decision Transformer and Trajectory Transformer reframe RL as sequence modelling.
Multi-modal — CLIP, LLaVA, and the Gemini-class and GPT-4o-class systems fuse text, image, audio, and sometimes video into a single attention stack.

In our experience across image-processing and generative engagements, the practical implication for engineering teams is that the same operational toolchain — PyTorch, CUDA, FlashAttention, TensorRT or vLLM for serving, ONNX for portability — now spans far more of the workload than it did in the CNN era. Investments in transformer inference infrastructure amortise across modalities in a way that earlier architecture-specific stacks did not.

Are transformers still the dominant architecture?

Yes, with growing diversification. This is an observed pattern in the 2024–2026 architecture literature rather than a benchmark claim, but it is stable enough to plan against. The frontier in 2026 is hybrid:

State-space models (Mamba, Mamba-2) compete with attention on long sequences at lower compute cost and now appear inside transformer hybrids (Jamba, Zamba).
Mixture-of-experts (MoE) routing (DeepSeek-V3, Mixtral, Llama 4 Scout/Maverick) keeps activated parameter counts down while raising total capacity.
Linear or sparse attention variants (RetNet, Lightning Attention, sliding-window attention in Mistral and Gemma) trade some quality for sub-quadratic cost on long contexts.
Hardware-aware kernels — FlashAttention v3, PagedAttention in vLLM, and grouped-query attention — change the practical economics of serving without changing the architecture on paper.

The pattern: attention is no longer the whole system, but it remains the central component. Most production deployments in 2026 are still recognisably transformer-shaped.

What this means for an engineering team

A few practical implications worth keeping in scope:

If the problem is sequential or has a tokenisable structure, the transformer family is the default starting point — even outside language.
Architecture choice (encoder-only vs decoder-only vs encoder-decoder) follows the task structure: classification and retrieval favour encoder-only; generation favours decoder-only; translation-style mapping favours encoder-decoder.
The dominant cost driver in production is no longer training but inference. Attention-aware serving infrastructure (FlashAttention, PagedAttention, speculative decoding, quantisation) is where most operational wins now come from.
The “transformer vs RNN” debate is largely settled in favour of attention, but the “pure transformer vs hybrid” question is genuinely open. Plan for architectures that mix attention with state-space or MoE components over the next two cycles.

For broader programme context on how we structure these systems across our engagements, explore our Generative & Agentic AI R&D practice.

Frequently asked questions

What are transformers in deep learning?

Transformers are a neural network family built around self-attention: every output position can attend to every input position, weighted by learned compatibility scores. Introduced in the 2017 “Attention Is All You Need” paper for machine translation, they replaced recurrent networks as the default sequence model and now dominate language, vision, audio, and multi-modal tasks. The architecture is conceptually small (attention + feedforward + residual + normalisation) but operationally rich.

What are the main parts of a transformer model?

Four repeating ingredients per layer: (1) multi-head self-attention that computes query, key, value projections and the attention-weighted output; (2) a position-wise feedforward network (usually 2 linear layers with a non-linearity, or a SwiGLU variant); (3) residual connections around both blocks; (4) layer normalisation (or RMSNorm in modern variants). Positional information is added separately — sinusoidal in the original, learned or rotary (RoPE) in modern systems.

Are transformers used for anything besides language?

Yes, extensively: Vision Transformers (ViT, DINOv2, SigLIP) for image understanding; AST and Whisper for audio; video transformers for action recognition; AlphaFold for protein structure; Decision Transformer and Trajectory Transformer for reinforcement learning; multi-modal systems (CLIP, LLaVA, Gemini-class, GPT-4o-class) that fuse text, image, audio, and video. The pattern across these is the same: tokenise the input, apply attention, optionally cross-attend across modalities.

Are transformers still the dominant architecture in 2026?

Yes, with growing diversification. Pure transformers remain dominant in the largest deployed models, but the 2026 frontier mixes in state-space layers (Mamba, Mamba-2), mixture-of-experts routing (DeepSeek, Mixtral, Llama 4), linear or sparse attention (RetNet, Lightning Attention), and hardware-specific kernels (FlashAttention v3, PagedAttention). The headline is hybrid architectures with attention as one component, not the whole system.