The GenAI landscape is wider than LLMs
When organisations say “generative AI,” they usually mean large language models — GPT-4, Claude, Gemini, Llama. This is understandable. LLMs are the most visible, most commercially deployed, and most discussed category of generative model. But the generative AI landscape includes entire families of models that generate images, audio, video, 3D assets, molecular structures, and code — each using architectures that differ fundamentally from the autoregressive token prediction that defines LLMs.
Understanding what exists beyond LLMs matters for two reasons. First, the use case you need to address may be better served by a non-LLM generative model — and defaulting to an LLM for every generative task is like using a hammer for every fastener. Second, the architectural differences between model families have practical implications for deployment: inference cost, latency characteristics, fine-tuning requirements, and output control differ across architectures in ways that affect build decisions.
The Stable Diffusion model processes images in a 64×64 latent space rather than the full 512×512 pixel space, reducing compute requirements by approximately 50× (Rombach et al., 2022). StyleGAN3 (Karras et al., 2021) achieves FID scores below 5 on FFHQ, establishing the quality benchmark for unconditional face generation.
How do diffusion models generate images?
Diffusion models generate images by iteratively denoising a random noise sample. The model learns to reverse a noising process: given a noisy image, predict what the image looked like one step less noisy. Applied iteratively from pure noise, this produces a clean image that matches the model’s learned distribution. Stable Diffusion (Stability AI), DALL-E 3 (OpenAI), Imagen (Google), and Midjourney all use diffusion-based architectures.
How they work. The training process adds Gaussian noise to images at increasing levels, and the model learns to predict and remove the noise at each level. Generation starts from pure noise and applies the denoising prediction repeatedly (typically 20–50 steps) to produce a clean image. Text conditioning (using a text encoder like CLIP or T5 to convert a text prompt into an embedding that guides the denoising) enables text-to-image generation.
Practical characteristics. Inference is iterative — each image requires multiple forward passes through the model, making generation slower than single-pass architectures. A 512×512 image at 50 denoising steps takes 2–10 seconds on a consumer GPU (depending on model size and optimisation). Quality scales with compute: more denoising steps generally produce higher-quality images. Fine-tuning for specific styles or subjects (using techniques like DreamBooth or LoRA) requires 5–50 images of the target subject and produces models that generate that subject consistently.
Where they are used. Marketing and advertising (product visualisation, campaign imagery), entertainment (concept art, game asset generation), e-commerce (product photography replacement, virtual try-on), and design (architecture visualisation, interior design exploration). We have worked with clients who use diffusion models for retail product visualisation and manufacturing documentation illustration.
GANs: adversarial generation with sharp outputs
Generative Adversarial Networks (GANs) train two networks simultaneously: a generator that produces synthetic images, and a discriminator that tries to distinguish synthetic images from real ones. The adversarial training process pushes both networks to improve — the generator produces increasingly realistic images, and the discriminator becomes increasingly discriminating. StyleGAN (NVIDIA), BigGAN, and GigaGAN are prominent examples.
How they differ from diffusion. GANs generate images in a single forward pass — no iterative denoising. This makes generation fast (milliseconds per image). The trade-off: GANs are harder to train (mode collapse, training instability, sensitivity to hyperparameters), less diverse in output (the generator may learn to produce high-quality images from a narrow subset of the distribution), and harder to condition on specific inputs (text-to-image control is less natural than in diffusion models).
Where they remain relevant. Despite diffusion models’ dominance for text-to-image generation, GANs remain the architecture of choice for tasks that require single-pass generation speed: real-time image translation (pix2pix, CycleGAN), super-resolution (ESRGAN), face generation and manipulation (StyleGAN), and data augmentation for training other models. The GAN vs diffusion comparison covers the architectural trade-offs in detail.
VAEs: structured latent spaces for controlled generation
Variational Autoencoders (VAEs) learn a compressed latent representation of the data and generate new samples by decoding points from the latent space. Unlike GANs, VAEs optimise a well-defined probabilistic objective (the evidence lower bound — ELBO), making training stable and reproducible.
How they work. The encoder compresses input data into a distribution in latent space. The decoder generates data from points sampled from this distribution. The latent space is continuous and structured — nearby points in latent space produce similar outputs, enabling smooth interpolation between generated samples and controlled manipulation of output attributes.
Practical characteristics. VAE outputs tend to be smoother and less sharp than GAN or diffusion outputs, because the VAE’s objective includes a reconstruction term that encourages averaging over possibilities. This makes standalone VAEs less suitable for high-fidelity image generation but well-suited for tasks where the latent structure is more important than output sharpness: anomaly detection (outliers have low likelihood in the latent space), data compression, drug discovery (generating molecular structures by sampling the latent space), and representation learning.
In modern architectures. Stable Diffusion uses a VAE as its image encoder/decoder: images are compressed to a latent space by the VAE encoder, the diffusion process operates in this latent space (which is much smaller than pixel space), and the VAE decoder converts the denoised latent back to pixel space. The combination — VAE for compression, diffusion for generation — is more efficient than operating directly in pixel space.
Neural audio and speech models
Generative models for audio span text-to-speech (TTS), music generation, and sound effect synthesis. The architectures differ from image generation:
Autoregressive models (WaveNet, SoundStorm) generate audio sample-by-sample or token-by-token, similar to how LLMs generate text. High quality, but slow inference due to the sequential generation process.
Diffusion models adapted for audio (AudioLDM, Stable Audio) apply the diffusion framework to spectrograms or latent audio representations. Text-to-audio generation follows the same conditioning approach as text-to-image.
Neural codec models (EnCodec by Meta, SoundStream by Google) compress audio into discrete tokens that can be modelled by autoregressive or masked models. This approach powers recent voice cloning and music generation systems — the audio is tokenised, a language model generates new token sequences, and the codec decoder converts tokens back to waveforms.
Video generation models
Video generation extends image generation to the temporal dimension, with additional complexity: temporal consistency (objects must maintain their appearance and physics across frames), motion coherence (movement must be physically plausible), and compute cost (generating 30 frames per second of video requires 30× the computation of a single image).
Current approaches include: diffusion models extended with temporal attention layers (Sora by OpenAI, Runway Gen-2, Stable Video Diffusion), autoregressive video generation (producing frames sequentially with each frame conditioned on the previous), and frame interpolation approaches that generate keyframes and fill in intermediate frames. The technology is advancing rapidly but remains compute-intensive and quality-variable — production-quality video generation at scale is not yet practical for most commercial applications.
3D generation models
3D asset generation — producing 3D meshes, textures, and materials from text or image prompts — is the newest frontier of generative AI. Models like Point-E, Shap-E (OpenAI), and DreamFusion generate 3D representations using various approaches: point cloud generation, neural radiance fields (NeRFs), and score distillation sampling (optimising a 3D representation to match a diffusion model’s learned distribution from multiple viewpoints).
The practical maturity is limited: generated 3D assets typically require significant manual cleanup before they are usable in production pipelines (games, film, industrial design). The technology’s trajectory suggests production-quality 3D generation within 2–3 years.
Choosing the right generative architecture
The architecture choice depends on the output modality and the deployment constraints:
| Output | Architecture | Key trade-off |
|---|---|---|
| Text | LLM (autoregressive) | Quality vs inference cost |
| Images | Diffusion model | Quality vs generation speed |
| Real-time image transforms | GAN | Speed vs training stability |
| Structured generation | VAE | Control vs output sharpness |
| Audio/speech | Neural codec + LM | Quality vs latency |
| Video | Temporal diffusion | Quality vs compute cost |
Defaulting to an LLM for every GenAI use case is a common mistake. If your use case involves image, audio, video, or 3D generation, the appropriate architecture is likely not an LLM — and the deployment characteristics (cost, latency, infrastructure) will differ accordingly.
If your team is evaluating GenAI use cases across multiple modalities, a GenAI Feasibility Assessment maps each use case to the appropriate model architecture and provides deployment cost and capability estimates. Our generative AI practice covers the full spectrum of generative model architectures.