Diffusion is not just for images We see this trade-off play out regularly in production deployments. Diffusion models became prominent through image generation, but the underlying mathematical framework — learning to reverse a noise process — applies to any continuous data distribution. The past two years have produced working diffusion models for audio, protein structures, drug-like molecules, and even tabular data. Each application reveals different strengths and limitations. Audio generation What works: Text-to-speech synthesis (Voicebox, AudioBox, E2 TTS), music generation (Stable Audio, AudioLDM), sound effect synthesis. How it differs from image diffusion: Audio is one-dimensional in time but has different frequency-domain structure. Many audio diffusion models operate in mel-spectrogram space (a 2D time-frequency representation) rather than raw waveforms, using a vocoder to reconstruct audio from the spectrogram. Key challenge: Temporal coherence across seconds. Images can be generated in a single coherent step; audio requires maintaining consistency over seconds to minutes. Autoregressive conditioning or multi-scale approaches address this. Production status: Text-to-speech diffusion models are in production deployment. Long-form music generation remains high-latency. Protein structure prediction and design What works: RFDiffusion, Chroma, FrameDiff for de novo protein backbone generation; generating protein sequences conditioned on structural constraints. Why diffusion fits proteins: Protein backbone geometry is continuous (bond angles, torsion angles) and follows a learned distribution. Diffusion can generate diverse structures that satisfy conditioning constraints (binding sites, symmetry) while covering the full distribution of valid proteins. Production status: Used in pharmaceutical research pipelines for protein design. Not real-time; generation requires minutes per structure. Molecule and drug design What works: GEOM, DiffSBDD for 3D molecular generation conditioned on protein binding sites; MolDiff for drug-like molecule generation. How it differs: Molecules are graphs with discrete (atom type) and continuous (3D coordinates) components. Hybrid approaches use diffusion for coordinates and discrete processes for atom types. Key advantage over prior approaches: Coverage of chemical space is better than sequential SMILES generation, which has mode coverage problems. Tabular data What works: TabDDPM for tabular data augmentation and synthetic data generation. Why it’s harder: Tabular data mixes continuous and categorical features with complex dependency structures. Diffusion on continuous features requires separate handling of categorical columns. Use cases: Generating synthetic training data for imbalanced classes, privacy-preserving data release, data augmentation. Comparison across domains Domain Diffusion advantage Key limitation Production maturity Images Quality, diversity Inference speed High Audio (TTS) Naturalness, speaker control Latency High Audio (music) Coherent long-form Very slow generation Medium Protein design Structure diversity Minutes per sample Research/pharma Molecules 3D validity Discrete-continuous mix Research Tabular Mixed-type handling Categorical challenges Low For the foundational architecture behind diffusion, GAN vs diffusion model architecture differences covers why the diffusion approach displaced GANs for most generative tasks. What makes diffusion models work for non-image domains? The core mechanism of diffusion models — learning to reverse a gradual noise process — is domain-agnostic. The forward process adds Gaussian noise to any continuous data representation until the signal is destroyed. The reverse process, parameterised by a neural network, learns to denoise step by step. This framework applies to any data type that can be represented as continuous vectors: audio waveforms, protein structure coordinates, molecular conformations, and tabular data. For audio generation, diffusion models operate on mel spectrograms or raw waveform samples. The noise schedule (how quickly noise is added during the forward process) requires adaptation — audio signals have different frequency distributions than images, and the perceptually important information concentrates in specific frequency bands. Models like AudioLDM and Stable Audio demonstrate that adapted diffusion produces audio quality competitive with autoregressive models at lower inference cost. Protein structure prediction uses diffusion over 3D coordinates. RFdiffusion applies the denoising process to protein backbone coordinates, generating novel protein structures that satisfy specified constraints (binding site geometry, secondary structure requirements). The diffusion framework is particularly well-suited here because protein structure space is continuous and smooth — similar structures have similar biochemical properties, which aligns with the gradual refinement process of reverse diffusion. For tabular data, diffusion models generate synthetic records that preserve the statistical properties of real datasets. This addresses a genuine need in regulated industries (healthcare, finance) where real data cannot be shared for model development. The challenge is mixed data types: continuous features (age, income) and categorical features (gender, diagnosis code) require different noise processes. Current approaches either discretise continuous features or use separate diffusion processes for each type. We have applied diffusion-based tabular synthesis for a pharmaceutical client who needed to share clinical trial data patterns with an external ML vendor without sharing actual patient records. The synthetic data preserved feature correlations and marginal distributions closely enough for the vendor to develop and validate their ML pipeline, while providing formal privacy guarantees through the noise injection mechanism.