Generating New Faces

From VAE to deployed face-generation web app: model choice, safety, cost, and the human review path that decides whether image-gen survives production.

Generating New Faces
Written by TechnoLynx Published on 06 Oct 2023

Introduction

Generating new faces with a Variational Autoencoder and shipping the result as a web app — the Face Mixing demo — is a useful applied example because the demo-to-production gap is concrete and bounded. The VAE training is tractable, the latent-space interpolation produces the visually striking outputs that justify the project, and the web stack is small enough to fit in one head. What the demo does not show, and what every team shipping image-gen ships with whether they planned to or not, is the production stack: model choice, prompt or latent management, safety filters, cost accounting, and the human review path that decides whether the feature is rolled back after the first incident. This article walks the Face Mixing applied example with the full stack visible. See generative AI for the broader feasibility framing.

The naive read is that the model is the project. The expert read is that the model is one of five layers in a production image-gen system, and the four other layers decide whether the feature survives past month one.

What this means in practice

  • Model choice (VAE vs GAN vs diffusion class) drives quality, latency, and cost — pick against the use-case envelope, not the technical interest.
  • Latent or prompt management is real infrastructure — versioned, validated, governed like any other API surface.
  • Safety filters are mandatory for any image-gen feature touching real users; “we’ll add them later” is the path to rollback.
  • Cost accounting at request volume is an order of magnitude harder than at demo volume.

What are the latest advancements in AI image generation in 2026, and which are production-ready?

The 2026 production-ready families are three. Diffusion models (Stable Diffusion XL/Turbo, SDXL fine-tunes, FLUX-class, Midjourney-class as managed APIs) are the dominant general-purpose stack — high quality, controllable through ControlNet-style conditioning, with mature open-weights variants for self-hosted deployment. VAEs (the family the Face Mixing demo uses) remain the right call for low-cost, low-latency, structurally-constrained generation where the latent space is the controllable surface. GANs survived as the right call for narrow domain-specific generation where diffusion’s cost is excessive.

What is not yet production-ready in 2026: text-to-video at long durations, multi-modal generation with consistent identity across long sequences, and fully autonomous “creative” agents. These work in demos and break in production deployment where the quality bar is consistent over thousands of generations rather than a curated handful.

How does explainable AI fit into generative diffusion models for regulated and high-stakes use?

Diffusion models are opaque by default — the iterative denoising trajectory does not produce a step-by-step explanation a regulator or a human reviewer can audit. Explainability layers that have proven useful: attribution to training data via influence functions or nearest-neighbour retrieval against the training set (answers “what training images shaped this output”), structural conditioning logging via ControlNet (answers “what constraints shaped the geometry”), and prompt-influence analysis (answers “which prompt tokens drove which output features”).

For regulated use (advertising in regulated markets, medical or scientific illustration, anything touching brand safety), the explainability layer plus the human review path is the gate that lets the diffusion stack ship. Image-gen without these layers is unregulatable; image-gen with them ships into contexts that would otherwise be off-limits.

Where does AI art generation sit between consumer tools (Adobe, Playground) and engineering pipelines?

Three positions exist in production. Consumer tools (Adobe Firefly, Playground, Midjourney via Discord/web) — fastest path to value for a creative team, no engineering required, latency and quality acceptable for one-off creative output, cost structured per-image or per-seat. Managed API pipelines (Stable Diffusion API, FLUX API, DALL-E 3) — engineering integrates the API into a product surface, cost scales linearly with generation volume, model choice is constrained by what the provider exposes.

Self-hosted engineering pipelines (Stable Diffusion on-prem or in customer’s cloud, ComfyUI workflows wrapped in services) — highest engineering investment, lowest marginal cost at scale, full control over model choice, fine-tuning, and safety layers. The Face Mixing demo is a self-hosted pipeline; the right choice depends on volume (high volume → self-host), customisation (deep custom models → self-host), and engineering capacity (low → managed). Most production deployments end up mixing all three.

What is the use-case map for diffusion models beyond consumer art — prototyping, simulation, synthetic data?

Product prototyping: rapid visual mock-ups of product variations before committing to physical or CAD work. The diffusion model generates a wide design space; humans select and refine. Synthetic data generation for CV training: diffusion models generate labelled synthetic images for training object-detection and segmentation models when real data is scarce or sensitive (medical imaging is the prominent example). Simulation: scene generation for testing autonomy systems, robotics, and AR/VR applications where real-world capture is expensive or unsafe.

Marketing personalisation: generating audience-specific creative at scale (with the safety-filter and brand-control layers that production deployment requires). Each use case has its own quality bar, latency budget, and cost structure — the model choice and pipeline architecture follows. The pattern is the same as for any AI capability: identify the use case, scope the requirements, then pick the model.

How do AI image generators compare on quality, latency, controllability, and licence terms for enterprise use?

Quality: in 2026 the top managed APIs (DALL-E 3, Midjourney v6, FLUX Pro) and the best self-hosted models (SDXL with high-quality fine-tunes, FLUX Schnell/Dev open weights) are within visual-quality range of each other for most general-purpose tasks. Domain-specific fine-tunes can outperform general models on their domain.

Latency: managed APIs typically 5–20s per image at production quality; self-hosted on adequate GPUs (single A100/H100) reaches 1–5s per image with optimised pipelines; SDXL Turbo and Lightning-class models reach sub-second on appropriate hardware. Controllability: open-weights models with ControlNet ecosystem have the most fine-grained control; managed APIs offer increasingly capable but bounded control surfaces. Licence terms: the make-or-break for enterprise — open-weights models with permissive licences (FLUX Schnell, SDXL) avoid the per-image royalty exposure that closed-licence and some open-weights-with-restrictions models carry. Read the licence before the pilot.

What does control (ControlNet, structural conditioning) buy in stable-diffusion-class pipelines for product work?

ControlNet adds structural conditioning to the diffusion process — pose, depth, edges, sketches, segmentation maps. The practical consequence: the diffusion model generates in the visual style the prompt requests while conforming to the structural input the ControlNet provides. For product work this lets the team generate variations that respect a known geometry (a product on a defined background, a character in a defined pose, a scene in a defined layout) rather than the diffusion model’s free interpretation.

The Face Mixing demo uses latent-space interpolation as its control mechanism; ControlNet generalises this to structural control over any diffusion model. For production product workflows — e-commerce imagery, marketing creative variations, simulation scene generation — ControlNet is the difference between “interesting demo” and “feature the creative team actually adopts.”

Limitations that remained

The Face Mixing demo shipped useful generated faces but the limitations were specific. The training dataset constrained the demographic distribution of generated faces — a known dataset-bias problem that the team mitigated by augmentation but did not eliminate. The latent-space interpolation between two faces produces intermediate faces that are not always meaningful blends — the latent space is structured but not perfectly disentangled. Generation cost at the demo’s web-app traffic volume was tolerable; at production volume it would have needed the cost-accounting layer the demo did not implement. Safety filtering for generated face content (avoiding accidentally generating identifying likenesses, handling user-uploaded inputs responsibly) was bounded by the demo context; a production face-generation feature would need a substantially more developed safety layer.

How TechnoLynx Can Help

TechnoLynx builds production image-generation systems from the model-choice decision through prompt and latent management, safety filters, cost accounting, and the human review path that decides whether the feature survives in production. If you are scoping an image-gen feature and need the full stack visible before commitment, contact us for a feasibility audit.

Image credits: Freepik

Back See Blogs
arrow icon