Research Article
Diffusion Model Accelerators: Efficient Sampling Beyond Brute-Force Denoising
Problem framing
Diffusion models demand repeated denoising steps, making inference latency and energy expensive. Even with fewer-step samplers, deployment at scale remains challenging.
Hardware-software co-design idea
Create an accelerator specialized for denoising loops:
- fused UNet operator pipelines,
- timestep-aware scheduler,
- on-chip latent buffering,
- reusable noise-conditioning units.
Pair this with sampler algorithms optimized for hardware (e.g., bounded-step adaptive schedules with predictable control flow).
Potential differentiators
- Step fusion: execute multiple denoising micro-steps with reduced memory traffic.
- Latent locality: keep hot latent tiles on-chip across successive timesteps.
- Conditioning acceleration: optimize cross-attention and text-conditioning for diffusion guidance.
What to measure
- latency/image at target quality,
- energy/image,
- quality metrics (FID, CLIP score) under equal power budget.