Problem framing

Diffusion models demand repeated denoising steps, making inference latency and energy expensive. Even with fewer-step samplers, deployment at scale remains challenging.

Hardware-software co-design idea

Create an accelerator specialized for denoising loops:

  • fused UNet operator pipelines,
  • timestep-aware scheduler,
  • on-chip latent buffering,
  • reusable noise-conditioning units.

Pair this with sampler algorithms optimized for hardware (e.g., bounded-step adaptive schedules with predictable control flow).

Potential differentiators

  1. Step fusion: execute multiple denoising micro-steps with reduced memory traffic.
  2. Latent locality: keep hot latent tiles on-chip across successive timesteps.
  3. Conditioning acceleration: optimize cross-attention and text-conditioning for diffusion guidance.

What to measure

  • latency/image at target quality,
  • energy/image,
  • quality metrics (FID, CLIP score) under equal power budget.