Accelerating Conditional Image Generation with CachedAttention

Accepted by ICCVML 2025.

Authors: Yi Pan, Ziyi Xu, Shihan Fang

Abstract

Conditional image generation involves generating images based on specific input conditions, such as text or other modalities. Diffusion Transformers (DiT) are widely used in this task, leveraging cross-attention or unified self-attention to align generated images with conditions. While effective, these mechanisms can lead to substantial computational complexity, particularly when dealing with long prompts or complex conditions. We identify notable condition redundancy in this process, as attention outputs between image and condition tend to be similar across different timesteps. To address this issue, we propose CacheAttention, a training-free attention mechanism for accelerating DiTs. By recognizing the varying similarity of different timesteps and tokens in the condition, CacheAttention dynamically caches and reuses the intermediate results in attention operators, thereby reducing redundant computations. We integrate CacheAttention into PixArt-alpha and OmniGen, two popular image generation models. Evaluation results show that our method improves overall throughput by 34\%, effectively mitigating condition redundancy in DiTs and improving the computational efficiency of conditional image generation.

Credit

SJTU Course AI3604: Computer Vision (2024 Fall) Team F Project.