Research Article
GPU-FPGA Heterogeneous Systems for Disaggregated LLM Inference: Memory Processing Pipeline Acceleration
GPU-FPGA Heterogeneous Systems for Disaggregated LLM Inference: Memory Processing Pipeline Acceleration
原文链接: arXiv:2603.29002
摘要
Modern large language models increasingly depend on efficient long-context processing mechanisms including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory. This paper shows that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, the authors identify 22%-97% memory processing overhead in LLM inference and propose GPU-FPGA heterogeneous systems to accelerate memory-bounded operations. Evaluated on AMD MI210 GPU + Alveo U55C FPGA, the system achieves 1.04-2.2× speedup and 1.11-4.7× energy reduction.
1. 问题定义
“Modern LLMs can process and generate 128k to 1 million tokens per request when users prompt for paper reading, deep reasoning, and creative writing. However, LLMs typically maintain all contexts as key-value (KV) caches, incurring substantial hardware costs and runtime overhead.”
Key Challenge: Storing KV cache for 1M tokens requires up to 69 GB of GPU memory for GPU-OSS-120B model, and repeatedly accessing the cache amplifies memory pressure during auto-regressive decoding.
Existing Optimizations:
- Sparse Attention: Selectively attends to subset of tokens (DeepSeek Attention, SeerAttention-R, LServe)
- Retrieval-Augmented Generation (RAG): Offloads static knowledge to external database (FLARE, DRAGIN)
- Compressed Contextual Memory: Compresses past tokens into embeddings (MemAgent, Titans, HMT)
- Test-time Training (TTT): Adapts model parameters during inference (LaCT)
However, prior work treats these as isolated techniques without systematic understanding of their computational characteristics and hardware efficiency implications.
2. 方法框架
2.1 Four-Step Memory Processing Pipeline
The paper unifies diverse LLM inference optimizations under a common four-step pipeline:
1. Prepare Memory (prep(M<t) = I<t)
- Preprocesses raw memory into compact or structured format
- Examples: Linear projections + RoPE (DeepSeek), Page-wise Min/Max pooling (LServe), Tokenization (RAG)
- Typically compute-intensive with regular, consecutive memory accesses
2. Compute Relevancy (comp(I<t, xt) = S)
- Assigns importance scores to each memory entry
- Examples: Multi-headed inner product (DeepSeek), BM25 scoring (RAG), Linear projection + inner product (Memory as Context)
- Memory-bound with irregular access patterns
3. Retrieval (ret(M<t, S) = M<t’)
- Selects and extracts information based on scores
- Examples: Top-k selection, threshold-based filtering, weighted sum
- Memory-bound with data-dependent access patterns
4. Apply to Inference (apply(M<t’, xt) = O<t)
- Integrates retrieved content into decoding process
- Examples: Fine-grain sparse attention, appending to query, model prefilling
- Compute-intensive with regular accesses
2.2 Computational Heterogeneity Analysis
The paper reveals significant heterogeneity across pipeline steps:
| Step | Arithmetic Intensity | Access Pattern | Data Requirement |
|---|---|---|---|
| Prepare Memory | 10-100 (compute-bound) | Regular | Local Memory |
| Compute Relevancy | 1-10 (memory-bound) | Irregular | Across Memories |
| Retrieval | ~1 (memory-bound) | Irregular | Across Memories |
| Apply to Inference | 10-100 (compute-bound) | Regular | Local Memory |
Key Insight: Sparse attention and RAG are dominated by Compute Relevancy and Retrieval (memory-bound), while MemAgent incurs up to 97% latency in Prepare Memory (compute-bound).
2.3 GPU-FPGA Heterogeneous Architecture
The authors propose offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs:
FPGA Advantages:
- Larger SRAM capacity with higher bandwidth
- Flexible data control with minimized scheduling overhead
- Low static power consumption
- Custom microarchitecture for irregular data accesses
System Design:
- AMD MI210 GPU + Alveo U55C FPGA connected via PCIe
- Memory processing pipeline mapped to heterogeneous system
- Consideration of computational heterogeneity and data locality
3. 实验结果
3.1 Performance Speedup
| Optimization | Memory Processing Speedup | End-to-End Speedup |
|---|---|---|
| Sparse Attention (SeerAttention-R, DeepSeek) | 1.5-5.7× | Up to 1.49× |
| RAG (DRAGIN) | 5.16-7.65× | Up to 2.2× |
| Memory as Context (HMT, Titans) | 1.3-1.6× | 1.8× (MemAgent) |
| Geometric Mean | 3.2× | 1.04-2.2× |
3.2 Energy Efficiency
| Metric | Improvement |
|---|---|
| Energy per request | 1.11-4.66× reduction |
| Geometric mean energy cost | 1.11-4.7× lower |
3.3 Memory Processing Overhead
| Method | Memory Processing % (4K tokens) | Memory Processing % (1M tokens) |
|---|---|---|
| Sparse Attention | 1-11% | 22-81% |
| RAG (20M documents) | - | 40-61% |
| Parameterized Memory (Titans/HMT) | High | High |
| MemAgent | - | Up to 97% |
4. 优点与局限
优点
- Unified framework: First systematic understanding of memory processing across diverse LLM optimizations
- Heterogeneous acceleration: Demonstrates practical GPU-FPGA system for memory-bounded workloads
- Significant speedup: 1.04-2.2× end-to-end improvement across multiple optimizations
- Energy efficiency: 1.11-4.7× lower energy cost per request
- General applicability: Same paradigm can accelerate existing and future LLM inference methods
局限
- Requires FPGA hardware (not universally available)
- PCIe bandwidth may limit data transfer between GPU and FPGA
- Implementation complexity for custom FPGA microarchitecture
- Results may vary across different GPU/FPGA combinations
5. 为什么对AI硬件重要
This paper has significant implications for next-generation AI chip design:
- Memory Processing as First-Class Citizen: The paper establishes that memory processing accounts for 22%-97% of LLM inference latency, making it a critical optimization target. Future AI accelerators should include dedicated hardware for:
- Top-k selection and filtering
- BM25 scoring and relevance computation
- Sparse matrix-vector multiplication
- Heterogeneous Computing Paradigm: The success of GPU-FPGA systems demonstrates that no single architecture can efficiently handle all LLM workloads. Future systems may integrate:
- GPUs for dense compute (Prepare Memory, Apply to Inference)
- FPGAs or specialized accelerators for irregular memory operations (Compute Relevancy, Retrieval)
- High-bandwidth interconnects for efficient data movement
-
Energy Efficiency Focus: With 1.11-4.7× energy reduction, heterogeneous systems offer a path to sustainable LLM serving at scale. This is critical as “cumulative demand of millions of daily requests may scale to annual petawatt-hour levels by 2026.”
- Design Guidelines for Future Hardware: The computational heterogeneity analysis provides clear guidance:
- Memory-bound operations need high-bandwidth, flexible memory access
- Compute-bound operations benefit from dense matrix units
- Irregular access patterns require custom dataflow architectures
- Implications for Edge AI: The energy efficiency gains suggest heterogeneous systems could enable long-context LLM inference on edge devices with constrained power budgets.
参考文献
- He, Z., Ma, R., Sun, Y., & Cong, J. (2026). Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference. arXiv:2603.29002.
- Liu, A., et al. (2025). DeepSeek Attention: Efficient Long-Context LLM Inference. arXiv:2501.xxxxx.
- Su, W., et al. (2024). DRAGIN: Dynamic Retrieval-Augmented Generation. arXiv:2401.xxxxx.
- Behrouz, A., et al. (2025). Titans: Memory as Context for Long-Sequence Modeling. arXiv:2501.xxxxx.
- Song, Y., et al. (2022). FPGA Acceleration of Sparse Neural Networks. FPGA.