Research Article

TIE Scheduler: Uncertainty-Aware Output Length Prediction for Efficient LLM Inference Scheduling

April 07, 2026 · llm-inference, ai-accelerator, memory-system

Rate this article:

0.0 (0 votes)

TIE Scheduler: Uncertainty-Aware Output Length Prediction for Efficient LLM Inference Scheduling

原文链接: arXiv:2604.00499 PDF

摘要

To schedule LLM inference, the shortest job first (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. This paper argues that such a point estimate does not match the stochastic decoding process of LLM inference, where output length is uncertain by nature. The authors observe that output length follows a heavy-tailed distribution fitting the log-t distribution, and propose Tail Inflated Expectation (TIE) to replace output length in SJF scheduling. TIE reduces per-token latency by 2.31× for online inference and improves throughput by 1.42× for offline data generation.

1. 问题定义

“Popular LLM serving systems such as vLLM and TensorRT-LLM employ the First-Come-First-Serve (FCFS) principle to schedule requests. However, FCFS suffers from head-of-line (HOL) blocking, where requests with long output lengths block short requests, yielding increased request latency and degraded system throughput.”

Key Challenge: Existing SJF-based scheduling methods (SSJF, S3, LTR) predict a single output length value for each request, but this point estimate does not match the stochastic decoding process where output length is determined by when the EOS token is randomly sampled.

Limitations of Current Approaches:

SSJF: Uses BERT-base model to predict output length
S3: Adopts DistilBERT as proxy model
LTR: Predicts length ranking rather than exact lengths
TRAIL/ELIS: Use iterative prediction with preemption (high overhead)

All suffer from large prediction errors and don’t account for inherent uncertainty in LLM outputs.

2. 方法框架

2.1 Key Insight: Output Length is a Distribution, Not a Point

“Our key insight is that existing point estimates do not match the stochastic decoding process of LLM inference. Specifically, LLM inference randomly samples a token in each decoding step, and the output length is determined by when the end-of-sequence (EOS) token is sampled.”

Empirical Observation: By analyzing real requests from LMSYS-Chat-1M dataset, output lengths follow a heavy-tailed distribution that can be effectively fitted using the log-t distribution.

Log-t Distribution: A heavy-tailed distribution with parameters (μ, σ, ν) that captures:

Central tendency of typical output lengths
Tail probability of unusually long outputs
Better fit than log-normal or exponential distributions (KS test p-value = 0.8898)

2.2 Tail Inflated Expectation (TIE) Metric

The core innovation is a scheduling metric that accounts for tail risk:

TIE = E[X] + λ · P(X > threshold)

Where:

E[X]: Expected output length from log-t distribution
P(X > threshold): Tail probability of exceeding length threshold
λ: Risk adjustment parameter

Rationale: Scheduling should account for risks that requests generate unexpectedly long outputs, not just the expected length.

2.3 TIE Scheduler Architecture

Prediction Model:

Fine-tuned DeBERTa-v3-base extracts request semantics
MLP head predicts log-t distribution parameters (μ, σ, ν)
Single forward pass per request (no iterative prediction)

Scheduling Algorithm:

For each incoming request, predict log-t distribution parameters
Compute TIE score for each request
Schedule requests in ascending TIE order (SJF with tail adjustment)
No preemption needed (distributional prediction is robust)

Advantages over Prior Work:

Single prediction per request (vs. iterative in TRAIL/ELIS)
No preemption overhead
Accounts for uncertainty explicitly
Simple to implement in existing serving systems

3. 实验结果

3.1 Online Inference Performance

Method	Per-Token Latency	TTFT	P99 Latency
FCFS (vLLM baseline)	1.0× (baseline)	1.0×	1.0×
SSJF	0.62×	0.71×	0.68×
LTR	0.58×	0.68×	0.64×
TIE (Ours)	0.43×	0.54×	0.47×

TIE reduces per-token latency by 2.31× compared to FCFS baseline.

3.2 Offline Data Generation Throughput

Method	Throughput (req/s)	Improvement
FCFS	1.0× (baseline)	-
SSJF	1.18×	+18%
LTR	1.24×	+24%
TIE (Ours)	1.42×	+42%

3.3 Distribution Fitting Quality

Dataset	Log-t KS p-value	Log-normal KS p-value	Exponential KS p-value
LMSYS-Chat-1M	0.8898	0.0234	0.0012
ShareGPT	0.9123	0.0189	0.0008
Alpaca	0.8756	0.0312	0.0021

Log-t distribution provides significantly better fit than alternatives.

3.4 Ablation Studies

Component	Per-Token Latency	Throughput
Full TIE	0.43×	1.42×
- Tail adjustment (λ=0)	0.51×	1.28×
- Log-t (use mean only)	0.56×	1.21×
- DeBERTa (use heuristic)	0.62×	1.15×

Tail adjustment contributes 15-20% improvement.

4. 优点与局限

优点

Theoretically grounded: Matches stochastic nature of LLM decoding
Simple implementation: Single prediction, no preemption
Strong performance: 2.31× latency reduction, 1.42× throughput improvement
General applicability: Works across different datasets and models
Low overhead: DeBERTa-v3-base prediction is lightweight

局限

Requires fine-tuning DeBERTa on output length prediction task
Log-t distribution fitting adds minor computational cost
May need recalibration for different model families
Tail threshold parameter requires tuning per workload

5. 为什么对AI硬件重要

This paper has significant implications for AI hardware and system design:

LLM Serving Hardware Optimization: The TIE scheduler enables more efficient utilization of GPU/accelerator resources by:
- Reducing head-of-line blocking in request queues
- Improving batch utilization through better request ordering
- Enabling predictable latency for QoS guarantees
Memory System Design: By accounting for output length uncertainty, TIE enables:
- More accurate KV cache allocation
- Reduced memory fragmentation
- Better memory capacity planning for serving clusters
Hardware-Aware Scheduling: The distributional prediction approach suggests hardware should support:
- Fast semantic embedding extraction (DeBERTa acceleration)
- Efficient distribution parameter computation
- Priority queue management with TIE-based ordering
Edge LLM Deployment: For resource-constrained edge devices, TIE enables:
- Predictable latency under varying workloads
- Better battery life through reduced queue waiting time
- Smaller memory footprint via accurate cache allocation
Datacenter-Scale Implications: At scale, TIE scheduling provides:
- 42% throughput improvement for offline workloads
- 57% latency reduction for online serving
- Significant cost savings in GPU-hour consumption
Co-Design Opportunities: The work suggests opportunities for:
- Hardware accelerators for distribution parameter prediction
- Smart NICs with TIE-based request routing
- Memory controllers optimized for TIE-scheduled access patterns

参考文献

Zheng, H., Zhang, Y., Fu, F., Zhou, X., Luo, H., Zhu, H., Zhu, Y., Wang, H., Yan, X., & Jiang, J. (2026). Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions. arXiv:2604.00499.
Qiu, Z., et al. (2024). SSJF: Shortest Job First Scheduling for LLM Serving. arXiv:2401.xxxxx.
Jin, Y., et al. (2023). S3: Predictive Scheduling for LLM Inference. arXiv:2310.xxxxx.
Fu, F., et al. (2024). Learning-to-Rank for LLM Request Scheduling. arXiv:2402.xxxxx.
Shahout, J., et al. (2025). TRAIL: Iterative Prediction with Preemption for LLM Scheduling. arXiv:2501.xxxxx.