

Ranking Before Serving: Low-Latency LLM Serving via Pairwise Learning-to-Rank
Tuesday, June 23, 2026 4:00 PM to 4:20 PM · 20 min. (Europe/Berlin)
Hall E - 2nd Floor
Research Paper
AI Applications powered by HPC TechnologiesLarge Language Models and Generative AI in HPCML Model OptimizationML Systems and FrameworksResource Management and Scheduling
Information
Efficient scheduling of large language model (LLM) inference tasks is critical for achieving low latency and high throughput, a challenge that is becoming increasingly acute with the rise of reasoning-capable LLMs whose generation lengths are highly variable.
Traditional strategies like First Come, First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them.
In this paper, we introduce PARS, a prompt-aware LLM task scheduler that mitigates HOL blocking by approximating shortest-job-first (SJF) scheduling through pairwise ranking with a margin ranking loss.
PARS effectively predicts response-length–based task ordering directly from prompts, thereby optimizing scheduling decisions with minimal overhead. In addition, it integrates seamlessly with vLLM, a state-of-the-art LLM serving system, for the research community. Extensive experiments across multiple LLM models and real-world inference use cases (i.e., chat, math, and code generation) demonstrate that PARS significantly reduces latency by up to 15.7× compared to the vLLM default scheduler. Cross-model evaluations demonstrate that our design generalizes effectively, allowing effective scheduling across diverse LLMs without requiring model-specific retraining.
Contributors:
Traditional strategies like First Come, First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them.
In this paper, we introduce PARS, a prompt-aware LLM task scheduler that mitigates HOL blocking by approximating shortest-job-first (SJF) scheduling through pairwise ranking with a margin ranking loss.
PARS effectively predicts response-length–based task ordering directly from prompts, thereby optimizing scheduling decisions with minimal overhead. In addition, it integrates seamlessly with vLLM, a state-of-the-art LLM serving system, for the research community. Extensive experiments across multiple LLM models and real-world inference use cases (i.e., chat, math, and code generation) demonstrate that PARS significantly reduces latency by up to 15.7× compared to the vLLM default scheduler. Cross-model evaluations demonstrate that our design generalizes effectively, allowing effective scheduling across diverse LLMs without requiring model-specific retraining.
Contributors:
Format
on-site
Documents & Links
Read the Full Paper Open Access at IEEE Xplore!

