vLLM, TensorRT-LLM, TGI v3, and LMDeploy: A Technical Breakdown of LLM Inference Performance

NVIDIA’s TensorRT-LLM achieves over 10,000 output tokens/s on H100 GPUs, offering 4.6× higher throughput and 4.4× faster latency compared to A100 models.

Staff

Published

23 November, 2025

As the landscape of production-level large language model (LLM) serving continues to evolve, it is increasingly clear that the challenge lies not merely in the generate() loop but rather in the optimization of the inference stack. This determines crucial metrics such as tokens per second, tail latency, and ultimately the cost per million tokens on a given GPU fleet.

This article examines four prominent inference stacks currently in use:

vLLM
NVIDIA TensorRT-LLM
Hugging Face Text Generation Inference (TGI v3)
LMDeploy

1. vLLM: PagedAttention as the Open Baseline

The core innovation behind vLLM is the implementation of PagedAttention, which treats the key-value (KV) cache like paged virtual memory rather than a single contiguous buffer. This method drastically reduces external fragmentation and allows for a higher number of concurrent sequences within the same VRAM.

vLLM divides the KV cache into fixed-size blocks.
It maintains a block table that maps logical tokens to physical blocks.
It shares blocks across sequences when prefixes overlap.

This architecture results in a 2–4× improvement in throughput compared to systems like FasterTransformer and Orca, especially for longer sequences. The system also supports continuous batching, merging incoming requests into existing GPU batches, thus improving efficiency.

2. TensorRT-LLM: Maximizing NVIDIA GPU Performance

TensorRT-LLM is NVIDIA’s specialized inference library designed for optimal performance on its GPUs. It incorporates custom attention kernels, inflight batching, and quantization down to FP4 and INT4, particularly leveraging FP8 tensor cores on Hopper and Blackwell architectures.

Performance metrics reveal that on H100 GPUs with FP8, TensorRT-LLM achieves over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with a time to first token around 100 ms. Notably, it offers up to 4.6× higher maximum throughput and 4.4× faster first token latency compared to the A100 on the same models.

3. Hugging Face TGI v3: Specializing in Long Prompts

The Text Generation Inference (TGI) v3 serves as a robust serving stack built with Rust and Python. This version emphasizes handling long prompts efficiently through techniques like chunking and prefix caching.

According to benchmarks, TGI v3 can serve a conversation reply that takes 27.5 seconds in vLLM in about 2 seconds, translating to a 13× speedup for long prompts exceeding 200,000 tokens. This is largely attributed to the system’s ability to maintain conversation context in a prefix cache, minimizing the computational overhead for subsequent turns.

4. LMDeploy: TurboMind with Blocked KV and Aggressive Quantization

LMDeploy, part of the InternLM ecosystem, focuses on high-throughput request serving and includes a blocked KV cache along with continuous batching. It emphasizes aggressive quantization strategies to improve performance.

Reportedly, LMDeploy can deliver up to 1.8× higher request throughput than vLLM, aided by its blocked KV, dynamic split and fuse, along with optimized CUDA kernels. Its architecture supports multi-model deployments with routing logic to select models based on request metadata.

Choosing the Right Stack

For maximum throughput and low time to first token on NVIDIA GPUs, TensorRT-LLM is the optimal choice, leveraging advanced features like FP8 and speculative decoding.
If handling long, reusable prompts, especially in RAG over large contexts, TGI v3 stands out due to its prefix caching method.
For an open, straightforward engine that provides a solid baseline performance, vLLM remains a strong candidate.
For deploying open models with a focus on aggressive quantization, LMDeploy is a fitting choice, particularly when working with models like InternLM.

As organizations navigate these options, many development teams find success by mixing different systems to align throughput, latency, and KV behavior with their specific workloads. Understanding these dynamics is crucial for optimizing costs and performance in LLM serving.

1 1. vLLM: PagedAttention as the Open Baseline
2 2. TensorRT-LLM: Maximizing NVIDIA GPU Performance
3 3. Hugging Face TGI v3: Specializing in Long Prompts
4 4. LMDeploy: TurboMind with Blocked KV and Aggressive Quantization
5 Choosing the Right Stack

Nvidia’s Jensen Huang Reveals 1.5 Million AI Models Driving a Revolutionary Market Shift

Nvidia CEO Jensen Huang highlights the AI revolution's scale with over 1.5 million models worldwide, emphasizing infrastructure's crucial role in driving innovation.

Staff17 minutes ago

AI Business

AI Factories Revolutionize Enterprise Efficiency: NVIDIA Partners with Lenovo for Gigawatt-Scale Production

NVIDIA and Lenovo unveil gigawatt-scale AI factories, poised to enhance enterprise AI production and efficiency, driving trillions in investments.

Marcus Chen2 hours ago

AI Technology

Alibaba Stock Rises 4.8% Following China’s Approval of Nvidia H200 AI Chips

Alibaba's stock surged 4.8% to $151.57 after China approved imports of Nvidia's H200 AI chips, boosting investor optimism for AI growth in China.

Staff7 hours ago

Experts Warn: Prepare for Potential AI Bubble in 2026 Amid Market Volatility

Investors eye potential AI bubble as Nvidia reports 1,330% returns over five years, prompting experts to advocate for strategic dollar-cost averaging by 2026

Staff13 hours ago

CES 2026 Wraps Up: Major AI Breakthroughs from Nvidia, AMD, and Robot Innovations by Hyundai

CES 2026 showcased groundbreaking advancements in physical AI, highlighted by Nvidia and AMD's keynotes on cutting-edge technology and Hyundai's humanoid robot Atlas attracting massive...

Staff19 hours ago

AIPRESSA.COM

Top Stories

vLLM, TensorRT-LLM, TGI v3, and LMDeploy: A Technical Breakdown of LLM Inference Performance

1. vLLM: PagedAttention as the Open Baseline

2. TensorRT-LLM: Maximizing NVIDIA GPU Performance

3. Hugging Face TGI v3: Specializing in Long Prompts

4. LMDeploy: TurboMind with Blocked KV and Aggressive Quantization

Choosing the Right Stack

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

Top Stories

Nvidia’s Jensen Huang Reveals 1.5 Million AI Models Driving a Revolutionary Market Shift

AI Business

AI Factories Revolutionize Enterprise Efficiency: NVIDIA Partners with Lenovo for Gigawatt-Scale Production

AI Technology

Alibaba Stock Rises 4.8% Following China’s Approval of Nvidia H200 AI Chips

Top Stories

Experts Warn: Prepare for Potential AI Bubble in 2026 Amid Market Volatility

Top Stories

CES 2026 Wraps Up: Major AI Breakthroughs from Nvidia, AMD, and Robot Innovations by Hyundai

AI Technology

NVIDIA and AMD Reveal AI’s Physical Integration: Smart Glasses and Yottabyte Era Ahead

Top Stories

Uncover 2 AI Stocks Set to Surge: Nvidia & Meta Could Lead You to Millions

Top Stories

Investors Should Buy Nvidia, Broadcom, and Amazon to Capitalize on AI Market Surge