AI Generative

Distributed Speculative Decoding Achieves 1.1x Speedup and 9.7% Throughput Gain for LLMs

Researchers at Franklin and Marshall College and NYU unveil Distributed Speculative Decoding, achieving 1.1x speedup and 9.7% throughput gain for LLMs across diverse environments

Staff

Published

29 November, 2025

Researchers from Franklin and Marshall College and New York University have unveiled a new framework aimed at improving the processing speeds of large language model (LLM) inference across various computing environments. The framework, named Distributed Speculative Decoding (DSD), addresses the persistent challenges of slow processing and scalability when using LLMs, particularly in settings ranging from high-powered data centers to mobile devices. Led by Fengze Yu, Leshu Li, Brad McDanel, and Saiqian Zhang, this innovative approach effectively accelerates text generation by coordinating processing across multiple devices and predicting likely text sequences in advance.

In their efforts, the team recognized the need for dedicated simulation tools to optimize the distributed approach of DSD. As a solution, they developed DSD-Sim, a discrete-event simulator that accurately models the complexities of network dynamics, batching processes, and scheduling involved in multi-device LLM deployments. By simulating interactions among devices during the decoding process, DSD-Sim offers critical insights into performance bottlenecks and opportunities for optimization. The researchers also introduced an Adaptive Window Control (AWC) policy, which adjusts the size of the speculation window during inference. This data-driven method optimizes throughput by balancing the advantages of increased speculation against the risks of incorrect predictions, thereby ensuring both performance and stability.

Extensive testing confirmed the effectiveness of DSD and AWC, with results showing a performance improvement of up to a 1.1x speedup and a 9.7% increase in throughput compared to current speculative decoding methods. This enhancement significantly improves both latency and scalability, underscoring the potential of DSD to enable more responsive and agile applications of large language models in diverse environments.

The DSD framework not only accelerates LLM inference but also scales effectively across edge and cloud platforms. Traditional speculative decoding techniques are often limited to single-node execution, but DSD extends these methods to multi-device coordination, allowing for more agile and efficient LLM serving. To further simulate this distributed model, the DSD-Sim tool was employed to capture the complexities of network interactions, batching, and scheduling considerations.

Building on insights from DSD-Sim, the AWC policy leverages a Window Control Deep Neural Network (WC-DNN) that processes system state data, including queue depth, utilization rates, and round-trip time statistics. The WC-DNN predicts the optimal speculation window size through supervised regression techniques, ensuring efficient performance under varying loads. The researchers implemented measures such as clamping window size predictions, applying exponential smoothing, and introducing hysteresis for mode switching to maintain stable execution and minimize fluctuations in predicted window sizes.

The implications of this research are profound for the future of large language models. By overcoming the inherent limitations of existing methods and facilitating distributed processing, DSD represents a significant step forward in the quest for fast, scalable, and efficient AI applications. As organizations increasingly adopt LLMs for a variety of purposes—from customer service automation to content generation—the advancements made through DSD could pave the way for broader deployment and innovation in the field of artificial intelligence.

AI Generative

LinkedIn Reveals LLM-Based Feed Overhaul, Boosts Content Relevance by 30x with GPUs

LinkedIn overhauls its Feed with LLMs and GPUs, boosting content relevance by 30x and driving a 121% return on ad spend for marketers.

Staff3 days ago

AI Generative

Google Researchers Reveal Bayesian Teaching Method Boosting LLM Accuracy to 81%

Google researchers enhance large language models' accuracy to 81% using a novel Bayesian teaching method for improved probabilistic reasoning in user interactions

Staff4 days ago

AI Generative

P-EAGLE Launches with Up to 1.69x Speed Boost for LLM Inference on NVIDIA B200

Researchers unveil P-EAGLE, boosting LLM inference speeds by up to 1.69x on NVIDIA B200, revolutionizing token generation efficiency.

Staff5 days ago

AI Finance

NVIDIA Blackwell Achieves STAC-AI Record with 3.2x Performance Boost for LLM Inference

NVIDIA's Blackwell architecture achieves a record-setting 3.2x performance boost for LLM inference in the STAC-AI benchmark, revolutionizing financial AI applications.

Marcus Chen5 March, 2026

AI Generative

MIT’s New TLT Method Doubles LLM Training Speed While Preserving Accuracy

MIT researchers unveil a new TLT method, boosting reasoning LLM training speed by 70-210% while maintaining accuracy, revolutionizing AI efficiency.

Staff26 February, 2026

AI Technology

Singtel, Nvidia Launch AI Centre to Overcome Deployment Barriers for Enterprises

Singtel partners with Nvidia to launch a multimillion-dollar AI centre of excellence, accelerating enterprise AI deployment and overcoming infrastructure challenges.

Staff24 February, 2026

AI Regulation

Researchers Propose PEPP Framework to Ensure Ethical AI in Animal Communication Studies

NYU and CETI establish the PEPP Framework to ethically regulate AI in animal communication research, addressing stress risks and promoting conservation efforts.

Staff18 February, 2026

NVIDIA and AMD Invest $315M in Runway as Company Shifts to World Model Technology

Runway secures $315 million in Series E funding led by General Atlantic to pivot towards world model technology, enhancing its AI video capabilities.

Staff12 February, 2026

AIPRESSA.COM

AI Generative

Distributed Speculative Decoding Achieves 1.1x Speedup and 9.7% Throughput Gain for LLMs

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

You May Also Like

AI Generative

LinkedIn Reveals LLM-Based Feed Overhaul, Boosts Content Relevance by 30x with GPUs

AI Generative

Google Researchers Reveal Bayesian Teaching Method Boosting LLM Accuracy to 81%

AI Generative

P-EAGLE Launches with Up to 1.69x Speed Boost for LLM Inference on NVIDIA B200

AI Finance

NVIDIA Blackwell Achieves STAC-AI Record with 3.2x Performance Boost for LLM Inference

AI Generative

MIT’s New TLT Method Doubles LLM Training Speed While Preserving Accuracy

AI Technology

Singtel, Nvidia Launch AI Centre to Overcome Deployment Barriers for Enterprises

AI Regulation

Researchers Propose PEPP Framework to Ensure Ethical AI in Animal Communication Studies

Top Stories

NVIDIA and AMD Invest $315M in Runway as Company Shifts to World Model Technology