NVIDIA researchers have unveiled ProRL AGENT, a new scalable infrastructure aimed at enhancing reinforcement learning (RL) training for multi-turn large language model (LLM) agents. This innovative system adopts a ‘Rollout-as-a-Service’ approach, which separates agent rollout orchestration from the training loop. By addressing resource conflicts inherent to I/O-intensive environment interactions and GPU-intensive policy updates, the architecture aims to overcome existing bottlenecks in agent development.
Multi-turn agent tasks commonly involve complex interactions with external environments, such as code repositories or operating systems, requiring iterative tool usage. Many current frameworks, including SkyRL, VeRL-Tool, Agent Lightning, rLLM, and GEM, tightly couple rollout control with the training process. This tight coupling leads to two main issues: conflicting system requirements and maintenance barriers. Rollouts depend heavily on I/O operations, necessitating sandbox creation and asynchronous coordination, while training is centered on GPU-intensive tasks like forward and backward passes, causing inefficiencies when managed concurrently. Additionally, embedding rollout logic within the trainer complicates transitions to different training backends or runtime environments.
The ProRL AGENT operates as a standalone HTTP service, managing the complete rollout lifecycle independently of the RL trainer, which communicates with the server via an API. This design choice enhances flexibility, as the trainer remains agnostic to the underlying rollout infrastructure. To optimize throughput, ProRL AGENT utilizes an asynchronous three-stage assembly line: initialization workers set up sandbox containers, rollout workers manage the multi-turn agent loop, and evaluation workers score results to produce reward signals. This structure allows for overlapping phases across jobs, mitigating slow evaluations from hindering the rollout process.
In terms of infrastructure, ProRL AGENT employs Singularity for its sandboxing solution, which allows for rootless execution crucial for deployment on shared high-performance computing (HPC) clusters managed by Slurm. The system incorporates several optimizations aimed at reducing tool execution latency, which often dominates total rollout times. These include replacing traditional terminal multiplexing with a more efficient approach, connecting directly to persistent IPython kernels to eliminate network overhead, and utilizing Unix Domain Sockets for communication within the execution environment to further reduce latency.
Moreover, ProRL AGENT introduces advanced features designed to improve training stability and hardware utilization. The server manages a pool of LLM inference backends, optimizing prefix cache reuse to minimize inference time across multiple agent turns. To prevent re-tokenization drift, the system maintains token IDs as the canonical representation throughout the entire process, ensuring consistency between rollout and training. It also supports Dynamic Sampling Policy Optimization (DAPO), which filters out non-informative prompts and employs an asynchronous replenishment mechanism to maintain high throughput.
Experimental results validate the effectiveness of ProRL AGENT, demonstrating significant performance improvements across various model scales. For instance, the Qwen3-8B model saw its performance nearly double on the SWE-Bench Verified benchmark, increasing from 9.6% to 18.0%. Similarly, the Qwen3-14B model improved from 15.4% to 23.6%. The system showcased not only advancements in software engineering but also its applicability across STEM, math, and coding domains, with steady reward growth observed during RL training. Scalability tests confirmed that rollout throughput increases nearly linearly as additional compute nodes are introduced.
The introduction of ProRL AGENT signifies a meaningful step in the evolution of reinforcement learning infrastructures, effectively decoupling the rollout lifecycle from policy training. It offers substantial performance gains, reduces system latency, and ensures consistent tokenization, all while facilitating native deployment on HPC clusters. As the demand for more sophisticated AI models grows, innovations like ProRL AGENT could play a pivotal role in optimizing the training processes necessary to develop advanced AI systems.
See also
Sam Altman Praises ChatGPT for Improved Em Dash Handling
AI Country Song Fails to Top Billboard Chart Amid Viral Buzz
GPT-5.1 and Claude 4.5 Sonnet Personality Showdown: A Comprehensive Test
Rethink Your Presentations with OnlyOffice: A Free PowerPoint Alternative
OpenAI Enhances ChatGPT with Em-Dash Personalization Feature

















































