Connect with us

Hi, what are you looking for?

AI Generative

LLMs Achieve Over 99% Accuracy as World Models for AI Agent Training, Study Reveals

Researchers demonstrate that large language models achieve over 99% accuracy as world models, revolutionizing AI agent training with simulated environments.

Recent research has revealed that large language models (LLMs) can effectively simulate environments, addressing a significant challenge in the training of autonomous AI agents. Autonomous AI systems depend on real-world interactions to gain experience, yet these environments can be limited, difficult to replicate, and often too rigid for diverse learning. Researchers from the Southern University of Science and Technology, Microsoft Research, Princeton University, the University of Edinburgh, and others explored whether LLMs could serve as internal simulators—termed “world models”—to enable training through simulated experiences instead of solely relying on real-world data.

A world model predicts the outcome of an action taken by an AI agent, allowing it to learn in a controlled, synthetic environment. This approach reframes language modeling from predicting the next word to forecasting the next state of the environment following an action. The researchers aimed to demonstrate that this capability allows LLMs to function as precise world simulators, potentially improving the efficiency of AI training.

The study evaluated LLMs across five different text-based environments: ALFWorld, where agents perform household tasks; SciWorld, a simulation for scientific experiments; TextWorld, which presents narrative puzzles; WebShop, a shopping site where agents search for products; and StableToolBench, focused on API tool usage. This diverse set of environments provided a mix of structured tasks with clear rules and more variable scenarios, allowing the team to assess the models’ predictive accuracy over longer sequences, their scalability with increased data, and their practical utility in actual training scenarios.

Initial findings showed that pre-trained models demonstrated some capacity for modeling environments, with Claude-sonnet-4.5 achieving 77 percent accuracy in predicting outcomes in the household tasks of ALFWorld after just three examples. However, this accuracy was insufficient for more complicated tasks. The breakthrough came with additional fine-tuning using real interaction data, which enabled models like Qwen2.5-7B and Llama-3.1-8B to exceed 99 percent accuracy in ALFWorld, approximately 98.6 percent in SciWorld, and around 70 percent in TextWorld.

Longer action sequences also maintained high reliability. In structured environments, the consistency ratio surpassed 90 percent, indicating that plans developed through the world model succeeded in real-world applications at rates comparable to those achieved through direct interactions. However, the e-commerce simulation presented more challenges, with consistency rates averaging around 70 percent and varying significantly between different agents. When simulated processes were initialized with real observations, consistency improved dramatically, nearing 100 percent even with a GPT-4o agent.

As the researchers explored scaling, they identified distinct patterns. For structured environments, accuracy plateaued after about 20,000 training trajectories—recorded sequences of agent actions. In contrast, open environments, such as the shopping site, showed continued improvement with increased data, reaching up to 70,000 trajectories. Similar scaling effects were observed in model size; while 1.5 billion parameter models performed well in structured settings, more complex scenarios necessitated larger models. The findings underscore that both data volume and model size must scale with the complexity of the environment for effective world modeling.

This research supports a growing discourse on the future direction of AI, echoing concerns raised by Turing Award winner Richard Sutton. He has stated that the AI industry is at a crossroads, arguing for a shift towards continuous learning from experience rather than relying on pre-existing knowledge. In his co-authored essay “Welcome to the Era of Experience” with DeepMind researcher David Silver, Sutton advocates for AI agents that learn from their own experiences, with world models playing a crucial role as internal simulators.

While this study provides empirical evidence that LLMs can simulate environmental dynamics, it does not fully address Sutton’s concern regarding the necessity for continuous learning without the risk of forgetting past knowledge—an essential aspect for achieving true intelligence in AI systems.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Technology

OpenAI identifies five essential skills for aspiring prompt engineers, highlighting the increasing demand for expertise as AI integration expands across industries.

Top Stories

Microsoft launches a lightweight security scanner to uncover hidden backdoors in open-weight LLMs, enhancing AI trust without model retraining.

AI Generative

A study reveals large language models accurately assess personality traits from brief narratives, outperforming traditional methods and predicting daily behaviors.

AI Regulation

India's adaptable AI strategy prioritizes practical innovation over costly Western models, aiming to cultivate local talent and domain-specific applications while navigating global market volatility.

AI Technology

UCSD and Columbia University unveil ChipBench, revealing that top LLMs achieve only 30.74% effectiveness in Verilog generation, highlighting urgent evaluation needs.

Top Stories

South Korea enacts the AI Basic Act, its first comprehensive AI regulation, establishing key safety measures and a National AI Committee to ensure public...

Top Stories

New Jersey's AI Hub partners with Plug and Play to launch an Accelerator, selecting 20 startups to drive innovation and ethical AI development starting...

AI Generative

MIT's new Recursive Language Models achieve 91.33% accuracy on the 10M token BrowseComp-Plus benchmark, effectively eliminating context rot in LLMs.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.