Connect with us

Hi, what are you looking for?

AI Generative

LLMs Achieve Over 99% Accuracy as World Models for AI Agent Training, Study Reveals

Researchers demonstrate that large language models achieve over 99% accuracy as world models, revolutionizing AI agent training with simulated environments.

Recent research has revealed that large language models (LLMs) can effectively simulate environments, addressing a significant challenge in the training of autonomous AI agents. Autonomous AI systems depend on real-world interactions to gain experience, yet these environments can be limited, difficult to replicate, and often too rigid for diverse learning. Researchers from the Southern University of Science and Technology, Microsoft Research, Princeton University, the University of Edinburgh, and others explored whether LLMs could serve as internal simulators—termed “world models”—to enable training through simulated experiences instead of solely relying on real-world data.

A world model predicts the outcome of an action taken by an AI agent, allowing it to learn in a controlled, synthetic environment. This approach reframes language modeling from predicting the next word to forecasting the next state of the environment following an action. The researchers aimed to demonstrate that this capability allows LLMs to function as precise world simulators, potentially improving the efficiency of AI training.

The study evaluated LLMs across five different text-based environments: ALFWorld, where agents perform household tasks; SciWorld, a simulation for scientific experiments; TextWorld, which presents narrative puzzles; WebShop, a shopping site where agents search for products; and StableToolBench, focused on API tool usage. This diverse set of environments provided a mix of structured tasks with clear rules and more variable scenarios, allowing the team to assess the models’ predictive accuracy over longer sequences, their scalability with increased data, and their practical utility in actual training scenarios.

Initial findings showed that pre-trained models demonstrated some capacity for modeling environments, with Claude-sonnet-4.5 achieving 77 percent accuracy in predicting outcomes in the household tasks of ALFWorld after just three examples. However, this accuracy was insufficient for more complicated tasks. The breakthrough came with additional fine-tuning using real interaction data, which enabled models like Qwen2.5-7B and Llama-3.1-8B to exceed 99 percent accuracy in ALFWorld, approximately 98.6 percent in SciWorld, and around 70 percent in TextWorld.

Longer action sequences also maintained high reliability. In structured environments, the consistency ratio surpassed 90 percent, indicating that plans developed through the world model succeeded in real-world applications at rates comparable to those achieved through direct interactions. However, the e-commerce simulation presented more challenges, with consistency rates averaging around 70 percent and varying significantly between different agents. When simulated processes were initialized with real observations, consistency improved dramatically, nearing 100 percent even with a GPT-4o agent.

As the researchers explored scaling, they identified distinct patterns. For structured environments, accuracy plateaued after about 20,000 training trajectories—recorded sequences of agent actions. In contrast, open environments, such as the shopping site, showed continued improvement with increased data, reaching up to 70,000 trajectories. Similar scaling effects were observed in model size; while 1.5 billion parameter models performed well in structured settings, more complex scenarios necessitated larger models. The findings underscore that both data volume and model size must scale with the complexity of the environment for effective world modeling.

This research supports a growing discourse on the future direction of AI, echoing concerns raised by Turing Award winner Richard Sutton. He has stated that the AI industry is at a crossroads, arguing for a shift towards continuous learning from experience rather than relying on pre-existing knowledge. In his co-authored essay “Welcome to the Era of Experience” with DeepMind researcher David Silver, Sutton advocates for AI agents that learn from their own experiences, with world models playing a crucial role as internal simulators.

While this study provides empirical evidence that LLMs can simulate environmental dynamics, it does not fully address Sutton’s concern regarding the necessity for continuous learning without the risk of forgetting past knowledge—an essential aspect for achieving true intelligence in AI systems.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Business

Red Hat advances enterprise AI with Small Language Models that achieve over 98% validity in structured tasks, prioritizing reliability and data sovereignty.

AI Generative

Apple's new LaDiR framework enhances large language model accuracy by 20% in math reasoning and code generation, revolutionizing AI problem-solving.

Top Stories

Google DeepMind's Alexander Lerchner claims AI can't achieve consciousness, challenging AGI narratives and revealing it as mere advanced simulation.

AI Technology

Lumai unveils the Iris inference server, the world's first optical system enabling real-time execution of billion-parameter AI models with 90% lower energy consumption.

AI Cybersecurity

AI integration in corporate workflows demands stringent data access permissions to prevent sensitive information leaks, with shadow AI practices posing significant security risks.

AI Education

Educators urge a shift from electronics to critical thinking in classrooms, as AI tools like ChatGPT risk diminishing students' analytical skills.

AI Education

Institute of Foundation Models unveils K2 Think V2, an open-source AI reasoning model, at HackPrinceton, enhancing student projects with advanced logic capabilities.

AI Generative

llama.cpp introduces speculative checkpointing, cutting VRAM usage by 40% and boosting throughput by 20%, enhancing local inference for large models.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.