The evaluation of frontier AI agents has yielded significant insights into their task-completion capabilities, specifically concerning the task-completion time horizon. This metric indicates the time duration, as estimated by human expert completion times, within which an AI agent is predicted to succeed at specific tasks with a certain reliability. For instance, the 50%-time horizon represents the duration at which an agent is anticipated to achieve successful outcomes half of the time. Recent analyses illustrate the 50%- and 80%-time horizons for various AI agents based on their performance across numerous software tasks.
The methodology for estimating these time horizons involves fitting a logistic curve to predict the probability of task completion success, based on human task durations. The intersection of this curve at 50% and 80% success rates indicates the respective time horizons. This approach is grounded in extensive data gathered from over a hundred diverse software tasks, encompassing domains such as software engineering, machine learning, and cybersecurity.
To derive human task duration estimates, contracted professionals attempt the tasks while following the same instructions provided to AI agents. Their completion times, aggregated through geometric means, form a baseline against which AI performance is measured. However, these estimates may overstate the actual time experienced professionals would take, as the evaluators often lack the contextual knowledge typically held by individuals in their everyday roles. For tasks lacking reliable human completion times, expert estimates or quality assurance data are utilized.
It is crucial to clarify that the term “time horizon” does not imply the length of time AI agents can operate autonomously. Instead, the 50%-time horizon signifies the duration of tasks that AI agents can complete with a 50% reliability rate, reflecting task complexity rather than the actual time taken by AI during task execution. In practice, AI agents generally outperform humans in terms of speed, often completing tasks several times faster.
The performance of these agents is contingent upon various factors, including the specific task at hand and the agent configuration. AI tends to excel by executing tasks in fewer actions and performing tasks such as code writing in a single attempt. On average, the human evaluators involved in measuring task durations possess around five years of relevant experience and have attended top-tier universities, ensuring a robust reference for task complexity.
Despite the promising metrics, it is essential to recognize that the evaluated tasks are largely confined to software engineering, machine learning, and cybersecurity, and do not encompass the full spectrum of intellectual tasks performed in real-world scenarios. The evaluation results indicate that while AI capabilities exhibit exponential growth, their effectiveness remains uneven across various domains. This unevenness suggests that while an AI may achieve an 8-hour time horizon, it does not necessarily translate into the ability to automate all job functions.
Concerns surrounding the reliability of these metrics arise when considering the limitations of task design. Many jobs involve complex, interdependent tasks with success metrics that elude algorithmic scoring. Consequently, the evaluation tasks are more straightforward than the multifaceted nature of actual work environments, where prior context and collaboration with others play significant roles.
In their recent assessments, researchers have opted not to report time horizons at higher success rates such as 99% due to the significant challenges associated with accurately measuring such metrics. The need for numerous short tasks complicates the design and reliability of human baselines, pushing the evaluation toward more practical 50%- and 80%-time horizons data, which exhibit similar trends.
The process of evaluating time horizons involves a systematic approach that begins with setting up access to AI models, understanding their behavior, and eliciting their capabilities through a curated set of tasks. Following this initial phase, the evaluation expands into a larger test set, with multiple independent runs conducted to ensure reliability. The overall process typically spans several weeks, reflecting the complexities involved in accurately measuring AI performance.
This ongoing evaluation and reporting of time horizons for various AI models, including recent additions like Claude Opus 4.6 and GPT-5.3-Codex, highlight the dynamic landscape of AI capabilities. However, several notable models remain unassessed, indicating the limitations in coverage due to resource constraints. As the field continues to evolve, the implications of these findings will significantly influence discussions around AI autonomy and its role in various professional domains.
See also
Sam Altman Praises ChatGPT for Improved Em Dash Handling
AI Country Song Fails to Top Billboard Chart Amid Viral Buzz
GPT-5.1 and Claude 4.5 Sonnet Personality Showdown: A Comprehensive Test
Rethink Your Presentations with OnlyOffice: A Free PowerPoint Alternative
OpenAI Enhances ChatGPT with Em-Dash Personalization Feature




















































