AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Human researchers outperform large language models in systematic literature reviews, with LLMs achieving only 93% accuracy in data extraction but failing to produce satisfactory final manuscripts.

Staff

Published

7 December, 2025

A recent study published in the journal Scientific Reports highlights the continuing superiority of human researchers over large language models (LLMs) in conducting systematic literature reviews. The findings emphasize that while LLMs have shown impressive capabilities in various applications, they are best utilized as supervised support tools rather than as independent authors for critical research tasks.

Large language models, which employ deep learning techniques to generate human-like text, have gained significant traction since the debut of OpenAI’s ChatGPT in 2022. These models are now frequently employed in sectors such as healthcare and education for their ability to interpret and generate text, with applications ranging from language translation to medical report drafting. Despite their rapid adoption, the potential risks and challenges associated with their integration into scientific research demand careful consideration.

The study aimed to assess whether LLMs could outperform human researchers in systematic literature reviews—a fundamental process in evidence-based medicine. Researchers compared the outputs of six different LLMs against an original systematic review conducted by human experts. The evaluation included tasks such as literature searches, article screening and selection, data extraction, and the final drafting of the review, with each task repeated to monitor improvements over time.

Among the LLMs tested, Gemini excelled in the initial literature search and selection phase, successfully identifying 13 out of 18 articles that human researchers included in their review. However, the study revealed significant limitations in the LLMs’ performance across other tasks, particularly in data summarization and drafting the final manuscript. These shortcomings are likely tied to the restricted access that many LLMs have to scientific article databases and the limited scope of their training datasets, which often lack sufficient original research articles.

Despite challenges in the first task, LLMs demonstrated a faster extraction rate of relevant articles compared to human researchers, suggesting their potential utility for preliminary literature screening. During the data extraction and analysis phase, the model DeepSeek achieved an overall accuracy rate of 93%, but also required complex prompts and multiple uploads to yield results—a clear indicator of inefficiency relative to human efforts.

When it came to drafting the final manuscript, none of the LLMs succeeded in producing fully satisfactory content. The generated articles often fell short in adhering to the structured format required for systematic reviews, producing outputs that, while well-organized and using correct scientific language, lacked the depth and nuance expected from expert analysis. This could mislead readers unfamiliar with the rigorous standards demanded in systematic reviews and meta-analyses.

Overall, the study concludes that modern LLMs are not yet capable of independently generating systematic reviews in the medical domain without the aid of well-designed prompts. However, the incremental improvements observed between evaluation rounds suggest that, under appropriate supervision, LLMs could serve as valuable adjuncts in certain aspects of the review process. Recent evidence supports the notion that guided prompting strategies can enhance LLM performance in specific review tasks.

The scope of this study, which focused solely on a single systematic review in the medical field, may limit the generalizability of the findings. Further research is needed to evaluate multiple systematic reviews across various biomedical and non-biomedical disciplines to enrich the robustness and external validity of the results. As the integration of AI tools continues to evolve, understanding their strengths and limitations will be pivotal for advancing research practices in an increasingly technology-driven landscape.

For more details, refer to the study by Sollini et al., published in Scientific Reports, DOI: 10.1038/s41598-025-28993-5.

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

Red Hat advances enterprise AI with Small Language Models that achieve over 98% validity in structured tasks, prioritizing reliability and data sovereignty.

Marcus Chen3 May, 2026

AI Government

US Defense Partners with Anthropic, OpenAI, and Tech Giants for AI-First Military Initiative

US Department of Defense partners with tech giants including SpaceX and OpenAI to launch an "AI-first" initiative aimed at enhancing military decision-making efficiency.

Staff3 May, 2026

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

OpenAI's o1 model achieves 81.6% diagnostic accuracy in emergency situations, surpassing human doctors and signaling a major shift in medical practice.

Staff3 May, 2026

AI Marketing

BusySeed Launches Rankxa to Measure Brand Visibility in AI-Generated Search Results

BusySeed unveils Rankxa, a tool tracking brand visibility across AI-generated responses, revealing 90% of brands lack meaningful presence in this new landscape.

Sofía Méndez3 May, 2026

AI Technology

A1 Public Relations Enhances AI Visibility for Entertainment Brands in 2026

A1 Public Relations helps entertainment brands enhance AI visibility in 2026 by integrating structured content and fresh, authoritative media, ensuring they are recognized by...

Staff2 May, 2026

AI Generative

OpenAI Launches GPT Image 2, Surpassing Google Nano Banana 2 in Key Categories

OpenAI unveils GPT Image 2, achieving a record 242-point lead over competitors, transforming the AI image generation landscape with native reasoning capabilities.

Staff2 May, 2026

AI Finance

More Than 55% of Americans Use AI for Financial Advice, Risking Personal Data Exposure

More than 55% of Americans now turn to AI tools for financial advice, risking personal data exposure despite rising privacy concerns.

Marcus Chen2 May, 2026

AI Technology

Apple Faces Mac Mini and Studio Shortage as OpenClaw Drives AI Demand Surge

Apple CEO Tim Cook warns of several-month supply shortages for the Mac mini and Mac Studio as demand surges, pushing Mac revenue to $8.4...

Staff2 May, 2026

AIPRESSA.COM

AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

AI Government

US Defense Partners with Anthropic, OpenAI, and Tech Giants for AI-First Military Initiative

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

AI Marketing

BusySeed Launches Rankxa to Measure Brand Visibility in AI-Generated Search Results

AI Technology

A1 Public Relations Enhances AI Visibility for Entertainment Brands in 2026

AI Generative

OpenAI Launches GPT Image 2, Surpassing Google Nano Banana 2 in Key Categories

AI Finance

More Than 55% of Americans Use AI for Financial Advice, Risking Personal Data Exposure

AI Technology

Apple Faces Mac Mini and Studio Shortage as OpenClaw Drives AI Demand Surge