AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Human researchers outperform large language models in systematic literature reviews, with LLMs achieving only 93% accuracy in data extraction but failing to produce satisfactory final manuscripts.

Staff

Published

7 December, 2025

A recent study published in the journal Scientific Reports highlights the continuing superiority of human researchers over large language models (LLMs) in conducting systematic literature reviews. The findings emphasize that while LLMs have shown impressive capabilities in various applications, they are best utilized as supervised support tools rather than as independent authors for critical research tasks.

Large language models, which employ deep learning techniques to generate human-like text, have gained significant traction since the debut of OpenAI’s ChatGPT in 2022. These models are now frequently employed in sectors such as healthcare and education for their ability to interpret and generate text, with applications ranging from language translation to medical report drafting. Despite their rapid adoption, the potential risks and challenges associated with their integration into scientific research demand careful consideration.

The study aimed to assess whether LLMs could outperform human researchers in systematic literature reviews—a fundamental process in evidence-based medicine. Researchers compared the outputs of six different LLMs against an original systematic review conducted by human experts. The evaluation included tasks such as literature searches, article screening and selection, data extraction, and the final drafting of the review, with each task repeated to monitor improvements over time.

Among the LLMs tested, Gemini excelled in the initial literature search and selection phase, successfully identifying 13 out of 18 articles that human researchers included in their review. However, the study revealed significant limitations in the LLMs’ performance across other tasks, particularly in data summarization and drafting the final manuscript. These shortcomings are likely tied to the restricted access that many LLMs have to scientific article databases and the limited scope of their training datasets, which often lack sufficient original research articles.

Despite challenges in the first task, LLMs demonstrated a faster extraction rate of relevant articles compared to human researchers, suggesting their potential utility for preliminary literature screening. During the data extraction and analysis phase, the model DeepSeek achieved an overall accuracy rate of 93%, but also required complex prompts and multiple uploads to yield results—a clear indicator of inefficiency relative to human efforts.

When it came to drafting the final manuscript, none of the LLMs succeeded in producing fully satisfactory content. The generated articles often fell short in adhering to the structured format required for systematic reviews, producing outputs that, while well-organized and using correct scientific language, lacked the depth and nuance expected from expert analysis. This could mislead readers unfamiliar with the rigorous standards demanded in systematic reviews and meta-analyses.

Overall, the study concludes that modern LLMs are not yet capable of independently generating systematic reviews in the medical domain without the aid of well-designed prompts. However, the incremental improvements observed between evaluation rounds suggest that, under appropriate supervision, LLMs could serve as valuable adjuncts in certain aspects of the review process. Recent evidence supports the notion that guided prompting strategies can enhance LLM performance in specific review tasks.

The scope of this study, which focused solely on a single systematic review in the medical field, may limit the generalizability of the findings. Further research is needed to evaluate multiple systematic reviews across various biomedical and non-biomedical disciplines to enrich the robustness and external validity of the results. As the integration of AI tools continues to evolve, understanding their strengths and limitations will be pivotal for advancing research practices in an increasingly technology-driven landscape.

For more details, refer to the study by Sollini et al., published in Scientific Reports, DOI: 10.1038/s41598-025-28993-5.

AI Research

Krites Enhances Asynchronous Semantic Caching, Boosts Curated Response Rate by 3.9x

Krites boosts curated response rates by 3.9x for large language models while maintaining latency, revolutionizing AI caching efficiency.

Staff3 hours ago

AI Business

Pentagon Integrates ChatGPT into GenAI.mil, Expanding Access to 3M Defense Personnel

Pentagon partners with OpenAI to integrate ChatGPT into GenAI.mil, granting 3 million personnel access to advanced AI capabilities for enhanced mission readiness.

Marcus Chen6 hours ago

AI Education

UGA Launches $800K AI Pilot Program for Students, Access to ChatGPT Edu and Gemini Pro

UGA invests $800,000 to launch a pilot program providing students access to premium AI tools like ChatGPT Edu and Gemini Pro starting spring 2026.

David Park11 hours ago

AI Generative

OpenAI Retires GPT-4o, Sparking Outcry Among AI Companion Community

OpenAI has retired the GPT-4o model, impacting 0.1% of users who formed deep emotional bonds with the AI as it transitions to newer models...

Staff12 hours ago

AI Research

AI Simplifies Medical Scan Reports by 50%, Enhancing Patient Understanding, Says Study

AI could simplify medical scan reports by nearly 50%, enhancing patient understanding from a university level to that of an 11- to 13-year-old, says...

Staff15 hours ago

AIPRESSA.COM

AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

Top Stories

DeepMind Achieves Breakthroughs with AlphaFold and AlphaZero, Transforming AI Landscape

You May Also Like

AI Research

Krites Enhances Asynchronous Semantic Caching, Boosts Curated Response Rate by 3.9x

AI Business

Pentagon Integrates ChatGPT into GenAI.mil, Expanding Access to 3M Defense Personnel

AI Education

UGA Launches $800K AI Pilot Program for Students, Access to ChatGPT Edu and Gemini Pro

AI Generative

OpenAI Retires GPT-4o, Sparking Outcry Among AI Companion Community

AI Research

AI Simplifies Medical Scan Reports by 50%, Enhancing Patient Understanding, Says Study

AI Generative

ChatBCI Launches P300 Speller BCI with Context-Driven Word Prediction Using GPT-3.5

Top Stories

Microsoft’s Mustafa Suleyman Announces Shift to AI Self-Sufficiency, Aims for Superintelligence

Top Stories

Mistral AI Invests $1.4B in Nordic Data Centers to Enhance Europe’s A.I. Independence