Connect with us

Hi, what are you looking for?

AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Human researchers outperform large language models in systematic literature reviews, with LLMs achieving only 93% accuracy in data extraction but failing to produce satisfactory final manuscripts.

A recent study published in the journal Scientific Reports highlights the continuing superiority of human researchers over large language models (LLMs) in conducting systematic literature reviews. The findings emphasize that while LLMs have shown impressive capabilities in various applications, they are best utilized as supervised support tools rather than as independent authors for critical research tasks.

Large language models, which employ deep learning techniques to generate human-like text, have gained significant traction since the debut of OpenAI’s ChatGPT in 2022. These models are now frequently employed in sectors such as healthcare and education for their ability to interpret and generate text, with applications ranging from language translation to medical report drafting. Despite their rapid adoption, the potential risks and challenges associated with their integration into scientific research demand careful consideration.

The study aimed to assess whether LLMs could outperform human researchers in systematic literature reviews—a fundamental process in evidence-based medicine. Researchers compared the outputs of six different LLMs against an original systematic review conducted by human experts. The evaluation included tasks such as literature searches, article screening and selection, data extraction, and the final drafting of the review, with each task repeated to monitor improvements over time.

Among the LLMs tested, Gemini excelled in the initial literature search and selection phase, successfully identifying 13 out of 18 articles that human researchers included in their review. However, the study revealed significant limitations in the LLMs’ performance across other tasks, particularly in data summarization and drafting the final manuscript. These shortcomings are likely tied to the restricted access that many LLMs have to scientific article databases and the limited scope of their training datasets, which often lack sufficient original research articles.

Despite challenges in the first task, LLMs demonstrated a faster extraction rate of relevant articles compared to human researchers, suggesting their potential utility for preliminary literature screening. During the data extraction and analysis phase, the model DeepSeek achieved an overall accuracy rate of 93%, but also required complex prompts and multiple uploads to yield results—a clear indicator of inefficiency relative to human efforts.

When it came to drafting the final manuscript, none of the LLMs succeeded in producing fully satisfactory content. The generated articles often fell short in adhering to the structured format required for systematic reviews, producing outputs that, while well-organized and using correct scientific language, lacked the depth and nuance expected from expert analysis. This could mislead readers unfamiliar with the rigorous standards demanded in systematic reviews and meta-analyses.

Overall, the study concludes that modern LLMs are not yet capable of independently generating systematic reviews in the medical domain without the aid of well-designed prompts. However, the incremental improvements observed between evaluation rounds suggest that, under appropriate supervision, LLMs could serve as valuable adjuncts in certain aspects of the review process. Recent evidence supports the notion that guided prompting strategies can enhance LLM performance in specific review tasks.

The scope of this study, which focused solely on a single systematic review in the medical field, may limit the generalizability of the findings. Further research is needed to evaluate multiple systematic reviews across various biomedical and non-biomedical disciplines to enrich the robustness and external validity of the results. As the integration of AI tools continues to evolve, understanding their strengths and limitations will be pivotal for advancing research practices in an increasingly technology-driven landscape.

For more details, refer to the study by Sollini et al., published in Scientific Reports, DOI: 10.1038/s41598-025-28993-5.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Research

Krites boosts curated response rates by 3.9x for large language models while maintaining latency, revolutionizing AI caching efficiency.

AI Business

Pentagon partners with OpenAI to integrate ChatGPT into GenAI.mil, granting 3 million personnel access to advanced AI capabilities for enhanced mission readiness.

AI Education

UGA invests $800,000 to launch a pilot program providing students access to premium AI tools like ChatGPT Edu and Gemini Pro starting spring 2026.

AI Generative

OpenAI has retired the GPT-4o model, impacting 0.1% of users who formed deep emotional bonds with the AI as it transitions to newer models...

AI Research

AI could simplify medical scan reports by nearly 50%, enhancing patient understanding from a university level to that of an 11- to 13-year-old, says...

AI Generative

ChatBCI introduces a pioneering P300 speller BCI that integrates GPT-3.5 for dynamic word prediction, enhancing communication speed for users with disabilities.

Top Stories

Microsoft’s AI chief Mustafa Suleyman outlines a bold shift to self-sufficiency by developing proprietary models, aiming for superintelligence and reducing reliance on OpenAI.

Top Stories

Mistral AI commits €1.2B to build Nordic data centers, boosting Europe's A.I. autonomy and positioning itself as a rival to OpenAI and Microsoft.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.