Connect with us

Hi, what are you looking for?

AI Research

Human Researchers Outperform AI in Medical Systematic Reviews, Study Finds

Human researchers outperform large language models in systematic literature reviews, with LLMs achieving only 93% accuracy in data extraction but failing to produce satisfactory final manuscripts.

A recent study published in the journal Scientific Reports highlights the continuing superiority of human researchers over large language models (LLMs) in conducting systematic literature reviews. The findings emphasize that while LLMs have shown impressive capabilities in various applications, they are best utilized as supervised support tools rather than as independent authors for critical research tasks.

Large language models, which employ deep learning techniques to generate human-like text, have gained significant traction since the debut of OpenAI’s ChatGPT in 2022. These models are now frequently employed in sectors such as healthcare and education for their ability to interpret and generate text, with applications ranging from language translation to medical report drafting. Despite their rapid adoption, the potential risks and challenges associated with their integration into scientific research demand careful consideration.

The study aimed to assess whether LLMs could outperform human researchers in systematic literature reviews—a fundamental process in evidence-based medicine. Researchers compared the outputs of six different LLMs against an original systematic review conducted by human experts. The evaluation included tasks such as literature searches, article screening and selection, data extraction, and the final drafting of the review, with each task repeated to monitor improvements over time.

Among the LLMs tested, Gemini excelled in the initial literature search and selection phase, successfully identifying 13 out of 18 articles that human researchers included in their review. However, the study revealed significant limitations in the LLMs’ performance across other tasks, particularly in data summarization and drafting the final manuscript. These shortcomings are likely tied to the restricted access that many LLMs have to scientific article databases and the limited scope of their training datasets, which often lack sufficient original research articles.

Despite challenges in the first task, LLMs demonstrated a faster extraction rate of relevant articles compared to human researchers, suggesting their potential utility for preliminary literature screening. During the data extraction and analysis phase, the model DeepSeek achieved an overall accuracy rate of 93%, but also required complex prompts and multiple uploads to yield results—a clear indicator of inefficiency relative to human efforts.

When it came to drafting the final manuscript, none of the LLMs succeeded in producing fully satisfactory content. The generated articles often fell short in adhering to the structured format required for systematic reviews, producing outputs that, while well-organized and using correct scientific language, lacked the depth and nuance expected from expert analysis. This could mislead readers unfamiliar with the rigorous standards demanded in systematic reviews and meta-analyses.

Overall, the study concludes that modern LLMs are not yet capable of independently generating systematic reviews in the medical domain without the aid of well-designed prompts. However, the incremental improvements observed between evaluation rounds suggest that, under appropriate supervision, LLMs could serve as valuable adjuncts in certain aspects of the review process. Recent evidence supports the notion that guided prompting strategies can enhance LLM performance in specific review tasks.

The scope of this study, which focused solely on a single systematic review in the medical field, may limit the generalizability of the findings. Further research is needed to evaluate multiple systematic reviews across various biomedical and non-biomedical disciplines to enrich the robustness and external validity of the results. As the integration of AI tools continues to evolve, understanding their strengths and limitations will be pivotal for advancing research practices in an increasingly technology-driven landscape.

For more details, refer to the study by Sollini et al., published in Scientific Reports, DOI: 10.1038/s41598-025-28993-5.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

Top Stories

SpaceX, OpenAI, and Anthropic are set for landmark IPOs as early as 2026, with valuations potentially exceeding $1 trillion, reshaping the AI investment landscape.

Top Stories

OpenAI launches Sora 2, enabling users to create lifelike videos with sound and dialogue from images, enhancing social media content creation.

Top Stories

Musk's xAI acquires a third building to enhance AI compute capacity to nearly 2GW, positioning itself for a competitive edge in the $230 billion...

Top Stories

Nvidia and OpenAI drive a $100 billion investment surge in AI as market dynamics shift, challenging growth amid regulatory skepticism and rising costs.

AI Education

WVU Parkersburg's Joel Farkas reports a 40% test failure rate linked to AI misuse, urging urgent policy reforms to uphold academic integrity.

Top Stories

Hybe's AI-driven virtual pop group Syndi8 debuts with "MVP," showcasing a bold leap into music innovation by blending technology and global fan engagement.

AI Research

OpenAI and Google DeepMind are set to enhance AI agents’ recall systems, aiming for widespread adoption of memory-enabled models by mid-2025.

AI Tools

MIT study reveals that 83% of students using ChatGPT for essays struggle to recall their work, highlighting significant cognitive deficits and reduced engagement.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.