AI Models Face Off: DeepSeek-R1 Achieves 98% Excellence in Ovarian Cancer Care Evaluation

DeepSeek-R1 outshines Doubao-1.5-pro with 98% excellence in ovarian cancer care evaluation, demonstrating superior alignment with clinical guidelines.

Staff

Published

2 hours ago

Can artificial intelligence (AI) tools handle the complexities of cancer care? A systematic evaluation from Shanghai rigorously examines this question by comparing two prominent Chinese large language models (LLMs) in the context of ovarian cancer diagnosis and treatment. The study, titled “Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness,” was published in Diagnostics and presents one of the most structured assessments of generative AI systems in gynecologic oncology to date.

Conducted by a research team at the Shanghai First Maternity and Infant Hospital and affiliated institutions, the study utilized a controlled benchmark to evaluate how well these models align with established international clinical guidelines. Ovarian cancer, known for its high mortality rate and complex treatment pathways, demands precision across surgery, chemotherapy, genetic testing, and long-term follow-up.

The researchers designed a 20-question evaluation framework based on NCCN, FIGO, and ESMO guidelines, dividing the questions into four equally weighted domains: Risk Factors and Prevention, Surgical Management, Medical Treatment, and Surveillance. The two models answered the same set of questions independently, with responses submitted in separate sessions to minimize interaction bias. Five senior gynecologic oncology chief physicians rated the responses on a 10-point scale, collecting a total of 200 expert ratings.

The results revealed a significant advantage for DeepSeek-R1, which received 98 “Excellent” ratings out of 100. All of its responses surpassed the seven-point threshold. In contrast, Doubao-1.5-pro garnered only 41 Excellent ratings, with just nine of its answers exceeding the seven-point mark. The performance gap was particularly pronounced in the Medical Treatment domain, where DeepSeek-R1 maintained high-level scoring, demonstrating near-universal excellence.

While Doubao-1.5-pro showed relative strength in Risk Factors and Prevention—addressing questions about BRCA mutation testing and family history assessment—it faltered in Medical Treatment and Surveillance, with only 12 percent of its ratings achieving the Excellent threshold in Medical Treatment. Statistical testing underscored these discrepancies, with DeepSeek-R1 showing consistent performance across all domains, unlike Doubao-1.5-pro, which indicated uneven knowledge depth.

The models were also compared on the specifics of 20 individual questions. DeepSeek-R1 outperformed Doubao-1.5-pro in 19 cases, with the latter only marginally exceeding it in a surgical protocol question. Despite the generally detailed and structured responses from both models, researchers noted specific inaccuracies and omissions in DeepSeek-R1’s outputs. For instance, it simplified indications for surgical eligibility and applied staging language more narrowly than current guidelines permit.

While these errors were considered minor, they highlight a critical concern: even high-performing LLMs can incorporate outdated or oversimplified interpretations due to static training data. The study’s authors attributed some discrepancies to insufficiently updated datasets, a known limitation in medical AI systems. Doubao-1.5-pro exhibited broader weaknesses, providing general medical explanations rather than professional clinical guidance and lacking crucial decision-making criteria in high-risk areas.

Despite these limitations, DeepSeek-R1 shows potential as a supplementary educational tool and assistive clinical support system. Its strengths in risk assessment, treatment planning, and follow-up management suggest that high-performing LLMs could enhance information synthesis and guideline referencing in clinical settings. However, the study firmly concludes that LLMs are not yet ready for independent clinical deployment, emphasizing that human clinicians must retain ultimate responsibility for diagnostic and therapeutic decisions.

The research advocates for ongoing model updates, integration of guideline-based retrieval systems, and multidimensional safety assessments. Developers are urged to address issues such as hallucination risks, outdated references, and excessive verbosity to enhance model clarity and communication. Future research directions include testing output stability through repeated responses, evaluating adaptability to varied question phrasing, and including leading international models for broader benchmarking.

While the findings present a clear picture of the strengths and weaknesses of these AI models, the authors also acknowledge limitations in their study design, such as a narrow focus on just 20 questions and a single institution’s expert ratings. The subjective nature of the seven-point excellence threshold also invites scrutiny. Overall, this research contributes to the evolving discourse on the role of AI in oncology, emphasizing that while the technology presents promising possibilities, it is not yet a substitute for human expertise.

AI Technology

Shanghai’s Model Speed Space Launches MiniMax M2.5, Promising 100 TPS for AI Innovations

MiniMax launches the M2.5, achieving 100 TPS and transforming AI deployment costs to $0.3 input and $2.4 output per million tokens, enhancing operational efficiency.

Staff16 February, 2026

Montage’s $902M IPO Soars 64%, Signaling Investment Surge in China’s AI Ecosystem

Montage Technology's IPO on the Hong Kong Stock Exchange raised $902 million, soaring 64% on its first day and reinforcing investor confidence in China's...

Staff12 February, 2026

AI Technology

DroidUp Reveals Moya, First Biomimetic AI Robot with Humanlike Expressions and 92% Walking Accuracy

DroidUp unveils Moya, the world's first biomimetic robot capable of humanlike expressions and 92% walking accuracy, targeting healthcare and education markets.

Staff6 February, 2026

DeepSeek R1 Matches ChatGPT Performance at 96% Lower Cost, Targets Developers

DeepSeek R1 matches ChatGPT's performance while slashing costs by 96%, training for $5.5M on 2,048 chips versus ChatGPT's $100M on 16,000 chips.

Staff3 February, 2026

AI Technology

Professor Zhang Wenhong Warns Against AI in Hospital Diagnostics, Citing Training Risks

Professor Zhang Wenhong warns that heavy reliance on AI in hospital diagnostics could jeopardize critical diagnostic skills in medical training.

Staff30 January, 2026

AI Technology

Shanghai Surpasses 120,000 PFLOPS in Intelligent Computing, Launches New Industry Initiative

Shanghai launches the 'High-Quality Development Initiative for the Intelligent Computing Industry,' surpassing 120,000 PFLOPS and unveiling Muxi's XiSuo X series GPUs.

Staff27 January, 2026

DeepSeek-R1 Surpasses Traditional Models with Enhanced Reasoning Through Internal Dialogues

Google and the University of Chicago reveal that DeepSeek-R1 outperforms traditional models in reasoning tasks by utilizing a multi-agent dialogue approach, enhancing accuracy significantly.

Staff26 January, 2026

AI Technology

Nvidia CEO Jensen Huang Visits Shanghai Office as China Prepares to Import H200 Chips

Nvidia CEO Jensen Huang visits Shanghai as China readies to lift its import ban on the H200 AI chips, crucial for the company's growth...

Staff24 January, 2026

AIPRESSA.COM

Top Stories

AI Models Face Off: DeepSeek-R1 Achieves 98% Excellence in Ovarian Cancer Care Evaluation

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Technology

Shanghai’s Model Speed Space Launches MiniMax M2.5, Promising 100 TPS for AI Innovations

Top Stories

Montage’s $902M IPO Soars 64%, Signaling Investment Surge in China’s AI Ecosystem

AI Technology

DroidUp Reveals Moya, First Biomimetic AI Robot with Humanlike Expressions and 92% Walking Accuracy

Top Stories

DeepSeek R1 Matches ChatGPT Performance at 96% Lower Cost, Targets Developers

AI Technology

Professor Zhang Wenhong Warns Against AI in Hospital Diagnostics, Citing Training Risks

AI Technology

Shanghai Surpasses 120,000 PFLOPS in Intelligent Computing, Launches New Industry Initiative

Top Stories

DeepSeek-R1 Surpasses Traditional Models with Enhanced Reasoning Through Internal Dialogues

AI Technology

Nvidia CEO Jensen Huang Visits Shanghai Office as China Prepares to Import H200 Chips