Can artificial intelligence (AI) tools handle the complexities of cancer care? A systematic evaluation from Shanghai rigorously examines this question by comparing two prominent Chinese large language models (LLMs) in the context of ovarian cancer diagnosis and treatment. The study, titled “Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness,” was published in Diagnostics and presents one of the most structured assessments of generative AI systems in gynecologic oncology to date.
Conducted by a research team at the Shanghai First Maternity and Infant Hospital and affiliated institutions, the study utilized a controlled benchmark to evaluate how well these models align with established international clinical guidelines. Ovarian cancer, known for its high mortality rate and complex treatment pathways, demands precision across surgery, chemotherapy, genetic testing, and long-term follow-up.
The researchers designed a 20-question evaluation framework based on NCCN, FIGO, and ESMO guidelines, dividing the questions into four equally weighted domains: Risk Factors and Prevention, Surgical Management, Medical Treatment, and Surveillance. The two models answered the same set of questions independently, with responses submitted in separate sessions to minimize interaction bias. Five senior gynecologic oncology chief physicians rated the responses on a 10-point scale, collecting a total of 200 expert ratings.
The results revealed a significant advantage for DeepSeek-R1, which received 98 “Excellent” ratings out of 100. All of its responses surpassed the seven-point threshold. In contrast, Doubao-1.5-pro garnered only 41 Excellent ratings, with just nine of its answers exceeding the seven-point mark. The performance gap was particularly pronounced in the Medical Treatment domain, where DeepSeek-R1 maintained high-level scoring, demonstrating near-universal excellence.
While Doubao-1.5-pro showed relative strength in Risk Factors and Prevention—addressing questions about BRCA mutation testing and family history assessment—it faltered in Medical Treatment and Surveillance, with only 12 percent of its ratings achieving the Excellent threshold in Medical Treatment. Statistical testing underscored these discrepancies, with DeepSeek-R1 showing consistent performance across all domains, unlike Doubao-1.5-pro, which indicated uneven knowledge depth.
The models were also compared on the specifics of 20 individual questions. DeepSeek-R1 outperformed Doubao-1.5-pro in 19 cases, with the latter only marginally exceeding it in a surgical protocol question. Despite the generally detailed and structured responses from both models, researchers noted specific inaccuracies and omissions in DeepSeek-R1’s outputs. For instance, it simplified indications for surgical eligibility and applied staging language more narrowly than current guidelines permit.
While these errors were considered minor, they highlight a critical concern: even high-performing LLMs can incorporate outdated or oversimplified interpretations due to static training data. The study’s authors attributed some discrepancies to insufficiently updated datasets, a known limitation in medical AI systems. Doubao-1.5-pro exhibited broader weaknesses, providing general medical explanations rather than professional clinical guidance and lacking crucial decision-making criteria in high-risk areas.
Despite these limitations, DeepSeek-R1 shows potential as a supplementary educational tool and assistive clinical support system. Its strengths in risk assessment, treatment planning, and follow-up management suggest that high-performing LLMs could enhance information synthesis and guideline referencing in clinical settings. However, the study firmly concludes that LLMs are not yet ready for independent clinical deployment, emphasizing that human clinicians must retain ultimate responsibility for diagnostic and therapeutic decisions.
The research advocates for ongoing model updates, integration of guideline-based retrieval systems, and multidimensional safety assessments. Developers are urged to address issues such as hallucination risks, outdated references, and excessive verbosity to enhance model clarity and communication. Future research directions include testing output stability through repeated responses, evaluating adaptability to varied question phrasing, and including leading international models for broader benchmarking.
While the findings present a clear picture of the strengths and weaknesses of these AI models, the authors also acknowledge limitations in their study design, such as a narrow focus on just 20 questions and a single institution’s expert ratings. The subjective nature of the seven-point excellence threshold also invites scrutiny. Overall, this research contributes to the evolving discourse on the role of AI in oncology, emphasizing that while the technology presents promising possibilities, it is not yet a substitute for human expertise.
See also
Germany”s National Team Prepares for World Cup Qualifiers with Disco Atmosphere
95% of AI Projects Fail in Companies According to MIT
AI in Food & Beverages Market to Surge from $11.08B to $263.80B by 2032
Satya Nadella Supports OpenAI’s $100B Revenue Goal, Highlights AI Funding Needs
Wall Street Recovers from Early Loss as Nvidia Surges 1.8% Amid Market Volatility


















































