Google’s AI search overviews, which rely on the company’s Gemini large-language model (LLM), are reportedly facing significant inaccuracies, raising concerns within the tech community. A recent report conducted by AI startup Oumi and commissioned by the New York Times claims that while 91 percent of searches return accurate results, this still translates to tens of millions of incorrect answers given Google’s processing of over five trillion searches annually.
The volume of misinformation is alarming, with Futurism describing the situation as a potential “misinformation crisis.” In light of these findings, Google’s spokesperson, Ned Adriance, has contested the report, labeling it as flawed. He criticized the methodology used, which involved one AI grading another, calling it an “old benchmark that is known for being full of errors.” This, he argues, does not adequately reflect the nature of Google searches.
The research utilized a system called SimpleQA, a benchmark from OpenAI that assesses how effectively an LLM can answer straightforward, fact-based questions. Although OpenAI maintains that SimpleQA is accurate, its limited scope means it only measures short inquiries with a single verifiable answer. As noted in the report, the correlation between providing concise factual answers and creating comprehensive, accurate responses remains an open question.
Despite the high accuracy rate reported, Oumi’s examination of Google’s AI revealed instances where verifiable questions resulted in incorrect responses. The report highlighted several factual errors, with the AI sometimes citing unreliable sources or misinterpreting information from credible sites. In some cases, while the initial answer was correct, the additional context provided was inaccurate. Furthermore, the AI’s susceptibility to manipulation was evident, as even a blog post could mislead it into recognizing someone as an expert in an unrelated field.
Google has also pointed out flaws within the SimpleQA framework, citing a study by researchers at Google DeepMind that identified incorrect “ground truths,” or verified facts. The company emphasized the irony of using one imperfect AI model to evaluate another. This raises broader questions about the reliability of AI assessment methods in general.
Adriance brought attention to two specific examples from the New York Times report. In one instance, Gemini incorrectly stated that Bob Marley’s house became a museum in 1987, while the correct date is 1986. Google provided a screenshot of the Wikipedia entry that Gemini utilized, which at the time contained conflicting dates. The issue has since been rectified, with the entry now consistently stating 1986.
In another example, Gemini reportedly misidentified the Neuse River’s location in North Carolina, claiming it ran “west” of Goldsboro. Google contended that while the river primarily flows south, it does indeed run southwest, rendering the answer “plausible” rather than entirely incorrect. This statement underscores the challenges of capturing nuanced geographical information through AI.
The ongoing dialogue around AI accuracy highlights the evolving landscape of technology, where the stakes are high for platforms like Google, which underpin much of the information landscape. As scrutiny of AI systems continues, the industry must grapple with the balance between innovation and reliability, ensuring that users receive accurate information in an era increasingly reliant on automated systems.
See also
Microsoft Tests New Copilot Features Leveraging Openclaw for Enhanced Automation
Germany”s National Team Prepares for World Cup Qualifiers with Disco Atmosphere
95% of AI Projects Fail in Companies According to MIT
AI in Food & Beverages Market to Surge from $11.08B to $263.80B by 2032
Satya Nadella Supports OpenAI’s $100B Revenue Goal, Highlights AI Funding Needs


















































