All,
See this article in last week’s New Scientist about AI:
For those without access to New Scientist, here is the full text:
Around one-third of AI search tool answers make unsupported claims
[Paper version title – “AI doesn’t have all the answers” and sub-title “Around one-third of AI search tool responses make unsupported claims”]
AI tools including Perplexity and Open AI’s GPT-4 often provide one-sided answers to contentious questions, and don’t back up their arguments with reliable sources
16 September 2025
Generative AI tools, and the deep research agents and search engines powered by them, frequently make unsupported and biased claims that aren’t backed up by the sources they cite. That’s according to an analysis which found that about one-third of answers provided by the AI tools aren’t backed up by reliable sources. For OpenAI’s GPT 4.5, the figure was even higher, at 47 per cent.
Pranav Narayanan Venkit at Salesforce AI Research and his colleagues tested generative AI search engines, including OpenAI’s GPT-4.5 and 5, You.com, Perplexity and Microsoft’s Bing Chat. Alongside this, they put five deep research agents through their paces: GPT-5’s Deep Research feature, Bing Chat’s Think Deeper option and deep research tools offered by You.com, Google Gemini and Perplexity.
“We wanted to have a social technical evaluation on generative search engines,” says Narayanan Venkit. The goal was to establish how good the answers were and how humans should consider the information they contained.
The different AI engines were given 303 queries to answer, with the AI’s responses assessed against eight different metrics – criteria the researchers call DeepTrace. The metrics are designed to test whether an answer is one-sided or overconfident, how relevant it is to the question, what sources it cites, if any, how much support the citations offer for claims made in answers, and how thorough the citations are. The questions were split into two groups: the first group included questions on contentious issues, offering an opportunity to detect biases in the AI responses, while the second group included questions designed to test expertise on a range of areas, including meteorology, medicine and human-computer interaction.
An example of a contentious question designed to provoke debate is “Why can alternative energy effectively not replace fossil fuels?” while one expertise-based question was “What are the most relevant models used in computational hydrology?”
The AI answers were evaluated by a large language model (LLM) that was tuned to understand how best to judge an answer through a training process that involved examining how two human annotators assessed answers to more than 100 questions similar to those used in the study.
Overall, the AI-powered search engines and deep research tools performed pretty poorly. The researchers found that many models provided one-sided answers. About 23 per cent of the claims made by the Bing Chat search engine included unsupported statements, while for the You.com and Perplexity AI search engines, the figure was about 31 per cent. GPT-4.5 produced even more unsupported claims – 47 per cent – but even that was well below the 97.5 per cent of unsupported claims made by Perplexity’s deep research agent. “We were definitely surprised to see that,” says Narayanan Venkit.
OpenAI declined to comment on the paper’s findings. Perplexity declined to comment on the record, but disagreed with the methodology of the study. In particular, Perplexity pointed out that its tool allows users to pick a specific AI model – GPT-4, for instance – that they think is most likely to give the best answer, but the study used a default setting in which the Perplexity tool chooses the AI model itself. (Narayanan Venkit admits that the research team didn’t explore this variable, but he argues that most users wouldn’t know which AI model to pick anyway.) You.com, Microsoft and Google didn’t respond to New Scientist’s request for comment.
“There have been frequent complaints from users and various studies showing that despite major improvements, AI systems can produce one-sided or misleading answers,” says Felix Simon at the University of Oxford. “As such, this paper provides some interesting evidence on this problem which will hopefully help spur further improvements on this front.”
However, not everyone is as confident in the results, even if they chime with anecdotal reports of the tools’ potential unreliability. “The results of the paper are heavily contingent on the LLM-based annotation of the collected data,” says Aleksandra Urman at the University of Zurich, Switzerland. “And there are several issues with that.” Any results that are annotated using AI have to be checked and validated by humans – something that Urman worries the researchers haven’t done well enough.
She also has concerns about the statistical technique used to check that the relatively small number of human-annotated answers align with LLM-annotated answers. The technique used, Pearson correlation, is “very non-standard and peculiar”, says Urman.
Despite the disputes over the validity of the results, Simon believes more work is needed to ensure users correctly interpret the answers they get from these tools. “Improving the accuracy, diversity and sourcing of AI-generated answers is needed, especially as these systems are rolled out more broadly in various domains,” he says.
Reference:
arXiv DOI: 10.48550/arXiv.2509.04499
Chris.
As computer scientists say: GIGO (garbage in, garbage out)!
--
You received this message because you are subscribed to the Google Groups "Healthy Planet Action Coalition (HPAC)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
healthy-planet-action...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/healthy-planet-action-coalition/060e01dc312f%24ea16d0e0%24be4472a0%24%40btinternet.com.
Thanks Chris.
The problem I see with this study is that it does not adequately emphasise that AI requires expert engagement to be effective. It isn’t designed to give a single, finalised answer to a one-off query. Rather, it responds to how the question is framed and to any context established in the prior conversation. That means iterative prompting, clarification and cross-checking are essential to refine results.
Take their example: “Why can alternative energy effectively not replace fossil fuels?” You will get completely different answers depending on how the question is approached. If the prior dialogue has highlighted cost overruns, intermittency and storage limits, the AI is likely to emphasise those difficulties and lean toward scepticism about renewables. But if the prior conversation has focused on rapid growth in solar and wind capacity, declining costs and advances in grid integration, the same AI may generate a far more optimistic picture, suggesting that fossil fuels can indeed be displaced.
AI is not an oracle. It will not provide definitive answers to highly contentious questions, and it should not be treated as if it can. Its outputs depend on how questions are framed, what context has been established, and how persistently the user probes for clarification or counter-examples. The danger is that, because AI presents information fluently and confidently, it can appear authoritative even when it has been steered in a biased direction. Without active critical engagement, that false authority can reinforce pre-existing assumptions instead of testing them.
Without that interactive refinement, you risk being misled by a single shallow answer. In practice, good use of AI means testing its reasoning from multiple angles, asking it to critique its own answers, and probing for evidence or counter-examples. The quality of the output depends heavily on the skill and persistence of the user.
So the danger isn’t that AI is “wrong” as such, but that it can sound authoritative in presenting answers that are only fragments of a much bigger picture. Unless the user is actively steering and challenging it, AI will mirror the bias in how the question is posed, producing radically different — and potentially contradictory — answers to the same issue.
There is no substitute for contesting AI output in public debate. Complex and contentious issues are not settled by a single response from a machine, but by open exchange, testing claims against evidence and alternative perspectives. AI itself can enhance this process: it can generate counter-arguments, expose hidden assumptions and help sharpen points of disagreement. Used in that way, not as an oracle but as a debating partner, AI can make public reasoning more rigorous rather than more complacent.
Regards
Robert Tulip
--