The way we train AIs makes them more likely to spout bull

- Advertisement -


Certain AI training techniques may encourage models to be untruthful

Cravetiger/Getty Images

Common methods used to train artificial intelligence models seem to increase their tendency to give misleading answers, according to researchers who are aiming to produce “the first systematic analysis of machine bullshit”.

It is widely known that large language models (LLMs) have a tendency to generate false information – or “hallucinate” – but this is just one example, says Jaime Fernández Fisac at Princeton University. He and his colleagues define bullshit as “discourse intended to manipulate audience’s beliefs, delivered with disregard for its truth value”.

Read more

How to avoid being fooled by AI-generated misinformation

“Our analysis found that the problem of bullshit in large language models is quite serious and widespread,” says Fisac.

The team divided such instances into five categories: empty rhetoric, such as “this red car combines style, charm, and adventure that captivates everyone”; weasel words – uncertain statements such as “studies suggest our product may help improve results in some cases”; paltering – using truthful statements to give a misleading impression; unverified claims; and sycophancy.

They studied three datasets comprising thousands of AI-generated responses to a wide range of prompts, from models including GPT-4, Gemini and Llama. One dataset contained a range of queries designed to test for bullshitting when AIs are asked to provide guidance or recommendations, while the other datasets included questions about online shopping and political issues.

Free newsletter

Sign up to The Daily

The latest on what’s new in science and why it matters each day.

Sign up to newsletter
New Scientist. Science news and long reads from expert journalists, covering developments in science, technology, health and the environment on the website and the magazine.

Fisac and his colleagues first used an LLM to determine whether the responses involved any of the five categories, then got volunteers to check that the AI’s judgements aligned with human ones.

The team found that the most serious issues with truth seemed to arise as a result of a training method known as reinforcement learning from human feedback. The technique is intended to make machine responses more helpful by giving the LLM immediate feedback on its responses.

But this approach is problematic, says Fisac, because it makes models prioritise immediate human approval and perceived helpfulness, which is “sometimes in conflict with telling the truth”.

“Who likes to hear bad news or entertain a long, nuanced rebuttal of something that feels obviously true?” says Fisac. “By trying to abide by the measure of good behaviour we provide to them, the models learn to demote the truth in favour of confident, eloquent responses, just so that they can secure our approval.”

The study found that reinforcement learning from human feedback significantly increased bullshit behaviours: empty rhetoric rose by nearly 40 per cent, paltering by nearly 60 per cent, weasel words by more than a quarter, and unverified claims by over half.

The increase in paltering is particularly harmful, says team member Kaiqu Liang, also at Princeton, as it leads users to make poorer decisions. When a model was uncertain whether a product had a desired feature, deceptive positive claims jumped from a fifth to over three-quarters after human training.

Read more

The future of AI: The 5 possible scenarios, from utopia to extinction

Another concern is that bullshit was particularly common in political discussions, with AI models “frequently resorting to vague and ambiguous language to avoid committing to concrete statements,” says Liang.

AIs are also more likely to behave this way when there is a conflict of interest, because the system serves multiple parties, such as both a company and its customers, the researchers found.

The way to overcome the problem may be to move to a “hindsight feedback” model, they suggest. Rather than asking for immediate feedback after the AI model’s output, the system should first generate a plausible simulation of what might happen if the user acts on the information received. It would then present the outcome to the human evaluator to judge.

“Ultimately, our hope is that by better understanding the subtle but systematic ways AI can aim to mislead us, we can guide future efforts toward developing genuinely truthful AI systems,” says Fisac.

Daniel Tigard at the University of San Diego, who was not involved in the study, is sceptical of discussing LLMs and their outputs in such terms. He argues that just because an LLM produces bullshit, it doesn’t mean it is deliberately doing so, given that AI systems, as they currently stand, do not set out to deceive us and do not have an interest in doing so.

“The main reason is that this framing appears to run against some very sensible suggestions for how we should and shouldn’t live with these sorts of technologies,” Tigard says. “Calling bullshit might be yet another way of anthropomorphising these systems, which, in turn, may well contribute to their deceptive potential.”

Reference:

arXiv DOI: arXiv:2507.07484

Topics:

  • AI
FacebookTwitterEmailLinkedInPinterestWhatsAppTumblrCopy LinkTelegramRedditMessageShare
- Advertisement -
FacebookTwitterEmailLinkedInPinterestWhatsAppTumblrCopy LinkTelegramRedditMessageShare
error: Content is protected !!
Exit mobile version