
Be cautious about asking AI for advice on when to see a doctor
Chong Kee Siong/Getty Images
Should you see a doctor about your sore throat? AI’s advice may depend on how carefully you typed your question. When artificial intelligence models were tested on simulated writing from would-be patients, they were more likely to advise against seeking medical care if the writer made typos, included emotional or uncertain language – or was female.
Read more
AI doesn’t know ‘no’ – and that’s a huge problem for medical bots
“Insidious bias can shift the tenor and content of AI advice, and that can lead to subtle but important differences” in how medical resources are distributed, says Karandeep Singh at the University of California, San Diego, who was not involved in the study.
Abinitha Gourabathina at the Massachusetts Institute of Technology and her colleagues used AI to help create thousands of patient notes in different formats and styles. For example, some messages included extra spaces and typos to mimic patients with limited English proficiency or less ease with typing. Other notes used uncertain language in the style of writers with health anxiety, colourful expressions that lent a dramatic or emotional tone or gender-neutral pronouns.
The researchers then fed the notes to four large language models (LLMs) commonly used to power chatbots and told the AI to answer questions about whether the patient should manage their condition at home or visit a clinic, and whether the patient should receive certain lab tests and other medical resources. These AI models included OpenAI’s GPT-4, Meta’s Llama-3-70b and Llama-3-8b, and the Palmyra-Med model developed for the healthcare industry by the AI company Writer.
The tests showed that the various format and style changes made all the AI models between 7 and 9 per cent more likely to recommend patients stay home instead of getting medical attention. The models were also more likely to recommend that female patients remain at home, and follow-up research showed they were more likely than human clinicians to change their recommendations for treatments because of gender and language style in the messages.
Receive a weekly dose of discovery in your inbox.
OpenAI and Meta did not respond to a request for comment. Writer does not “recommend or support” using LLMs – including the company’s Palmyra-Med model – for clinical decisions or health advice “without a human in the loop”, says Zayed Yasin at Writer.
Most operational AI tools currently used in electronic health record systems rely on OpenAI’s GPT-4o, which was not specifically studied in this research, says Singh. But he said one big takeaway from the study is the need for improved ways to “evaluate and monitor generative AI models” used in the healthcare industry.
Journal reference:
FAccT ’25: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency DOI: 10.1145/3715275.3732121
Topics:
- artificial intelligence/
- medical technology/
- health