‘Embarrassingly simple’ probe finds AI in medical image diagnosis ‘worse than random’

ylai ( @ylai@lemmy.ml ) · 5 months ago

‘Embarrassingly simple’ probe finds AI in medical image diagnosis ‘worse than random’

spaduf ( @spaduf@slrpnk.net ) · edit-2 5 months ago

Yeah this is probably just straight up misinformation. By no means is a diagnosis going to be made by a generalist multimodal LLM. Diagnosis is a literally a binary classification (although that is an oversimplification) and on medical CV you are optimizing on that directly.

snooggums ( @snooggums@midwest.social ) · edit-2 5 months ago

They did not use a LLM.

In a recent experiment, they set out to determine how reliable LMMs are in medical diagnosis — asking both general and more specific diagnostic questions — as well as whether models were even being evaluated correctly for medical purposes.

Curating a new dataset and asking state-of-the-art models questions about X-rays, MRIs and CT scans of human abdomens, brain, spine and chests, they discovered “alarming” drops in performance.

Thorry84 ( @Thorry84@feddit.nl ) · 5 months ago

You’ve quoted them stating they used LLMs while claiming they did not use a LLM? What am I missing here?

everett ( @everett@lemmy.ml ) · 5 months ago

What am I missing here?

“L” “M” “M”

spaduf ( @spaduf@slrpnk.net ) · 5 months ago

Which in this context just means multimodal LLM, correct?

blindsight ( @blindsight@beehaw.org ) · 5 months ago

Correct.

large language models (LLM) vs. large multi-modal models (LMM)

Regardless, they both use an LLM as the main driver. Multi modal just means that the LLM is interfaced with generative and/or predictive AIs for other types of content like images, sound, video, etc.

This is using a generalist tool for a specialized job. I’d expect the limit for LMMs is telling you if your picture is a heart or a kidney… Maybe. With low accuracy. Diagnosing? lol, hell no.