Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine

The advent of multimodal AI, especially models like GPT-4 with Vision (GPT-4V), has revolutionized the field of medicine, showcasing remarkable proficiency in diagnostic tasks. Recent research, however, highlights that while GPT-4V may outperform human physicians in certain challenge tasks, the underlying rationales for its decisions often contain significant flaws.
This article delves into the hidden intricacies of GPT-4V’s performance, based on a comprehensive analysis published in npj Digital Medicine.
Expert-Level Performance with Caveats
GPT-4V has demonstrated exceptional accuracy in multiple-choice medical questions, boasting an accuracy rate of 81.6%, surpassing the 77.8% achieved by human physicians in a closed-book setting.
This performance is particularly noteworthy in cases where human physicians often falter, with GPT-4V maintaining over 78% accuracy in these instances. However, the study underscores a crucial aspect: the rationales provided by GPT-4V, especially in correctly answered questions, frequently contain significant errors.
Evaluating GPT-4V’s Multimodal Capabilities
The research extends beyond mere accuracy, exploring GPT-4V’s abilities in three critical areas: image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning. Using 207 multiple-choice questions from the New England Journal of Medicine (NEJM) Image Challenge, the study provides a granular analysis of GPT-4V’s performance:
- Image Comprehension: This involves describing patient images accurately. GPT-4V, however, showed a high error rate (27.2%) in this area, often misinterpreting visual information.
- Medical Knowledge Recall: Here, the model demonstrates its ability to recall and apply relevant medical knowledge. While more reliable than image comprehension, errors still occurred in 8.9% of cases.
- Step-by-Step Reasoning: This evaluates the logical sequence used by GPT-4V to arrive at a conclusion. Although errors in reasoning were less frequent (12.4%), they still highlight areas for improvement.
Findings
The study reveals that despite high overall accuracy, GPT-4V often provides flawed rationales. For instance, it correctly answered a question about malignant syphilis but failed to recognize that the presented skin lesions were manifestations of the same pathology. This discrepancy between the final answer and the rationale raises concerns about GPT-4V’s true understanding and reliability in clinical decision-making.
Moreover, the research points out the significant challenge in image comprehension, where GPT-4V miscounted the number of CT images provided in a case. Such errors underline the need for thorough validation before deploying AI models in clinical settings.
Implications for Clinical Practice
While GPT-4V’s performance is promising, the study emphasizes the importance of scrutinizing AI-generated rationales. The authors suggest physicians relying on AI assistance must remain vigilant, cross-verifying AI outputs with clinical expertise.
The research highlights a key limitation of large language models such as GPT-4V, which are only as ‘intelligent’ as the data they have been trained with. It is yet to be seen whether larger models such as GPT-5, which OpenAI is currently working on, will circumvent this.