Man vs Machine: ChatGPT completes USMLE medical exam

Written by Emma Hall (Digital Editor)

A recent study has tested the performance of ChatGPT on the United States Medical Licensing Exam (USMLE), and the results are astonishing.

In a recent study published in PLOS Digital Health, researchers at AnsibleHealth (CA, USA) tested the ability of ChatGPT in the USMLE. Scored for accuracy, concordance and insight in the three exams comprising the USMLE: Steps 1, 2CK and 3, ChatGPT gave impressive responses to every question, covering a range of topics including medical science, clinical knowledge and ethics.

We are in an age of time where we are shifting to a digital world. A transforming era of flickering screens and intricate machines, where refusal to adapt means being left behind. In this digital world, artificial intelligence (AI) technologies offer remarkable potential to enhance healthcare and medicine. However, such integration is met with substantial opposition, with ethical concerns regarding privacy, safety, bias and trust. How can we rely on something that lacks human judgement?

One way we can be confident that clinical AI is directed by standards of trust, transparency and explainability is to assess these principles by comparing human clinicians with medical AI knowledge, such as in a language model. ChatGPT is an AI natural language processing model that generates conversational text using predictions to place relevant words together.

ChatGPT was tested using 350 questions taken from the June 2022 USMLE sample exam, after removing 26 questions containing images or graphs. Question outputs were marked by two physician moderators.

Astonishingly, without the arduous and demanding years of study required by medical students to prepare for the exam and obtain a medical licence, ChatGPT scored near or at the passing threshold of 60% for all three exams without any prior training, scoring above 50% across all exams, and Step 3 test scores even reaching 75% accuracy.

The researchers also attempted to avoid memory retention bias by deleting the previous chat session and creating a new chat session for each question input.

Interestingly, ChatGPT also scored highly in concordance (94.6% across all responses) and demonstrated extensive reasoning and logical clinical insight (88.9% of responses).

Particularly notable was the performance comparison to PubMedGPT, a fellow language processing model trained solely on biomedical domain literature on USMLE. ChatGPT surpassed PubMedGPT, which scored 50.3% on a previous set of USMLE exam questions.

Although early days, these results suggest that ChatGPT has the ability to enhance medical education and may even contribute to clinical decision making in the future. It is probable that the medical applications of language processing models will expand as they continue to develop and improve.