Testing the Ability of Large Language Models to Achieve the Ethical Education Requirements of Preclinical Medical Students (V)
Dhaval Patel1, Kamran Khan1, Venkata Paruchuri1, University of Queensland-Ochsner Clinical School Brisbane 1University of Queensland-Ochsner Clinical School, Brisbane, Queensland, Australia
Abstract
Nature and Scope
Large language models (LLMs) such as ChatGPT and Google Bard have received considerable attention from the medical community due to their potential to improve clinical care. The ability of these LLMs to answer clinical questions has been tested frequently with many studies showing exceptional performance. However, little attention has been given to see if these LLMs can deal with the intricate ethical dilemmas that are intertwined with delivering healthcare.
Purpose and Issue Considered
In this presentation, we evaluate how LLMs perform on ethical benchmark assessments designed for preclinical medical students (PCMS) to facilitate understanding of LLM’s limitations in answering complex ethical questions.
Methods
Five different PCMS examinations were input into both ChatGPT and Bard to determine if they would achieve a passing score. The performance of both models was compared to see if one was superior.
Outcomes
Both LLMs successfully achieved a passing score on all examinations with scores ranging from 63.2% to 89.2%. No statistically significant difference was found in the performance of the two models.
Conclusion
Testing LLMs to determine if they can answer basic ethical dilemmas allows us to understand the real-life limitations of incorporating AI into clinical care. While the LLMs all successfully passed the minimum standards expected at the PCMS level, this was likely due to the examination’s high proportion of first-order questions. LLM performance was much lower when only considering higher-order questions. Future research will examine LLM’s ability to deal with more complex ethical situations that might arise in healthcare.
Biography
Bio to come