Article Text

Download PDFPDF

192 Using large language models to evaluate the offer of options in clinical encounters by using a single item of observer option-5, a measure of shared decision-making
Free
  1. Sai Prabhakar Pandi Selvaraj1,
  2. Renata W Yen2,3,
  3. Rachel C Forcino4,
  4. Glyn Elwyn2
  1. 1Tempus Labs, Redwood City, CA, USA
  2. 2The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
  3. 3The Center for Technology and Behavioral Health, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
  4. 4Department of Population Health, University of Kansas School of Medicine, Kansas City, KS, USA

Abstract

Introduction Assessing recordings of clinical encounters using observer-based measures of shared decision-making, such as Observer OPTION-5 (OO5), is expensive. In this study, we aimed to assess the potential of using large language models (LLMs) to automate the rating of the OO5 item focused on offering options (figure 1).

Methods We used a dataset of 287 clinical encounter transcripts of women diagnosed with early breast talking with their surgeon to discuss treatments. Each transcript had been previously scored by two researchers using OO5. We set up two rules-based baselines, one random and one using trigger words, and classified option talk instances using GPT-3.5 Turbo, GPT-4, and PaLM 2. To develop and compare the performance of these models, we randomly selected 16 transcripts for additional human annotation focusing on option talk instances (binary). To assess performance, we calculated Spearman correlations between the researcher-generated scores for item 1 for the remaining 271 transcripts and the item 1 instances predicted by the LLMs.

Results We observed high levels of correlation between the LLMs and researcher-generated scores.

GPT-3.5 Turbo with a few-shot example had an rs=0.60 with the mean of the two scorers (see figure 1). Other LLMs had slightly lower correlation levels.

Discussion The LLMs, particularly GPT-3.5 Turbo with few-shot examples, demonstrated superior performance in identifying option talk instances compared to baseline models. GPT-3.5 Turbo demonstrated the best performance, achieving higher precision and recall.

Conclusion Further improvements in score correlations may be possible through improvements in and better understanding of LLMs.

Abstract 191 Figure 1

Spearman Correlations between researcher-generated observer OPTION-5 scores and ChatGPT 3.5 turbo predictions

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.