Article Text
Abstract
Introduction Assessing recordings of clinical encounters using observer-based measures of shared decision-making, such as Observer OPTION-5 (OO5), is expensive. In this study, we aimed to assess the potential of using large language models (LLMs) to automate the rating of the OO5 item focused on offering options (figure 1).
Methods We used a dataset of 287 clinical encounter transcripts of women diagnosed with early breast talking with their surgeon to discuss treatments. Each transcript had been previously scored by two researchers using OO5. We set up two rules-based baselines, one random and one using trigger words, and classified option talk instances using GPT-3.5 Turbo, GPT-4, and PaLM 2. To develop and compare the performance of these models, we randomly selected 16 transcripts for additional human annotation focusing on option talk instances (binary). To assess performance, we calculated Spearman correlations between the researcher-generated scores for item 1 for the remaining 271 transcripts and the item 1 instances predicted by the LLMs.
Results We observed high levels of correlation between the LLMs and researcher-generated scores.
GPT-3.5 Turbo with a few-shot example had an rs=0.60 with the mean of the two scorers (see figure 1). Other LLMs had slightly lower correlation levels.
Discussion The LLMs, particularly GPT-3.5 Turbo with few-shot examples, demonstrated superior performance in identifying option talk instances compared to baseline models. GPT-3.5 Turbo demonstrated the best performance, achieving higher precision and recall.
Conclusion Further improvements in score correlations may be possible through improvements in and better understanding of LLMs.