Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

Background Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. Objective This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models’ responses can enhance the triage proficiency of untrained personnel. Methods A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters’ MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. Results GPT-4–based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5–based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4–based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged. Conclusions While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models’ triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors’ triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.


Multimedia Appendix 1 Overview
This plot shows for each triage case the individual assessment per rater.On the y-axis the case number and on the x-axis the individual raters are shown.The color code represents the triage class according to the Manchester Triage levels from level 1 'Red'immediate assessment -to level 5 'Blue' -non-urgent assessment.GPT-4 and GPT-3.5 refer to GPT-4 based ChatGPT and GPT-3.5 based ChatGPT accordingly.

Figure S2: Triage Results per individual Rater (Other LLMs).
This plot shows for each triage case the individual assessment per large language model (LLM) queried via application programming interface.The consensus set is displayed as reference on the very left.On the y-axis the case number and on the xaxis the individual raters are shown.The color code represents the triage class according to the Manchester Triage levels from level 1 'Red'immediate assessment -to level 5 'Blue' -non-urgent assessment.b.This bar chart uses the average quadratic weighted Cohen's kappa values for the two ChatGPT Groups (GPT-4 based or GPT-3.5 based), the different other queried large language models and the untrained doctors (blue).Improved performances for untrained doctors using GPT-4 based ChatGPT as second opinion (orange), as well as the hypothetical best possible score merging their assessments (green) are depicted.The least severe level, Level 5 ('Blue'), is located at the bottom, while the most severe level, Level 1 ('Red'), is positioned at the top.

Figure S1 :
Figure S1: Triage Results per individual Rater (Doctors and ChatGPT versions).This plot shows for each triage case the individual assessment per rater.On the y-axis the case number and on the x-axis the individual raters are shown.The color code represents the triage class according to the Manchester Triage levels from level 1 'Red'immediate assessment -to level 5 'Blue' -non-urgent assessment.GPT-4 and GPT-3.5 refer to GPT-4 based ChatGPT and GPT-3.5 based ChatGPT accordingly.

Figure
Figure S3: Quadratic Weighted Cohen's Kappa Agreement for Individual Raters and Three Scenarios.a.The bar chart presents the quadratic weighted Cohen's Kappa scores for each individual rater (doctor or large language model) and the performance of untrained doctors utilizing GPT-4 based ChatGPT for a second opinion.Additionally, the maximal performance for untrained doctors using

Figure S4 :
Figure S4: Overtriage and Undertriage Rate per Group and Individual Rater.a.The upper bar chart presents the percentage of overtriaged cases for each rater group and the corresponding standard deviation among the group.Accordingly, the lower chart shows the percentage of undertriage cases.b.The bar charts present similar to a) the undertriage and overtriage rate, but for each individual rater.

Figure S5 :
Figure S5: Average Distribution of Manchester Triage System Levels per Individual Rater.The figure presents stacked bar plots illustrating the average count of triage categories according to the Manchester Triage System, arranged in ascending order of severity.The least severe level, Level 5 ('Blue'), is located at the bottom, while the most severe level, Level 1 ('Red'), is positioned at the top.

Textbox S1: Example Case Vignettes Textbox S2: Prompt used for the Generation of LLM Triages Textbox S3: Example ChatGPT Output (for Example Case Vignette 1) used as Second Opinion.
Ordered by appearance in main text 1. Textbox S1: Example Case Vignettes 2. Textbox S2: Prompt used for the Generation of LLM Triages 3. Textbox S3: Example ChatGPT Output (for Example Case Vignette 1) used as Second Opinion.4. Table S1: Large Language Models and Respective Settings 5. Figure S1: Triage Results per individual Rater (Doctors and ChatGPT).6. Figure S2: Triage Results per individual Rater (Other LLMs).7. Table S2: Results of Tukey's Honest Significant Difference Test 8. Figure S3: Quadratic Weighted Cohen's Kappa Agreement for Individual Raters and Three Scenarios 9. Figure S4: Overtriage and Undertriage Rate per Group and Individual Rater 10. Figure S5: Average Distribution of Manchester Triage System Levels per Individual Rater Prompt used for ChatGPT: "I would like to test your ability to perform an emergency room triage according to the Manchester Triage System.I will present you several cases.The provided information is all you get.If there is something missing, please assume it is not relevant to the case.As an answer, please tell me the triage color and very briefly explain your decision."PromptusedforAPIs: "You are a professional medical chatbot.2."A 27-year-old male patient reports increased ear pain on his right side following a respiratory infection last week.He took an unspecified antibiotic until yesterday.He has a mild hearing impairment in his right ear but no nausea, dizziness, or otorrhoea.He doesn't have a fever or any neurological deficits.He rates the pain as 5/10 on the NAS scale.3."A68-year-old female patient walks in with severe left-sided pain (NAS 10/10).Has a history of kidney stones, and describes the current episode to be very similar.Currently hypertensive (180mmHg sys), other vitals normal.No other symptoms, no other relevant medical history.No immunosuppression or international journeys.""Triage Color: Red The patient is presenting with severe respiratory distress, indicated by dyspnea (difficulty breathing) and inspiratory stridor (a high-pitched, wheezing sound due to obstructed air flow).These symptoms suggest she may be experiencing a severe asthma attack or possibly an allergic reaction that's causing her airways to tighten.Immediate medical intervention is needed to open her airways and normalize her breathing."