ChatGPT Predicts In-Hospital All-Cause Mortality for Sepsis: In-Context Learning with the Korean Sepsis Alliance Database

Namkee Oh; Won Chul Cha; Jun Hyuk Seo; Seong-Gyu Choi; Jong Man Kim; Chi Ryang Chung; Gee Young Suh; Su Yeon Lee; Dong Kyu Oh; Mi Hyeon Park; Chae-Man Lim; Ryoung-Eun Ko

doi:10.4258/hir.2024.30.3.266

Journal List > Healthc Inform Res > v.30(3) > 1516088101

Go to TopGo to Top Go to BottomGo to Bottom

TOOLS

Oh, Cha, Seo, Choi, Kim, Chung, Suh, Lee, Oh, Park, Lim, and Ko: ChatGPT Predicts In-Hospital All-Cause Mortality for Sepsis: In-Context Learning with the Korean Sepsis Alliance Database

Original Article

Healthcare Informatics Research 2024; 30(3): 266-276.

Published online: 31 July 2024

DOI: https://doi.org/10.4258/hir.2024.30.3.266

ChatGPT Predicts In-Hospital All-Cause Mortality for Sepsis: In-Context Learning with the Korean Sepsis Alliance Database

Namkee Oh^1,^*

, Won Chul Cha^2,^*

, Jun Hyuk Seo³

, Seong-Gyu Choi¹

, Jong Man Kim¹

, Chi Ryang Chung⁴

, Gee Young Suh^4,⁵

, Su Yeon Lee⁶

, Dong Kyu Oh⁶

, Mi Hyeon Park⁶

, Chae-Man Lim⁶

, Ryoung-Eun Ko⁴

¹Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea

²Department of Emergency Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea

³Department of Digital Health, Samsung Advanced Institute for Health Science & Technology, Sungkyunkwan University, Seoul, Korea

⁴Department of Critical Care Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea

⁵Division of Pulmonary and Critical Care Medicine, Department of Medicine, Samsung Medical Center, Sungkyunkwan University, Seoul, Korea

⁶Division of Pulmonology and Critical Care Medicine, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea

Corresponding Author: Ryoung-Eun Ko, Department of Critical Care Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro, Gangnam-gu, Seoul 06351, Korea. Tel: +82-2-3410-6399, E-mail: koryoungeun@gmail.com (https://orcid.org/0000-0003-4945-5623)

^* These authors contributed equally to this work.

Received 23 January 2024 Revised 6 May 2024 Accepted 28 May 2024

(open-access, http://creativecommons.org/licenses/by/4.0):

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Objectives

Sepsis is a leading global cause of mortality, and predicting its outcomes is vital for improving patient care. This study explored the capabilities of ChatGPT, a state-of-the-art natural language processing model, in predicting in-hospital mortality for sepsis patients.

Methods

This study utilized data from the Korean Sepsis Alliance (KSA) database, collected between 2019 and 2021, focusing on adult intensive care unit (ICU) patients and aiming to determine whether ChatGPT could predict all-cause mortality after ICU admission at 7 and 30 days. Structured prompts enabled ChatGPT to engage in in-context learning, with the number of patient examples varying from zero to six. The predictive capabilities of ChatGPT-3.5-turbo and ChatGPT-4 were then compared against a gradient boosting model (GBM) using various performance metrics.

Results

From the KSA database, 4,786 patients formed the 7-day mortality prediction dataset, of whom 718 died, and 4,025 patients formed the 30-day dataset, with 1,368 deaths. Age and clinical markers (e.g., Sequential Organ Failure Assessment score and lactic acid levels) showed significant differences between survivors and non-survivors in both datasets. For 7-day mortality predictions, the area under the receiver operating characteristic curve (AUROC) was 0.70–0.83 for GPT-4, 0.51–0.70 for GPT-3.5, and 0.79 for GBM. The AUROC for 30-day mortality was 0.51–0.59 for GPT-4, 0.47–0.57 for GPT-3.5, and 0.76 for GBM. Zero-shot predictions using GPT-4 for mortality from ICU admission to day 30 showed AUROCs from the mid-0.60s to 0.75 for GPT-4 and mainly from 0.47 to 0.63 for GPT-3.5.

Conclusions

GPT-4 demonstrated potential in predicting short-term in-hospital mortality, although its performance varied across different evaluation metrics.

Keywords: Artificial Intelligence, Hospital Mortality, Natural Language Processing, Sepsis

Introduction

Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection [1]. Globally, sepsis is a leading cause of mortality and poses a significant challenge for health systems [2]. Predicting the outcome of sepsis patients is crucial for guiding treatment decisions, allocating resources, and improving patient care [3]. Although various regression and machine learning models have been developed to estimate mortality risk in sepsis patients, their practical use in clinical settings is limited. These models often require extensive datasets, lack sufficient interpretability and explainability, and depend on more features than necessary [4–6]. Accordingly, they are seldom used in practical clinical settings.

ChatGPT is a state-of-the-art large language model (LLM) with over 100 billion parameters that can perform various tasks such as text generation, summarization, and question-answering [7]. Since its launch in late 2022, ChatGPT has shown impressive capabilities in numerous fields, particularly in medicine. Here, it has achieved results that are comparable to or even surpass those of medical experts in answering medical questions and exams [8–10]. However, the application of LLMs in real-world clinical settings involves more than just retrieving information; it also requires clinical reasoning and decision support. The recently developed NYUTron, an LLM enhanced with medical knowledge, has shown significant promise in predicting in-hospital mortality and 30-day all-cause readmission. Nonetheless, the advancements it offers are limited by the substantial additional investment needed for such refinement.

In this study, we investigated the potential of ChatGPT to overcome these challenges by utilizing its pre-trained parameters and extensive dataset to generate natural language responses to structured prompts. We did not require additional training or fine-tuning of ChatGPT; instead, we employed in-context learning to tailor it to the specific task. We evaluated ChatGPT’s effectiveness in predicting in-hospital mortality among sepsis patients using clinical data and scenarios.

Methods

1. Dataset

This study was a secondary analysis of data prospectively collected from the Korean Sepsis Alliance (KSA) database between September 2019 and December 2021. The KSA database serves as a nationwide registry of sepsis cases from 16 tertiary or university-affiliated hospitals across South Korea [11]. The Institutional Review Boards of all participating hospitals including Samsung Medical Center (Approval No. 2018-05-108) waived the requirement for informed consent due to the study’s observational nature. The research was conducted in accordance with the principles outlined in the Declaration of Helsinki.

Adult patients (over the age of 18) who were admitted to the intensive care unit (ICU) were included in the study. Clinical characteristics, including age (stratified by the Charlson comorbidity index), sex, Sequential Organ Failure Assessment (SOFA) score, serum lactic acid levels, and in-hospital mortality, were collected. All-cause mortality was assessed from the first day of the ICU stay up to the 30th day. Patients who were transferred to other hospitals were excluded from the analysis because their final outcomes, whether death or survival, were unknown.

The research employed two sub-datasets from the KSA database, each aimed at predicting all-cause mortality before discharge at two different time points: day 7 and day 30. In the dataset used for predicting 7-day mortality, patients who were transferred out before day 7 were excluded. Similarly, in the dataset for 30-day mortality prediction, patients transferred before day 30 were also excluded.

2. Task

The objective of this study is to experimentally determine whether ChatGPT can predict various patients’ future outcomes based on provided clinical data. We assessed its ability to predict survival or death at 7 days and 30 days following the initial day of ICU admission, utilizing the SOFA score and lactic acid level from the first day. This analysis was conducted using two sub-datasets extracted from the KSA database. Specifically, for the 30-day dataset, we collected data on survival or death at multiple time points post-ICU admission (0, 1, 2, 3, 4, 5, 6, 7, 14, 21, 28, and 30 days). Additionally, we assessed how the accuracy of predictions, based on the first day’s data, varied over time.

3. Prompt

The prompts were structured into two parts, providing examples for in-context learning and then asking for a prediction for a new case.

Examples were provided as follows:

“sex (male or female), age score in Charlson Comorbidity Index (1–5), SOFA score (X), lactic acid level (Y) at the first day of ICU. The patient was observed (survived or died) after 7 days from ICU admission.”

And for the prediction, the questions for new cases were as follows:

“(male or female), (Charlson age), (SOFA score), (lactic acid level) at the first day of ICU. What is your prediction on the outcome of survival or death at the 7 days after ICU admission?”

Figure 1 offers a schematic illustration of the interaction between the user and ChatGPT. It shows how examples for in-context learning and questions were presented, followed by the corresponding responses. This figure helps clarify the process by which the model uses the provided data to make predictions.

Figure 1

Example of in-context learning and prediction using ChatGPT. This figure illustrates the method of in-context learning and subsequent prediction using ChatGPT for assessing sepsis outcomes in ICU patients. It presents an example of how the model was provided with specific patient data, including age (age score in Charlson Comorbidity Index), SOFA score, and lactic acid levels at the time of ICU admission, along with the known outcomes after 7 days. SOFA: Sequential Organ Failure Assessment, ICU: intensive care unit.

4. In-Context Learning

To evaluate ChatGPT’s predictive capabilities under various conditions, we manipulated the number of examples provided as follows: in the zero-shot scenario, ChatGPT was tasked with predicting a patient’s survival or death without any prior examples. In the one-shot scenario, a single representative patient example was provided, based on which ChatGPT made its prediction. In the two-shot scenario, two patient examples were presented, and in the few-shot scenario, six patient examples were used.

The examples of patients selected for in-context learning consisted of representative cases from those included in the study. Since the SOFA score was normally distributed, the mean and standard deviation (SD) were used (mean – SD, mean, mean + SD). Conversely, because the distribution of lactic acid levels was skewed, representative values were selected using the first quartile, median, and third quartile (Supplementary Table S1, Figure S2).

5. Experiment and Comparison

The predictions were conducted using ChatGPT-3.5-turbo and ChatGPT-4 through the OpenAI API. Additionally, a gradient boosting model (GBM) was employed to demonstrate the performance of a conventional machine learning model. The GBM was trained using the scikit-learn package, with an 8:2 dataset split for training and validation purposes. To ensure a fair comparison, the performance of ChatGPT was assessed using a randomly selected sample of 100 patients from the validation set, while the examples of patients used for in-context learning were drawn from the training set. For each validation, we calculated accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUROC) score to compare the performance of ChatGPT-3.5-turbo, ChatGPT-4, and gradient boosting (Figure 2).

Figure 2

Experimental design. The Korean Sepsis Alliance (KSA) database was utilized to conduct an experiment to predict in-hospital mortality for patients admitted to the intensive care unit (ICU). The predictive performance of gradient boosting machine, ChatGPT-3.5, and ChatGPT-4 were compared.

Results

1. Patients and Baseline Characteristics

The KSA database included a total of 11,981 adult sepsis patients, of whom 4,890 were admitted to the ICU during their hospital stay (Supplementary Figure S1). By day 7, 104 patients had been transferred to another hospital, and by day 30, the number of transfers had increased to 865. Consequently, the dataset used for predicting all-cause mortality before discharge by day 7 included 4,786 patients (7D dataset), while the dataset for the 30-day mortality prediction comprised 4,025 patients (30D dataset).

Out of the 4,786 patients in the 7D dataset, 718 died within 7 days (Table 1). The sex distribution among these patients was not significantly different (p = 0.825). Among survivors, patients in their 70s were the most common, comprising 30% (1,221/4,068). However, in the mortality group, individuals in their 80s were the most prevalent, at 36.8% (264/718), with a significant difference in age distribution between the two groups (p < 0.001). The mortality group also had a higher SOFA score (12.6 ± 3.7) than the survivors (9.3 ± 3.6; p < 0.001). Additionally, lactic acid levels were higher in the mortality group (median [interquartile range], 7.1 [3.9–11.8]) than in the survivors (2.4 [1.5–4.4]; p < 0.001).

Table 1

Dataset for predicting all-cause mortality before discharge by day 7

	Overall (n = 4,786)	Survival (n = 4,068)	Mortality (n = 718)	p-value
Age (yr)				<0.001
<50	372 (7.8)	333 (8.2)	39 (5.4)
50–59	549 (11.5)	483 (11.9)	66 (9.2)
60–69	1,043 (21.8)	924 (22.7)	119 (16.6)
70–79	1,451 (30.3)	1,221 (30.0)	230 (32.0)
≥80	1,371 (28.6)	1,107 (27.2)	264 (36.8)

Sex	–			0.825
Male	2,805 (58.6)	2,381 (58.5)	424 (59.1)
Female	1,981 (41.4)	1,687 (41.5)	294 (40.9)

Septic shock	1,339 (28.0)	1,044 (25.7)	295 (41.1)	<0.001

SOFA score	9.8 ± 3.8	9.3 ± 3.6	12.6 ± 3.7	<0.001

Lactic acid (mmol/L)	2.6 (1.6–5.4)	2.4 (1.5–4.4)	7.1 (3.9–11.8)	<0.001

Initial vital signs
SBP (mmHg)	90.0 (78.0–112.0)	91.0 (79.0–116.0)	88.0 (72.5–100.0)	<0.001
DBP (mmHg)	56.0 (46.0–69.0)	57.0 (47.0–70.0)	52.0 (44.0–63.0)	<0.001
MBP (mmHg)	67.7 (56.7–83.3)	68.3 (57.3–84.3)	64.0 (53.9–75.8)	<0.001
HR (rate/min)	106.6 ± 26.1	106.5 ± 25.9	107.1 ± 26.8	0.553
RR (rate/min)	24.0 (20.0–28.0)	23.0 (20.0–27.0)	24.0 (21.5–28.0)	<0.001
BT (°C)	37.1 (36.4–38.1)	37.2 (36.5–38.1)	36.8 (36.1–37.9)	<0.001

Comorbidities
Cardiovascular disease	1,196 (25.0)	1,030 (25.3)	166 (23.1)	0.227
Respiratory disease	657 (13.7)	551 (13.5)	106 (14.8)	0.415
Chronic neurologic disease	1,163 (24.3)	990 (24.3)	173 (24.1)	0.927
Chronic liver disease	480 (10.0)	404 (9.9)	76 (10.6)	0.638
Diabetes mellitus	1,821 (38.0)	1,555 (38.2)	266 (37.0)	0.577
Chronic kidney disease	747 (15.6)	637 (15.7)	110 (15.3)	0.861
Connective tissue disease	140 (2.9)	118 (2.9)	22 (3.1)	0.905

Type of infection				<0.001
Community	2,562 (53.5)	2,116 (52.0)	446 (62.1)
Nursing home acquired	262 (5.5)	209 (5.1)	53 (7.4)
Nursing hospital acquired	548 (11.5)	438 (10.8)	110 (15.3)
Hospital acquired	1,414 (29.5)	1,305 (32.1)	109 (15.2)

Appropriateness of initial empirical therapy				0.001
Appropriate	4,193 (87.6)	3,592 (88.3)	601 (83.7)
Inappropriate	577 (12.1)	466 (11.5)	111 (15.5)
Not applicable	16 (0.3)	10 (0.2)	6 (0.8)

Values are presented as number (%) or mean ± standard deviation or median (interquartile range).

SOFA: Sequential Organ Failure Assessment, SBP: systolic blood pressure, DBP: diastolic blood pressure, MBP: mean blood pressure, HR: heart rate, RR: respiratory rate, BT: body temperature.

Of the 4,025 patients in the 30D dataset, 1,368 died within 30 days (Table 2). The male distribution was 58.0% (1,540/2,657) in the survivors and 61.8% (845/1,368) in the mortality group, showing a significant difference (p = 0.022). In the survivor group, the largest proportion constituted patients in their 70s (39.6%; 787/2,657). In contrast, the largest proportion of patients in the mortality group consisted of individuals in their 80s (33.0%; 452/1,368), with a significant difference in age distribution between the groups (p < 0.001). The SOFA score was higher in the mortality group (12.0 ± 3.8) than in the survivors (9.0 ± 3.5) (p < 0.001). Similarly, the lactic acid level was higher in the mortality group, at 5.0 (2.5–9.9), compared to 2.3 (1.5–4.1) in the survivor group (p < 0.001).

Table 2

Dataset for predicting all-cause mortality before discharge by day 30

	Overall (n = 4,025)	Survival (n = 2,657)	Mortality (n = 1,368)	p-value
Age (yr)
50	348 (8.6)	270 (10.2)	78 (5.7)	<0.001
50–59	492 (12.2)	353 (13.3)	139 (10.2)
60–69	913 (22.7)	647 (24.4)	266 (19.4)
70–79	1220 (30.3)	787 (29.6)	433 (31.7)
≥80	1052 (26.1)	600 (22.6)	452 (33.0)

Sex
Male	2,385 (59.3)	1,540 (58.0)	845 (61.8)	0.022
Female	1,640 (40.7)	1,117 (42.0)	523 (38.2)

SOFA score	10.0 ± 3.9	9.0 ± 3.5	12.0 ± 3.8	<0.001

Lac_D1 (mmol/L)	2.8 (1.7–5.7)	2.3 (1.5–4.1)	5.0 (2.5–9.9)	<0.001

Values are presented as number (%) or mean ± standard deviation or median (interquartile range).

SOFA: Sequential Organ Failure Assessment, Lac_D1: lactic acid at the first day.

2. Predicting All-Cause Mortality before Discharge by Day 7

The predictive performance of GPT-4 for 7-day mortality showed accuracy values ranging from 0.48 to 0.77, precision from 0.98 to 1.00, recall from 0.40 to 0.76, F1-scores from 0.57 to 0.86, and AUROC values from 0.70 to 0.83. The model with the highest recall (0.76), F1-score (0.86), and AUROC (0.83) was the one-shot model (7D_S_3). In contrast, GPT-3.5 demonstrated a wide range of prediction performance, with accuracy, precision, recall, F1-score, and AUROC values ranging from 0.13 to 0.67, 0.92 to 1.00, 0.02 to 0.69, 0.04 to 0.79, and 0.51 to 0.70, respectively (Table 3, Figure 3A). GBM achieved an AUROC of 0.79 (Supplementary Table S2).

Figure 3

Performance evaluations of the GPT-4 and GPT-3.5 for predicting all-cause mortality before discharge by (A) day 7 and (B) day 30, with accuracy, precision, recall, F1-score, and AUROC. Changes in predictive performance are shown as the number of examples for in-context learning varies. AUROC, area under curve of the receiver operating characteristic.

Table 3

Diagnostic performance of ChatGPT for predicting 7-day in-hospital mortality

Example	GPT-4					GPT-3.5

	Accuracy	Precision	Recall	F1-score	AUROC	Accuracy	Precision	Recall	F1-score	AUROC
Zero-shot
None	0.65	1.00	0.61	0.76	0.81	0.32	0.92	0.26	0.40	0.54

One-shot
7D_S_1	0.62	1.00	0.56	0.72	0.78	0.43	0.97	0.37	0.54	0.64
7D_S_2	0.67	0.98	0.64	0.77	0.77	0.54	0.98	0.49	0.66	0.70
7D_S_3	0.77	0.98	0.76	0.86	0.83	0.67	0.92	0.69	0.79	0.62
7D_D_1	0.46	1.00	0.40	0.57	0.70	0.13	1.00	0.02	0.04	0.51
7D_D_2	0.50	1.00	0.44	0.61	0.72	0.13	1.00	0.02	0.04	0.51
7D_D_3	0.61	0.98	0.57	0.72	0.74	0.14	1.00	0.03	0.07	0.52

Two-shot
7D_S_2 & 7D_D_2	0.55	1.00	0.49	0.66	0.75	0.40	0.97	0.34	0.50	0.63
7D_S_1 & 7D_D_3	0.57	1.00	0.52	0.68	0.76	0.40	1.00	0.33	0.49	0.66

Six-shot
All	0.48	1.00	0.42	0.59	0.71	0.48	0.97	0.43	0.59	0.67

AUROC: area under the receiver operating characteristic curve, SOFA: Sequential Organ Failure Assessment.

7D_S_1 indicates 60s male survivor whose SOFA score was 6 and lactic acid was 1.5; 7D_S_2, 70s male survivor whose SOFA score was 9 and lactic acid was 2.4; 7D_S_3, 80s male survivor whose SOFA score was 13 and lactic acid was 4.4; 7D_D_1, 70s female death whose SOFA score was 9 and lactic acid was 3.9; 7D_D_2, 80s male death whose SOFA score was 13 and lactic acid was 7.1; and 7D_D_3, 50s male death whose SOFA score was 16 and lactic acid was 11.8.

3. Predicting All-Cause Mortality before Discharge by Day 30

The predictive performance of GPT-4 in predicting 30-day mortality showed accuracy values ranging from 0.44 to 0.57, precision from 0.65 to 0.73, recall from 0.21 to 0.67, F1-scores from 0.32 to 0.68, and AUROC values from 0.51 to 0.59. The model with the highest recall (0.67), F1-score (0.68), and AUROC (0.59) was the one-shot model (7D_ S_3). Meanwhile, the predictive performance of GPT-3.5 for predicting 30-day mortality exhibited accuracy values ranging from 0.35 to 0.57, precision ranging from 0.00 to 1.00, recall ranging from 0.00 to 0.61, F1-scores ranging from 0.00 to 0.64, and AUROC values ranging from 0.47 to 0.57 (Table 4, Figure 3B). GBM achieved an AUROC of 0.76 (Supplementary Table S3).

Table 4

Diagnostic performance of ChatGPT for predicting 30-day in-hospital mortality

Example	GPT-4					GPT-3.5

	Accuracy	Precision	Recall	F1-score	AUROC	Accuracy	Precision	Recall	F1-score	AUROC
Zero-shot
None	0.51	0.67	0.44	0.53	0.53	0.52	0.68	0.43	0.53	0.55

One-shot
30D_S_1	0.55	0.73	0.47	0.57	0.59	0.47	0.68	0.30	0.42	0.53
30D_S_2	0.52	0.67	0.46	0.55	0.54	0.55	0.70	0.49	0.58	0.57
30D_S_3	0.61	0.70	0.67	0.68	0.59	0.57	0.67	0.61	0.64	0.55
30D_D_1	0.44	0.68	0.21	0.32	0.52	0.39	1.00	0.03	0.06	0.52
30D_D_2	0.45	0.65	0.27	0.38	0.51	0.37	0.00	0.00	0.00	0.50
30D_D_3	0.52	0.67	0.46	0.55	0.54	0.35	0.00	0.00	0.00	0.47

Two-shot
30D_S_2 & 30D_D_2	0.53	0.68	0.48	0.56	0.55	0.42	0.60	0.24	0.34	0.48
30D_S_1 & 30D_D_3	0.57	0.69	0.57	0.63	0.57	0.43	0.64	0.22	0.33	0.50

Six-shot
All	0.47	0.73	0.25	0.38	0.55	0.47	0.66	0.33	0.44	0.52

AUROC: area under the receiver operating characteristic curve, SOFA: Sequential Organ Failure Assessment.

30D_S_1 indicates 80s male survivor whose SOFA score was 6 and lactic acid was 1.5; 30D_S_2, 60s female survivor whose SOFA score was 9 and lactic acid was 2.3; 30D_S_3, 70s female survivor whose SOFA score was 13 and lactic acid was 4.1; 30D_D_1, 70s male death whose SOFA score was 8 and lactic acid was 2.5; 30D_D_2, 60s male death whose SOFA score was 12 and lactic acid was 5; and 30D_D_3, 70s female death whose SOFA score was 16 and lactic acid was 9.9.

4. Predicting All-Cause Mortality from ICU Admission to Day 30 with a Zero-Shot Approach

ChatGPT was tasked with predicting mortality on the first day of ICU admission using the SOFA score and lactic acid levels from that day, in a zero-shot scenario without any patient examples. The performance of GPT-4 was shown by an AUROC of 0.75, compared to 0.49 for GPT-3.5. Similarly, the models’ performance was evaluated up to the seventh day of admission using the initial SOFA score and lactic acid levels. GPT-4’s performance showed AUROCs of 0.73, 0.73, 0.71, 0.71, 0.72, 0.70, and 0.69, while GPT-3.5’s were 0.54, 0.53, 0.63, 0.49, 0.57, 0.53, and 0.54. For predicting mortality at 14, 21, 28, and 30 days, GPT-4 maintained AUROCs of 0.66, while GPT-3.5 showed AUROCs of 0.47, 0.59, 0.58, and 0.54 (refer to Table 5, Figure 4).

Figure 4

Temporal dependency in the predictive performance of the GPT-4, GPT-3.5, and gradient boosting machine (GBM). GPT-4 and GPT-3.5 were tested with zero shots, while GBM was trained with 80% of the dataset and tested with 20% of the dataset. Thus, a simple performance comparison between ChatGPT and GBM should be interpreted with caution. AUROC, area under curve of the receiver operating characteristic.

Table 5

Diagnostic performance of GPT for predicting all-cause mortality from ICU admission to day 30 with a zero-shot approach

Day	GPT-3.5		GPT-4		Gradient boosting

	Accuracy	AUROC	Accuracy	AUROC	Accuracy	AUROC
0	0.26	0.49	0.55	0.75	0.92	0.88

1	0.36	0.54	0.59	0.73	0.90	0.88

2	0.40	0.53	0.63	0.73	0.88	0.86

3	0.48	0.63	0.62	0.71	0.87	0.85

4	0.38	0.49	0.62	0.71	0.85	0.85

5	0.41	0.57	0.62	0.72	0.85	0.83

6	0.41	0.53	0.61	0.70	0.85	0.84

7	0.41	0.54	0.61	0.69	0.84	0.83

14	0.40	0.47	0.61	0.66	0.79	0.79

21	0.54	0.59	0.63	0.66	0.77	0.78

28	0.52	0.58	0.63	0.66	0.75	0.78

30	0.48	0.54	0.63	0.66	0.73	0.77

ICU: intensive care unit, AUROC: area under the receiver operating characteristic curve.

Discussion

This study aimed to investigate the potential of ChatGPT for predicting in-hospital mortality among sepsis patients using clinical data from the KSA database. The findings indicate that ChatGPT can accurately forecast 7-day and 30-day clinical outcomes from data collected on a patient’s first day in the ICU. Among the models tested, GPT-4 exhibited superior performance in predicting 7-day mortality with a one-shot example, achieving an AUROC of 0.83, an F1-score of 0.876, a precision of 0.98, and a recall of 0.76. Remarkably, GPT-4 also demonstrated strong predictive ability using a zero-shot approach, with an AUROC of 0.81, an F1-score of 0.76, a precision of 1.00, and a recall of 0.61. This level of performance, attained without any tailored training and relying solely on pre-trained knowledge, is comparable to that of specialized machine learning models such as GBM, which required training on 80% of the dataset to achieve an AUROC of 0.79.

Predicting 30-day mortality based solely on data from the first day of ICU admission proved to be challenging. The AUROC for GPT-4 ranged from 0.51 to 0.59, significantly lower than its performance in 7-day predictions. In contrast, the GBM machine learning model demonstrated a more robust performance, with an AUROC of 0.76. Further analysis indicated a temporal dependency in the predictions: GPT-4’s predictive accuracy declined as the time between the data collection and the targeted prediction date increased (Figure 4). A similar trend was observed in the GBM model. These findings suggest that the relevance of specific features in model predictions may vary over time, a characteristic inherent to time-series data [12,13]. This highlights potential limitations in using initial ICU data for long-term predictions.

A central methodological feature of this study was the use of in-context learning. Generally, when language models are prompted with examples, their performance tends to improve due to their capacity to identify underlying patterns, a phenomenon known as “language models are few-shot learners” [14–16]. This was particularly noticeable in GPT-3.5, especially in its predictions of 7-day mortality. However, GPT-4 demonstrated superior performance in a zero-shot scenario. Contrary to expectations, as the number of examples increased, the model’s performance did not improve but instead declined, as illustrated in Figure 2. While the precise cause of this trend is unclear, one hypothesis is that GPT-4, with its extensive training data and increased parameters, might already have a significant inherent ability for inference. The examples provided during in-context learning could inadvertently introduce a negative bias, which might explain the observed decrease in predictive performance.

Throughout the study, we utilized GBM as a representative benchmark for classical machine learning predictions. This decision was based on comparative evaluations in which GBM consistently outperformed logistic regression, random forest, and decision tree (Supplementary Tables S2, S3). When predicting 7-day and 30-day mortality across various datasets, GBM generally surpassed GPT-4. Given GBM’s specific design for predictive tasks, its proficiency in extracting and leveraging information from the data is to be expected [17,18]. This robustness of GBM as a predictive tool was clearly demonstrated in our study. In contrast, ChatGPT was not originally trained for predictive tasks but was developed to generate coherent text sequences [19]. However, it was noteworthy that in certain tasks, GPT-4 showed a predictive capability that rivaled that of GBM.

The prediction of mortality in sepsis patients is a crucial component of personalized medicine, facilitating tailored treatment strategies and optimal resource allocation [20–22]. Recent advances in machine learning and artificial intelligence have enhanced the use of these technologies for prognostic assessments. Notably, the fine-tuning of the large language model NYUTron has been reported to improve in-hospital mortality predictions [4,6,23–26]. However, such AI models generally require large datasets and substantial computational expertise for their development. In contrast, ChatGPT, a pretrained large language model, simplifies user interaction by providing direct responses to prompts, thereby eliminating the need for specialized computer science knowledge. Available through a web-based platform, Chat-GPT also offers interpretable explanations for its mortality predictions, making it more accessible and useful for users without technical backgrounds [8,19,27].

This study represents one of the initial efforts to validate ChatGPT’s predictive capabilities in the clinical domain. Despite its objectives, the study faces several limitations. The performance of GPT-4 and GPT-3.5 shows considerable variation depending on the evaluation metric used, with notable discrepancies in precision and recall. This variation can be attributed to the uneven distribution of mortality cases among the patients included in this study (less than 30%). The study attempted to predict in-hospital mortality using only the SOFA score and lactic acid levels. However, relying on just two variables from a vast array of clinical factors is a significant limitation. In clinical practice, numerous factors influence patient outcomes, and incorporating a broader range of variables would likely improve the model’s predictive accuracy and reliability. While predicting mortality is crucial, it is equally important to evaluate ChatGPT’s predictive capacity across a diverse range of clinical scenarios. Due to constraints related to the rate limit and cost of the OpenAI API, not all cases were verified, and the use of random sampling may have compromised the study’s reliability. Moreover, focusing exclusively on ChatGPT, given the availability of various other LLMs, raises questions about the generalizability of the findings to other models. Nevertheless, this study explored the predictive capacity of ChatGPT using clinical data from the KSA database and demonstrated its potential in interpreting clinical data and predicting future clinical outcomes.

In conclusion, this experimental study evaluated ChatGPT’s ability to predict all-cause in-hospital mortality among sepsis patients. GPT-4 showed promise in forecasting short-term in-hospital mortality, although its performance differed across various evaluation metrics. Therefore, additional research is necessary to fully ascertain its capabilities, limitations, and optimal uses in the medical field.

Data Availability

Raw data were generated from the Korean Sepsis Alliance (KSA) and the data are available on request from the KSA. The data are not publicly available because of privacy or ethical restrictions.

Acknowledgments

The following people and institutions participated in the Korean Sepsis Alliance (KSA): Steering Committee, Chae-Man Lim (Chair), Kyeongman Jeon, Dong Kyu Oh, Sunghoon Park, Yeon Joo Lee, Sang-Bum Hong, Gee Young Suh, Young-Jae Cho, Ryoung-Eun Ko, and Sung Yoon Lim; Participating Persons and Centers, Kangwon National University Hospital, Jeongwon Heo; Korea University Anam Hospital, Jae-myeong Lee; Daegu Catholic University Hospital, Kyung Chan Kim; Seoul National University Bundang Hospital, Yeon Joo Lee; Inje University Sanggye Paik Hospital, Youjin Chang; Samsung Medical Center, Kyeongman Jeon; Seoul National University Hospital, Sang-Min Lee; Asan Medical Center, Chae-Man Lim and Suk-Kyung Hong; Pusan National University Yangsan Hospital, Woo Hyun Cho; Chonnam National University Hospital, Sang Hyun Kwak; Jeonbuk National University Hospital, Heung Bum Lee; Ulsan University Hospital, Jong-Joon Ahn; Jeju National University Hospital, Gil Myeong Seong; Chungnam National University Hospital, Song-I Lee; Hallym University Sacred Heart Hospital, Sunghoon Park; Hanyang University Guri Hospital, Tai Sun Park; Severance Hospital, Su Hwan Lee; Yeungnam University Medical Center, Eun Young Choi; Chungnam National University Sejong Hospital, Jae Young Moon.

Notes

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

^Funding This work was supported by the “Future Medicine 2030 Project” of the Samsung Medical Center (No. SMX1230771), “Bio&Medical Technology Development Program” of Korean government (MSIT) (No. RS-2023-00222838), and Research Program funded by the Korea Disease Control and Prevention Agency (Fund Code No. 2019E280500, 2020E280700, and 2021-10-026).

Supplementary Materials

Supplementary materials can be found via https://doi.org/10.4258/hir.2024.30.3.266.

hir-2024-30-3-266-Supplementary-Fig-S1,2.pdf

hir-2024-30-3-266-Supplementary-Table-S1,2,3.pdf

References

1. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016; 315(8):801–10. https://doi.org/10.1001/jama.2016.0287.

2. Fleischmann C, Scherag A, Adhikari NK, Hartog CS, Tsaganos T, Schlattmann P, et al. Assessment of global incidence and mortality of hospital-treated sepsis. current estimates and limitations. Am J Respir Crit Care Med. 2016; 193(3):259–72. https://doi.org/10.1164/rccm.201504-0781OC.

3. Sakr Y, Jaschinski U, Wittebole X, Szakmany T, Lipman J, Namendys-Silva SA, et al. Sepsis in intensive care unit patients: worldwide data from the intensive care over nations audit. Open Forum Infect Dis. 2018; 5(12):ofy313. https://doi.org/10.1093/ofid/ofy313.

4. Hu C, Li L, Huang W, Wu T, Xu Q, Liu J, et al. Interpretable machine learning for early prediction of prognosis in sepsis: a discovery and validation study. Infect Dis Ther. 2022; 11(3):1117–32. https://doi.org/10.1007/s40121-022-00628-6.

5. Park H, Lee J, Oh DK, Park MH, Lim CM, Lee SM, et al. Serial evaluation of the serum lactate level with the SOFA score to predict mortality in patients with sepsis. Sci Rep. 2023; 13(1):6351. https://doi.org/10.1038/s41598-023-33227-7.

6. van Doorn WP, Stassen PM, Borggreve HF, Schalkwijk MJ, Stoffers J, Bekers O, et al. A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLoS One. 2021; 16(1):e0245157. https://doi.org/10.1371/journal.pone.0245157.

7. OpenAI. GPT-4 technical report [Internet]. Ithaca (NY): arXiv.org;2023. [cited at 2023 Dec 1]. Available from: https://arxiv.org/abs/2303.08774.

8. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023; 104(5):269–73. https://doi.org/10.4174/astr.2023.104.5.269.

9. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)?: the implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023; 9:e45312. https://doi.org/10.2196/45312.

10. Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023; 141(6):589–97. https://doi.org/10.1001/jamaophthalmol.2023.1144.

11. Jeon K, Na SJ, Oh DK, Park S, Choi EY, Kim SC, et al. Characteristics, management and clinical outcomes of patients with sepsis: a multicenter cohort study in Korea. Acute Crit Care. 2019; 34(3):179–91. https://doi.org/10.4266/acc.2019.00514.

12. Rooke C, Smith J, Leung KK, Volkovs M, Zuberi S. Temporal dependencies in feature importance for time series predictions [Internet]. Ithaca (NY): arXiv.org;2021. [cited at 2024 Jul 10]. Available from: https://doi.org/10.48550/arXiv.2107.14317.

13. Theissler A, Spinnato F, Schlegel U, Guidotti R. Explainable AI for time series classification: a review, taxonomy and research directions. IEEE Access. 2022; 10:100700–24. https://doi.org/10.1109/ACCESS.2022.3207765.

14. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33:1877–901.

15. Liu H, Tam D, Muqeeth M, Mohta J, Huang T, Bansal M, et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv Neural Inf Process Syst. 2022; 35:1950–65.

16. Min S, Lyu X, Holtzman A, Artetxe M, Lewis M, Hajishirzi H, et al. Rethinking the role of demonstrations: what makes in-context learning work? [Internet]. Ithaca (NY): arXiv.org;2022. [cited at 2024 Jul 10]. Available from: https://doi.org/10.48550/arXiv.2202.12837.

17. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29(5):1189–232. https://doi.org/10.1214/aos/1013203451.

18. Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O. written on behalf of AME Big-Data Clinical Trial Collaborative Group. Predictive analytics with gradient boosting in clinical medicine. Ann Transl Med. 2019; 7(7):152. https://doi.org/10.21037/atm.2019.03.29.

19. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber Phys Syst. 2023; 3:121–54. https://doi.org/10.1016/j.iotcps.2023.04.003.

20. Sanderson M, Chikhani M, Blyth E, Wood S, Moppett IK, McKeever T, et al. Predicting 30-day mortality in patients with sepsis: an exploratory analysis of process of care and patient characteristics. J Intensive Care Soc. 2018; 19(4):299–304. https://doi.org/10.1177/1751143718758975.

21. Burdick H, Pino E, Gabel-Comeau D, McCoy A, Gu C, Roberts J, et al. Effect of a sepsis prediction algorithm on patient mortality, length of stay and readmission: a prospective multicentre clinical outcomes evaluation of real-world patient data from US hospitals. BMJ Health Care Inform. 2020; 27(1):e100109. https://doi.org/10.1136/bmjhci-2019-100109.

22. Wu Y, Huang S, Chang X. Understanding the complexity of sepsis mortality prediction via rule discovery and analysis: a pilot study. BMC Med Inform Decis Mak. 2021; 21(1):334. https://doi.org/10.1186/s12911-021-01690-9.

23. Islam MM, Nasrin T, Walther BA, Wu CC, Yang HC, Li YC. Prediction of sepsis patients using machine learning approach: a meta-analysis. Comput Methods Programs Biomed. 2019; 170:1–9. https://doi.org/10.1016/j.cmpb.2018.12.027.

24. Kong G, Lin K, Hu Y. Using machine learning methods to predict in-hospital mortality of sepsis patients in the ICU. BMC Med Inform Decis Mak. 2020; 20(1):251. https://doi.org/10.1186/s12911-020-01271-2.

25. Li K, Shi Q, Liu S, Xie Y, Liu J. Predicting in-hospital mortality in ICU patients with sepsis using gradient boosting decision tree. Medicine (Baltimore). 2021; 100(19):e25813. https://doi.org/10.1097/MD.0000000000025813.

26. Jiang LY, Liu XC, Nejatian NP, Nasir-Moin M, Wang D, Abidin A, et al. Health system-scale language models are all-purpose prediction engines. Nature. 2023; 619(7969):357–62. https://doi.org/10.1038/s41586-023-06160-y.

27. Haleem A, Javaid M, Singh RP. An era of ChatGPT as a significant futuristic support tool: a study on features, abilities, and challenges. BenchCouncil Trans Benchmarks Stand Eval. 2022; 2(4):100089. https://doi.org/10.1016/j.tbench.2023.100089.

TOOLS

Similar articles