Journal List > Ann Lab Med > v.45(5) > 1516092325

Ahn: Large Language Model Advances in Transfusion Medicine: From Answering Questions to Supporting Clinical Decisions
Large language models (LLMs) have demonstrated increasing utility in supporting clinical decisions, particularly in complex medical domains requiring pattern recognition and knowledge synthesis [1]. LLMs operate through next-token prediction, iteratively generating the most probable word based on the input prompt to form coherent responses. Although this approach represents stochastic pattern matching rather than true understanding, the ability to process extensive medical information renders LLMs valuable for augmenting clinical reasoning [2]. By carefully engineering prompts, clinicians can modulate the probability distributions of subsequent tokens to guide models toward clinically accurate and contextually appropriate outputs.
Transfusion medicine presents challenges well suited for artificial intelligence (AI)-augmented decision support. The field involves high-risk decisions where errors can cause hemolytic events, whereas overly conservative approaches result in inefficient use of scarce resources. The complexity of blood group systems, compatibility requirements, and patient-specific factors creates opportunities for targeted AI assistance. The growing interest in systems such as ChatGPT has led to a conflation of LLMs with the broader AI field. This conflation disregards the computational demands, probabilistic nature, and limitations of LLMs for certain clinical tasks.
The optimal AI approach depends on the nature of the clinical problem (Fig. 1). For structured tasks involving tabular data, such as surgical blood loss prediction, traditional machine-learning models provide faster and more deterministic solutions [3, 4]. For logic-based processes such as ABO compatibility checks or transfusion thresholds, rule-based engines are reliable, interpretable, and efficient. For text-based tasks with predictable structure, regular expressions can be used to efficiently extract information and provide structured input data for downstream models. The unique value of LLMs becomes apparent in scenarios requiring semantic comprehension, such as parsing complex clinical narratives or making decisions under ambiguity. In essence, LLMs are particularly suited to tasks previously dependent on human intelligence to manage complexity and ambiguity, not as a universal solution for all computational challenges.
In this issue of Annals of Laboratory Medicine, Lee et al. [5] provide a rigorous evaluation of six leading LLMs for clinical-decision support in RhD blood-type transfusion scenarios. The findings establish a performance baseline based on benchmarking against 22 human specialists. Human experts averaged 80% accuracy, whereas the top-performing LLM (GPT-4o) achieved 70%, followed by Gemini 1.5 (63%) and GPT-4 (60%). Comparable performance across Korean and English tasks challenges assumptions regarding “anglocentric bias” in AI, highlighting the robust multilingual capabilities of state-of-the-art models.
In contrast to their performance on single-select questions, the models demonstrated near-total failure on multi-select questions, with an average accuracy of 4%. Minimal gains from basic prompt engineering highlight a significant limitation that must be addressed before clinical application. The detailed analysis of failure modes is the most vital contribution of the study. By identifying specific shortcomings rather than relying solely on aggregate metrics, the findings offer a framework for examining recent advancements. The article delineates LLM advancements and their implications for supporting clinical decisions in transfusion medicine.
Retrieval-augmented generation enhances LLMs by integrating generative capabilities with dynamic external information retrieval, addressing hallucinations, outdated knowledge, and a lack of domain expertise [6]. Through semantic search, relevant text segments from curated databases are retrieved and inserted into prompts, enabling real-time guideline updates, region-specific protocols, and transparent source citations [7]. Kaplinsky et al. [8] demonstrated 72% precision and recall in detecting major bleeding events in the electronic health records (EHRs) with interpretable reasoning.
Chain-of-thought prompting instructs models to decompose complex problems into sequential logical steps [9]. Test-time compute scaling (e.g., OpenAI’s o1/o3) allocates additional computational resources during inference to enable deeper reasoning chains. Yang et al. [10] reported superior diagnostic reasoning across 30 laboratory medicine cases, and Huang et al. [11] observed accuracy improvements with increased reasoning budgets.
Techniques that explore multiple reasoning paths can further improve reliability. Self-consistency generates multiple step-by-step answers and selects the most frequent outcome via majority vote, reducing the impact of random LLM variations [12]. Notably, high variability in the generated answers can itself be a useful signal, indicating low model confidence or multiple acceptable but suboptimal answers. The tree-of-thought reasoning framework explores multiple reasoning paths in parallel, evaluates progress, and backtracks from unpromising approaches, which are valuable solutions for problems requiring consideration of multiple possibilities [13].
Domain-specific adaptation is an effective approach for aligning general-purpose LLMs with specialized medical requirements without extensive retraining. Med-PaLM achieved 67.6% accuracy in MedQA using instruction prompt tuning with just 40 curated examples, marking the first time the passing score was surpassed [2]. Med-PaLM 2 is a combination of an improved base model and medical fine-tuning, enabling it to achieve 86.5% accuracy and surpass physician performance [14]. MedGemma demonstrated 2.6–10% gains in accuracy through continued pre-training combined and task-specific fine-tuning with fewer parameters [15]. These approaches enable smaller models with medical specialization and institutional customization without extensive computational resources.
LLM agents represent a sophisticated evolution in AI systems, combining reasoning capabilities with the ability to use external tools and take actions [16]. Frameworks like Reasoning and Acting (ReAct) enable a model to generate a reasoning trace (“I need to check the patient’s antibody-screening results”) and then execute a corresponding action (write and execute commands to query the laboratory information system) in an interleaved cycle [17]. This enables the agent to actively gather information, verify hypotheses, and adapt its strategy based on new data. An agent could be equipped with tools to access the laboratory information system for real-time test results, the blood bank inventory system for blood product availability, and the EHR for patient history. A multi-agent framework can further enhance this process by incorporating an orchestrator agent that spawns specialized sub-agents or summons proper tools for each sub-task (Fig. 1) [18]. This multi-agent framework mirrors the collaborative nature of medical teams, where different experts contribute their specialized knowledge. This decomposition and delegation approach is particularly effective for complex tasks that require the compilation and consideration of multiple pieces of information from multiple sources.
The evaluation by Lee et al. [5] demonstrated both the current capabilities and limitations of LLMs in transfusion medicine. The observed performance improvement from GPT-3.5 to GPT-4 and GPT-4o reflects rapid advancements in model capabilities, primarily through scaling. More recently, agentic-reasoning models that utilize test-time compute scaling and tool calling to solve more complex problems have foreshadowed continued improvements and the likely integration of these systems into clinical-decision support workflows.
Achieving clinically reliable AI systems in transfusion medicine will require active participation from medical professionals. Such participation includes developing comprehensive test sets that capture edge cases and unusual scenarios (such as RhD variants), establishing clear standards for AI tool evaluation and deployment, and creating feedback mechanisms for continuous improvement. The future of AI-augmented transfusion medicine lies in recognizing this as a bidirectional learning opportunity where AI systems can capture and systematize the implicit reasoning patterns of experienced clinicians, and physicians can leverage the capacity of AI to simultaneously evaluate multiple decision pathways and integrate vast amounts of data that would overwhelm human cognitive processing.

Notes

AUTHOR CONTRIBUTIONS

The author confirms sole responsibility for manuscript conception and preparation.

CONFLICTS OF INTEREST

None declared.

REFERENCES

1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. 2023; Large language models in medicine. Nat Med. 29:1930–40. DOI: 10.1038/s41591-023-02448-8. PMID: 37460753.
2. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. 2023; Large language models encode clinical knowledge. Nature. 620:172–80. DOI: 10.1038/s41586-023-06291-2. PMID: 37438534. PMCID: PMC10396962.
3. You J, Seok HS, Kim S, Shin H. 2025; Advancing laboratory medicine practice with machine learning: swift yet exact. Ann Lab Med. 45:22–35. DOI: 10.3343/alm.2024.0354. PMID: 39587856. PMCID: PMC11609717.
4. Maynard S, Farrington J, Alimam S, Evans H, Li K, Wong WK, et al. 2024; Machine learning in transfusion medicine: a scoping review. Transfusion. 64:162–84. DOI: 10.1111/trf.17582. PMID: 37950535. PMCID: PMC11497333.
5. Lee JK, Choi S, Park S, Hwang SH, Cho D. 2025; Evaluation of six large language models for clinical decision support: application in transfusion decision-making for RhD blood-type patients. Ann Lab Med. 45:520–29. DOI: 10.3343/alm.2024.0588. PMID: 40289855.
6. Gargari OK, Habibi G. 2025; Enhancing medical AI with retrieval-augmented generation: a mini narrative review. Digit Health. 11:20552076251337177. DOI: 10.1177/20552076251337177. PMID: 40343063. PMCID: PMC12059965. PMID: 9ddbde19d8bf45e5b4f9e26b2f1cee18.
7. Ahn S. 2025; A guide to evade hallucinations and maintain reliability when using large language models for medical research: a narrative review. Ann Pediatr Endocrinol Metab. 30:115–18. DOI: 10.6065/apem.2448278.139. PMID: 40624912. PMCID: PMC12235426. PMID: 26605f20cec64f0c82eefa68875742b3.
8. Kaplinsky P, Singh R, Fusillo TF, Leader A, Zwicker JI, Mantha S. 2024; Retrieval augmented generation for the detection of major bleeding events in the electronic health record. Blood. 144(S1):2263. DOI: 10.1182/blood-2024-203911.
9. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, Zhou D. 2022; Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 35:24824–37. PMID: https://scholar.google.com/scholar_lookup?title=Chain-of-thought+prompting+elicits+reasoning+in+large+language+models&publication=Adv+Neural+Inf+Process+Syst&publication_year=2022.
10. Yang HS, Li J, Yi X, Wang F. 2025; Performance evaluation of large language models with chain-of-thought reasoning ability in clinical laboratory case interpretation. Clin Chem Lab Med. 63:e199–201. DOI: 10.1515/cclm-2025-0055. PMID: 40023838.
11. Huang X, Wu J, Liu H, Tang X, Zhou Y. m1: Unleash the potential of test-time scaling for medical reasoning with large language models. arXiv [preprint] 2025;2504.00869. doi: 10.48550/arXiv.2504.00869. DOI: 10.48550/arXiv.2504.00869.
12. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D. Self-consistency improves chain of thought reasoning in language models. arXiv [preprint]. 2025;2203.11171. doi: 10.48550/arXiv.2203.11171. DOI: 10.48550/arXiv.2203.11171.
13. Yao S, Yu D, Zhao J, Shafran I, Griffiths T, Cao Y, Narasimhan K. 2023; Tree of thoughts: deliberate problem solving with large language models. Advances in neural information processing systems. 36:11809–22. PMID: https://scholar.google.com/scholar_lookup?title=Tree+of+thoughts:+deliberate+problem+solving+with+large+language+models&publication=Advances+in+neural+information+processing+systems&publication_year=2023.
14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, Clark K, Pfohl SR, Cole-Lewis H, Neal D. 2025; Toward expert-level medical question answering with large language models. Nat Med. 31:943–50. DOI: 10.1038/s41591-024-03423-7. PMID: 39779926. PMCID: PMC11922739.
15. Sellergren A, Kazemzadeh S, Jaroensri T, Kiraly A, Traverse M, Kohlberger T, Xu S, Jamil F, Hughes C, Lau C, Chen J. MedGemma technical report. arXiv [preprint] 2025;2507.05201. doi: 10.48550/arXiv.2507.05201. DOI: 10.48550/arXiv.2507.05201.
16. Yuan H. 2025; Agentic large language models for healthcare: current progress and future opportunities. Med Adv. 3:37–41. DOI: 10.1002/med4.70000. PMID: 30a5adf381224195bf4e28dfe98c7e8c.
17. Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, Cao Y. 2023; React: synergizing reasoning and acting in language models. International Conference on Learning Representations. 10451467. PMID: https://scholar.google.com/scholar_lookup?title=React:+synergizing+reasoning+and+acting+in+language+models&publication=International+Conference+on+Learning+Representations&publication_year=2023.
18. Anthropic. How we built our multi-agent research system. https://www.anthropic.com/engineering/built-multi-agent-research-system. Updated on Jun 2025.

Fig. 1

Decision framework for selecting AI approaches in clinical tasks.

Abbreviations: AI, artificial intelligence; LLM, large language model.
alm-45-5-469-f1.tif
TOOLS
Similar articles