Abstract
Objectives
This study aimed to develop and evaluate a retrieval-augmented generation (RAG)-based chatbot system designed to optimize hospital operations. By leveraging electronic medical record (EMR) manuals, the system seeks to streamline administrative workflows and enhance healthcare delivery.
Methods
The system integrated fine-tuned multilingual embedding models (Multilingual-E5-Large and BGE-M3) for indexing and retrieving information from EMR manuals. A dataset comprising 5,931 question-document pairs was constructed through query augmentation and validated by domain experts. Fine-tuning was performed using contrastive learning to enhance semantic understanding, with performance assessed using top-k accuracy metrics. The Solar Mini Chat API was adopted for text generation, prioritizing Korean-language responses and cost efficiency.
Results
The fine-tuned models demonstrated marked improvements in retrieval accuracy, with BGE-M3 achieving 97.6% and Multilingual-E5-Large reaching 89.7%. The chatbot achieved high performance, with query latency under 10 ms and robust retrieval precision, effectively addressing operational EMR queries. Key applications included administrative task support and billing process optimization, highlighting its potential to reduce staff workload and enhance healthcare service delivery.
Conclusions
The RAG-based chatbot system successfully addressed critical challenges in healthcare administration, improving EMR usability and operational efficiency. Future research should focus on real-world deployment and longitudinal studies to further evaluate its impact on administrative burden reduction and workflow improvement.
Recent advancements in artificial intelligence (AI) have significantly transformed healthcare by addressing both clinical and administrative challenges. Among these innovations, chatbots have emerged as promising tools for improving healthcare service delivery and patient engagement [1–6]. By leveraging natural language processing and machine learning, these systems can assist with tasks such as providing medical guidance, scheduling appointments, and supporting remote monitoring [7].
However, conventional chatbot architectures often struggle with contextual reasoning, limiting their effectiveness in handling complex medical inquiries. They may generate responses that lack relevance or consistency, undermining user trust and reliability [8–10]. Retrieval-augmented generation (RAG) technology addresses these limitations by combining retrieval-based search with generative capabilities, enabling chatbots to access relevant information and generate contextually appropriate responses [11]. This approach enhances information retrieval efficiency, reduces response times, and improves the overall usability of chatbots in healthcare scenarios, including clinical decision support and patient education [12–20].
In this study, we developed and evaluated a RAG-based chatbot system aimed at improving hospital operations by integrating electronic medical record (EMR) system manuals as external knowledge sources. By optimizing both information retrieval and response generation, the system seeks to streamline administrative processes such as billing and insurance procedures, ultimately reducing the workload for healthcare staff and improving service efficiency.
The study is structured as follows: Section II describes the methodology, including data preparation, embedding model fine-tuning, vector database implementation, query processing, and response generation. Section III presents the experimental results and model evaluations. Section IV discusses the implications, potential applications, and limitations, and offers directions for future research.
The proposed system architecture integrates a RAG-based chatbot with the EMR system to enhance both operational efficiency and information accessibility (Figure 1).
At its core, the system employs a vector database to manage embedding generation, vectorized storage, and similarity search. This design enables seamless interaction between user queries and the EMR database while maintaining high retrieval accuracy. Indexed EMR documents are stored together with their corresponding vector embeddings, which are generated by a fine-tuned, domain-specific embedding model (see Section II-2). Incoming user queries are processed using the same embedding model to ensure alignment with the pre-indexed data, thus enabling accurate similarity search and retrieval (Section II-3). Retrieved documents are subsequently processed by the response generation module, which utilizes a pre-trained large language model (LLM) to generate coherent, contextually relevant responses. This approach ensures both precision and usability for healthcare professionals and patients (Section II-4). The LLM is further fine-tuned to prioritize medical accuracy and clarity, while adhering to domain-specific requirements. By integrating vector-based retrieval and language generation, the system delivers accurate, efficient, and secure access to EMR data, streamlining both administrative and clinical workflows.
To maximize retrieval accuracy and efficiency, the embedding model is fine-tuned through a process tailored to the unique manual data of each participating hospital. This hospital-specific approach maintains dedicated, fine-tuned models for each facility, ensuring optimal performance. The fine-tuning process converts textual information from hospital manuals into high-dimensional embedding vectors, which are systematically stored in the vector database to facilitate efficient query processing. The following steps detail the fine-tuning workflow, reflecting best practices and recent advancements in the field.
This study evaluated several multilingual embedding models for EMR manual processing, ultimately selecting Multilingual-E5-Large [15] and BGE-M3 [16] based on retrieval performance, computational efficiency, and multilingual capabilities. Commercial models such as text-embedding-ada-002 and text-embedding-3-large were excluded due to higher costs, limited options for fine-tuning, and lower performance in non-English languages.
Both Multilingual-E5-Large and BGE-M3 were chosen for their support of long sequences and computational efficiency, ensuring smooth integration into the RAG-based chatbot system.
The primary dataset consisted of the EMR system usage manual, which covered system functionality, user operations, maintenance procedures, and troubleshooting. Its structured content provided a robust foundation for training embedding models specialized in healthcare administration. The manual included detailed descriptions of system architecture, operational workflows, and step-by-step guidelines, enabling the models to effectively process EMR-related queries.
The dataset, summarized in Table 1, underwent thorough preprocessing. Text was extracted from 267 PDFs (totaling 1,631 pages) covering system functionality and maintenance topics. After extraction, a cleaning process removed irrelevant characters, formatting artifacts, and errors to ensure high data quality. Query augmentation generated three unique queries per page, expanding the dataset to 5,931 entries.
The final dataset was structured into three main columns: query, file name, and document text. This rigorous preprocessing workflow ensured both consistency and diversity, providing a strong foundation for embedding model fine-tuning.
The training dataset was developed using a structured query generation and validation process. The GPT-3.5-turbo API [19] was utilized to generate question-document pairs from the EMR manual, segmenting the text and generating contextually relevant questions via structured prompts. For each manual page, three distinct questions were generated to simulate real-world queries from healthcare professionals, resulting in 5,931 question-document pairs covering all key EMR functionalities and use cases. To ensure accuracy and clinical relevance, three experienced nurses validated the dataset, refining terminology and clarifying context as needed. Their expert review ensured that the dataset was well-aligned with practical healthcare information needs.
The GPT-3.5-turbo prompt template was constructed in both Korean and English to maintain consistency in question generation and enhance the effectiveness of model training (Figure 2).
A test set of 134 question-answer pairs was constructed by medical experts to evaluate retrieval performance. These questions encompassed factual, procedural, and scenario-based queries related to EMR system functionality, with each answer linked to a ground truth reference in the manual for accuracy. Three experienced nurses validated this dataset as well, reviewing each pair for clinical relevance and accuracy. Their feedback refined terminology and improved contextual clarity, ensuring alignment with real-world healthcare workflows. The validated test set was used as a benchmark to assess retrieval accuracy and relevance, ensuring that the chatbot could effectively support healthcare professionals’ information needs.
A structured fine-tuning approach was applied to the two candidate embedding models, Multilingual-E5-Large and BGE-M3. The training dataset consisted of 5,931 question-document pairs, carefully curated from the EMR manual. Both the questions and document texts were tokenized using the respective native tokenizers of the embedding models to ensure optimal data formatting, which enhances the models’ ability to generate meaningful embeddings. The fine-tuning process was based on contrastive learning, with cosine similarity used as the optimization objective. This approach encouraged the models to maximize similarity between matched question-document pairs while minimizing similarity for unmatched pairs.
We employed top-K accuracy metrics for model evaluation on a validated test set of 46 queries. The top-K accuracy was calculated as follows:
where N is the total number of test queries, K represents the number of documents retrieved for each query, docij represents the j-th retrieved document for query i, and yi represents the correct document for query i. The indicator function 1(docij = yi) equals 1 if docij matches yi, and 0 otherwise. This metric was specifically chosen to reflect the study’s goal of developing an efficient information retrieval system, where presenting the correct result within the top five responses is essential for practical usability in healthcare workflows.
The comparative evaluation of the two fine-tuned models highlighted their respective strengths and weaknesses. These results informed the selection of the optimal embedding model for integration into the RAG-based chatbot system, as detailed in the Results section.
The vector database infrastructure was constructed through a systematic series of processes designed to maximize the performance of the RAG system. Figure 1 illustrates the detailed pipeline for vector database indexing and query processing.
The EMR system manual was preprocessed to prepare the text for embedding generation. A sliding window approach, using a 512-token window with a 128-token overlap, was employed to maintain contextual integrity while minimizing redundancy. Standard text normalization steps—including lowercase conversion, special character handling, and whitespace normalization—were applied to ensure clean, structured input.
Two multilingual models, Multilingual-E5-Large and BGE-M3, were used to generate 1024-dimensional embeddings. Unlike E5-Large, which only supports English, Multilingual-E5-Large accommodates multiple languages. Both models underwent identical preprocessing to allow for precise performance comparisons. The embedding generation pipeline was optimized for parallel execution, which enhanced computational efficiency and increased processing speed for large-scale document embedding.
The vector database implemented a structured indexing and similarity search approach. Various strategies—including the Hierarchical Navigable Small World (HNSW) index—were evaluated. Distance metrics such as inner product and L2 (Euclidean) were tested to identify the most effective indexing method. Ultimately, the database adopted an IndexFlatL2 index, using a unified 1024-dimensional structure across models to ensure consistent vector representation and facilitate comparative analysis.
A comprehensive metadata management system was developed to provide rich contextual information for each document embedding. The metadata schema included detailed document-specific attributes to improve retrieval precision and contextual understanding. Key metadata elements comprised unique document identifiers, sequential chunk positioning, verbatim content preservation, source section classification, source manual filename, and specific page numbers.
The pipeline processed user queries in parallel for both Multilingual-E5-Large and BGE-M3 models. Each query was converted into a vector representation and matched with stored document embeddings. FAISS (Facebook AI Similarity Search) indices were used initially for performance evaluation, enabling direct comparisons between models under identical conditions [21]. For the final system, Weaviate replaced FAISS, offering improved scalability and advanced metadata management. Relevant passages were retrieved based on similarity scores and then used as context for response generation. This hybrid approach—FAISS for benchmarking and Weaviate for deployment—ensured rigorous evaluation and a scalable production system.
Retrieved content was passed to the LLM for response generation. The LLM synthesized accurate, contextually appropriate natural language responses. The system demonstrated robust performance, consistently maintaining an average query latency below 10 ms while delivering high retrieval accuracy.
The parallel query processing architecture enabled comparative analysis of retrieval performance across embedding models. This systematic design established a solid foundation for evaluating the real-world performance of the RAG-based chatbot system in EMR operational settings.
The Upstage Solar Mini Chat API [20] was selected over OpenAI’s GPT-3.5-turbo for its superior Korean text generation capabilities, making it particularly well-suited for RAG-based healthcare chatbots. Solar Mini Chat excels at generating contextually accurate responses by effectively incorporating supplementary input data beyond the initial query. It offers faster text generation than GPT-4-turbo, ensuring real-time responsiveness crucial for interactive chatbot applications. Additionally, its customization features support domain-specific models in law, finance, and healthcare, enhancing adaptability to diverse use cases. Comparative analysis demonstrated that Solar Mini Chat outperformed GPT-3.5-turbo in both text quality and cost efficiency, making it a practical choice for large-scale healthcare deployments. Its rapid response times, advanced handling of the Korean language, and significant cost advantages were key factors in its adoption as the text generation module for our RAG-based chatbot system. Ultimately, Solar Mini Chat ensured accurate, user-friendly, and cost-effective responses for end users.
A series of systematic hyperparameter optimization experiments were conducted to maximize the accuracy of the embedding models. The optimization results for Multilingual-E5-Large and BGE-M3 are summarized below.
For Multilingual-E5-Large, optimal performance was achieved by fine-tuning 12 training layers with a gradient accumulation of 2. The model was trained with a learning rate of 1e-4 and a batch size of 8 for 7 epochs.
For BGE-M3, a distinct set of optimal parameters emerged from our optimization process. The model performed best with a smaller train batch size of 1 and a more conservative learning rate of 1e-5. Training was performed with the fp32 data type over 5 epochs. The temperature was set to 0.02, and maximum lengths for queries and passages were configured at 64 and 1,024 tokens, respectively. Notably, self-distillation was enabled during training, which provided additional benefits to model performance.
Performance results for both models, using these optimized configurations, are presented in Table 2.
The fine-tuning experiments demonstrated substantial improvements in retrieval performance for both models when evaluated on the EMR system manual corpus. Both models began from an identical baseline accuracy of 84.126% and showed marked improvements after fine-tuning, though the magnitude of enhancement differed significantly.
For Multilingual-E5-Large, our hyperparameter optimization indicated that an aggressive learning rate of 1e-4 combined with a moderate batch size of 8 yielded optimal performance. After fine-tuning, the model achieved an accuracy of 89.682%, representing a 5.556 percentage point increase over baseline. The best configuration, involving 12 training layers and gradient accumulation of 2, suggests that substantial parameter updating was necessary for the model to adapt to the domain-specific characteristics of the EMR manual content.
For BGE-M3, systematic optimization showed that a conservative learning rate of 1e-5, combined with self-distillation, produced superior results. After fine-tuning, the model reached an accuracy of 97.619%, a significant 13.493 percentage point improvement from baseline. The optimal configuration utilized a longer passage context (1,024 tokens) and a shorter query length (64 tokens), which was particularly effective in capturing the hierarchical structure of EMR manual content. The use of the fp32 data type and a low temperature of 0.02 during training contributed to stable convergence and robust performance.
The dramatic performance gap between the two models (a final accuracy difference of 7.937 percentage points) suggests that BGE-M3’s architecture and training strategy are especially well-suited for Korean EMR manual embedding tasks. Its superior performance can be attributed to effective self-distillation and optimal handling of long passage contexts, all while maintaining computational efficiency.
As detailed above, comparative analysis of the two embedding models showed that BGE-M3 outperformed Multilingual-E5-Large in both retrieval accuracy and overall system performance. Consequently, BGE-M3 was selected as the optimal embedding model for integration into the RAGbased chatbot system (Figure 3).
Figure 3A displays the chat interface where users interact with the chatbot. For example, a user inquires, “How do I request a patient transfer?” and the chatbot responds with a detailed explanation of the steps involved, including selection options and the entry of relevant details in the EMR system. Beneath the chat, several files are listed with their respective names and page numbers, likely providing links to related manuals or supplementary instructions. Figure 3B presents a document viewer displaying a page titled “입원 환자리스트 - 전과전동/현위치 등록 (1/2),” which translates as “Inpatient List – Transfer/Current Location Registration (1/2).” The document includes structured guidelines, step-by-step instructions, and checkboxes for various options required for patient transfers, thereby offering clear procedural guidance to hospital staff.
This study fine-tuned an embedding model to enhance the accuracy and contextual relevance of responses for EMR system users, extending beyond basic search capabilities to better support healthcare administration.
One major application of this approach is the development of an administrative assistant chatbot. Healthcare professionals often spend substantial amounts of time on documentation, record management, and patient discharge procedures. By leveraging the proposed embedding model, an EMR-aware chatbot can efficiently retrieve procedural guidelines and relevant information, thereby reducing staff workload and enabling medical professionals to devote more time to patient care.
Another significant application is in billing and insurance support. Accurate retrieval of medical codes and billing procedures is critical for maintaining operational efficiency. The model’s high precision in retrieving domain-specific information can help minimize billing errors and accelerate insurance claim processing. Furthermore, rapid access to coding guidelines, such as Current Procedural Terminology (CPT) codes, can further streamline administrative workflows and increase efficiency.
Collectively, these applications illustrate how embedding models can optimize healthcare operations by facilitating accurate and timely information retrieval, ultimately improving both administrative processes and the quality of patient care delivery.
A key limitation of this study is the absence of direct human evaluation. Although automated assessments (Precision@5 of 82.6% accuracy) demonstrated strong performance, the system has yet to be validated with real-world healthcare users. User feedback is critical for assessing practical usability and identifying areas for improvement. Further studies are required to evaluate the long-term impact on healthcare efficiency, cost reduction, and administrative workload. Real-world testing should focus on metrics such as time savings, error rates, and user satisfaction. Moreover, this study did not assess the model’s potential role in supporting clinical decision-making. Future research should examine integration into clinical workflows to expand the model’s utility beyond administrative tasks.
Future studies should incorporate structured user evaluations, including usability testing with healthcare professionals, to assess the system’s practical effectiveness. Qualitative assessments in clinical environments can provide valuable insights into real-world performance and areas for further refinement. By addressing these limitations and exploring expanded applications, the model can be further improved to better support healthcare organizations and enhance overall administrative efficiency.
References
1. Oh N, Cha WC, Seo JH, Choi SG, Kim JM, Chung CR, et al. ChatGPT predicts in-hospital all-cause mortality for sepsis: in-context learning with the Korean Sepsis Alliance Database. Healthc Inform Res. 2024; 30(3):266–76. https://doi.org/10.4258/hir.2024.30.3.266.

2. Jovanovic M, Baez M, Casati F. Chatbots as conversational healthcare services. IEEE Internet Comput. 2020; 25(3):44–51. https://doi.org/10.1109/MIC.2020.3037151.

3. Sun G, Zhou YH. AI in healthcare: navigating opportunities and challenges in digital communication. Front Digit Health. 2023; 5:1291132. https://doi.org/10.3389/fdgth.2023.1291132.

4. Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021; 7(4):e27850. https://doi.org/10.2196/27850.

5. Habicht J, Viswanathan S, Carrington B, Hauser TU, Harper R, Rollwage M. Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nat Med. 2024; 30(2):595–602. https://doi.org/10.1038/s41591-023-02766-x.

6. Fietta V, Rizzi S, De Luca C, Gios L, Pavesi MC, Gabrielli S, et al. A chatbot-based version of the World Health Organization-validated self-help plus intervention for stress management: co-design and usability testing. JMIR Hum Factors. 2024; 11:e64614. https://doi.org/10.2196/64614.

7. Ohannessian R, Duong TA, Odone A. Global telemedicine implementation and integration within health systems to fight the COVID-19 pandemic: a call to action. JMIR Public Health Surveill. 2020; 6(2):e18810. https://doi.org/10.2196/18810.

8. Olszewski R, Watros K, Manczak M, Owoc J, Jeziorski K, Brzezinski J. Assessing the response quality and readability of chatbots in cardiovascular health, oncology, and psoriasis: a comparative study. Int J Med Inform. 2024; 190:105562. https://doi.org/10.1016/j.ijmedinf.2024.105562.

9. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023; 55(12):248. https://doi.org/10.1145/3571730.

10. Majeed A, Hwang SO. A data-centric AI paradigm for socio-industrial and global challenges. Electronics. 2024; 13(11):2156. https://doi.org/10.3390/electronics13112156.

11. Zhou Q, Liu C, Duan Y, Sun K, Li Y, Kan H, et al. GastroBot: a Chinese gastrointestinal disease chatbot based on the retrieval-augmented generation. Front Med (Lausanne). 2024; 11:1392555. https://doi.org/10.3389/fmed.2024.1392555.

12. Ranasinghe S, De Silva D, Mills N, Alahakoon D, Manic M, Lim Y, et al. Addressing the productivity paradox in healthcare with retrieval augmented generative AI chatbots. In : Proceedings of 2024 IEEE International Conference on Industrial Technology (ICIT); 2024 Mar 25–27; Bristol, UK. p. 1–6. https://doi.org/10.1109/ICIT58233.2024.10540818.

13. Bora A, Cuayahuitl H. Systematic analysis of retrieval-augmented generation-based LLMs for medical chatbot applications. Mach Learn Knowl Extr. 2024; 6(4):2355–74. https://doi.org/10.3390/make6040116.

14. Ke Y, Jin L, Elangovan K, Abdullah HR, Liu N, Sia AT, et al. Development and testing of retrieval augmented generation in large language models: a case study report [Internet]. Ithaca (NY): arXiv.org;2024. [cited at 2025 Jul 1]. Available from: https://arxiv.org/abs/2402.01733.
15. Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual e5 text embeddings: A technical report [Internet]. Ithaca (NY): arXiv.org;2024. [cited at 2025 Jul 1]. Available from: https://arxiv.org/abs/2402.05672.
16. Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. BGE M3-embedding: multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation [Internet]. Ithaca (NY): arXiv.org;2024. [cited at 2025 Jul 1]. Available from: https://arxiv.org/abs/2402.03216v1.
17. Greene R, Sanders T, Weng L, Neelakantan A. New and improved embedding model [Internet]. San Francisco (CA): OpenAI;2022. [cited at 2025 Jul 1]. Available from: https://openai.com/index/new-and-improved-embedding-model/.
18. OpenAI. New embedding models and API updates [Internet]. San Francisco (CA): OpenAI;2024. [cited at 2025 Jul 1]. Available from: https://openai.com/index/new-embedding-models-and-api-updates/.
19. OpenAI. GPT-3.5 Turbo API [Internet]. San Francisco (CA): OpenAI;2024. [cited at 2025 Jul 1]. Available from: https://platform.openai.com/docs/models/gpt-3.5-turbo.
20. Choi E. Introducing Solar Mini: compact yet powerful [Internet]. Yongin, Korea: Upstage;2024. [cited at 2025 Jul 1]. Available from: https://www.upstage.ai/blog/en/introducing-solar-mini-compact-yet-powerful.
21. Meta. Faiss (Facebook AI Similarity Search) [Internet]. Menlo Park (CA): Meta;2024. [cited at 2025 Jul 1]. Available from: https://ai.meta.com/tools/faiss/.
Figure 1
System architecture for the RAG-based EMR chatbot service. The proposed system architecture integrates a vector database management system, utilizing the Weaviate vector database to store vectorized medical text and semantic document clusters. The query processing pipeline preprocesses user input and employs cosine similarity to retrieve relevant documents from the vector database efficiently. Retrieved documents are then passed to the Response Generation module, which utilizes the Upstage Solar LLM to synthesize contextually accurate and natural language responses. The system operates within a secure EMR framework, enforcing robust access control measures and prioritizing patient data privacy to ensure compliance with healthcare regulations. RAG: retrieval-augmented generation, EMR: electronic medical record, LLM: large language model.
Figure 2
Prompt template used for the GPT-3.5-turbo API. The template consists of two parts: Korean on the left and English translation on the right. The content is structured to guide the user or system (e.g., GPT-3.5-turbo) by generating contextually relevant questions from the document.
Figure 3
RAG-based EMR chatbot system. This screenshot highlights the functionality of the RAG-based EMR chatbot system, show-casing how it aids users in navigating and performing tasks within an EMR system. RAG: retrieval-augmented generation, EMR: electronic medical record.



PDF
Citation
Print



XML Download