Abstract
Objective
This study aimed to evaluate the reliability and usefulness of information generated by Chat Generative Pre-Trained Transformer (ChatGPT) on temporomandibular joint disorders (TMD).
Methods
We asked ChatGPT about the diseases specified in the TMD classification and scored the responses using Likert reliability and usefulness scales, the modified DISCERN (mDISCERN) scale, and the Global Quality Scale (GQS).
Results
The highest Likert scores for both reliability and usefulness were for masticatory muscle disorders (mean ± standard deviation [SD] 6.0 ± 0), and the lowest scores were for inflammatory disorders of the temporomandibular joint (mean ± SD 4.3 ± 0.6 for reliability, 4.0 ± 0 for usefulness). The median Likert reliability score indicates that the responses are highly reliable. The median Likert usefulness score was 5 (4–6), indicating that the responses were moderately useful. A comparative analysis was performed, and no statistically significant differences were found in any subject for either reliability or usefulness (P = 0.083–1.000). The median mDISCERN score was 4 (3–5) for the two raters. A statistically significant difference was observed in the mean mDISCERN scores between the two raters (P = 0.046). The GQS scores indicated a moderate to high quality (mean ± SD 3.8 ± 0.8 for rater 1, 4.0 ± 0.5 for rater 2). No statistically significant correlation was found between mDISCERN and GQS scores (r = –0.006, P = 0.980).
Temporomandibular joint disorders (TMD) represent a significant public health concern, affecting an estimated 5–12% of the general population. They refer to a group of conditions characterized by pain and loss of function and are associated with headaches, restricted mouth opening, hypermobility syndromes, fibromyalgia, anxiety, and sleep disorders.1-4 Given the impact of TMD on daily life, patients seek prompt solutions. In recent years, several innovations have emerged in various domains, including treatment modalities and patient management strategies for TMD. However, notable challenges remain for health professionals due to the complex nature of these diseases and their etiologies. Studies show that 50–80% of patients obtain information about their disease online before visiting a doctor.5
Artificial intelligence (AI) is a technology that can solve complex problems like human thinking and mimic human cognitive processes.6 In recent years, the development of large language models (LLMs) has revolutionized the field of AI.7,8 A multitude of chatbots such as Woebot, Your.MD, HealthTap, Cancer Chatbot, VitaminBot, Babylon Health, Safedrugbot, Microsoft Bing, and Google Bard are utilized for diverse purposes within the realm of health.9-11 In dentistry, AI is used in scheduling appointments, clinical diagnosis and treatment planning, malocclusion detection in orthodontics, automatic classification of restorations in panoramic radiographs, periodontal diseases, root caries, and detection of maxillofacial abnormalities.12
The most discussed and used chatbot is Chat Generative Pre-Trained Transformer (ChatGPT), developed by the San Francisco-based company OpenAI and released in November 2022. The chatbot was trained on a large dataset and functions in a manner analogous to that of a human in dialogue with users. ChatGPT uses AI to respond to natural language queries, such as those of humans.13-15 Its popularity is due to its detail, speed, and convenience. ChatGPT can provide numerous services to healthcare professionals in dentistry and healthcare.16-18 ChatGPT has been subjected to three examinations by the United States Medical Licensing Examination, and its capabilities have been demonstrated.19
In addition, it is important to note that AI models, particularly LLMs, can produce varying results depending on the phrasing of the input and quality of the datasets on which they were trained. These models may occasionally demonstrate inconsistencies in their responses and, in certain instances, generate false or incomplete information, a phenomenon known as hallucinations. Hallucinations are more prevalent when a model lacks sufficient or consistent data on a given topic, particularly in cases involving obscure or controversial subjects.20,21
Several articles on AI and LLMs have recently been published in various fields. 8-19 These publications evaluated the competence and reliability of ChatGPT, focusing on the views of health professionals. Despite a comprehensive literature review, no studies have been conducted on TMD. This study evaluated the reliability and usefulness of the ChatGPT-4 responses to TMD keywords. This study assessed ChatGPT’s efficacy in informing patients and professionals about TMD, a growing concern, and its reliability and utility.
We propose the following hypothesis: ChatGPT, with its extensive training on multiple topics, provides reliable and useful information for patients with TMD. This study is an effective supplementary resource for understanding TMD symptoms, management options, and self-care strategies, increasing patient knowledge and improving patient outcomes.
This study was conducted following the principles of the Declaration of Helsinki. Ethical committee approval was not required because no human or animal data were used in this study.
Subheadings with various etiologies have been identified for TMD. The most recent guideline from the American Academy of Orofacial Pain and the International Headache Society, as modified by Okeson, was used as keywords.22 These are “masticatory muscle disorders,” “disc-condyle complex irregularities,” “structural disorders of the articular surfaces,” “inflammatory disorders of temporomandibular joint (TMJ),” “chronic mandibular hypermobility,” “ankyloses,” and “growth disorders of TMJ.” A dialogue was initiated with the AI service (ChatGPT-4) utilizing the entry “TMD.” A detailed discussion of the specific etiology of ChatGPT has been conducted. To obtain comprehensive information, we first asked ChatGPT about the disease. We then asked about causes, symptoms, and treatment. We created a new account to ensure that ChatGPT provides impartial answers. Each keyword was initiated as a new conversation and recorded for analysis. ChatGPT responses were obtained from the version released on October 12. The wording of the ChatGPT questions was standardized and consistent. Responses were examined for inconsistencies and hallucinations (Supplementary data).
The content of each response was evaluated for reliability using the ChatGPT reliability score.14 The ChatGPT reliability score is a Likert-type scale with scores ranging from 1 to 7. Responses from medical and scientific sources were verified to ensure they were free of incomplete or error-prone information. High scores indicate high usefulness (Table 1).
The ChatGPT usefulness score was used to assess the content of each response to determine its utility in patients.17 The scale is a Likert-type scale with values ranging from 1 to 7. A high score indicated high usefulness (Table 1).
The DISCERN scale was designed to evaluate the reliability of the ChatGPT responses. The modified DISCERN (mDISCERN) scale comprised five questions with yes/no answers (Table 1). According to the mDISCERN criteria, a score of 1 was negative and 5 was positive. Scores of 2, 3, and 4 indicated partial applicability. A score of “partially” is preferred for limited applicability. The total scores range from 0 to 5, with higher scores indicating greater reliability.
The Global Quality Scale (GQS) was modified with a specific emphasis on assessing the accuracy and utility of the information initially presented by Bernard. The responses generated by ChatGPT were evaluated using a GQS (Table 1).23
Two specialists of independent fields—an orthodontist and a physical medicine specialist—were involved in evaluating the screenshots to avoid bias. If the raters disagreed, a third independent rater assessed the data and made the final decision.
Analyses were performed using MedCalc® Statistical Software version 19.7.2 (MedCalc Software Ltd., Ostend, Belgium; https://www.medcalc.org). Descriptive statistics were used to describe the continuous variables (mean, standard deviation [SD], minimum, median, and maximum). Quantitative data were summarized as the mean, SD, and median (minimum to maximum). The Wilcoxon signed-rank test determined relationships between two dependent variables that did not conform to a normal distribution. The inter-rater agreement was analyzed using Cronbach’s α. The intraclass correlation coefficient results indicate that positive values ranging from 0 to 0.2 represent poor agreement, 0.2 to 0.4 represent fair agreement, 0.4 to 0.6 represent moderate agreement, 0.6 to 0.8 represent good agreement, and 0.8 to 1.0 represent very good agreement. The level of statistical significance was set at P < 0.05.
Two raters with extensive experience and independence evaluated the responses. The results are presented in the following tables and figure.
The responses are shown in Table 2, as indicated by a Likert-type reliability scale. The median Likert reliability scores were 6 (4–6) and 5 (4–6), respectively, indicating that the responses were highly reliable and reproducible for raters 1 and 2. The median reliability scores showed no statistically significant differences between the groups (P > 0.05; Table 3). The Inter-rater reliability of the scoring process was evaluated for the entire content. A high level of agreement was observed between the two raters (Cronbach α: 0.829).
Table 3 presents the distribution of the responses to the Likert-type usefulness scale. The median Likert usefulness score was 5 (4–6) for two raters, indicating that the responses were moderately helpful. The agreement of usefulness scoring between the two raters was examined for all content. Moderate agreement was observed between the two raters (Cronbach α: 0.693). The median usefulness scores showed no statistically significant differences between the groups (P > 0.05; Table 3).
The highest Likert scores for the reliability and usefulness of the information provided by the ChatGPT were for masticatory muscle disorders (mean: 6.0 ± 0). The lowest scores in terms of both reliability and usefulness were for inflammatory disorders of the TMJ (mean: 4.3 ± 0.6 for reliability, mean: 4.0 ± 0 for usefulness). The scores the two raters gave ranged between four and six (P = 1.000–0.317, respectively).
A comparative analysis was performed on all topics’ total reliability and usefulness scores. No statistically significant difference was found in any subject for the reliability or usefulness total scores (P = 0.083-1.000; Table 3).
Table 2 presents the distribution of ChatGPT response scores according to the mDISCERN scale. No statistically significant difference was identified between the median mDISCERN values of the groups (P > 0.05). Regarding reliability, as determined using the mDISCERN tool, Rater 1 identified very high reliability; however, Rater 2 identified high reliability (mean ± SD: 4.1 ± 0.7, 3.8 ± 0.6, respectively). The median mDISCERN score of the responses was 4 (3–5) for two raters. A statistically significant difference was observed in the mean mDISCERN scores between raters 1 and 2 (P = 0.046; Table 3). There was no statistically significant correlation between DISCERN and GQS scores (r = –0.006, P = 0.980).
The mean GQS scores for rater 1 and rater 2 were 3.8 ± 0.8 and 4.0 ± 0.5, respectively. The raters assessed the responses as of moderate to high quality. As demonstrated in Table 3, the GQS scores indicate a low level of agreement between the two raters (Cronbach α: 0.452). No statistically significant differences were observed between the groups (P > 0.05).
The reliability and usefulness indices were calculated based on the mean scores assigned by the raters. The resulting distribution is shown in Figure 1. The highest mean value for reliability was observed in masticatory muscle disorders and ankylosis (mean ± SD: 6.0 ± 0). In contrast, the lowest mean value was noted in chronic mandibular hypermobility (mean ± SD: 4.0 ± 0), as determined by the separate scoring of two independent raters. For usefulness, the highest mean value was seen in masticatory muscle disorders and disc-condyle complex derangement (mean ± SD: 6.0 ± 0), whereas the lowest mean value was seen in chronic mandibular hypermobility (mean ± SD: 4.0 ± 0). The raters had no significant differences in the reliability or usefulness total scores.
Recently, AI has brought about innovative tools such as ChatGPT, which are designed to engage in conversations and provide information on various topics. ChatGPT can be used in various areas of dentistry, such as symptom assessment and temporary treatment suggestions, appointment scheduling and reminders, the planning and assessment of treatments, the analysis and diagnostics of images, the monitoring of patients and the subsequent follow-up, and the conducting of literature reviews.14,24
The effects of AI on dentistry span various clinical and administrative domains. Clinically, AI is primarily used for diagnostic purposes, enhancing the accuracy and efficiency of identifying oral conditions, such as dental caries and periodontal diseases.25 Moreover, AI-driven imaging technologies have advanced the detection capabilities of dental radiographs, enabling the early and accurate diagnosis of complex oral issues. AI contributes to practice management by streamlining processes, such as appointment scheduling, patient communication, and claims processing, allowing dental professionals to devote more time to direct patient care.26,27 Innovations like AI-powered live coaching systems and computer vision technologies for pediatric and orthodontic care are anticipated to improve patient outcomes and practice efficiency.28
Despite these benefits, the integration of AI into dentistry is challenging. Ethical concerns, data privacy issues, and potential errors are the prominent controversies surrounding the adoption of AI in clinical practice.29
In addition, the quality of training data heavily influences the efficacy of AI tools, necessitating continuous updates. These challenges highlight the importance of adopting a cautious approach emphasizing the transparency and reliability of AI applications. As AI technology matures, its potential to transform dental-care delivery and patient engagement will continue to grow, fostering a new era of precision and enhanced patient-centered care.30 Health-related misinformation can lead to misdiagnosis, treatment errors, and serious consequences.31
Given the above scientific data, we aimed to assess whether ChatGPT displays a high level of knowledge on specific topics. To this end, we examined the reliability and usefulness of the information provided by ChatGPT concerning the various subtopics related to TMD affecting the daily lives of patients.
This study had several strengths. The authors employed a combination of objective and validated tools to assess reliability. These included Likert scales, GQS, and mDISCERN criteria, mainly compared with previously published descriptive studies.32,33 The Likert scale is an effective tool for evaluating the reliability and usefulness of generated responses, as it allows for a clear distinction between the levels of accuracy. This scale enables discernible differentiation among the levels of accuracy and is commonly used in the existing literature for comparable assessments.34-36
Discern is a validated instrument recognized as an effective method for assessing the reliability and quality of written health information. It is intended for health professionals and laypeople (Institute of Health Sciences, University of Oxford, Oxford, UK).37 In the present study, as suggested by previous studies, the mDISCERN scale was used to assess the reliability of the information to focus on specific criteria through a detailed analysis of TMD.38
TMJ problems require multidisciplinary treatment involving the cooperation of many specialists. Therefore, our study evaluated the responses separately by an orthodontist and a physical therapy practitioner. This can provide valuable insights into the ChatGPT platform’s strengths and limitations in addressing the diverse aspects of TMJ problems and help ensure a more reliable and holistic evaluation.
Our results indicate that masticatory muscle diseases exhibited the highest reliability and usefulness scores. The responses to masticatory muscle disorder treatment were evaluated, with a mean score of 6.0 ± 0. After reviewing the responses, a wide range of information was provided regarding treatment options for masticatory muscle diseases in the literature. These options vary from conservative treatments such as occlusal splints and medicines to more interventional approaches such as Botox injections and surgical procedures.39 ChatGPT’s treatment options were comprehensive and accurate in the context of masticatory muscle disorders. However, other studies require additional details. Accurate information on conservative treatments should be highlighted. However, superficial information on advanced surgical options is a potential limitation. The model’s information was incomplete and lacked guidance in complex areas like TMJ. This is a limitation of ChatGPT. The lowest mean usefulness and reliability scores were associated with inflammatory TMJ diseases. The reasons for this lower score could vary, including the complexity of the subject matter, quality of the available information, and presentation of the content.
However, ChatGPT can provide helpful information, and it is vital to highlight the significance of seeking guidance from qualified healthcare experts and reliable medical sources when making decisions regarding TMD. It is necessary to consider ChatGPT as an additional source of information. However, it does not replace the expertise of actual doctors. Previous research noted that ChatGPT frequently recommended contacting an orthodontist, dentist, or doctor at the end of responses. This result is consistent with our findings.31
Although the raters agreed on the usefulness and reality scores, they disagreed on the GQS or mDISCERN scores. Therefore, a third evaluator was included, and the scores were incompatible. The mDISCERN grading of the responses revealed that the level of agreement between the two evaluators is not particularly strong, and the observed agreement may be attributed to chance (Cronbach α: 0.452). In particular, there were differences in the scores related to patient treatment. This may be because the raters treated patients using different clinical practices.
Balel40 evaluated the answers generated by ChatGPT regarding maxillofacial surgery by physicians using the GQS scale; the results were of moderate quality. In their current study, Kılınç and Mansız41 used the Flesch-Kincaid and DISCERN tools to evaluate the reliability and readability of the information about orthodontics generated by ChatGPT-4. The mean DISCERN value was reported to be 2.96 ± 0.05 for general questions, 3.04 ± 0.06, 2.38 ± 0.27 and 2.82 ± 0.31 for treatment-related questions. The mDISCERN tool indicated high reliability for both raters (mean ± SD: 4.1 ± 0.7 to 3.8 ± 0.6, respectively). It is thought that in our study, the discrepancy in evaluations may be attributed to the different specialties of the evaluators.
Dursun and Bilici Geçer35 used the Likert scale, mDISCERN, GQS, and Flesch Reading Ease Score to assess ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot AI. Responses were moderately reliable and of good quality. ChatGPT-4’s mean GQS score was 3.8 ± 0.62, indicating high quality. These results confirmed our findings. In our study, ChatGPT-4’s responses for TMD were moderate to high quality. The highest mean Likert score for the ChatGPT-4 responses was 4.5 ± 0.61. We found the mean Likert scores 5.4 ± 0.7 and 4.9 ± 0.8 for reliability and usefulness.
Tanaka et al.42 evaluated the reliability of ChatGPT-4 in answering questions on clear aligners, temporary anchorage devices, and digital imaging in orthodontics using a Likert scale. The study found a slight agreement and discrepancy in ratings between assessors, with a combined Fleiss’ κ value of 0.004 (P < 0.001). Our study demonstrated a high level of agreement regarding inter-rater reliability (Cronbach α: 0.829).
Khanagar et al.43 noted that in the context of AI applications for diagnosis and treatment planning in orthodontics, AI achieved accuracy and precision comparable to that of trained examiners. Abu Arqub et al.44 reported that ChatGPT generated responses to questions about clear aligner therapy in orthodontics with less-than-optimal accuracy. However, none of these studies referred to TMD.
LLM’s accuracy depends on the questions and training data despite their large datasets. Some inconsistencies are due to the model’s answers to the questions. The formulation of questions can result in discrepancies in model responses. Hallucinations, a prevalent issue in language models, must be considered. The generation of false or fabricated information by the model presents a significant risk, particularly in clinical applications.45
ChatGPT can generate coherent responses but may fabricate details or provide outdated information. A systematic review found that 96.7% of studies were concerned about the accuracy of ChatGPT. These concerns highlight the risks of ChatGPT, including misinformation, lack of originality, and hallucinations.46 ChatGPT’s effectiveness depends on the training data. It struggles with rare or novel medical conditions but shows potential in certain contexts, including clinical documents and healthcare efficiency. Additionally, studies have reported remarkable accuracy in specific medical examinations, suggesting that it could be useful in medical education and decision-making.47
In this study, some of the responses from the model were either incomplete or incorrect. ChatGPT responses to complex topics, such as inflammatory disorders of the TMJ, were incomplete or lacked detail. This demonstrates the need to rigorously evaluate language models when providing health information. Similarly, the aforementioned studies highlighted the phenomenon of hallucinations in ChatGPT responses.
Our study has several limitations. The ChatGPT 4.0 model can provide information on many subjects. ChatGPT 4.0 is based on an updated knowledge base until April 2023. However, they may not be aware of the recent developments. However, ChatGPT cannot verify or update the accuracy of the information provided. ChatGPT does not assess individual health or private medical histories. ChatGPT provides general information rather than personalized medical advice. To evaluate individual health conditions accurately, it is essential to consult qualified healthcare professionals. The absence of a published study on this topic prevented us from discussing it in detail.48
A previous study on TMD on YouTube found that information was inadequate. It is important to evaluate sources, as they may not provide accurate information, particularly for complex medical conditions.49 ChatGPT’s main advantage is its conversational interactivity.50 Studies are needed to compare the information provided by ChatGPT with YouTube and other patient information platforms. This may highlight areas of improvement and help patients make informed decisions.
Although AI has immense potential in healthcare, it should always be used with human expertise. Future research should focus on analyzing various LLMs. The proliferation of misinformation has become a major concern in the digital age. Academic institutions are important in ensuring that patients with TMD are directed toward the correct diagnostic and therapeutic processes. AI models have limitations, including variability in responses and the risk of hallucinations. However, the information provided by these models provide must be verified by experts in clinical applications.
Notes
AUTHOR CONTRIBUTIONS
Conceptualization: All authors. Data curation: All authors. Formal analysis: All authors. Investigation: All authors. Methodology: All authors. Project administration: All authors. Resources: All authors. Software: All authors. Supervision: All authors. Validation: All authors. Visualization: All authors. Writing–original draft: All authors. Writing–review & editing: All authors.
Appendix
References
1. Andre A, Kang J, Dym H. 2022; Pharmacologic treatment for temporomandibular and temporomandibular joint disorders. Oral Maxillofac Surg Clin North Am. 34:49–59. https://doi.org/10.1016/j.coms.2021.08.001. DOI: 10.1016/j.coms.2021.08.001. PMID: 34598856.
2. Shaffer SM, Brismée JM, Sizer PS, Courtney CA. 2014; Temporomandibular disorders. Part 1: anatomy and examination/diagnosis. J Man Manip Ther. 22:2–12. https://doi.org/10.1179/2042618613y.0000000060. DOI: 10.1179/2042618613Y.0000000060. PMID: 24976743. PMCID: PMC4062347.
3. Thomas DC, Khan J, Manfredini D, Ailani J. 2023; Temporomandibular joint disorder comorbidities. Dent Clin North Am. 67:379–92. https://doi.org/10.1016/j.cden.2022.10.005. DOI: 10.1016/j.cden.2022.10.005. PMID: 36965938.
4. Valesan LF, Da-Cas CD, Réus JC, Denardin ACS, Garanhani RR, Bonotto D, et al. 2021; Prevalence of temporomandibular joint disorders: a systematic review and meta-analysis. Clin Oral Investig. 25:441–53. https://doi.org/10.1007/s00784-020-03710-w. DOI: 10.1007/s00784-020-03710-w. PMID: 33409693.
5. AlGhamdi KM, Moussa NA. 2012; Internet use by the public to search for health-related information. Int J Med Inform. 81:363–73. https://doi.org/10.1016/j.ijmedinf.2011.12.004. DOI: 10.1016/j.ijmedinf.2011.12.004. PMID: 22217800.
6. Mintz Y, Brodie R. 2019; Introduction to artificial intelligence in medicine. Minim Invasive Ther Allied Technol. 28:73–81. https://doi.org/10.1080/13645706.2019.1575882. DOI: 10.1080/13645706.2019.1575882. PMID: 30810430.
7. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB 3rd. 2023; Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 8:e23. https://doi.org/10.2106/jbjs.oa.23.00056. DOI: 10.2106/JBJS.OA.23.00056. PMID: 37693092. PMCID: PMC10484364. PMID: bec922f726d549bcbc0a81891f71a3b6.
8. Sharma S, Pajai S, Prasad R, Wanjari MB, Munjewar PK, Sharma R, et al. 2023; A critical review of ChatGPT as a potential substitute for diabetes educators. Cureus. 15:e38380. https://doi.org/10.7759/cureus.38380. DOI: 10.7759/cureus.38380. PMID: 37265899. PMCID: PMC10231273.
9. Palanica A, Flaschner P, Thommandram A, Li M, Fossat Y. 2019; Physicians' perceptions of chatbots in health care: cross-sectional web-based survey. J Med Internet Res. 21:e12887. https://doi.org/10.2196/12887. DOI: 10.2196/12887. PMID: 30950796. PMCID: PMC6473203.
10. Acar AH. 2024; Can natural language processing serve as a consultant in oral surgery? J Stomatol Oral Maxillofac Surg. 125:101724. https://doi.org/10.1016/j.jormas.2023.101724. DOI: 10.1016/j.jormas.2023.101724. PMID: 38052322.
11. Davenport T, Kalakota R. 2019; The potential for artificial intelligence in healthcare. Future Healthc J. 6:94–8. https://doi.org/10.7861/futurehosp.6-2-94. DOI: 10.7861/futurehosp.6-2-94. PMID: 31363513. PMCID: PMC6616181.
12. Ahmed N, Abbasi MS, Zuberi F, Qamar W, Halim MSB, Maqsood A, et al. 2021; Artificial intelligence techniques: analysis, application, and outcome in dentistry-a systematic review. Biomed Res Int. 2021:9751564. https://doi.org/10.1155/2021/9751564. DOI: 10.1155/2021/9751564. PMID: 34258283. PMCID: PMC8245240.
13. Alhaidry HM, Fatani B, Alrayes JO, Almana AM, Alfhaed NK. 2023; ChatGPT in dentistry: a comprehensive review. Cureus. 15:e38317. https://doi.org/10.7759/cureus.38317. DOI: 10.7759/cureus.38317. PMID: 37266053. PMCID: PMC10230850.
14. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. 2023; Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 35:1098–102. https://doi.org/10.1111/jerd.13046. DOI: 10.1111/jerd.13046. PMID: 37017291.
15. Strunga M, Urban R, Surovková J, Thurzo A. 2023; Artificial intelligence systems assisting in the assessment of the course and retention of orthodontic treatment. Healthcare (Basel). 11:683. https://doi.org/10.3390/healthcare11050683. DOI: 10.3390/healthcare11050683. PMID: 36900687. PMCID: PMC10000479.
16. Schwendicke F, Samek W, Krois J. 2020; Artificial intelligence in dentistry: chances and challenges. J Dent Res. 99:769–74. https://doi.org/10.1177/0022034520915714. DOI: 10.1177/0022034520915714. PMID: 32315260. PMCID: PMC7309354.
17. Cankurtaran RE, Polat YH, Aydemir NG, Umay E, Yurekli OT. 2023; Reliability and usefulness of ChatGPT for inflammatory bowel diseases: an analysis for patients and healthcare professionals. Cureus. 15:e46736. https://doi.org/10.7759/cureus.46736. DOI: 10.7759/cureus.46736. PMID: 38022227. PMCID: PMC10630704.
18. Uz C, Umay E. 2023; "Dr ChatGPT": is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis. 26:1343–9. https://doi.org/10.1111/1756-185x.14749. DOI: 10.1111/1756-185X.14749. PMID: 37218530.
19. Hasnain M, Hayat A, Hussain A. 2023; Revolutionizing chronic obstructive pulmonary disease care with the open AI application: ChatGPT. Ann Biomed Eng. 51:2100–2. https://doi.org/10.1007/s10439-023-03238-6. DOI: 10.1007/s10439-023-03238-6. PMID: 37184746.
20. Chelli M, Descamps J, Lavoué V, Trojani C, Azar M, Deckert M, et al. 2024; Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis. J Med Internet Res. 26:e53164. https://doi.org/10.2196/53164. DOI: 10.2196/53164. PMID: 38776130. PMCID: PMC11153973. PMID: ec1a9d0c24384d3d914d5c02f6fc50fa.
21. Alkaissi H, McFarlane SI. 2023; Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 15:e35179. https://doi.org/10.7759/cureus.35179. DOI: 10.7759/cureus.35179. PMID: 36811129. PMCID: PMC9939079.
22. Yaltırık M, Palancıoğlu A, Koray M, Turgut CT. 2017; Temporomandibular joint disorders and diagnosis. Yeditepe J Dent. 13:43–50. https://doi.org/10.5505/yeditepe.2017.07078. DOI: 10.5505/yeditepe.2017.07078.
23. Bernard A, Langille M, Hughes S, Rose C, Leddin D, Veldhuyzen van Zanten S. 2007; A systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am J Gastroenterol. 102:2070–7. https://doi.org/10.1111/j.1572-0241.2007.01325.x. DOI: 10.1111/j.1572-0241.2007.01325.x. PMID: 17511753.
24. Agrawal P, Nikhade P. 2022; Artificial intelligence in dentistry: past, present, and future. Cureus. 14:e27405. https://doi.org/10.7759/cureus.27405. DOI: 10.7759/cureus.27405. PMID: 36046326. PMCID: PMC9418762.
25. Anil S, Sudeep K, Saratchandran S, Sweety VK. Chibinski ACR, editor. 2023. Revolutionizing dental caries diagnosis through artificial intelligence. Dental caries perspectives - a collection of thoughtful essays. IntechOpen;London: https://doi.org/10.5772/intechopen.112979. DOI: 10.5772/intechopen.112979.
26. Musleh D, Almossaeed H, Balhareth F, Alqahtani G, Alobaidan N, Altalag J, et al. 2024; Advancing dental diagnostics: a review of artificial intelligence applications and challenges in dentistry. Big Data Cogn Comput. 8:66. https://doi.org/10.3390/bdcc8060066. DOI: 10.3390/bdcc8060066. PMID: bd87ab5a23bf40a4b652cf010cbfc980.
27. Ghaffari M, Zhu Y, Shrestha A. 2024; A review of advancements of artificial intelligence in dentistry. Dent Rev. 4:100081. https://doi.org/10.1016/j.dentre.2024.100081. DOI: 10.1016/j.dentre.2024.100081.
28. Balaban C, Inam W, Kennedy R, Faiella R. 2021; The future of dentistry: how AI is transforming dental practices. Compend Contin Educ Dent. 42:14–7. https://pubmed.ncbi.nlm.nih.gov/33481621/.
29. Xie B, Xu D, Zou XQ, Lu MJ, Peng XL, Wen XJ. 2024; Artificial intelligence in dentistry: a bibliometric analysis from 2000 to 2023. J Dent Sci. 19:1722–33. https://doi.org/10.1016/j.jds.2023.10.025. DOI: 10.1016/j.jds.2023.10.025. PMID: 39035285. PMCID: PMC11259617.
30. Kukalakunta Y, Thunki P, Yellu RR. 2024; Integrating artificial intelligence in dental healthcare: opportunities and challenges. J Deep Learn Genom Data Anal. 4:34–41. https://aithor.com/paper-summary/integrating-artificial-intelligence-in-dental-healthcare-opportunities-and-challenges.
31. Kessels RP. 2003; Patients' memory for medical information. J R Soc Med. 96:219–22. https://doi.org/10.1177/014107680309600504. DOI: 10.1177/014107680309600504. PMID: 12724430. PMCID: PMC539473.
32. Vinufrancis A, Al Hussein H, Patel HV, Nizami A, Singh A, Nunez B, et al. 2024; Assessing the quality and reliability of AI-generated responses to common hypertension queries. Cureus. 16:e66041. https://doi.org/10.7759/cureus.66041. DOI: 10.7759/cureus.66041. PMID: 39224724. PMCID: PMC11366780.
33. Onder CE, Koc G, Gokbulut P, Taskaldiran I, Kuskonmaz SM. 2024; Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep. 14:243. https://doi.org/10.1038/s41598-023-50884-w. DOI: 10.1038/s41598-023-50884-w. PMID: 38167988. PMCID: PMC10761760. PMID: cf05fd3fad374273985631cfef83c1fb.
34. Xie Y, Seth I, Hunter-Smith DJ, Rozen WM, Seifman MA. 2024; Investigating the impact of innovative AI chatbot on post-pandemic medical education and clinical assistance: a comprehensive analysis. ANZ J Surg. 94:68–77. https://doi.org/10.1111/ans.18666. DOI: 10.1111/ans.18666. PMID: 37602755.
35. Dursun D, Bilici Geçer R. 2024; Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak. 24:211. https://doi.org/10.1186/s12911-024-02619-8. DOI: 10.1186/s12911-024-02619-8. PMID: 39075513. PMCID: PMC11285120. PMID: 1503e7a1dfbd46bea66e94c40ad14e5c.
36. Hatia A, Doldo T, Parrini S, Chisci E, Cipriani L, Montagna L, et al. 2024; Accuracy and completeness of ChatGPT-generated information on interceptive orthodontics: a multicenter collaborative study. J Clin Med. 13:735. https://doi.org/10.3390/jcm13030735. DOI: 10.3390/jcm13030735. PMID: 38337430. PMCID: PMC10856539.
37. Alan R, Alan BM. 2023; Utilizing ChatGPT-4 for providing information on periodontal disease to patients: a DISCERN quality analysis. Cureus. 15:e46213. https://doi.org/10.7759/cureus.46213. DOI: 10.7759/cureus.46213. PMID: 37908933. PMCID: PMC10613831.
38. Zengin O, Onder ME. 2021; Educational quality of YouTube videos on musculoskeletal ultrasound. Clin Rheumatol. 40:4243–51. https://doi.org/10.1007/s10067-021-05793-6. DOI: 10.1007/s10067-021-05793-6. PMID: 34059985. PMCID: PMC8166370.
39. Wadhwa S, Kapila S. 2008; TMJ disorders: future innovations in diagnostics and therapeutics. J Dent Educ. 72:930–47. https://pubmed.ncbi.nlm.nih.gov/18676802/. DOI: 10.1002/j.0022-0337.2008.72.8.tb04569.x. PMID: 18676802. PMCID: PMC2547984.
40. Balel Y. 2023; Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 124:101471. https://doi.org/10.1016/j.jormas.2023.101471. DOI: 10.1016/j.jormas.2023.101471. PMID: 37061037.
41. Kılınç DD, Mansız D. 2024; Examination of the reliability and readability of Chatbot Generative Pretrained Transformer's (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version. Am J Orthod Dentofacial Orthop. 165:546–55. https://doi.org/10.1016/j.ajodo.2023.11.012. DOI: 10.1016/j.ajodo.2023.11.012. PMID: 38300168.
42. Tanaka OM, Gasparello GG, Hartmann GC, Casagrande FA, Pithon MM. 2023; Assessing the reliability of ChatGPT: a content analysis of self-generated and self-answered questions on clear aligners, TADs and digital imaging. Dental Press J Orthod. 28:e2323183. https://doi.org/10.1590/2177-6709.28.5.e2323183.oar. DOI: 10.1590/2177-6709.28.5.e2323183.oar. PMID: 37937680. PMCID: PMC10627416. PMID: fd1e68b950b544a185fcd2e408c01dc2.
43. Khanagar SB, Al-Ehaideb A, Vishwanathaiah S, Maganur PC, Patil S, Naik S, et al. 2021; Scope and performance of artificial intelligence technology in orthodontic diagnosis, treatment planning, and clinical decision-making: a systematic review. J Dent Sci. 16:482–92. https://doi.org/10.1016/j.jds.2020.06.018. DOI: 10.1016/j.jds.2020.06.018. PMID: 33384781. PMCID: PMC7770311.
44. Abu Arqub S, Al-Moghrabi D, Allareddy V, Upadhyay M, Vaid N, Yadav S. 2024; Content analysis of AI-generated (ChatGPT) responses concerning orthodontic clear aligners. Angle Orthod. 94:263–72. https://doi.org/10.2319/062623-472.1. DOI: 10.2319/071123-484.1. PMID: 38195060. PMCID: PMC11050467.
45. Siontis KC, Attia ZI. 2024; ChatGPT hallucinating: can it get any more humanlike? Eur Heart J. 45:321–3. https://doi.org/10.1093/eurheartj/ehad548. DOI: 10.1093/eurheartj/ehad548. PMID: 37674408.
46. Sallam M. 2023; ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel). 11:887. https://doi.org/10.3390/healthcare11060887. DOI: 10.3390/healthcare11060887. PMID: 36981544. PMCID: PMC10048148.
47. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C. 2023; Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198. DOI: 10.1371/journal.pdig.0000198. PMID: 36812645. PMCID: PMC9931230. PMID: abc82f32e9d44f6c8570aadf34cfe565.
48. Fatima A, Shafi I, Afzal H, Díez IT, Lourdes DRM, Breñosa J, et al. 2022; Advancements in dentistry with artificial intelligence: current clinical applications and future perspectives. Healthcare (Basel). 10:2188. https://doi.org/10.3390/healthcare10112188. DOI: 10.3390/healthcare10112188. PMID: 36360529. PMCID: PMC9690084.
49. Vaira LA, Sergnese S, Salzano G, Maglitto F, Arena A, Carraturo E, et al. 2023; Are YouTube videos a useful and reliable source of information for patients with temporomandibular joint disorders? J Clin Med. 12:817. https://doi.org/10.3390/jcm12030817. DOI: 10.3390/jcm12030817. PMID: 36769466. PMCID: PMC9918192.
50. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. 2023; How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 9:e45312. https://doi.org/10.2196/45312. DOI: 10.2196/45312. PMID: 36753318. PMCID: PMC9947764. PMID: 56efa8a062244f0898f11b12fa0d5d93.
Figure 1
Assessment of average reliability and usefulness scores of each disease (interquartile range error bars).
TMJ, temporomandibular joint.

Table 1
Reliability score scale, usefulness score scale, modified DISCERN scale, Global Quality Scale
Modified Global Quality Scale* |
---|
1. The information exhibits poor quality, lacks a smooth flow, and is missing significant information. It does not offer any value to patients. |
2. The information is generally of low quality with an unstructured flow. While some information is correctly presented, it omits many important topics, making it of very limited use for patients. |
3. The information demonstrates moderate quality with suboptimal flow. It adequately covers some crucial information but poorly addresses others, making it useful for patients. |
4. The information is of good quality and generally flows well. Most relevant information is included, but some topics are still missing. It is considered useful for patients. |
5. The information showcases excellent quality and flows seamlessly, making it highly beneficial for patients. It covers a comprehensive range of essential information and is very valuable. |
*Data from the article of Bernard et al. (Am J Gastroenterol 2007;102:2070-7).23
Table 2
Inter-rater reliability, usefulness, mDISCERN and GQS
Table 3
Inter-rater differences in Likert-type reliability, Likert-type usefulness, mDISCERN and GQS score of the subject
Masticatory muscle disorders |
Disc- condyle complex derangement |
Structural disorders of the articular surfaces |
Inflammatory disorders of TMJ |
Chronic mandibular hypermobility | Ankylosis | Growth disorders of TMJ | |
---|---|---|---|---|---|---|---|
Reliability (Likert-type) | |||||||
Rater 1 | |||||||
Mean ± SD | 6.0 ± 0 | 5.7 ± 0.6 | 5.3 ± 0.6 | 4.3 ± 0.6 | 4.7 ± 0.6 | 6.0 ± 0 | 5.7 ± 0.6 |
Median (min–max) | 6 (6–6) | 6 (5–6) | 5 (5–6) | 4 (4–5) | 5 (4–5) | 6 (6–6) | 6 (5–6) |
Rater 2 | |||||||
Mean ± SD | 6.0 ± 0 | 5.0 ± 1.0 | 4.7 ± 0.6 | 4.3 ± 0.6 | 4.0 ± 0 | 5.3 ± 0.6 | 5.0 ± 1.0 |
Median (min–max) | 6 (6–6) | 5 (4–6) | 5 (4–5) | 4 (4–5) | 4 (4–4) | 5 (5–6) | 5 (4–6) |
Usefulness (Likert-type) | |||||||
Rater 1 | |||||||
Mean ± SD | 6.0 ± 0 | 6.0 ± 0 | 5.7 ± 0.6 | 4.0 ± 0 | 4.7 ± 0.6 | 5.3 ± 0.6 | 4.7 ± 0.6 |
Median (min–max) | 6 (6–6) | 6 (6–6) | 6 (5–6) | 4 (4–4) | 5 (4–5) | 5 (5–6) | 5 (4–5) |
Rater 2 | |||||||
Mean ± SD | 5.7 ± 0.6 | 5.0 ± 1.0 | 4.7 ± 0.6 | 4.3 ± 0.6 | 4.0 ± 0 | 4.3 ± 0.6 | 4.7 ± 0.6 |
Median (min–max) | 6 (5–6) | 5 (4–6) | 5 (4–5) | 4 (4–5) | 4 (4–4) | 4 (4–5) | 5 (4–5) |
mDISCERN | |||||||
Rater 1 | |||||||
Mean ± SD | 4.3 ± 0.6 | 4.0 ± 1.0 | 4.0 ± 1.0 | 3.7 ± 0.6 | 4.0 ± 1.0 | 4.7 ± 0.6 | 4.3 ± 0.6 |
Median (min–max) | 4 (4–5) | 4 (3–5) | 4 (3–5) | 4 (3–4) | 4 (3–5) | 5 (4–5) | 4 (4–5) |
Rater 2 | |||||||
Mean ± SD | 4.3 ± 0.6 | 4.0 ± 0 | 3.7 ± 0.6 | 3.7 ± 0.6 | 3.3 ± 0.6 | 4.0 ± 1.0 | 3.3 ± 0.6 |
Median (min–max) | 4 (4–5) | 4 (4–4) | 4 (3–4) | 4 (3–4) | 3 (3–4) | 4 (3–5) | 3 (3–4) |
GQS | |||||||
Rater 1 | |||||||
Mean ± SD | 3.0 ± 1.0 | 3.3 ± 0.6 | 3.3 ± 0.6 | 4.7 ± 0.6 | 4.3 ± 0.6 | 4.3 ± 0.6 | 3.3 ± 0.6 |
Median (min–max) | 3 (2–4) | 3 (3–4) | 3 (3–4) | 5 (4–5) | 4 (4–5) | 4 (4–5) | 3 (3–4) |
Rater 2 | |||||||
Mean ± SD | 4.3 ± 0.6 | 4.0 ± 0 | 4.3 ± 0.6 | 3.3 ± 0.6 | 4.0 ± 0 | 3.7 ± 0.6 | 4.0 ± 0 |
Median (min–max) | 4 (4–5) | 4 (4–4) | 4 (4–5) | 3 (3–4) | 4 (4–4) | 4 (3–4) | 4 (4–4) |
Reliability* | 1.000 | 0.157 | 0.157 | 1.000 | 0.157 | 0.157 | 0.157 |
Usefulness* | 0.317 | 0.180 | 0.083 | 0.317 | 0.157 | 0.083 | 1.000 |
mDISCERN* | 1.000 | 1.000 | 0.564 | 1.000 | 0.157 | 0.157 | 0.083 |
GQS* | 0.102 | 0.157 | 0.180 | 0.157 | 0.317 | 0.317 | 0.157 |