TY - JOUR
T1 - Large Language Models in Dental Licensing Examinations
T2 - Systematic Review and Meta-Analysis
AU - Liu, Mingxin
AU - Okuhara, Tsuyoshi
AU - Huang, Wenbo
AU - Ogihara, Atsushi
AU - Nagao, Hikari Sophia
AU - Okada, Hiroko
AU - Kiuchi, Takahiro
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2025/2
Y1 - 2025/2
N2 - Introduction and aims: This study systematically reviews and conducts a meta-analysis to evaluate the performance of various large language models (LLMs) in dental licensing examinations worldwide. The aim is to assess the accuracy of these models in different linguistic and geographical contexts. This will inform their potential application in dental education and diagnostics. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a comprehensive search across PubMed, Web of Science, and Scopus for studies published from 1 January 2022 to 1 May 2024. Two authors independently reviewed the literature based on the inclusion and exclusion criteria, extracted data, and evaluated the quality of the studies in accordance with the Quality Assessment of Diagnostic Accuracy Studies-2. We conducted qualitative and quantitative analyses to evaluate the performance of LLMs. Results: Eleven studies met the inclusion criteria, encompassing dental licensing examinations from eight countries. GPT-3.5, GPT-4, and Bard achieved integrated accuracy rates of 54%, 72%, and 56%, respectively. GPT-4 outperformed GPT-3.5 and Bard, passing more than half of the dental licensing examinations. Subgroup analyses and meta-regression showed that GPT-3.5 performed significantly better in English-speaking countries. GPT-4’s performance, however, remained consistent across different regions. Conclusion: LLMs, particularly GPT-4, show potential in dental education and diagnostics, yet their accuracy remains below the threshold required for clinical application. The lack of sufficient training data in dentistry has affected LLMs’ accuracy. The reliance on image-based diagnostics also presents challenges. As a result, their accuracy in dental exams is lower compared to medical licensing exams. Additionally, LLMs even provide more detailed explanation for incorrect answer than correct one. Overall, the current LLMs are not yet suitable for use in dental education and clinical diagnosis.
AB - Introduction and aims: This study systematically reviews and conducts a meta-analysis to evaluate the performance of various large language models (LLMs) in dental licensing examinations worldwide. The aim is to assess the accuracy of these models in different linguistic and geographical contexts. This will inform their potential application in dental education and diagnostics. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a comprehensive search across PubMed, Web of Science, and Scopus for studies published from 1 January 2022 to 1 May 2024. Two authors independently reviewed the literature based on the inclusion and exclusion criteria, extracted data, and evaluated the quality of the studies in accordance with the Quality Assessment of Diagnostic Accuracy Studies-2. We conducted qualitative and quantitative analyses to evaluate the performance of LLMs. Results: Eleven studies met the inclusion criteria, encompassing dental licensing examinations from eight countries. GPT-3.5, GPT-4, and Bard achieved integrated accuracy rates of 54%, 72%, and 56%, respectively. GPT-4 outperformed GPT-3.5 and Bard, passing more than half of the dental licensing examinations. Subgroup analyses and meta-regression showed that GPT-3.5 performed significantly better in English-speaking countries. GPT-4’s performance, however, remained consistent across different regions. Conclusion: LLMs, particularly GPT-4, show potential in dental education and diagnostics, yet their accuracy remains below the threshold required for clinical application. The lack of sufficient training data in dentistry has affected LLMs’ accuracy. The reliance on image-based diagnostics also presents challenges. As a result, their accuracy in dental exams is lower compared to medical licensing exams. Additionally, LLMs even provide more detailed explanation for incorrect answer than correct one. Overall, the current LLMs are not yet suitable for use in dental education and clinical diagnosis.
KW - Dental education
KW - Dentistry
KW - Healthcare
KW - Oral medicine
KW - Systematic review
UR - http://www.scopus.com/inward/record.url?scp=85208762667&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85208762667&partnerID=8YFLogxK
U2 - 10.1016/j.identj.2024.10.014
DO - 10.1016/j.identj.2024.10.014
M3 - Article
C2 - 39532572
AN - SCOPUS:85208762667
SN - 0020-6539
VL - 75
SP - 213
EP - 222
JO - International Dental Journal
JF - International Dental Journal
IS - 1
ER -