Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

doi:10.2196/69910

Published on 20.May.2025 in Vol 27 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/69910, first published 11.Dec.2024.

Hands typing on a laptop keyboard in a modern office setting.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

Kaitlin Hanss¹

; Karthik V Sarma¹

; Anne L Glowinski¹

; Andrew Krystal¹

; Ramotse Saunders¹

; Andrew Halls¹

; Sasha Gorrell¹

; Erin Reilly¹

Article Authors Cited by (12) Tweetations (1) Metrics

Journals

Cassim E, Prewitt M, Walsh D. Medical Apps for Physicians. Medical Clinics of North America 2026;110(2):237 View
Ohu F, Burrell D, Jones L. Public Health Risk Management, Policy, and Ethical Imperatives in the Use of AI Tools for Mental Health Therapy. Healthcare 2025;13(21):2721 View
Böke A, Hacker H, Chakraborty M, Baumeister-Lingens L, Vöckel J, Koenig J, Vogel D, Lichtenstein T, Vogeley K, Kambeitz-Ilankovic L, Kambeitz J. Observer-Independent Assessment of Content Overlap in Mental Health Questionnaires: Large Language Model–Based Study. JMIR AI 2025;4:e79868 View
Voultsiou E, Moussiades L. A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions. Electronics 2026;15(3):524 View
Han B, Barnes T, Reddy C, Shin A. Evaluating Large Language Model–Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study. JMIR AI 2026;5:e85221 View
Polyzou M, Baraliakos X. Artificial Intelligence (AI) in rheumatology: a comparative evaluation of the ChatGPT and DeepSeek application. BMC Rheumatology 2026;10(1) View
Güler I, Grieb G, Kraus A, Stelling H. Artificial Intelligence in Plastic Surgery Education: A Global Multimodel Benchmark of Large Language Models on the Plastic Surgery In-Service Training Examination. Aesthetic Surgery Journal Open Forum 2026;8 View
Jiao R, Chen M, Zhang J. Assessing large language model responses to pediatric depression FAQs: a cross-sectional study on readability, accuracy, and sentiment. Frontiers in Psychiatry 2026;17 View
Tosun S, Çulha E. Evaluating the Ability of Multimodal Artificial Intelligence to Identify Endodontic Instruments: A Comparative Study of ChatGPT-4o and Gemini 3 Flash. Journal of Clinical Medicine 2026;15(11):4391 View
Sarma K, Hanss K, Halls A, Becker D, Glowinski A, Krystal A. Simulated Reasoning and Self-Verification for Psychiatric Diagnosis in Generalist Large Language Models: Comparative Evaluation. JMIR AI 2026;5:e83927 View

Conference Proceedings

Perea del Olmo C, Coyle D. Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems. Generative AI in the Online Mental Health Information Ecosystem: Young Adults' Use and Perceptions View
Shaikh S, Kashish , Suman , Wani M, Daudpota S, Imran A. 2026 5th International Conference on Computing, Mathematics and Engineering Technologies (iCoMET). Reducing Unsafe Mental Health Advice in Large Language Models Using Safety-Aware Prompting View

Citation

Please cite as:

Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study
J Med Internet Res 2025;27:e69910
doi: 10.2196/69910 PMID: 40392576 PMCID: 12134693

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Artificial Intelligence (4600) Psychiatry (131) Generative Language Models Including ChatGPT (1444) AI Language Models in Health Care (710)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn