Enhancing Healthcare Interoperability Using Large Language Models: A Generative Proof-of-Concept Framework to Extract Medical Information from Unstructured Clinical Text
Date Submitted: Jan 29, 2026
Open Peer Review Period: Feb 2, 2026 - Mar 30, 2026
Background: Unstructured clinical text remains a major barrier to interoperable data reuse and large-scale secondary analysis in healthcare. Large language models (LLMs) have the potential to automate the extraction of structured clinical information; however, their application is limited by the scarcity of high-quality annotated training data. Objective: - Methods: We evaluated an LLM–based pipeline for extracting structured clinical information from cancer-related discharge letters and mapping it to representations compatible with Fast Healthcare Interoperability Resources (FHIR). To enable large-scale supervised training, we developed a random sample generator that creates synthetic discharge letters using Qwen3 235B by randomly sampling and aggregating structured FHIR data from 41,175 cancer patients. The resulting synthetic discharge letters (n=75k) were paired with their originating structured data, forming a large-scale dataset for fine-tuning MedGemma 27B. Evaluation was conducted on the synthetic test dataset (n=7,500), real-world discharge letters (n=30) which are evaluated by physicians and a medical student, and a comparative one-shot approach using open-source models (Qwen3, LLaMA, and GPT-OSS). Results: The fine-tuned model achieved high extraction performance across multiple clinical entities, including full ICD diagnosis codes (F1 = 0.84), tumor-related information (0.99), laboratory values (0.99), medication names and dosages (0.99), and ATC medication codes (0.94). Extraction of procedure-related information was more challenging but remained reliable, with F1 scores of 0.63 for OPS codes and 0.90 for procedure descriptions. In a one-shot comparison of general-purpose LLMs with the fine-tuned model, the fine-tuned model consistently outperformed general-purpose LLMs in nearly all extraction categories. When applied to real-world discharge letters, performance remained robust, with F1 scores of 78.9% for ICD diagnoses, 86.1% for tumor-related information, 93% for medications, and 61.3% for procedures. Conclusions: These results demonstrate that synthetic text generation from structured clinical data enables effective and scalable training of LLMs for extracting interoperable, multi-entity clinical information from unstructured documentation.
