journal_logo

GMS Medizinische Informatik, Biometrie und Epidemiologie

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)

1860-9171


Research Article
EFMI STC 2025

Performance evaluation of Meditron3-70B in medical coding: Current limitations and integration perspectives for clinical practice

 Coralie Galland-Decker 1
Muaziza Usenbacher 1
Christophe Nunes 1
François Bouche 1
Giorgia Carra 2,3
Noémie Boillat-Blanco 3
Mary-Anne Hartley 4
Alexandre Sallinen 4
Jean Louis Raisaro 2
François Bastardot 1

1 Medical informatics, Lausanne University Hospital, Lausanne, Switzerland
2 Biomedical Data Science Center, Lausanne University Hospital, Lausanne, Switzerland
3 Infectious Diseases Service, Lausanne University Hospital, Lausanne, Switzerland
4 Laboratory for Intelligent Global Health and Humanitarian Response Technologies (LiGHT), EPFL, Lausanne, Switzerland

Abstract

The emergence of large language models (LLMs) presents challenges for their integration into clinical practice, particularly for medical coding. This study evaluated the performance of Meditron3-70B, a state-of-the-art open-source medical LLM, in generating SNOMED CT and ICD-10 codes using 200 fictitious emergency department consultation vignettes. Healthcare professionals assessed the accuracy of the outputs. Although Meditron performed well on standard benchmarks such as MedQA, significant shortcomings were observed in the relevance and completeness of the generated diagnostic codes, with only 2% of responses deemed acceptable. While LLMs show promise for supporting clinical decision-making, their current ability to produce accurate and comprehensive medical codes remains limited. Integrating specialized retrieval tools through hybrid approaches could enhance coding accuracy and warrants further investigation in real-world clinical settings.


Keywords

large language models, emergency room visits, clinical coding, SNOMED CT, ICD-10

Introduction

Medical coding involves converting clinical information from electronic health records (EHRs) – often unstructured free text – into standardized codes according to established classification systems. This process is crucial for administrative and public health purposes, such as statistical reporting, reimbursement, and epidemiological surveillance [1]. However, it imposes a significant documentary burden on health providers, thereby contributing to professional fatigue and dissatisfaction. Meditron3-70B [2] is a large language model (LLM) specifically fine-tuned on various biomedical and clinical datasets, aiming to support various healthcare-related natural language processing tasks.

Our study investigates a central research question: How well does Meditron3-70B perform in real-world medical coding tasks in SNOMED CT [3] and ICD-10 [4], based on emergency department (ED) anamnesis? We focus on its ability to assign specific, accurate codes aligned with current classification standards.

Methodology

This study was conducted at Lausanne University Hospital (CHUV) between September and December 2024. We generated 200 fictitious clinical vignettes reflecting common presenting complaints in the emergency department. Each vignette simulated a pre-admission scenario, and the model was prompted to assign specific SNOMED CT and ICD-10 codes, per upcoming requirements for administrative coding of entry diagnoses used to determine reimbursement categories and care package allocations in the Swiss outpatient system. The outputs were evaluated by two physicians and three nurses (from internal medicine, pediatrics, psychiatry, and emergency medicine) using a 9-item evaluation grid based on a 5-point Likert scale.

To elicit the model’s output, a standardized prompt was used (Figure 1 [Fig. 1]).

Figure 1: Standardized prompt

Results

The evaluation of the model’s performance in identifying SNOMED CT and ICD-10 codes revealed heterogeneous results across the assessment criteria (Figure 2 [Fig. 2]).

Figure 2: Evaluation criteria: Unsatisfactory in red (1–2 points on the Likert scale), satisfactory in yellow (3 points), and highly satisfactory in green (4–5 points)

The model demonstrated significant shortcomings in the relevance and completeness of its responses, with only 2% rated as acceptable (4–5 points on the Likert scale). Confidence in the model was moderate, with only 19% of responses considered satisfactory (4–5 points). In contrast, the model performed well in question understanding (76%) and contextual awareness (86%). Finally, it achieved excellent results regarding fairness (100%) and absence of harm (98%).

Discussion

Our results show that the model struggles to produce relevant and complete diagnostic codes based solely on patient anamnesis despite a good general understanding of the clinical questions and context. Medical coding is a complex task that requires surface-level comprehension, nuanced clinical reasoning and the ability to synthesize information. The limited quality of the generated codes explains the moderate confidence reported by healthcare professionals, and highlights a key barrier integrating LLMs in real-world coding workflows.

These findings are consistent with recent studies, which also report that current LLMs (e.g. ChatGPT-4.5), often fall short in tasks requiring high precision and domain-specific reasoning [5]. While prompt engineering can help clarify expectations, it does not sufficiently compensate for the model’s limited access to up-to-date medical knowledge. Hybrid approaches such as Retrieval-Augmented Generation (RAG), which enable dynamic access to curated external sources during inference, appear particularly promising [6], [7], [8]. They could help improve the specificity and accuracy of generated codes and better align model outputs with clinical documentation requirements.

Systematic comparisons with general-purpose models such as ChatGPT are needed to better characterize the strengths and limitations of specialized versus broadly trained language models. This study also highlights ethical and legal concerns inherent to generative AI in clinical settings. These include transparency of model outputs, accountability for errors or omissions, data privacy, and bias mitigation.

Conclusion

Meditron3-70B showed apparent limitations in generating relevant and comprehensive diagnostic codes from emergency department anamnesis alone. These shortcomings, consistent with other recent findings, suggest that current LLMs are not yet reliable for standalone use in complex medical coding tasks. Future research should focus on hybrid systems that combine LLMs with structured retrieval tools to enhance performance and increase trust in AI-assisted documentation within clinical settings.

Notes

Authors’ ORCIDs

Competing interests

The authors declare that they have no competing interests.


References

[1] Dong H, Falis M, Whiteley W, Alex B, Matterson J, Ji S, Chen J, Wu H. Automated clinical coding: what, why, and where we are? NPJ Digit Med. 2022 Oct;5(1):159. DOI: 10.1038/s41746-022-00705-7
[2] Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, Pagliardini M, Fan S, Köpf A, Mohtashami A, Sallinen A, Sakhaeirad A, Swamy V, Krawczuk I, Bayazit D, Marmet A, Montariol S, Hartley MA, Jaggi M, Antoine Bosselut A. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [Preprint]. arXiv. 2023 Nov 27. DOI: 10.48550/arXiv.2311.16079
[3] SNOMED Intenational. SNOMED-CT. International Edition.
[4] World Health Organisation. ICD-10. International Statistical Classification of Diseases and Related Health Problems. 10th Revision. WHO; 2019.
[5] Soroush A, Glicksberg BS, Zimlichman E, Barash Y, Freeman R, Charney AW, Nadkarni GN, Klang E. Large Language Models Are Poor Medical Coders – Benchmarking of Medical Code Querying. NEJM AI. 2024;1(5):AIdbp2300040. DOI: 10.1056/AIdbp2300040
[6] Ng KKY, Matsuba I, Zhang PC. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI. 2025;2(1):AIra2400380. DOI: 10.1056/AIra2400380
[7] Puts S, Zegers CML, Dekker A, Bermejo I. Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection. JMIR Form Res. 2025 Feb;9:e60095. DOI: 10.2196/60095
[8] Kwan K. Large language models are good medical coders, if provided with tools [Preprint]. arXiv. 2024 Jul 6. DOI: 10.48550/arXiv.2407.12849