Training and Assessing a Foundation Model for Electronic Hospital Records: Vector Representation of Patients with Transformer Architecture (Preprint)

2025

Min-Gyu Kim, Ji Hoon Kim, Dong Yun Lee, Yeong‐Hun Song, Min Ho An, Joon-Kyung Seong, Rae Woong Park

Abstract

BACKGROUND Advancements in natural language processing have been adopted to encoding patient information using electronic health records (EHR), leading to efforts to train foundation models (FMs) for EHR. While there were FMs for EHR reported, the robustness of the models is less scrutinized. Furthermore, model architecture is not fully optimized for adoption in EHR representation. In this study, we aim to suggest how a minor change in the architecture to better describe the nature of EHR data can improve model performance and introduce more methods of evaluating pretrained models. OBJECTIVE To assess how providing more information about data structure can lead to a better foundation model for EHR when applying techniques derived from natural language processing. METHODS EHR data converted to Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) format were first transformed into a sequence of concept IDs for each patient. The model was trained with Masked Language Modelling (MLM). The pre-trained model was evaluated on predicting neurologic complications in diabetic patients and myocardial infarction in hyperlipidemia patients. 1% of the patient representations were clustered and visualized using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) and Uniform Manifold Approximation and Projection (UMAP). class-based Term Frequency-Inverse Term Frequency (cTF-IDF) was used to identify representative concept IDs from the clusters. Finally, the patient representations containing the representative concept IDs were visualized. RESULTS The pretrained model showed performance on par with traditional statistical models trained from scratch when fine-tuned. 22 clusters and one outlier cluster were formed. class-based Term Frequency-Inverse Term Frequency (cTF-IDF) revealed consistent, recognizable concept IDs that represent each cluster. Visualization of patients containing the representative concept IDs showed matching results with the clustering in a proportion of the data. CONCLUSIONS While achieving a true foundation model for EHR data remains a challenging task, incremental modifications can lead to the ultimate goal of understanding how medical records could be represented using deep learning models. Future research should explore alternative pretraining strategies tailored to medical records, as well as architectural modifications that may better capture the complexities of structured and unstructured medical data.

Abstract

Related Papers