Su Xian
Graduated: June 13, 2025
Thesis/Dissertation Title:
Use of the electronic health records to facilitate phenotyping, comorbidity analysis, and genomics
Since the wide adoption of electronic health records (EHR) in 2010, many topics regarding the secondary use of the EHR received attention. The secondary use of EHR usually indicates repurposing the EHR data for research use, including information extraction, phenotyping, disease surveillance and forecasting, and policy making. In this work, we explored unsupervised methods and rule-based (supervised method), demonstrating the use of the EHR data in phenotyping, comorbidity analysis, and genomics. In aim 1, we present an unsupervised method for embedding high-dimensional EHR data at the patient level to help characterize patients and identify new disease patterns. Inspired by the modern language model architecture - transformers, with the attention mechanism- we use patient diagnosis and procedure codes as vocabularies and treat each patient as a sentence to perform the patient embedding. Using 34,851 medical codes for 1,046,649 longitudinal patient events, we performed embedding for 102,739 patients in the electronic MEdical Records and GEnomics (eMERGE) Network. In aim 2, we demonstrated excellent performance in the prediction of future disease events (median AUROC = 0.87, one year within the future), and bulk-phenotyping (median AUROC = 0.84). We then illustrated the use of patient vectors to reveal heterogeneity comorbidity patterns (disease subtypes) within a defined phenotype and captured their disease trajectory longitudinally. Our model is externally validated using the EHR dataset from the University of Washington, showing robustness and stable performance. Together, these results paved the way for using representation learning in the EHR to characterize patients and associated clinical outcomes that can promote disease forecasting performances and facilitate personalized medicine. Aim 3, we focused on an EHR-derived and validated rule-base phenotyping algorithm and illustrated the application of genomic study using this algorithm. We took a complex psychiatric disease -- depression, a leading cause of disability -- as an example, to study the genetic predisposition using data from the EHR. Large-scale genomic studies have identified common variants associated with depression. However, the complexity of the depression phenotype caused its suffering from inconsistent cohort definition and limited sample sizes. There is a need for a validated, automated EHR phenotyping algorithm that can accurately identify depression in the clinic. Here, we implemented a validated EHR phenotyping algorithm to construct a depression cohort (11,532 cases and 39,631 controls, total n = 51,163) and conducted a genome-wide association study (GWAS) using this cohort. Our study reproduced previously identified genetic associations (PHF5A, KCNG2) with depression susceptibility. We also identified novel SNPs falling into the HLA region and the IGVH region, indicating an association between the immune function and depression phenotype. In addition, we also demonstrated the robustness of our phenotyping algorithm through genetic correlation analysis, using a large meta-analysis of major depressive disorder as a standard. Together, this work served as a non-exhaustive but powerful demonstration of the use of the EHR data both in a supervised and unsupervised manner, to facilitate many downstream clinical applications, including phenotyping, comorbidity analysis, and genomics.