Abstract
Background: Obesity affects approximately 40% of adults and 15%‐20% of children and adolescents in the United States, and poses significant economic and psychosocial burdens. Currently, patient responses to any single antiobesity medication (AOM) vary significantly, making obesity deep phenotyping and associated precision medicine important targets of investigation.
Objective: This study aimed to evaluate the potential of electronic health records (EHR) as a primary data source for obesity deep phenotyping. We conducted an in-depth analysis of the data elements and quality available from obesity patients prior to pharmacotherapy and applied a multimodal longitudinal deep autoencoder to investigate the feasibility, data requirements, clustering patterns, and challenges associated with EHR-based obesity deep phenotyping.
Methods: We analyzed 53,688 pre-AOM periods from 32,969 patients with obesity or overweight who underwent medium- to long-term AOM treatment. A total of 92 laboratory and vital measurements, along with 79 ICD (International Classification of Diseases)-derived clinical classifications software (CCS) codes recorded within one year prior to AOM treatment, were used to train a gated recurrent unit with decay-based longitudinal autoencoder (GRU-D-AE) to generate dense embeddings for each pre-AOM record. Principal component analysis and Gaussian mixture modeling (GMM) were applied to identify clusters.
Results: Our analysis identified at least 9 clusters, with 5 exhibiting distinct and explainable clinical relevance. Certain clusters show characteristics overlapping with phenotypes from traditional phenotyping strategy. Results from multiple training folds demonstrated stable clustering patterns in 2D space and reproducible clinical significance. However, challenges persist regarding the stability of missing data imputation across folds, maintaining consistency in input features, and effectively visualizing complex diseases in low-dimensional spaces.
Conclusions: In this proof-of-concept study, we demonstrated longitudinal EHR as a valuable resource for deep phenotyping the pre-AOM period at per patient visit level. Our analysis revealed the presence of clusters with distinct clinical significance, which could have implications in AOM treatment options. Further research using larger, independent cohorts is necessary to validate the reproducibility and clinical relevance of these clusters, uncover more detailed substructures and corresponding AOM treatment responses.
doi:10.2196/70140
Keywords
Introduction
Obesity affects approximately 40% of adults and 15%‐20% of children and adolescents in the United States [,]. It is projected that 49% of adults will have obesity, and 24.2% have severe obesity by 2030 in the USA []. Obesity increases the risk of a wide spectrum of chronic diseases and causes profound economic and psychosocial burden. Obesity arises from a complex interplay of genetic [-], nongenetic [,], and epigenetic factors [-]. Due to this multifaceted nature, no single therapy, either noninvasive (eg, antiobesity medications (AOMs), dietary control, hydrogel) or invasive (eg, bariatric surgeries), can robustly predict patient response. For example, meta-analysis indicates wide interindividual response after various bariatric surgeries during longer-term follow-up [,]. Roux-en-Y gastric bypass of 6000 patients showed 23.1% as nonresponders 5 years after surgery []. Follow-up of 652 patients with sleeve gastrectomy for 7 years indicated 27.8% weight recidivism []. For AOMs, even the most effective glucagon-like peptide-1 (GLP-1) drug Tirzepatide claimed 10% weight loss in only 69% participants after 1 year at a relatively safe dosage []. Other AOMs [] and treatment strategies like hydrogels [] and vagal nerve blockade [,] generally show broader interindividual variations, which make them often an auxiliary part of a weight management plan [-].
The inconsistent therapeutic response makes obesity phenotyping (ie, classify obesity into subtypes) and associated precision management important targets of investigation. Earlier advancements in obesity staging [] and phenotyping-guided pragmatic trials [] have moved beyond the oversimplified classification by BMI [] and demonstrated clinical values [-]. However, health care practitioners (HCPs) still found major obstacles in adopting them for obesity precision medicine []. One obstacle lies in the complexity of these strategies. For example, the trial by [] involves tedious (eg, satiation measured by ad libitum buffet meal) and subjective (eg, hunger measured by visual analog scale) measurement processes. There are also considerable gaps between the desired and actual level of acceptance of obesity treatment guidelines by HCPs [,]. Another major obstacle is the lack of granularity of the phenotypic information to personalized context []. Phenotype-based pharmacotherapy has been successful only in rare monogenic obesity []. To date, the sole predictor of the long-term response of pharmacotherapy is the short-term weight loss result, which is useless for guiding the initial treatment selection [].
Recent review articles [,] concluded with the necessity of obesity deep phenotyping for precision medicine, as “when considering obesity, every person should be assessed based on their own specific and unique circumstances” []. Echoing this statement, our recent study covering more than 1 million obesity and overweight patients indicates wide interindividual responses to any single AOM regardless of exposure lengths []. To date, there is a significant dearth of research on important questions like in what clinical context (eg, at the time of diagnosis; before initiating pharmacotherapy; postsurgical weight management) should we discuss “deep phenotyping,” the type or source or availability or quality of relevant data elements, the computational methodologies, and how “deep” should and can we go. To take a peek into this intricate and multifaceted research question, here we narrow down the scope to deep phenotyping before initiating pharmacotherapy, a period with direct relation to drug response and thus potential clinical impact. We propose that electronic health records (EHRs) are a valuable data source for this purpose, as they provide detailed information across patient journeys and have inspired relevant research [,]. Using real-world EHR data from 444,219 patients with obesity or overweight diagnosed between 2005 and 2023, we analyzed commonly available data elements and their quality before pharmacotherapy. We also tested a multimodal longitudinal deep autoencoder to examine the feasibility, data requirements, clustering patterns, and challenges of EHR-based obesity deep phenotyping.
Methods
EHR Database and the Case Cohort
The study is based on EHR data from the outpatient practice of the University of Texas Health Sciences Center at Houston’s McGovern Medical School. The data is transformed (from Allscripts Touchworks EHR pre-2021, GE Centricity EHR pre-2021 for billing, EPIC EHR post-2021) to Observational Medical Outcomes Partnership (OMOP) common data model (CDM) on a nightly basis to harmonize data query format. Currently, the OMOP CDM instance covers 6.5 million patients, among which approximately 3.7 million patients were documented between January 1, 2005, and December 31, 2023. From these patients, we identified 456,890 patients with either obesity (n=217,040; ) or overweight (n=239,850) diagnosis. For the purpose of the study, overweight was defined as patients with certain comorbidities () within 1 year prior to a BMI measurement ≥27, to reflect a pragmatic, risk-based approach increasingly used in personalized obesity care. From the 456,890 patients we defined the case cohort as the 32,969 patients with medium (>112 d) to long exposure to AOMs and who have no bariatric surgeries. Obesity diagnosis, comorbidities, and demographic information—such as age, race, and gender—were extracted from EHR data collected during routine clinical practice.
- Obesity: 433,736;
- Morbid obesity: 434,005;
- Localized adiposity: 438,731;
- Extreme obesity with alveolar hypoventilation: 4,100,857;
- Drug-induced obesity: 4,097,996;
- Simple obesity: 4,217,557.
Prediabetes, diabetes, hypertension, metabolic syndrome, obstructive sleep apnea, polycystic ovary syndrome, insulin resistance, hyperlipidemia, fatty liver, non-alcoholic steatohepatitis, coronary artery disease, cerebrovascular accident, stroke, peripheral vascular disease, congestive heart failure, colon or breast or renal or endometrial or liver cancer, osteoarthritis, or with a glycated hemoglobin level of ≥5.7 and <6.5.
Food and Drug Administration–Approved AOMs (F-AOMs) and Off-Label AOMs (O-AOMs)
For Food and Drug Administration–Approved AOMs (F-AOMs), we considered bupropion, naltrexone, orlistat, phentermine, phentermine-topiramate, liraglutide, semaglutide, and tirzepatide that were approved by Food and Drug Administration (FDA) for obesity treatment. For Off-Label AOMs (O-AOMs), we considered bupropion, canagliflozin, dapagliflozin with or without saxagliptin, empagliflozin with or without linagliptin, lisdexamfetamine, metformin with or without (liflozin, liptin, litazone, and statin), topiramate, and zonisamide. Specifically, exposure to an O-AOM ingredient for less than 30 days without a neighboring (30 d before or after) record was removed to allow for occasional use of O-AOMs for other purposes. While most of the O-AOMs are for diabetes treatment, we make no explicit distinction between the treatment purposes when the treatment course is longer than 30 days, in which case an impact on body weight should be anticipated. For each drug ingredient, we searched the RxNav [] database for all aliases. We then searched the Observational Health Data Sciences and Informatics Athena [] database for concept ids of the original ingredient and all aliases. The concept IDs were combined to represent a single corresponding ingredient.
Definition of AOM Treatment Session
An AOM treatment session represents a continuous period of exposure to generally the same active ingredients. Specifically, compared to an existing exposure record RECA, a new exposure record RECB belongs to the same treatment session (as RECA) if it (1) has exactly the same ingredient(s) as RECA and there is ≤40 days gap between start of RECB and end of RECA, or (2) has less ingredient than RECA, and has both exposure length and gap to RECA ≤40 days, or (3) has new ingredient(s) added within 40 days of initiating RECA and has exposure length ≤40 days. For all other circumstances, we consider RECB as a new treatment session. For the 32,969 patients in the case cohort, we identified a total of 53,688 AOM treatment sessions.
Pre-AOM Period, and Data Points Sampling Scheme
We defined “pre-AOM period” as 1 year (ie, 365 d) before starting each AOM treatment session. Thus, there are 53,688 pre-AOM periods corresponding to each of the 53,688 AOM treatment sessions. For each pre-AOM period, we sampled data points every 30 days with a maximum look-back period of 30 days, resulting in 13 sampling points and 12 intervals for each feature. Notably, the interval closest to the treatment session start date spans from 35 days to 5 days before initiation. Throughout the rest of the text, we refer to the date 4 days before AOM initiation as the index date. We designed a 4-day lead time to allow for any final data collection or assessments that can be completed without interfering with the treatment process.
Normal BMI Control Period
We randomly selected 10,000 case pre-AOM periods to match with corresponding normal BMI control periods, of patients with no obesity or overweight diagnosis and no BMI>25 kg/m². The matching process followed these criteria: (1) the case and control periods were matched by gender, (2) the age at the normal BMI measurement was within one year of the case’s age at AOM initiation, and (3) the normal BMI measurement date was within one year of the case’s AOM initiation date. In the end, 8773 case pre-AOM periods were successfully matched one-to-one with normal BMI control periods that belonged to 8185 normal-BMI patients.
Explored Features
Measurements
We examined all numerical measurements recorded in the OMOP “measurement” table, which includes all lab and vital records (referred to as “measurements” in the remaining text) from the original EHR databases. The units of each measurement were harmonized to the primary unit type through manual inspection. Specifically, data quality of each pre-AOM period was defined as the average proportion of observed measurements of the 13 sampling points before AOM initiation.
Diagnosis Code
All ICD-9 (International Classification of Diseases, Ninth Revision) diagnosis codes were cast into 265 categories of clinical classifications software (CCS) codes []. All ICD-10 (International Statistical Classification of Diseases, Tenth Revision) diagnosis codes were mapped to ICD-9 code through general equivalence mapping provided by Center for Medicare & Medicaid Services. CCS is a tool for clustering patient ICD-9 diagnoses and procedures into a manageable number of clinically meaningful categories for easy presentation and statistical analysis. The CCS codes were one-hot encoded before presenting to the autoencoder model.
Embedding of Longitudinal EHR Data
We re-engineered the GRU-D [] based longitudinal deep learning architecture to function as an autoencoder (GRU-D-AE) for embedding the longitudinal EHR data from pre-AOM period. The basic architecture of the GRU-D model was systematically described by [], here we recapitulate the equations for handling missing values.
where is the missing value indicator for feature at timestep takes value 1 when is observed, or 0 otherwise, in which case the function resorts to weighted sum of the last observed value and empirical mean calculated from the training data for the th feature. Furthermore, the weighting factor is determined by
where is a trainable weights matrix and is the time interval from the last observation to the current timestep. When is large (ie, the last observation is far away from current timestep), is small, resulting in smaller weights on the last observed value , and higher weights on the empirical mean (ie, decay to mean).
The architecture of GRU-D-AE is illustrated in , where Xt represents all input data at timestep t, xt is the normalized feature value (eg, converting lab measurements to corresponding z-scores; scaling age to a 0‐1 range), mt is the missing indicator (0 for missing and 1 for presence), and dt is the time since the last actual observation. The bottleneck layer hT contains the dense embedding vector as the output of GRU-D from the last time step T. hT is then passed to a native RNN network (in this case, a native GRU model) to generate the reconstructed feature value . shows the data processing flow, from raw EHR data to principal components, and how GRU-D-AE functions in the process.

Specifically, the loss function for GRU-D-AE is expressed as
Where N represents the number of training samples, T the longitudinal time steps during pre-AOM period, the reconstructed feature value for sample i at time step t, the original observation or, if missing, an imputed value through the GRU-D missing parameterization mechanism.
Embedding of Static-Transformed Longitudinal EHR Data
As a baseline comparison, we conducted experiments using an atemporal sparse autoencoder (SAE) to embed static-transformed longitudinal EHR data. The static transformation involved extracting the last observed feature within the 1-year period preceding AOM initiation. Features with no observations during this period were imputed with 0 and excluded from the loss function through masking. The experiment pipeline, including the training sample indices, was identical to that used in the GRU-D-AE model ().
Principal Component Analysis and Clustering Algorithm
Principal component analysis (PCA) was conducted on the embeddings generated from the training data to derive eigenvectors. These eigenvectors were then applied to the embeddings from both the training and test datasets to calculate the corresponding principal components (PCs; ). We used the Gaussian mixture model (GMM) to perform probability-based clustering on the top 40 PCs of each embedding. To visualize the embeddings in two-dimensional space, we explored 2 methods: a scatterplot of the top two PCs and a t-SNE plot of the top 40 PCs. Cluster characteristics—including CCS prevalence rates, z score–transformed mean measurement levels, and missing data proportions—were visualized using a 2-way heatmap with Euclidean distance applied both rows and columns wise.
Computational Environment
The computational environment for data analysis consisted of a Unix-based high-performance computing system with 192 CPUs. Deep learning models were developed using PyTorch version 2.3.1, implemented in Python version 3.12.2. Statistical analyses were conducted using R version 4.3.3. PCA was performed with the built-in stats package in R, while GMM was implemented using the mclust package (version 6.1.1). Two-way heatmap was implemented with the R pheatmap package (version 1.0.12). t-SNE dimensionality reduction was carried out with the Rtsne package (version 0.17). Data visualization tasks were completed using the ggplot2 package (version 3.5.0) in R.
Ethical Considerations
This study was approved by The University of Texas institutional review board (IRB), and the Ethics Committee waived the need for written informed consent from participants.
The research involved secondary analysis of previously collected EHR data and was determined to be exempt from full IRB review, as it involved de-identified data with minimal risk to participants.
Informed consent was obtained from all individuals at the time of original data collection. The IRB determined that additional informed consent was not required for this secondary analysis, as the study used de-identified data and was consistent with the scope of the original consent.
All study data were de-identified prior to analysis in accordance with institutional and federal privacy regulations. No direct identifiers were used or linked, and all analyses were conducted in secure, access-controlled environments to ensure confidentiality.
No compensation was provided to individuals as this was a secondary analysis of existing data, and no direct interaction with participants occurred.
No images or other media containing identifiable information about individuals were included in the manuscript or supplementary materials. If any potentially identifiable images are included in future submissions, appropriate consent forms will be obtained and uploaded.
Results
Baseline Characteristics of the Study Population
From January 1, 2005, to December 31, 2023, we identified 444,309 (12%) patients with either obesity (n=204,688) or overweight (n=239,621) diagnosis. Among these patients, 72,089 (16%) were exposed to AOM therapy, with a total of 136,728 AOM treatment sessions (ie, averagely ~2 treatment sessions for each patient). 53,688 (39%) of the AOM sessions from 32,969 patients belong to medium or long-term exposure (>=112 d). The baseline characteristics of the study population were shown in . The sample processing flow is illustrated in .
| Total cohort | With AOMs exposure | With medium/long term AOMs exposure | |
| Patient count | 444,219 | 72,089 | 32,969 |
| AOM session count | — | 136,728 | 53,688 |
| Gender | |||
| Male | 178,097 (40) | 26,891 (37) | 11,924 (36) |
| Female | 266,057 (60) | 45,188 (63) | 21,037 (64) |
| Race | |||
| White | 150,802 (34) | 30,425 (42) | 12,752 (39) |
| Black or African American | 108,170 (24) | 17,501 (24) | 8815 (27) |
| Asian | 9499 (2.1) | 1896 (2.6) | 1089 (3.3) |
| Others | 1758 (0.4) | 279 (0.4) | 133 (0.4) |
| Unknown | 173,990 (39) | 21,988 (31) | 10,180 (31) |
| Age at first diagnosis | 51 (37, 63) | 56 (42, 66) | 51 (39, 64) |
| Diagnosis | |||
| Obesity | 139,880 (31) | 20,537 (28) | 11,842 (36) |
| Morbid obesity | 61,495 (14) | 12,399 (17) | 5989 (18) |
| Localized adiposity | 1738 (0.4) | 288 (0.4) | 155 (0.5) |
| Extreme obesity | 1032 (0.2) | 136 (0.2) | 63 (0.2) |
| Drug-induced obesity | 456 (0.1) | 141 (0.2) | 83 (0.3) |
| Simple obesity | 17 (0.004) | 7 (0.01) | 6 (0.1) |
| Overweight | 239,600 (54) | 38,581 (54) | 14,831 (45%) |
aAOM: antiobesity medication.
bNot applicable.
Overall Feature Availability During Pre-AOM Period
During the pre-AOM period, which has potentially the most relevant data for deep phenotyping, we identified 92 measurements with ≥5% overall presence rate and nonzero standard deviation (Table S1 in ). The measurements with ≥70% presence rate are body weight, body surface area, BMI, height, blood pressure (BP), and heart rate. Whereas lab measurements for basic metabolic panels (eg, glucose, calcium, sodium), and lipid levels (eg, high-density-lipoprotein cholesterol, low-density-lipoprotein cholesterol, triglyceride) generally have around or ≥50% presence rate. Examining the distribution of overall feature presence rates among patient subgroups reveals no differences between males and females, nor across racial groups (ie, White, Black or African American, Asian, etc).
Out of 285 CCS categories, we identified 79 with ≥5% overall presence rate (Table S2 in ). The most common CCS categories are diabetes mellitus without complications (63%), hypertension (55%), hypertension with complications (55%), and hyperlipidemia (51%). Nutritional disorders, including any type of obesity diagnosis, were present in 40% of cases within one year prior to the AOM session, generally aligning with the percentage (46%) of obesity patients in our case cohort (ie, obesity + overweight).
Temporal Windows Based Feature Presence Rate
We sampled features every 30 days within one year surrounding the treatment session initiation date to examine the feature presence rates across the 24 corresponding time intervals. For both measurements (, Figure S1 in ) and CCS codes (, Figure S2 in ), the feature presence rates were remarkably higher in the time interval immediately before the index date (ie, Day −35 to −5), and slightly elevated around 180, 90, and 30 days before the index date. This pattern of increased feature presence was not evident after the index date and was not seen in the normal BMI control periods (data not shown). For the pre-AOM periods, 50% measurements were contributed by 22% treatment sessions. On the other hand, 9.9% patients and 11% treatment sessions had no single measurement during this period.

Feature Presence Rate Throughout Years
Analyzing feature presence rates from 2005 to 2023 revealed a general upward trend for both measurements and CCS codes over the years. Specifically, the Mann-Kendall nonparametric trend test showed that all 92 measurements have significantly increased presence rates (P≤.05) over time. The top five measurements with the highest increase are microalbumin, creatinine in urine, high-density-lipoprotein cholesterol, total cholesterol, and ferritin in serum or plasma. Additionally, 50 out of the 79 CCS categories demonstrated a significant increase in presence rates over the years, with the top five being anxiety disorders, nutritional deficiencies, other acquired deformities, other non-traumatic joint disorders, and screening and history of mental health and substance abuse codes (Figure S3 in ). Two CCS categories—chest pain and skin and subcutaneous tissue infections—showed a significant (P≤.05) or marginally significant (P≤.10) decrease in presence rates over the years.
Encoding and Reconstruction of Longitudinal EHR Data
We experimented with various bottleneck layer neuron sizes (n=40, 60, and 120) for the GRU-D-AE model. Generally, larger bottleneck sizes allow the autoencoder to capture more intricate temporal structures. Figure S4 in shows the GRU-D-AE based reconstructions of 171 temporal features (92 measurements and 79 CCS codes) from a single patient, using 5-fold cross-validation with a bottleneck layer of 120 neurons. Comparing results across the 5 models indicates slightly different imputation patterns across temporal steps and successful reconstruction of input feature patterns at high levels. For the remainder of the paper, we present results exclusively based on 120 bottleneck layer size, as this configuration demonstrated optimal performance.
Clustering of Case Pre-AOM Periods
By encoding the pre-AOM periods and projecting them onto principal component (PC) spaces, we visualized clustering patterns in lower-dimensional spaces, as shown in . Specifically, highlights the distribution of pre-AOM periods across different data quality quartiles. Data points from the bottom quartile (red) are densely packed in the lower-left corner, while those from the top quartile (purple) are more dispersed and form distinct clusters. The T-SNE plot with top 40 PCs (explained 80% variance) reveals a complete separation of bottom quartile data points and partial separation of 2nd quartile data points. Importantly, applying the same pipeline () to each of the 5-fold models shows highly reproducible clustering patterns, with different projection angles (Figure S5 in ). After removing low-quality pre-AOM periods (ie, the 1st and 2nd quartiles), the remaining high-quality periods (ie, the 3rd and 4th quartiles) form clusters that are less relevant to data quality (). GMM clustering of the 40 PCs of the high-quality periods suggests at least 9 distinct clusters within the case group (). Notably, GMM-based clusters may align (eg, clusters 2, 5, 7) or differ (eg, clusters 1, 6, 9) from the visually identifiable cluster centers based on the two PCs (Figure S6 in ). The visual separation between clusters improves slightly with T-SNE-based dimensionality reduction (Figure 7 in ). When the clustering results are colored by traditional obesity diagnosis categories (eg, drug-induced obesity, extreme obesity, localized adiposity, morbid obesity), no clear relationship emerges between traditional obesity diagnoses and the EHR-based clustering of pre-AOM periods (Figure S8 in ).

Clustering of Case Versus Control Periods
Visualization of the top two PCs and T-SNE plot () from the case pre-AOM periods and the control periods showed clear separation of the majority of control periods from the high-quality case periods. However, a minor proportion of the control periods formed clusters with centers overlapping the case group on the top two PCs (), with no clear pattern of overlap on the T-SNE plot ().

Clustering of Multiple Pre-AOM Periods From the Same Patient
For patients with multiple pre-AOM periods from corresponding multiple AOM treatment sessions, their clustering behavior is influenced by both data quality and intrinsic physiological factors reflected through measurements and diagnosis status. In , all three patients exhibit improved data quality over time (later pre-AOM periods are indicated by larger dots, which generally have better data quality), with varying degrees of clustering tightness in the high-quality pre-AOM periods (represented by cyan and purple points).

Characterizing GMM-Based Clusters of Pre-AOM Periods
We conducted two-way clustering of the GMM-based clusters of pre-AOM periods against CCS prevalence rates (), average measurement values (), and temporal measurement presence rate (Figure S9 in ). Here we briefly describe the 5 clusters with remarkable clinical relevance.
The 1st cluster is distinguished by the highest prevalence of chronic kidney disease (CKD), coronary atherosclerosis, and diabetes mellitus with complications (). This cluster also shows notably low glomerular filtration rates, elevated blood urea nitrogen, and comparatively lower levels of low-density cholesterols and myeloperoxidase ().
The 5th cluster is primarily characterized by the highest levels of low-density lipoprotein, very-low-density lipoprotein, triglyceride, systolic or diastolic blood pressure (SBP and DBP), and myeloperoxidase among all clusters. This cluster also has slightly higher glomerular filtration rates than other clusters except cluster 6 (). However, its disease prevalence rates are not significantly distinct from those of other clusters ().
The 6th cluster is marked by a higher prevalence of reproductive and genital health disorders (eg, female genital disorders, menstrual disorders, and postabortion complications; ). It also features higher heart rates, the highest body weight and glomerular filtration rates, and increased levels of leukocytes, lymphocytes, monocytes, neutrophils, and glucose levels ().
The 7th Cluster is characterized by a high level of primary malignancies (eg, head and neck cancer, bone and connective tissue cancer, and colon cancer) (), the highest presence rate of temporal anthropometric and physiological measurements (eg, body height, body weight, body temperature, BP, heart rate), and a lower to average presence rate of other more advanced lab tests. It is also featured with a mildly higher body weight (Figure S9 X7 in ).
The 9th cluster is notable for significantly lower prevalence rates of primary malignancies (), and remarkably lower leukocyte levels (including lymphocytes, monocytes, and neutrophils) alongside elevated globulin levels ().
No notable difference was observed between the clusters on average ADI-based SDOH (social determinants of health) factors including median household income, mean education percentage, and mean insurance percentage (data not shown). Inspecting demographic characteristics (ie, age, gender, race) indicates a relatively lower male proportion in cluster 6, slightly higher proportion of black people in cluster 9, and slightly older age in cluster 1 (data not shown).
Interestingly, clusters characterized by distinct clinical traits (eg, reproductive and genital health disorders in Cluster 6 and the absence of primary cancer in Cluster 9) were not visually distinguishable on the first two PCs (Figure S6 in ) and only formed vague boundaries in the 40-PC based T-SNE plot (Figure S7 in ). In contrast, clusters that showed clear separation in the low-dimensional spaces (eg, Cluster 2) may not have distinct clinical characteristics as other clusters, at least based on the available study features. We also tried GMM clustering using just the top two PCs, which produced visually distinguishable clusters but lacked meaningful clinical differentiation (data not shown).


SAE-Based Embedding and Clustering of Pre-AOM Periods
As a baseline comparison, we evaluated the SAE-based embedding of static-transformed pre-AOM periods and summarized the results here. Using the same bottleneck layer size (n=120), the SAE model reliably reconstructed the static-transformed features (Figure S11 in ). However, embeddings generated by the SAE model do not exhibit visually distinct clusters on the top two PCs (Figure 12a in ). When colored by data quality quartiles, pre-AOM periods with lower data quality appear less separable (Figure S12a in ), aligning with the results observed in the GRU-D-AE model (). Additionally, SAE-based embeddings show weaker differentiation between pre-AOM periods across different data quality quartiles (Figure 12 in ). For high-quality pre-AOM periods, GMM clustering reveals less defined cluster boundaries on both the top two PCs (Figure S13a in ) and the T-SNE plot (Figure 13b in ). Overall, the SAE-based model demonstrates less distinct separation between normal BMI controls and high-quality case pre-AOM periods (Figure S14 in ).
Discussion
Background
In this preliminary work, we outlined a workflow for obtaining EHR-based fine-grained obesity phenotypes at per patient visit level prior to initiating AOM therapy. Since EHR chronicled a patient’s physiological and pathological status, this may potentially help with AOM treatment decision making. This study aligns with the growing interest in classifying obesity into distinct subtypes and the adoption of the plural term “obesities” [,]. It also lays the foundation for using EHR-based digital fingerprints as an alternative to traditional phenotyping approaches, which have shown potential in certain contexts [,] but not yet widely adopted by healthcare providers for precision obesity treatment due to their labor-intensive nature and insufficient granularity [].
Principal Findings
Our analysis identified at least 9 distinct clusters before AOM initiation. Five clusters show clear clinical relevance independent of traditional obesity diagnoses (eg, extreme obesity, localized adiposity). Below is a brief overview of the clinical significance of these clusters: Cluster 1primarily includes patients with CKD and metabolic disorders. Obesity drives risk factors for CKD, most notably hypertension and type 2 diabetes, and also through obstructive sleep apnea. Hypertension is the #1 modifiable risk factor for CKD [], while type 2 diabetes remains the leading cause of CKD worldwide []. Obstructive sleep apnea contributes to nocturnal hypoxia and worsens hypertension — both of which accelerate kidney injury. Beyond these indirect pathways, visceral adiposity itself promotes a proinflammatory state (marked by TNF-alpha, IL-6, MCP-1) that drives renal fibrosis and activates the renin-angiotensin-aldosterone system, leading to glomerular hyperfiltration. Additionally, increased intra-abdominal pressure from central obesity can directly alter renal hemodynamics, further contributing to hyperfiltration and progressive kidney damage. On the other hand, CKD can lead to water retention and increased body weight. The bidirectional relationship between obesity and CKD [] makes it critical in selecting appropriate anti-obesity therapies []. For example, a recent review article concluded that bariatric surgery should be considered in morbidly obese adults with CKD to improve the cardio-metabolic status and kidney outcomes []. Cluster 5 includes patients with obesity-related hyperlipidemia, elevated atherogenic lipoproteins, and higher blood pressure, but fewer severe comorbidities and clinical measurements than other clusters—possibly reflecting better overall health, shorter obesity duration, or fewer complications. Limited healthcare access may also contribute to underreported comorbidities. These patients may have a lower genetic predisposition to visceral fat accumulation, allowing them to tolerate higher BMI without metabolic dysfunction. This underscores the limitations of BMI as a sole indicator of obesity-related morbidity and highlights the need for more precise phenotyping to capture the true burden of disease. Cluster 6 includes individuals with reproductive and genital health disorders, often related to hormonal imbalances and insulin resistance. Unlike Cluster 5, these individuals are more likely to seek medical care due to the impact of these conditions on reproductive health []. Recent evidence has linked maternal diabetes to neurodevelopmental disorders like autism, ADHD, intellectual disability, and other learning disorders []. Given maternal obesity is a driving risk factor for gestational diabetes, this cluster represents an important demographic for prenatal intervention. For example, metformin and GLP-1 receptor agonists are potential AOM options for patients in this cluster. By enhancing insulin sensitivity, metformin not only supports weight loss but also helps regulate menstrual cycles and improve ovulation in obese women []. Similarly, GLP-1 receptor agonists may contribute to hormonal balance and enhance fertility outcomes [,]. Cluster 7 predominance of primary malignancies with low frequency of cancer-related lab tests may be representative of patients who have a history of cancer and are seeking medical weight management to reduce their risk of recurrence []. In patients with obesity, those who underwent bariatric surgery had a 57% lower risk of cancer-related death compared to non-surgical patients matched for pre-operative BMI []. Bariatric surgery recipients also have a 25% lower incidence of cancer compared to a nonsurgical comparison group matched for preoperative BMI []. Both of these examples demonstrate the significance of weight loss interventions on cancer risk and mortality. The remarkably lower inflammatory response of Cluster 9 may be caused by less visceral adiposity and more subcutaneous adiposity. This may be an overall healthier cluster with a lower threshold for seeking treatment. Compared to visceral adiposity, subcutaneous adiposity allows for more flexible treatment options. For instance, central abdominal adiposity—a type of subcutaneous fat—can be addressed through noninvasive or minimally invasive procedures such as cryolipolysis, high-intensity focused ultrasound, nonthermal ultrasound, radiofrequency, and injection lipolysis []. Prospective surveillance of cluster 9 may prove fruitful in understanding if there are early clinical features that could help identify if these individuals fractionate into new clusters or previously defined clusters. These observations highlight the method’s potential to uncover unique patient groups, marking an important first step in digital phenotyping. Future steps would be to test cluster stability in a larger cohort and incorporate additional environmental, psychological, behavioral, functional factors, and even microbiome profiles or genetic markers. Finally, linking these phenotypes to treatment responses would provide more comprehensive insights and reveal any subtle subgroup patterns.
Comparison to Prior Work
The EHR-based clusters emphasize clinical manifestations and comorbidities, whereas traditional phenotyping, as described by [], classifies obesity within a behavioral and metabolic framework (ie, hungry brain, emotional hunger, hungry gut, and slow burn). While there is no exact mapping between the two, certain overlaps exist. For example, Cluster 1 shares features with both the hungry brain and hungry gut phenotypes, reflecting advanced metabolic dysfunction (eg, diabetes and coronary atherosclerosis) and CKD associated with chronic overeating. Cluster 5 aligns with the emotional hunger phenotype, given its hyperlipidemia, high levels of atherogenic lipoproteins, elevated blood pressure, and fewer comorbidities. Cluster 9 corresponds to the slow burn phenotype, with its low prevalence of malignancies, reduced systemic inflammation, preserved renal function, and absence of dramatic metabolic derangements, consistent with slower metabolic turnover. These potential overlaps suggest combining the strengths of each approach may provide a more complete understanding of obesity and its related disorders.
Future Direction of GRU-D-AE for Encoding Longitudinal EHR
We demonstrate the proposed GRU-D-AE architecture, along with its error function, effectively captures the nuances of high-dimensional, extremely sparse longitudinal EHR data. We also show that embeddings from native longitudinal data provide more clearly separated and distinct clusters than embeddings from static-transformed data. These characteristics make the architecture a highly valuable candidate for encoding longitudinal EHRs of obesity, as well as various other diseases. However, there are several caveats that deserve discussion and further investigation. First, the model exhibits a certain level of arbitrariness in handling missing values in different training folds. This behavior aligns with expectations, as the error function primarily minimizes the difference between input features (both observed and imputed) and reconstructed features. As a result, the model may exploit shortcuts (eg, imputing all missing values as zero, as an extreme example) to satisfy the error function. A similar concern was raised by Daniel Jarrett et al [], who proposed target-embedding autoencoders, co-training the autoencoder with a target value to enhance stability. In another study [], using target variables and clinical confounders improved the stability of embeddings in complex biomedical datasets. Nonetheless, selecting appropriate co-training targets (eg, clinical outcomes, BMI trajectories, treatment responses) for deep phenotyping during pre-AOM periods remains challenging and is an important area for future investigation. Second, unlike the arbitrariness in missing data imputation, the top two PCs from the five training folds consistently formed similar clustering patterns, differing only in orientation. Moreover, examination of diagnosis and measurement characteristics of clusters in other training folds revealed highly consistent clinical traits (data not shown). These findings suggest that the bottleneck layer may capture essential temporal feature characteristics independent of the missing data imputation methods. Future studies should carefully assess the impact of imputation mechanisms on cluster stability.
Challenges
Visualization of Complex Disease Phenotypes
At this stage, we observed no obvious connection between visual separation in low-dimensional spaces, particularly the top two PCs, and clinical relevance. Even with advanced dimensionality reduction techniques like T-SNE, there are generally no clear-cut boundaries between clinically distinct clusters. The findings highlight the challenges of visualizing complex diseases which often cannot be adequately captured in low-dimensional representations. Further research, such as MapperPlus [] and other open frameworks [], will be helpful to explore how to effectively visualize these clusters in a more clinically meaningful way beyond the static heatmap employed in current study.
Considerations for Low-Quality Samples
In this study, we defined data quality as the observed proportion of lab or vital measurements within 1 year before AOM initiation. While this may be arbitrary given the anticipatable variation in feature availability across time and health care systems, it provides an intuitive reflection of how well a patient is documented by the EHR system during the pre-AOM period. Given the observation that pre-AOM periods with above median measurement proportion (ie, 0.0453) formed clusters generally independent of data quality levels, we propose 5% as a cut-off threshold, below which the pre-AOM periods might be less distinguishable from each other. On the other hand, further research is needed to carefully assess the source of these low-quality pre-AOM periods (eg, limited access to health care, first-time visits, and fewer comorbidities).
Feature Variability in High-Dimensional EHR
The laboratory and vital measurements used in this study are derived from the most commonly available features among obesity patients in the current EHR system. However, feature availability can vary over time and across healthcare systems. This variability, combined with the high dimensionality of EHR data, poses challenges for ensuring feature consistency, which can significantly affect the stability and reliability of clustering results. Bellman first introduced this issue in 1957, referring to it as the “curse of dimensionality” []. Similar challenges [-] are prevalent in the medical field, where large feature sets and limited sample sizes are common. Specifically, Loftus et al [] reviewed these challenges in disease phenotyping, highlighting the importance of cohort representativeness and discussing caveats in data outliers, missing data, categorical variables, scaling, data transformation, and feature selection. For obesity, while certain strategies may improve cluster stability and potential clinical utility in the pre-AOM period, substantial challenges remain. Further research combining large cohorts with diverse demographic and SDOH backgrounds and intricately designed clustering approaches is needed to generate a holistic view that can inform clinical practice.
Limitations
In this exploratory study, we only shed light on a few limited aspects of obesity deep phenotyping from the perspective of EHR. Below, we outline key limitations and suggest areas for future research. First, although Houston is one of the most ethnically and racially diverse cities in the United States, this study is still subject to the limitations inherent in a single-site study, such as selection bias, environmental and institutional influences, and regulatory differences. Broader studies conducted at the state or national level are necessary to validate the generalizability of the findings and to uncover more nuanced clustering structures. Second, due to the limited sample size, we only considered laboratory/vital measurements, diagnosis codes for deep phenotyping, which do not provide comprehensive coverage of the factors that may influence obesity phenotypes (eg, medications, social determinants of health, and genetic/epigenetic information, demographics). Future research should include these factors as phenotyping features or examine clustering patterns across different subgroups (eg, age subgroups). Third, this study focuses on clustering patterns during the pre-AOM period, but it does not explore how different clusters respond to various AOMs in body weight, BMI, lipid levels, and side effects. Given the numerous types, combinations, and exposure lengths of AOMs, this area needs to be thoroughly investigated in larger cohorts. Fourth, we did not distinguish treatment intent when the course of O-AOMs (eg, bupropion and metformin) extended beyond 30 days, based on the assumption that a sustained treatment duration is likely to affect body weight. A more refined study could exclude non–obesity-related treatment purposes by reviewing clinical notes. Finally, we demonstrate that data quality significantly influences the visual separability of the top two PCs. However, due to space limitations, we did not delve deeply into the sources of low-quality pre-AOM periods (eg, limited access to healthcare, first-time visits, fewer comorbidities) or how changes in data quality affect a given patient’s clustering behavior. Further investigation into these aspects is needed to better understand the context in which EHR-based deep phenotyping can be faithfully applied in clinical practice.
Conclusions
Obesity is a complex, chronic condition, and its multifaceted nature poses significant challenges for deep phenotyping and precision medicine. Here we demonstrated that longitudinal EHR data can potentially serve as a valuable resource for deep phenotyping during the pre-AOM period at the individual patient visit level. Our analysis revealed the presence of clusters with distinct clinical significance, which could have an implication on AOM treatment options. Further research using larger, independent cohorts is necessary to validate the reproducibility and clinical relevance of these clusters, uncover more detailed substructures, and assess cluster-specific responses to AOM treatment.
Acknowledgments
The study was funded by The University of Texas Health Science Center at Houston (CPRIT RR230020) and National Institutes of Health (grant R01LM011934). Generative artificial intelligence was not used in the preparation of this manuscript.
Data Availability
The data supporting this study are not publicly available due to restrictions related to patient privacy and confidentiality. Access to the deidentified UT-Physician data may be granted upon reasonable request and with appropriate institutional approvals.
Authors' Contributions
XR led the study design, conducted data analysis, and drafted the manuscript. SL conducted data analysis on atemporal clustering. LW contributed to manuscript preparation. AW oversaw the harmonization of the UT-Physician EHR with the OMOP CDM. SM provided expert insights as an obesity clinician. HL contributed to the study design and coordinated the overall project.
Conflicts of Interest
None declared.
92 measurements within 1 year before exposure to antiobesity medication.
XLSX File, 13 KBClinical classifications software codes within 1 year before exposure to antiobesity medication.
XLSX File, 11 KBAdditional figures.
DOCX File, 18230 KBReferences
- Centers for disease control and prevention (CDC). Obesity and Severe Obesity Prevalence in Adults: United States, August 2021–August 2023. URL: https://www.cdc.gov/nchs/products/databriefs/db508.htm [Accessed 2025-08-15]
- Sanyaolu A, Okorie C, Qi X, Locke J, Rehman S. Childhood and Adolescent Obesity in the United States: A Public Health Concern. Glob Pediatr Health. 2019;6:2333794X19891305. [CrossRef] [Medline]
- Ward ZJ, Bleich SN, Cradock AL, et al. Projected U.S. State-Level Prevalence of Adult Obesity and Severe Obesity. N Engl J Med. Dec 19, 2019;381(25):2440-2450. [CrossRef] [Medline]
- Stunkard AJ, Foch TT, Hrubec Z. A twin study of human obesity. JAMA. Jul 4, 1986;256(1):51-54. [Medline]
- Stunkard AJ, Sørensen TI, Hanis C, et al. An adoption study of human obesity. N Engl J Med. Jan 23, 1986;314(4):193-198. [CrossRef] [Medline]
- Paolacci S, Pompucci G, Paolini B, et al. Mendelian non-syndromic obesity. Acta Biomed. Sep 30, 2019;90(10-S):87-89. [CrossRef] [Medline]
- Górczyńska-Kosiorz S, Kosiorz M, Dzięgielewska-Gęsiak S. Exploring the Interplay of Genetics and Nutrition in the Rising Epidemic of Obesity and Metabolic Diseases. Nutrients. Oct 21, 2024;16(20):3562. [CrossRef] [Medline]
- World Health Organization. Obesity: Preventing and Managing the Global Epidemic: Report of a WHO Consultation. World Health Organization; 2000. ISBN: 9789241208949
- Segal NL, Allison DB. Twins and virtual twins: bases of relative body weight revisited. Int J Obes Relat Metab Disord. Apr 2002;26(4):437-441. [CrossRef] [Medline]
- Barres R, Kirchner H, Rasmussen M, et al. Weight loss after gastric bypass surgery in human obesity remodels promoter methylation. Cell Rep. Apr 25, 2013;3(4):1020-1027. [CrossRef] [Medline]
- Keller M, Hopp L, Liu X, et al. Genome-wide DNA promoter methylation and transcriptome analysis in human adipose tissue unravels novel candidate genes for obesity. Mol Metab. Jan 2017;6(1):86-100. [CrossRef] [Medline]
- Nilsson E, Jansson PA, Perfilyev A, et al. Altered DNA methylation and differential expression of genes influencing metabolism and inflammation in adipose tissue from subjects with type 2 diabetes. Diabetes. Sep 2014;63(9):2962-2976. [CrossRef] [Medline]
- Wahl S, Drong A, Lehne B, et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature New Biol. Jan 5, 2017;541(7635):81-86. [CrossRef] [Medline]
- Chang SH, Stoll CRT, Song J, Varela JE, Eagon CJ, Colditz GA. The effectiveness and risks of bariatric surgery: an updated systematic review and meta-analysis, 2003-2012. JAMA Surg. Mar 2014;149(3):275-287. [CrossRef] [Medline]
- Courcoulas AP, King WC, Belle SH, et al. Seven-Year Weight Trajectories and Health Outcomes in the Longitudinal Assessment of Bariatric Surgery (LABS) Study. JAMA Surg. May 1, 2018;153(5):427-434. [CrossRef] [Medline]
- Brissman M, Beamish AJ, Olbers T, Marcus C. Prevalence of insufficient weight loss 5 years after Roux-en-Y gastric bypass: metabolic consequences and prediction estimates: a prospective registry study. BMJ Open. Mar 2, 2021;11(3):e046407. [CrossRef] [Medline]
- Clapp B, Wynn M, Martyn C, Foster C, O’Dell M, Tyroch A. Long term (7 or more years) outcomes of the sleeve gastrectomy: a meta-analysis. Surg Obes Relat Dis. Jun 2018;14(6):741-747. [CrossRef] [Medline]
- Jastreboff AM, Aronne LJ, Ahmad NN, SURMOUNT-1 Investigators, et al. Tirzepatide Once Weekly for the Treatment of Obesity. N Engl J Med. Jul 21, 2022;387(3):205-216. [CrossRef] [Medline]
- Chakhtoura M, Haber R, Ghezzawi M, Rhayem C, Tcheroyan R, Mantzoros CS. Pharmacotherapy of obesity: an update on the available medications and drugs under investigation. EClinicalMedicine. Apr 2023;58:101882. [CrossRef] [Medline]
- Aronne LJ, Anderson JE, Sannino A, Chiquette E. Recent advances in therapies utilizing superabsorbent hydrogel technology for weight management: A review. Obes Sci Pract. Jun 2022;8(3):363-370. [CrossRef] [Medline]
- Morton JM, Shah SN, Wolfe BM, et al. Effect of Vagal Nerve Blockade on Moderate Obesity with an Obesity-Related Comorbid Condition: the ReCharge Study. Obes Surg. May 2016;26(5):983-989. [CrossRef] [Medline]
- Apovian CM, Shah SN, Wolfe BM, et al. Two-Year Outcomes of Vagal Nerve Blocking (vBloc) for the Treatment of Obesity in the ReCharge Trial. Obes Surg. Jan 2017;27(1):169-176. [CrossRef] [Medline]
- Raynor HA, Champagne CM. Position of the Academy of Nutrition and Dietetics: Interventions for the Treatment of Overweight and Obesity in Adults. J Acad Nutr Diet. Jan 2016;116(1):129-147. [CrossRef] [Medline]
- Koliaki C, Spinos T, Spinou Μ, Brinia Μ, Mitsopoulou D, Katsilambros N. Defining the Optimal Dietary Approach for Safe, Effective and Sustainable Weight Loss in Overweight and Obese Adults. Healthcare (Basel). Jun 28, 2018;6(3):73. [CrossRef] [Medline]
- Wharton S, Lau DCW, Vallis M, et al. Obesity in adults: a clinical practice guideline. CMAJ. Aug 4, 2020;192(31):E875-E891. [CrossRef] [Medline]
- Sharma AM, Kushner RF. A proposed clinical staging system for obesity. Int J Obes. Mar 2009;33(3):289-295. [CrossRef]
- Acosta A, Camilleri M, Abu Dayyeh B, et al. Selection of Antiobesity Medications Based on Phenotypes Enhances Weight Loss: A Pragmatic Trial in an Obesity Clinic. Obesity (Silver Spring). Apr 2021;29(4):662-671. [CrossRef] [Medline]
- Garvey WT, Mechanick JI. Proposal for a Scientifically Correct and Medically Actionable Disease Classification System (ICD) for Obesity. Obesity (Silver Spring). Mar 2020;28(3):484-492. [CrossRef] [Medline]
- Padwal RS, Pajewski NM, Allison DB, Sharma AM. Using the Edmonton obesity staging system to predict mortality in a population-representative cohort of people with overweight and obesity. CMAJ. Oct 4, 2011;183(14):E1059-E1066. [CrossRef] [Medline]
- Rodríguez-Flores M, Goicochea-Turcott EW, Mancillas-Adame L, et al. The utility of the Edmonton Obesity Staging System for the prediction of COVID-19 outcomes: a multi-centre study. Int J Obes. Mar 2022;46(3):661-668. [CrossRef]
- Atlantis E, John JR, Hocking SL, et al. Development and internal validation of the Edmonton Obesity Staging System-2 Risk screening Tool (EOSS-2 Risk Tool) for weight-related health complications: a case-control study in a representative sample of Australian adults with overweight and obesity. BMJ Open. Jun 2022;12(6):e061251. [CrossRef]
- Portincasa P, Frühbeck G. Phenotyping the obesities: reality or utopia? Rev Endocr Metab Disord. Oct 2023;24(5):767-773. [CrossRef] [Medline]
- Turner M, Jannah N, Kahan S, Gallagher C, Dietz W. Current Knowledge of Obesity Treatment Guidelines by Health Care Professionals. Obesity (Silver Spring). Apr 2018;26(4):665-671. [CrossRef]
- Kaplan LM, Golden A, Jinnett K, et al. Perceptions of Barriers to Effective Obesity Care: Results from the National ACTION Study. Obesity (Silver Spring). Jan 2018;26(1):61-69. [CrossRef] [Medline]
- Oral EA, Simha V, Ruiz E, et al. Leptin-replacement therapy for lipodystrophy. N Engl J Med. Feb 21, 2002;346(8):570-578. [CrossRef] [Medline]
- Hocking S, Sumithran P. Individualised prescription of medications for treatment of obesity in adults. Rev Endocr Metab Disord. Oct 2023;24(5):951-960. [CrossRef] [Medline]
- Trang K, Grant SFA. Genetics and epigenetics in the obesity phenotyping scenario. Rev Endocr Metab Disord. Oct 2023;24(5):775-793. [CrossRef] [Medline]
- Preda A, Carbone F, Tirandi A, Montecucco F, Liberale L. Obesity phenotypes and cardiovascular risk: From pathophysiology to clinical management. Rev Endocr Metab Disord. Oct 2023;24(5):901-919. [CrossRef] [Medline]
- Ruan X, Li R, Wang L, et al. Current status of anti-obesity medications and performance, an EHR based survey. Health Informatics. Preprint posted online on 2024. [CrossRef]
- Alzoubi H, Alzubi R, Ramzan N, West D, Al-Hadhrami T, Alazab M. A Review of Automatic Phenotyping Approaches using Electronic Health Records. Electronics (Basel). 2019;8(11):1235. [CrossRef]
- Weng C, Shah NH, Hripcsak G. Deep phenotyping: Embracing complexity and temporality-Towards scalability, portability, and interoperability. J Biomed Inform. May 2020;105:103433. [CrossRef] [Medline]
- Zeng K, Bodenreider O, Kilbourne J, Nelson S. RxNav: a web service for standard drug information. AMIA Annu Symp Proc. 2006;2006:1156. [Medline]
- Reich C, Ostropolets A, Ryan P, et al. OHDSI Standardized Vocabularies—a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc. Feb 16, 2024;31(3):583-590. [CrossRef]
- Healthcare Cost and Utilization Project (HCUP). Encyclopedia of Health Services Research. 2009. [CrossRef]
- Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep. Apr 17, 2018;8(1):6085. [CrossRef] [Medline]
- Yárnoz-Esquiroz P, Olazarán L, Aguas-Ayesa M. “Obesities”: Position statement on a complex disease entity with multifaceted drivers. Eur J Clin Invest. Jul 2022;52(7):e13811. [CrossRef] [Medline]
- Salmón-Gómez L, Catalán V, Frühbeck G, Gómez-Ambrosi J. Relevance of body composition in phenotyping the obesities. Rev Endocr Metab Disord. Oct 2023;24(5):809-823. [CrossRef] [Medline]
- Cifuentes L, Ghusn W, Feris F, et al. Phenotype tailored lifestyle intervention on weight loss and cardiometabolic risk factors in adults with obesity: a single-centre, non-randomised, proof-of-concept study. EClinicalMedicine. Apr 2023;58:101923. [CrossRef] [Medline]
- Hamrahian SM, Falkner B. Hypertension in Chronic Kidney Disease. Adv Exp Med Biol. 2017;956:307-325. [CrossRef] [Medline]
- Chaudhry K, Karalliedde J. Chronic kidney disease in type 2 diabetes: The size of the problem, addressing residual renal risk and what we have learned from the CREDENCE trial. Diabetes Obes Metab. Oct 2024;26 Suppl 5:25-34. [CrossRef] [Medline]
- Prasad R, Jha RK, Keerti A. Chronic Kidney Disease: Its Relationship With Obesity. Cureus. Oct 2022;14(10):e30535. [CrossRef] [Medline]
- Chintam K, Chang AR. Strategies to Treat Obesity in Patients With CKD. Am J Kidney Dis. Mar 2021;77(3):427-439. [CrossRef] [Medline]
- Parvathareddy VP, Ella KM, Shah M, Navaneethan SD. Treatment options for managing obesity in chronic kidney disease. Curr Opin Nephrol Hypertens. Sep 1, 2021;30(5):516-523. [CrossRef] [Medline]
- Zheng L, Yang L, Guo Z, Yao N, Zhang S, Pu P. Obesity and its impact on female reproductive health: unraveling the connections. Front Endocrinol. 2024;14:1326546. [CrossRef]
- Ye W, Luo C, Zhou J, et al. Association between maternal diabetes and neurodevelopmental outcomes in children: a systematic review and meta-analysis of 202 observational studies comprising 56·1 million pregnancies. Lancet Diabetes Endocrinol. Jun 2025;13(6):494-504. [CrossRef]
- Practice Committee of the American Society for Reproductive Medicine. Electronic address: ASRM@asrm.org, Practice Committee of the American Society for Reproductive Medicine. Role of metformin for ovulation induction in infertile patients with polycystic ovary syndrome (PCOS): a guideline. Fertil Steril. Sep 2017;108(3):426-441. [CrossRef] [Medline]
- Cena H, Chiovato L, Nappi RE. Obesity, Polycystic Ovary Syndrome, and Infertility: A New Avenue for GLP-1 Receptor Agonists. J Clin Endocrinol Metab. Aug 1, 2020;105(8):e2695-e2709. [CrossRef] [Medline]
- Han Y, Li Y, He B. GLP-1 receptor agonists versus metformin in PCOS: a systematic review and meta-analysis. Reprod Biomed Online. Aug 2019;39(2):332-342. [CrossRef]
- Demark-Wahnefried W, Schmitz KH, Alfano CM, et al. Weight management and physical activity throughout the cancer care continuum. CA Cancer J Clin. Jan 2018;68(1):64-89. [CrossRef] [Medline]
- Aminian A, Wilson R, Al-Kurd A, et al. Association of Bariatric Surgery With Cancer Risk and Mortality in Adults With Obesity. JAMA. Jun 28, 2022;327(24):2423-2433. [CrossRef] [Medline]
- Adams TD, Meeks H, Fraser A, et al. Long-term cancer outcomes after bariatric surgery. Obesity (Silver Spring). Sep 2023;31(9):2386-2397. [CrossRef] [Medline]
- Friedmann DP. A review of the aesthetic treatment of abdominal subcutaneous adipose tissue: background, implications, and therapeutic options. Dermatol Surg. Jan 2015;41(1):18-34. [CrossRef] [Medline]
- Jarrett D, Schaar M. Target-embedding autoencoders for supervised representation learning. Presented at: International Conference on Learning Representations 2019. URL: https://openreview.net/pdf?id=BygXFkSYDH [Accessed 2024-10-09]
- Yu T. AIME: Autoencoder-based integrative multi-omics data embedding that allows for confounder adjustments. PLoS Comput Biol. Jan 2022;18(1):e1009826. [CrossRef] [Medline]
- Datta E, Ballal A, López JE, Izu LT. MapperPlus: Agnostic clustering of high-dimension data for precision medicine. PLOS Digit Health. 2023;2(8):e0000307. [CrossRef]
- Abdullah SS, Rostamzadeh N, Sedig K, Garg AX, McArthur E. Visual Analytics for Dimension Reduction and Cluster Analysis of High Dimensional Electronic Health Records. Informatics (MDPI). May 27, 2020;7(2):17. [CrossRef]
- Bellman R. Dynamic Programming. Princeton University Press; 1957. ISBN: 9780691079516
- Alelyani S. Stable bagging feature selection on medical data. J Big Data. Dec 2021;8(1):1-18. [CrossRef]
- Ma X, Chu X, Wang Y, et al. MedFACT: modeling medical feature correlations in patient health representation learning via feature clustering. Preprint posted online on Apr 21, 2022. URL: http://arxiv.org/abs/2204.10011 [Accessed 2025-08-15] [CrossRef]
- Loftus TJ, Shickel B, Balch JA, et al. Phenotype clustering in health care: A narrative review for clinicians. Front Artif Intell. Aug 12, 2022;5:842306. [CrossRef]
- Zhang X, Zhang H, Wang Z, Ma X, Luo J, Zhu Y. PWSC: a novel clustering method based on polynomial weight-adjusted sparse clustering for sparse biomedical data and its application in cancer subtyping. BMC Bioinformatics. Dec 21, 2023;24(1). [CrossRef]
Abbreviations
| AOM: Antiobesity Medication |
| BP: blood pressure |
| CCS: clinical classifications software |
| CDM: Common Data Model |
| CKD: chronic kidney disease |
| EHR: electronic health record |
| F-AOM: Food and Drug Administration–approved antiobesity medication |
| FDA: Food and Drug Administration |
| GMM: Gaussian mixture model |
| HCP: health care practitioner |
| ICD-10: International Statistical Classification of Diseases, Tenth Revision |
| ICD-9: International Classification of Diseases, Ninth Revision |
| IRB: institutional review board |
| O-AOM: Off-label AOM |
| OMOP: Observational Medical Outcomes Partnership |
| PC: principal component |
| PCA: principal component analysis |
| SAE: atemporal sparse autoencoder |
| SDOH: social determinants of health |
Edited by Andrew Coristine; submitted 16.12.24; peer-reviewed by Jerry John Thayil, Nur Maamor, Saritha Kondapally, Sheyu Li; final revised version received 08.05.25; accepted 08.05.25; published 20.08.25.
Copyright©Xiaoyang Ruan, Shuyu Lu, Liwei Wang, Andrew Wen, Sameer Murali, Hongfang Liu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 20.8.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

