This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.
Our study had two aims. First, we aimed to identify essential depressionassociated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 20122016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis.
An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a highperformance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using surveyweighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network.
The XGBoostderived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (
XGBoost and network analysis were useful to discover depressionrelated factors and their relationships and can be applied to epidemiological studies using big survey data.
Depression is a common debilitating psychiatric condition characterized by a lowspirited mood, loss of interest, and a range of emotional, cognitive, physical, and behavioral symptoms. It has a high global disease burden and had been projected to become the second most common cause of disabilityadjusted life years worldwide by 2020 [
Given the complex biological, psychological, and sociocultural factors underlying the pathogenesis of depression, an integrated model with confounder adjustment may provide a better understanding and multifaceted individualized approach for depression. In a typical study, surveyweighted logistic regression is used to identify depressionassociated factors. A simple regression model for each candidate factor is built to adjust for age and sex. Then, a complex model is presented by adding more covariates to control potential confounders [
A confounder has associations with both exposure and disease but is not in the causal pathway between exposure and outcome [
For big survey data with numerous variables, the machine learning model can detect principle factors for a condition by attempting to maximize its predicting performance. During the training, dimensionality reduction (which reduces the number of features) is essential because many features may cause overfitting, and the generated model may not be generalized appropriately [
We speculate that extreme gradient boosting (XGBoost) can help select an optimal subset of essential variables from a large volume of survey data. XGBoost is an advanced implementation of a gradientboosting decisiontree algorithm. XGBoost has been used in a few studies to predict or screen depression [
However, XGBoost has many hyperparameters that should be manually specified, and optimal parameter tuning can be challenging. To solve this problem, we have chosen an automated machine learning tool, the treebased pipeline optimization tool (TPOT), along with genetic programming [
The overall study flowchart is provided in
Flow diagram of the study. AUC: area under the curve; EBICglasso: extended Bayesian information criterium graphical lasso; KNHANES: Korea National Health and Nutrition Examination Survey.
In the annual KNHANES (20122016) following a multistage clustered probability design, 192 primary sampling units (PSUs) were generated from approximately 200,000 geographically defined PSUs for the entire country; subsequently, 20 final target households were sampled for each PSU as secondary sampling units (ID_fam) [
Positive depression diagnosed by a physician (DF2_dg) and positive present depression (DF2_pr) represented the positive case “current depression,” and these indicated that the individual had a diagnosis of depression and was currently experiencing depression, respectively. In terms of the negative case, both DF2_dg and DF2_pr were negative and represented “no lifetime depression,” which indicated that the individual had never experienced depression. DF2_dg and DF2_pr were reported by the participants in this survey.
We included as many variables as possible (
Physical examination
Waist circumference, BMI, regularity of pulse, and periodontal disease
Blood tests
Anemia (hemoglobin <13.0 g/dL for men; hemoglobin <12.0 g/dL for nonpregnant women; hemoglobin <11.0 g/dL for pregnant women), white blood cell count, platelet count, red blood cell count, aspartate aminotransferase, alanine aminotransferase, creatinine, urea nitrogen, hepatitis B surface antigen, and hepatitis C antibody
Fasting (≥8 hours) blood tests
Sugar level, highdensity lipoprotein cholesterol (HDL), and triglyceride (TG)
Urine strip test
pH, nitrite, specific gravity, protein, glucose, ketones, bilirubin, blood, urobilinogen
Pulmonary function test
Hypercholesterolemia: total cholesterol (TC) ≥240 mg/dL or lipidlowering medication
Hypertension: systolic pressure ≥140 mmHg or diastolic pressure ≥90 mmHg or medication
Diabetes mellitus: fasting blood sugar level ≥126 mg/dL, diagnosis, medications, or insulin injections
Prediabetes: fasting blood sugar level ≥100 and <126 mg/dL
Lowdensity lipoprotein cholesterol (LDL) = TC – (HDL + TG/5), where TG ≤400 mg/dL
Age, marital status, educational level, occupational class, household income, weight changes in the past year, alcohol consumption frequency, sleep hours per day, perceived stress, current smoking status, walking for more than 10 minutes (days/week), musclestrengthening activities (days/week), childbirth experience, and EuroQol fivedimension threelevel (EQ5D3L) questionnaire (mobility, usual activities, selfcare, and pain/discomfort; three levels for each) [
Morbidity
Chronic obstructive pulmonary disease, stroke, ischemic heart disease, osteoarthritis, rheumatoid arthritis, pulmonary tuberculosis, asthma, thyroid disease, gastric cancer, hepatoma, colon cancer, breast cancer, cervix cancer, lung cancer, thyroid cancer, atopic dermatitis, renal failure, hepatitis B, hepatitis C, and liver cirrhosis
Oral health
Difficulty chewing, caries treatment within the last 1 year, root canal treatment within the last 1 year, and prosthetic treatment within the recent 1 year
Carbohydrates, proteins, saturated fatty acids, monounsaturated fatty acids, polyunsaturated fatty acids, n3 fatty acids, n6 fatty acids, n6:n3 fatty acid ratio, cholesterol, fiber, vitamin A, vitamin B1, vitamin B2, vitamin C, niacin, iron, calcium, potassium, phosphorus, and sodium
The training set included 80% “current depression” and 80% “no lifetime depression” (10,076 cases). The remaining cases were assigned to the test set (2520 cases). We used TPOT to choose the best XGBoost hyperparameters from the training set to predict “current depression” from “no lifetime depression.” Then, the model features were selected further using the feature importance of XGBoost. The XGBoost model was explained using shapley additive explanation (SHAP) values that showed each feature’s impact on the model prediction [
The weighted area under the receiver operating characteristic curve (AUC) of the final model was calculated using the sample weight [
Surveyweighted multiple logistic regression was performed using the model features from all samples (training and test sets). We defined the “direct” factor of depression as the model feature whose multiple logistic regression coefficient significantly differs from zero (
A correlation matrix of model features and nonmodel factors was generated using all samples. Based on the correlation matrix, a thresholded EBICglasso (extended Bayesian information criterium graphical lasso) network was plotted, in which the network model was estimated using graphical LASSO regularization with EBIC model selection. Each edge weight was the correlation coefficient between two nodes after controlling all other network correlations [
Three centrality indices (strength, closeness, and betweenness) were computed. Strength centrality is the absolute sum of the edge weights connected to the node. Closeness centrality is the sum of the shortest distances from the node to all other network nodes. Betweenness centrality is the number of times when the node lies on the shortest path between two other nodes [
If an indirect risk factor was positively connected to direct risk factors, the direct factor confounding effects were tested. If the indirect factor’s coefficient on surveyweighted multiple logistic regression became significant without the direct factors, we assumed they were the indirect factor’s confounders.
If an indirect risk factor was positively connected to a direct preventive factor or negatively related to a direct risk factor, the interaction effect was tested using the interaction term on surveyweighted multiple logistic regression.
We used the following two programming languages: Python (version 3.8.5, Python Software Foundation) and R (version 3.6.3, R Foundation for Statistical Computing).
The following scikitlearn tools [
TPOT starts from a collection of random models (first generation). Subsequently, those with higher performance are chosen and copied into the next generation’s population. The offspring crossover with other offspring or are randomly changed by mutation. The algorithm repeats this evaluateselectcrossovermutate process for multiple generations. Finally, the best model is selected from the run [
We used TPOT to tune the hyperparameters of the XGBClassifier, such as the number of boosting rounds (“n_estimators”), boosting learning rate (“learning_rate”), maximum tree depth for base learners (“max_depth”), minimum sum of instance weight needed in a child (“min_child_weight”), minimum loss reduction required to make a further partition on a leaf node of the tree (“gamma”), subsample ratio of the training instance (“subsample”), subsample ratio of columns when constructing each tree (“colsample_bytree”), L1 regularization term on weights (“reg_alpha”), and L2 regularization term on weights (“reg_lambda”). To control the balance of positive and negative weights, we set “scale_pos_weight” as the number of negative cases/the number of positive cases. The “objective” option was set to a default value, “binary:logistic” (ie, logistic regression for binary classification).
Two parameters of TPOTClassifier, namely, number of iterations to run the XGBoost optimization process (“generation”) and number of individuals to retain in the genetic programming population every generation (“puplation_size”), were set to 100 (default) each. Finally, TPOT evaluated models using AUC (“roc_auc”).
XGBoost incorporates an ensemble technique (boosting) of adding new models to correct errors made by previous models. When adding new models, gradient boosting uses a gradient descent algorithm to minimize the loss. More details on XGBoost’s working principles can be found in published articles [
The
Receiver operating characteristic (ROC) curves were generated using the
For a correlation matrix in network analysis, a correlation between two variables was calculated using the wtd.cor function with the sample weight for “weight” in R’s weights package, which produced weighted Pearson correlation coefficients for survey data [
The population was estimated to be 22,262,880 (SE 355,977) (97.3%) for no depression reported during lifetime and 616,082 (SE 40,273) (2.7%) for current depression (complexsurveydesignbased estimation). After TPOT training, the number of XGBoost model features was reduced from 120 to 81 (33% reduction) by the L1 regularization term of XGBoost. Furthermore, we decreased the feature number to 18 (78% reduction) by applying feature selection without performance loss.
The final model features included mental health, quality of life (QoL), socioeconomic factors, morbidity, sex, marital status, urinalysis, and health behavior (
XGBoost model for the depression and performance test. Red dots represent positive for a binary factor or high feature values for a continuous factor on the beeswarm plot of shapley additive explanation (SHAP) values (log odds of the current depression) for the training data (A). On the weighted receiver operating characteristic curve (ROC) (B), sensitivity is 0.78 and specificity is 0.82 at the best threshold of probability of 0.461 for test samples. The model is also tested using the Patient Health Questionnaire9 (depression when score ≥10) for 6098 samples (C). AUC: area under the curve.
Confusion matrix for the test data set.
Actual depression  Predicted depression^{a}  

Current  No lifetime 
Current  103,182  28,933 
No lifetime  804,195  3,657,604 
^{a}Case numbers are estimated using complexsurveydesign weights at the best threshold.
Perceived stress and current asthma were topranked statistically significant risk factors (
Surveyweighted multiple logistic regression analysis of depressionrelated factors.
Variable  No lifetime depression^{a} (N=22,262,880)  Current depression^{a} (N=616,082)  Odds ratio (95% CI)^{b}  







Yes  5,999,970 (26.95%)  372,138 (60.40%)  3.3 (2.54.3)  <.001  

No (reference)  16,262,910 (73.05%)  243,944 (39.60%)  N/A^{c}  N/A  







Male  11,072,884 (49.74%)  165,770 (26.91%)  0.5 (0.30.7)  <.001  

Female (reference)  11,189,996 (50.26%)  450,312 (73.09%)  N/A  N/A  







Separated or divorced  852,170 (3.83%)  105,567 (17.14%)  2.2 (1.43.5)  <.001  

Single  6,153,788 (27.64%)  133,952 (21.74%)  1.4 (0.92.2)  .11  

Widowed  397,387 (1.78%)  36,923 (5.99%)  1.4 (0.82.6)  .25  

Married (reference)  14,859,535 (66.75%)  339,640 (55.13%)  N/A  N/A  







Nonmanual  10,304,597 (46.29%)  161,758 (26.26%)  0.6 (0.40.8)  <.001  

Manual  4,394,739 (19.74%)  102,028 (16.56%)  0.7 (0.41.0)  .06  

Farm^{d}  498,239 (2.24%)  10,848 (1.76%)  0.4 (0.20.8)  .02  

Unemployed (reference)  7,065,305 (31.73%)  341,448 (55.42%)  N/A  N/A  







Current  205,771 (0.92%)  34,769 (5.64%)  3.1 (1.56.5)  .002  

Past  303,213 (1.36%)  10,387 (1.69%)  1.3 (0.53.2)  .57  

No lifetime (reference)  21,753,896 (97.72%)  570,926 (92.67%)  N/A  N/A  







Yes  2,963,312 (13.31%)  146,742 (23.82%)  1.4 (0.91.9)  .10  

No (reference)  19,299,568 (86.69%)  469,340 (76.18%)  N/A  N/A  







Current  825,130 (3.71%)  100,307 (16.28%)  1.4 (0.92.1)  .17  

Past  155,266 (0.70%)  8,385 (1.36%)  0.7 (0.14.6)  .71  

No lifetime (reference)  21,282,484 (95.59%)  507,390 (82.36%)  N/A  N/A  







Yes  1,466,133 (6.59%)  84,663 (13.74%)  1.4 (0.92.2)  .11  

Prediabetes  4,504,230 (20.23%)  125,783 (20.42%)  1.0 (0.71.4)  .81  

No (reference)  16,292,517 (73.18%)  405,636 (65.84%)  N/A  N/A  







Yes  3,398,898 (15.27%)  173,323 (28.13%)  0.9 (0.71.3)  .65  

No (reference)  18,863,982 (84.73%)  442,759 (71.87%)  N/A  N/A  







Yes  5,178,518 (23.26%)  139,892 (22.71%)  1.2 (0.81.8)  .47  

No  17,084,362 (76.74%)  476,190 (77.29%)  N/A  N/A  
Pain or discomfort level (1 to 3 points), mean (SE)  1.2 (0.004)  1.6 (0.034)  1.3 (1.11.5)  <.001  
Fasting serum triglyceride (mg/dL)^{d}, mean (SE)  134.8 (1.3)  161.8 (12.0)  1.2 (1.11.3)  .002  
Weight gain level within 1 year (0 to 3 points), mean (SE)  0.4 (0.009)  0.6 (0.047)  1.2 (1.11.3)  .003  
Urine specific gravity, mean (SE)  1.020 (0.00007)  1.018 (0.0004)  0.8 (0.70.9)  .003  
Usual activity problem level (1 to 3 points), mean (SE)  1.0 (0.002)  1.3 (0.029)  1.1 (1.01.3)  .006  
Income quintiles (household) (1 to 5 points), mean (SE)  3.4 (0.02)  2.6 (0.09)  0.8 (0.70.9)  .007  
Educational level (1 to 4 points), mean (SE)  3.2 (0.01)  2.6 (0.06)  0.8 (0.71.0)  .01  
Urine pH, mean (SE)  5.7 (0.009)  5.8 (0.050)  1.1(1.01.3)  .07  
Mobility problem level (1 to 3 points), mean (SE)  1.1 (0.002)  1.3 (0.032)  1.1 (1.01.2)  .22  
Selfcare problem level (1 to 3 points), mean (SE)  1.0 (0.001)  1.1 (0.017)  1.0 (0.91.1)  .98  
Age (years)^{e}, mean (SE)  40.7 (0.2)  45.1 (0.8)  1.1 (0.81.3)  .61 
^{a}Population counts (n), means, and standard errors are estimated using complexsurveydesign weights.
^{b}For continuous factors, the value was standardized by removing the mean and scaling to unit variance.
^{c}N/A: not applicable.
^{d}Nonmodel factors were found by controlling for model features and age.
^{e}Not a depressionrelated factor, but included to control confounding effects of age.
Based on the centrality indexes, the network had two highcentrality nodes, namely, “educational level” and “male,” which meant both nodes were highly connected with various factors (
Partial correlation network graph and centrality indices for depressionassociated factors. The factors can be positively (risk) or negatively (protective) related to current depression. Node size is proportional to the effective size of the odds ratio. Green and red edges represent positive and negative correlations, respectively. The edge with the highest absolute weight has fullcolor saturation and the widest width.
Two indirect factors, namely, hypercholesterolemia and diabetes, were positively connected with triglyceride on the network, and triglyceride was a confounder of hypercholesterolemia and diabetes (
Possible confounders on indirect factors in depression.
Indirect factors  Confounders^{a}  Odds ratio (95% CI)  
With confounders  Without confounders  
Current osteoarthritis  Pain or discomfort level and mobility problem level  1.4 (0.92.1)  1.6 (1.12.5)  
Hypercholesterolemia  Triglyceride  1.4 (1.01.9)  1.5 (1.02.1)  
Diabetes  Triglyceride and hypercholesterolemia  1.4 (0.92.2)  1.7 (1.12.6) 
^{a}A group of confounders to make a coefficient of the indirect factor statistically insignificant.
Another indirect factor, “current smoker,” interacted with sex and was statistically significant in females (odds ratio 2.2, 95% CI 1.33.7;
The proportion of current depression according to sex and smoking (A) and according to weight gain and diabetes (B). The interaction effects between them are significant (
We found an interesting relationship between urine specific gravity (USG) and depression. Because USG was associated with the “male” node on the network (
Linear regression plots between age and urine specific gravity in females and males. In the images, 95% confidence intervals for the regression estimate are drawn using translucent bands around the regression line.
Linear regression plots between age and daily water intake in females and males. In the images, 95% confidence intervals for the regression estimate are drawn using translucent bands around the regression line.
Our data set was highly imbalanced, with a 2.7% prevalence of depression. Therefore, it may be difficult for a machine learning algorithm to predict depression because this diagnosis is not sufficiently represented [
There would have been a generalization problem with this survey data because we did not train the model using the sample weight. However, the weighted AUC for the test set (0.86) was not lower than the mean unweighted AUC for the training set (0.84, SD 0.02; seven repeats of fivefold crossvalidation). Therefore, our model was not overfitted, and we expect that our model can be applied to the general population.
A model may be less reliable when it is based on selfreported depression than on clinically diagnosed depression. However, for PHQ9 scores, a clinical module, the model performance was still satisfactory when the model was retested with available cases (
XGBoost L1 regularization did not ultimately reduce the feature number; however, feature selection using XGBoost feature importance could decrease the feature number more at a higher reduction rate. Therefore, we suggest feature reduction using XGBoost feature importance to obtain a sparser model that contains the most principle factors of the disease.
Because of the relatively small number of predictors, a sparse model is interpretable without irrelevant features, which could be shown by the impact on the depression of each feature (
Additionally, nonmodel factors (triglyceride, current asthma, and farm worker) exhibited statistical significance by multiple regression with depression model factors. By controlling principle depression factor effects (eg, confounding effects), nonmodel factors’ significance was reliable. Therefore, we might use a sparse disease model as a testing tool for candidate factors.
Our network was composed of differently sized nodes (statistical strength of the association) and edges (net correlation between two factors). “Perceived stress” and “current asthma” were prioritized in depression risk factor control because of their large nodes (
In terms of the network’s centrality indexes, “male” and “educational level” showed the highest values. Scalefree networks are characterized by growth and preferential attachment in which earlier nodes in the network increase their connectivity at a higher rate [
Clusters in the disease network can be potent risk intervention targets because factors in a cluster are connected and controlled together [
If an indirect risk factor is connected to a direct risk factor, the direct factor could be the confounder. Furthermore, if the direct factor is on the causal pathway, it can be a mediator of the indirect factor action. We found potential confounders of current osteoarthritis, hypercholesterolemia, and diabetes using the network (
Obesity and metabolic syndrome may mediate the relationship between diabetes and depression [
If an indirect risk factor is connected to a direct protective factor or negatively associated with a direct risk factor, its role could be revealed by the interaction effect. “Current smoker,” which is an indirect factor strongly connected to “male,” was a significant risk factor in females, but not in males (
To the best of our knowledge, USG was a novel factor of depression, and the level was low in female depressive subjects (
This crosssectional study involved people aged 19 to 64 years and could analyze only associations, not causalities. Therefore, the associations established in this study, such as weight gain in diabetes, elevated triglyceride, low USG, and smoking in females, require future clinical research to prove their efficacy in depression control. We used Bonferroni correction for nonmodel factors, and this conservative method might miss other possibly significant factors. Several cancers (hepatoma, gastric cancer, colon cancer, breast cancer, and lung cancer), renal failure, pulmonary tuberculosis, liver cirrhosis, and hepatitis C might not have enough positive cases to be tested. Finally, psychiatrists’ structured interviews to diagnose depression would be more valid than the selfreported approach to identify depression used in this study.
We successfully created a sparse model for depression using TPOTassisted XGBoost training and feature selection based on the feature importance of XGBoost from a large number of variables in KNHANES data. Because of the datadriven approach, we could discover a novel factor. We constructed a network of the depressionassociated factors using association strength and interfactor correlations. The model factors were classified into direct and indirect according to their statistical significance, and the role of indirect factors was explained by confounding or interaction effects. The network also indicated predisposing factors by high centrality and cluster factors by a closely connected edge. Therefore, XGBoost and network analysis can be useful for discovering and understanding diseaseassociated factors in epidemiological studies.
area under the receiver operating characteristic curve
extended Bayesian information criterium graphical lasso
EuroQol fivedimension threelevel
Korea Centers for Disease Control and Prevention
Korea National Health and Nutrition Examination Survey
least absolute shrinkage and selection operator
Patient Health Questionnaire9
primary sampling unit
quality of life
receiver operating characteristic
socioeconomic status
treebased pipeline optimization tool
urine specific gravity
extreme gradient boosting
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2019R1C1C1007663, 2020R1F1A106842312, and 2019R1A2C1084611).
HWH and JIK contributed equally as corresponding authors.
None declared.