This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Machine learning techniques are increasingly being applied in health research. It is not clear how useful these approaches are for modeling continuous outcomes. Child quality of life is associated with parental socioeconomic status and physical activity and may be associated with aerobic fitness and strength. It is unclear whether diet or academic performance is associated with quality of life.
The purpose of this study was to compare the predictive performance of machine learning techniques with that of linear regression in examining the extent to which continuous outcomes (physical activity, aerobic fitness, muscular strength, diet, and parental education) are predictive of academic performance and quality of life and whether academic performance and quality of life are associated.
We modeled data from children attending 9 schools in a quasi-experimental study. We split data randomly into training and validation sets. Curvilinear, nonlinear, and heteroscedastic variables were simulated to examine the performance of machine learning techniques compared to that of linear models, with and without imputation.
We included data for 1711 children. Regression models explained 24% of academic performance variance in the real complete-case validation set, and up to 15% in quality of life. While machine learning techniques explained high proportions of variance in training sets, in validation, machine learning techniques explained approximately 0% of academic performance and 3% to 8% of quality of life. With imputation, machine learning techniques improved to 15% for academic performance. Machine learning outperformed regression for simulated nonlinear and heteroscedastic variables. The best predictors of academic performance in adjusted models were the child’s mother having a master-level education (
Linear regression was less prone to overfitting and outperformed commonly used machine learning techniques. Imputation improved the performance of machine learning, but not sufficiently to outperform regression. Machine learning techniques outperformed linear regression for modeling nonlinear and heteroscedastic relationships and may be of use in such cases. Regression with splines performed almost as well in nonlinear modeling. Lifestyle variables, including physical exercise, television and computer use, and parental education are predictive of academic performance or quality of life. Academic performance is associated with quality of life after adjusting for lifestyle variables and may offer another promising intervention target to improve quality of life in children.
In trials and quasi-experimental designs, reported sample sizes range from less than 100 to several thousand [
Quality of life is an important health outcome in trials [
We used data from fifth-year students attending 9 schools in Norway between 2015 and 2019, within the Health Oriented Pedagogical Project (HOPP), which is an ongoing quasi-experimental study (ClinicalTrials.gov; NCT02495714) in which data up to 2019 were captured [
Physical activity level (defined by movement counts per minute: sedentary 0-99, light 100-1999, moderate 2000-4999, and hard or vigorous ≥5000 [
We split the data set randomly into training (70%) and validation (30%) sets in order to train models and subsequently evaluate performance. We expected missing data (approximately 20% overall, with few variables >50%). Full imputation may often be performed with machine learning techniques regardless of the extent of missing data or whether or not data are missing at random. We performed a sensitivity analysis using single-mean imputation for continuous predictor variables and mode for nonbinary or categorical predictors (stratified by school) under the assumption that observations were missing at random. We tested this assumption for variables in final models by fitting a dummy variable for variable missingness, examining effect on outcome using 2-tailed independent
We took a pragmatic approach to regression modeling that we judged to approximate best practice. In cases of high between-predictor correlations (ρ>0.75), we selected 1 variable for modeling. In the absence of strong clinical or theoretical indications, we chose the variable that explained the most variance. To enable comparisons to regression approaches in which individuals are clustered by site, we fitted linear mixed models with a random intercept by school. We also built nonhierarchical models, without this random effect, to compare adjusted
The diet and lifestyle variables from the Ungkost-2000 questionnaire have multiple quasi-continuous responses (eg
We evaluated the performance of 4 machine learning techniques (
Variables that it did not make sense to include were removed (eg
Machine learning techniques that were evaluated in this study.
Algorithm | Description |
k |
A classification technique that assigns class or predicts a continuous value based on the classes or values of |
Neural network | A technique in which artificial neuron cores are connected with |
Random forest | An iteratively grown set of decision trees, where each tree outputs outcome means, with branches split by variable characteristics, and where each tree is formed from randomly bootstrapping data, with averages taken from all trees. |
Support vector machine | A technique that minimizes error to individualize a hyperplane. |
We simulated data to explore types of relationship that were not present within our real data, but which we reasoned, may perform better with either regression or machine learning techniques. We simulated, without missing data, (1) a variable with a quadratic relationship with academic performance; (2) a variable with a true nonlinear relationship with academic performance; and (3) a variable with marked heteroscedasticity (ie
To compare performance, we calculated RMSE and
To aid interpretation of adjusted regression model outputs for those unfamiliar with the outcome scales, we calculated Cohen
All analyses were performed using Stata (version 15.1; StataCorp LLC) and R (version 3.6; R Foundation for Statistical Computing). The HOPP project received approval from the Norwegian Regional Ethical Committee (2014/2064/REK south-east), and parents of all children provided written informed consent for their child’s participation.
Data comprised outcomes from 1711 year 5 (11- and 12-year-old) children (Tables S1 and S2 in
Academic performance was approximately normally distributed (
In real complete-case data, nonhierarchical and mixed models explained approximately 30% of the variance in the training set and 22% to 24% of the variance in the validation set (
Histogram of average national test scores.
Adjusted effects in selected mixed regression model for predicting academic performance.
Variable | n | ||
Stroop test congruent (milliseconds) | −0.0037 (−0.0047 to −0.0027) | 384 | <.001 |
Effect of master-level education for father | 1.59 (−0.06 to 3.25) | 384 | .06 |
Effect of master-level education for mother | 1.98 (0.25 to 3.71) | 384 | <.001 |
Average hand strength (kilograms) | 0.21 (0.08 to 0.34) | 384 | .001 |
Hours of physical activity (self-reported; dichotomized) | 2.47 (1.08 to 3.87) | 384 | .001 |
Effect of mother having higher education | 1.82 (0.07 to 3.57) | 384 | .04 |
Hours of television per week (self-reported; 7-level quasi-continuous) | 1.19 (0.25 to 3.71) | 384 | .03 |
Performance indicators in real data and real data augmented with simulated data (quadratic, nonlinear, or heteroscedastic) for academic performance.
Model | Training (n=962) |
Validation (n=406) |
||||||||||||
|
|
RMSEa | n | RMSE | n | |||||||||
|
0.81 | 0.30 | 384 | 0.85 | 0.22 | 163 | ||||||||
|
Quadratic | 0.45 | 0.78 | 384 | 0.40 | 0.83 | 163 | |||||||
|
Nonlinear | 0.55 | 0.68 | 384 | 0.53 | 0.70 | 163 | |||||||
|
Heteroscedastic | 0.53 | 0.70 | 384 | 0.61 | 0.61 | 163 | |||||||
|
0.83 | 0.30 | 384 | 0.86 | 0.24 | 163 | ||||||||
|
Quadratic | 0.46 | 0.79 | 384 | 0.39 | 0.84 | 163 | |||||||
|
Nonlinear | 0.56 | 0.68 | 384 | 0.53 | 0.72 | 163 | |||||||
|
Heteroscedastic | 0.54 | 0.70 | 384 | 0.62 | 0.62 | 163 | |||||||
|
—c | — | — | — | — | — | ||||||||
|
Nonlinear | 0.41 | 0.82 | 384 | 0.39 | 0.84 | 163 | |||||||
|
0.61 | 0.62 | 121 | 0.95 | −0.02 | 63 | ||||||||
|
Quadratic | 0.32 | 0.91 | 121 | 0.51 | 0.75 | 63 | |||||||
|
Nonlinear | 0.36 | 0.89 | 121 | 0.57 | 0.64 | 63 | |||||||
|
Heteroscedastic | 0.34 | 0.89 | 121 | 0.67 | 0.53 | 63 | |||||||
|
0.55 | 0.63 | 116 | 0.89 | −0.05 | 58 | ||||||||
|
Quadratic | 0.33 | 0.87 | 116 | 0.53 | 0.62 | 58 | |||||||
|
Nonlinear | 0.46 | 0.77 | 116 | 0.77 | 0.18 | 58 | |||||||
|
Heteroscedastic | 0.35 | 0.85 | 116 | 0.62 | 0.52 | 58 | |||||||
|
0.90 | 0.13 | 133 | 1.02 | −0.01 | 66 | ||||||||
|
Quadratic | 0.37 | 0.84 | 133 | 0.48 | 0.75 | 66 | |||||||
|
Nonlinear | 0.41 | 0.81 | 133 | 0.48 | 0.75 | 66 | |||||||
|
Heteroscedastic | 0.43 | 0.79 | 133 | 0.61 | 0.60 | 66 | |||||||
|
0.73 | 0.35 | 124 | 1.03 | −0.02 | 66 | ||||||||
|
Quadratic | 0.38 | 0.82 | 124 | 0.40 | 0.85 | 66 | |||||||
|
Nonlinear | 0.41 | 0.79 | 124 | 0.46 | 0.79 | 66 | |||||||
|
Heteroscedastic | 0.43 | 0.77 | 124 | 0.70 | 0.53 | 66 |
aRMSE: residual mean square error.
bUnlike unadjusted
cNot performed.
Regression performed best for modeling real data augmented with simulations (
Scatter plots of average national test score and simulated (A) curvilinear, (B) nonlinear, and (C) heteroscedastic variables.
Crude performance of simulated variables.
Model | Training (n=962) |
Validation (n=406) |
|||||
|
|
RMSEa | n | RMSEa | n | ||
|
|
|
|
|
|
|
|
|
Quadratic | 0.45 | 0.79 | 962 | 0.43 | 0.82 | 406 |
|
Nonlinear | 0.56 | 0.68 | 962 | 0.58 | 0.68 | 406 |
|
Heteroscedastic | 0.59 | 0.64 | 962 | 0.63 | 0.62 | 406 |
|
|
|
|
|
|
|
|
|
Quadratic | 0.45 | 0.79 | 962 | 0.43 | 0.83 | 406 |
|
Nonlinear | 0.57 | 0.68 | 962 | 0.58 | 0.68 | 406 |
|
Heteroscedastic | 0.59 | 0.64 | 962 | 0.63 | 0.62 | 406 |
|
|
|
|
|
|
|
|
|
Nonlinear | 0.41 | 0.83 |
962 | 0.39 | 0.85 | 406 |
|
|
|
|
|
|
|
|
|
Quadratic | 0.25 | 0.94 | 962 | 0.49 | 0.78 | 406 |
|
Nonlinear | 0.24 | 0.94 | 962 | 0.45 | 0.81 | 406 |
|
Heteroscedastic | 0.32 | 0.90 | 962 | 0.66 | 0.58 | 406 |
|
|
|
|
|
|
|
|
|
Quadratic | 0.44 | 0.80 | 962 | 0.42 | 0.83 | 406 |
|
Nonlinear | 0.44 | 0.81 | 962 | 0.43 | 0.82 | 406 |
|
Heteroscedastic | 0.57 | 0.68 | 962 | 0.61 | 0.66 | 406 |
|
|
|
|
|
|
|
|
|
Quadratic | 0.40 | 0.84 | 962 | 0.45 | 0.81 | 406 |
|
Nonlinear | 0.34 | 0.88 | 962 | 0.43 | 0.82 | 406 |
|
Heteroscedastic | 0.49 | 0.76 | 962 | 0.65 | 0.60 | 406 |
|
|
|
|
|
|
|
|
|
Quadratic | 0.44 | 0.80 | 962 | 0.42 | 0.83 | 406 |
|
Nonlinear | 0.40 | 0.84 | 962 | 0.38 | 0.86 | 406 |
|
Heteroscedastic | 0.56 | 0.68 | 962 | 0.59 | 0.66 | 406 |
aRMSE: residual mean square error.
Performance indicators for academic performance in sensitivity analyses (single-mean imputation).
Model | Training (n=962) | Validation (n=406) | |||||
|
RMSEa | n | RMSEa | n | |||
Nonhierarchical linear model | 0.88 | 0.20 | 962 | 0.92 | 0.15 | 406 | |
Mixed model | 0.89 | 0.21 | 962 | 0.92 | 0.18 | 406 | |
Random forest | 0.76 | 0.48 | 962 | 0.94 | 0.14 | 406 | |
Support vector machine | 0.82 | 0.32 | 962 | 0.95 | 0.12 | 406 | |
k-Nearest neighbors | 0.89 | 0.20 | 962 | 0.86 | 0.12 | 406 | |
Neural network | 0.90 | 0.18 | 962 | 0.97 | 0.09 | 406 |
aRMSE: residual mean square error.
Despite a ceiling effect, we judged the distribution of child-reported quality of life (
Histogram of child-reported quality of life scores.
Adjusted effects of with modifiable risk factors in mixed regression model for predicting quality of life.
Variable | β (95% CI) | n | |
Frequency of physical activity (7-level quasi-continuous) | 1.09 (0.53 to 1.66) | 676 | <.001 |
Hours of television per week (self-reported; 7-level quasi-continuous) | −0.95 (−1.55 to −0.36) | 676 | .002 |
Hard exercise (minutes) | 0.02 (0.002 to 0.03) | 676 | .008 |
Percentage of time in moderate exercise | 0.29 (0.002 to 0.59) | 676 | .048 |
Our parsimonious 3-variable mixed model explained 12% of variance in the training set and 15% of the variance in the validation set. Machine learning techniques retained more observations than the first regression model due to our selection of the fish oil variable, which had fewer observations (
Performance indicators by modeling approach for quality of life.
Model | Training (n=1107) | Validation (n=453) | |||||
|
RMSEa | n | RMSEa | n | |||
Regression model 1 | 0.89 | 0.11 | 293 | 0.85 | 0.13 | 111 | |
Mixed model 1 | 0.89 | 0.12 | 293 | 0.85 | 0.15 | 111 | |
Regression model 2 | 0.91 | 0.08 | 676 | 0.95 | 0.06 | 275 | |
Mixed model 2 | 0.91 | 0.08 | 676 | 0.96 | 0.07 | 275 | |
Random forest | 0.66 | 0.74 | 481 | 0.89 | 0.03 | 190 | |
Support vector machine | 0.85 | 0.14 | 524 | 0.97 | 0.08 | 208 | |
k-Nearest neighbors | 0.78 | 0.33 | 295 | 0.97 | 0.08 | 117 | |
Neural network | 0.80 | 0.28 | 319 | 0.99 | 0.07 | 123 |
aRMSE: residual mean square error.
Performance indicators by modeling approach for quality of life in sensitivity analysis (single-mean imputation).
Model | Training (n=1107) | Validation (n=453) | ||||
|
RMSEa | n | RMSEa | n | ||
Regression model | 0.95 | .09 | 1107 | 0.93 | .13 | 453 |
Mixed model | 0.95 | .09 | 1107 | 0.93 | .14 | 453 |
Random forest | 0.80 | .59 | 1107 | 0.96 | .05 | 453 |
Support vector machine | 0.92 | .17 | 1107 | 0.96 | .07 | 453 |
k-Nearest neighbors | 0.94 | .12 | 1107 | 0.96 | .06 | 453 |
Neural network | 0.96 | .09 | 1107 | 0.97 | .05 | 453 |
aRMSE: residual mean square error.
In modeling continuous health outcomes in a data set containing some missing data, linear regression was less prone to overfitting, retained more observations, and outperformed common machine learning techniques. In validation, regression explained approximately one-quarter of the variance in academic performance and up to 15% of the variance in quality of life, using exercise, lifestyle, and parental education quality of life data. Imputation improved machine learning performance, but improvements were not sufficient to outperform regression. Machine learning techniques outperformed regression for modeling nonlinear and heteroscedastic simulations and may be of use when there are no missing data or imputation is plausible, and where complex nonlinearity or heteroscedasticity exists. However, regression with splines performed almost as well for nonlinear modeling.
Multiple comparisons exist between machine learning techniques and logistic regression, multiclass, and survival analysis models, which taken together suggest similar results and an increased risk of overfitting with machine learning techniques [
We found very strong evidence that reported physical activity, time recorded in vigorous exercise, and percentage of time spent in moderate exercise are positively associated with quality of life as continuous health outcomes in typical circumstances when adjusted for each of the other modeled variables. Associations between socioeconomic status, increased physical activity, and child quality of life are well established [
We found very strong evidence that reported physical activity, increased hand strength, mother having master’s education or above, and decreased Stroop time, are associated with increases in academic performance. We found some evidence that a mother having university education and increases in television and computer use, are associated with increased academic performance. Reporting exercise that causes a sweat for at least 2 hours per week, 10 kg greater hand strength, a mother having university or master’s education, increases of 1 television and computer use level, or a decreased Stroop time of 1 second were each associated with small or small-to-medium increases in academic performance. Socioeconomic status variables have been shown, in a meta-analysis [
Diet may affect both quality of life and academic performance via mechanisms related to the consumption of adequate micronutrients [
The rising popularity of machine learning techniques is understandable given the general abundance of data and a need for fewer assumptions. Machine learning techniques may be useful simply by virtue of the amount of data available. However, in public health research and health services research, data are less abundant and often missing. When modeling continuous outcomes in such circumstances, machine learning techniques are likely to perform worse unless marked nonlinear or heteroscedastic relationships exist. We have shown that the tendency to overfit that is often demonstrated in binary and multiclass machine learning techniques is also a challenge when modeling continuous outcomes. Furthermore, an innate inability for parameter estimation hampers interpretation and may make machine learning techniques generally less useful. At the time of writing, machine learning techniques have made relatively little impact in public health research on COVID-19 (with either continuous or categorical outcomes) where there is a pressing and immediate need for good modeling. We find this unsurprising—in most cases, public health data have normal distributions, and marked nonlinearity is rare. In these cases, traditional regression methods use the most efficient estimators and will lead to better models.
Interventions aiming to improve activity levels in children may have a positive effect on both child quality of life and academic performance. The small association between academic performance and quality of life could follow satisfaction of achievement, although reversed causal direction, or residual confounding is plausible. In addition to increasing physical activity, new interventions to improve quality of life might target improvements in academic performance. Television and computer use is associated with decreases in quality of life but improvements in academic performance and these factors should be examined separately to clarify other promising intervention targets.
We provide like-for-like comparisons between machine learning techniques and regression for modeling continuous health outcomes, with larger sample size than those used in previous research, and separate validation. Nevertheless, our work has limitations. We used an average of reading, math, and English tests as a proxy for academic performance. Not including subjects such as science may impair construct coverage of academic performance. Using single-mean imputation and last observation carried forward (in missing Ungkost variables) allowed us to avoid using multiple imputation (which is based on regression approaches) for data used in machine learning models (ie, to avoid mixing methods). However, multiple imputation provides better coverage than single-mean imputation, and last observation carried forward is known to be problematic [
Future focus on comparisons to other machine learning techniques, separate analysis of academic performance components, and iteratively varying the size of the training set to explore how training set size affects overfitting will provide further useful knowledge. The Ungkost item on television and computer use combines 2 activities. We found large positive associations between the item and academic performance and a small negative association with quality of life. We suspect the positive associations may be grounded in computer use for education, and the negative associations may be grounded in uses for leisure. Separation of these exposures will provide clarity. Some machine learning techniques retained diet variables that we did not select for adjusted models. One strength of machine learning techniques may be an ability to detect mild and easily missed nonlinear relationships, which is worth further exploration.
For modeling continuous health outcomes when some data are missing, linear regression is less prone to overfitting and outperforms common machine learning techniques. Imputation improves the performance of machine learning techniques, but improvements are not sufficient to outperform regression. Machine learning techniques outperform regression in modeling nonlinear and heteroscedastic relationships and may be of use in cases where imputation is sensible or there are no or few missing data. Otherwise regression is preferred. Regression with splines performs almost as well in nonlinear modeling. Lifestyle variables, including physical activity, television and computer use, muscular strength, and parental education were predictive of academic performance or quality of life explaining up to 24% and 15% of the variance in these outcomes, respectively. Targeting these areas in future interventions may help improve child quality of life and academic performance.
Supplementary tables and technical notes.
coronavirus disease 2019
Health Oriented Pedagogical Project
residual mean square error
Thanks are due to Kristiania University College for providing seed funding for this work and to Gary Abel (University of Exeter), Sandra Eldridge (Queen Mary, University of London), and George Bouliotis (University of Warwick) for helpful discussions related to this work.
RF conceived the study, applied for internal seed funding, conducted some analyses, and wrote the first draft of the manuscript. SH conducted most of the machine learning analyses, and HK conducted remaining machine learning analyses. JF set up and maintained study software and server. LF provided input on educational components. PMF provided data and input on the HOPP study and obtained ethics approval for the HOPP study activities. All authors contributed to interpretation of the findings and approved the final manuscript.
RF is a director and shareholder and JF is a shareholder of Clinvivo Ltd, a University of Warwick spin-out company. Neither Clinvivo services nor Clinvivo software products were used in this study.