Introduction

JMIR

J Med Internet Res

Journal of Medical Internet Research

1438-8871

JMIR Publications

Toronto, Canada

v27i1e77721

41191908

10.2196/77721

Viewpoint

Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality

Mavragani

Amaryllis

Silvey

Scott

Jieni

Yanan

BCom 1

Monash Centre for Health Research and Implementation Faculty of Medicine, Nursing and Health Sciences Monash University

43-51 Kanooka Grove

Melbourne, 3168

Australia 61 438555775 yanan.hu@monash.edu

https://orcid.org/0000-0003-1794-7789

Zhang

Xin

BEng, MEng, PhD 2

https://orcid.org/0000-0002-0675-6891

Slavin

Valerie

BSc, PhD 3 4 5 6

https://orcid.org/0000-0002-7688-2498

Belsti

Yitayeh

PhD 1

https://orcid.org/0000-0001-8984-1495

Tiruneh

Sofonyas Abebaw

MPH 1

https://orcid.org/0000-0001-5867-5105

Callander

Emily

BA, PhD 1 7

https://orcid.org/0000-0001-7233-6804

Enticott

Joanne

PhD 1

https://orcid.org/0000-0002-4480-5690

1 Monash Centre for Health Research and Implementation Faculty of Medicine, Nursing and Health Sciences Monash University

Melbourne

Australia 2 Department of Electrical and Computer Systems Engineering Monash University

Melbourne

Australia 3 Gold Coast Hospital and Health Service Gold Coast Hospital

Gold Coast

Australia 4 School of Nursing and Midwifery Griffith University

Gold Coast

Australia 5 School of Nursing and Midwifery University of Technology Sydney

Sydney

Australia 6 School of Nursing, Midwifery and Social Work The University of Queensland

Brisbane

Australia 7 School of Public Health University of Technology Sydney

Sydney

Australia

Corresponding Author: Yanan Hu yanan.hu@monash.edu

2025

5 11 2025

e77721

19 5 2025 24 7 2025 28 7 2025 14 8 2025

©Yanan Hu, Xin Zhang, Valerie Slavin, Yitayeh Belsti, Sofonyas Abebaw Tiruneh, Emily Callander, Joanne Enticott. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.11.2025.

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

The rapid uptake of supervised machine learning (ML) in clinical prediction modelling, particularly for binary outcomes based on tabular data, has sparked debate about its comparative advantage over traditional statistical logistic regression. Although ML has demonstrated superiority in unstructured data domains, its performance gains in structured, tabular clinical datasets remain inconsistent and context dependent. This viewpoint synthesizes recent comparative studies and simulation findings to argue that there is no universal best modelling approach. Model performance depends heavily on dataset characteristics (eg, linearity, sample size, number of candidate predictors, minority class proportion) and data quality (eg, completeness, accuracy). Consequently, we argue that efforts to improve data quality, not model complexity, are more likely to enhance the reliability and real-world utility of clinical prediction models.

machine learning clinical prediction models logistic regression clinical data data quality binary classfication

Introduction

The increasing adoption of supervised machine learning (ML) in clinical prediction models (diagnostic or prognostic) based on tabular data has sparked considerable debate regarding its comparative performance against traditional statistical logistic regression (LR), particularly for binary outcomes such as mortality or the occurrence of adverse events [1]. Although supervised ML approaches have demonstrated clear superiority in classifying unstructured clinical data such as medical images and texts [2], their added value for the classification of clinical tabular data, where structured data are organized in tables, typically with rows representing individual cases and columns representing individual characteristics, remains uncertain and context-dependent [3].

Definition of Statistical LR Versus ML-Based LR

The distinction between statistical LR and ML-based LR is frequently blurred in both literature and practice [1]. Many studies loosely refer to any penalized LR model as ML, despite fundamental methodological differences. To enable valid comparisons between these approaches, it is therefore crucial to clearly delineate their boundaries (Table 1).

Table 1

Definitions of statistical logistic regression and machine learning–based logistic regression.

Method	Definition
Statistical logistic regression	A parametric model operating under conventional statistical assumptions, including linearity and independence, employing fixed hyperparameters without data-driven optimization, using prespecified candidate predictors based on clinical or theoretical justification. This aligns with traditional epidemiological approaches where model specification precedes data analysis.
Machine learning–based logistic regression	An adaptive variant where model specification becomes part of the analytical process itself, hyperparameters like penalty terms are tuned through cross-validation, predictors may be selected algorithmically from a broader set of candidates, and the analytical focus shifts decisively toward predictive performance. While mathematically similar to statistical logistic regression, this approach embodies the machine learning philosophy of learning from data.

In this paper, we adopt the definition proposed in a previous systematic review [1], which characterizes statistical LR as a theory-based model that operates under strict assumptions and does not involve data-driven optimization of predictive performance through hyperparameter tuning but relies on subject knowledge from researchers or experts to specify the model structure. Although penalized LR models may involve hyperparameter tuning and variable selection, they remain theory-involved and do not intrinsically capture nonlinearities or interactions, and the core assumptions of LR still apply. Conversely, ML models are defined as methods that autonomously learn patterns from data (ie, data-driven hyperparameter tuning or predictor selection) [4], including ML-based LR and other supervised ML methods (eg, random forest, boosting, neural networks, support vector machine) that intrinsically handle complex interactions without manual specification beforehand.

Comparative Performance: What Was Already Known

A 2019 meta-regression of 145 low-risk-of-bias comparisons between statistical LR and ML binary clinical prediction models on tabular data showed no performance benefit of ML over statistical LR [1]. However, this comparison was limited to discrimination measured by the area under the receiver operating characteristic curve (AUROC), as other performance metrics were not frequently reported. Notably, 79% (56/71) of the studies did not report calibration performance, and only one study reported clinical utility. Clinical utility is commonly assessed through decision curve analysis, which estimates the clinical value of a prediction model at the population level by considering the consequences of decisions made based on its output [5], specifically, the benefit of correctly predicting true positives and the harm of incorrectly predicting false positives. A step-by-step guide to this method is available here [6].

Each performance metric captures a distinct aspect of model performance, with its own strengths and limitations. A model may achieve a high AUROC yet still have poor calibration and potentially harmful clinical consequences if the predicted probabilities are systematically overestimated or underestimated; vice versa, a well-calibrated model may still have poor discriminative ability. This highlights the need for comprehensive evaluation across multiple performance domains, including discrimination, calibration, classification metrics, clinical utility, and fairness. Therefore, focusing solely on marginal gains in AUROC between LR and ML can be misleading and inadequate for guiding future research.

Furthermore, the meta-regression of AUROC did not explore the underlying sources of performance differences (eg, sample size, number of predictors, use of hyperparameter tuning). Therefore, it is still unclear whether the observed variation in performance reflects true algorithmic superiority or is instead driven by dataset characteristics or modelling procedures. For example, the systematic review highlighted that more than half of the included studies did not clearly report their hyperparameter tuning strategies.

The comparison should not only focus on the difference in performance but also on the stability of performance [7]. Even when statistical LR outperforms ML in certain metrics (eg, AUROC), this does not necessarily imply that the predictions are stable or reproducible, that is, applying the same model development procedure to different samples of the same size drawn from the same underlying population can result in substantially different predictions for the same individual. This issue is particularly pronounced when using small development datasets, which leads to more different models in the multiverse, often with vastly unstable individual predictions [8]. Adherence to the minimum sample size recommendations is one way to mitigate this issue [9,10]. Notably, a 2023 systematic review reported that 73% of the binary clinical prediction models using statistical LR had sample sizes below the recommended minimum threshold [11]. ML algorithms are generally more data-hungry than LR to achieve stable performance. For example, one study demonstrated that random forest may require more than 20 times the number of events for each candidate predictor compared to statistical LR [10].

Research Recommendation for Reporting: Improving the Transparency of Each Step

In light of the methodological issues identified in the current literature rather than focusing solely on determining the inherent superiority of one modelling approach over another, greater attention should be directed toward ensuring the rigor and transparency of modelling procedures. This includes clear documentation of data preprocessing steps, sample size justifications, modelling decisions, hyperparameter tuning strategies (eg, grid or random search), feature selection techniques (including filter methods like correlation analysis, wrapper methods like recursive feature elimination, or embedded methods such as least absolute shrinkage and selection operator [LASSO]), model performance evaluation methods and metrics, and model explanation methods (eg, Shapley Additive Explanations [SHAP] [12], Submodular Pick Local Interpretable Model-agnostic Explanations [SP-LIME] [13], and Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence [CERTIFAI] models [14]).

No Free Lunch Theorem

There is no universal golden method for clinical prediction models [15,16] and whether the benefit of each algorithm (either statistical or ML) can be fulfilled is highly subject to the dataset characteristics (eg, sample size, class imbalance, nonlinearity, number of candidate predictors [10]) and data quality (eg, completeness, accuracy [17]).

Each algorithm has its unique strengths and limitations in handling different data characteristics (Table 2). For example, Categorical Boosting is particularly effective for datasets with high-cardinality categorical variables, as it includes built-in techniques to encode categories without extensive preprocessing [18]. eXtreme Gradient Boosting [19] and Light Gradient-Boosting Machine [20] are known for their computational efficiency, performance in data with complex feature interactions, and native handling of missing data, but are less interpretable than LR. Deep learning, a subfield of ML, uses multilayered neural networks to simulate human decision-making [21]. While capable of learning highly complex nonlinear relationships from extremely large and high-dimensional datasets, deep learning models are generally more data-hungry, less interpretable, and require significantly more computational resources than traditional ML methods, which may limit their transparency and clinical applicability [22]. Although efforts in Explainable Artificial Intelligence (XAI) are advancing, current ML models often fall short of the level of clarity and trust required for clinical implementation [23]. On the other hand, LR is highly interpretable and performs well on small sample sizes when predictors have an approximately linear relationship with the outcome, but it may struggle with complex nonlinearities or large numbers of correlated predictors. The smaller the sample size available, the more we must rely on external information or inputs from experts to determine the features/predictors.

Table 2

Strengths and weaknesses of statistical logistic regression and machine learning in binary clinical prediction models based on tabular data.

Aspect	Statistical logistic regression	Supervised machine learning
Learning process	Theory-driven; relies on expert knowledge for model specification and candidate predictor selection	Data-driven; directly and automatically learn relationship from data
Assumptions in data structure	High (eg, interactions, linearity)	Low; handle complex, nonlinear relationship
Assumptions in model specification	High; use default value	Low; data-driven hyperparameter tuning
User input in creation and selection of candidate predictors	High; researchers need to investigate the nonlinearity of continuous variables and interaction effects and systematic review or expert opinion of candidate predictors before developing the model	Low; models automatically capture nonlinearity and interactions, no need for researchers to investigate nonlinearity and interaction effects between variables
Flexibility	Low; constrained by linearity assumptions but can be improved by adding penalty	High
Complexity	Low; simple, parametric model	High
Performance on complex data	Low	High
Sample size requirement for stable performance	Low	High; data-hungriness
Interpretability (in-processing decision-making process)	High; white-box nature, model coefficients are directly interpretable, can also be presented using graphical score charts or nomograms	Low; black-box nature, decision-making process is not transparent
Explainability (postprocessing explanation)	High	Low; complex to explain to end users, requires post hoc methods like Shapley Additive Explanations for explanation
Deployment ease	High	Low
Computational cost	Low	High

Therefore, the choice of the algorithm should be tailored to the structure, quality, and characteristics of the dataset. Ultimately, the development of clinical prediction models involves unavoidable trade-offs. There is no single algorithm that excels across all performance metrics (fairness, accuracy, generalizability, stability, parsimony, and interpretability). Researchers must prioritize certain metrics depending on the model’s intended application and target population. For instance, model parsimony, where a model with fewer predictors may sacrifice some accuracy for simplicity, can be crucial in enhancing user acceptance, as overly complex models may reduce usability. Additionally, discussions with stakeholders (eg, health care providers, patients) regarding the most relevant features or desired trade-offs can guide model development.

Clinical tabular datasets often exhibit characteristics that tend to favor LR over ML models [9]. These include small to moderate sample sizes, relatively high levels of noise, a limited number of candidate predictors (ie, low dimension), and typically binary outcomes (Table 3). Such conditions can constrain the ability of complex ML algorithms to demonstrate superior performance. Moreover, LR’s well-recognized interpretability and trustworthiness [24] further reinforce its widespread use in clinical prediction modelling, and it is typically used as a reference model for performance benchmarking in ML studies. However, ML approaches may warrant consideration when they demonstrate clear superiority in performance, supported by model explainability to help build trust among clinicians and end users. To date, no consensus exists on how to evaluate or compare model interpretability and explainability across different methods [23].

Table 3

Mismatches between the characteristics of clinical data and supervised machine learning’s strengths.

Aspect	Characteristics in clinical data	Supervised machine learning’s relative strength compared to statistical logistic regression	Comments
Data modality	Mostly single modal data (tabular data)	Excels with multimodal data (image, scan, text, or signal)	Clinical datasets often lack the multimodal richness that enables ML^a models to fully demonstrate their advantages
Data quality	High noise due to errors, missingness, or inconsistent measurement (low signal-to-noise ratio)	Performs better for data with high signal-to-noise ratio	Noise dilutes true signals, and ML models tend to overfit to noisy artifacts without careful data preprocessing
Sample size	Often small to moderate	Benefit from large-scale datasets. More “data-hungry”	Although sample sizes are improving in some registries, they are often insufficient to train complex ML architectures robustly
Predictors	Typically, a small set of clinically meaningful predictors with high linearity and low order of interaction terms	Excels with high-dimensional, nonlinear interactions, temporally rich data	ML's strength in handling high-dimensional, time-series data with higher nonlinearity and interaction terms
Prediction	Predominantly binary classification (eg, event occurrence: yes/no).	Advantage in multiclass classification and regression	Simple binary classification problems often diminish the additional complexity that ML can handle

^aML: machine learning.

Policy and Research Recommendation: Shifting From Model Debate to Data Quality

Amid increasing interest in complex models, it is crucial to reorient clinical prediction modelling from a model-centric to a data-centric paradigm [25]. The quality, structure, and representativeness of data are far more critical to model performance than the complexity of models. In clinical settings, prediction models serve best as a second set of eyes, complementing clinical judgment rather than replacing it [23]. However, without high-quality data, even the most sophisticated models will propagate existing biases and limitations, as the saying goes, “garbage in, garbage out.”

As the number of clinical prediction models continues to grow, policymakers and funding bodies should prioritize investment in data quality infrastructure, including standardized phenotyping, consistent variable definitions, and robust data curation practices. Since all models are trained on historical data that inherently reflect systemic limitations, model complexity cannot resolve errors rooted in the data; in fact, they might amplify the bias and make unfair decisions in underrepresented or marginal groups such as defined by sex, ethnicity, or deprivation [26]. In contrast, thoughtful data preprocessing and transparent reporting of modelling strategies are foundational to developing reliable, generalizable, reproducible, and trustworthy decision support tools [27]. In addition, more effort is needed to expand the candidate predictors available in health data [26], such as integrating lifestyle factors collected through wearables and a range of medical devices [28].

This shift in emphasis from modelling sophistication to data stewardship is essential to ensure that clinical prediction tools genuinely enhance, rather than undermine, the quality and equity of patient-centered care.

Abbreviations

AUROC

area under the receiver operating characteristic curve

CERTIFAI

Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence

LASSO

least absolute shrinkage and selection operator

logistic regression

machine learning

SHAP

Shapley Additive Explanations

SP-LIME

Submodular Pick Local Interpretable Model-agnostic Explanations

XAI

Explainable Artificial Intelligence

None declared.

Christodoulou

Collins

Steyerberg

Verbakel

Van Calster

A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

J Clin Epidemiol 2019 06 110 12 22

10.1016/j.jclinepi.2019.02.004

30763612

S0895-4356(18)31081-3

Chakraborty

Bhattacharya

Pal

Lee

From machine learning to deep learning: advances of the recent data-driven paradigm shift in medicine and healthcare

Current Research in Biotechnology 2024 7 100164

10.1016/j.crbiot.2023.100164

Mann

Lyons

O'Rourke

Davies

Machine learning or traditional statistical methods for predictive modelling in perioperative medicine: a narrative review

J Clin Anesth 2025 03 102 111782

10.1016/j.jclinane.2025.111782

39977974

S0952-8180(25)00042-X

Breiman

Statistical modeling: the two cultures (with comments and a rejoinder by the author)

Statist Sci 2001 8 1 16 3 199 231

10.1214/ss/1009213726

Vickers

Elkin

Decision curve analysis: a novel method for evaluating prediction models

Med Decis Making 2006 11 01 26 6 565 574

10.1177/0272989x06295361

Vickers

van Calster

Steyerberg

A simple, step-by-step guide to interpreting decision curve analysis

Diagn Progn Res 2019 3 1 18

10.1186/s41512-019-0064-7

31592444

PMC6777022

Riley

Collins

Stability of clinical prediction models developed using statistical or machine learning methods

Biom J 2023 12 65 8 e2200302

10.1002/bimj.202200302

37466257

PMC10952221

Riley

Pate

Dhiman

Archer

Martin

Collins

Clinical prediction models and the multiverse of madness

BMC Med 2023 12 18 21 1 502

10.1186/s12916-023-03212-y

38110939

10.1186/s12916-023-03212-y

PMC10729337

Riley

Ensor

Snell

KIE

Harrell

Martin

Reitsma

Moons

KGM

Collins

van Smeden

Calculating the sample size required for developing a clinical prediction model

BMJ 2020 03 18 368 m441

10.1136/bmj.m441

32188600

Silvey

Liu

Sample size requirements for popular classification algorithms in tabular clinical data: empirical study

J Med Internet Res 2024 12 17 26 e60231

10.2196/60231

39689306

v26i1e60231

PMC11688588

Dhiman Ma

Jie

Cathy

Bullock

Garrett

Sergeant

Jamie C

Riley

Richard D

Collins

Gary S

Sample size requirements are not being considered in studies developing prediction models for binary outcomes: a systematic review

BMC Med Res Methodol 2023 08 19 23 1 188

10.1186/s12874-023-02008-1

37598153

10.1186/s12874-023-02008-1

PMC10439652

Lundberg

Scott M

Lee

Su-In

A unified approach to interpreting model predictions

ArXiv. Preprint posted online on November 25, 2017 2017

10.48550/arXiv.1705.07874

Ribeiro

Marco Tulio

Singh

Sameer

Guestrin

Carlos

"Why should I trust you?": explaining the predictions of any classifier

2016 08 13

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

2016

San Francisco, California, USA

1135 1144

https://doi.org/10.1145/2939672.2939778

Sharma

Shubham

Henderson

Jette

Ghosh

Joydeep

CERTIFAI: a common framework to provide explanations and analyse the fairness and robustness of black-box models

2020 02 07

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

February 7

New York, NY, USA

166 172

10.1145/3375627.3375812

Wolpert

The supervised learning no-free-lunch theorems

Soft Computing and Industry 2002

London

Springer

25 42

Fernández-Delgado

Manuel

Cernadas

Eva

Barro

Senén

Amorim

Dinani

Do we need hundreds of classifiers to solve real world classification problems?

JMLR 2014

2025-11-03

https://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf

Mohammed

Budach

Feuerpfeil

Ihde

Nathansen

Noack

Patzlaff

Naumann

Harmouch

The effects of data quality on machine learning performance on tabular data

Information Systems 2025 07 132 102549

10.1016/j.is.2025.102549

Prokhorenkova

Liudmila

Gusev

Gleb

Vorobev

Aleksandr

Dorogush

Anna Veronika

Gulin

Andrey

CatBoost: unbiased boosting with categorical features

2018 12 03

Proceedings of the 32nd International Conference on Neural Information Processing Systems

December 3

Montréal, Canada

6639 6649

Chen

Tianqi

Guestrin

Carlos

XGBoost: a scalable tree boosting system

2016 08 13

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13

San Francisco, California, USA

785 794

10.1145/2939672.2939785

Guolin

Meng

Finley

Thomas

Wang

Taifeng

Chen

Wei

Weidong

Qiwei

Liu

Tie-Yan

LightGBM: a highly efficient gradient boosting decision tree

2017 12 04

Proceedings of the 31st International Conference on Neural Information Processing Systems

December 13

Long Beach, California, USA

3149 3157

Choi

Coyner

Aaron S

Kalpathy-Cramer

Jayashree

Chiang

Michael F

Campbell

J Peter

Introduction to machine learning, neural networks, and deep learning

Transl Vis Sci Technol 2020 03 27 9 2 14

10.1167/tvst.9.2.14

32704420

TVST-19-2007

PMC7347027

Shwartz-Ziv

Armon

Tabular data: deep learning is not all you need

Information Fusion 2022 05 81 84 90

10.1016/j.inffus.2021.11.011

Antoniadi

Guendouz

Wei

Mazo

Becker

Mooney

Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: a systematic review

Applied Sciences 2021 05 31 11 11 5088

10.3390/app11115088

Rudin

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Nat Mach Intell 2019 05 1 5 206 215

10.1038/s42256-019-0048-x

35603010

PMC9122117

Zha

Bhat

Lai

Yang

Jiang

Zhong

Data-centric artificial intelligence: a survey

ACM Comput Surv 2025 01 24 57 5 1 42

10.1145/3711118

Johnson

Wei

Weeraratne

Frisse

Misulis

Rhee

Zhao

Snowdon

Precision medicine, AI, and the future of personalized health care

Clin Transl Sci 2021 01 14 1 86 93

10.1111/cts.12884

32961010

PMC7877825

Collins

Gary S

Moons

Karel G M

Dhiman

Paula

Riley

Richard D

Beam

Andrew L

Van Calster

Ben

Ghassemi

Marzyeh

Liu

Xiaoxuan

Reitsma

Johannes B

van Smeden

Maarten

Boulesteix

Anne-Laure

Camaradou

Jennifer Catherine

Celi

Leo Anthony

Denaxas

Spiros

Denniston

Alastair K

Glocker

Ben

Golub

Robert M

Harvey

Hugh

Heinze

Georg

Hoffman

Michael M

Kengne

André Pascal

Lam

Emily

Lee

Naomi

Loder

Elizabeth W

Maier-Hein

Lena

Mateen

Bilal A

McCradden

Melissa D

Oakden-Rayner

Lauren

Ordish

Johan

Parnell

Richard

Rose

Sherri

Singh

Karandeep

Wynants

Laure

Logullo

Patricia

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

BMJ 2024 04 16 385 e078378

10.1136/bmj-2023-078378

38626948

PMC11019967

Rudrapatna

Butte

Atul J

Opportunities and challenges in using real-world data for health care

J Clin Invest 2020 03 03 130 2 565 574

10.1172/JCI129197

32011317

129197

PMC6994109