This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Supervised machine learning (ML) is being featured in the health care literature with study results frequently reported using metrics such as accuracy, sensitivity, specificity, recall, or F1 score. Although each metric provides a different perspective on the performance, they remain to be overall measures for the whole sample, discounting the uniqueness of each case or patient. Intuitively, we know that all cases are not equal, but the present evaluative approaches do not take case difficulty into account.
A more case-based, comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. This study aims to demonstrate how the item response theory (IRT) can be used to stratify the data based on how
Two large, public intensive care unit data sets, Medical Information Mart for Intensive Care III and electronic intensive care unit, were used to showcase this method in predicting mortality. For each data set, a balanced sample (n=8078 and n=21,940, respectively) and an imbalanced sample (n=12,117 and n=32,910, respectively) were drawn. A 2-parameter logistic model was used to provide scores for each case. Several ML algorithms were used in the demonstration to classify cases based on their health-related features: logistic regression, linear discriminant analysis, K-nearest neighbors, decision tree, naive Bayes, and a neural network. Generalized linear mixed model analyses were used to assess the effects of case difficulty strata, ML algorithm, and the interaction between them in predicting accuracy.
The results showed significant effects (
This demonstration shows that using the IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty. This method explains which features are indicative of healthy states and why. It enables end users to tailor the classifier that is appropriate to the difficulty level of the patient for personalized medicine.
This study aims to demonstrate an approach to assess the effectiveness of binary machine learning (ML) classification, which is an alternative to the more traditional single scalar measures in the literature. Our approach uses an item response theory (IRT) model to enhance the understanding of the data set on which ML protocols are run as well as the results of the classification outcomes. Aspects of IRT’s utility have recently surfaced in the ML literature, including comparisons of collaborative filtering [
The varied and numerous contexts (eg, business, finances, medicine, home, government agencies) in which ML is being used is no less than staggering [
Despite the advances in metric development, there is an interest in developing more extensive descriptions of ML classification outcomes. For example, it has been argued that “... any single scalar measure has significant limitations” and “that such measures ... oversimplify complex questions and combine things that should be kept separate” [
These comments are consistent with general calls for a fuller explanation regarding the interpretability of ML studies [
The ML criticism of the lack of attention to the unique characteristics of the individual cases is the focus of this study. We propose to address this challenge using a more comprehensive, case-nuanced approach. Although there has been some work in this regard, such as the now accepted wisdom that standard classifiers do not work well with imbalanced data [
This lack of attention is highlighted in the assessment study of various ML models; they often result in comparable outcomes as similar percentages of cases are misclassified regardless of the model used [
There are 2 fundamental building blocks to any ML system: the features of interest and the cases in the data set. To investigate the research question in this study, methods derived from IRT were employed as they simultaneously estimate the characteristics of both features and cases. Understanding this phenomenon allows medical professionals to tailor the classifier to the patient.
Following an examination, there are discussions by students about the test items; often they remark “that question was hard, what did you put?” or “that was as easy question.” Such comments reflect the purposeful construction of the test items. Some items are designed to be relatively easy to pass whereas others are designed to be more difficult such that only a few can pass. Similarly, students talk about the test takers—“she always gets the highest score in the class” or “I think I have a 50-50 chance of passing.” Test takers are quite cognizant of the fact that not all test items are created equal and that not all test takers have the same ability. These fundamental assumptions give rise to IRT, where the characteristics of the items and of the students are modeled together, providing a clearer picture about which items discriminate between which test takers.
A parallel can be drawn between a set of students passing or failing a test based on their performance on a set of items with a set of patients being classified into 1 of 2 categories (alive or not alive) based on their scores on a set of health-related features. Using the
As not all test takers score 100% or 0% on an examination, some combination of right and wrong answers to questions provides an index of individual test-taker ability in completing the test. The term ability (symbolized by the term theta, θ) is used in the psychometric literature where IRT evolved and is used to describe any latent construct of interest being measured. In this study, within-range or out-of-range laboratory values and vital signs as well as demographic information comprise the features in our data sets. Thus, we can ascertain a case’s placement with respect to the underlying distribution of
The process of generating CDIs on the unhealthiness continuum will be carried out without using the outcome variable of mortality itself, that is, IRT provides case-based scores (CDIs) that can be examined before the data as a collective is subjected to an ML protocol.
The IRT analysis provides case-based CDIs using a set of feature characteristics that do not use the information on the outcome classification variable. CDIs for the sample are generated along the normal distribution, with a mean 0.0 (SD 1.0). It is hypothesized that cases with more centrally located CDIs will be less likely to be classified correctly, whereas cases with more peripherally located CDIs will be more likely to be classified correctly. One research question is as follows: Will some ML classifiers be more accurate in classifying cases at all CDIs? Another research question is as follows: Will some ML classifiers be more accurate than others in classifying cases at different CDIs? Identifying these cases
Data were obtained through 2 large, freely available data sets. One was the MIMIC-III (Medical Information Mart for Intensive Care III) database housing health data of >40,000 critical care unit patients at the Beth Israel Deaconess Medical Center admitted between 2001 and 2012 [
Databases were queried using the SQL plug-in for Python (Python Software Foundation). Case inclusion criteria were as follows: (1) age 16 years, (2) at least three-fourth of the features of interest were available for a select case (patient), leading to subsequent imputation, and (3) first hospital visit in the case of repeated patients. Features of predictive interest were selected based on 2 common severity of illness scores: Simplified Acute Physiology Score II and Acute Physiology and Chronic Health Evaluation IV for MIMIC-III and eICU, respectively. To test the hypothesis with both balanced and imbalanced data sets, the number of
For the MIMIC-III data set, there were 4039 cases that experienced
The features included demographic, procedural, pre-existing conditions, and laboratory values (
Medical Information Mart for Intensive Care III variables based on Simplified Acute Physiology Score II.
Feature name | Description | Normal values, units |
AIDS | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Heme malignancy | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Metastatic cancer | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Minimum GCSa | Glasgow Coma Scale | 15b, 1-15 |
WBCc minimum | Lowest white blood cell | 4-10, 109 |
WBC maximum | Highest white blood cell | 4-10, 109 |
Na minimum | Sodium minimum | 135-145, mmol/L |
Na maximum | Sodium maximum | 135-145, mmol/L |
K minimum | Potassium minimum | 3.5-5, mmol/L |
K maximum | Potassium maximum | 3.5-5, mmol/L |
Bilirubin maximum | Bilirubin maximum | ≤1.52, mg/dL |
HCO3 minimum | Bicarbonate minimum | 24-30, mmol/L |
HCO3 maximum | Bicarbonate maximum | 24-30, mmol/L |
BUNd minimum | Blood urea nitrogen minimum | 7-22, mg/dL |
BUN maximum | Blood urea nitrogen maximum | 7-22, mg/dL |
PO2 | Partial pressure of oxygen | 85-105, mm Hg |
FiO2 | Fraction of inspired oxygen | 21, % |
Heart rate mean | Mean heart rate | 60-100, bpm |
BP mean | Mean systolic blood pressure | 95-145, mm Hg |
Max temp | Maximum temperature | 36.5-37.5, ℃ |
Urine output | Urine output | 800-2000e, mL/24h |
Sex | Male or female | Male: 1, Female: 0, Male or female |
Age | Age in years | ≤65: 0, years |
Admission type | Emergency or elective | Emergency: 1; else: 0, N/Af |
aGCA: Glasgow Coma Scale.
bTeasdale and Jennett, 1974 [
cWBC: white blood cell.
dBUN: blood urea nitrogen.
eMedical CMP, 2011 [
fN/A: not applicable.
Electronic intensive care unit data set variables based on Acute Physiology and Chronic Health Evaluation IV.
Feature name | Description | Normal values, Units |
GCSa | Glasgow Coma Scale | 15b, 1-15 |
Urine output | Urine output in 24 hours | 800-2000c, mL/24 hour |
WBCd | White blood cell count | 4-10, 109 |
Na | Serum sodium | 135-145, mmol/L |
Temperature | Temperature in Celsius | 36.5-37.5e, ℃ |
Respiration rate | Highest white blood cell | 12-20f, breaths/min |
Heart rate | Heart rate/min | 60-100f, bpm |
Mean blood pressure | Mean arterial pressure | 70-100g, mm Hg |
Creatinine | Serum creatinine | 0.57-1.02 (Fh); 0.79-1.36 (Mi), mEq/L |
pH | Arterial pH | 7.35-7.45, N/Aj |
Hematocrit | Red blood cell volume | 37-46 (F); 38-50 (M), % |
Albumin | Serum albumin | 3.5-5.0, g/dL |
PO2 | Partial pressure of oxygen | 85-105, mm Hg |
PCO2 | Partial pressure carbon dioxide | 35-45, mm Hg |
BUNk | Blood urea nitrogen maximum | 7-22, mg/dL |
Glucose | Blood sugar level | 68-200, mL/dL |
Bili | Serum bilirubin | ≤1.52, md/dL |
FiO2 | Fraction of inspired oxygen | 21l, % |
Sex | Male or female | Male: 1; female: 0, M or F |
Age | Age in years | ≤65: 0, years |
Leukemia | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Lymphoma | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Cirrhosis | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Hepatic failure | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Metastatic cancer | Pre-existing diagnosis | Absent: 0, 0 or 1 |
AIDS | Pre-existing diagnosis | Absent: 0, 0 or 1 |
Thrombolytics | Medical intervention | Absent: 0, 0 or 1 |
Ventilator | Medical intervention | Absent: 0, 0 or 1 |
Dialysis | Medical intervention | Absent: 0, 0 or 1 |
Immunosuppressed | Medical intervention | Absent: 0, 0 or 1 |
Elective surgery | Medical intervention | Absent: 0, 0 or 1 |
aGCS: Glasgow Coma Scale.
bTeasdale and Jennett, 1974 [
cMedical CMP, 2011 [
dWBC: white blood cell.
eLapum et al. 2018 [
fMDCalc [
gHealthline [
hF: female.
iM: male.
jN/A: not applicable.
kBUN: blood urea nitrogen.
leICU Collaborative Research Database [
Using the IRTPRO (Scientific Software International) program, a 2-parameter logistic model (2PL) was run on the dichotomous data. The program uses a marginal maximum likelihood estimation procedure to calculate feature and case parameters [
Equation 1 shows a 2PL model in IRT; slope (ai) captures the
Characteristic curve using a 2-parameter logistic model.
CDI estimation in a 2PL model is calculated based on equation 2, where the probability of obtaining the correct answer is based on the scores on the items’ ui weighted by ai.
Equation 3, where ui ∈ (0, 1) is the score on item i, is called the likelihood function. It is the probability of a response pattern given the CDIs and the item parameters across cases. There is 1 likelihood function for each response pattern, and the sum of all such functions equals 1 at any value of the distribution. On the basis of the pattern of each case’s values on the features, the program uses a Bayesian estimation process that provides a CDI on the unhealthiness continuum for each case in the data set.
CDIs are reported on the standard normal distribution and typically range between −2.50 and +2.50. Each case’s CDI has its own individual SE around it based on the individual’s pattern of results across all features and their unique characteristics. Using the results from the 2PL model, it was possible to identify which of the cases were more centrally or more peripherally located on the distribution and thus would be less or more likely to be accurately classified into their respective categories (no death or death).
To allow for easy visualization and testing of effects, several strata bins were created into which continuous IRT CDIs could be assigned. These
Multiple ML algorithms were tested using the original feature values for both MIMIC-III and eICU data sets. These included logistic regression, linear discriminant analysis, K-nearest neighbors, decision tree, naive Bayes, and neural network. Both the K-nearest neighbors and neural network had their hyperparameters optimized by a grid search. In the case of the K-nearest neighbors, the search grid included K from 1 to 40 and distance methods of Minkowski, Hamming, and Manhattan. The grid investigated for the neural network included activation functions such as softmax, softplus, softsign, relu, Tanh, sigmoid, and hard sigmoid; learning rates such as 0.001, 0.01, 0.1, 0.2, and 0.3; and hidden neurons in a single hidden layer of 1, 5, 10, 15, 20, 25, and 30. In each of these methods, a 10-fold cross-validation was performed, and the numerical prediction was extracted for each case and then reassociated with its subject ID number for graphical plotting. The evaluation methods, accuracy, precision, recall, F1, and AUC metrics were calculated. Accuracy was used to assess the hypotheses and research questions.
To test the main effects of CDI and the repeated measure of the ML classifier as well as their interaction on each case’s accuracy score (0,1), generalized linear mixed model (GLMM) [
In equation 5, g(µ) is the logit link function that defines the relationship between the mean response µ and the linear combination of predictors. X represents the fixed effects matrix, and Z is a random effects matrix, where is simply an offset to the model.
The models specified that (1) all effects are fixed, (2) the dependent variable follows a binomial distribution, and thus the predictors and criterion are linked via a logit function, (3) the residual covariance matrix for the repeated measure (ML classifier) is diagonal, and (4) the reference category was set to 0. Follow-up paired-comparison tests on the estimated marginal and cell means used a
Descriptive results of case CDIs are shown in
It should be noted that the 2 data sets have different distributions, and this fingerprint is inherently unique to the data set processed.
Item response theory case classification difficulty index results.
Data set | CDIa range | Overall, mean (SD) | Point-biserial correlationsb | No death, mean (SD) | Death, mean (SD) | |||
|
|
|
|
|
||||
MIMIC-IIId balanced | −1.81 to +2.16 | 0.00 (0.85) | 0.37 | <.001 | −0.32 (0.79) | 0.32 (0.80) | 35.76 (8077) | <.001 |
MIMIC-III imbalanced | −1.70 to +2.27 | 0.00 (0.85) | 0.35 | <.001 | −0.21 (0.80) | 0.42 (0.80) | 40.88 (12116) | <.001 |
eICUe balanced | −2.63 to +2.83 | 0.00 (0.80) | 0.50 | <.001 | −0.40 (0.73) | 0.40 (0.64) | 86.18 (21939) | <.001 |
eICU imbalanced | −2.55 to +2.93 | 0.00 (0.81) | 0.51 | <.001 | −0.29 (0.73) | 0.59 (0.61) | 109.09 (32909) | <.001 |
aCDI: classification difficulty index.
bBetween CDI and outcome (no death or death).
cDifference between no death and death means.
dMIMIC III: Medical Information Mart for Intensive Care.
eeICU: electronic intensive care unit.
Classification Difficulty Indexes in MIMIC-III (A) balanced and (B) imbalanced data. CDI: classification difficulty index; MIMIC: Medical Information Mart for Intensive Care.
Classification Difficulty Indexes in eICU (A) balanced and (B) imbalanced data. eICU: electronic Intensive Care Unit; DT: decision tree; KNN: K-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
Using the feature parameter estimates and case CDI, the unique differentiating capacity for each feature can be depicted by calculating the probability of each case falling into the 0 (no death) or 1 (death) categories. For example, the slope and location parameters for the blood urea nitrogen (BUN) minimum and urine output for the 2 MIMIC-III data sets are shown in
Medical Information Mart for Intensive Care III feature parameters.
Feature parameters | Slope | Location | |
|
|||
|
Blood urea nitrogen (minimum) | 5.64 | 0.09 |
|
Urine output | 0.15 | −2.23 |
|
|||
|
Blood urea nitrogen (minimum) | 5.22 | 0.02 |
|
Urine output | 0.09 | −3.59 |
Similar to the MIMIC-III results, the IRT analyses of the eICU showed that BUN was a highly discriminating feature whereas urine output was not (
Electronic intensive care unit feature parameters.
Feature parameter | Slope | Location | |
|
|||
|
Blood urea nitrogen (minimum) | 1.55 | −0.33 |
|
Urine output | 0.04 | −1.19 |
|
|||
|
Blood urea nitrogen (minimum) | 1.49 | −0.1 |
|
Urine output | 0.03 | −1.39 |
Checking the K-nearest neighbors grid warranted using Manhattan distancing and 27 nearest neighbors for MIMIC-III and Manhattan distancing with 19 neighbors for eICU. The neural network grid search results returned an optimum learning rate of 0.001, activation function softmax, and a number of hidden nodes, 15 for MIMIC-III and 17 for eICU.
Traditional metrics of accuracy, precision, recall, F1, and AUC are presented for MIMIC-III in
Medical Information Mart for Intensive Care III classification performance in traditional metrics.
Metric | LRa (%) | LDAb (%) | KNNc (%) | DTd (%) | NBe (%) | NNf (%) | |
|
|||||||
|
Accuracy | 75.3 | 75.0 | 67.2 | 70.9 | 70.4 | 76.1 |
|
Precision | 75.8 | 75.6 | 69.3 | 71.1 | 79.5 | 75.6 |
|
Recall | 74.3 | 73.8 | 61.8 | 70.6 | 54.9 | 77.2 |
|
F1 | 75.0 | 74.7 | 65.3 | 70.8 | 64.9 | 76.4 |
|
AUCg | 75.3 | 75.0 | 67.2 | 70.9 | 70.4 | 76.5 |
|
|||||||
|
Accuracy | 78.3 | 77.9 | 72.8 | 73.7 | 75.3 | 80.5 |
|
Precision | 73.3 | 73.8 | 63.1 | 60.6 | 67.7 | 72.7 |
|
Recall | 54.8 | 52.1 | 44.4 | 60.6 | 49.6 | 66.6 |
|
F1 | 62.7 | 61.1 | 52.2 | 60.6 | 57.3 | 69.5 |
|
AUC | 72.4 | 71.4 | 65.7 | 70.9 | 68.9 | 76.9 |
aLR: logistic regression.
bLDA: linear discriminant analysis.
cKNN: K-nearest neighbor.
dDT: decision tree.
eNB: naive Bayes.
fNN: neural network.
gAUC: area under the curve.
In both the balanced and imbalanced MIMIC-III data sets, the neural network outperformed the other classifiers (balanced: accuracy was 76.1% and imbalanced: accuracy was 80.5%) using traditional metrics. It is worth highlighting the role an imbalanced data set has on an increased accuracy and a reduction in precision, recall, and F1.
In both the balanced and the imbalanced eICU data sets (
Item response theory–based Medical Information Mart for Intensive Care III mortality prediction accuracy stratified by classification difficulty index.
Number of cases | CDIa | LRb (%) | LDAc (%) | KNNd (%) | DTe (%) | NBf (%) | NNg (%) | |
|
||||||||
|
1 | 2.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
|
13 | 2.0 | 92.3 | 92.3 | 84.6 | 92.3 | 92.3 | 92.3 |
|
316 | 1.5 | 90.2 | 88.2 | 80.4 | 80.4 | 89.2 | 88.3 |
|
1884 | 1.0 | 75.6 | 74.9 | 68.2 | 68.8 | 68.4 | 77.0 |
|
1321 | 0.5 | 70.5 | 70.6 | 63.5 | 65.9 | 65.4 | 71.1 |
|
952 | 0.0 | 72.0 | 72.4 | 62.8 | 68.8 | 66.2 | 73.9 |
|
1346 | −0.5 | 70.9 | 70.6 | 60.4 | 67.1 | 63.7 | 72.1 |
|
1955 | −1.0 | 77.0 | 77.1 | 70.9 | 75.4 | 75.2 | 78.3 |
|
288 | −1.5 | 94.8 | 94.8 | 83.3 | 91.0 | 95.5 | 94.5 |
|
3 | −2.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
|
||||||||
|
1 | 2.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
|
30 | 2.0 | 93.3 | 93.3 | 76.7 | 73.3 | 93.3 | 93.3 |
|
571 | 1.5 | 77.4 | 75.7 | 64.1 | 71.1 | 77.4 | 78.3 |
|
1886 | 1.0 | 70.6 | 70.3 | 63.9 | 65.0 | 64.6 | 73.3 |
|
1537 | 0.5 | 76.3 | 75.5 | 67.3 | 71.2 | 72.7 | 79.7 |
|
1251 | 0.0 | 78.7 | 78.0 | 75.6 | 74.5 | 76.8 | 80.3 |
|
2794 | −0.5 | 75.0 | 74.5 | 71.0 | 72.1 | 72.3 | 78.4 |
|
2722 | −1.0 | 88.3 | 88.3 | 85.0 | 83.3 | 87.1 | 89.1 |
|
325 | −1.5 | 99.1 | 99.1 | 96.6 | 98.2 | 99.1 | 98.8 |
aCDI: classification difficulty index.
bLR: logistic regression.
cLDA: linear discriminant analysis.
dKNN: K-nearest neighbor.
eDT: decision tree.
fNB: naive Bayes.
gNN: neural network.
Electronic intensive care unit classification performance in traditional metrics.
Metric | LRa (%) | LDAb (%) | KNNc (%) | DTd (%) | NBe (%) | NNf (%) | |
|
|||||||
|
Accuracy | 77.9 | 77.4 | 67.2 | 76.7 | 66.6 | 84.7 |
|
Precision | 77.9 | 78.1 | 67.9 | 76.7 | 73.7 | 84.5 |
|
Recall | 77.9 | 76.3 | 65.3 | 76.8 | 51.6 | 84.9 |
|
F1 | 77.8 | 77.2 | 66.6 | 76.7 | 60.7 | 84.7 |
|
AUCg | 77.9 | 77.4 | 67.2 | 77.1 | 66.6 | 85.9 |
|
|||||||
|
Accuracy | 78.0 | 80.1 | 73.6 | 81.6 | 73.3 | 89.5 |
|
Precision | 73.6 | 75.1 | 64.1 | 72.1 | 62.0 | 84.7 |
|
Recall | 62.1 | 60.2 | 47.2 | 72.9 | 51.5 | 83.5 |
|
F1 | 67.4 | 66.8 | 54.4 | 72.5 | 56.3 | 84.1 |
|
AUC | 75.5 | 75.1 | 67.0 | 79.3 | 67.9 | 87.8 |
aLR: logistic regression.
bLDA: linear discriminant analysis.
cKNN: K-nearest neighbor.
dDT: decision tree.
eNB: naive Bayes.
fNN: neural network.
gAUC: area under the curve.
Item response theory–based electronic intensive care unit mortality prediction accuracy stratified by classification difficulty index.
Number of cases | CDIa | LRb (%) | LDAc (%) | KNNd (%) | DTe (%) | NBf (%) | NNg (%) | ||||||||
|
|||||||||||||||
|
2 | 3.0 | 100.0 | 100.0 | 100.0 | 50.0 | 100.0 | 100.0 | |||||||
|
61 | 2.5 | 82.0 | 82.0 | 75.4 | 78.7 | 86.9 | 85.2 | |||||||
|
160 | 2.0 | 81.3 | 82.5 | 75.0 | 76.3 | 81.9 | 83.4 | |||||||
|
621 | 1.5 | 86.2 | 86.8 | 74.5 | 79.2 | 83.7 | 87.9 | |||||||
|
3167 | 1.0 | 83.7 | 82.9 | 72.1 | 78.3 | 66.3 | 85.4 | |||||||
|
4998 | 0.5 | 74.0 | 72.7 | 64.7 | 73.1 | 55.2 | 80.9 | |||||||
|
4776 | 0.0 | 70.9 | 70.1 | 58.5 | 71.5 | 57.3 | 80.0 | |||||||
|
3864 | −0.5 | 73.8 | 74.5 | 63.3 | 74.4 | 67.4 | 84.3 | |||||||
|
2858 | −1.0 | 85.4 | 85.5 | 74.4 | 84.8 | 83.1 | 91.8 | |||||||
|
1183 | −1.5 | 92.5 | 92.6 | 84.3 | 91.7 | 91.9 | 96.4 | |||||||
|
240 | −2.0 | 97.1 | 97.1 | 91.7 | 95.8 | 96.3 | 97.9 | |||||||
|
10 | −2.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |||||||
|
|||||||||||||||
|
6 | 3.0 | 66.7 | 83.3 | 83.3 | 66.6 | 66.6 | 83.3 | |||||||
|
58 | 2.5 | 82.8 | 81.0 | 69.0 | 75.9 | 87.9 | 84.5 | |||||||
215 | 2.0 | 79.1 | 78.6 | 67.0 | 72.6 | 76.3 | 82.3 | ||||||||
|
1369 | 1.5 | 79.8 | 79.0 | 65.4 | 75.2 | 72.8 | 85.7 | |||||||
|
4776 | 1.0 | 72.2 | 72.4 | 61.6 | 74.8 | 58.4 | 83.9 | |||||||
|
6657 | 0.5 | 67.3 | 67.0 | 72.1 | 57.3 | 57.3 | 83.1 | |||||||
|
7068 | 0.0 | 76.4 | 76.9 | 70.0 | 78.8 | 70.3 | 88.5 | |||||||
|
6396 | −0.5 | 87.1 | 87.3 | 83.2 | 87.3 | 83.4 | 93.7 | |||||||
|
4265 | −1.0 | 94.8 | 95.0 | 92.0 | 94.3 | 92.7 | 97.7 | |||||||
|
1763 | -1.5 | 98.0 | 98.0 | 97.1 | 97.9 | 97.3 | 99.4 | |||||||
|
317 | -2.0 | 99.1 | 99.1 | 98.4 | 98.4 | 98.4 | 99.1 | |||||||
|
20 | -2.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
aCDI: classification difficulty index.
bLR: logistic regression.
cLDA: linear discriminant analysis.
dKNN: K-nearest neighbor.
eDT: decision tree.
fNB: naive Bayes.
gNN: neural network.
The CDI group sizes at the extreme ends were too small and were collapsed into the next level down for each data set. Tests of the effects of MIMIC-III are reported in
The MIMIC-III balanced data showed significantly better accuracies for the more peripheral than central CDI bins. K-nearest neighbors and decision tree were the poorest classifiers. Although there was a small significant interaction effect, by and large, the main effects were borne out.
Tests of the effects of classification difficulty index, classifier, and their interaction for the Medical Information Mart for Intensive Care III data set.
Effect | Significance | Significant paired comparisons ( |
||
|
|
|||
|
||||
|
||||
|
CDIa | 123 (6,48456) | <.001 |
−1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 +1.0 vs +0.5, 0.0 +1.5 vs +1.0, +0.5, 0.0 |
|
MLb classifier | 52 (5,48456) | <.001 |
LRc, LDAd, NBe, NNf vs KNNg, DTh DT vs KNN |
|
CDI×ML classifier | 2 (30,48456) | <.001 |
−1.5: LR, LDA, NB, NN, DT vs KNN −1.0: LR, LDA, NB, NN, DT vs KNN −0.5: LR, LDA, DT, NN vs NB, KNN 0.0: LR, LDA, DT, NN vs NB, KNN +0.5: LR, LDA, NN vs NB, KNN, DT +1.0: LR, LDA, NN vs NB, KNN, DT +1.5: LR, LDA, NB, NN vs KNN DT |
|
||||
|
CDI | 314 (6,72660) | <.001 |
−1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 0.0 vs −0.5, +0.5, +1.0 +0.5 vs +1.0 +1.5 vs +1.0 |
|
ML classifier | 12 (5,72660) | <.001 |
LR, LDA, NB, NN vs KNN, DT |
|
CDI×ML classifier | 2 (30,72660) | .004 |
−1.5: no differences −1.0: LR, LDA, NB, NN vs KNN, DT −0.5: LR, LDA, NN vs NB, KNN, DT 0.0: NN vs DT |
aCDI: classification difficulty index.
bML: machine learning.
cLR: logistic regression.
dLDA: linear discriminant analysis.
eNB: naive Bayes.
fNN: neural network.
gKNN: K-nearest neighbor.
hDT: decision tree.
Medical Information Mart for Intensive Care (MIMIC) III generalized linear mixed model (GLMM) accuracy results; machine learning classifier against CDI for (A) balanced and (B) imbalanced data. DT: decision tree; KNN: K-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
The MIMIC-III imbalanced data set showed that at the healthier end of the CDI continuum, more peripheral cases were accurately classified. This was not the case at the central and unhealthier end of the continuum. Like the balanced data set, K-nearest neighbors and decision tree were the poorest classifiers. Although the interaction was significant, most of the paired comparisons supported the main effect findings.
Tests of the effects from eICU are reported in
Tests of the effects of classification, classifier, and their interaction for the electronic intensive care unit data set.
Effect | Significance | Significant paired comparisons ( |
|||||
|
|
||||||
|
|||||||
|
CDIa | 382 (8,131586) | <.001 |
−2.0 vs −1.5, −1.0, −0.5, 0.0 −1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 +1.0 vs +0.5, 0.0 +1.5 vs +1.0, +0.5, 0.0 +2.0 vs +0.5, 0.0 |
|||
|
MLb classifier | 58 (5,131586) | <.001 |
NNc vs LRd, LDAe, DTf vs NBg vs KNNh |
|||
|
CDI×ML classifier | 9 (40,131586) | <.001 |
−2.0: NN vs KNN −1.5: NN vs LR, LDA, NB, DT vs KNN −1.0: NN vs LR, LDA, NB, DT vs KNN −0.5: NN vs LR, LDA, DT vs NB vs KNN 0.0: NN vs LR, LDA, DT vs NB vs KNN +0.5: NN vs LR, LDA, DT vs KNN vs NB +1.0: NN vs LR, LDA vs DT vs KNN vs NB +1.5: NN, LR, LDA vs NB vs DT vs KNN −2.0: NN vs KNN |
|||
|
|||||||
|
Difficulty CDI | 1138 (8,197406) | <.001 |
−2.0 vs −1.0, −0.5, 0.0 −1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 −0.5 vs 0.0 0.0 vs +0.5, +1.0 +1.0 vs +0.5 +1.5 vs +0.5, +1.0 +2.0 vs +1.0, +0.5 |
|||
|
ML classifier | 28 (5,197406) | <.001 |
NN vs LR, LDA vs DT vs NB, KNN |
|||
|
CDI×ML classifier | 4 (40,197406) | <.001 |
−2.0: no differences −1.5: NN vs LR, LDA, NB, KNN, DT −1.0: NN vs LR, LDA, DT vs KNN, NB −0.5: NN vs LR, LDA, DT vs KNN, NB 0.0: NN vs LR, LDA vs DT vs KNN, NB +0.5: NN vs LR, LDA vs DT vs KNN, NB +1.0: NN vs LR, LDA, DT vs KNN, NB +1.5: NN, LR vs LDA vs DT, NB vs KNN +2.0: NN, LR vs KNN |
aCDI: classification difficulty index.
bML: machine learning.
cNN: neural network.
dLR: logistic regression.
eLDA: linear discriminant analysis.
fDT: decision tree.
gNB: naive Bayes.
hKNN: K-nearest neighbor.
Electronic intensive care unit (eICU) generalized linear mixed model (GLMM) accuracy results; machine learning classifier against CDI for (A) balanced and (B) imbalanced data. DT: decision tree; KNN: K-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
For the eICU balanced data set, moving away from the central bin showed significantly better accuracy, except at the +2.0 level, which was similar to the +1.5 ML classifier estimated means showed that the neural network had significantly better accuracy than all other classifiers. The overall interaction effect was significant, but the paired comparisons were similar to the main effects.
For the eICU imbalanced data set, more peripheral cases were accurately classified at the healthier end of the distribution, whereas there was only a slight improvement at the unhealthier end. Similar to the other analyses, the neural network showed the best classification accuracy. Although the overall interaction was significant, the neural network continued to be the best classifier.
The results generally supported the hypothesis that cases with more extreme IRT-based CDI values are more likely to be correctly classified than cases with more central CDI values. This provides a unique manner to evaluate the utility of ML classifiers in a health context. We were able to demonstrate that ML classifiers performed similarly for the extreme cases, whereas for the centrally located cases, there were more differences between classifiers. Thus, ML classifiers can be evaluated based on their relative performance with cases of varying difficulty.
Although these were the general results, there were several specific findings that are worth noting. First, the neural network classifier was the best across all situations. The logistic regression and linear discriminant analysis classifiers were close to the second-best classifiers, whereas K-nearest neighbors almost always performed the worst. It is possible, as found in this study, that classifiers may turn out to be consistent over all levels of difficulty. However, owing to the unique characteristics of both data sets and classifiers selected, some algorithms may yield better results at various levels of case difficulty in other samples.
It was also clear that the
On the basis of the IRT analysis results, easier- and harder-to-classify cases were identified. This has implications for research and clinical practice. Once the cases have been identified, other information gathered from their patient-specific data may provide clues about why they are easier or harder to classify, diagnose, or treat. The features themselves that have varying weighted importance in the indexing process can be examined to assess for any differences in a patient’s CDI, that is, not just how many they got
As an example of how one could examine more closely the
An IRT analysis can assist in providing a better understanding of why the classification process works well or falls short on the set of features and cases under investigation. This moves the field closer to having interpretable and explainable results [
Limitations of this research include the fact that classifiers showcased here were not exhaustive, only ICU data sets were used, and converting an out-of-range laboratory value as either
There are several ways to extend this work. Future research calls for (1) applying this method to other data sets to generalize its use, (2) using polytomous IRT models (eg, 0=in range, 1=somewhat out of range, and 2=very out of range) for more fine-grained case CDI scoring, (3) using multidimensional IRT models to obtain CDIs on >1 underlying dimension, and (4) using this approach to compare human versus machine classification accuracy across case difficulty. We can extend the intersection of ML with clinical medicine if we liken a physician to an ML classifier using feature data. It would be particularly interesting to compare case accuracies based on traditional ML versus clinical classifiers for cases of varying difficulty using an approach similar to that demonstrated in this study. Identifying which cases clinical classifiers are better suited to address, and which cases should be offloaded to an automated system allows for the optimal use of scarce resources. As clinical expertise is developed over time, the use of ML algorithms to assist any single individual would be a moving target and would also serve as a source of future research.
Another way to improve the veracity of the findings would be to address the issue of extraneous features. Several of the features in MIMIC-III and eICU had very low (<0.35) discrimination (slope) parameters, suggesting that there was a lot of
As more ML methods are investigated in the health care sphere, concerns have risen because of a lack of understanding regarding why they are successful, especially when compared with physician counterparts. This study has suggested an IRT-based methodology as one way to address this issue by examining the case difficulty in a data set that allows for follow-up into possible reasons why cases are or are not classified correctly.
Using the methods described in this study would signal a change in the way we evaluate supervised ML. Adopting them would move the field toward more of an evaluation system that characterizes the entire data set on which the classifiers are being trained and tested. Doing so circumvents the pitfalls associated with 1 classifier being cited as more accurate or more precise and generates a more tailored approach to ML classifier comparisons. In addition, this methodology lends itself well to
The method here presents an intersection of personalized medicine and ML that maintains its explainability and transparency in both feature selection and modeled accuracy, both of which are pivotal to their uptake in the health sphere.
2-parameter logistic
area under the curve
blood urea nitrogen
classification difficulty index
electronic intensive care unit
generalized linear mixed model
intensive care unit
item response theory
Medical Information Mart for Intensive Care
machine learning
AK contributed to idea generation, study and method design, literature search, data acquisition (MIMIC-III data set), figures, tables, data analysis, and writing. TK contributed to data analysis, writing, and proofing the manuscript. ZA contributed to data acquisition of eICU data set. JL contributed to proofing and journal selection.
None declared.