This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Supervised machine learning (ML) is being featured in the health care literature with study results frequently reported using metrics such as accuracy, sensitivity, specificity, recall, or F1 score. Although each metric provides a different perspective on the performance, they remain to be overall measures for the whole sample, discounting the uniqueness of each case or patient. Intuitively, we know that all cases are not equal, but the present evaluative approaches do not take case difficulty into account.
A more casebased, comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. This study aims to demonstrate how the item response theory (IRT) can be used to stratify the data based on how
Two large, public intensive care unit data sets, Medical Information Mart for Intensive Care III and electronic intensive care unit, were used to showcase this method in predicting mortality. For each data set, a balanced sample (n=8078 and n=21,940, respectively) and an imbalanced sample (n=12,117 and n=32,910, respectively) were drawn. A 2parameter logistic model was used to provide scores for each case. Several ML algorithms were used in the demonstration to classify cases based on their healthrelated features: logistic regression, linear discriminant analysis, Knearest neighbors, decision tree, naive Bayes, and a neural network. Generalized linear mixed model analyses were used to assess the effects of case difficulty strata, ML algorithm, and the interaction between them in predicting accuracy.
The results showed significant effects (
This demonstration shows that using the IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty. This method explains which features are indicative of healthy states and why. It enables end users to tailor the classifier that is appropriate to the difficulty level of the patient for personalized medicine.
This study aims to demonstrate an approach to assess the effectiveness of binary machine learning (ML) classification, which is an alternative to the more traditional single scalar measures in the literature. Our approach uses an item response theory (IRT) model to enhance the understanding of the data set on which ML protocols are run as well as the results of the classification outcomes. Aspects of IRT’s utility have recently surfaced in the ML literature, including comparisons of collaborative filtering [
The varied and numerous contexts (eg, business, finances, medicine, home, government agencies) in which ML is being used is no less than staggering [
Despite the advances in metric development, there is an interest in developing more extensive descriptions of ML classification outcomes. For example, it has been argued that “... any single scalar measure has significant limitations” and “that such measures ... oversimplify complex questions and combine things that should be kept separate” [
These comments are consistent with general calls for a fuller explanation regarding the interpretability of ML studies [
The ML criticism of the lack of attention to the unique characteristics of the individual cases is the focus of this study. We propose to address this challenge using a more comprehensive, casenuanced approach. Although there has been some work in this regard, such as the now accepted wisdom that standard classifiers do not work well with imbalanced data [
This lack of attention is highlighted in the assessment study of various ML models; they often result in comparable outcomes as similar percentages of cases are misclassified regardless of the model used [
There are 2 fundamental building blocks to any ML system: the features of interest and the cases in the data set. To investigate the research question in this study, methods derived from IRT were employed as they simultaneously estimate the characteristics of both features and cases. Understanding this phenomenon allows medical professionals to tailor the classifier to the patient.
Following an examination, there are discussions by students about the test items; often they remark “that question was hard, what did you put?” or “that was as easy question.” Such comments reflect the purposeful construction of the test items. Some items are designed to be relatively easy to pass whereas others are designed to be more difficult such that only a few can pass. Similarly, students talk about the test takers—“she always gets the highest score in the class” or “I think I have a 5050 chance of passing.” Test takers are quite cognizant of the fact that not all test items are created equal and that not all test takers have the same ability. These fundamental assumptions give rise to IRT, where the characteristics of the items and of the students are modeled together, providing a clearer picture about which items discriminate between which test takers.
A parallel can be drawn between a set of students passing or failing a test based on their performance on a set of items with a set of patients being classified into 1 of 2 categories (alive or not alive) based on their scores on a set of healthrelated features. Using the
As not all test takers score 100% or 0% on an examination, some combination of right and wrong answers to questions provides an index of individual testtaker ability in completing the test. The term ability (symbolized by the term theta, θ) is used in the psychometric literature where IRT evolved and is used to describe any latent construct of interest being measured. In this study, withinrange or outofrange laboratory values and vital signs as well as demographic information comprise the features in our data sets. Thus, we can ascertain a case’s placement with respect to the underlying distribution of
The process of generating CDIs on the unhealthiness continuum will be carried out without using the outcome variable of mortality itself, that is, IRT provides casebased scores (CDIs) that can be examined before the data as a collective is subjected to an ML protocol.
The IRT analysis provides casebased CDIs using a set of feature characteristics that do not use the information on the outcome classification variable. CDIs for the sample are generated along the normal distribution, with a mean 0.0 (SD 1.0). It is hypothesized that cases with more centrally located CDIs will be less likely to be classified correctly, whereas cases with more peripherally located CDIs will be more likely to be classified correctly. One research question is as follows: Will some ML classifiers be more accurate in classifying cases at all CDIs? Another research question is as follows: Will some ML classifiers be more accurate than others in classifying cases at different CDIs? Identifying these cases
Data were obtained through 2 large, freely available data sets. One was the MIMICIII (Medical Information Mart for Intensive Care III) database housing health data of >40,000 critical care unit patients at the Beth Israel Deaconess Medical Center admitted between 2001 and 2012 [
Databases were queried using the SQL plugin for Python (Python Software Foundation). Case inclusion criteria were as follows: (1) age 16 years, (2) at least threefourth of the features of interest were available for a select case (patient), leading to subsequent imputation, and (3) first hospital visit in the case of repeated patients. Features of predictive interest were selected based on 2 common severity of illness scores: Simplified Acute Physiology Score II and Acute Physiology and Chronic Health Evaluation IV for MIMICIII and eICU, respectively. To test the hypothesis with both balanced and imbalanced data sets, the number of
For the MIMICIII data set, there were 4039 cases that experienced
The features included demographic, procedural, preexisting conditions, and laboratory values (
Medical Information Mart for Intensive Care III variables based on Simplified Acute Physiology Score II.
Feature name  Description  Normal values, units 
AIDS  Preexisting diagnosis  Absent: 0, 0 or 1 
Heme malignancy  Preexisting diagnosis  Absent: 0, 0 or 1 
Metastatic cancer  Preexisting diagnosis  Absent: 0, 0 or 1 
Minimum GCS^{a}  Glasgow Coma Scale  15^{b}, 115 
WBC^{c} minimum  Lowest white blood cell  410, 10^{9} 
WBC maximum  Highest white blood cell  410, 10^{9} 
Na minimum  Sodium minimum  135145, mmol/L 
Na maximum  Sodium maximum  135145, mmol/L 
K minimum  Potassium minimum  3.55, mmol/L 
K maximum  Potassium maximum  3.55, mmol/L 
Bilirubin maximum  Bilirubin maximum  ≤1.52, mg/dL 
HCO_{3} minimum  Bicarbonate minimum  2430, mmol/L 
HCO_{3} maximum  Bicarbonate maximum  2430, mmol/L 
BUN^{d} minimum  Blood urea nitrogen minimum  722, mg/dL 
BUN maximum  Blood urea nitrogen maximum  722, mg/dL 
PO_{2}  Partial pressure of oxygen  85105, mm Hg 
FiO_{2}  Fraction of inspired oxygen  21, % 
Heart rate mean  Mean heart rate  60100, bpm 
BP mean  Mean systolic blood pressure  95145, mm Hg 
Max temp  Maximum temperature  36.537.5, ℃ 
Urine output  Urine output  8002000^{e}, mL/24h 
Sex  Male or female  Male: 1, Female: 0, Male or female 
Age  Age in years  ≤65: 0, years 
Admission type  Emergency or elective  Emergency: 1; else: 0, N/A^{f} 
^{a}GCA: Glasgow Coma Scale.
^{b}Teasdale and Jennett, 1974 [
^{c}WBC: white blood cell.
^{d}BUN: blood urea nitrogen.
^{e}Medical CMP, 2011 [
^{f}N/A: not applicable.
Electronic intensive care unit data set variables based on Acute Physiology and Chronic Health Evaluation IV.
Feature name  Description  Normal values, Units 
GCS^{a}  Glasgow Coma Scale  15^{b}, 115 
Urine output  Urine output in 24 hours  8002000^{c}, mL/24 hour 
WBC^{d}  White blood cell count  410, 10^{9} 
Na  Serum sodium  135145, mmol/L 
Temperature  Temperature in Celsius  36.537.5^{e}, ℃ 
Respiration rate  Highest white blood cell  1220^{f}, breaths/min 
Heart rate  Heart rate/min  60100^{f}, bpm 
Mean blood pressure  Mean arterial pressure  70100^{g}, mm Hg 
Creatinine  Serum creatinine  0.571.02 (F^{h}); 0.791.36 (M^{i}), mEq/L 
pH  Arterial pH  7.357.45, N/A^{j} 
Hematocrit  Red blood cell volume  3746 (F); 3850 (M), % 
Albumin  Serum albumin  3.55.0, g/dL 
PO_{2}  Partial pressure of oxygen  85105, mm Hg 
PCO_{2}  Partial pressure carbon dioxide  3545, mm Hg 
BUN^{k}  Blood urea nitrogen maximum  722, mg/dL 
Glucose  Blood sugar level  68200, mL/dL 
Bili  Serum bilirubin  ≤1.52, md/dL 
FiO_{2}  Fraction of inspired oxygen  21^{l}, % 
Sex  Male or female  Male: 1; female: 0, M or F 
Age  Age in years  ≤65: 0, years 
Leukemia  Preexisting diagnosis  Absent: 0, 0 or 1 
Lymphoma  Preexisting diagnosis  Absent: 0, 0 or 1 
Cirrhosis  Preexisting diagnosis  Absent: 0, 0 or 1 
Hepatic failure  Preexisting diagnosis  Absent: 0, 0 or 1 
Metastatic cancer  Preexisting diagnosis  Absent: 0, 0 or 1 
AIDS  Preexisting diagnosis  Absent: 0, 0 or 1 
Thrombolytics  Medical intervention  Absent: 0, 0 or 1 
Ventilator  Medical intervention  Absent: 0, 0 or 1 
Dialysis  Medical intervention  Absent: 0, 0 or 1 
Immunosuppressed  Medical intervention  Absent: 0, 0 or 1 
Elective surgery  Medical intervention  Absent: 0, 0 or 1 
^{a}GCS: Glasgow Coma Scale.
^{b}Teasdale and Jennett, 1974 [
^{c}Medical CMP, 2011 [
^{d}WBC: white blood cell.
^{e}Lapum et al. 2018 [
^{f}MDCalc [
^{g}Healthline [
^{h}F: female.
^{i}M: male.
^{j}N/A: not applicable.
^{k}BUN: blood urea nitrogen.
^{l}eICU Collaborative Research Database [
Using the IRTPRO (Scientific Software International) program, a 2parameter logistic model (2PL) was run on the dichotomous data. The program uses a marginal maximum likelihood estimation procedure to calculate feature and case parameters [
Equation 1 shows a 2PL model in IRT; slope (a_{i}) captures the
Characteristic curve using a 2parameter logistic model.
CDI estimation in a 2PL model is calculated based on equation 2, where the probability of obtaining the correct answer is based on the scores on the items’ u_{i} weighted by a_{i}.
Equation 3, where u_{i} ∈ (0, 1) is the score on item i, is called the likelihood function. It is the probability of a response pattern given the CDIs and the item parameters across cases. There is 1 likelihood function for each response pattern, and the sum of all such functions equals 1 at any value of the distribution. On the basis of the pattern of each case’s values on the features, the program uses a Bayesian estimation process that provides a CDI on the unhealthiness continuum for each case in the data set.
CDIs are reported on the standard normal distribution and typically range between −2.50 and +2.50. Each case’s CDI has its own individual SE around it based on the individual’s pattern of results across all features and their unique characteristics. Using the results from the 2PL model, it was possible to identify which of the cases were more centrally or more peripherally located on the distribution and thus would be less or more likely to be accurately classified into their respective categories (no death or death).
To allow for easy visualization and testing of effects, several strata bins were created into which continuous IRT CDIs could be assigned. These
Multiple ML algorithms were tested using the original feature values for both MIMICIII and eICU data sets. These included logistic regression, linear discriminant analysis, Knearest neighbors, decision tree, naive Bayes, and neural network. Both the Knearest neighbors and neural network had their hyperparameters optimized by a grid search. In the case of the Knearest neighbors, the search grid included K from 1 to 40 and distance methods of Minkowski, Hamming, and Manhattan. The grid investigated for the neural network included activation functions such as softmax, softplus, softsign, relu, Tanh, sigmoid, and hard sigmoid; learning rates such as 0.001, 0.01, 0.1, 0.2, and 0.3; and hidden neurons in a single hidden layer of 1, 5, 10, 15, 20, 25, and 30. In each of these methods, a 10fold crossvalidation was performed, and the numerical prediction was extracted for each case and then reassociated with its subject ID number for graphical plotting. The evaluation methods, accuracy, precision, recall, F1, and AUC metrics were calculated. Accuracy was used to assess the hypotheses and research questions.
To test the main effects of CDI and the repeated measure of the ML classifier as well as their interaction on each case’s accuracy score (0,1), generalized linear mixed model (GLMM) [
In equation 5, g(µ) is the logit link function that defines the relationship between the mean response µ and the linear combination of predictors. X represents the fixed effects matrix, and Z is a random effects matrix, where is simply an offset to the model.
The models specified that (1) all effects are fixed, (2) the dependent variable follows a binomial distribution, and thus the predictors and criterion are linked via a logit function, (3) the residual covariance matrix for the repeated measure (ML classifier) is diagonal, and (4) the reference category was set to 0. Followup pairedcomparison tests on the estimated marginal and cell means used a
Descriptive results of case CDIs are shown in
It should be noted that the 2 data sets have different distributions, and this fingerprint is inherently unique to the data set processed.
Item response theory case classification difficulty index results.
Data set  CDI^{a} range  Overall, mean (SD)  Pointbiserial correlations^{b}  No death, mean (SD)  Death, mean (SD)  






MIMICIII^{d} balanced  −1.81 to +2.16  0.00 (0.85)  0.37  <.001  −0.32 (0.79)  0.32 (0.80)  35.76 (8077)  <.001 
MIMICIII imbalanced  −1.70 to +2.27  0.00 (0.85)  0.35  <.001  −0.21 (0.80)  0.42 (0.80)  40.88 (12116)  <.001 
eICU^{e} balanced  −2.63 to +2.83  0.00 (0.80)  0.50  <.001  −0.40 (0.73)  0.40 (0.64)  86.18 (21939)  <.001 
eICU imbalanced  −2.55 to +2.93  0.00 (0.81)  0.51  <.001  −0.29 (0.73)  0.59 (0.61)  109.09 (32909)  <.001 
^{a}CDI: classification difficulty index.
^{b}Between CDI and outcome (no death or death).
^{c}Difference between no death and death means.
^{d}MIMIC III: Medical Information Mart for Intensive Care.
^{e}eICU: electronic intensive care unit.
Classification Difficulty Indexes in MIMICIII (A) balanced and (B) imbalanced data. CDI: classification difficulty index; MIMIC: Medical Information Mart for Intensive Care.
Classification Difficulty Indexes in eICU (A) balanced and (B) imbalanced data. eICU: electronic Intensive Care Unit; DT: decision tree; KNN: Knearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
Using the feature parameter estimates and case CDI, the unique differentiating capacity for each feature can be depicted by calculating the probability of each case falling into the 0 (no death) or 1 (death) categories. For example, the slope and location parameters for the blood urea nitrogen (BUN) minimum and urine output for the 2 MIMICIII data sets are shown in
Medical Information Mart for Intensive Care III feature parameters.
Feature parameters  Slope  Location  



Blood urea nitrogen (minimum)  5.64  0.09 

Urine output  0.15  −2.23 



Blood urea nitrogen (minimum)  5.22  0.02 

Urine output  0.09  −3.59 
Similar to the MIMICIII results, the IRT analyses of the eICU showed that BUN was a highly discriminating feature whereas urine output was not (
Electronic intensive care unit feature parameters.
Feature parameter  Slope  Location  



Blood urea nitrogen (minimum)  1.55  −0.33 

Urine output  0.04  −1.19 



Blood urea nitrogen (minimum)  1.49  −0.1 

Urine output  0.03  −1.39 
Checking the Knearest neighbors grid warranted using Manhattan distancing and 27 nearest neighbors for MIMICIII and Manhattan distancing with 19 neighbors for eICU. The neural network grid search results returned an optimum learning rate of 0.001, activation function softmax, and a number of hidden nodes, 15 for MIMICIII and 17 for eICU.
Traditional metrics of accuracy, precision, recall, F1, and AUC are presented for MIMICIII in
Medical Information Mart for Intensive Care III classification performance in traditional metrics.
Metric  LR^{a} (%)  LDA^{b} (%)  KNN^{c} (%)  DT^{d} (%)  NB^{e} (%)  NN^{f} (%)  



Accuracy  75.3  75.0  67.2  70.9  70.4  76.1 

Precision  75.8  75.6  69.3  71.1  79.5  75.6 

Recall  74.3  73.8  61.8  70.6  54.9  77.2 

F1  75.0  74.7  65.3  70.8  64.9  76.4 

AUC^{g}  75.3  75.0  67.2  70.9  70.4  76.5 



Accuracy  78.3  77.9  72.8  73.7  75.3  80.5 

Precision  73.3  73.8  63.1  60.6  67.7  72.7 

Recall  54.8  52.1  44.4  60.6  49.6  66.6 

F1  62.7  61.1  52.2  60.6  57.3  69.5 

AUC  72.4  71.4  65.7  70.9  68.9  76.9 
^{a}LR: logistic regression.
^{b}LDA: linear discriminant analysis.
^{c}KNN: Knearest neighbor.
^{d}DT: decision tree.
^{e}NB: naive Bayes.
^{f}NN: neural network.
^{g}AUC: area under the curve.
In both the balanced and imbalanced MIMICIII data sets, the neural network outperformed the other classifiers (balanced: accuracy was 76.1% and imbalanced: accuracy was 80.5%) using traditional metrics. It is worth highlighting the role an imbalanced data set has on an increased accuracy and a reduction in precision, recall, and F1.
In both the balanced and the imbalanced eICU data sets (
Item response theory–based Medical Information Mart for Intensive Care III mortality prediction accuracy stratified by classification difficulty index.
Number of cases  CDI^{a}  LR^{b} (%)  LDA^{c} (%)  KNN^{d} (%)  DT^{e} (%)  NB^{f} (%)  NN^{g} (%)  



1  2.5  100.0  100.0  100.0  100.0  100.0  100.0 

13  2.0  92.3  92.3  84.6  92.3  92.3  92.3 

316  1.5  90.2  88.2  80.4  80.4  89.2  88.3 

1884  1.0  75.6  74.9  68.2  68.8  68.4  77.0 

1321  0.5  70.5  70.6  63.5  65.9  65.4  71.1 

952  0.0  72.0  72.4  62.8  68.8  66.2  73.9 

1346  −0.5  70.9  70.6  60.4  67.1  63.7  72.1 

1955  −1.0  77.0  77.1  70.9  75.4  75.2  78.3 

288  −1.5  94.8  94.8  83.3  91.0  95.5  94.5 

3  −2.0  100.0  100.0  100.0  100.0  100.0  100.0 



1  2.5  100.0  100.0  100.0  100.0  100.0  100.0 

30  2.0  93.3  93.3  76.7  73.3  93.3  93.3 

571  1.5  77.4  75.7  64.1  71.1  77.4  78.3 

1886  1.0  70.6  70.3  63.9  65.0  64.6  73.3 

1537  0.5  76.3  75.5  67.3  71.2  72.7  79.7 

1251  0.0  78.7  78.0  75.6  74.5  76.8  80.3 

2794  −0.5  75.0  74.5  71.0  72.1  72.3  78.4 

2722  −1.0  88.3  88.3  85.0  83.3  87.1  89.1 

325  −1.5  99.1  99.1  96.6  98.2  99.1  98.8 
^{a}CDI: classification difficulty index.
^{b}LR: logistic regression.
^{c}LDA: linear discriminant analysis.
^{d}KNN: Knearest neighbor.
^{e}DT: decision tree.
^{f}NB: naive Bayes.
^{g}NN: neural network.
Electronic intensive care unit classification performance in traditional metrics.
Metric  LR^{a} (%)  LDA^{b} (%)  KNN^{c} (%)  DT^{d} (%)  NB^{e} (%)  NN^{f} (%)  



Accuracy  77.9  77.4  67.2  76.7  66.6  84.7 

Precision  77.9  78.1  67.9  76.7  73.7  84.5 

Recall  77.9  76.3  65.3  76.8  51.6  84.9 

F1  77.8  77.2  66.6  76.7  60.7  84.7 

AUC^{g}  77.9  77.4  67.2  77.1  66.6  85.9 



Accuracy  78.0  80.1  73.6  81.6  73.3  89.5 

Precision  73.6  75.1  64.1  72.1  62.0  84.7 

Recall  62.1  60.2  47.2  72.9  51.5  83.5 

F1  67.4  66.8  54.4  72.5  56.3  84.1 

AUC  75.5  75.1  67.0  79.3  67.9  87.8 
^{a}LR: logistic regression.
^{b}LDA: linear discriminant analysis.
^{c}KNN: Knearest neighbor.
^{d}DT: decision tree.
^{e}NB: naive Bayes.
^{f}NN: neural network.
^{g}AUC: area under the curve.
Item response theory–based electronic intensive care unit mortality prediction accuracy stratified by classification difficulty index.
Number of cases  CDI^{a}  LR^{b} (%)  LDA^{c} (%)  KNN^{d} (%)  DT^{e} (%)  NB^{f} (%)  NN^{g} (%)  



2  3.0  100.0  100.0  100.0  50.0  100.0  100.0  

61  2.5  82.0  82.0  75.4  78.7  86.9  85.2  

160  2.0  81.3  82.5  75.0  76.3  81.9  83.4  

621  1.5  86.2  86.8  74.5  79.2  83.7  87.9  

3167  1.0  83.7  82.9  72.1  78.3  66.3  85.4  

4998  0.5  74.0  72.7  64.7  73.1  55.2  80.9  

4776  0.0  70.9  70.1  58.5  71.5  57.3  80.0  

3864  −0.5  73.8  74.5  63.3  74.4  67.4  84.3  

2858  −1.0  85.4  85.5  74.4  84.8  83.1  91.8  

1183  −1.5  92.5  92.6  84.3  91.7  91.9  96.4  

240  −2.0  97.1  97.1  91.7  95.8  96.3  97.9  

10  −2.5  100.0  100.0  100.0  100.0  100.0  100.0  



6  3.0  66.7  83.3  83.3  66.6  66.6  83.3  

58  2.5  82.8  81.0  69.0  75.9  87.9  84.5  
215  2.0  79.1  78.6  67.0  72.6  76.3  82.3  

1369  1.5  79.8  79.0  65.4  75.2  72.8  85.7  

4776  1.0  72.2  72.4  61.6  74.8  58.4  83.9  

6657  0.5  67.3  67.0  72.1  57.3  57.3  83.1  

7068  0.0  76.4  76.9  70.0  78.8  70.3  88.5  

6396  −0.5  87.1  87.3  83.2  87.3  83.4  93.7  

4265  −1.0  94.8  95.0  92.0  94.3  92.7  97.7  

1763  1.5  98.0  98.0  97.1  97.9  97.3  99.4  

317  2.0  99.1  99.1  98.4  98.4  98.4  99.1  

20  2.5  100.0  100.0  100.0  100.0  100.0  100.0 
^{a}CDI: classification difficulty index.
^{b}LR: logistic regression.
^{c}LDA: linear discriminant analysis.
^{d}KNN: Knearest neighbor.
^{e}DT: decision tree.
^{f}NB: naive Bayes.
^{g}NN: neural network.
The CDI group sizes at the extreme ends were too small and were collapsed into the next level down for each data set. Tests of the effects of MIMICIII are reported in
The MIMICIII balanced data showed significantly better accuracies for the more peripheral than central CDI bins. Knearest neighbors and decision tree were the poorest classifiers. Although there was a small significant interaction effect, by and large, the main effects were borne out.
Tests of the effects of classification difficulty index, classifier, and their interaction for the Medical Information Mart for Intensive Care III data set.
Effect  Significance  Significant paired comparisons ( 









CDI^{a}  123 (6,48456)  <.001 
−1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 +1.0 vs +0.5, 0.0 +1.5 vs +1.0, +0.5, 0.0 

ML^{b} classifier  52 (5,48456)  <.001 
LR^{c}, LDA^{d}, NB^{e}, NN^{f} vs KNN^{g}, DT^{h} DT vs KNN 

CDI×ML classifier  2 (30,48456)  <.001 
−1.5: LR, LDA, NB, NN, DT vs KNN −1.0: LR, LDA, NB, NN, DT vs KNN −0.5: LR, LDA, DT, NN vs NB, KNN 0.0: LR, LDA, DT, NN vs NB, KNN +0.5: LR, LDA, NN vs NB, KNN, DT +1.0: LR, LDA, NN vs NB, KNN, DT +1.5: LR, LDA, NB, NN vs KNN DT 



CDI  314 (6,72660)  <.001 
−1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 0.0 vs −0.5, +0.5, +1.0 +0.5 vs +1.0 +1.5 vs +1.0 

ML classifier  12 (5,72660)  <.001 
LR, LDA, NB, NN vs KNN, DT 

CDI×ML classifier  2 (30,72660)  .004 
−1.5: no differences −1.0: LR, LDA, NB, NN vs KNN, DT −0.5: LR, LDA, NN vs NB, KNN, DT 0.0: NN vs DT 
^{a}CDI: classification difficulty index.
^{b}ML: machine learning.
^{c}LR: logistic regression.
^{d}LDA: linear discriminant analysis.
^{e}NB: naive Bayes.
^{f}NN: neural network.
^{g}KNN: Knearest neighbor.
^{h}DT: decision tree.
Medical Information Mart for Intensive Care (MIMIC) III generalized linear mixed model (GLMM) accuracy results; machine learning classifier against CDI for (A) balanced and (B) imbalanced data. DT: decision tree; KNN: Knearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
The MIMICIII imbalanced data set showed that at the healthier end of the CDI continuum, more peripheral cases were accurately classified. This was not the case at the central and unhealthier end of the continuum. Like the balanced data set, Knearest neighbors and decision tree were the poorest classifiers. Although the interaction was significant, most of the paired comparisons supported the main effect findings.
Tests of the effects from eICU are reported in
Tests of the effects of classification, classifier, and their interaction for the electronic intensive care unit data set.
Effect  Significance  Significant paired comparisons ( 







CDI^{a}  382 (8,131586)  <.001 
−2.0 vs −1.5, −1.0, −0.5, 0.0 −1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 +1.0 vs +0.5, 0.0 +1.5 vs +1.0, +0.5, 0.0 +2.0 vs +0.5, 0.0 


ML^{b} classifier  58 (5,131586)  <.001 
NN^{c} vs LR^{d}, LDA^{e}, DT^{f} vs NB^{g} vs KNN^{h} 


CDI×ML classifier  9 (40,131586)  <.001 
−2.0: NN vs KNN −1.5: NN vs LR, LDA, NB, DT vs KNN −1.0: NN vs LR, LDA, NB, DT vs KNN −0.5: NN vs LR, LDA, DT vs NB vs KNN 0.0: NN vs LR, LDA, DT vs NB vs KNN +0.5: NN vs LR, LDA, DT vs KNN vs NB +1.0: NN vs LR, LDA vs DT vs KNN vs NB +1.5: NN, LR, LDA vs NB vs DT vs KNN −2.0: NN vs KNN 




Difficulty CDI  1138 (8,197406)  <.001 
−2.0 vs −1.0, −0.5, 0.0 −1.5 vs −1.0, −0.5, 0.0 −1.0 vs −.05, 0.0 −0.5 vs 0.0 0.0 vs +0.5, +1.0 +1.0 vs +0.5 +1.5 vs +0.5, +1.0 +2.0 vs +1.0, +0.5 


ML classifier  28 (5,197406)  <.001 
NN vs LR, LDA vs DT vs NB, KNN 


CDI×ML classifier  4 (40,197406)  <.001 
−2.0: no differences −1.5: NN vs LR, LDA, NB, KNN, DT −1.0: NN vs LR, LDA, DT vs KNN, NB −0.5: NN vs LR, LDA, DT vs KNN, NB 0.0: NN vs LR, LDA vs DT vs KNN, NB +0.5: NN vs LR, LDA vs DT vs KNN, NB +1.0: NN vs LR, LDA, DT vs KNN, NB +1.5: NN, LR vs LDA vs DT, NB vs KNN +2.0: NN, LR vs KNN 
^{a}CDI: classification difficulty index.
^{b}ML: machine learning.
^{c}NN: neural network.
^{d}LR: logistic regression.
^{e}LDA: linear discriminant analysis.
^{f}DT: decision tree.
^{g}NB: naive Bayes.
^{h}KNN: Knearest neighbor.
Electronic intensive care unit (eICU) generalized linear mixed model (GLMM) accuracy results; machine learning classifier against CDI for (A) balanced and (B) imbalanced data. DT: decision tree; KNN: Knearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; NB: naive Bayes; NN: neural network.
For the eICU balanced data set, moving away from the central bin showed significantly better accuracy, except at the +2.0 level, which was similar to the +1.5 ML classifier estimated means showed that the neural network had significantly better accuracy than all other classifiers. The overall interaction effect was significant, but the paired comparisons were similar to the main effects.
For the eICU imbalanced data set, more peripheral cases were accurately classified at the healthier end of the distribution, whereas there was only a slight improvement at the unhealthier end. Similar to the other analyses, the neural network showed the best classification accuracy. Although the overall interaction was significant, the neural network continued to be the best classifier.
The results generally supported the hypothesis that cases with more extreme IRTbased CDI values are more likely to be correctly classified than cases with more central CDI values. This provides a unique manner to evaluate the utility of ML classifiers in a health context. We were able to demonstrate that ML classifiers performed similarly for the extreme cases, whereas for the centrally located cases, there were more differences between classifiers. Thus, ML classifiers can be evaluated based on their relative performance with cases of varying difficulty.
Although these were the general results, there were several specific findings that are worth noting. First, the neural network classifier was the best across all situations. The logistic regression and linear discriminant analysis classifiers were close to the secondbest classifiers, whereas Knearest neighbors almost always performed the worst. It is possible, as found in this study, that classifiers may turn out to be consistent over all levels of difficulty. However, owing to the unique characteristics of both data sets and classifiers selected, some algorithms may yield better results at various levels of case difficulty in other samples.
It was also clear that the
On the basis of the IRT analysis results, easier and hardertoclassify cases were identified. This has implications for research and clinical practice. Once the cases have been identified, other information gathered from their patientspecific data may provide clues about why they are easier or harder to classify, diagnose, or treat. The features themselves that have varying weighted importance in the indexing process can be examined to assess for any differences in a patient’s CDI, that is, not just how many they got
As an example of how one could examine more closely the
An IRT analysis can assist in providing a better understanding of why the classification process works well or falls short on the set of features and cases under investigation. This moves the field closer to having interpretable and explainable results [
Limitations of this research include the fact that classifiers showcased here were not exhaustive, only ICU data sets were used, and converting an outofrange laboratory value as either
There are several ways to extend this work. Future research calls for (1) applying this method to other data sets to generalize its use, (2) using polytomous IRT models (eg, 0=in range, 1=somewhat out of range, and 2=very out of range) for more finegrained case CDI scoring, (3) using multidimensional IRT models to obtain CDIs on >1 underlying dimension, and (4) using this approach to compare human versus machine classification accuracy across case difficulty. We can extend the intersection of ML with clinical medicine if we liken a physician to an ML classifier using feature data. It would be particularly interesting to compare case accuracies based on traditional ML versus clinical classifiers for cases of varying difficulty using an approach similar to that demonstrated in this study. Identifying which cases clinical classifiers are better suited to address, and which cases should be offloaded to an automated system allows for the optimal use of scarce resources. As clinical expertise is developed over time, the use of ML algorithms to assist any single individual would be a moving target and would also serve as a source of future research.
Another way to improve the veracity of the findings would be to address the issue of extraneous features. Several of the features in MIMICIII and eICU had very low (<0.35) discrimination (slope) parameters, suggesting that there was a lot of
As more ML methods are investigated in the health care sphere, concerns have risen because of a lack of understanding regarding why they are successful, especially when compared with physician counterparts. This study has suggested an IRTbased methodology as one way to address this issue by examining the case difficulty in a data set that allows for followup into possible reasons why cases are or are not classified correctly.
Using the methods described in this study would signal a change in the way we evaluate supervised ML. Adopting them would move the field toward more of an evaluation system that characterizes the entire data set on which the classifiers are being trained and tested. Doing so circumvents the pitfalls associated with 1 classifier being cited as more accurate or more precise and generates a more tailored approach to ML classifier comparisons. In addition, this methodology lends itself well to
The method here presents an intersection of personalized medicine and ML that maintains its explainability and transparency in both feature selection and modeled accuracy, both of which are pivotal to their uptake in the health sphere.
2parameter logistic
area under the curve
blood urea nitrogen
classification difficulty index
electronic intensive care unit
generalized linear mixed model
intensive care unit
item response theory
Medical Information Mart for Intensive Care
machine learning
AK contributed to idea generation, study and method design, literature search, data acquisition (MIMICIII data set), figures, tables, data analysis, and writing. TK contributed to data analysis, writing, and proofing the manuscript. ZA contributed to data acquisition of eICU data set. JL contributed to proofing and journal selection.
None declared.