This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The use of artificial intelligence (AI) in the medical domain has attracted considerable research interest. Inference applications in the medical domain require energy-efficient AI models. In contrast to other types of data in visual AI, data from medical laboratories usually comprise features with strong signals. Numerous energy optimization techniques have been developed to relieve the burden on the hardware required to deploy a complex learning model. However, the energy efficiency levels of different AI models used for medical applications have not been studied.
The aim of this study was to explore and compare the energy efficiency levels of commonly used machine learning algorithms—logistic regression (LR), k-nearest neighbor, support vector machine, random forest (RF), and extreme gradient boosting (XGB) algorithms, as well as four different variants of neural network (NN) algorithms—when applied to clinical laboratory datasets.
We applied the aforementioned algorithms to two distinct clinical laboratory data sets: a mass spectrometry data set regarding
The experimental results indicated that the RF and XGB algorithms achieved the two highest AUROC values for both data sets (84.7% and 83.9%, respectively, for the mass spectrometry data set; 91.1% and 91.4%, respectively, for the urinalysis data set). The XGB and LR algorithms exhibited the shortest inference time for both data sets (0.47 milliseconds for both in the mass spectrometry data set; 0.39 and 0.47 milliseconds, respectively, for the urinalysis data set). Compared with the RF algorithm, the XGB and LR algorithms exhibited a 45% and 53%-60% reduction in inference time for the mass spectrometry and urinalysis data sets, respectively. In terms of energy efficiency, the XGB algorithm exhibited the lowest power consumption for the mass spectrometry data set (9.42 Watts) and the LR algorithm exhibited the lowest power consumption for the urinalysis data set (9.98 Watts). Compared with a five-hidden-layer NN, the XGB and LR algorithms achieved 16%-24% and 9%-13% lower power consumption levels for the mass spectrometry and urinalysis data sets, respectively. In all experiments, the XGB algorithm exhibited the best performance in terms of accuracy, run time, and energy efficiency.
The XGB algorithm achieved balanced performance levels in terms of AUROC, run time, and energy efficiency for the two clinical laboratory data sets. Considering the energy constraints in real-world scenarios, the XGB algorithm is ideal for medical AI applications.
Machine learning (ML) methods have been successfully employed in various medical fields [
An optimal ML model should achieve balanced predictive performance and energy efficiency. However, most relevant studies have only focused on comparing the predictive performance of different ML algorithms [
A partial explanation for the poor understanding of energy efficiency is that estimating energy consumption is more difficult than estimating other metrics (eg, accuracy) [
In this study, we estimated the power consumption of nine algorithms during ML inference: logistic regression (LR), k-nearest neighbor (kNN), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and four NN-based algorithms. These algorithms were used to classify two clinical laboratory data sets: a large binary feature set and a small integer feature set. The following performance measures were recorded: accuracy, area under the receiver operating characteristic curve (AUROC), time consumption, and power consumption. The time and power consumption were determined using performance counter data from Intel Power Gadget 3.5. Finally, we performed statistical tests to validate our results. The results indicated the energy efficiency of each investigated ML algorithm in medical applications.
Process flowchart of this study. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network; AUROC: area under the receiver operating characteristic curve.
All experiments were run on a Windows 10 personal computer with 4 GB RAM and a 2.3 GHz Intel Core i5-8300H central processing unit (CPU). All ML models were implemented using Python 3.7.1 with the following Python libraries: scikit-learn 0.23.1 [
The mass spectrometry and urinalysis data sets adopted in this study represent distinct feature patterns in ML;
Characteristics of the final mass spectrometry and urinalysis data sets.
Data set | Cases, n | Features, n | Binary features, n | Integer features, n | Majority class | Percentage of majority class (Nmax/N) | Gini Impurity |
Mass spectrometry | 3338 | 268 | 268 | 0 | Methicillin-resistant |
53.0% | 0.50 |
Urinalysis | 2898 | 9 | 2 | 7 | 57.1% | 0.49 |
The mass spectrometry data set comprises mass spectral information on methicillin-resistant and methicillin-sensitive
In the original mass spectral data, intensity values are a function of the mass-to-charge ratio. We preprocessed these data using a validated binning method [
The urinalysis data set comprises routine urinalysis data (including leukocyte esterase, nitrite, protein, occult blood, red blood cell count, white blood cell count, and epithelial cell count) and demographic data (age and sex) of patients with and without
In the urinalysis data set, “sex” and “nitrite” are binary features, which are represented by 1 and 0. “Leukocyte esterase,” “protein,” and “occult blood” data are semiquantitative features, which are represented on scales ranging from 0 to 4 (“negative,” “trace,” 1+, 2+, and 3+), 0 to 5 (“negative,” “trace,” 1+, 2+, 3+, and 4+), and 0 to 5 (“negative,” “trace,” 1+, 2+, 3+, and 4+), respectively. “Age,” “red blood cell count,” “white blood cell count,” and “epithelial cell count” are nonnegative integer features with maximum values of 103, 501, 501, and 101, respectively. The urinalysis data set does not contain missing values, and we did not perform further feature selection for this data set.
LR is one of the simplest binary classification algorithms. In the LR algorithm, the predictive outcome ŷ(
where
The kNN algorithm is a memory-based algorithm. Accordingly, predictions of the kNN algorithm are directly based on the training data set and no additional training is required [
where K denotes the set of k instances in the training data set that are most similar to the given data
where q is a given positive number. The Manhattan distance (for q=1) and the Euclidean distance (for q=2) are the two most common distance metrics. In this study, the numbers of nearest neighbors (k) were 27 and 7 in the final models of the mass spectrometry and urinalysis data sets, respectively.
The SVM algorithm is a commonly used binary classification method. The purpose of the SVM algorithm is to find a hyperplane that separates two classes of data with the maximum margin in the feature space [
where
The SVM algorithm is frequently applied with kernel transformations. Kernel functions represent the similarity between two data points. The radial basis function is one of the most commonly used kernel functions and is defined as follows:
where
where
RF is an ensemble decision tree classifier. The prediction of the RF algorithm, namely ŷ(
where T represents the set of decision trees in the RF and Ti(
RF training involves the “bagging” (ie, bootstrap aggregating) technique [
The XGB algorithm is a type of ensemble algorithm, which uses the “boosting” technique to reduce the overall bias by sequentially combining weak classifiers into a model [
where Ti(
In each iteration, the training of decision trees in XGB is equivalent to a process of minimizing a certain objective function. Because XGB is a regularized algorithm [
where N denotes the number of leaf nodes in a decision tree and γ and λ denote given positive numbers. The parameters Gj and Hj are defined as follows:
where ŷi represents the predictive outcome of training data
In this study, we implemented the XGB model by using the “xgboost 0.90” Python library.
NN is a type of ML model that is inspired by the human nervous system. An NN consists of multiple layers of nodes (or neurons). The layer that receives the initial data is the input layer and the layer that exports the predictive results is the output layer. Numerous hidden layers exist between the input and output layers. The outputs of each node in an NN are obtained according to the outputs of the nodes in the previous layer [
Several types of connection patterns are possible between two adjacent layers. For example, in a classic fully connected layer, a series of weighted sums of the inputs is first calculated according to the given model parameters. These weighted sums are subjected to nonlinear transformation to obtain the output of the aforementioned layer. In practice, these steps are implemented using vectorized expressions, and the output vector of the
where
NNs are ML models that are flexible in terms of the numbers of hidden layers and nodes in each layer. According to previous studies, one hidden layer is sufficient for approximating most continuous functions [
To determine the appropriate architecture of an NN model, some previous studies offered theoretical heuristics regarding the number of hidden units in an NN layer. However, the results ranged widely according to different studies regarding the optimal number of nodes in a hidden layer [
Pruning is a method for eliminating redundant connections in NNs [
In this study, we applied global unstructured pruning to eliminate connections from the entire NN. The pruned NNs displayed in the figures are NN5s with a sparsity of 50%. However, we implemented pruning with sparsity values of 25%, 50%, and 75% for the NN1 and NN5 models. The detailed results regarding other pruned NNs are provided in
Quantization is a common method for model compression. In this method, the model size is reduced by computing and storing parameters with low bit widths [
In this study, the parameters of the original NN1 and NN5 models were tensors in the single-precision floating-point format; the quantized models had a quantized 8-bit signed integer data format. The quantized NNs displayed in the figures are NN5s. However, we implemented dynamic quantization for both the NN1 and NN5 models, and the detailed results regarding quantized NNs are provided in
In this study, we selected the aforementioned supervised ML algorithms according to their maturity and popularity. For every ML model, we tuned the hyperparameters in each algorithm through 5-fold cross-validation. The cutoff with the highest Youden index was selected as the final cutoff in each model [
We evaluated the predictive performance of all final models using independent testing data sets. We selected accuracy and AUROC as the predictive performance metrics. The 95% CIs of both accuracy and AUROC were calculated.
We derived the inference time and power data from Intel Power Gadget 3.5 [
We implemented the ML models using command lines and logged the time and power data using PowerLog3.0.exe, a command line version of Intel Power Gadget that allows users to log the time and power data of a specific command line. In addition, because Intel Power Gadget only provides the energy data of the entire processor, all testing procedures were performed without background programs. The measurement for each algorithm was repeated 100 times.
We initially employed the Shapiro-Wilk test to check for normality. If the assumption of normality did not hold, we subsequently adopted the Friedman test to compare the means of different groups. The pairwise Wilcoxon signed-rank test was used to identify which groups were different.
Classification accuracy rates of different algorithms implemented on the mass spectrometry and urinalysis data sets. The black bars indicate the 95% CIs of the classification accuracy. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network.
AUROC values of different algorithms implemented on the mass spectrometry and urinalysis data sets. The black bars indicate the 95% CIs of the AUROC. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network; AUROC: area under the receiver operating characteristic curve.
Time consumed in single prediction for the mass spectrometry and urinalysis data sets. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network.
Power consumption levels of the different algorithms implemented on the mass spectrometry and urinalysis data sets. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network.
Predictive performance (AUROC)–power consumption plot of the nine algorithms for the mass spectrometry data set. The two tree-based algorithms (RF and XGB) achieved a balanced predictive performance and power consumption. The horizontal and vertical dashed axes indicate the mean energy consumption and mean AUROC of the nine predictive models, respectively. Each algorithm is located in one of the four quadrants. The gray rectangle around each data point denotes the 95% CI of the AUROC and power consumption. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network; AUROC: area under the receiver operating characteristic curve.
Predictive performance (AUROC)–power consumption plot of the nine algorithms for the urinalysis dataset. The two tree-based algorithms (ie, RF and XGB) achieved a balanced predictive performance and power consumption. The horizontal and vertical dashed axes indicate the mean energy consumption and mean AUROC of the nine predictive models, respectively. Each algorithm is located in one of the four quadrants. The gray rectangle around each data point denotes the 95% CI of the AUROC and power consumption. LR: logistic regression; kNN: k-nearest neighbor; SVM: support vector machine; RF: random forest; XGB: extreme gradient boosting; NN1: one-hidden-layer neural network; QNN: quantized five-hidden-layer neural network; PNN: pruned five-hidden-layer neural network; NN5: five-hidden-layer neural network; AUROC: area under the receiver operating characteristic curve.
In this study, we compared the predictive performance, time consumption, and power consumption of nine algorithms using two clinical laboratory data sets. The XGB algorithm achieved a balanced performance with respect to the aforementioned metrics, indicating that the XGB algorithm is ideal for medical artificial intelligence applications with energy constraints.
In addition to this study, previous studies have performed comparative analyses of various ML algorithms in the medical domain [
The RF and XGB algorithms exhibited higher AUROC values than did the other algorithms for both data sets. This finding is similar to those of previous studies. In a study that considered 11 performance metrics, the RF algorithm and probability-calibrated boosted trees exhibited the best performance among 10 algorithms [
In this study, the XGB and LR algorithms exhibited the shortest run times (both 0.47 milliseconds for the mass spectrometry data set; 0.39 and 0.47 milliseconds, respectively, for the urinalysis data set). The SVM and RF algorithms exhibited the highest time consumption levels for the mass spectrometry and urinalysis data sets, respectively. Notably, although the XGB and RF algorithms are ensemble algorithms based on decision trees, the XGB algorithm consumed less time than the RF algorithm. This finding is possibly due to differences in the depth and number of trees between these algorithms. For both data sets, the XGB model had shallower trees than did the RF model (for the mass spectrometry and urinalysis data sets, the maximum depths of the XGB decision trees were 6 and 10, respectively, and the average depths of the RF trees were 29 and 21, respectively). An explanation for this finding is that boosting reduces the bias of weak classifiers [
The NN algorithms (NN1, QNN, PNN, and NN5) and the kNN algorithm exhibited the highest power consumption levels in this study, and the two tree-based algorithms (ie, RF and XGB) exhibited the lowest power consumption levels. Tree-based algorithms use the data structure of search trees for making inferences. The inference process mainly involves comparison operations at tree nodes and irregular memory access operations for subtree retrievals. In contrast to several other ML algorithms, tree-based algorithms typically do not use multiplication operations. The comparison and memory access operations in tree-based algorithms consume less energy than do multiplication operations [
NNs have been regarded as the main tools for implementing ML in the last few years. The development of different NN architectures (eg, convolutional NNs and recurrent NNs) has contributed to considerable improvements in unstructured data analyses [
Several methods are available for reducing the power consumption of NNs. NNs have diverse architectures, and constructing an NN with a small architecture is an effective method for improving energy efficiency, as reflected by the difference in power consumption between the NN1 and NN5 models in this study (see
In summary, the XGB algorithm achieved balanced predictive performance and energy efficiency levels.
Deep learning models are the main ML algorithms applied currently. These algorithms achieve state-of-the-art predictive performance for unstructured data sets (eg, data sets for computer vision and natural language processing) [
This study has some limitations. First, because Intel Power Gadget 3.5 only provides the energy consumption of the entire processor [
This study comprehensively compared various ML algorithms in terms of their predictive performance, time consumption, and power consumption when implemented on two clinical laboratory data sets. According to the results, the XGB algorithm attained balanced performance levels in terms of the aforementioned parameters for the two data sets. Thus, the XGB algorithm is ideal for application in real-world clinical settings.
Time complexity of some common algorithms.
Classification performance of nonneural network–based machine learning algorithms implemented on the mass spectrometry and urinalysis data sets.
Classification performance of different neural networks (NNs) implemented on the mass spectrometry and urinalysis data sets.
Inferencing time and average power consumption levels of nonneural network–based algorithms implemented on the mass spectrometry and urinalysis data sets.
Inferencing time and average power consumption levels of different neural networks (NNs) implemented on the mass spectrometry and urinalysis data sets.
area under the receiver operating characteristic curve
central processing unit
k-nearest neighbor
logistic regression
machine learning
neural network
one-hidden-layer neural network
five-hidden-layer neural network
pruned five-hidden-layer neural network
quantized five-hidden-layer neural network
Running Average Power Limit
random forest
support vector machine
extreme gradient boosting
This work was supported by Chang Gung Memorial Hospital (Linkou) (CMRPG3J1791, CMRPG3L0401, CMRPG3L0431, and CMRPG3L1011) and Ministry of Science and Technology, Taiwan (MOST 110-2636-E-008-008). This manuscript was edited by Wallace Academic Editing.
HW conceptualized the study. JY, CHC, and TH wrote the manuscript, analyzed the data, plotted the figures, and created the tables. JY performed the experiments. TH, JL, CRC, TL, MW, YT, and HW reviewed and edited the manuscript for important intellectual content. YT and HW obtained funding and supervised the study. All authors discussed the results and revised the manuscript.
None declared.