Original Paper
Abstract
Background: Despite excellent prediction performance, noninterpretability has undermined the value of applying deeplearning algorithms in clinical practice. To overcome this limitation, attention mechanism has been introduced to clinical research as an explanatory modeling method. However, potential limitations of using this attractive method have not been clarified to clinical researchers. Furthermore, there has been a lack of introductory information explaining attention mechanisms to clinical researchers.
Objective: The aim of this study was to introduce the basic concepts and design approaches of attention mechanisms. In addition, we aimed to empirically assess the potential limitations of current attention mechanisms in terms of prediction and interpretability performance.
Methods: First, the basic concepts and several key considerations regarding attention mechanisms were identified. Second, four approaches to attention mechanisms were suggested according to a twodimensional framework based on the degrees of freedom and uncertainty awareness. Third, the prediction performance, probability reliability, concentration of variable importance, consistency of attention results, and generalizability of attention results to conventional statistics were assessed in the diabetic classification modeling setting. Fourth, the potential limitations of attention mechanisms were considered.
Results: Prediction performance was very high for all models. Probability reliability was high in models with uncertainty awareness. Variable importance was concentrated in several variables when uncertainty awareness was not considered. The consistency of attention results was high when uncertainty awareness was considered. The generalizability of attention results to conventional statistics was poor regardless of the modeling approach.
Conclusions: The attention mechanism is an attractive technique with potential to be very promising in the future. However, it may not yet be desirable to rely on this method to assess variable importance in clinical settings. Therefore, along with theoretical studies enhancing attention mechanisms, more empirical studies investigating potential limitations should be encouraged.
doi:10.2196/18418
Keywords
Introduction
In recent years, there has been significant evidence that deeplearning algorithms can outperform other machinelearning algorithms and conventional statistics in the medical field [
, ]. Despite the better prediction accuracy than conventional algorithms, the implications of using deep learning have been limited owing to the inability to explain the models [ , ]. Particularly in the medical environment, where the association between a disease and symptoms must be identified to provide adequate treatments, the interpretability of models is very important [  ]. To overcome these limitations, interpretable deeplearning algorithms such as Shapley Additive Explanations (SHAP), Local Interpretable Modelagnostic Explanations (LIME), and attention mechanisms have been introduced [  ]. The commonality of these three methodologies provides interpretability in the form of variable importance [  ]. The difference between the methodologies is that with SHAP or LIME, variable importance is measured through simulations that change the data after model training is completed [ , ], whereas under attention mechanisms, variable importance is inferred during model training, which improves model performance by weighting several important variables [ , ].Based on this advantage, attention mechanisms have starting to gain appeal in the clinical research field [
 ]. However, there is a gap between the application of attention mechanisms in clinical research and uptodate attention algorithms in development. Specifically, most of the recent attention studies have focused on improving the theoretical robustness, design approach, and model accuracy with attention mechanisms [  ]. However, clinical researchers are more interested in potential limitations that may arise when attention mechanisms are applied, and how they may differ from conventional statistics, than in the details as to how robust and sophisticated attention mechanisms are being developed. A few studies have introduced the potential limitations of attention mechanisms [ , ]. However, these studies have been theoretical, making it difficult for clinical researchers to understand and accept the results. Thus, it is increasingly necessary to provide a discussion of what clinical researchers should be aware of when applying the new concept of attention mechanisms in their research.With the goal of reducing this gap, the aim of this study was to evaluate attention mechanisms in terms of prediction performance and interpretability. In addition, there remains a lack of guidance for clinical researchers in the implementation of attention mechanisms; therefore, to facilitate understanding for clinical researchers, this study preemptively provides basic concepts, key considerations, and codes for attention mechanisms. Finally, a case analysis was performed in a crosssectional and structured data environment, which is the simplest data setting possible for clinical researchers.
This study was conducted according to the following procedure. First, the scope of the study was established in terms of the data structure. Then, a brief introduction and several important considerations regarding attention mechanisms were considered. Second, based on previous research, a twodimensional framework was established to guide the four modeling approaches to attention mechanisms. Third, five empirical tests with attention mechanisms were performed using the four models: prediction performance, probability reliability, concentration of variable importance, consistency of attention results, and generalizability of attention results to conventional statistics. Finally, potential limitations that may arise when using attention mechanisms were identified.
Methods
Research Scope
Since the design approaches of attention mechanisms differ greatly depending on the data structure, the scope of this study was established in terms of data structure. Specifically, attention mechanism research in the medical field can be divided into two main categories from a data point of view. The first category is an unstructured data area where data containing natural language and images cannot be stored in a row and column table structure [
, ]. In the field of natural language processing, attention mechanisms have been applied to determine the relationship between words or between words and diseases in clinical notes [ , , ]. In the image area, attention mechanisms have been used to highlight which parts of clinical images were related to clinical events, or to annotate the images [  ]. The second category is a structured data area where data can be organized in table formats with a row and column structure [ ]. In this area, attention mechanisms have been applied to electronic health records to determine variables that are strongly associated with clinical events [ , , ].Structured data familiar to clinical researchers are widely applicable to most statistical analyses, including linear regression analysis and analysis of variance (ANOVA). Since one purpose of this study was to compare the results of attention mechanisms and conventional statistical methods, the scope of the study was limited to structured data. Furthermore, most previous attention mechanism studies using structured data have been conducted in timeseries settings [
, , ]. However, this study was conducted in a crosssectional data setting, which is simpler and easier than a timeseries data setting, and can therefore help readers less familiar with attention mechanisms to better understand the results of the case study.Introduction to Attention
Concepts of Attention Mechanisms
Attention, one of the layers in a neural network model, quantifies the importance of input variables in terms of their impact on outcomes (
) [ , , ]. Attention is mostly calculated based on the Softmax function (Notation A1 in ), such that each node in the layer has a value between 0 and 1 and the sum of all node values must be 1 [ , , ]. When the node size of attention is equal to the number of input variables, the influence of the input variables can be transferred toward the model outcome by multiplying the attention values with the corresponding input variables (Context layer in A) [ , , ]. Accordingly, in the case of binary classification, all values in the context layer are summed together to produce a single value (Summation in A). The efficient model design without context layers is possible through the dot product between the input and attention values ( B). Finally, the single value may be converted to a value between 0 and 1 through the sigmoid function (Sigmoid transformation in A and B).The attention value of a certain variable indicates the relative importance of that variable compared with that of other variables. If the attention value of a particular variable is large, the large influence of that variable is transmitted when predicting the outcome variable. As an extreme example, when the attention value of a variable is 1, only that variable is used to predict the outcome variable, whereas if the attention value of a variable is 0, that variable is not used to predict the outcome variable.
A shows the basic architecture of a model with attention mechanisms; the code for the model implemented in Keras is provided in Codes A1 of .Consideration in Attention Modeling
Attention mechanisms can be implemented in various ways, because the key feature of deeplearning modeling is that users can freely design the structure [
, ,  ]. However, there is a primary important consideration in implementing attention mechanisms. In some cases, the influence of variable importance in the context layer can be distorted. For instance, if all w_{i} values are close to 0, the value of C_{1} has a minor effect on the next layer even if that value is the highest in the context layer ( C). Moreover, even if the value of C_{2} is the lowest, if all v_{i}s have very large positive values, the large influence of C_{2} can be passed to the next layer ( C). As such, context values can be skewed as they are computed through a weight matrix in the process of being passed to the next layer (ie, Dense layer in C). As a result, the skewed effects can be propagated to the model output if the output is inferred from the layer. Therefore, it is very important to design a structure where the outputs are not computed through weight matrices [ , ].Modeling Approaches
Although deeplearning models can be developed in various ways depending on the tendency of developers, two approaches have been commonly applied in recent attention studies: increasing the degrees of freedom and uncertainty awareness (UA).
Increase in the Degrees of Freedom
The mechanism for increasing the degrees of freedom is to design multiattention layers; representative algorithms that reflect such a mechanism include transformer and bidirectional transformer (BERT) [
, ]. Our intuition regarding the effectiveness of the mechanism relies on the idea that models can learn the importance of input variables from various perspectives [ , ]. Given the randomness feature of deeplearning training, the result from one attention layer can be unreliable. However, the multiattention model offers multiple result sets with variable importance so that a reliable set of results may compensate for an unreliable set. Consequently, models in which multiattention layers are applied have recently shown better performance than other models [ ,  ].UA
Deeplearning algorithms are not free from the uncertainty issue, which concerns the fact that prediction results have the potential to be incomplete in terms of accuracy and consistency [
,  ]. The major sources of uncertainty include data with noise and omissions, the complexity of the model associated with the parameters (ie, number of weights and type of activation functions), and the structures (ie, degree of depth) [ , ]. One way to alleviate this issue is to consider the presence of uncertainty in modeling [ , , , ]. Specifically, we may assume that node values (ie, attention values) in a certain layer come from a distribution with a mean (μ) and a variance (σ^{2}; D) [ , , , ]. A normal distribution (ie, a Gaussian distribution) that is theoretically clear and can be computed efficiently is often assumed [ , , , ]. Based on this assumption, certain values with high probability are estimated, which may mitigate the random nature of deeplearning training [ ,  ]. A representative model designed under these assumptions is the variational autoencoder [ , ].Framework for Empirical Tests
Based on the discussion above, two directions (ie, degree of freedom and UA) were considered for attention modeling (
). In this twodimensional framework, four cases were suggested for empirical tests ( ). Degree of freedom is related to model structures and UA is related to the estimation approach.Empirical test entries for the four models in the framework were categorized into two broad categories: outcome and attention (
). In terms of model outcome, a receiver operating characteristic (ROC) test, which expresses model accuracy based on the relationship between sensitivity and specificity, was employed for prediction performance [ ]. In addition, the performance of probability reliability, which measures the degree of agreement between predicted and actual probability, was assessed using a reliability diagram and Brier scores [  ].In terms of attention, the degree of how concentrated the variable importance was in particular variables was measured (ie, Concentration in
). The Herfindahl index, which represents the degree of concentration with values ranging from near 0 (least concentrated) to 1 (most concentrated), was employed for this measure [ ]. Furthermore, correlation analysis was conducted to evaluate the consistency of attention results between multiple instances. Lastly, the generalizability of attention results was tested in two ways. First, the variable effect sizes obtained from conventional statistical methods (t test, Cohen d; chisquare test, Cramer V) were compared with variable importance (ie, attention values) [ , ]. For clear comparison from a clinical point of view, only the top 5% of the variables in terms of effect size (ie, conventional methods) and variable importance (ie, attention) were compared. Second, regression analysis was used to determine the overall relationship between attention values and effect sizes from conventional methods.Entries (measures)  Methods  
Outcome  
Prediction performance 
 
Probability reliability 
 
Attention  
Concentration 
 
Consistency 
 
Generalizability 

Model Specifications
Model Designs
Four models were developed according to the framework presented in
. The letters A, B, C, and D represent quadrants on the framework that correspond to the letters representing the model designs in and . Model A (without any uncertainty considerations) has only a single attention layer ( A). The basic design of model B is the same as that of model A; however, it differs in that it has additional layers for UA (see layers with μ, σ^{2}, and z in B). Thus, attention values in model B were estimated from the Gaussian distribution [ , ].The two models in which the degree of freedom is considered are presented in
. The difference between models C and D is that uncertainty in attention estimation is considered in model D (see layers with μ, σ^{2}, and z in D).Since these models have multiattention layers (ie, Local attention in
) with a heuristic size of 20, multiple attentions are estimated. Thus, a novel structure was designed to convey the multiple values in the direction of model outcomes. Specifically, a context layer was created as the dot product of the local attention layer and the input layer (see Context layer in ). Each value on the context layer represents the summed impact of each corresponding local attention layer. Next, a “Weights of each local attention layer” was formed, whose role is to weigh (with weights between 0 and 1) the summed impact values in the context layer ( ). Lastly, the outcome layer was created as the dot product of the weights of each local attention layer and the context layer.This somewhat complex structure ensures that the influence of one variable is passed only once to the model outcome, even if attention values are inferred multiple times (20 times in this case). Furthermore, using both the local attention layer and the weights corresponding to each vector, a unique attention value for each variable can be obtained, which facilitates interpretation.
Graphical and mathematical notations are provided for obtaining a set of unique values (global attention in Figure A1 of
). In addition, details of the four models are provided as Keras codes in Codes A2A5 of .Settings for Rigorous Analysis
A 10fold test was performed to assess the empirical test entries. The dataset was divided into 10 test sets (10% of total sets) and 10 training sets (90% of total sets). The training sets were then subdivided with 80% used directly for model training and 20% for validation.
Entries related to the model outcome (
) were evaluated using all predicted probabilities of the entire sample. In other words, all of the values estimated from the 10 test sets were combined into one total set, which was then used for testing. Entries related to model attention were assessed based on the level of fold sets. Specifically, all of the estimated attention values were aggregated so that each set of 10 folds had a representative value. For rigorous estimation, both outcome and attention were estimated through Monte Carlo simulations with 100 trials [ , ]. Detailed procedures for estimating outcome and attention values are provided in the form of pseudoalgorithms (Algorithms A1A3) in .Cost Functions
The binary crossentropy function, which is generally used in binary outcome settings [
, ], was employed for models A and C where UA was not considered. However, the loss functions for models B and D, given their UA, were specified differently.The UA models assume that the model outcome is dependent on the normal distribution (ie, layers with μ, σ^{2}, and z), which infers the attention value [
, , ]. Therefore, the distribution associated with attention should be considered in the cost function. The cost function under these assumptions was derived through Bayesian inference theory [ , , , ]. According to this theory, the network weights should be learned so that the distributions in the z layer generated by the weights (see z layer in B) become similar to the true distributions in the z layer [ , , , ]. Therefore, the cost function for uncertainty awareness models consists of two terms: the loss associated with the model outcome and the degree of similarity associated with the z distribution [ , , , ]. The cost function, with its description, is presented in Notation A2 of .Learning Environments and Parameters
Attention models were developed, learned, and tested on Keras 2.3.1, tensorflow 2.1.0, and Python 3.7.6. Adam with a learning rate of 0.001 was employed as an optimizer to train all models. A training dataset with a batch size of 5000 was provided to the model. The earlystop rule was applied to stop training the models at the optimal epoch. Thus, model training was terminated when the loss value of the validation set did not improve further during the last 1200 epochs. Other details about activation functions and the structure of nodes and layers are provided in Code A2 to Code A5 of
. For the effect sizes of conventional statistical methods, the values for Cohen d and Cramer V were obtained from researchpy 0.2.3, a thirdparty Python library. Additionally, regression analysis was performed on Stata 13, a commercial statistical analysis software.Data
The case analysis was performed in a setting where the relationship between a disease and other variables is well established: an 8year (2010 to 2017 inclusive) cumulative Korea National Health and Nutrition Examination Survey dataset, which assesses the nutrition and health status of Koreans and collects information about major chronic diseases such as metabolic syndrome and diabetes [
]. Since the association between diabetes and other variables has been well established through prior studies using these data, this selection facilitated a clear assessment of the empirical test results of this study [  ]. Thus, a diabetes diagnosis (1=diabetes, 0=no diabetes) was set as the outcome variable for the four attention models. The subjects were classified as having or not having diabetes based on whether they were diagnosed by a doctor, or received diabetes medication or insulin injections. Fasting blood glucose levels, which are a very strong indicator for diagnosing diabetes, were intentionally used as an input variable to evaluate the power of the attention mechanism for determining important variables.In the 8year cumulative data, only variables with consistent labels during that period were included. Variables with no change in value, containing more than a 50% omission rate and subject identification information were excluded from the study set. Categorical variables of both nominal and ordinal types were integerized using integer encoding [
]. In other words, class labels of each categorical variable were converted into integers. Missing values were encoded as the extreme value 99,999. Since deeplearning algorithms can learn the nonlinear relations among variables [ ], these encoding approaches can be workable and are efficient in settings where preprocessing is demanding owing to many variables. All values in input variables except for a missing value indicator (ie, 99,999) were normalized to be between 0 and 1, and were then fed into the deeplearning models.Results
Data Preprocessing
There were 238 variables with consistent labels in the 8year cumulative dataset. Only 128 variables were selected by preprocessing. There were 22 variables with no change in value, 84 variables with more than 50% missing values, and 4 variables containing identification information that were excluded from the analysis. The total number of observations (ie, the number of subjects) was 33,065, with an average age of 48.89 years and with men accounting for 40.41% (n=13,361) of the sample. Only 6 variables had no omissions, and the average missing rate of variables with omissions was 10.38%
Outcome
Prediction Performance
represents the results for the ROC test and area under the curve (AUC) values of the five models. The results are based on the combined sets of predicted probabilities of 10 test sets of each model. According to the AUC results, the accuracy of the five models in terms of sensitivity and specificity was excellent overall. The AUCs of the base model without the attention mechanism, the single attention model, multiattention model, single attention model with UA, and multiattention model with UA were 0.977, 0.948, 0.968, 0.965, and 0.976, respectively.
Probability Reliability
shows the performance of probability reliability for the four models in the form of a reliability diagram. A characteristic of the UA models is that most fractions of positives were plotted above the diagonal. By contrast, models without UA showed more fractions of positives below the diagonal than the other models. The fraction of positives of the multiattention with UA model was the closest to the diagonal. The Brier scores of the single attention, multiattention, single attention with UA, and multiattention with UA models were 0.018, 0.0171, 0.0148, and 0.0142, respectively.
Attention
Concentration
shows stacked Herfindahl indices sorted by value size for the 10fold sets in each model. The values for each fold are presented in Table A1 of . In general, models without UA showed relatively large Herfindahl indices. The average Herfindahl index values for the single attention and multiattention models were 0.236 and 0.048, respectively. However, models with UA had very small values, regardless of the degree of freedom. The average Herfindahl index values for the single attention with UA and multiattention with UA models were 0.01 and 0.01, respectively. These results indicate that influence is more concentrated on several variables in models where uncertainty is not considered than in those where uncertainty is considered.
Consistency
shows histograms for the 45 correlations ({[10×10]–10}/2) among the 10fold sets for each model. In general, the correlations of the fold sets from the models with UA were higher than those of the models without UA. The average correlations of the fold sets from both models without UA were calculated to be close to zero (ie, 0.01 and 0.1). Moreover, the average correlations from the two models with UA were calculated as 0.99 and 0.66, respectively.
Generalizability
shows the results of the variable importance learned by the attention models and the effect sizes measured by conventional statistical methods. Definitions of each variable are provided in Table A2 of . The top 5% (127×0.05=6) of variables, sorted by the magnitude of values obtained from each method, are reported. Since Cohen d from conventional methods may take on negative values, the absolute value was applied when sorting. Overall, models in which uncertainty was not considered were trained to have high attention values. Furthermore, variables such as “allownc” (whether to receive a basic living allowance) and “house” (whether to have own house) that bear little relation to health status were included in the results.
Variable^{a}  Attention value^{b}/Effect size^{c}  
Single attention model  
sm_presnt  0.0615  
BD1  0.0519  
pa_walk  0.0479  
HE_Upro  0.0468  
HE_alt  0.0430  
Npins  0.0413  
Multiattention model  
HE_HB  0.041  
Sex  0.033  
Pa_walk  0.032  
HE_HBsAg  0.027  
Allownc  0.026  
HE_sbp  0.026  
Single attention model with UA^{d}  
pa_walk  0.050  
BH9_11  0.017  
HE_THfh2  0.011  
HE_THfh1  0.010  
HE_THfh3  0.010  
DI5_dg  0.010  
Multiattention model with UA  
pa_walk  0.050  
HE_THfh2  0.019  
BH9_11  0.015  
HE_THfh1  0.013  
HE_ast  0.010  
house  0.010  
Conventional statistics^{e}  
age  1.536  
HE_glu  1.214  
HE_HbA1c  –1.014  
Wt_pool_1  –0.516  
Wt_itvex  –0.516  
HE_Uglu  0.319 
^{a}See Table A2 in
for variable label descriptions.^{b}Average from the 10 fold sets.
^{c}Effect size is presented only for the conventional statistics.
^{d}UA: uncertainty awareness.
^{e}Null hypothesis of categorical variables=no relationships between diabetes and a categorical variable; null hypothesis of continuous variables=no difference in variables between the diabetes and no diabetes groups.
shows the overall relationship between the effect sizes of variables and variable importance. Since attention values from both single attention and multiattention models with UA had a high correlation (0.943), two regression models in which the two variables did not overlap were specified. The regression results showed no association between the variable importance from attention models and the effect size of variables from conventional methods.
Regression model variables^{a}  Regression 1  Regression 2  
Coefficient  P > t  Coefficient  P > t  
Single attention  −0.696  0.635  −0.668  0.652 
Multiattention  1.897  0.487  2.019  0.476 
Single attention with UA^{b}  −1.351  0.797  —^{c}  — 
Multiattention with UA  —  —  −1.541  0.772 
Intercept  0.075  0.09  0.075  0.075 
^{a}The dependent variable is the absolute value of effect size, calculated by Cohen d for continuous variables and Cramer V for categorical variables. The total number of observations is equal to the number of variables.
^{b}UA: uncertainty awareness.
^{c}—: variable not included in the regression model.
Discussion
Principal Findings
Reliability
A difference in performance according to the degree of freedom was prominent in the probability reliability diagram (
). The fraction of positives located above the diagonal indicates that probabilities are predicted to be larger than expected, while the fraction of positives located below the diagonal means that probabilities are estimated to be smaller than expected [ , ]. In this regard, overall probabilities from the two attention models without UA tended to be underestimated, whereas the attention models with UA tended to overestimate probabilities. Although no clear causal relationship has been identified, several lines of empirical evidence suggest that the over and underestimation is associated with data noise, estimation methods, and parameter settings [ ,  ].Since the difference appeared to be based on the UA axis, the over or underestimation tendency of the models may be related to UA. Furthermore, the Brier scores of the two models with UA were smaller than those of the two models without UA, indicating that models with UA tend to estimate more reliable probabilities than models without UA. These findings are consistent with the results of recent research that estimated reliable outcomes with an emphasis on UA [
,  ]. Theoretically, the most probabilistic values are inferred from a distribution that takes means and variances into account under UA [ ,  ]. Thus, the awareness of uncertainty may bring reliability to the prediction results of deeplearning models, which are vulnerable to randomness during the learning process.Consistency
UA produced noticeable differences in results consistency and the concentration of variable importance. Specifically, in UA models with low Herfindahl indices, variable importance appeared to be distributed over many variables in contrast to models that did not consider uncertainty (
). In addition, high correlations between 10fold sets were found in the attention results from the UA models, whereas no correlations were found in the results from UA models ( ). Furthermore, the attention values in the models with UA were generally smaller than those of models without UA.These results suggest that the consistency of results from the UA models is high because the variable importance with overall low values is distributed evenly over most variables. This result is closely associated with the assumption that attention values were estimated based on a normal distribution within the cost function (see equation for the Kullback–Leibler divergence D_{KL} in Notation A2 of
). According to this equation, as both μ and σ^{2} approach zero, model parameters for forming the normal distribution approximate the true theoretical distribution, indicating that the models are well learned [ , ]. Consequently, the overall attention values were small since the overall values of μ were small.Spurious Correlations
As with conventional statistical methods, the attention models were unable to control spurious correlations during attention learning. Specifically, of the top 5% of variables obtained from conventional statistics, wt_pool_1 (interview weight combined years) and wt_itvex (interview weight for a single year) have little to do with health status (
). These variables are weights for compensating errors due to differences in the number of households and populations between the sample design time and the survey time. In addition, the variables “allownc” (whether to receive basic living allowance) and “house” (whether to have own house) were obtained from the attention models ( ). These results may suggest spurious correlations in the dataset itself [ ]. In other words, these variables, with little relation to diabetes, have a relatively close relationship with diabetes only by chance.Generalizability to Conventional Statistics
In terms of clinically relevant variables, no significant association between the results of conventional statistics and attention models was found. Specifically, the variables age, HE_glu (fasting blood sugar), HE_HbA1c (hemoglobin AIC), and HE_Uglu (urine glucose) selected by conventional methods are well known to have a direct association with diabetes [
 ]. In addition, several variables, including pa_walk (amount of walking) and BH9_11 (vaccination status against influenza virus), obtained by the attention models are less directly related to diabetes. These variables may represent behavioral characteristics of patients with diabetes who are trying to manage their health.Furthermore, there was no intersection of variables selected by both attention models and conventional statistical methods (
). In particular, HE_glu, which was intentionally used as an input variable for testing purposes, was not determined as a major variable in the attention mechanism models in contrast to the conventional statistical methods. Additionally, no variable was statistically significant in the regression analysis that evaluated the positive association between attention values and effect sizes ( ). Comprehensively, these results suggest that the variable importance obtained from attention mechanisms may not be generalized to the effect size of conventional statistics.Lessons from the Findings
Hyperparameters
The model structure and weight of terms in the cost function are hyperparameters to be adjusted. In terms of the model structure, the degree of freedom of attention layers was evaluated by comparing two extreme cases of 1 attention layer and 20 attention layers. Although the size of attention layers does not make a significant difference, the results can be significantly different if the number of attention layers is different under other conditions.
Furthermore, by taking uncertainty into account in the models, a term (ie, the degree of similarity associated with the normal distribution) was added to the cost function. However, as discussed previously, this term may interfere with the assignment of great importance values to variables by making all μ values small. To alleviate this issue, the weight of the term may be lowered, so that the term is less reflected during model training [
, , ].However, hyperparameter tuning is not conducted based on a theoretical basis but rather on a heuristic basis. In other words, there is no standard of how many attention layers should be specified and how much the weight should be adjusted for better results. If the goal of building models aims to maximize accuracy, various hyperparameter settings can be tested in the direction of increasing model accuracy. However, there is no clear criterion to maximize the performance of interpretability. In other words, although various hyperparameter settings are tested, finding the bestoptimized hyperparameter setting based on the statistical point of view is challenging. Therefore, the variable importance should be understood in a limited way only within the framework of this experiment.
Potential Limitations of Interpretability
There was no significant association between the variable importance obtained from the attention mechanism and the effect size obtained from conventional statistics. One of the most probable reasons for this result is that the assumption of the association among input variables is different between conventional methods and deeplearning algorithms. Specifically, conventional statistical methodologies such as linear regression analysis and ANOVA basically estimate effect sizes based on the assumption of independence between input variables [
, ]. Thus, if a particular input variable has nothing to do with an outcome variable, the variable has little effect on the outcome. In contrast, neural network–based algorithms, including deep learning, infer outcome variables by taking into account the dependencies between the input variables [ , ]. Therefore, a variable that is not directly related to an outcome variable but is associated with others that are related to the outcome variable may have a somewhat greater effect on the outcome variable. Owing to these differences, attention results must not be considered to have similar meanings and tendency to the variable effect size from conventional methodologies.Recent new technologies such as sensors (ie, wearable devices or facilities in operating rooms) have produced new types of data. Since the associations between variables have not yet been fully explored, relying solely on attention mechanisms may lead to a false judgment that variables that have minimal association with the outcome variable are important. Hence, it is advisable to consider the results of attention and conventional statistics together.
Furthermore, in situations where there is a spurious correlation, neither method provides good explanatory power. Spurious correlations can only be eliminated through data preprocessing based on domain knowledge. Hence, care must be taken when implementing both attention models and conventional statistical methods in environments with manifold variables that cannot be preprocessed (ie, included or excluded) using definite knowledge. In particular, finding new features using attention mechanisms may not be adequate in environments where the data are susceptible to spurious correlations owing to a large number of variables but few observations such as in the field of genetic engineering [
, ]. In this case, it may be appropriate to employ results of attention mechanisms for reaffirming existing findings in previous research or supporting informed knowledge.Future Direction for Medical Informatics
The results of this study provide several points of guidance for future research in the medical field. First, more empirical evidence should be secured based on various structures in terms of the degree of freedom. It may be desirable to test what attention results are produced when different values of degree of freedom are employed. Particularly, given that the medical field has various data types such as images, natural languages, and numerical values, attention results should be assessed according to the degree of freedom with consideration of the data characteristics [
 ].Second, attention models with more sophisticated UA should be tested. In this study, the model outcome variable was assumed to depend on the distribution of the attention layer; that is, P(diabetesz). However, current stateoftheart Bayesian estimation assumes that the model outcomes depend on all network weights and data; that is, P(outcomez, weight, data) [
, ]. Thus, it is necessary to evaluate how the variable importance is formed when more uptodate estimation methods are applied.Third, more research that strictly evaluates variable importance based on attention mechanisms over diverse disease domains is needed. As found in this study, attention has its limitations in terms of generalizability to conventional statistics and control of spurious correlations. However, since this case study was conducted with a single cohort of Korean patients with diabetes, more empirical evidence from various cohorts or diseases should be tested to confirm that attention mechanisms may not provide any significant meaning. Importantly, for elaborate empirical research, a greater indepth understanding of the association between covariates and health outcomes is needed. Hence, more domain experts on a specific disease along with data scientists should be actively involved in these studies.
Fourth, methods for controlling the distribution of variable importance should be studied. As revealed in this analysis, the variable importance can be distributed over many variables or concentrated on a few variables depending on the model structure (
). When examining the overall relationships between covariates and health outcomes such as a comprehensive review of national health status [  ], it may be desirable to detect many potentially important variables. By contrast, when the relationship between a small number of key variables and outcome is important, such as in the generation of targeted therapy [ , ], the importance should be focused on a few variables. However, to the best of our knowledge, most existing attention studies have not considered the control of the variable importance distribution [ ,  , , , , , , ]. Therefore, more studies on this subject are needed.Limitations
There are several limitations to be aware of when assessing the academic value of this study. First, wellbehaved data with excellent predictive performance owing to the data characteristics were employed for the analysis. For this reason, the overall AUC performance (see ROC test in
) might have been good for all approaches (ie, the degree of freedom and UA). When the attention mechanisms are applied to illbehaved data without manipulation, such as the intentional use of a variable HE_glu as an input variable, the model accuracy may be reduced. If the accuracy of the model is moderate and domain expertise exists for the disease, it is still advisable to attempt a variable importance interpretation. However, if the model accuracy becomes too poor, it may not be worthwhile to interpret the variable importance. Furthermore, categorical variables of both nominal and ordinal types were integerized, and missing values were encoded as the extreme value in this study. Although this operationalization can be efficient in deeplearning algorithms that can learn nonlinear relationships, it is not a robust approach. Thus, it is necessary to identify problems with the approach and to discuss how to deal with them when illbehaved data with robust operationalization are employed. Furthermore, since data from a single cohort were used, the results of this study, which point out the limitations of the interpretable power of attention mechanisms, should not be generalized. Rather, it should be recognized that accuracy performance and interpretable power may vary depending on the modeling approaches and data.Second, this study does not guarantee that stateoftheart estimation methods for UA were applied. Specifically, the models’ outcomes do not depend on network weights. In addition, research on estimation methodologies in deep learning is in progress, and therefore new methodologies are still being developed. Accordingly, the value of this study lies in the framework proposals that suggest the research direction of attention modeling rather than in the details of attention estimation methods.
Third, the design of weights of each local attention layer is not as sophisticated as the design of local attention layers (
). Specifically, uncertainty considerations are not assumed in the weights layer. Moreover, this layer does not have to be dependent on the local attention layer. In other words, the weights layer may be designed as an independent layer that does not come from the local attention layer. We plan to perform various investigations in this area.Conclusions
Attention mechanisms have the potential to make a significant contribution to the medical field, where explanatory power is important, by overcoming the limitations of the noninterpretability of deeplearning algorithms. However, potential problems that may arise when attention mechanisms are applied in practice have not been well studied. Thus, we hope that this study will serve as a cornerstone to raise potential issues, and that many similar studies will be conducted in the future. The cohesive awareness of potential problems arising from attention mechanisms in the field of application will provide theoretical researchers with new goals for problemsolving.
Acknowledgments
This study was supported by a grant (20103801) from the National Cancer Center of Korea.
Conflicts of Interest
None declared.
Notations, global attention inference procedure, Herfindahl index values, variable label descriptions, and reparameterization trick concept description.
PDF File (Adobe PDF File), 525 KB
Codes and algorithms for inferring outcomes.
PDF File (Adobe PDF File), 381 KBReferences
 Norgeot B, Glicksberg BS, Trupin L, Lituiev D, Gianfrancesco M, Oskotsky B, et al. Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients With Rheumatoid Arthritis. JAMA Netw Open 2019 Mar 01;2(3):e190606 [FREE Full text] [CrossRef] [Medline]
 Shamai G, Binenbaum Y, Slossberg R, Duek I, Gil Z, Kimmel R. Artificial Intelligence Algorithms to Assess Hormonal Status From Tissue Microarrays in Patients With Breast Cancer. JAMA Netw Open 2019 Jul 03;2(7):e197700 [FREE Full text] [CrossRef] [Medline]
 Montavon G, Samek W, Müller K. Methods for interpreting and understanding deep neural networks. Dig Sign Process 2018 Feb;73:115. [CrossRef]
 Zhang Z, Beck MW, Winkler DA, Huang B, Sibanda W, Goyal H, written on behalf of AME BigData Clinical Trial Collaborative Group. Opening the black box of neural networks: methods for interpreting neural network models in clinical applications. Ann Transl Med 2018 Jun;6(11):216. [CrossRef] [Medline]
 Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. 2015 Presented at: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining; August 2015; Sydney, Australia p. 17211730. [CrossRef]
 Lundberg SM, Lee SI. A unified approach to interpreting model predictions. 2017 Presented at: Advances in neural information processing systems; December 2017; Long Beach, CA p. 47654774.
 Ribeiro M, Singh S, Guestrin C. Why should I trust you? Explaining the predictions of any classifier. 2016 Presented at: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; August 2016; New York, NY p. 11351144.
 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, et al. Attention is all you need. 2017 Presented at: Advances in Neural Information Processing Systems; December 2017; Long Beach, CA p. 59986008.
 Heo J, Lee H, Kim S, Lee J, Kim K, Yang E, et al. Uncertaintyaware attention for reliable interpretation and prediction. 2018 Presented at: Advances in Neural Information Processing Systems; December 2018; Montréal, Canada p. 909918.
 Tomita N, Abdollahi B, Wei J, Ren B, Suriawinata A, Hassanpour S. AttentionBased Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides. JAMA Netw Open 2019 Nov 01;2(11):e1914645 [FREE Full text] [CrossRef] [Medline]
 Hwang EJ, Park S, Jin K, Kim JI, Choi SY, Lee JH, DLAD Development Evaluation Group. Development and Validation of a Deep LearningBased Automated Detection Algorithm for Major Thoracic Diseases on Chest Radiographs. JAMA Netw Open 2019 Mar 01;2(3):e191095 [FREE Full text] [CrossRef] [Medline]
 Kaji DA, Zech JR, Kim JS, Cho SK, Dangayach NS, Costa AB, et al. An attention based deep learning model of clinical events in the intensive care unit. PLoS One 2019;14(2):e0211057 [FREE Full text] [CrossRef] [Medline]
 Li L, Zhao J, Hou L, Zhai Y, Shi J, Cui F. An attentionbased deep learning model for clinical named entity recognition of Chinese electronic medical records. BMC Med Inform Decis Mak 2019 Dec 05;19(Suppl 5):235 [FREE Full text] [CrossRef] [Medline]
 You R, Liu Y, Mamitsuka H, Zhu S. BERTMeSH: Deep Contextual Representation Learning for Largescale Highperformance MeSH Indexing with Full Text. Bioinformatics 2020 Sep 25:btaa837. [CrossRef] [Medline]
 Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B, et al. Attention gated networks: Learning to leverage salient regions in medical images. Med Image Anal 2019 Apr;53:197207 [FREE Full text] [CrossRef] [Medline]
 Oktay O, Schlemper J, Folgoc L, Lee M, Heinrich M, Misawa K, et al. Attention unet: Learning where to look for the pancreas. ArXiv 2018:1804.03999 [FREE Full text]
 Hasan S, Ling Y, Liu J, Sreenivasan R, Anand S, Arora T, et al. Attentionbased medical caption generation with image modality classification and clinical concept mapping. 2018 Presented at: International Conference of the CrossLanguage Evaluation Forum for European Languages; September 2018; Avignon, France p. 224230.
 Hahn M. Theoretical Limitations of SelfAttention in Neural Sequence Models. Trans Assoc Comput Ling 2020 Jul;8:156171. [CrossRef]
 Ghorbani A, Abid A, Zou J. Interpretation of Neural Networks Is Fragile. 2019 Jul 17 Presented at: Proceedings of the AAAI Conference on Artificial Intelligence; February 712, 2020; New York, NY p. 36813688. [CrossRef]
 Feldman R, Sanger J. The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge, UK: Cambridge University Press; 2007.
 Ryan M. Deep learning with structured data. Shelter Island, NY: Manning Publications; 2020.
 Li Y, Rao S, Solares J, Hassaine A, Ramakrishnan R, Canoy D, et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 2020 Apr 28;10(1):7155. [CrossRef] [Medline]
 Choi E, Bahadori M, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. 2016 Presented at: Advances in Neural Information Processing Systems; December 2016; Barcelona, Spain p. 35043512.
 Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attentionbased models for speech recognition. 2015 Presented at: Advances in Neural Information Processing Systems; December 2015; Quebec, Canada p. 577585.
 Stollenga M, Masci J, Gomez F, Schmidhuber J. Deep networks with internal selective attention through feedback connections. 2014 Presented at: Advances in Neural Information Processing Systems; December 2014; Quebec, Canada p. 35453553.
 Devlin J, Chang M, Lee K, Toutanova K. Bert: Pretraining of deep bidirectional transformers for language understanding. ArXiv 2018:1810.04805 [FREE Full text]
 Song H, Rajan D, Thiagarajan J, Spanias A. Attend and diagnose: Clinical time series analysis using attention models. arXiv 2018:preprint 1711.03905 [FREE Full text]
 Xiong Y, Du B, Yan P. Reinforced Transformer for Medical Image Captioning. 2019 Presented at: International Workshop on Machine Learning in Medical Imaging; October 2019; Shenzhen, China p. 680. [CrossRef]
 Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. 2016 Presented at: International conference on machine learning; June 2016; New York, NY p. 10501059.
 Gal Y. Uncertainty in deep learning. University of Cambridge, PhD Thesis. 2016. URL: http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf [accessed 20201207]
 Kläs M, Vollmer A. Uncertainty in machine learning applications: A practicedriven classification of uncertainty. 2018 Presented at: International Conference on Computer Safety, Reliability, and Security; September 2018; Västerås, Sweden p. 431438.
 Ehsan AM, Dick A, van DHA. Infinite variational autoencoder for semisupervised learning. 2017 Presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2017; Honolulu, USA p. 58885897. [CrossRef]
 Kingma D, Welling M. Autoencoding variational Bayes. arXiv 2013:1312.6114 [FREE Full text]
 Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982 Apr;143(1):2936. [CrossRef] [Medline]
 NiculescuMizil A, Caruana R. Predicting good probabilities with supervised learning. 2005 Presented at: Proceedings of the 22nd international conference on machine learning; August 2005; New York, NY p. 625632. [CrossRef]
 Bröcker J, Smith L. Increasing the reliability of reliability diagrams. Weather Forecast 2007;22(3):651661 [FREE Full text]
 Benedetti R. Scoring rules for forecast verification. Month Weather Rev 2010;138(1):203211 [FREE Full text]
 Kwoka JJ. The Herfindahl index in theory and practice. Antitrust Bull 1985;30:915.
 Cramér H. Mathematical Methods of Statistics (PMS9). Princeton, NJ: Princeton University Press; 2016.
 Rosenthal R, Cooper H, Hedges L. Parametric measures of effect size. In: Cooper H, Hedges LV, editors. The handbook of research synthesis. New York, NY: Russel Sage Foundation; 1994:231244.
 Brownlee J. Deep learning with Python: develop deep learning models on Theano and TensorFlow using Keras. Vermont, Victoria, Australia: Machine Learning Mastery; 2016.
 Chollet F. Deep learning with Python. Shelter Island, NY: Manning Publications Co; 2018.
 Kingma D, Salimans T, Welling M. Variational dropout and the local reparameterization trick. 2015 Presented at: Advances in neural information processing systems; December 2015; Quebec, Canada p. 25752583.
 Kweon S, Kim Y, Jang M, Kim Y, Kim K, Choi S, et al. Data resource profile: the Korea National Health and Nutrition Examination Survey (KNHANES). Int J Epidemiol 2014 Feb;43(1):6977 [FREE Full text] [CrossRef] [Medline]
 Moon S. Low skeletal muscle mass is associated with insulin resistance, diabetes, and metabolic syndrome in the Korean population: the Korea National Health and Nutrition Examination Survey (KNHANES) 20092010. Endocr J 2014;61(1):6170 [FREE Full text] [CrossRef] [Medline]
 Jee D, Lee WK, Kang S. Prevalence and risk factors for diabetic retinopathy: the Korea National Health and Nutrition Examination Survey 20082011. Invest Ophthalmol Vis Sci 2013 Oct 17;54(10):68276833. [CrossRef] [Medline]
 Choi YJ, Lee MS, An SY, Kim TH, Han SJ, Kim HJ, et al. The Relationship between Diabetes Mellitus and HealthRelated Quality of Life in Korean Adults: The Fourth Korea National Health and Nutrition Examination Survey (20072009). Diabetes Metab J 2011 Dec;35(6):587594 [FREE Full text] [CrossRef] [Medline]
 Hwang J, Shon C. Relationship between socioeconomic status and type 2 diabetes: results from Korea National Health and Nutrition Examination Survey (KNHANES) 20102012. BMJ Open 2014 Aug 19;4(8):e005710 [FREE Full text] [CrossRef] [Medline]
 Ahn JH, Yu JH, Ko S, Kwon H, Kim DJ, Kim JH, Taskforce Team of Diabetes Fact Sheet of the Korean Diabetes Association. Prevalence and determinants of diabetic nephropathy in Korea: Korea national health and nutrition examination survey. Diabetes Metab J 2014 Apr;38(2):109119 [FREE Full text] [CrossRef] [Medline]
 Au TC. Random forests, decision trees, and categorical predictors: the "Absent Levels" Problem. J Machine Learn Res 2018;19(1):17371766 [FREE Full text]
 Maass P. Deep learning for trivial inverse problems. In: Compressed Sensing and its Applications. Cham: Birkhäuser; 2019:195209.
 Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning. 1993 Presented at: Proceedings of the 1993 Connectionist Models Summer School; December 1993; Hillsdale, NJ.
 Hayward RA, Heisler M, Adams J, Dudley RA, Hofer TP. Overestimating outcome rates: statistical estimation when reliability is suboptimal. Health Serv Res 2007 Aug;42(4):17181738 [FREE Full text] [CrossRef] [Medline]
 Dehmer M, Basak S. Statistical and Machine Learning Approaches for Network Analysis. Hoboken, NJ: Wiley Online Library; 2012.
 Sander J, de VB, Wolterink J, Išgum I. Towards increased trustworthiness of deep learning segmentation methods on cardiac MRI. 2019 Presented at: Medical Imaging 2019: Image Processing; March 2019; San Diego, CA p. 1094919.
 Kuleshov V, Liang P. Calibrated structured prediction. 2015 Presented at: Advances in Neural Information Processing Systems; December 2015; Quebec, Canada p. 34743482.
 Seo S, Seo P, Han B. Learning for singleshot confidence calibration in deep neural networks through stochastic inferences. 2019 Presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; June 2019; Long Beach, CA p. 90309038.
 Simon HA. Spurious Correlation: A Causal Interpretation. J Am Stat Assoc 1954 Sep;49(267):467. [CrossRef]
 Kirkman MS, Briscoe VJ, Clark N, Florez H, Haas LB, Halter JB, et al. Diabetes in older adults. Diabetes Care 2012 Dec;35(12):26502664 [FREE Full text] [CrossRef] [Medline]
 Koenig RJ, Cerami A. Hemoglobin A Ic and diabetes mellitus. Annu Rev Med 1980;31(1):2934. [CrossRef] [Medline]
 LevRan A. Glycohemoglobin. Arch Intern Med 1981 May 01;141(6):747. [CrossRef]
 Heiss F. Using R for introductory econometrics. South Carolina: Createspace Independent Publishing Platform; 2016.
 Cannon A, Cobb G, Hartlaub B, Legler J, Lock R, Moore T, et al. Stat2: modeling with regression ANOVA. Second edition. New York, NY: WH Freeman; 2018.
 Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK. Gene selection: a Bayesian variable selection approach. Bioinformatics 2003 Jan;19(1):9097. [CrossRef] [Medline]
 Wimmer V, Lehermeier C, Albrecht T, Auinger H, Wang Y, Schön CC. Genomewide prediction of traits with different genetic architecture through efficient variable selection. Genetics 2013 Oct;195(2):573587 [FREE Full text] [CrossRef] [Medline]
 Wang H, Wang Y, Liang C, Li Y. Assessment of Deep Learning Using Nonimaging Information and Sequential Medical Records to Develop a Prediction Model for Nonmelanoma Skin Cancer. JAMA Dermatol 2019 Sep 04;155(11):12771283 [FREE Full text] [CrossRef] [Medline]
 BeaulieuJones B, Finlayson SG, Chivers C, Chen I, McDermott M, Kandola J, et al. Trends and Focus of Machine Learning Applications for Health Research. JAMA Netw Open 2019 Oct 02;2(10):e1914051 [FREE Full text] [CrossRef] [Medline]
 Ehteshami Bejnordi B, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, the CAMELYON16 Consortium, et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 2017 Dec 12;318(22):21992210 [FREE Full text] [CrossRef] [Medline]
 Taggart M, Chapman WW, Steinberg BA, Ruckel S, PregenzerWenzler A, Du Y, et al. Comparison of 2 Natural Language Processing Methods for Identification of Bleeding Among Critically Ill Patients. JAMA Netw Open 2018 Oct 05;1(6):e183451 [FREE Full text] [CrossRef] [Medline]
 JerbyArnon L, Pfetzer N, Waldman Y, McGarry L, James D, Shanks E, et al. Predicting cancerspecific vulnerability via datadriven detection of synthetic lethality. Cell 2014 Aug 28;158(5):11991209 [FREE Full text] [CrossRef] [Medline]
 Hyman DM, Taylor BS, Baselga J. Implementing GenomeDriven Oncology. Cell 2017 Feb 09;168(4):584599 [FREE Full text] [CrossRef] [Medline]
Abbreviations
ANOVA: analysis of variance 
AUC: area under the curve 
LIME: Local Interpretable Modelagnostic Explanations 
ROC: receiver operating characteristic 
SHAP: Shapley Additive Explanations 
UA: uncertainty awareness 
Edited by G Eysenbach, R Kukafka; submitted 25.02.20; peerreviewed by SY Shin, M Feng, S Azimi; comments to author 07.08.20; revised version received 13.09.20; accepted 11.11.20; published 16.12.20
Copyright©Junetae Kim, Sangwon Lee, Eugene Hwang, Kwang Sun Ryu, Hanseok Jeong, Jae Wook Lee, Yul Hwangbo, Kui Son Choi, Hyo Soung Cha. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.12.2020.
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.