This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Despite excellent prediction performance, noninterpretability has undermined the value of applying deeplearning algorithms in clinical practice. To overcome this limitation, attention mechanism has been introduced to clinical research as an explanatory modeling method. However, potential limitations of using this attractive method have not been clarified to clinical researchers. Furthermore, there has been a lack of introductory information explaining attention mechanisms to clinical researchers.
The aim of this study was to introduce the basic concepts and design approaches of attention mechanisms. In addition, we aimed to empirically assess the potential limitations of current attention mechanisms in terms of prediction and interpretability performance.
First, the basic concepts and several key considerations regarding attention mechanisms were identified. Second, four approaches to attention mechanisms were suggested according to a twodimensional framework based on the degrees of freedom and uncertainty awareness. Third, the prediction performance, probability reliability, concentration of variable importance, consistency of attention results, and generalizability of attention results to conventional statistics were assessed in the diabetic classification modeling setting. Fourth, the potential limitations of attention mechanisms were considered.
Prediction performance was very high for all models. Probability reliability was high in models with uncertainty awareness. Variable importance was concentrated in several variables when uncertainty awareness was not considered. The consistency of attention results was high when uncertainty awareness was considered. The generalizability of attention results to conventional statistics was poor regardless of the modeling approach.
The attention mechanism is an attractive technique with potential to be very promising in the future. However, it may not yet be desirable to rely on this method to assess variable importance in clinical settings. Therefore, along with theoretical studies enhancing attention mechanisms, more empirical studies investigating potential limitations should be encouraged.
In recent years, there has been significant evidence that deeplearning algorithms can outperform other machinelearning algorithms and conventional statistics in the medical field [
Based on this advantage, attention mechanisms have starting to gain appeal in the clinical research field [
With the goal of reducing this gap, the aim of this study was to evaluate attention mechanisms in terms of prediction performance and interpretability. In addition, there remains a lack of guidance for clinical researchers in the implementation of attention mechanisms; therefore, to facilitate understanding for clinical researchers, this study preemptively provides basic concepts, key considerations, and codes for attention mechanisms. Finally, a case analysis was performed in a crosssectional and structured data environment, which is the simplest data setting possible for clinical researchers.
This study was conducted according to the following procedure. First, the scope of the study was established in terms of the data structure. Then, a brief introduction and several important considerations regarding attention mechanisms were considered. Second, based on previous research, a twodimensional framework was established to guide the four modeling approaches to attention mechanisms. Third, five empirical tests with attention mechanisms were performed using the four models: prediction performance, probability reliability, concentration of variable importance, consistency of attention results, and generalizability of attention results to conventional statistics. Finally, potential limitations that may arise when using attention mechanisms were identified.
Since the design approaches of attention mechanisms differ greatly depending on the data structure, the scope of this study was established in terms of data structure. Specifically, attention mechanism research in the medical field can be divided into two main categories from a data point of view. The first category is an unstructured data area where data containing natural language and images cannot be stored in a row and column table structure [
Structured data familiar to clinical researchers are widely applicable to most statistical analyses, including linear regression analysis and analysis of variance (ANOVA). Since one purpose of this study was to compare the results of attention mechanisms and conventional statistical methods, the scope of the study was limited to structured data. Furthermore, most previous attention mechanism studies using structured data have been conducted in timeseries settings [
Attention, one of the layers in a neural network model, quantifies the importance of input variables in terms of their impact on outcomes (
Model structures for attention implementation. (A) Basic architecture of an attention mechanism model. (B) Model architecture where the dot product is employed for transferring the influence of input variables toward the outcome. (C) Model architecture where the importance of input variables may be decayed. (D) Model architecture that is aware of uncertainty. The percentages in the circles show examples of attention values.
The attention value of a certain variable indicates the relative importance of that variable compared with that of other variables. If the attention value of a particular variable is large, the large influence of that variable is transmitted when predicting the outcome variable. As an extreme example, when the attention value of a variable is 1, only that variable is used to predict the outcome variable, whereas if the attention value of a variable is 0, that variable is not used to predict the outcome variable.
Attention mechanisms can be implemented in various ways, because the key feature of deeplearning modeling is that users can freely design the structure [
Although deeplearning models can be developed in various ways depending on the tendency of developers, two approaches have been commonly applied in recent attention studies: increasing the degrees of freedom and uncertainty awareness (UA).
The mechanism for increasing the degrees of freedom is to design multiattention layers; representative algorithms that reflect such a mechanism include transformer and bidirectional transformer (BERT) [
Deeplearning algorithms are not free from the uncertainty issue, which concerns the fact that prediction results have the potential to be incomplete in terms of accuracy and consistency [
Based on the discussion above, two directions (ie, degree of freedom and UA) were considered for attention modeling (
Framework for empirical tests.
Empirical test entries for the four models in the framework were categorized into two broad categories: outcome and attention (
In terms of attention, the degree of how concentrated the variable importance was in particular variables was measured (ie, Concentration in
Empirical test entries for measuring the performance of four models.
Entries (measures)  Methods  




Prediction performance 
Receiver operating characteristic 

Probability reliability 
Reliability diagrams 




Concentration 
Herfindahl index (near 0, least concentrated; 1, most concentrated) 

Consistency 
Correlation 

Generalizability 
Effect size: Cohen Regression analysis (dependent variable, effect size obtained from conventional methods; independent variable, attention values) 
Four models were developed according to the framework presented in
Model designs for single attention mechanisms. UA: uncertainty awareness. The concept of "reparameterization trick" is described in Concept A1 of
The two models in which the degree of freedom is considered are presented in
Since these models have multiattention layers (ie, Local attention in
Model designs for multiattention mechanisms. UA: uncertainty awareness. The concept of "reparameterization trick" is described in Concept A1 of
This somewhat complex structure ensures that the influence of one variable is passed only once to the model outcome, even if attention values are inferred multiple times (20 times in this case). Furthermore, using both the local attention layer and the weights corresponding to each vector, a unique attention value for each variable can be obtained, which facilitates interpretation.
Graphical and mathematical notations are provided for obtaining a set of unique values (global attention in Figure A1 of
A 10fold test was performed to assess the empirical test entries. The dataset was divided into 10 test sets (10% of total sets) and 10 training sets (90% of total sets). The training sets were then subdivided with 80% used directly for model training and 20% for validation.
Entries related to the model outcome (
The binary crossentropy function, which is generally used in binary outcome settings [
The UA models assume that the model outcome is dependent on the normal distribution (ie, layers with
Attention models were developed, learned, and tested on Keras 2.3.1, tensorflow 2.1.0, and Python 3.7.6. Adam with a learning rate of 0.001 was employed as an optimizer to train all models. A training dataset with a batch size of 5000 was provided to the model. The earlystop rule was applied to stop training the models at the optimal epoch. Thus, model training was terminated when the loss value of the validation set did not improve further during the last 1200 epochs. Other details about activation functions and the structure of nodes and layers are provided in Code A2 to Code A5 of
The case analysis was performed in a setting where the relationship between a disease and other variables is well established: an 8year (2010 to 2017 inclusive) cumulative Korea National Health and Nutrition Examination Survey dataset, which assesses the nutrition and health status of Koreans and collects information about major chronic diseases such as metabolic syndrome and diabetes [
In the 8year cumulative data, only variables with consistent labels during that period were included. Variables with no change in value, containing more than a 50% omission rate and subject identification information were excluded from the study set. Categorical variables of both nominal and ordinal types were integerized using integer encoding [
There were 238 variables with consistent labels in the 8year cumulative dataset. Only 128 variables were selected by preprocessing. There were 22 variables with no change in value, 84 variables with more than 50% missing values, and 4 variables containing identification information that were excluded from the analysis. The total number of observations (ie, the number of subjects) was 33,065, with an average age of 48.89 years and with men accounting for 40.41% (n=13,361) of the sample. Only 6 variables had no omissions, and the average missing rate of variables with omissions was 10.38%
Receiver operating characteristic (ROC) test for the four models. All values estimated from 10 test sets are combined into one set for each model. The base indicates the deeplearning model without attention mechanisms. UA: uncertainty awareness.
Reliability diagrams for four models. The Brier score measures the overall reliability of probabilistic predictions. UA: uncertainty awareness.
Herfindahl index values of 10fold sets for each model. UA: uncertainty awareness.
Histograms for correlations among 10fold sets for each model. UA: uncertainty awareness.
Top 5% variable importance results estimated by different methods.
Variable^{a}  Attention value^{b}/Effect size^{c}  




sm_presnt  0.0615  

BD1  0.0519  

pa_walk  0.0479  

HE_Upro  0.0468  

HE_alt  0.0430  

Npins  0.0413  




HE_HB  0.041  

Sex  0.033  

Pa_walk  0.032  

HE_HBsAg  0.027  

Allownc  0.026  

HE_sbp  0.026  




pa_walk  0.050  

BH9_11  0.017  

HE_THfh2  0.011  

HE_THfh1  0.010  

HE_THfh3  0.010  

DI5_dg  0.010  




pa_walk  0.050  

HE_THfh2  0.019  

BH9_11  0.015  

HE_THfh1  0.013  

HE_ast  0.010  

house  0.010  




age  1.536  

HE_glu  1.214  

HE_HbA1c  –1.014  

Wt_pool_1  –0.516  

Wt_itvex  –0.516  

HE_Uglu  0.319 
^{a}See Table A2 in
^{b}Average from the 10 fold sets.
^{c}Effect size is presented only for the conventional statistics.
^{d}UA: uncertainty awareness.
^{e}Null hypothesis of categorical variables=no relationships between diabetes and a categorical variable; null hypothesis of continuous variables=no difference in variables between the diabetes and no diabetes groups.
Regression analysis results for assessing an association between attention values and effect sizes.
Regression model variables^{a}  Regression 1  Regression 2  

Coefficient  Coefficient  
Single attention  −0.696  0.635  −0.668  0.652 
Multiattention  1.897  0.487  2.019  0.476 
Single attention with UA^{b}  −1.351  0.797  —^{c}  — 
Multiattention with UA  —  —  −1.541  0.772 
Intercept  0.075  0.09  0.075  0.075 
^{a}The dependent variable is the absolute value of effect size, calculated by Cohen
^{b}UA: uncertainty awareness.
^{c}—: variable not included in the regression model.
A difference in performance according to the degree of freedom was prominent in the probability reliability diagram (
Since the difference appeared to be based on the UA axis, the over or underestimation tendency of the models may be related to UA. Furthermore, the Brier scores of the two models with UA were smaller than those of the two models without UA, indicating that models with UA tend to estimate more reliable probabilities than models without UA. These findings are consistent with the results of recent research that estimated reliable outcomes with an emphasis on UA [
UA produced noticeable differences in results consistency and the concentration of variable importance. Specifically, in UA models with low Herfindahl indices, variable importance appeared to be distributed over many variables in contrast to models that did not consider uncertainty (
These results suggest that the consistency of results from the UA models is high because the variable importance with overall low values is distributed evenly over most variables. This result is closely associated with the assumption that attention values were estimated based on a normal distribution within the cost function (see equation for the Kullback–Leibler divergence
As with conventional statistical methods, the attention models were unable to control spurious correlations during attention learning. Specifically, of the top 5% of variables obtained from conventional statistics, wt_pool_1 (interview weight combined years) and wt_itvex (interview weight for a single year) have little to do with health status (
In terms of clinically relevant variables, no significant association between the results of conventional statistics and attention models was found. Specifically, the variables age, HE_glu (fasting blood sugar), HE_HbA1c (hemoglobin AIC), and HE_Uglu (urine glucose) selected by conventional methods are well known to have a direct association with diabetes [
Furthermore, there was no intersection of variables selected by both attention models and conventional statistical methods (
The model structure and weight of terms in the cost function are hyperparameters to be adjusted. In terms of the model structure, the degree of freedom of attention layers was evaluated by comparing two extreme cases of 1 attention layer and 20 attention layers. Although the size of attention layers does not make a significant difference, the results can be significantly different if the number of attention layers is different under other conditions.
Furthermore, by taking uncertainty into account in the models, a term (ie, the degree of similarity associated with the normal distribution) was added to the cost function. However, as discussed previously, this term may interfere with the assignment of great importance values to variables by making all
However, hyperparameter tuning is not conducted based on a theoretical basis but rather on a heuristic basis. In other words, there is no standard of how many attention layers should be specified and how much the weight should be adjusted for better results. If the goal of building models aims to maximize accuracy, various hyperparameter settings can be tested in the direction of increasing model accuracy. However, there is no clear criterion to maximize the performance of interpretability. In other words, although various hyperparameter settings are tested, finding the bestoptimized hyperparameter setting based on the statistical point of view is challenging. Therefore, the variable importance should be understood in a limited way only within the framework of this experiment.
There was no significant association between the variable importance obtained from the attention mechanism and the effect size obtained from conventional statistics. One of the most probable reasons for this result is that the assumption of the association among input variables is different between conventional methods and deeplearning algorithms. Specifically, conventional statistical methodologies such as linear regression analysis and ANOVA basically estimate effect sizes based on the assumption of independence between input variables [
Recent new technologies such as sensors (ie, wearable devices or facilities in operating rooms) have produced new types of data. Since the associations between variables have not yet been fully explored, relying solely on attention mechanisms may lead to a false judgment that variables that have minimal association with the outcome variable are important. Hence, it is advisable to consider the results of attention and conventional statistics together.
Furthermore, in situations where there is a spurious correlation, neither method provides good explanatory power. Spurious correlations can only be eliminated through data preprocessing based on domain knowledge. Hence, care must be taken when implementing both attention models and conventional statistical methods in environments with manifold variables that cannot be preprocessed (ie, included or excluded) using definite knowledge. In particular, finding new features using attention mechanisms may not be adequate in environments where the data are susceptible to spurious correlations owing to a large number of variables but few observations such as in the field of genetic engineering [
The results of this study provide several points of guidance for future research in the medical field. First, more empirical evidence should be secured based on various structures in terms of the degree of freedom. It may be desirable to test what attention results are produced when different values of degree of freedom are employed. Particularly, given that the medical field has various data types such as images, natural languages, and numerical values, attention results should be assessed according to the degree of freedom with consideration of the data characteristics [
Second, attention models with more sophisticated UA should be tested. In this study, the model outcome variable was assumed to depend on the distribution of the attention layer; that is,
Third, more research that strictly evaluates variable importance based on attention mechanisms over diverse disease domains is needed. As found in this study, attention has its limitations in terms of generalizability to conventional statistics and control of spurious correlations. However, since this case study was conducted with a single cohort of Korean patients with diabetes, more empirical evidence from various cohorts or diseases should be tested to confirm that attention mechanisms may not provide any significant meaning. Importantly, for elaborate empirical research, a greater indepth understanding of the association between covariates and health outcomes is needed. Hence, more domain experts on a specific disease along with data scientists should be actively involved in these studies.
Fourth, methods for controlling the distribution of variable importance should be studied. As revealed in this analysis, the variable importance can be distributed over many variables or concentrated on a few variables depending on the model structure (
There are several limitations to be aware of when assessing the academic value of this study. First, wellbehaved data with excellent predictive performance owing to the data characteristics were employed for the analysis. For this reason, the overall AUC performance (see ROC test in
Second, this study does not guarantee that stateoftheart estimation methods for UA were applied. Specifically, the models’ outcomes do not depend on network weights. In addition, research on estimation methodologies in deep learning is in progress, and therefore new methodologies are still being developed. Accordingly, the value of this study lies in the framework proposals that suggest the research direction of attention modeling rather than in the details of attention estimation methods.
Third, the design of weights of each local attention layer is not as sophisticated as the design of local attention layers (
Attention mechanisms have the potential to make a significant contribution to the medical field, where explanatory power is important, by overcoming the limitations of the noninterpretability of deeplearning algorithms. However, potential problems that may arise when attention mechanisms are applied in practice have not been well studied. Thus, we hope that this study will serve as a cornerstone to raise potential issues, and that many similar studies will be conducted in the future. The cohesive awareness of potential problems arising from attention mechanisms in the field of application will provide theoretical researchers with new goals for problemsolving.
Notations, global attention inference procedure, Herfindahl index values, variable label descriptions, and reparameterization trick concept description.
Codes and algorithms for inferring outcomes.
analysis of variance
area under the curve
Local Interpretable Modelagnostic Explanations
receiver operating characteristic
Shapley Additive Explanations
uncertainty awareness
This study was supported by a grant (20103801) from the National Cancer Center of Korea.
None declared.