A Questionnaire-Based Ensemble Learning Model to Predict the Diagnosis of Vertigo: Model Development and Validation Study

Background: Questionnaires have been used in the past 2 decades to predict the diagnosis of vertigo and assist clinical decision-making. A questionnaire-based machine learning model is expected to improve the efficiency of diagnosis of vestibular disorders. Objective: This study aims to develop and validate a questionnaire-based machine learning model that predicts the diagnosis of vertigo. Methods: In this multicenter prospective study, patients presenting with vertigo entered a consecutive cohort at their first visit to the ENT and vertigo clinics of 7 tertiary referral centers from August 2019 to March 2021, with a follow-up period of 2 months. All participants completed a diagnostic questionnaire after eligibility screening. Patients who received only 1 final diagnosis by their treating specialists for their primary complaint were included in model development and validation. The data of patients enrolled before February 1, 2021 were used for modeling and cross-validation, while patients enrolled afterward entered external validation. Results: A total of 1693 patients were enrolled, with a response rate of 96.2% (1693/1760). The median age was 51 (IQR 38-61) years, with 991 (58.5%) females; 1041 (61.5%) patients received the final diagnosis during the study period. Among them, 928 (54.8%) patients were included in model development and validation, and 113 (6.7%) patients who enrolled later were used as a test set for external validation. They were classified into 5 diagnostic categories. We compared 9 candidate machine learning methods, and the recalibrated model of light gradient boosting machine achieved the best performance, with an area under the curve of 0.937 (95% CI 0.917-0.962) in cross-validation and 0.954 (95% CI 0.944-0.967) in external validation. Conclusions: The questionnaire-based light gradient boosting machine was able to predict common vestibular disorders and assist decision-making in ENT and vertigo clinics. Further studies with a larger sample size and the participation of neurologists will help assess the generalization and robustness of this machine learning method. light cupula, Ramsay-Hunt syndrome, labyrinthine fistula, and superior semicircular canal dehiscence syndrome.


Introduction
Dizziness and vertigo are the major complaints of patients with vestibular disorders, with an estimated lifetime prevalence of dizziness (including vertigo) of 15%-35% [1]. Dizziness and vertigo are incapacitating and considerably impact patients' quality of life. These conditions often lead to activity restriction and are closely associated with psychiatric disorders such as anxiety, phobic, and somatoform disorders [1][2][3]. Patients with dizziness and vertigo are also at a higher risk of falls and fall-related injuries, especially older people [4]. However, the diagnosis of vestibular disorders is challenging and time-consuming. It involves a variety of vestibular and neurological causes and complex pathological processes, leading to misdiagnosis and potentially widespread overuse of imaging among vertiginous patients [5][6][7][8]. Consequent delays in diagnosis can worsen the functional and psychological consequences of the disease.
The application of artificial intelligence in diagnosing dizziness and vertigo dates back more than 30 years. Expert systems such as Vertigo [9], Carrusel [10], and One [11] consist of knowledge bases with fixed diagnostic rules. They infer through nonadaptive algorithms that were unable to learn from patients' data. Different machine learning algorithms, including genetic algorithms, neural networks, Bayesian methods, k-nearest neighbors, and support vector machines, have also been employed to analyze patient data from One [12][13][14][15][16]. The predictive accuracy was 90%-97% for 6 common otoneurologic diagnoses and 76.8%-82.4% for 9 diagnostic categories. EMBalance is a comprehensive platform that was launched in 2015 to assist the diagnosis, treatment, and evolution of balance disorders by using ensemble learning methods based on decision trees (Adaptive Boosting) [17,18]. There has been a shift from pure knowledge-driven to data-driven methodology in computer-aided diagnosis of vestibular disorders.
Except Vertigo, all of the models mentioned above are based on patients' medical history and examinations combined with necessary tests, while in practice, patient history alone provides important clues to possible diagnosis and further evaluation [19]. Numerous questionnaires for dizziness and vertigo have emerged during the past 2 decades to assist the clinical diagnosis of vestibular disorders [20][21][22][23][24][25][26][27]. Most of these studies used simple statistical models, typically logistic regression, validated with the same data as modeling [26][27][28]. Few studies have tried to apply machine learning algorithms. However, the accuracy of these models was not as good as that of simple statistical models owing to small data sets or inappropriate choice of modeling data [29,30].
This study is part of the Otogenic Vertigo Artificial Intelligence Research (OVerAIR) study, in which the overarching purpose is to build a comprehensive platform that integrates diagnosis, treatment, rehabilitation, and follow-up in a cohort of patients with otogenic vertigo by using artificial intelligence. The specific aims of this study include developing and verifying a diagnostic platform for vertigo and assisting clinical decision-making by using machine learning techniques and further exploring the effectiveness and clinical utility of the proposed platform.

Study Design
Patients presenting with a new complaint of vertigo or dizziness according to the classification of vestibular symptoms by the Barany Society [31] were enrolled consecutively from the ENT and vertigo clinics of Eye & ENT Hospital of Fudan University, The Second Hospital of Anhui Medical University, The First Affiliated Hospital of Xiamen University, Shengjing Hospital of China Medical University, Shanghai Pudong Hospital, Shenzhen Second People's Hospital, and The First Affiliated Hospital of Chongqing Medical University from August 2019 through March 2021. At their first interview with an ENT specialist, patients completed the electronic version of the questionnaire via a tablet or smartphone after giving informed consent. Those who were unable to read and complete the questionnaire by themselves answered the questions read by the researchers. We did not interfere with the normal medical procedures of the patients. Patients were scheduled for a next visit as the specialist considered necessary; therefore, they did not stick to a fixed follow-up time.
tomography, and magnetic resonance imaging) was prescribed when necessary. The clinical diagnosis given by ENT specialists with more than 5 years of clinical experience who were blinded to questionnaire responses was used as the reference diagnosis. The reference diagnostic standards include practice guidelines for benign paroxysmal positional vertigo (BPPV) by the American Academy of Otolaryngology-Head and Neck Surgery [33] and diagnostic criteria for vestibular disorders (including vestibular migraine [34], Meniere disease [35], persistent postural-perceptual dizziness [36], vestibular paroxysmia [37], and bilateral vestibulopathy [38]) by the Barany society. Patients with typical clinical features who did not meet the criteria of definite diagnosis were given probable diagnosis. Patients without a specific diagnosis within 2 months or who stopped coming for visits before reaching a final diagnosis were labeled undetermined.

Questionnaire Development
The diagnostic questionnaire was developed through an iterative process that mainly consisted of the following 3 stages. 1. Focus group and panel meeting: First, a focus group discussion and 3 follow-up panel meetings were convened to identify the commonly seen peripheral vestibular disorders in ENT clinics. In this process, 16 disorders were identified and the featured manifestations of each disorder were listed. The literature of diagnostic or practice guidelines for each disorder was searched and the pertinent ones were carefully reviewed. After that, the initial questionnaire composed of 43 items was drafted. 2. Patient interview: Fifteen patients who presented with vertigo in our ENT clinic were interviewed for the understandability and easiness of filling out the questionnaire. Two patients reported that it was too long and time-consuming. Another 3 complained of being asked too many questions such as heart disease and medication taken, which seemed unrelated to their vertigo condition. At this stage, the wording of the questionnaire was thoroughly simplified and 6 questions were deleted. 3. Expert group meeting: At a national conference, 12 experts (from ENT, neurology, vestibular examination, and rehabilitation) were invited to evaluate the suitability and clarity of the questionnaire, and they put forward suggestions for further revision. During this process, the items were reordered and some were combined or omitted.

Statistical Analysis
We compared 9 candidate machine learning methods to screen for the one with the best performance. Five non-ensemble learning algorithms were considered, namely, decision tree [39], ridge regression [40], logistic regression (with L2-regularization) [41], support vector classification [42], and support vector classification with stochastic gradient descent [43]. Ensemble learning refers to a general meta approach that strategically improves predictive performance by combining the predictions from multiple models. Four of the ensemble learning methods were implemented, namely, random forest [44], Adaptive Boosting [45], gradient boosting decision tree [46], and light gradient boosting machine (LGBM) [47]. We took bootstrapped cross-validation that randomly sampled data into train and validation sets by 7:3, which were repeated 100 times with replacement [48]. Models were trained on the training set and evaluated based on the prediction performance on the validation set. The best model was selected and tuned based on the average prediction performance over the 100 validation set. The area under the curve (AUC) was used to evaluate the performance of the models. In multiclass prediction, sensitivity, specificity, likelihood ratio, and AUC were calculated through a one-vs-rest scheme (microaverage). Then, recalibration was performed using calibration curves [49] and Brier scores [50] to adjust the difference between the predicted probability and observed proportion of each diagnostic category. External validation was performed using the data of the newest patients in the cohort (enrolled during the last 2 months), which constituted the test set. The 95% CIs of all the metrics were calculated through bootstrapping.
The missing values of Boolean variables were imputed with False in the main results, and sensitivity analysis was conducted by comparing different imputation strategies (ie, without imputation or imputation with True). All machine learning algorithms were implemented in Python, and the code is available in online resources. Hyperparameters are set to default according to the state-of-art machine learning package: sklearn.

Robustness and Sample Size Analysis
As a data-driven prediction approach for boosting clinical diagnosis, it is necessary to verify that the number of samples is enough for model development and validation. Following Riley [51] and Riley et al [52], we quantified the sufficiency of sample size in terms of the global shrinkage factor and the minimal number of samples. The criterion of enough sample size is to ensure a shrinkage factor >0.9. Further, given the acceptable shrinkage factor (eg, 0.9), the necessary size of the samples to develop a prediction model can be estimated based on the Cox-Snell ratio of explained variance.
Further, the increased flexibility of modern techniques implies that larger sample sizes may be required for reliable estimation compared with classical methods such as logistic regression. Thus, we followed the approach of van der Ploeg et al [53] to evaluate our best model LGBM's sensitivity on sample size. The training set is of different sizes and subsampled from the development set. Each training set size is repeated 30 times to eliminate randomness, while the average AUC measures the performance on the test set.

Important Variables
To measure the importance of variables, we first evaluated multivariate feature importance according to information gains in cross-validation and selected the top 20 important variables. Then, to figure out feature importance in individual diagnostic categories, each selected variable was used to predict the 5 diagnostic categories independently, and univariate variable importance was measured in terms of AUC.

Overview of the Diagnostic Questionnaire
The final questionnaire consists of 23 items that incorporated branching logic. The full version of the questionnaire is available in Multimedia Appendix 1. The contents of the items are shown in Textbox 1.

Textbox 1.
Items in the diagnostic questionnaire.
• One question on the characteristic of the symptom: was the head spinning or not? If not, then the kind of dizziness needs to be specified (heavy/muddled head, staggering, or other) • Three questions on the frequency, duration, and duration it has been since the first vertigo attack • One question on the condition of hearing loss, that is, which side and how it changes • Three questions on the condition of tinnitus, aural fullness, and earache, that is, which side and whether it changes before and after the attack should be specified (aggravate before/during the attack, relieve after the attack) • Of the 1041 patients, 928 were classified into the training set (for modeling and cross-validation) and 113 were included in the test set (Table 2). Figure 1 shows the study flowchart. The details of the training set and test set are described in Table 2.

Development and Validation of the Model
The LGBM model had the highest AUC of 0.937 (95% CI 0.917-0.962) and the lowest Brier score of 0.057 (95% CI 0.049-0.068) among the 9 models in cross-validation (Table 3). Therefore, it was recalibrated and used as the final predictive model.
For sensitivity analysis, when imputing the missing value with mode (the most frequent label), the AUC and Brier score of all 9 methods dropped ( Table 4). Note that LGBM does not rely on imputation methods; therefore, it can directly utilize the information from missing to achieve a better prediction performance.
LGBM without imputation performs as well as the recalibrated LGBM (imputed with 0), which verifies the robustness of our method. Ensemble learning methods performed better than non-ensemble learning methods except logistic regression with LASSO in cross-validation, indicating that the introduction of ensemble learning in vertigo diagnosis is effective across specific ensemble approaches. Further, LGBM performs better than other methods in AUC and Brier scores.
The receiver operating characteristic curves of the recalibrated LGBM model in cross-validation are shown in Figure 2. Table  5 presents the AUC, sensitivity, specificity, likelihood ratios, and accuracy in different diagnostic categories in both cross and external validation. The model made highly accurate prediction for SSNHL-V (AUC>0.98, positive likelihood ratio [+LR]>20, negative likelihood ratio [-LR]<0.05), accurate prediction for BPPV and Meniere disease (AUC>0.95, sensitivity>0.8, specificity>0.9, accuracy>0.9, +LR>10, -LR<0.2), and showed fair discriminative ability for vestibular migraine (AUC 0.9, 95% CI 0.87-0.92). The prediction of other diagnoses was unstable owing to the limited sample size and great heterogeneity in this category, with an AUC ranging from 0.771 to 0.929 in cross-validation and 0.879 to 0.957 in external validation.
Calibration curves in cross-validation ( Figure 3) properly estimated the probability of Meniere disease and vestibular migraine and slightly underestimated the probability of SSNHL-V and BPPV. The predictions for other diagnoses were relatively conservative, as it was less likely to give probabilities close to 0 or 1. The Brier score was 0.058 (95% CI 0.049-0.068) in cross-validation, which suggested that the predicted probabilities fitted well with the actual proportions of the diagnoses. We also applied our methods to the external data set. The results indicated that the selected best model, LGBM, was of generalization ability in predicting vertigo diagnosis, achieving an AUC of 0.958 (95% CI 0.951-0.969). Meanwhile, LGBM also performed better than the second-best method, logistic regression, which achieved an AUC of 0.939 (95% CI 0.925-0.956) in external validation. The multivariable feature importance in terms of information gain is shown in Table 6.
The analysis of the global shrinkage factor of each diagnostic category and sensitivity analysis results indicated that the sample size of this study is sufficient for model development. See Multimedia Appendix 2 for more details of sample size analysis. Then, to figure out feature importance in individual diagnostic categories, each of the top 20 contributing variables in Table 6 was used to predict the 5 diagnostic categories independently, and univariate variable importance was measured in terms of AUC (Figure 4).

Principal Findings
In this multicenter prospective cohort study, a questionnaire was developed to diagnose vertigo, and an LGBM model was developed using patients' historical data collected through the questionnaire. This is, to our knowledge, the first questionnaire-based machine learning model to predict multiple diagnoses of vertigo. Because all the patients in this study were from ENT and vertigo clinics, the distribution of diagnoses differs from that in previous studies conducted in neurology and balance clinics [19][20][21]26]. There was a much higher prevalence of SSNHL-V (173/1693, 10.2%) and a lower prevalence of vestibular neuritis (22/1693, 1.3%) in our study.
Our model outperformed previously reported questionnaire-based statistical models in predicting common vestibular diagnoses [20,21,26]. A possible explanation is that machine learning methods are better at dealing with potentially nonlinear relationships and overfitting. Additionally, given the subjectivity of patient-reported historical information, data-driven models are better fits in questionnaire-based prediction than knowledge-driven models [9,11,54,55]. Compared with previous machine learning diagnostic systems that used comprehensive patient history data, physical examination, and laboratory tests, our questionnaire-based diagnostic model has its merits [13][14][15][16][17]. First, medical history provides important clues to the cause of vertigo, based on which the doctor will try to confirm or exclude a presumptive diagnosis. Therefore, a questionnaire-based diagnostic tool can provide early decision support according to patient history and help reduce unnecessary workup. Further, since questionnaire data come directly from patients, the model's performance does not rely on the accurate interpretation of patient history by professionals. Besides, considering the limited accessibility of specific tests (eg, pure tone audiometry, caloric test, video head impulse test), a questionnaire requiring no special equipment is suitable across different clinical settings. However, a questionnaire-based diagnostic model also has intrinsic limitations. Patient-reported medical history can be imprecise because it can be easily affected by recall bias, misinterpretation, emotional state of the patients, and other subjective factors. Meanwhile, for patients with only nonspecific symptoms, physical examination and laboratory testing are more important diagnostic tools. Patient history should always be combined with objective evidence to make a more reliable diagnosis. Therefore, it is necessary to introduce physical examination and laboratory test results into the system in the future to make a comprehensive stepwise diagnostic prediction.

Limitations
This study had the following limitations. The uneven distribution of diagnoses made it difficult for the model to give accurate predictions of rare diagnoses. In order to reduce potential noise, we included only patients with 1 final diagnosis in modeling. The exclusion of patients with undetermined diagnosis was a potential source of bias. There were several reasons that these patients did not receive a specific diagnosis. In some cases, patients with BPPV might experience spontaneous remission while waiting for the scheduled positional test and treatment (1-2 weeks later), which also explains the relatively low prevalence of BPPV in our cohort than that in other ENT clinics [56]. The exclusion of these patients could reduce noise and improve model performance. Besides, some patients only experienced transient symptoms without observable structural, functional, or psychological changes; therefore, no specific diagnosis was given. Moreover, while a majority of patients completed all the necessary examinations within the follow-up, it was also possible that some rare causes were not determined within 2 months, possibly adding to the imbalance of data. Nevertheless, as the cohort expands, more patients with rare diagnoses will be included, which will enable the model to predict rare diagnoses with higher accuracy. We can also manage the influence of imbalanced data during modeling. Meanwhile, the observed AUC in external validation was higher than that in cross-validation, which could be accounted for by the relatively small sample size of the test set. More participants with definite diagnosis are needed for providing further validation. Finally, since this study was conducted in the ENT and vertigo clinic of tertiary centers, the predictive power of the model is yet to be verified in different clinical settings.

Conclusion
This study presents the first questionnaire-based machine learning model for the prediction of common vestibular disorders. The model achieved strong predictive ability for BPPV, vestibular migraine, Meniere disease, and SSNHL-V by using an ensemble learning method LGBM. As part of the OVerAIR platform, it can be used to assist clinical decision-making in ENT clinics and help with the remote diagnosis of BPPV. We have also been working on a smartphone app that integrates the questionnaire with referral, follow-up, treatment, and rehabilitation to improve the health outcomes of patients with vertigo. The next phase of the OVerAIR study will involve the participation of neurologists, which is expected to improve the model's predictive ability for central vertigo and help assess its generalization and robustness.