The Role of Machine Learning in Diagnosing Bipolar Disorder: Scoping Review

Background: Bipolar disorder (BD) is the 10th most common cause of frailty in young individuals and has triggered morbidity and mortality worldwide. Patients with BD have a life expectancy 9 to 17 years lower than that of normal people. BD is a predominant mental disorder, but it can be misdiagnosed as depressive disorder, which leads to difficulties in treating affected patients. Approximately 60% of patients with BD are treated for depression. However, machine learning provides advanced skills and techniques for better diagnosis of BD. Objective: This review aims to explore the machine learning algorithms used for the detection and diagnosis of bipolar disorder and its subtypes. Methods: The study protocol adopted the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. We explored 3 databases, namely Google Scholar, ScienceDirect, and PubMed. To enhance the search, we performed backward screening of all the references of the included studies. Based on the predefined selection criteria, 2 levels of screening were performed: title and abstract review, and full review of the articles that met the inclusion criteria. Data extraction was performed independently by all investigators. To synthesize the extracted data, a narrative synthesis approach was followed. Results: We retrieved 573 potential articles were from the 3 databases. After preprocessing and screening, only 33 articles that met our inclusion criteria were identified. The most commonly used data belonged to the clinical category (19, 58%). We identified different machine learning models used in the selected studies, including classification models (18, 55%), regression models (5, 16%), model-based clustering methods (2, 6%), natural language processing (1, 3%), clustering algorithms (1, 3%), and deep learning–based models (3, 9%). Magnetic resonance imaging data were most commonly used for classifying bipolar patients compared to other groups (11, 34%), whereas microarray expression data sets and genomic data were the least commonly used. The maximum ratio of accuracy was 98%, whereas the minimum accuracy range was 64%. Conclusions: This scoping review provides an overview of recent studies based on machine learning models used to diagnose patients with BD regardless of their demographics or if they were compared to patients with psychiatric diagnoses. Further research can be conducted to provide clinical decision support in the health industry.


Introduction
Background Bipolar disorder (BD) is a predominant mental disorder that involves dramatic shifts in mood and temper. It is the 10th most common cause of frailty in young adults and affects approximately 1% to 5% of the overall population [1]. It is mostly initiated during emotional states caused by disturbances in thinking, ranging from extreme mania and excitement to severe depression [2]. An epidemiological survey reported that its prevalence is rapidly increasing every year [3]. BD is associated with an evidently higher early mortality [4]. Bipolar patients have unfortunate life situations because these patients have a lifetime 9 to 17 years lower than that of normal people [5]. Additionally, several studies from various countries including Denmark and the United Kingdom state that this mortality difference has continuously been increasing since the last decades [6]. Although the maximum number of death cases in BD are due to cardiovascular diseases and diabetes, some death cases are due to unnatural events. Suicide is also relatively predominant in the patients with BD [6]. Suicide rates in patients with BD are 10%-20% higher than in the general population [4]. This context demonstrates significant background knowledge on bipolar disorder.
To effectively comprehend BD conditions and stipulate better treatment, primary exposure to mental disorders is a crucial phase. Different from finding other long-lasting situations that depend on laboratory trials and statistical analysis, BD is stereotypically detected based on patients' self-statements in precise surveys planned for uncovering specific types of feelings, moods, and public relations [4]. Owing to the growing accessibility of information relating to patients' mental health levels, artificial intelligence (AI) and machine learning (ML) skills are proving useful for deepening our comprehension of mental health situations, and they are promising methods to support psychiatrists in making better clinical decisions and analyses [7]. In recent years, AI techniques have shown superior performance in countless data-rich implementation frameworks, including BP [8,9].
In a previous review, Diego et al [10] discussed the applications of ML algorithms in diagnosing BD. They focused on 5 main application domains of ML in BD: diagnosis, prognosis, treatment, data-driven phenotypes plus research, and clinical direction. In contrast, the current review aims to evaluate existing literature on the applications of ML in BD diagnosis. Moreover, in the current review, we only focused on the role of ML in diagnosing BD and its types, which has not been previously comprehensively reviewed in any other study. We also discuss the strengths and challenges associated with the present work, future research guidelines for spanning the breach among the applications of ML procedures and patient diagnosis.

Research Problem
BD is misdiagnosed as depressive disorder that leads to difficulties and delay in the treatment of affected patients [1]. Approximately 60% of patients with BD are looking for treatment of major depressive disorders [11]. According to a National Chinese Mental Health Survey report, while the incidence of BD in China increased by 4.5% within a 12-month period, the recognition rate of BD as a depressive disorder increased to 39.9% [12]. Hence, there is an urgent demand to diagnose BD correctly. Moreover, ML increasingly provides various advanced methods to diagnose BD at the individual level to achieve better clinical results [10]. Many scientists have used support vector machine (SVM) algorithms to build BD classification models using neuroimaging information to differentiate BD from major depression [13]. In Taiwan, scientists have designed prediction algorithms using random forests that calculate the genetic risk scores of BD [14]. However, based on all the evidence, it is necessary to provide a scoping review that focuses on all applications of ML for BD diagnosis. The current review aims to explore how ML algorithms are used for better diagnosis of BD.

Review Approach
The current scoping review was conducted to provide an understanding regarding the role of ML in diagnosing BD. A scoping review is an approach that is systematically executed to enable researchers to examine emerging evidence from available studies on a specific topic [15]. It is also helpful for identifying knowledge gaps in a given field [15]. This scoping review follows the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines recommended in 2016 [16].

Search Sources
We conducted a systematic search in 3 electronic databases: PubMed, Google Scholar, and ScienceDirect. We searched for articles published between January 2016 and December 2021. The search was conducted between March 16 to March 20, 2021. The references lists of the included articles were reviewed to check for possible articles that could be included.

Search Terms
The search strategies applied differed depending on the nature of the databases chosen for the search and are given in Multimedia Appendix 1. For example, PubMed allows the application of limiters such as "humans" and "English" language articles. In addition, further search terms for BD were added as we uploaded the references of Medical Subject Headings (MeSH) in PubMed. Google Scholar and ScienceDirect limit the number of search terms. Therefore, some search terms were not used when searching in these 2 databases. The intervention terms identified were ("Artificial Intelligence*" OR "Deep Learning" OR "Machine Learning" OR "Natural Language Processing" OR neural network* OR "unsupervised learning" OR "supervised learning"). The disorder terms identified were ("Bipolar disorder" OR "Bipolar 1 Disorder" OR "Bipolar 2 Disorder" OR "bipolar mood disorder" OR "bipolar affective disorder" OR "Cyclothymic Disorder" OR Cyclothym* OR manic*. Regarding search terms related to studies' outcome, which was bipolar disorder diagnosis, the search terms used were (diagnos* OR recog* OR prognosis OR detect* OR screening*).
The articles obtained from the search were uploaded to the Rayyan intelligent review application (Rayyan Systems Inc) in an EndNote (Clarivate) format [17]. This application allows researchers to collaborate and review articles at easily and at a faster pace [17]. Reviewers can create individual or collaborative reviews and make decisions regarding including or excluding the articles independently [17]. We considered 2 aspects when determining the key terms to be used for the current scoping review, which were population and interventions. The population we considered comprised Individuals with or without any health condition regardless of their gender, age, and ethnicity. The interventions considered include the ML models and algorithms used for diagnosing BD. The search terms were selected based on several scoping and systematic reviews we encountered during the preliminary search phase in the databases specified above.

Study Eligibility Criteria
Articles met the inclusion criteria if they achieved the main objective, namely providing an overview on the role of ML in diagnosing BD. The criteria identified for the inclusion and exclusion phases are given in Textbox 1.

Study Selection
In the first phase, 3 researchers (NA, OM, and ZJ) screened the titles and abstracts of the retrieved articles in an independent manner. In the second phase, the reviewers went through the full text of the articles included from the first phase. The retrieved articles were uploaded to the Rayyan intelligent review application in an EndNote format [17]. Disagreements were discussed amongst the 3 reviewers and decisions were made via consensus.

Data Extraction
For data extraction, a form was developed to include all the different data considered for the scoping review such as the ML model, accuracy, and type of data used. A description of the data extraction fields is included in Multimedia Appendices 2 and 3. Data extraction was performed independently by the 3 reviewers (NA, OM, and ZJ) using and Microsoft Excel (Microsoft Corporation). Any disagreements regarding the extracted data were resolved via consensus. A summary of all the data extracted from included studies is given in Multimedia Appendices 4.

Data Synthesis
This scoping review follows a narrative synthesis approach to synthesize the extracted data of the studies that made it to the final phase of inclusion and exclusion. From this analysis, we included studies that used ML models to assess participants with BD compared with other psychiatric disorders and healthy controls. The studies were classified based on the ML model used to diagnose BD, whether the model was an existing one or a novel one, BD type, data used, accuracy of diagnosis, other statistical measures, and whether the data used were private (gathered by the researchers) or public (open-access data). We also summarized the characteristics of the selected articles. Furthermore, we categorized the ML models into 10 categories and identified the characteristics of the selected studies that fitted under each category for the diagnosis of BD.

Search Outcomes
In this scoping review, we retrieved 573 potential articles from 3 different databases and included 33 studies for data synthesis, as shown in Figure 1. Among these, 488 articles remained after eliminating 85 duplicates. In the first phase of screening the titles and abstracts of the articles, 430 records (wrong intervention=130 articles, population=137 articles, outcome=73 articles, study design=24 articles, publication types=40 articles, publication year=25 articles, and language=1 article) were excluded. In the second phase, we reviewed the full text of 58 articles and included 31 articles. Then, 2 additional studies were added after checking the reference lists. Finally, 33 articles were selected for data synthesis.

Regression Models
The 33 included studies used 4 different types of regression models. Baseline logistic regression used in only 1 (3.03%) study for diagnosing BD and other psychiatric disorders [14]. Linear regression models were used in 3 (9.09%) studies [33,34,47] to diagnose type 1, type 2, and unspecified BD. In 2 (6.06%) studies [33,47], the elastic net method and least absolute shrinkage and selection operator (LASSO) [19,34] were used for diagnosing of type I, type II, and other unspecified BD types.

Natural Language-Based Model
A natural language processing model was employed by 1 (3.03%) study [48] to diagnose type 1 and type 2 BD.

BD Assessment Tools
Only 1 (3.03%) study [33] used SCID (Structured Clinical Interview for DSM-IV), a BD assessment tool, for diagnosing type 1 and type 2 BD.

Features of the Data Used in the Included Studies
The sample sizes were not consistent, and different sample sizes were used in the included articles ranging from 15 to 25,000. In 18 (56%) of the 33 studies, the sample size was less than 300 (56%), whereas in 12 (36.4%) studies, the sample size was above 300, as indicated in Table 3 and Multimedia Appendix 4. The most important feature of the included study was the data type. Multidimensional data were used in the selected articles, out of which data in 61.13% (19) of the studies belong to the clinical category, whereas 38.7 % (12) of the studies involved nonclinical data such as that in genomic and genome-wide association studies (GWAS). Private data sources (nongovernment sources or any other clinical data that are not publicly available) were the most commonly used in the included studies, whereas the least commonly used data sources were public (government sources, public databases, online websites, and freely available databases). Most of the included studies used already existing ML models for data evaluation (10, 30.3%), whereas the second common purpose was model adaptation (6, 18.2%). Only few studies developed novel ML models (2, 0.6%), as shown in Multimedia Appendix 4. The most common BD types mentioned in the selected studies were type 1 and type 2, whereas the least common types were chronic bipolar, first episode bipolar, and psychotic bipolar disorders, as observed in Table 1 and Multimedia Appendix 4. Table 3. Features of data used in the included studies (N=33).

Value Feature
Data set size (sample size), a n (%) 9 (28) <100 9 (28)  b Data types were only mentioned in 31 studies. Clinical data include blood samples, electronic medical records, neurological data, magnetic resonance imaging data, electroencephalography and microarray expression data, whereas nonclinical data include phenotype data, genotype data, genomic data, and genome wide association studies. c Public data include government sources, public databases, websites, and freely available databases, whereas private data include nongovernment sources, personal information, or data of specific hospitals or research organizations. Private data include databases that are not available in the public domain. d More than 90% of the samples used in the included studies were bipolar disorder samples (regardless of type), whereas 10% of the samples were healthy control samples.

Types of Data Sets Used in the Included Studies
Data types were only mentioned in the 31 of the 33 studies. As shown in Table 4

Principal Findings
Previous studies stressed the importance of ML classifiers to aid in diagnosing BD accurately, as it is frequently misdiagnosed. Approximately 60% of BD cases are misdiagnosed as major depressive disorders, and a proper diagnosis may take up to 10 years [46]. AI and ML exhibit considerable potential in clinical decision support and analysis with the help of big data, especially in mental health [7].
In this review, we explored the uses of ML techniques in diagnosing BD. From the 573 studies retrieved, 33 studies were included in this review. To explore the use of ML in diagnosing BD, the information was classified into 3 main categories as follows:

Machine Learning Models Used for Diagnosing BD
This review identified ML models, methods, and tools used for diagnosing BD, some of which did not use ML methods as the primary tool for diagnosis but used them as a supportive tool.
SVMs were the most commonly used ML models in diagnosing BD in 9 (27%) of the 33 studies, followed by ANNs (5, 15%), followed ensemble models (3, 9%), linear regression (3, 9%), and the Gaussian process model (2, 6%). Further, natural language processing, linear discriminant analysis, and logistic regression were used once in each study (3, 9%). Additionally, 7 studies applied other ML models that were emerging models or used a program to perform the diagnoses. However, only 1 study used a BD assessment tool, SCID, for the diagnosis of BD and an ML model as a supportive tool. Further, 1 study did not specify which ML model was employed. Hence, the use of ML models to diagnose BD is influenced by the diagnosis of BD, which is why studies have been exploring different ML models to better diagnose such mental disorders.

Data Sets Used in the Included Studies
The included studies used 2 types of data in diagnosing BD (clinical and nonclinical data). Clinical data were the most widely used, in 19 (53%) of the 33 studies. Among these 19 studies, 10 used magnetic resonance imaging (MRI) to classify bipolar patients compared to other groups. Other less commonly used data are mentioned in Table 4.
Nonclinical data were used in 12 studies (36%); some examples of nonclinical data used are large-scale GWAS (2, 6%), phenotypic data sets (2, 6%), diffusion tensor images (DTIs) (2, 6%) and other less commonly used data (Table 4). It is not surprising that nonclinical data are less commonly used because they mainly depend on surveys and tests related to mental disorders, which may lead to some biased results.

Validation of ML Models
The retrieved studies used 4 main validation measures to validate the ML models; these measures are accuracy, sensitivity, specificity, and AUC.
The accuracy of the ML models and algorithms was reported in 24 studies. The accuracy ranged from ≤70% to >91%. The highest accuracy achieved was 98% in only 1 study, whereas the lowest accuracy was 64%. Most studies achieved an accuracy of 83%-90% (9, 37.5%). The mean value of the accuracy was 82.06%. Moreover, sensitivity was only reported in 15 studies; it ranged from ≤60 to >90%. The mean value of sensitivity was 78.26%, whereas most studies (8, 53.3%) achieved sensitivity values between 80% and 88%. Furthermore, specificity was only mentioned in 13 studies. The value of specificity ranged from ≤70 to 92%. The mean value of Specificity was 85.4%, and most studies (6, 46.15%) achieved specificity values of 80%-90%. Finally, the AUC value was only reported in 10 studies, ranging from ≤69% to >97%. The maximum AUC value was 97%, whereas the minimum value was 65%. The mean AUC value was 81%. An important factor is that we were unable to compare the ML models and better categorize them owing to the variety of validation methods used in the reviewed studies. However, accuracy tended to be the most used measure in validating the ability of ML models to diagnose BD.

Comparison With Prior Work
Diego et al [10] conducted a systematic review that explored the applications of ML in diagnosing BD. The authors included articles from PubMed, Embase, and Web of Science published in any language up to 2017. They extracted 757 articles and included 51 studies in their review. They focused on categorizing the studies based on the data used to diagnose, treat, and prevent BD. Our focus was providing insight on the ML techniques used to diagnose various types of BD, including bipolar 1, bipolar 2, chronic bipolar, and episode bipolar. However, the articles lack information on the type of BD used to train and test the ML models (20 out of 33 studies did not specify the BD type). Thus, the data were categorized based on the ML model used to classify bipolar patients. Furthermore, we highlighted the advantages of the different data types used for different ML models. MRI data that were specifically used for SVMs and Gaussian process models showed good accuracy. However, EEG data used for SVMs showed higher accuracy than MRI data (98%), whereas DTI data showed lower accuracy than MRI and EEG data in case of SVMs (68.3%). Hence, we can infer that the predictive power and accuracy of ML models depend on the type of input data, as summarized in Table 6.

Future Research and Practical Implications
This review categorized the most common ML models and data used in diagnosing BD. Based on our findings, ML models can diagnose BD using clinical and nonclinical data. Future research should explore the studies involving patients in clinical and nonclinical settings to better evaluate the accuracy of the ML models.
Moreover, future studies should explore the influence of external factors like social media and the influence of the society on mental disorders to evaluate the influence of these factors on the patients and their effects on the performance of the ML models.
Furthermore, ML models should be compared with other traditional techniques for diagnosing BD like the Affective Disorder Evaluation (ADE) scale and Structured Clinical Interview for DSM-IV.
Only 2 studies reviewed used data sets with sizes above 2000, which is not surprising considering that most studies had data size as a limitation. In future studies, the ML models should be trained and validated on a larger data set and have a larger healthy control sample, as it was less than 10% in the reviewed studies.
As AI use in the health sector is growing rapidly, physicians should pay careful attention to some major issues that stand in the way of dealing with sensitive data such as medical information because of data ownership and security issues.
BD symptoms overlap with other mood disorders, specifically MDD, and this leads to the misdiagnosis of BD [20]. Future research should explore the main indicator that shows the patient is diagnosed with BD; for example, studies showed that patients diagnosed with BD have abnormal gray matter density in the MRI images of the brain. Another major indicator is regional homogeneity (ReHo), which indicates the activity of the brain while at rest [20,23]. Although some studies explored the use of some ML techniques that use binary classification methods such as SVMs and logistic regression, it is still not clear how ML techniques can distinguish BD, healthy people, and other mood disorders without the need for 2 groups (binary classification).
In addition, clinicians and researchers should explore the use of ML technology in clinical settings and address the clinical implications and outcomes of ML in diagnosing BD. Future investigations should focus on understanding of people's physiological and psychological behavior regarding the use of these technologies and the level of acceptance shown by physicians and patients. Finally, clinicians should explore the effectiveness of diagnosing models in clinical settings and develop predictive models that can predict mental disorders like BD.

Strengths
The present review was conducted to address the lack of scoping reviews that gather and categorize ML models used in diagnosing BD. The importance of this review stems from the fact that the traditional ways of diagnosing BD may lead to late diagnosis (an average of 10 years delay until formal diagnosis). This review explored studies that examined the ability of ML models to diagnose BD using a variety of data.
The most recent reviews focused on the implications of ML in patients with BD focused either on a specific ML model (neural networks) [51] or on the application of ML using MRI data [52]. This review explored the application of ML models in diagnosing BD without any limitations in terms of the technique or the type of data used, which gives a deeper insight into the technologies used in this field.
The studies considered in this review were the latest one to reduce bias in terms of date selection. We also conducted a backward referencing check by which we found 2 studies. Finally, the study selection included 3 reviewers working independently and any disagreements in the process were discussed and a decision was made upon consensus; this ensured reduced selection bias.

Limitations
This review included only 3 databases (PubMed, Google Scholar, and ScienceDirect), and other databases were not included, such as Embase, IEEE, Scopus, and the ACM Digital Library. This may have led to the absence of some studies that might be relevant to our review; for example, we did not include XGBoosting or LGBM, which are the most common ensemble models used for diagnosis purposes. Some of these databases were not included because of inaccessibility and time constraints. Moreover, we only considered articles published in the last 5 years (2016-2021). We missed categorizing supervised and unsupervised ML models, such as logistic regression, which is a supervised learning method.
We retrieved studies published in English only, which potentially led to the absence of other relevant studies published in other languages, especially French. Our study included data belonging to the United States, United Kingdom, China, Germany, Japan, Turkey, Korea, Italy, India, Canada, Norway, Egypt, Australia, Brazil, and the Netherlands. We missed including data from other populations. This made our results less comprehensive.
Furthermore, this review focused mainly on ML models diagnosing BD, regardless of what the patients were compared to in the training and testing sets (other psychiatric diagnoses) and regardless of the demographics of the patients. This may lead to biased decisions compared to other psychiatric diagnoses without having a healthy control sample. Moreover, our search queries lacked terms related to specific ML algorithms or models. Hence, we did not retrieve articles that used these terms in the title or abstract instead of ML. This again reduced the diversity of our scoping review.

Conclusions
This scoping review grouped recent studies based on the ML model used to diagnose patients with BD regardless of their demographics or their assessments compared to patients with other psychiatric diagnoses. We have also provided information about the data used and summarized the data that were most commonly used in diagnosing BD. The goal of this review was to provide insights into how these technologies can help in faster and better diagnosis of BD and to promote their use in making clinical decisions in the health industry.