This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Bipolar disorder (BD) is the 10th most common cause of frailty in young individuals and has triggered morbidity and mortality worldwide. Patients with BD have a life expectancy 9 to 17 years lower than that of normal people. BD is a predominant mental disorder, but it can be misdiagnosed as depressive disorder, which leads to difficulties in treating affected patients. Approximately 60% of patients with BD are treated for depression. However, machine learning provides advanced skills and techniques for better diagnosis of BD.
This review aims to explore the machine learning algorithms used for the detection and diagnosis of bipolar disorder and its subtypes.
The study protocol adopted the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. We explored 3 databases, namely Google Scholar, ScienceDirect, and PubMed. To enhance the search, we performed backward screening of all the references of the included studies. Based on the predefined selection criteria, 2 levels of screening were performed: title and abstract review, and full review of the articles that met the inclusion criteria. Data extraction was performed independently by all investigators. To synthesize the extracted data, a narrative synthesis approach was followed.
We retrieved 573 potential articles were from the 3 databases. After preprocessing and screening, only 33 articles that met our inclusion criteria were identified. The most commonly used data belonged to the clinical category (19, 58%). We identified different machine learning models used in the selected studies, including classification models (18, 55%), regression models (5, 16%), model-based clustering methods (2, 6%), natural language processing (1, 3%), clustering algorithms (1, 3%), and deep learning–based models (3, 9%). Magnetic resonance imaging data were most commonly used for classifying bipolar patients compared to other groups (11, 34%), whereas microarray expression data sets and genomic data were the least commonly used. The maximum ratio of accuracy was 98%, whereas the minimum accuracy range was 64%.
This scoping review provides an overview of recent studies based on machine learning models used to diagnose patients with BD regardless of their demographics or if they were compared to patients with psychiatric diagnoses. Further research can be conducted to provide clinical decision support in the health industry.
Bipolar disorder (BD) is a predominant mental disorder that involves dramatic shifts in mood and temper. It is the 10th most common cause of frailty in young adults and affects approximately 1% to 5% of the overall population [
To effectively comprehend BD conditions and stipulate better treatment, primary exposure to mental disorders is a crucial phase. Different from finding other long-lasting situations that depend on laboratory trials and statistical analysis, BD is stereotypically detected based on patients’ self-statements in precise surveys planned for uncovering specific types of feelings, moods, and public relations [
In a previous review, Diego et al [
BD is misdiagnosed as depressive disorder that leads to difficulties and delay in the treatment of affected patients [
The current scoping review was conducted to provide an understanding regarding the role of ML in diagnosing BD. A scoping review is an approach that is systematically executed to enable researchers to examine emerging evidence from available studies on a specific topic [
We conducted a systematic search in 3 electronic databases: PubMed, Google Scholar, and ScienceDirect. We searched for articles published between January 2016 and December 2021. The search was conducted between March 16 to March 20, 2021. The references lists of the included articles were reviewed to check for possible articles that could be included.
The search strategies applied differed depending on the nature of the databases chosen for the search and are given in
The articles obtained from the search were uploaded to the Rayyan intelligent review application (Rayyan Systems Inc) in an EndNote (Clarivate) format [
Articles met the inclusion criteria if they achieved the main objective, namely providing an overview on the role of ML in diagnosing BD. The criteria identified for the inclusion and exclusion phases are given in
Empirical studies
Peer-reviewed articles, theses, dissertations, and reports
No restrictions related to machine learning algorithms and models
No restrictions on country of study
English language
No restrictions related to population
Bipolar disorder
Newspapers, magazines, reviews, proposals, and posters
Any language other than English
Machine learning algorithms that do not detect bipolar disorder
Nonhuman subjects
In the first phase, 3 researchers (NA, OM, and ZJ) screened the titles and abstracts of the retrieved articles in an independent manner. In the second phase, the reviewers went through the full text of the articles included from the first phase. The retrieved articles were uploaded to the Rayyan intelligent review application in an EndNote format [
For data extraction, a form was developed to include all the different data considered for the scoping review such as the ML model, accuracy, and type of data used. A description of the data extraction fields is included in
This scoping review follows a narrative synthesis approach to synthesize the extracted data of the studies that made it to the final phase of inclusion and exclusion. From this analysis, we included studies that used ML models to assess participants with BD compared with other psychiatric disorders and healthy controls. The studies were classified based on the ML model used to diagnose BD, whether the model was an existing one or a novel one, BD type, data used, accuracy of diagnosis, other statistical measures, and whether the data used were private (gathered by the researchers) or public (open-access data). We also summarized the characteristics of the selected articles. Furthermore, we categorized the ML models into 10 categories and identified the characteristics of the selected studies that fitted under each category for the diagnosis of BD.
In this scoping review, we retrieved 573 potential articles from 3 different databases and included 33 studies for data synthesis, as shown in
PRISMA (Preferred Reporting Items for Systematics Reviews and Meta-Analyses) flow diagram.
Among the 33 included articles, 30 were research articles (91%) [
General characteristics of the included studies (N=33).
Characteristic | Studies, n (%) | |
|
||
|
Research articles | 30 (91) |
|
Conference proceedings | 3 (9) |
|
||
|
Published | 33 (100) |
|
||
|
China | 8 (24) |
|
United States | 7 (21) |
|
United Kingdom | 3 (9) |
|
Canada | 2 (6) |
|
Germany | 2 (6) |
|
Brazil | 1 (3) |
|
Japan | 1 (3) |
|
Australia | 1 (3) |
|
Italy | 1 (3) |
|
Turkey | 1 (3) |
|
Korea | 2 (6) |
|
Norway | 1 (3) |
|
Netherlands | 1 (3) |
|
India | 1 (3) |
|
Egypt | 1 (3) |
|
||
|
2021 | 6 (18) |
|
2020 | 5 (15) |
|
2019 | 7 (21) |
|
2018 | 7 (21) |
|
2017 | 3 (9) |
|
2016 | 5 (15) |
|
||
|
Model development | 24 (73) |
|
Evaluation | 5 (15) |
|
Data analysis | 3 (9) |
|
Model adaptation | 2 (6) |
|
||
|
Bipolar disorder type 1 | 27 (82) |
|
Bipolar disorder type 2 | 27 (82) |
|
Psychotic bipolar | 3 (9) |
|
Chronic bipolar | 2 (6) |
|
First episode bipolar | 1 (3) |
|
||
|
Machine learning | 33 (100) |
|
Deep learning | 3 (9) |
|
||
|
Diagnosis and detection | 33 (100) |
Publications by year and country.
As shown in
Machine learning models and algorithms, methods, and tools used in the included studies (N=33).a,b
Model categories | Number of studies, n (%) | Study ID | |
|
|||
|
Support vector machine | 9 (28) | [ |
|
Artificial neural network | 4 (12.12) | [ |
|
Artificial neural network-particle swarm optimization | 1 (3.03) | [ |
|
Random forest | 4 (12.12) | [ |
|
Prediction rule ensembles | 1 (3.03) | [ |
|
Gaussian process model | 2 (6.06) | [ |
|
Nearest neighbor classification algorithm | 1 (3.03) | [ |
|
Naive Bayes algorithm | 1 (3.03) | [ |
|
Decision tree algorithm | 1 (3.03) | [ |
|
|||
|
Growth mixture modeling | 1 (3.03) | [ |
|
Linear discriminant analysis | 1 (3.03) | [ |
|
|||
|
Baseline logistic regression | 1 (3.03) | [ |
|
Linear regression | 3 (9.09) | [ |
|
Elastic net method | 2 (6.06) | [ |
Least absolute shrinkage and selection operator | 2 (6.06) | [ |
|
Fuzzy TOPSIS method | 1 (3.03) | [ |
|
|
|||
|
K-means clustering | 1 (3.03) | [ |
|
|||
|
Deep neural network | 2 (6.06) | [ |
|
Convolutional neural network | 1 (3.03) | [ |
|
DeepBipolar | 1 (3.03) | [ |
|
|||
|
Natural language processing | 1 (3.03) | [ |
|
|||
|
Structured clinical interview for DSM-IVd | 1 (3.03) | [ |
aMachine learning models/algorithms were not reported in 2 studies, of which 1 study used a novel machine learning approach to diagnose bipolar disorder type I. The name of the model is not mentioned.
bMachine learning methods were only reported in 8 studies.
cThis is an interview-based assessment tool for diagnosis.
dDSM-IV:
The includes studies employed 9 different types of classification models. In 9 (28%) of the 33 studies, SVM-based models were used to diagnose BD (specific types are not mentioned) [
The 33 included studies used 4 different types of regression models. Baseline logistic regression used in only 1 (3.03%) study for diagnosing BD and other psychiatric disorders [
Linear discriminant analysis (LDA) and growth mixture modeling (GMM) were employed in 2 (6.06%) studies [
Among the 33 studies, 1 (3.03%) used deep neural networks and convolutional Neural Network algorithms [
A natural language processing model was employed by 1 (3.03%) study [
Only 1 (3.03%) study [
The Fuzzy TOPSIS method was employed in 1 (3.03%) study [
In 1 study (3.03%) [
The sample sizes were not consistent, and different sample sizes were used in the included articles ranging from 15 to 25,000. In 18 (56%) of the 33 studies, the sample size was less than 300 (56%), whereas in 12 (36.4%) studies, the sample size was above 300, as indicated in
Features of data used in the included studies (N=33).
Feature | Value | |
|
|
|
|
<100 | 9 (28) |
|
100-200 | 9 (28) |
|
200-600 | 7 (21) |
|
700-1000 | 3 (9) |
|
>2000 | 2 (6) |
|
|
|
|
Clinical data | 19 (58) |
|
Nonclinical data | 12 (36) |
|
|
|
|
Private | 21 (64) |
|
Public | 9 (28) |
|
|
|
|
Disorder samples | >90 |
|
Healthy control | 10 |
aData set size was only reported in 30 studies.
bData types were only mentioned in 31 studies. Clinical data include blood samples, electronic medical records, neurological data, magnetic resonance imaging data, electroencephalography and microarray expression data, whereas nonclinical data include phenotype data, genotype data, genomic data, and genome wide association studies.
cPublic data include government sources, public databases, websites, and freely available databases, whereas private data include nongovernment sources, personal information, or data of specific hospitals or research organizations. Private data include databases that are not available in the public domain.
dMore than 90% of the samples used in the included studies were bipolar disorder samples (regardless of type), whereas 10% of the samples were healthy control samples.
Data types were only mentioned in the 31 of the 33 studies. As shown in
Data set types used in the included studies (N=33).
Data typea | Study reference | ||
|
|||
|
Immune-inflammatory signature | [ |
|
|
Blood samples (serum) | [ |
|
|
Neuropsychological data | [ |
|
|
Neurocognitive data | [ |
|
|
Affective Disorder Evaluation scale | [ |
|
|
Magnetic resonance imaging ( structural and functional) | [ |
|
|
Electroencephalography | [ |
|
|
PGBI-10Mb manic symptom data | [ |
|
|
Microarray expression data set | [ |
|
|
|||
|
CANTABc cognitive scores | [ |
|
|
Large-scale genome-wide association | [ |
|
|
Phenotypic data set | [ |
|
|
Fractional anisotropy | [ |
|
|
Radial diffusivity | [ |
|
|
Axial diffusivity | [ |
|
|
Electronic medical record | [ |
|
|
Passive digital phenotypes | [ |
|
|
Bipolarity index | [ |
|
|
Daily mood ratings survey | [ |
|
|
Diffusion tensor images | [ |
|
|
Affective Disorder Evaluation scale | [ |
|
|
Activity monitoring | [ |
|
|
Genomic data | [ |
aIn several studies, more than one data type was used.
bPGBI-10M: Parent General Behavior Inventory-10-Item Mania Scale.
cCANTAB: Cambridge Neuropsychological Test Automated Battery.
The accuracies of the ML models and algorithms were reported in 24 studies, as shown in
Sensitivity was reported in only 15 studies; it ranged from ≤60% to >90%. Sensitivity was ≤60% in 1 study [
The proportion of the area under the curve (AUC) value was only reported in 10 studies, ranging from ≤69% to >97%. In 3 studies, the AUC ratio was ≤70% [
Statistical validation.
Statistics | Study reference | |
|
|
|
|
≤70 | [ |
|
71-78 | [ |
|
83-90 | [ |
|
>91 | [ |
|
|
|
|
≤60 | [ |
|
65-67 | [ |
|
75-78 | [ |
|
80-88 | [ |
|
>90 | [ |
|
|
|
|
≤70 | [ |
|
74-77 | [ |
|
81-89 | [ |
|
>92 | [ |
|
|
|
|
≤70 | [ |
|
74-78 | [ |
|
84- 88 | [ |
|
>91 | [ |
aRatio of accuracy was not reported in 7 studies. In some studies, different values were mentioned, so the overall values do not sum up.
bSensitivity was not mentioned in 18 studies.
cSpecificity was not mentioned in 20 studies.
dAUC: area under the curve. It is basically used for statistical validation of any data. AUC values were not reported in 23 studies.
Previous studies stressed the importance of ML classifiers to aid in diagnosing BD accurately, as it is frequently misdiagnosed. Approximately 60% of BD cases are misdiagnosed as major depressive disorders, and a proper diagnosis may take up to 10 years [
In this review, we explored the uses of ML techniques in diagnosing BD. From the 573 studies retrieved, 33 studies were included in this review. To explore the use of ML in diagnosing BD, the information was classified into 3 main categories as follows:
This review identified ML models, methods, and tools used for diagnosing BD, some of which did not use ML methods as the primary tool for diagnosis but used them as a supportive tool.
SVMs were the most commonly used ML models in diagnosing BD in 9 (27%) of the 33 studies, followed by ANNs (5, 15%), followed ensemble models (3, 9%), linear regression (3, 9%), and the Gaussian process model (2, 6%). Further, natural language processing, linear discriminant analysis, and logistic regression were used once in each study (3, 9%). Additionally, 7 studies applied other ML models that were emerging models or used a program to perform the diagnoses. However, only 1 study used a BD assessment tool, SCID, for the diagnosis of BD and an ML model as a supportive tool. Further, 1 study did not specify which ML model was employed. Hence, the use of ML models to diagnose BD is influenced by the diagnosis of BD, which is why studies have been exploring different ML models to better diagnose such mental disorders.
The included studies used 2 types of data in diagnosing BD (clinical and nonclinical data). Clinical data were the most widely used, in 19 (53%) of the 33 studies. Among these 19 studies, 10 used magnetic resonance imaging (MRI) to classify bipolar patients compared to other groups. Other less commonly used data are mentioned in
Nonclinical data were used in 12 studies (36%); some examples of nonclinical data used are large-scale GWAS (2, 6%), phenotypic data sets (2, 6%), diffusion tensor images (DTIs) (2, 6%) and other less commonly used data (
The retrieved studies used 4 main validation measures to validate the ML models; these measures are accuracy, sensitivity, specificity, and AUC.
The accuracy of the ML models and algorithms was reported in 24 studies. The accuracy ranged from ≤70% to >91%. The highest accuracy achieved was 98% in only 1 study, whereas the lowest accuracy was 64%. Most studies achieved an accuracy of 83%-90% (9, 37.5%). The mean value of the accuracy was 82.06%. Moreover, sensitivity was only reported in 15 studies; it ranged from ≤60 to >90%. The mean value of sensitivity was 78.26%, whereas most studies (8, 53.3%) achieved sensitivity values between 80% and 88%. Furthermore, specificity was only mentioned in 13 studies. The value of specificity ranged from ≤70 to 92%. The mean value of Specificity was 85.4%, and most studies (6, 46.15%) achieved specificity values of 80%-90%. Finally, the AUC value was only reported in 10 studies, ranging from ≤69% to >97%. The maximum AUC value was 97%, whereas the minimum value was 65%. The mean AUC value was 81%. An important factor is that we were unable to compare the ML models and better categorize them owing to the variety of validation methods used in the reviewed studies. However, accuracy tended to be the most used measure in validating the ability of ML models to diagnose BD.
Diego et al [
Model performance metrics.
Data type | Study ID | Proposed model | Sensitivity, % | Specificity, % | Accuracy, % | AUCa |
GWASb | [ |
Random forest | 77.7 |
85.4 | 85.2 | NRc |
Neuropsychological data | [ |
SVMd | 76 |
77 | 77.0 | NR |
ADEe and BPxf | [ |
SVM | NR | NR | 96.0 | 92.1 |
MRIg | [ |
SVM | 85 | 85 | 85 | NR |
MRI | [ |
SVM | 82.3 |
92.7 | 87.6 | NR |
MRI | [ |
SVM | 87.5 | 97.1 | 92.4 | NR |
MRI | [ |
SVM | NR | NR | 76.0 | 74 |
MRI | [ |
SVM | 84.6 | 92.3 | 83.5 | NR |
MRI | [ |
Gaussian process model | 66.4 |
74.2 | 70.3 | NR |
EEGh | [ |
SVM | NR | NR | 98.0 | NR |
|
[ |
ANNi | 83.87 | NR | 89.89 | NR |
DTIj | [ |
SVM | NR | NR | 68.3 | NR |
Activity monitoring | [ |
RF,k CNN,l and ANN | 82 |
84 | 84 | NR |
Genomic data | [ |
ANN-PSOm | 83.87 | NR | 89.89 | NR |
Immune-inflammatory signature | [ |
Linear regression and elastic net methods | NR | NR | 86 | 97 |
EMRn | [ |
Linear regression and elastic net methods | 75 |
81 | 78 | 84 |
CANTABo cognitive score | [ |
Linear regression and LASSOp | NR | NR | 71.0 | NR |
Phenotypic data set (passive digital phenotype) | [ |
RF | NR | NR | 65 | 67 |
Fractional anisotropy, radial diffusivity, and axial diffusivity | [ |
Gaussian Process model | 66.67 |
84.21 | 75.0 | NR |
PGBI-10Mq manic symptom data | [ |
Growth mixture modeling | 83 | 89 | NR | NR |
aAUC: area under the curve.
bGWAS: genome-wide association.
cNR: not reported in the article.
dSVM: support vector machine.
eADE: Affective Disorder Evaluation.
fBPx: bipolarity index.
gMRI: magnetic resonance imaging.
hEEG: electroencephalography.
iANN: artificial neural network.
jDTI: diffusion tensor images.
kRF: random forest.
lCNN: convolutional neural network.
mANN-PSO: ANN-particle swarm optimization.
nEMR: electronic medical record.
oCANTAB: Cambridge Neuropsychological Test Automated Battery.
pLASSO: least absolute shrinkage and selection operator.
qPGBI-10M: Parent General Behavior Inventory-10-Item Mania Scale.
This review categorized the most common ML models and data used in diagnosing BD. Based on our findings, ML models can diagnose BD using clinical and nonclinical data. Future research should explore the studies involving patients in clinical and nonclinical settings to better evaluate the accuracy of the ML models.
Moreover, future studies should explore the influence of external factors like social media and the influence of the society on mental disorders to evaluate the influence of these factors on the patients and their effects on the performance of the ML models.
Furthermore, ML models should be compared with other traditional techniques for diagnosing BD like the Affective Disorder Evaluation (ADE) scale and Structured Clinical Interview for DSM-IV.
Only 2 studies reviewed used data sets with sizes above 2000, which is not surprising considering that most studies had data size as a limitation. In future studies, the ML models should be trained and validated on a larger data set and have a larger healthy control sample, as it was less than 10% in the reviewed studies.
As AI use in the health sector is growing rapidly, physicians should pay careful attention to some major issues that stand in the way of dealing with sensitive data such as medical information because of data ownership and security issues.
BD symptoms overlap with other mood disorders, specifically MDD, and this leads to the misdiagnosis of BD [
In addition, clinicians and researchers should explore the use of ML technology in clinical settings and address the clinical implications and outcomes of ML in diagnosing BD. Future investigations should focus on understanding of people’s physiological and psychological behavior regarding the use of these technologies and the level of acceptance shown by physicians and patients. Finally, clinicians should explore the effectiveness of diagnosing models in clinical settings and develop predictive models that can predict mental disorders like BD.
The present review was conducted to address the lack of scoping reviews that gather and categorize ML models used in diagnosing BD. The importance of this review stems from the fact that the traditional ways of diagnosing BD may lead to late diagnosis (an average of 10 years delay until formal diagnosis). This review explored studies that examined the ability of ML models to diagnose BD using a variety of data.
The most recent reviews focused on the implications of ML in patients with BD focused either on a specific ML model (neural networks) [
The studies considered in this review were the latest one to reduce bias in terms of date selection. We also conducted a backward referencing check by which we found 2 studies. Finally, the study selection included 3 reviewers working independently and any disagreements in the process were discussed and a decision was made upon consensus; this ensured reduced selection bias.
This review included only 3 databases (PubMed, Google Scholar, and ScienceDirect), and other databases were not included, such as Embase, IEEE, Scopus, and the ACM Digital Library. This may have led to the absence of some studies that might be relevant to our review; for example, we did not include XGBoosting or LGBM, which are the most common ensemble models used for diagnosis purposes. Some of these databases were not included because of inaccessibility and time constraints. Moreover, we only considered articles published in the last 5 years (2016-2021). We missed categorizing supervised and unsupervised ML models, such as logistic regression, which is a supervised learning method.
We retrieved studies published in English only, which potentially led to the absence of other relevant studies published in other languages, especially French. Our study included data belonging to the United States, United Kingdom, China, Germany, Japan, Turkey, Korea, Italy, India, Canada, Norway, Egypt, Australia, Brazil, and the Netherlands. We missed including data from other populations. This made our results less comprehensive.
Furthermore, this review focused mainly on ML models diagnosing BD, regardless of what the patients were compared to in the training and testing sets (other psychiatric diagnoses) and regardless of the demographics of the patients. This may lead to biased decisions compared to other psychiatric diagnoses without having a healthy control sample. Moreover, our search queries lacked terms related to specific ML algorithms or models. Hence, we did not retrieve articles that used these terms in the title or abstract instead of ML. This again reduced the diversity of our scoping review.
This scoping review grouped recent studies based on the ML model used to diagnose patients with BD regardless of their demographics or their assessments compared to patients with other psychiatric diagnoses. We have also provided information about the data used and summarized the data that were most commonly used in diagnosing BD. The goal of this review was to provide insights into how these technologies can help in faster and better diagnosis of BD and to promote their use in making clinical decisions in the health industry.
List of queries used in various databases.
Description of data extraction fields.
Characteristics of the included studies and purposes of machine learning techniques used in the studies.
Summary of all the data extracted from the included studies.
Fractions of articles by publication type.
Fractions of numbers of articles published by year.
axial diffusivity
Affective Disorder Evaluation
artificial intelligence
Artificial neural network-particle swarm optimization
bipolar disorder
Cambridge Neurophysiological Test Automated Battery
diffusion tensor images
electroencephalography
electronic health record
fractional anisotropy
functional magnetic resonance imaging
Gaussian process classifier
genome-wide association data
logistic regression
machine learning
magnetic resonance imaging
natural language processing
obsessive compulsive disorder
Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews
radial diffusivity
random forest
resting state functional magnetic resonance imaging
support vector machine
yellow-brown obsessive-compulsive disorder
We thank the faculty of the Division of Information Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, for providing the opportunity to conduct this review.
The review was developed under the supervision and guidance of MH, AA, and AAA. Each reviewer independently carried out the study selection and data extraction phase. NAA reviewed OM’s work in both phases, OM revised ZJ’s work, and ZJ revised NAA’s work. Any disagreements with the decisions made were discussed and a decision was made upon consensus. All reviewers collaborated equally on the manuscript writeup and data extraction. TA helped with the classification of machine learning models as well as the designing of performance metrics. ZJ prepared the final manuscript file, and AAA and MH reviewed the final version of the manuscript.
None declared.