This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Automatic diagnosis of depression based on speech can complement mental health treatment methods in the future. Previous studies have reported that acoustic properties can be used to identify depression. However, few studies have attempted a large-scale differential diagnosis of patients with depressive disorders using acoustic characteristics of non-English speakers.
This study proposes a framework for automatic depression detection using large-scale acoustic characteristics based on the Korean language.
We recruited 153 patients who met the criteria for major depressive disorder and 165 healthy controls without current or past mental illness. Participants' voices were recorded on a smartphone while performing the task of reading predefined text-based sentences. Three approaches were evaluated and compared to detect depression using data sets with text-dependent read speech tasks: conventional machine learning models based on acoustic features, a proposed model that trains and classifies log-Mel spectrograms by applying a deep convolutional neural network (CNN) with a relatively small number of parameters, and models that train and classify log-Mel spectrograms by applying well-known pretrained networks.
The acoustic characteristics of the predefined text-based sentence reading automatically detected depression using the proposed CNN model. The highest accuracy achieved with the proposed CNN on the speech data was 78.14%. Our results show that the deep-learned acoustic characteristics lead to better performance than those obtained using the conventional approach and pretrained models.
Checking the mood of patients with major depressive disorder and detecting the consistency of objective descriptions are very important research topics. This study suggests that the analysis of speech data recorded while reading text-dependent sentences could help predict depression status automatically by capturing the characteristics of depression. Our method is smartphone based, is easily accessible, and can contribute to the automatic identification of depressive states.
Depression is a serious psychiatric illness affecting >300 million people worldwide. It leads to a variety of negative health outcomes in individuals [
A promising approach to address the abovementioned problems is to identify depression markers and advanced machine learning (ML) techniques using real-world accessible sensors (eg, wearables, cameras, and phones). These approaches may make it easier for nonspecialists to effectively identify symptoms in patients with depression and accordingly direct them toward appropriate treatment or management. Previous studies have explored a spectrum of behavioral signal approaches, such as speech [
Automatic depression detection (ADD) has gained popularity with the advent of publicly available data sets [
Text-dependent read speech helps to reduce acoustic variability, as reading language content can be designed to express behavior in a controlled manner, such as the same length and content [
To address these issues, we used audio samples based on text-dependent reading modes to determine whether depressive symptoms could be detected based on differences in acoustic characteristics. We believe that an autosensing approach using deep learning to describe text-dependent reading modes can significantly contribute to this field of research. Therefore, in this study, a Korean-based speech cohort was used to diagnose depression using text-dependent read speech. We then applied a convolutional neural network (CNN) layer and a dense layer to model the depressive state. Finally, the trained model was used to predict depression among unknown data samples. The performance of the model was compared with that of other approaches, ML, and pretrained models. The results show that our approach is effective in easily and automatically assessing depressive states in speech.
The participants were 153 patients with major depressive disorder (MDD) and 165 healthy controls. All participants were from South Korea. The inclusion criterion was that study participants should be aged ≥19 years. All patients with MDD were evaluated by board-certified psychiatrists according to the Diagnostic and Statistical Manual of Mental Disorder criteria to identify their current mood states. The severity of depressive symptoms was assessed using the Hamilton Depression Rating Scale (HAM-D) and Patient Health Questionnaire-9 (PHQ-9) [
This study was reviewed and approved by the Institutional Review Boards of Inje University Ilsan Paik Hospital (number 2019-12-011-015) and Chungnam National University Hospital (number 2019-10-101-018), which specializes in mental health in South Korea.
The experimental protocol in this study was designed to assess speech responses to identify depression using predefined text-dependent speech tasks (
The reading tasks of each participant were recorded using a smartphone (built-in microphone of Samsung Galaxy S10). Speech samples on the microphone were saved as mono-PCM (Pulse-Code Modulation; 32 bits) .wav files, sampled at 44.1 kHz. The speech samples were recorded in a quiet room, and the microphone was positioned at a distance of approximately 30 cm from the participants to ensure the quality of the recordings. We created slides with scripts for each task and placed them on a table where the participants could easily see them. The participants were instructed to read the script through the slides in a comfortable and self-selected pitch and volume. The collected audio data set was used for the experiments in this study.
Experimental procedure based on the read-dependent speech tasks.
The recorded audio signal included the speaker’s speech and nonspeech parts, such as silence and background noise caused by environmental factors. These nonspeech parts can be an obstacle to training acoustic features for depression recognition [
Time information of audio samples in the major depressive disorder (MDD) and healthy controls (HC) groups during the vowel, digit, and passage tasks.
|
Vowel | Digit | Passage | ||||
|
|||||||
|
Number of samples, n (%) | 153 (48.1) | 153 (48.1) | 153 (48.1) | |||
|
Speech time (seconds), mean (SD) | 14.67 (4.07) | 16.02 (5.39) | 91.80 (17.46) | |||
|
Total speech time (seconds) | 2307 | 2373 | 14,150 | |||
|
|||||||
|
Number of samples, n (%) | 165 (51.9) | 165 (51.9) | 165 (51.9) | |||
|
Speech time (seconds), mean (SD) | 13.59 (4.06) | 14.39 (4.44) | 72.71 (7.91) | |||
|
Total speech time (seconds) | 2325 | 2361 | 13,110 |
The goal of the proposed approach is to capture a series of acoustic features from audio samples using text-dependent speech tasks and map them into a similar representation space to determine the presence of depression. As CNN is a powerful framework for learning a feature hierarchy, it can provide a representation space capable of detecting depression. Thus, we chose the CNN architecture to learn spatially invariant features of audio samples. We trained a deep CNN from scratch and evaluated its classification performance.
The pipeline of the proposed method for depression detection. (A) Image representation using spectrograms of audio samples, (B) Log-Mel spectrogram-based convolutional neural network (CNN) and depression detection, and (C) architectures on the proposed CNN model in this study. CONV: convolution; ReLu: rectified linear unit; FC: fully connected.
The audio signal must be properly converted to enable deep CNN input, and deep CNN training requires a large amount of labeled data to use deep feature extraction using a CNN for audio samples. To address this issue, we segmented the audio signal into an optimal length to enable depression recognition in each segment, as shown in
We adapted augmentation techniques to increase the size of the training set to improve generalization. An augmentation method that generates new additional training data is a successful method for reducing model overfitting on sparse data and improving generalization performance. SpecAugment is the latest augmentation method developed by Google for spectrograms of input audio [
We designed a CNN architecture to extract features from an LM spectrogram. The detailed model structure of the proposed CNN is shown in
In addition to the proposed CNN architecture, pretrained CNN models were used as feature extractors to evaluate depression detection in this study. We tested the following additional pretrained CNNs designed to extract audio representations: 4 state-of-the-art image CNN models, namely, VGG16, VGG19 [
The image CNN models Inception-v3, VGG16, VGG19, ResNet50, and MobileNet-v2 were also used with LM spectrogram images as the 2D input. The LM spectrograms were resized to 224×224 images with 3 channels (R, G, and B) because of their suitability for deep feature extraction. For the classification of the image CNN models, we applied global average pooling instead of using a flattening layer after the feature extractor. For all these models, the final fully connected layers were eliminated and redesigned in the new classifier.
VGGish is a deep audio embedding method for training a modified visual geometry group architecture to predict video tags from the Youtube-8M data set [
For the proposed CNN and pretraining models, the classifier uses 128 and 64 fully connected dense layers, respectively, and rectified linear unit activation. We use a 0.5 dropout probability between dense layers to prevent overfitting of the training data. The output layer consisted of 1 hidden unit to classify an image into 2 classes: MDD and control. For the output layer, we used a sigmoid activation function for binary classification.
The training process was set up to run 20 epochs for all the CNN models. For model training, the Adam optimizer was selected with a fixed learning rate of 1e-3. In this study, the learning rate was based on the validation accuracy of the model, and early stopping was performed. All the parameters of the testing stage were applied similar to the training stage. The deep CNN was implemented using Python (version 3.7.11) and TensorFlow (version 2.5.0) on a Quadro RTX 8000 48GB GPU (NVIDIA).
The baseline audio signals extracted 3 widely used acoustic feature representations—Mel-frequency cepstral coefficients (MFCCs), extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [
The 318 audio data sets were randomly split into 254 training sets, 32 validation sets, and 32 testing sets in a balanced ratio between patients with MDD and healthy controls during the experiments. Participant-independent splits were used for training, validation, and testing.
A 10-fold cross-validation method was used to split the data and evaluate the performance of each classifier. We compared the predicted results with test data to evaluate their performance. In our study, we adopted the accuracy, precision, recall, receiver operating characteristic curve, and
The demographic and clinical characteristics of the age, education, HAM-D, and PHQ-9 factors were not normally distributed; these factors were compared between the control and MDD groups using the Mann-Whitney
The descriptive demographic and clinical characteristics of the participants in this are summarized in
Demographic characteristics of major depressive disorder (MDD) and health controls (HC) groups.
Factors | MDD | HC | |||
|
.16a | ||||
|
Male | 51 (33.3) | 42 (25.5) |
|
|
|
Female | 102 (66.7) | 123 (74.5) |
|
|
Age (years), mean (SD) | 38.64 (13.26) | 37.15 (12.18) | .06 | ||
Education (years), mean (SD) | 13.10 (2.57) | 14.71 (2.19) |
|
||
Nonalcoholic, n (%) | 55 (35.95) | 83 (50.30) |
|
||
Nonsmoker, n (%) | 45 (29.41) | 21 (12.73) |
|
||
HAM-Dd, mean (SD) | 22.18 (6.34) | 2.07 (4.61) |
|
||
PHQ-9e, mean (SD) | 16.92 (6.99) | 2.14 (3.44) |
|
aChi-square test.
b
c
dHAM-D: Hamilton Depression Rating Scale.
ePHQ-9: Patient Health Questionnaire-9.
In this section, we analyze the classification capability of the proposed approach, including the conventional ML models and the CNN method, to recognize depression in unknown audio samples (test set) obtained from each text-based speech task.
First, we investigated the ability of the conventional ML methods to classify the samples as MDD or non-MDD for each data set of the vowel, digit, and passage tasks.
We further analyzed the variability of MFCCs between the 2 groups in each task to explore the differences in MFCCs between the MDD and control groups. As a result of analyzing the difference between the 2 groups among the 12 MFCCs, MFCC3 showed the greatest difference.
Second, we compared the performance of depression detection for each CNN-based representation model using our data sets. The best performances were obtained with a batch size of 32 in our experiment; thus, all results provided below refer to the models trained with this batch size.
The receiver operating characteristic curves and AUCs based on the CNN models in the passage task are shown in
In addition, benchmark experiments were performed to compare the performance of depression detection and PHQ-9 score prediction in the benchmark model and the proposed CNN model on audio data sets in a text-dependent setting (read-dependent speech mode) and a text-independent setting (spontaneous mode). A previously reported model [
We first trained both the benchmark and proposed models in a text-dependent setting and then compared the mean performance of 10-folds for depression detection and PHQ-9 score prediction. Here, the performance of depression detection and PHQ-9 score prediction were adopted as the accuracy or
These results prove that the model proposed in this study can classify depression better than other models in the dependent setting.
Comparisons of the performance of machine learning classifiers for Mel-frequency cepstral coefficients (MFCCs), extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), and Interspeech Computational Paralinguistics Challenge (COMPARE) feature sets of vowel task (N=318).
Feature and classifier | Test set | ||||
|
ACCa (%), mean (SD) | PRECb (%), mean (SD) | RECc (%), mean (SD) | ||
|
|||||
|
SVMe | 56.56 (3.81) | 56.22 (4.01) | 55.82 (3.84) | 55.43 (3.91) |
|
LDAf | 54.69 (2.88) | 54.25 (2.99) | 53.98 (2.80) | 53.65 (2.75) |
|
kNNg | 57.19 (5.60) | 56.84 (5.85) | 56.65 (5.58) | 56.47 (5.80) |
|
RFh | 60.63 (1.53) | 61.24 (2.07) | 59.41 (1.44) | 58.40 (1.31) |
|
|||||
|
SVM | 59.38 (5.93) | 59.27 (6.37) | 58.59 (6.16) | 57.90 (6.58) |
|
LDA | 59.69 (2.60) | 59.44 (2.66) | 59.16 (2.66) | 59.05 (2.71) |
|
kNN | 59.37 (3.13) | 59.30 (3.17) | 59.29 (3.18) | 59.23 (3.18) |
|
RF | 61.25 (2.86) | 61.67 (3.20) | 60.16 (2.99) | 59.32 (3.39) |
|
|||||
|
SVM | 48.75 (3.48) | 48.57 (3.62) | 48.67 (3.44) | 48.26 (3.70) |
|
LDA | 51.25 (3.19) | 51.37 (3.46) | 51.37 (3.40) | 51.07 (3.38) |
|
kNN | 59.38 (5.23) | 59.12 (5.42) | 58.78 (5.32) | 58.61 (5.40) |
|
RF | 72.80 (2.44) | 73.70 (2.19) | 72.14 (2.64) | 72.04 (2.77) |
aACC: accuracy.
bPREC: precision.
cREC: recall.
d
eSVM: support vector machine.
fLDA: linear discriminate analysis.
gkNN: k-nearest neighbor.
hRF: random forest.
Comparisons of the performance of machine learning classifiers for Mel-frequency cepstral coefficients (MFCCs), extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), and Interspeech Computational Paralinguistics Challenge (COMPARE) feature sets of digit task (N=318).
Feature and classifier | Test set | ||||||||
|
ACCa (%), mean (SD) | PRECb (%), mean (SD) | RECc (%), mean (SD) | ||||||
|
|||||||||
|
SVMe | 53.75 (4.80) | 53.82 (4.96) | 53.76 (4.87) | 53.62 (4.74) | ||||
|
LDAf | 56.25 (5.76) | 56.07 (5.84) | 55.92 (5.66) | 55.87 (5.64) | ||||
|
kNNg | 51.25 (5.80) | 50.80 (4.41) | 50.82 (4.36) | 50.61 (4.44) | ||||
|
RFh | 52.81 (3.26) | 51.66 (4.09) | 51.63 (3.48) | 50.24 (4.24) | ||||
|
|||||||||
|
SVM | 51.84 (4.62) | 51.32 (4.58) | 51.30 (4.56) | 51.17 (4.60) | ||||
|
LDA | 48.01 (2.50) | 47.95 (2.69) | 48.04 (2.64) | 47.92 (2.69) | ||||
|
kNN | 45.36 (3.42) | 45.60 (3.48) | 45.52 (3.39) | 45.13 (3.30) | ||||
|
RF | 56.21 (3.56) | 56.15 (4.89) | 55.10 (3.68) | 52.96 (4.25) | ||||
|
|||||||||
|
SVM | 57.81 (5.09) | 57.64 (5.69) | 57.08 (5.78) | 55.73 (6.77) | ||||
|
LDA | 62.19 (4.30) | 62.32 (4.49) | 62.25 (4.41) | 62.09 (4.31) | ||||
|
kNN | 45.31 (4.01) | 45.14 (4.03) | 45.16 (4.03) | 45.10 (4.03) | ||||
|
RF | 73.44 (1.56) | 73.84 (1.79) | 72.96 (1.49) | 72.99 (1.53) |
aACC: accuracy.
bPREC: precision.
cREC: recall.
d
eSVM: support vector machine.
fLDA: linear discriminate analysis.
gkNN: k-nearest neighbor.
hRF: random forest.
Comparisons of the performance of machine learning classifiers for Mel-frequency cepstral coefficients (MFCCs), extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), and Interspeech Computational Paralinguistics Challenge (COMPARE) feature sets of passage task (N=318).
Feature and classifier | Test set | ||||
|
ACCa (%), mean (SD) | PRECb (%), mean (SD) | RECc (%), mean (SD) | ||
|
|||||
|
SVMe | 70.73 (5.93) | 70.00 (6.29) | 68.18 (6.19) | 68.63 (6.63) |
|
LDAf | 69.95 (8.38) | 70.57 (8.23) | 69.79 (8.34) | 69.49 (8.60) |
|
kNNg | 63.45 (7.76) | 63.55 (8.05) | 63.25 (7.91) | 63.13 (7.87) |
|
RFh | 68.54 (7.75) | 66.45 (7.40) | 67.98 (8.09) | 67.27 (8.74) |
|
|||||
|
SVM | 59.38 (2.08) | 59.35 (2.15) | 58.89 (2.15) | 58.58 (2.34) |
|
LDA | 58.14 (2.99) | 57.96 (3.34) | 57.21 (2.96) | 56.62 (2.99) |
|
kNN | 61.88 (4.37) | 62.11 (4.85) | 64.05 (4.71) | 60.66 (4.37) |
|
RF | 57.81 (2.21) | 57.33 (2.91) | 56.74 (2.17) | 56.10 (2.36) |
|
|||||
|
SVM | 63.44 (2.44) | 63.44 (2.53) | 62.96 (2.58) | 62.80 (2.71) |
|
LDA | 62.19 (4.93) | 62.17 (4.94) | 62.10 (4.87) | 62.03 (4.90) |
|
kNN | 65.63 (3.13) | 65.61 (3.19) | 65.49 (3.07) | 65.44 (3.06) |
|
RF | 68.75 (1.40) | 69.69 (1.27) | 67.84 (1.49) | 67.60 (1.63) |
aACC: accuracy.
bPREC: precision.
cREC: recall.
d
eSVM: support vector machine.
fLDA: linear discriminate analysis.
gkNN: k-nearest neighbor.
hRF: random forest.
The Mel-frequency cepstral coefficient 33 variability between the major depressive disorder (MDD) and control groups in (A) vowel, (B) digits, and (C) passage tasks.
Comparisons of the depression detection performance of convolutional neural network (CNN) models in all tasks (N=318).
Data set and models | Test set | |||||||
|
ACCa (%), mean (SD) | PRECb (%), mean (SD) | RECc (%), mean (SD) | |||||
|
||||||||
|
Proposed CNNs | 65.44 (6.58) | 64.04 (21.56) | 46.66 (24.55) | 51.13 (24.38) | |||
|
VGG16 | 47.70 (1.76) | 53.16 (1.26) | 41.06 (3.83) | 46.24 (2.46) | |||
|
VGG19 | 47.97 (0.95) | 53.29 (0.83) | 43.06 (6.30) | 47.37 (4.30) | |||
|
ResNet50 | 47.02 (3.98) | 10.99 (21.99) | 20.00 (40.00) | 14.89 (28.38) | |||
|
Inception-v3 | 56.83 (1.33) | 63.13 (1.81) | 51.81 (3.01) | 56.84 (1.87) | |||
|
MobilieNet-v2 | 57.53 (0.85) | 63.34 (0.99) | 54.20 (4.66) | 58.29 (2.24) | |||
|
VGGish | 55.93 (0.78) | 59.93 (0.77) | 60.01 (1.16) | 59.99 (0.75) | |||
|
YAMNet | 62.54 (0.98) | 65.78 (0.82) | 66.43 (1.96) | 66.09 (1.16) | |||
|
||||||||
|
Proposed CNNs | 66.60 (7.10) | 56.79 (28.83) | 47.72 (24.99) | 51.53 (26.23) | |||
|
VGG16 | 61.69 (1.24) | 62.65 (1.19) | 48.49 (7.53) | 54.32 (5.20) | |||
|
VGG19 | 53.20 (1.62) | 60.62 (1.66) | 35.94 (6.53) | 44.79 (5.90) | |||
|
ResNet50 | 54.15 (9.25) | 11.89 (18.16) | 30.00 (45.83) | 17.03 (26.00) | |||
|
Inception-v3 | 61.69 (1.24) | 65.99 (1.50) | 58.72 (3.76) | 62.06 (2.04) | |||
|
MobilieNet-v2 | 61.92 (2.03) | 65.59 (1.64) | 60.49 (3.81) | 62.9 (2.70) | |||
|
VGGish | 46.00 (1.74) | 49.83 (1.60) | 51.26 (2.94) | 50.52 (2.14) | |||
|
YAMNet | 57.37 (0.83) | 61.32 (0.73) | 55.18 (3.06) | 58.04 (1.76) | |||
|
||||||||
|
|
|||||||
|
VGG16 | 66.80 (0.40) | 65.97 (1.48) | 65.48 (3.11) | 65.64 (1.05) | |||
|
VGG19 | 65.58 (2.52) | 58.14 (2.69) | 66.38 (5.16) | 61.80 (2.08) | |||
|
ResNet50 | 56.62 (4.84) | 79.98 (9.15) | 17.11 (18.59) | 23.54 (21.24) | |||
|
Inception-v3 | 65.46 (0.28) | 63.00 (0.69) | 69.86 (2.04) | 66.23 (0.60) | |||
|
MobilieNet-v2 | 67.56 (1.28) | 65.94 (2.13) | 68.98 (3.07) | 67.29 (0.97) | |||
|
VGGish | 62.17 (0.69) | 59.96 (0.81) | 66.70 (1.70) | 63.13 (0.75) | |||
|
YAMNet | 62.65 (0.52) | 60.38 (0.46) | 67.39 (1.46) | 63.69 (0.74) |
aACC: accuracy.
bPREC: precision.
cREC: recall.
d
eThe accuracy, precision, recall, and
Mean and SD of the receiver operating characteristic curve and area under receiver operating characteristic curve (AUC) based on the convolutional neural network (CNN) models with 10-fold cross-validation in the passage task.
Comparison of the classification accuracy according to the model parameter size in convolutional neural network (CNN) models. ResNet: residual neural network; VGG: visual geometry group; YAMNet: yet another mobile network fully connected.
Performance comparisons of a bench model [
Models | Data sets | ||||||||
|
Text-dependent setting (mean of 10-fold) | Text-independent setting (single fold) | |||||||
|
ACCa (%) | CCCc | RMSEd | ACC (%) | CCC | RMSE | |||
Proposed CNNs model | 78.14 | 77.27 | 0.28 | 9.21 | 56.82 | 37.84 | 0.287 | 5.53 | |
GCNN-LSTMe [ |
51.65 | 50.90 | 0.43 | 8.10 | 58.57 | 39.78 | 0.497 | 5.70 |
aACC: accuracy.
b
cCCC: Concordance Correlation Coefficient.
dRMSE: Root Mean Square Error.
eGCNN-LSTM: Gated Convolutional Neural Network-Long Short Term Memory.
Our results suggest that the speech characteristics obtained through text-dependent speech can be a promising biomarker for MDD. In this study, we recorded speech data using a mobile phone with predefined text-based reading speech tasks and confirmed the potential for automatically detecting depression based on the recorded data and a deep learning approach. As previous studies have suggested the possibility of detecting depression in speech data [
In the passage task, the approach that adopted the CNN model improved performance by approximately ≥7% compared with the conventional ML approach. Compared with conventional ML methods for evaluating handcraft-based extracted feature sets, the acoustic feature extraction approach of CNNs generally exhibits features in a broader spectrum and shows higher performance, as shown in terms of classification accuracy. When the performances of the 3 text-dependent reading tasks were compared (
The features of the audio samples were extracted from well-known pretrained models (VGG16, VGG19, Inception-v3, ResNet50, MobileNet-v2, VGGish, and YAMNet), such as features extracted through a customized CNN model, and the classification performance was compared. We found that the feature set extracted through pretrained models from all data sets of text-based speech tasks did not significantly affect the classification performance, as shown in
Audio samples in the passage task achieved 78% accuracy in depression detection (
Previous studies have focused on developing depression detection models based on open cohort data [
Our study had some limitations. First, we used a small number of voice samples in the experiments; thus, we are currently recruiting additional participants to collect more data to run deep learning. With further research in this cohort, we plan to report the outcomes of developing ML techniques for disease diagnosis and severity prediction. Second, all the audio samples were recorded in a quiet environment. Extended studies are needed to apply this approach to other recording environments (eg, in the wild and noisy conditions). In addition, we plan to explore different strategies for combining our speech-based systems with various information, such as video or physiological signals. Approaching the multimodal detection will present a robust framework that operates under more precise and natural conditions. We believe that these efforts will help to build more robust predictors of MDD for daily life in the future.
This study has opened new opportunities to identify speech markers related to the assessment of depression through readily obtainable speech patterns. This study also presented an approach to automatically detect whether a person has depression by analyzing their speech. We acquired audio samples from 318 participants with depression and healthy controls based on the Korean text-dependent read speech tasks using a smartphone and analyzed their association with depression. We found that there are many benefits of learning audio acoustic patterns and detecting depression using deep learning. This approach has the potential to reduce depression and shows that it is powerful and effective in ADD via speech.
automatic depression detection
area under receiver operating characteristic curve
Concordance Correlation Coefficient
convolutional neural network
Interspeech Computational Paralinguistics Challenge
extended Geneva Minimalistic Acoustic Parameter Set
Hamilton Depression Rating Scale
k-nearest neighbor
linear discriminate analysis
log-Mel
major depressive disorder
Mel-frequency cepstral coefficient
machine learning
Pulse-Code Modulation
Patient Health Questionnaire-9
random forest
support vector machine
This work was supported by the Institute for Information and Communications Technology Planning and Evaluation and Basic Science Research Program through the National Research Foundation of Korea grant funded by the Ministry of Science and Information and Communications Technology, Korean government (grant 2019-0-01376 and grant 2020R1A2C1101978) and the Electronics and Telecommunications Research Institute’s internal funds (grant 21YR2500).
None declared.