Background: Access to neurological care for Parkinson disease (PD) is a rare privilege for millions of people worldwide, especially in resource-limited countries. In 2013, there were just 1200 neurologists in India for a population of 1.3 billion people; in Africa, the average population per neurologist exceeds 3.3 million people. In contrast, 60,000 people receive a diagnosis of PD every year in the United States alone, and similar patterns of rising PD cases—fueled mostly by environmental pollution and an aging population—can be seen worldwide. The current projection of more than 12 million patients with PD worldwide by 2040 is only part of the picture given that more than 20% of patients with PD remain undiagnosed. Timely diagnosis and frequent assessment are key to ensure timely and appropriate medical intervention, thus improving the quality of life of patients with PD.
Objective: In this paper, we propose a web-based framework that can help anyone anywhere around the world record a short speech task and analyze the recorded data to screen for PD.
Methods: We collected data from 726 unique participants (PD: 262/726, 36.1% were women; non-PD: 464/726, 63.9% were women; average age 61 years) from all over the United States and beyond. A small portion of the data (approximately 54/726, 7.4%) was collected in a laboratory setting to compare the performance of the models trained with noisy home environment data against high-quality laboratory-environment data. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet, “the quick brown fox jumps over the lazy dog.” We extracted both standard acoustic features (mel-frequency cepstral coefficients and jitter and shimmer variants) and deep learning–based embedding features from the speech data. Using these features, we trained several machine learning algorithms. We also applied model interpretation techniques such as Shapley additive explanations to ascertain the importance of each feature in determining the model’s output.
Results: We achieved an area under the curve of 0.753 for determining the presence of self-reported PD by modeling the standard acoustic features through the XGBoost—a gradient-boosted decision tree model. Further analysis revealed that the widely used mel-frequency cepstral coefficient features and a subset of previously validated dysphonia features designed for detecting PD from a verbal phonation task (pronouncing “ahh”) influence the model’s decision the most.
Conclusions: Our model performed equally well on data collected in a controlled laboratory environment and in the wild across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with an audio-enabled device and help the participants screen for PD remotely, contributing to equity and access in neurological care.
Parkinson disease (PD) is the fastest-growing neurological disease worldwide. Unfortunately, an estimated 20% of patients with PD remain undiagnosed. This can be largely attributed to the shortage of neurologists worldwide [, ] and limited access to health care. Early diagnosis and continuous monitoring, which allows medication dosage adjustment, are key to managing the symptoms of this incurable disease. The current standard of diagnosis requires in-person clinic visits where an expert assesses the disease symptoms while observing the patients as they perform tasks from the Movement Disorder Society-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) [ ]. The MDS-UPDRS includes 24 motor-related tasks to assess speech, facial expression, limb movements, walking, memory, and cognitive abilities. Although many studies have shown success by analyzing hand movements [ ], limb movement patterns [ ], and facial expressions [ ], speech is especially important because approximately 90% of patients with PD exhibit vocal impairment [ , ], which is often one of the earliest indicators of PD [ ].
For speech analysis, researchers have looked into phonations (pronouncing “ahhh”) from audio recordings  to quantify rhythm, stress, and intonation [ ]. Little et al [ ] introduced pitch period entropy (PPE) as a measure of dysphonia to distinguish healthy people from people with PD with 91% accuracy. Later, Tsanas [ ] expanded on this by calculating 132 dysphonia measures to classify PD versus control with almost 99% accuracy. In addition, Peker et al [ ] used a novel feature selection technique with a complex-valued artificial neural network. Rueda and Krishnan [ ] identified a set of mel-frequency cepstral coefficients (MFCCs) and intrinsic mode functions to represent the characteristics of PD. In the domain of analyzing real-life audio data, Wroge et al [ ] analyzed verbal phonation data collected from smartphones; Vaiciukynas et al [ ] used a convolutional neural network to detect PD from speech; Vásquez-Correa et al [ ] collected speech samples with a handheld device in uncontrolled noise conditions; and Dubey et al [ ] designed a smartwatch-based voice and speech monitoring system for people with PD receiving speech therapies from speech-language pathologists. Although the current state-of-the-art method has shown promising results, it has limitations such as small sample size [ , , ], oversampling from the same participants [ ], noise-controlled data collection [ , ], and age discrepancy between PD and control [ ].
In this paper, we present our analysis of 726 audio recordings of speech from 36.1% (262/726) individuals with PD and 63.9% (464/726) without PD. The speech recordings were collected using a web-based tool called Parkinson’s Analysis with Remote Kinetic-tasks (PARK) . The PARK tool instructed the participants to utter a popular pangram containing all the letters in the English alphabet, “The quick brown fox jumps over the lazy dog,” and recorded it. This allowed us to rapidly collect a data set that is more likely to contain real-world variability associated with geographical boundaries, socioeconomic status, age groups, and a wide variety of heterogeneous recording devices. The findings in this study build on this unique real-world data set; thus, we believe it could potentially be generalized for real-world deployments.
Collecting audio data from individuals often requires in-person visits to the clinic, limiting the number of data points and the diversity within the data. Recent advancements have allowed the collection of tremor data from wearable sensors  and sleep data from radio frequency signals [ ]. The existing work with speech and audio analysis uses sophisticated equipment for collecting data that are often noise-free [ , ] and do not contain real-world variability. As a significant portion of the population has access to a mobile device with recording capability (eg, 81% of Americans own a smartphone) [ ], we opted to use a framework that allows participants to record data from their homes. From the recorded audio files, we extracted acoustic features including MFCCs, which represent the short-term power spectrum of a sound, jitter or shimmer variants (representing pathological voice quality), pitch-related features, spectral power, and dysphonia-related features designed to capture PD-induced vocal impairment [ ].
In addition, we extracted features from a deep-learning–based encoder—problem agnostic speech encoder (PASE) , which represents the information contained in a raw audio instance through a list of encoded vectors. These features are modeled with four different machine learning models—support vector machine (SVM), random forest, LightGBM, and XGBoost—to classify individuals with and without PD.
provides an outline of the data analysis system. Our contributions can be summarized as follows:
- We report findings from one of the largest data sets with real-world variability, containing 726 unique participants mostly from their homes.
- We analyzed the audio features of speech to predict PD versus non-PD with an area under the curve (AUC) score of 0.753.
- We provide evidence that our model prioritizes MFCC features and a subset of dysphonia features [ , ], which is consistent with previous literature.
- Our model performs consistently well when tested on gender- and age-stratified data collected in a controlled laboratory environment and in the wild.
Recruitment and Data Collection
We collected data from 726 unique participants uttering the sentences, “The quick brown fox jumps over the lazy dog. The dog wakes up and follows the fox into the forest, but again the quick brown fox jumps over the lazy dog,” using PARK  website. provides a brief overview of the data collection, storage, transfer, and analysis mechanisms. The users are recorded via a webcam and a microphone connected to the PC or laptop, and the recordings are uploaded to a server. shows images of some of the study participants, and shows the age distribution of the participants. The number of participants without PD is 1.8 times the number of participants with PD. Most of the participants with PD are concentrated in the age range of 40 to 80 years. However, participants without PD outnumber those with PD in the age group 20 to 40 years. Similarly, participants with PD significantly outnumber those without PD in the age group of 80 to 90 years. provides the demographic information of the study participants.
|Characteristics||Participants with PDa||Participants without PD|
|Total, n (%)||262 (36.1)||464 (63.9)|
|Gender, n (%)|
|Female||101 (38.5)||300 (64.6)|
|Male||161 (61.4)||164 (22.5)|
|Age (years), mean (SD)||65.92 (9.2)||57.98 (14.2)|
|Country, n (%)|
|United States||199 (75.9)||419 (90.3)|
|Other||63 (24)||45 (9.7)|
|Years since diagnosed, mean (SD)||7.88 (5.41)||N/Ab|
aPD: Parkinson disease.
bN/A: not applicable.
Among our 726 unique participants, 262 (36.1%) had received the diagnosis of PD, whereas the others were participants without PD. We obtained contact information of patients with PD from local clinics and various PD support groups. Participants without PD were recruited from Amazon Mechanical Turk. During data collection, informed consent of all participants was obtained in accordance with the protocol agreed upon between the researchers and the institutional review board of the University of Rochester. Among the 726 unique participants, only 54 (7.4%) provided data in the laboratory under the guidance of a study coordinator using the PARK tool; the rest of the 672 (92.6%) participants used the PARK system from their homes to provide data. Having participants perform the tasks at home and in the laboratory allowed us to compare the results across both conditions. No participant appeared in either set, and all of our participants used the identical PARK protocol.
The gender distribution in the data set was skewed, especially for female participants. Among all participants, 55.2% (401/726) were women, and 44.8% (325/726) were men. However, among participants with PD, only 38.5% (101/262) were women, and for participants without PD, 64.6% (300/464) were women. Most of the participants with PD were in the age range of 40 to 80 years, but most of the younger (20-40 years) and older (80-90 years) participants were from the group without PD and the group with PD, respectively.
As our data were collected through a web-based framework, we do not have the MDS-UPDRS scores for our participants because collecting them requires additional input from doctors. Among all the participants in our PD group, only 3 participants replied in affirmative that they had taken medication 2 hours before taking the test; others replied in the negative. Therefore, we assumed that medication effects were negligible for the purposes of this study.
During data collection, participants often took some additional time to start to utter the task sentences and stop the recording once the sentences were uttered. Hence, we had a substantial amount of noisy and irrelevant data at the beginning and end of most data instances (). To remove irrelevant data, we used the Penn Phonetics Lab Forced Aligner Toolkit (P2FA) [ ]. Given an audio file and transcript, it attempts to predict the time boundaries in which each of the words in the transcript was pronounced. P2FA applies a combination of hidden Markov models [ ] to predict the most likely sequence of phonemes for a given audio and Gaussian mixture models to combine those phonemes into words and obtain their time boundary using a predefined dictionary [ ]. From the output of this system, we can obtain the starting time of the first word recognized by the P2FA and the ending time of the last word recognized by the P2FA. We used the audio segments between them for further analysis.
Acoustic Features Extraction
We extracted features by combining outputs of multiple sources: Praat features  obtained through the Parselmouth Python interface [ ] and the previously used dysphonia features relevant for PD analysis [ , , ]. After calculating all the features, we constructed a correlation matrix of the feature values to calculate the degree of correlation between them. Then, we iterated over each pair of features in an unordered fashion, and if the correlation coefficient between them was above 0.9, we dropped one of those features from further analysis [ ]. contains a short overview of the features used in our analysis; the feature names in italicized text are those used for building the models. We provide a more comprehensive description of the features in [ , , , ]. Some of our definitions were adapted from the official Praat documentation [ ].
|Feature||Code source||Short description|
|MedianPitchb||Little et al ||Median principal frequency|
|MeanPitch||Boersma and Weenink ||Mean principal frequency|
|StdDevPitch||Little et al ||SD in principal frequency|
|MeanJitter||Little et al ||Perturbation in principal frequency (mean variation)|
|MedianJitter||Little et al ||Perturbation in principal frequency (median variation)|
|LocalJitter||Boersma and Weenink ||Average absolute difference between consecutive periods, divided by the average period|
|LocalAbsoluteJitter||Boersma and Weenink ||Average absolute difference between consecutive periods measured in seconds|
|RapJitter||Boersma and Weenink ||Average absolute difference between a period and the average of it and its two neighbors|
|Ppq5Jitter||Boersma and Weenink ||5-point period perturbation quotient|
|DdpJitter||Boersma and Weenink ||Difference of differences of periods of principal frequency|
|MeanShimmer||Little et al ||Amplitude perturbation (using mean)|
|MedianShimmer||Little et al ||Amplitude perturbation (using median)|
|LocalShimmer||Boersma and Weenink ||Average absolute difference between amplitudes of consecutive period|
|LocaldbShimmer||Boersma and Weenink ||Average absolute base-10 logarithm of the difference between amplitudes of consecutive period|
|Apq3Shimmer||Boersma and Weenink ||3-point amplitude perturbation quotient|
|Apq5Shimmer||Boersma and Weenink ||5-point amplitude perturbation quotient|
|Apq11Shimmer||Boersma and Weenink ||11-point amplitude perturbation quotient|
|DdaShimmer||Boersma and Weenink ||Shimmer calculated by difference in differences of amplitude|
|MeanMFCC (0-12)||Little et al ||13 features of mean MFCC|
|VariationMFCC (0-12)||Little et al ||13 features of mean variation of MFCC|
|RelBandPower (0-3)||Tsanas et al ||4 features capturing relative band power in 4 spectrum ranges|
|HNRd||Boersma and Weenink ||Signal-to-noise ratio|
|RPDEe||Little et al ||Pitch estimation uncertainty|
|DFAf||Little et al ||Measure of stochastic self-similarity in turbulent noise|
|PPEg||Little et al ||Measure of inability of maintaining constant pitch|
aWe collected the features using the code or methodology described in the corresponding code-source entry.
bThe correlated features were removed, and features in italicized text were used to build the models. Feature names are preceded by the loosely defined umbrella category they belong to.
cMFCC: mel-frequency cepstral coefficient.
dHNR: harmonic-to-noise ratio.
eRPDE: recurrence period density entropy.
fDFA: detrended fluctuation analysis.
gPPE: pitch period entropy.
Embedding Features Extraction
We extracted deep learning–based PASE embeddings  for our audio files. PASE represents the information contained in a raw audio instance through a list of encoded vectors. To ensure that the encoded vectors contain the same information as the input audio file, it decodes various properties of the audio file, including the audio waveform, log power spectrum, MFCCs, four prosody features (interpolated logarithm of the fundamental frequency, voiced or unvoiced probability, zero-crossing rate, and energy), and local InfoMax, from the encoded vectors. The encoded vectors must retain relevant information about the input audio file to decode all these properties successfully.
As these properties represent the inherent characteristics of the input audio file rather than any task-specific features, they have been used to solve a host of downstream tasks, such as speech classification, speaker recognition, and emotion recognition. Therefore, we also used them for PD detection.
For each feature set, we applied a standard set of machine learning algorithms, such as SVM , XGBoost [ ], LightGBM [ ], and random forest [ ], to classify PD versus non-PD. SVM separates the data into several classes while maintaining the maximum possible margin among the classes. A random forest is built as an ensemble of decision trees; each decision tree builds a tree using a subset of features and learns if-else type decision rules to make a prediction. We also used XGBoost and LightGBM, algorithms based on gradient boosting where they build successively better models by refining the models at hand.
We used a leave-one-out cross-validation training strategy; using this strategy, one sample of data from the data set is left out, and the other n−1 samples are used to create a model and predict the remaining sample. We used metrics such as binary accuracy and AUC to evaluate our model’s performance. AUC is the area under the receiver operating characteristics (ROC) curve. The ROC curve is constructed by calculating the AUC produced by taking the ratio of the true-positive rate and the false-positive rate while varying the decision threshold. AUC has the highest value of 1, which denotes that the two classes can be separated perfectly, whereas an AUC value of 0.5 indicates that the model cannot distinguish between the two classes. Because our data set was imbalanced, AUC is a much better metric for understanding the true performance of our model. To reduce the effects of data imbalance, we used data augmentation techniques such as the synthetic minority oversampling technique  and SVM synthetic minority oversampling technique [ ].
Model Interpretation Technique
To interpret the models, we used the Shapley additive explanations (SHAP) technique based on the Shapley value. The Shapley value is a game-theoretic concept of distributing the payouts fairly among players . In the machine learning context, each individual feature of an instance can be thought of as a player, and the payout is the difference between an instance’s prediction and the average prediction. We choose SHAP for two reasons: (1) it is well suited for explaining the output of any machine learning model, and (2) it is the only feature attribution method that fulfills the mathematical definition of fairness.
To use SHAP to explain gradient boosting and tree-based models such as XGBoost, Lundberg et al  introduced methods to provide a polynomial time model to compute optimized explanations. Their method can generate local interpretations—how the features affect one particular prediction of a data instance—and then combine those local interpretations to make global interpretations about features present in the entire data set. By setting up a class of features to condition on, they traverse the tree in the following manner: if we are traversing on a node that was split based on a feature we are conditioning on, we simply follow the decision path; otherwise, the results from the left and right subtrees originating from the current node are computed recursively, and their results are added through a weighted summation strategy, thus computing the SHAP value for the feature in question.
The data were preprocessed, and both standard acoustic features (eg, pitch, jitter, shimmer, and MFCC) and deep learning–based audio embedding features, representing an audio clip as a feature vector, were extracted; henceforth, we call these standard features and embedding features. In the rest of the Results section, we discuss the results from the models built on the entire data set, interpretations of the best model on the entire data set, and results and interpretation from specialized models on gender-stratified and age-trimmed data sets.
Detecting PD From the Entire Data Set
contains the AUC and accuracy scores of the four machine learning models trained on the standard features and embedding features separately. Applying XGBoost on the standard features showed the best performance of 0.75 AUC and 0.74 accuracy. We also noticed that models trained on standard features work better than those trained on embedding features.
|Algorithm||Standard features||Embedding features|
aModels using standard features perform better than the models using embedding features in terms of both binary accuracy and area under the curve. Although the performance of the models is almost similar in terms of area under the curve metric, XGBoost outperforms others by considering both the area under the curve and accuracy metrics simultaneously.
bAUC: area under the curve.
cSVM: support vector machine.
dVariable outperforms all others by taking both area under the curve and accuracy into account.
The goal of SHAP is to explain the model’s prediction of any instance as a sum of contributions from its feature values; if a data instance can be thought of as Xi=[f1, f2,...fN], SHAP will assign a number to each of these fj features, denoting the impact of that feature—both the magnitude and direction—on the model’s prediction. Then, all these local explanations are aggregated to create a global interpretation for the entire data set. A global interpretation is presented in the first part of; the top 10 most impactful features, ranked by having the most impact to the least, are presented. To calculate each feature’s impact, all of its SHAP values across all the data instances are gathered, and then the mean of their absolute values is calculated.
Features that have an impact on the model’s performance are typically the spectral features: the mean values or the variation of MFCC in each spectrum range. Apart from that, some other complex features such as recurrence period density entropy (RPDE; a measure of uncertainty in F0 estimation), PPE (a measure of inability to maintain a constant F0), and harmonic-to-noise ratio (HNR) also affected the interpretation presented in the first part of.
Gender- and Age-Stratified Analysis
The characteristics of a person’s voice are greatly influenced by age and gender. In, we see that men and women display a changing characteristic in their voices as they get older. Female voices have a higher F0 value, but it decreases with age. Males typically have a higher F0 value in their youth, which decreases with age and then increases roughly after the age of 45 years. Therefore, it can produce confounding effects when analyzing PD from audio, where the machine learning model uses audio features to detect PD. To minimize the effect of confounding factors, researchers trained separate models on data from male and female participants [ ] or analyzed an age-trimmed data set by considering data from participants older than 50 years [ , ].
Building Specialized Models for Each Gender- and Age-Trimmed Analysis
The performance metrics of the machine learning models trained on the male, female, and age-trimmed data sets are shown in. By comparing the performance with metrics presented in , we can see that the models that used male- or age-trimmed data sets performed at par or better than the models that used the whole data set to train. However, there was a performance drop in the models using the female data set. shows that females are overrepresented in the non-PD group and underrepresented in the PD group, leading to data imbalance and possibly lowering performance for the female-only model.
|Algorithm||Male (n=415)||Female (n=477)||Age-trimmed (male, n=366 and female, n=426)|
aAUC: area under the curve.
bSVM: support vector machine.
We also analyzed the features that drive the performance of these specialized models through SHAP analysis. The second part ofshows the most salient features ranked by their SHAP value and the distribution of the impact of each feature on the model’s decision-making. The most important features are still dominated by the MFCC-related features or complex features such as the HNR, relative band power in different frequency ranges (RelBandPower1, RelBandPower3), RPDE (uncertainty in F0 estimation), perturbation in F0 (DdpJitter), or perturbation in amplitude (Apq11Shimmer). However, one noticeable fact is that three-pitch and jitter-related features—MedianPitch (median principal frequency, StdDevPitch (SD in principal frequency) and MedianJitter (median variation in F0)—also affected the model’s prediction, which was not noticed in the SHAP analysis run on the all-data-model.
Similarly, we interpreted the salient features of the age-trimmed data set in the third part of. We noticed that the most salient features usually come from the MFCC feature groups, complex features (RPDE, PPE, and HNR), relative band power (RelBandPower1, RelBandPower2, and RelBandPower3), and pitch-related features. In addition, the pitch-related features also drove the prediction of the model.
In all of our experiments, we chose leave-one-out cross-validation to maintain uniform experimental settings between different models, data augmentation, and data set combinations because K-fold cross-validation can show high variance in performance based on how stratified the folds are. However, we acknowledge that our choice introduces the problem of overfitting and increases the computational complexity manifold. Therefore, we could not run extensive hyperparameter tuning to improve the performance of our models. By carefully stratifying the folds and maintaining the same fold settings for comparable experimental settings, K-fold cross-validation may enable us to achieve higher performance with lower computational complexity.
Moreover, our chosen metrics, accuracy and ROC AUC, can be overly optimistic because of the unbalanced nature of our data set. Metrics such as ROC–Precision-Recall balanced accuracy, and more information about sensitivity and specificity may shed more light on how we are performing in the minority class. In the future, we plan to apply other cross-validation techniques and better metrics to conduct our experiments and report our performance.
Detecting PD From Regular Conversation
Some of the most common voice disorders induced by PD are dysphonia (distortion or abnormality of voice), dysarthria (problems with speech articulation), and hypophonia (reduced voice volume). Two speech-related diagnostic tasks are commonly used to detect PD by exploiting the changing vocal pattern caused by these disorders: (1) sustained phonation (the participant is supposed to utter a single vowel for a long time with constant pitch) and (2) running speech (the participant speaks a standard sentence). Little et al  developed features for detecting dysphonia in patients with PD. Tsanas et al [ ] focused on the telemonitoring of self-administered sustained vowel phonation task to predict the Unified Parkinson’s Disease Rating Scale rating (on Rating Scales for Parkinson’s Disease 2003), a commonly used indicator for quantifying PD symptoms. These studies trained their models with data captured by sophisticated devices (eg, wearable devices, high-resolution video recorder, and the Intel at-home–testing-device telemonitoring system) that are often not accessible to all and are difficult to scale. The performance of these models can be significantly reduced when classifying data collected in home acoustics. In addition, correctly completing the sustained phonation task requires following a specific set of guidelines, such as completing the task in one breath, which can be difficult for older individuals.
In contrast, we analyzed the running speech task from the data collected by using a web-based data collection platform that can be accessed by anyone anywhere in the world and requires only an internet-connected device with an integrated camera and microphone. In addition, the running speech task does not require conforming to specific instructions and is more similar to regular conversation; therefore, the model can be potentially augmented to predict PD from a regular conversation—a potential game-changer in PD assessment. In the future, user-consented plug-ins could be developed for apps such as Alexa, Google Home, or Zoom, where audio is transmitted between persons. Anyone who consents to download the plug-in and uses it while on the phone, over Zoom, or giving virtual or in-person presentations could benefit from receiving an informal referral to see a neurologist, when appropriate. The plug-in would not store the participants’ data unless they opt to build a personalized profile to ensure privacy and ethical usage of our framework.
Validating Model Interpretation
The features that SHAP found to have an impact on modeling decisions are well supported by previous research. For example, MFCC features have already proven to be useful in a wide range of audio tasks such as speaker recognition , music information retrieval [ ], voice activity detection [ ], and most importantly, in voice quality assessment [ ]. Similarly, the high impact of the HNR and the measures of uncertainty in F0 estimation (RPDE) and inability to maintain a constant F0 (PPE) on the model’s output are in congruence with the findings from Little et al [ ]. However, explaining the first part of in light of PD-induced vocal impairment is a difficult task. MFCC features are calculated by converting the audio signal into the frequency domain, and they denote how energy in the signal is distributed within various frequency ranges. Therefore, providing a physical interpretation of the SHAP values corresponding to the MFCC features is not straightforward. Similarly, Little et al [ ] designed the RPDE and PPE features to model a sustained phonation task (uttering “ahh”) with the assumption that healthy participants will be able to maintain a smooth and regular voice pattern. In contrast, uttering multiple sentences introduces considerable variation in the data, adding a wide set of heterogeneous patterns. Therefore, the underlying assumptions behind constructing these features do not hold for our task of uttering multiple sentences.
We conduct an empirical validation of the SHAP output shown in the first part of. We incrementally add one feature at a time to build a dynamic feature set, train successive models on that feature set, and report the accuracy and AUC performances. We see that the performance of our model saturates after adding 7 to 8 features. Therefore, we can say that the SHAP analysis teases out the most important features driving the model’s performance successfully.
Why Not Stratified Analysis Only?
We built models inclusive of all genders for several reasons. First, there are potential shared characteristics among vocal patterns of all genders that can be relevant for detecting PD. Second, dividing the data set into two portions reduces the available training data for each model, which may, in turn, reduce the generalization capability of each model. In addition, our model analyzes data from patients of all ages. Although most people diagnosed with PD are older than 60 years, approximately 10% to 20% of the population diagnosed with PD are younger than 50 years, and approximately half of them are younger than 40 years . As anecdotal evidence, Michael J Fox was diagnosed with PD at the age of 29 years [ ], and Muhammad Ali had PD by 42 years [ ]. In our data set, there were also a minority of patients with PD who were younger than 50 years ( ). On the basis of these observations, we believe that our system should provide access to all people, irrespective of age. PD does not discriminate by age while affecting a person, and an automated system should not discriminate based on age and provide equitable services to people of all ages. However, these factors can function as confounders in PD analysis. Therefore, we provided additional analysis to ensure that our model does not use the idiosyncrasies of group-specific information to make predictions.
Performance Excluding Laboratory Environment Data
When the data were collected in the laboratory, participants had access to a clinician providing support using a consistent recording setup and dedicated bandwidth. In contrast, the data collected in the home setting involved no assistance and included the real-world variability of heterogeneous recording setups and inconsistent internet speed. In theory, the data collected at the laboratory and home were very different from each other.
To ensure that our model works equally well without the clean lab data, we designed two experiments. In experiment 1, we removed clean lab data, which was approximately 7.4% (54/726) of the entire data set, retrained our model on the remaining 672 participants using the leave-one-out validation procedure, and calculated the performance metrics. In experiment 2, we randomly removed 7% (roughly 54 data points) of home data from the entire data set (while keeping the lab data intact), building a model with the remaining 93% data with the leave-one-out cross-validation method. Then, we conducted experiment 2 10 times and calculated the average performance from these 10 runs. We found that the AUC metric across these three experiments was fairly consistent, with a very small 0.015 decrease in AUC when removing lab data, demonstrating that our framework performs equally well across the lab and home data.
Label Inconsistency and Predicting Tremor Score
Using the PARK  protocol, we collected one of the largest data sets of participants conducting a series of motor, facial expression, and speech tasks following the MDS-UPDRS PD assessment protocol [ ]. Although we analyzed only the speech task in this study, the data set can be potentially used to automate the assessment of a large set of MDS-UPDRS tasks and facilitate early-stage PD detection, thus improving the quality of life for millions of people worldwide. However, deploying the data collection protocol on the web and facilitating access to anyone anywhere around the world comes at a cost. To date, all of our participants with PD have been clinically verified to be diagnosed with PD. Therefore, the labels of the PD data points are reliable. However, the participants without PD did not undergo clinical verification. Our data collection protocol asked them appropriate questions to check whether they had been diagnosed with PD and collected data when they answered in the negative. However, we cannot discount the possibility that a small subset of our population without PD is in the very early stage of PD and is oblivious about it. At present, there are estimated to be around 1 million patients with PD in the United States, out of a population of 330 million [ ], yielding a PD prevalence rate of 0.3%. However, as our non-PD data set was largely tilted toward people older than 50 years, the rate in our data set could be higher than 0.3%. Even if we consider a liberal 1% prevalence rate, the number of individuals with undiagnosed PD in our control population is likely low (at most 4.6 persons). Therefore, we believe that the non-PD data labels are generally reliable. In the future, we plan to model tremor score in (0-4 range for each task; 0 for no tremor, and 4 for severe tremor) instead of a binary label following the MDS-UPDRS protocol to address this problem more thoroughly.
Building a More Representative Data Set
The PARK protocol is web enabled, allowing anyone with access to the internet to contribute data. We plan to augment our data set by adding more non–native English speakers, females, and participants with PD. As our PD data are collected through contacts from local PD clinics and non-PD data through Amazon Mechanical Turk, most of our participants were from the United States or other English-speaking countries. To make our model more robust on data from non–native English speakers, we are in process of collecting both PD and non-PD data from non–native English-speaking countries. In addition, our current protocol can collect data from people with computer and internet access only, and therefore, it can potentially exclude underserved people. In the future, we plan to build desktop and mobile apps that can collect data offline to build a more inclusive framework.
Our best model for female data performed worse than its male counterparts, as shown in. We attribute this degraded performance PD or non-PD imbalance for female participants in our data set: the PD-to-non-PD ratio for females was 101 out of 300 ( ). Previous epidemiological studies have shown that both the incidence and prevalence of PD are 1.5 to 2 times higher in men than in women [ , ]. Therefore, any randomly sampled data set for PD will have a higher prevalence of males, contributing to models that are more biased toward males. Our immediate plan is to prioritize collecting balanced data from all genders, ages, and races across geographical boundaries, leading to a balanced data set.
Our data set also has the ubiquitous problem of data imbalance in diagnostic tests: the amount of data from participants without PD is 1.8 times more than their PD counterparts. Therefore, there is a risk that the model will be biased toward predicting the majority non-PD class as default and yield a high false-negative score. To address this, we plan to recruit more participants with PD in the future to make our data set more balanced. Another potential approach is to run the analysis on a subset of the data set produced by conducting proper age, gender, and matching of patients with and without PD.
As shown in, our participants with PD were diagnosed 7.88 (SD 5.41) years ago. Owing to limited data availability, we did not conduct an analysis on early-stage PD prediction. In the future, we plan to include more participants in the early stages of PD and build models that can detect them. We will especially focus on recruiting participants with PD in the age range of 20 to 40 years to analyze the very early-stage onset of PD among young adults. Furthermore, we plan to collect and analyze multiple speech instances from each person to reduce person-specific variability and obtain a more holistic view of their PD state.
Increasing Model Performance
Although we consider AUC to be a better metric for our data set, our model performs 10% better than always choosing non-PD as the prediction in terms of binary accuracy. To be practically deployable in clinical settings, the performance needs to be improved further. We will focus on four promising avenues: making the data set balanced, designing better features capable of modeling the nuanced pattern in our data, making our model resilient to noise present in our data, and deconfounding the PD prediction from age and gender variables.
To remove noise, we plan to augment the techniques proposed in Poorjam et al  to automatically enhance our data quality by detecting the segments of data that conform to our experimental design. Moreover, as discussed above, gender and age can appear as confounding variables in the PD prediction task. In this paper, we have shown that our unified model and stratified sex-specific models have similar performance. However, we plan to build better models to systematically deconfound the effects of both age and gender variables while benefiting from them simultaneously. We can achieve this by incorporating the causal bootstrapping technique—a resampling method that considers the causal relationship between variables and negates the effect of spurious, indirect interactions, as outlined by Little and Badawy [ ].
This research was funded by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award number P50NS108676.
WR, SL, MSI, and VNA worked in data analysis, feature extraction, model training, model interpretation, and manuscript preparation. HR, AAM, EW, and SJR helped build, maintain, and coordinate the data collection procedure. MRA, MAL, and RD helped improve the manuscript; suggested important experiments; and provided access to critical resources such as code and data. All the other authors helped in data collection. EH was the principal investigator of the project; he facilitated the entire project and helped shape the narrative of the manuscript.
Conflicts of Interest
Description of the features.DOCX File , 20 KB
- Khadilkar S. Neurology in India. Ann Indian Acad Neurol 2013 Oct;16(4):465-466 [FREE Full text] [CrossRef] [Medline]
- Howlett WP. Neurology in Africa. Neurology 2014 Aug 11;83(7):654-655. [CrossRef]
- Goetz CG, Tilley BC, Shaftman SR, Stebbins GT, Fahn S, Martinez-Martin P, Movement Disorder Society UPDRS Revision Task Force. Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS): scale presentation and clinimetric testing results. Mov Disord 2008 Nov 15;23(15):2129-2170. [CrossRef] [Medline]
- Ali M, Hernandez J, Dorsey E, Hoque E, McDuff D. Spatio-temporal attention and magnification for classification of Parkinson’s disease from videos collected via the internet. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). 2020 Presented at: 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); November 16-20, 2020; Buenos Aires, Argentina. [CrossRef]
- Lonini L, Dai A, Shawen N, Simuni T, Poon C, Shimanovich L, et al. Wearable sensors for Parkinson's disease: which data are worth collecting for training symptom detection models. NPJ Digit Med 2018 Nov 23;1(1):64-68 [FREE Full text] [CrossRef] [Medline]
- Bandini A, Giovannelli F, Orlandi S, Barbagallo S, Cincotta M, Vanni P, et al. Automatic identification of dysprosody in idiopathic Parkinson's disease. Biomed Signal Process Control 2015 Mar;17:47-54. [CrossRef]
- Ho AK, Iansek R, Marigliani C, Bradshaw JL, Gates S. Speech impairment in a large sample of patients with Parkinson’s disease. Behav Neurol 1999;11(3):131-137. [CrossRef]
- Logemann JA, Fisher HB, Boshes B, Blonsky ER. Frequency and cooccurrence of vocal tract dysfunctions in the speech of a large sample of Parkinson patients. J Speech Hear Disord 1978 Feb;43(1):47-57. [CrossRef] [Medline]
- Duffy J. Motor Speech Disorders E-Book: Substrates, Differential Diagnosis, and Management. 4th Edition. Amsterdam: Elsevier; 2019:201.
- Arora S, Venkataraman V, Zhan A, Donohue S, Biglan K, Dorsey E, et al. Detecting and monitoring the symptoms of Parkinson's disease using smartphones: A pilot study. Parkinsonism Relat Disord 2015 Jun;21(6):650-653. [CrossRef] [Medline]
- Playfer J, Hindle JV. Parkinson's Disease in the Older Patient. London: CRC Press; 2008:1-432.
- Little M, McSharry P, Hunter E, Spielman J, Ramig L. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. Nat Prec 2008 Sep 12:1. [CrossRef]
- Tsanas A. Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning. Doctoral Dissertation, Oxford University, UK. 2012. URL: https://ora.ox.ac.uk/objects/uuid:2a43b92a-9cd5-4646-8f0f-81dbe2ba9d74 [accessed 2021-09-08]
- Peker M, Sen B, Delen D. Computer-aided diagnosis of parkinson's disease using complex-valued neural networks and mrmr feature selection algorithm. J Healthc Eng 2015 Aug;6(3):281-302 [FREE Full text] [CrossRef] [Medline]
- Rueda A, Krishnan S. Feature analysis of dysphonia speech for monitoring Parkinson’s disease. In: Proceedings of the 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2017 Presented at: 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); July 11-15, 2017; Jeju, Korea (South) p. 2308-2311. [CrossRef]
- Wroge T, Özkanca Y, Demiroglu C, Si D, Atkins D, Ghomi R. Parkinson’s disease diagnosis using machine learning and voice. In: Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 2018 Presented at: IEEE Signal Processing in Medicine and Biology Symposium (SPMB); Dec. 1, 2018; Philadelphia, PA, USA. [CrossRef]
- Vaiciukynas E, Gelzinis A, Verikas A, Bacauskiene M. Parkinson’s disease detection from speech using convolutional neural networks. In: Proceedings of the International Conference on Smart Objects and Technologies for Social Good. 2017 Presented at: International Conference on Smart Objects and Technologies for Social Good; November 29-30, 2017; Pisa, Italy p. 206-215. [CrossRef]
- Vásquez-Correa JC, Orozco-Arroyave J, Bocklet T, Nöth E. Towards an automatic evaluation of the dysarthria level of patients with Parkinson's disease. J Commun Disord 2018 Nov;76:21-36. [CrossRef] [Medline]
- Dubey H, Goldberg J, Abtahi M, Mahler L, Mankodiya K. EchoWear: smartwatch technology for voice and speech treatments of patients with Parkinson’s disease. In: Proceedings of the Conference on Wireless Health. 2015 Presented at: WH '15: Wireless Health 2015 Conference; October 14 - 16, 2015; Bethesda Maryland p. 1-8. [CrossRef]
- Vásquez-Correa J, Arias-Vergara T, Orozco-Arroyave J, Vargas-Bonilla J, Arias-Londoño J, Nöth E. Automatic detection of Parkinson’s disease from continuous speech recorded in non-controlled noise conditions. In: Proceedings of the Sixth Annual Conferecence on International Speech Communication Association. 2015 Presented at: Sixth Annual Conferecence on International Speech Communication Association; 2015; Dresden, Germany. [CrossRef]
- Chen H, Wang G, Ma C, Cai Z, Liu W, Wang S. An efficient hybrid kernel extreme learning machine approach for early diagnosis of Parkinson׳s disease. Neurocomputing 2016 Apr;184:131-144. [CrossRef]
- Park Test. URL: https://parktest.net/ [accessed 2020-09-02]
- Kubota K, Chen J, Little M. Machine learning for large-scale wearable sensor data in Parkinson's disease: Concepts, promises, pitfalls, and futures. Mov Disord 2016 Sep;31(9):1314-1326. [CrossRef] [Medline]
- Yue S, Yang Y, Wang H, Rahul H, Katabi D. BodyCompass:Monitoring Sleep Posture with Wireless Signals. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. 2020 Jun 15 Presented at: ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; June 15, 2020; New York, USA p. 1-25. [CrossRef]
- Tsanas A, Little M, McSharry P, Ramig L. Accurate telemonitoring of Parkinson's disease progression by noninvasive speech tests. IEEE Trans Biomed Eng 2010 Apr;57(4):884-893. [CrossRef]
- Mobile fact sheet. Pew Research Center, Internet & Technology. 2019. URL: https://www.pewresearch.org/internet/fact-sheet/mobile/ [accessed 2021-09-08]
- Pascual S, Ravanelli M, Serrà J, Bonafonte A, Bengio Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. In: Proceedings of the Annual Conference on Internatioal Speech Communication Association - INTERSPEECH International. 2019 Presented at: Annual Conference on Internatioal Speech Communication Association - INTERSPEECH International; Apr 6, 2019; Graz, Austria URL: http://arxiv.org/abs/1904.03416 [CrossRef]
- Tsanas A, Little M, McSharry P, Ramig L. Using the cellular mobile telephone network to remotely monitor Parkinson's disease symptom severity. Semantic Scholar. URL: https://www.semanticscholar.org/paper/Using-the-cellular-mobile-telephone-network-to-%E2%80%9F-s-Tsanas-Little/17180d351b33f13f7fe681ddf8411cf6aad74b64 [accessed 2021-09-08]
- Langevin R, Ali MR, Sen T, Snyder C, Myers T, Dorsey ER, et al. The PARK Framework for automated analysis of Parkinson's disease characteristics. Proc ACM Interact Mob Wearable Ubiquitous Technol 2019 Jun 21;3(2):1-22. [CrossRef]
- Penn Phonetics Lab Forced Aligner Toolkit (P2FA) for Python3. GitHub. URL: https://github.com/jaekookang/p2fa_py3 [accessed 2021-09-02]
- HTK Speech Recognition Toolkit. URL: http://htk.eng.cam.ac.uk/ [accessed 2021-09-02]
- Yuan J, Liberman M. Speaker identification on the SCOTUS corpus. J Acoust Soc Am 2008 May;123(5):3878. [CrossRef]
- Moltu C, Stefansen J, Svisdahl M, Veseth M. Negotiating the coresearcher mandate - service users' experiences of doing collaborative research on mental health. Disabil Rehabil 2012;34(19):1608-1616. [CrossRef] [Medline]
- Jadoul Y, Thompson B, de Boer B. Introducing Parselmouth: A Python interface to Praat. J Phonetics 2018 Nov;71:1-15 [FREE Full text] [CrossRef]
- Robust parsimonious selection of dysphonia measures for telemonitoring of Parkinson's disease symptom severity. Semantic Scholar. URL: https://www.semanticscholar.org/paper/Robust-parsimonious-selection-of-dysphonia-measures-Tsanas-Little/43b91147f0b5ab900b29ea7943d58fe761f0e26b [accessed 2019-10-14]
- Little MA, McSharry PE, Roberts SJ, Costello DA, Moroz IM. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed Eng OnLine 2007;6(1):23. [CrossRef]
- Goldberger A. A Course in Econometrics. Cambridge, MA: Harvard University Press; 1991:1-426.
- Müller M. Information Retrieval for Music and Motion. Switzerland: Springer; 2007.
- Tsanas A, Little MA, McSharry PE, Spielman J, Ramig LO. Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease. IEEE Trans Biomed Eng 2012 May;59(5):1264-1271. [CrossRef]
- Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995 Sep;20(3):273-297. [CrossRef]
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016 Presented at: KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13 - 17, 2016; San Francisco California p. 785-794. [CrossRef]
- Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 2017:3154 [FREE Full text]
- Ho T. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. 1995 Presented at: 3rd International Conference on Document Analysis and Recognition; Aug. 14-16, 1995; Montreal, QC, Canada. [CrossRef]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 2002 Jun 01;16:321-357. [CrossRef]
- Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig Inderscience Publishers 2011;3(1):4-21. [CrossRef]
- Shapley L. A value for n-person games. Contrib Theory Games 1953;2(28):307-317. [CrossRef]
- Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020 Jan 17;2(1):56-67 [FREE Full text] [CrossRef] [Medline]
- Bhattarai K, Prasad P, Alsadoon A, Pham L, Elchouemi A. Experiments on the MFCC application in speaker recognition using Matlab. In: Seventh International Conference on Information Science and Technology (ICIST). 2017 Presented at: Seventh International Conference on Information Science and Technology (ICIST); April 16-19, 2017; Da Nang, Vietnam p. 32-37. [CrossRef]
- Kinnunen T, Chernenko E, Tuononen M, Fränti P, Li H. Voice activity detection using MFCC features and support vector machine. In: Proceedings of the International Conference Speech and Computer (SPECOM07). 2007 Presented at: International Conference Speech and Computer (SPECOM07); 2007; Moscow, Russia p. 556-561 URL: http://cs.uef.fi/sipu/pub/svm-vad-SPECOM07.pdf
- Tsanas A, Little MA, McSharry PE, Ramig LO. Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity. J R Soc Interface 2011 Jun 06;8(59):842-855 [FREE Full text] [CrossRef] [Medline]
- Kinnunen T, Chernenko E, Tuononen M, Fränti P, Li H. Voice activity detection using MFCC features and support vector machine. In: Proceedings of the International Conference Speech and Computer (SPECOM07). 2007 Presented at: International Conference Speech and Computer (SPECOM07); 2007; Moscow, Russia p. 556-561 URL: http://cs.uef.fi/sipu/pub/svm-vad-SPECOM07.pdf
- Michael's Story. Michael J Fox. URL: https://www.michaeljfox.org/michaels-story [accessed 2021-09-02]
- McCallum K. Muhammad Ali’s advocacy for Parkinson’s disease endures with boxing legacy. Parkinson’s News Today. 2016. URL: https://parkinsonsnewstoday.com/2016/06/10/muhammad-alis-advocacy-parkinsons-disease-endures-boxing-legacy/ [accessed 2021-09-02]
- Van Den Eeden SK, Tanner C, Bernstein A, Fross R, Leimpeter A, Bloch D, et al. Incidence of Parkinson's disease: variation by age, gender, and race/ethnicity. Am J Epidemiol 2003 Jun 01;157(11):1015-1022. [CrossRef] [Medline]
- Haaxma CA, Bloem BR, Borm GF, Oyen WJ, Leenders KL, Eshuis S, et al. Gender differences in Parkinson's disease. J Neurol Neurosurg Psychiatry 2007 Aug 01;78(8):819-824 [FREE Full text] [CrossRef] [Medline]
- Poorjam AH, Kavalekalam MS, Shi L, Raykov JP, Jensen JR, Little MA, et al. Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection. Speech Commun 2021 Mar;127:1-16. [CrossRef]
- Little M, Badawy R. Causal bootstrapping. arXiv. 2019. URL: https://arxiv.org/abs/1910.09648 [accessed 2021-09-08]
|AUC: area under the curve|
|HNR: harmonic-to-noise ratio|
|MDS-UPDRS: Movement Disorder Society-Unified Parkinson’s Disease Rating Scale|
|MFCC: mel-frequency cepstral coefficient|
|P2FA: Penn Phonetics Lab Forced Aligner Toolkit|
|PARK: Parkinson’s Analysis with Remote Kinetic-tasks|
|PASE: problem agnostic speech encoder|
|PD: Parkinson disease|
|PPE: pitch period entropy|
|ROC: receiver operating characteristics|
|RPDE: recurrence period density entropy|
|SHAP: Shapley additive explanations|
|SVM: support vector machine|
Edited by R Kukafka, G Eysenbach; submitted 19.02.21; peer-reviewed by M Goni, C Fincham, D Zhai; comments to author 03.04.21; revised version received 13.04.21; accepted 07.08.21; published 19.10.21Copyright
©Wasifur Rahman, Sangwu Lee, Md Saiful Islam, Victor Nikhil Antony, Harshil Ratnu, Mohammad Rafayet Ali, Abdullah Al Mamun, Ellen Wagner, Stella Jensen-Roberts, Emma Waddell, Taylor Myers, Meghan Pawlik, Julia Soto, Madeleine Coffey, Aayush Sarkar, Ruth Schneider, Christopher Tarolli, Karlo Lizarraga, Jamie Adams, Max A Little, E Ray Dorsey, Ehsan Hoque. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 19.10.2021.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.