This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Autism spectrum disorder (ASD) is currently diagnosed using qualitative methods that measure between 20-100 behaviors, can span multiple appointments with trained clinicians, and take several hours to complete. In our previous work, we demonstrated the efficacy of machine learning classifiers to accelerate the process by collecting home videos of US-based children, identifying a reduced subset of behavioral features that are scored by untrained raters using a machine learning classifier to determine children’s “risk scores” for autism. We achieved an accuracy of 92% (95% CI 88%-97%) on US videos using a classifier built on five features.
Using videos of Bangladeshi children collected from Dhaka Shishu Children’s Hospital, we aim to scale our pipeline to another culture and other developmental delays, including speech and language conditions.
Although our previously published and validated pipeline and set of classifiers perform reasonably well on Bangladeshi videos (75% accuracy, 95% CI 71%-78%), this work improves on that accuracy through the development and application of a powerful new technique for adaptive aggregation of crowdsourced labels. We enhance both the utility and performance of our model by building two classification layers: The first layer distinguishes between typical and atypical behavior, and the second layer distinguishes between ASD and non-ASD. In each of the layers, we use a unique rater weighting scheme to aggregate classification scores from different raters based on their expertise. We also determine Shapley values for the most important features in the classifier to understand how the classifiers’ process aligns with clinical intuition.
Using these techniques, we achieved an accuracy (area under the curve [AUC]) of 76% (SD 3%) and sensitivity of 76% (SD 4%) for identifying atypical children from among developmentally delayed children, and an accuracy (AUC) of 85% (SD 5%) and sensitivity of 76% (SD 6%) for identifying children with ASD from those predicted to have other developmental delays.
These results show promise for using a mobile video-based and machine learning–directed approach for early and remote detection of autism in Bangladeshi children. This strategy could provide important resources for developmental health in developing countries with few clinical resources for diagnosis, helping children get access to care at an early age. Future research aimed at extending the application of this approach to identify a range of other conditions and determine the population-level burden of developmental disabilities and impairments will be of high value.
Autism spectrum disorder (ASD) is a heterogeneous developmental disorder that includes deficits in social communication, repetitive behaviors, and restrictive interests, all of which lead to significant social and occupational impairments throughout the lifespan. Autism is one of the fastest growing developmental disorders in the United States [
The current models for diagnosing autism in Bangladesh, as in the United States, are often administered by trained clinical professionals using standard assessments [
In our previous works, we have developed tools for rapid mobile detection of ASD in short home videos of US children by using supervised machine learning approaches to identify minimal sets of behaviors that align with clinical diagnoses of ASD [
Additionally, an independent validation set consisting of 66 videos (33 ASD, 33 TD) was labeled by a separate set of video raters in order to validate the results. The top-performing classifier maintained similar results, achieving an overall accuracy of 89% (95% CI 81%-95%).
The current study aimed to show generalizability of video-based machine learning procedures for ASD detection that have established validity among US-based children [
The study received ethical clearance under Dr Naila Khan from the Bangladesh Institute of Child Health, Dhaka Shishu Children’s Hospital (DSH) and the Stanford University Institutional Review Board. We aimed to recruit 150 children for this study: 50 with ASD, 50 with an SLC, and 50 with neurotypical development (TD). All participants were recruited after they provided consent (in Bengali language) for participation at the DSH, and their children were screened for the presence of ASD or SLC. Participants were enrolled if they were parents above 18 years of age, had a child between the ages of 18 months and 4 years, could attend an appointment at the DSH to complete the study procedures, and were willing to submit a brief video of their child to the study team. Enrolled families provided demographic information (see
Brief videos (2-5 minutes) were recorded during evaluation of the children who presented to the Child Development Center of the Bangladesh Institute of Child Health with neurodevelopmental concerns. We administered the Modified Checklist for Autism in Toddlers (Bangla version [
Acquired videos and supporting demographic measures were securely sent from DSH to Stanford University. Videos were assessed for quality by trained clinical researchers at Stanford University. Criteria included video, sound, and image quality in addition to video length and content (ie, ensuring that the video was long enough to answer necessary questions, that the child was present in the video, etc). Furthermore, videos were assessed to meet the following criteria: (1) it captured the child’s face and hands, (2) it involved social interaction or attempts of social interaction, and (3) it involved an interaction between the child and a toy/object.
Nine non-Bengali speaking US-based raters with no clinical training used a secure, HIPAA (Health Insurance Portability and Accountability Act)-compliant online website to watch the videos and answer a set of 31 multiple-choice questions corresponding to the behavioral features of autism [
We assembled eight published machine learning classifiers to test their viability for use in the rapid mobile detection of autism through the use of short home videos of US children [
In an effort to improve the results on the Bangladeshi dataset after attempting to validate previously built classifiers on these data, we constructed new classifiers while controlling for potential noise resulting from inaccurate ratings and constructed separate layers for each step of the classification for a streamlined approach. Our dataset contained three classes—TD, ASD, and SLC—assigned by screening via clinical evaluation at the DSH [
Given the raters’ lack of formal clinical training, we hypothesized that some raters might be more adept at identifying certain risk factors in some videos than others. Regardless of whether these interrater differences in identification accuracy for certain subsets of behaviors arise naturally or by chance, we hypothesized that this heterogeneous rater performance could be leveraged to yield increased model performance. For example, if one rater is especially capable of labeling a child’s level of eye contact and another rater does a poor job of rating eye contact but excels at rating language ability, then a model trained on each individual rater’s labels alone might perform poorly; however, an ensemble that considers the outputs of both rater’s models could perform substantially better. Achieving this improved performance is the focus of our proposed novel rater-adaptive weighting scheme.
For each of the three raters in the dataset, we trained a Random Forest classifier to predict a child’s class label (TD, SLC, or ASD) based on the rater’s annotations of that child’s behavior in a given video. The Random Forest classifier adapts to each rater’s expertise and labeling patterns; a basic analysis revealed that each rater had a different feature set that they rated well. In addition to (and, in part, because of) interrater differences in the labeling ability, each rater’s model had varying levels of accuracy. We wanted the ensemble to weigh the predictions from the most accurate rater models more heavily. Therefore, we first trained and calculated the accuracy of each rater’s model relative to a majority vote baseline and then used that difference to up- or downweigh that rater’s vote relative to the other raters’ votes.
Specifically, we let zj represent the difference in accuracy of rater
This ensures that all the raters’ weights collectively sum up to 1, so that the ensemble prediction will be a linear combination of each rater’s predictions. Using these weights, the final ensemble prediction for child
In order to reflect the differences in both the conceptualization and use cases of predicting (1) TD vs atypical development and (2) ASD from other developmental delays, we decided to create a stacked approach to classification. In the first layer, we built classifiers to distinguish between TD and atypical development (ASD/other SLCs). The cases classified as atypical from the first layer were then used as input for the second layer to distinguish between ASD and other SLCs.
We wanted to optimize the model for sensitivity in the first layer to ensure no atypical case was misclassified. In the second layer, we wanted to optimize for both sensitivity and specificity, so that children with ASD would be effectively distinguished from children with other development delays. After training these classifiers for each rater, we tested them on the held-out test set and aggregated rater scores using the rater weights calculated in the previous step. For each of these layers, we used a three-fold cross-validation approach to select the training and test sets randomly in order to ensure that the accuracy reported is stable across different splits.
To determine the impact of each video’s annotations on the classifier’s predicted label for that video, we used a recently developed method for efficiently calculating approximate Shapley values [
In other words, any video’s final predicted class probability is the average predicted class probability of the dataset plus all the Shapley values associated with each element of that video’s input vector. This property, called local accuracy, indicates that the feature importance can be easily measured and compared. Additionally, because each video, feature, and model triple is associated with a single scalar-valued feature importance, we can understand how each annotation for each child’s video affected his/her predicted probability of TD/ASD/SLC at an individual level and estimate a feature’s overall importance to the model by summing up the absolute values of that feature’s Shapley values over all videos. The features with the highest sum of absolute Shapley values are considered the most important to the model. Finally, given the way in which we ensembled individual raters’ models, we can extract Shapley values for the multirater ensemble by employing the same weights. Specifically, we can employ the following equation:
To test whether our classifier’s decisions align with clinical intuition, we calculated Shapley values for the 159 videos for the second layer of the classifier when distinguishing ASD from non-ASD.
In order to determine the generalizability of one dataset’s characteristics to the other, we trained logistic regression classifiers with elastic net regularization for the Bangladeshi data and US data to predict ASD from the non-autism class. We trained the model on the Bangladeshi data and tested the model on the US data and vice versa. For both classifiers, we randomly split the dataset into training and testing, reserving 20% for the latter while using cross-validation on the training set to tune hyperparameters associated with elastic net regularization. Note that while traditional logistic regression seeks to find a set of model coefficients, β, that minimizes the logarithmic loss (we will denote this loss as
Here, the first sum corresponds to an L2-loss, the second sum corresponds to an L1-loss,
Analyses were performed in Python 3.6.7; we used pandas 0.23.4 to prepare the data for analysis [
We collected 159 videos in total: 55 videos were of children with ASD, 50 were of children with SLC, and 54 were of children with TD. The parent-submitted home videos were an average of 3 minutes 11 seconds long (SD 1 minute 57 seconds). Of the 159 videos submitted, all were manually inspected and found to be of good, scorable quality in terms of length, resolution, and content. Demographic data were missing for 9 subjects, who were excluded from analysis; all other data were complete. Video rating staff were able to rate all videos.
We first sought to distinguish AD from non-ASD cases. Our top performing classifiers from our previous analysis of the videos from 162 US children [
Since we used a three-fold cross-validation approach, we trained and tested the models for each of the raters across three different splits. The training set consisted of 114 randomly selected videos, and the average demographic information for the three splits for the training set was as follows: average age, 2 years 7 months (SD 7 months); proportion of males, 64%; proportion of children with TD, 34%; proportion of children with SLC, 31%; and proportion of children with ASD, 35%. The demographic information for the test set for layer 1 (distinguishing TD from ASD/SLC) and layer 2 (distinguishing ASD from SLC) can be found in
Participant demographics collected from Dhaka Shishu Hospital, Bangladesh.
Demographic | Full cohort (N=150) | ASDa cohort (N=50) | TDb cohort (N=50) | SLCc cohort (N=50) | ||
Age (years), mean (SD) | 2.55 (0.62) | 2.51 (0.70) | 2.40 (0.59) | 2.73 (0.51) | ||
Gender (male), n (%) | 90 (60) | 36 (72) | 23 (62) | 31 (46) | ||
Preterm (ie, <37 weeks), n (%) | 11 (0.7) | 5 (10) | 0 (0) | 6 (12) | ||
1,000-10,000 | 16 (10.7) | 0 (0) | 16 (32) | 0 (0) | ||
>10,000-30,000 | 33 (22) | 2 (4) | 21 (42) | 10 (20) | ||
>30,000 | 101 (67.3) | 48 (96) | 13 (26) | 40 (80) | ||
Urban | 139 (92.7) | 50 (100) | 50 (100) | 39 (78) | ||
Semiurban | 8 (5.3) | 0 (0) | 0 (0) | 8 (16) | ||
Rural | 3 (2) | 0 (0) | 0 (0) | 3 (6) | ||
Muslim | 141 (94) | 44 (88) | 49 (98) | 48 (96) | ||
Hindu | 6 (4) | 4 (8) | 0 (0) | 2 (4) | ||
Christian | 1 (0.01) | 1 (2) | 0 (0) | 0 (0) | ||
Buddhist | 2 (0.01) | 1 (2) | 1 (2) | 0 (0) | ||
Missing stunting information | 60 (40) | 4 (8) | 50 (100) | 6 (12) | ||
No stunting | 49 (32.7) | 30 (60) | 0 (0) | 19 (48) | ||
Stunting | 41 (27.3) | 16 (32) | 0 (0) | 25 (50) | ||
Social affect | N/Ah | 11.57 (5.30) | N/A | N/A | ||
Restricted and repetitive behavior | N/A | 3.46 (3.29) | N/A | N/A | ||
Composite | N/A | 5.14 (2.08) | N/A | N/A | ||
Receptive language delay | N/A | N/A | N/A | 2 (4) | ||
Expressive language delay | N/A | N/A | N/A | 5 (10) | ||
Both receptive and expressive language delay | N/A | N/A | N/A | 37 (74) | ||
Receptive and expressive language disorder | N/A | N/A | N/A | 6 (12) |
aASD: autism spectrum disorder.
bTD: neurotypical development.
cSLC: speech and language condition.
d1 US $=84 taka.
eMCHAT: Modified Checklist for Autism in Toddlers
fADOS: Autism Diagnostic Observation Schedule.
gADOS was only performed on a subset of 28 children with ASD.
hN/A: not available.
Results from the top performing classifiers trained on US clinical score sheet data and tested on Bangladeshi data with an objective to distinguish between ASD and non-ASD. ROC: receiver operating characteristic; AUC: area under the curve; ASD: autism spectrum disorder.
Average demographic information of the test set calculated by testing the model on 45 videos for both layers.
Demographic | Layer 1 (distinguishing TDa from ASDb/SLCc) | Layer 2 (distinguishing ASD from SLC) |
Age (years), average (SD) | 2 years 7 months (5 months) | 2 years 6 months (3 months) |
Proportion of males, mean % | 62 | 70 |
Proportion of TD children, mean % | 33 | 22 |
Proportion of children with ASD, mean % | 33 | 44 |
Proportion of children with SLC, mean % | 33 | 34 |
aTD: neurotypical development.
bASD: autism spectrum disorder.
cSLC: speech and language condition.
(A) ROC curve for layer 1 (distinguishing between children with TD and children with ASD or SLC). (B) ROC curve for layer 2 (distinguishing between ASD and SLC). ASD: autism spectrum disorder; AUC: area under the curve; SLC: speech and language condition; TD: neurotypical development; ROC: receiver operating characteristic.
Results from classifiers to distinguish among autism spectrum disorder, speech and language conditions, and neurotypical development. The results distinguish layer 1 (distinguishing neurotypical development from atypical conditions [autism spectrum disorder/speech and language conditions]) and layer 2 (distinguishing autism spectrum disorder from other delays [speech and language conditions]) from those classified as atypical in layer 1.
Classifier Layer | Sensitivity, % (SD) | Specificity, % (SD) | Unweighted average recall, % (SD) | Area under the curve, % (SD) | Accuracy, % (SD%) |
Layer 1a | 76 (SD 4) | 58 (SD 3) | 67 (SD 1) | 76 (SD 3) | 70 (SD 2) |
Layer 2b | 76 (SD 6) | 77 (SD 24) | 77 (SD 9) | 85 (SD 5) | 76 (SD 11) |
aDistinguishing neurotypical development from autism spectrum disorder/speech and language conditions.
bDistinguishing autism spectrum disorder from other developmental delays (speech and language conditions).
Layer 1 of the stacked classifier, which sought to distinguish between children with TD from children with atypical development, achieved 76% (SD 4%) sensitivity and 58% (SD 3%) specificity with an AUC of 76% (SD 3%) and an accuracy of 70% (SD 2%;
The most important features in our rater-adaptive ensemble for predicting ASD, as measured by the Shapley value, align with clinical intuition.
Shapley value distributions for two of the most important features in the rater-adaptive ensemble model. These features measure the child’s stereotyped behaviors/repetitive interests and eye contact. They demonstrate that clinical intuition and the inner workings of our classifier align closely. ASD: autism spectrum disorder.
For the classifier trained on the Bangladeshi data, the performance on the held-out test set (20% of Bangladeshi data) was 84.4% and its performance when validated on US data was 72.5% (
We trained a similar classifier on our dataset of 162 US videos and validated it on the Bangladeshi data (
While performing hyperparameter tuning on these classifiers, we conducted further analysis to determine which of the behavioral features were selected most often for each cross-fold of US videos and Bangladeshi videos in order to draw a comparison. It is apparent from
Logistic regression (Elastic Net penalty) classifier, trained on Bangladeshi data and tested on US data as well as a held-out test set of the Bangladeshi data. AUC: area under the curve.
Logistic regression (Elastic Net penalty) classifier, trained on US data and tested on Bangladeshi data as well as a held-out test set of the US data.
Feature selection analysis. Numbers within the cells indicate the frequency of selection. (A) Feature frequency comparison during cross-fold validation with alpha value 0.1 between Bangladeshi data and US data. (B) Feature frequency comparison during cross-fold validation with alpha value 0.01 between Bangladeshi data and US data.
We were able to demonstrate the potential to use video-based machine learning methods to detect developmental delay and autism in a collection of videos of Bangladeshi children at risk for autism. Despite language, cultural, and geographic barriers, this outcome shows promise for remote autism detection in a developing country. More testing and refinement will be needed, but, in general, there is potential for the method to be made virtual to run entirely on mobile devices and therefore potential to increase the capacity to detect and provide more immediate diagnostics to children in need of therapeutic interventions.
An important result of our work is that we were able to gather 159 videos from Bangladeshi parents collected via mobile phone through our collaboration with DSH. This suggests feasibility of expanding this study to a larger sample size across Bangladesh and other low-resource settings and the ability to rely on the use of mobile phones in developing countries like Bangladesh, where 95% of the population are mobile phone subscribers [
A useful and novel contribution of our work was our method for ensembling predictions from models trained on and adapted to each individual rater. This method demonstrates several advantageous properties. First, because each classification model was trained to map an individual rater’s annotation patterns to a predicted class label, these rater-adaptive models can capitalize on features reflecting a rater’s strengths while ignoring features on which the rater shows weaker performance. Furthermore, the fact that raters’ models are trained independently from one another means that, in a distributed setting where there is a large corpora of videos such that each rater annotates only a small subset of them, our method can make predictions on each video by applying and ensembling the models from each rater without any need for additional imputation. By weighting each rater’s model according to its accuracy on a rater-specific held-out validation set, the overall ensemble can lean more heavily on those raters whose models consistently demonstrate the best classification performance. Finally, because the final ensemble’s prediction is a linear combination of all of the rater’s models and we are able to calculate Shapley values for every feature in each of these models, it follows that we can use the same weights from the ensemble of rater-specific predictions to generate ensemble-level Shapley values as well. Thus, if a child’s video is distributed to several different raters and those raters’ annotations are fed into the ensemble model, one can interpret how each of the child’s behavioral annotations contributed to both the final ensemble classification label and each rater’s predicted label individually.
We found that while models trained on videos of US children and models trained on Bangladeshi children both relied on many of the same clinically relevant features (eg, sensory seeking, stereotyped interests, and actions), some features were more prominent in one model compared to the other. For example, models trained on US data tended to rely more heavily on social participation and stereotyped speech, while models trained on Bangladeshi data relied more on eye contact. These patterns make sense, as raters could rely on a mutual understanding of the language (English) to evaluate behaviors like stereotyped speech and social interaction in US videos and may not have needed to rely as heavily on physical cues like eye contact, whereas when US raters viewed Bangladeshi videos, nonlanguage-based cues became more important. Even without the ability to confidently evaluate all aspects of the child’s behavior, the rater ensemble demonstrated that the set of behavioral features needed to make an accurate diagnosis of developmental delays, including ASD, may be narrower than previously thought. Nevertheless, the difficulty in assessing certain sociolinguistic patterns in the cross-cultural context may have been the cause of comparatively lower performance in the Bangladeshi dataset. We hypothesize that, when trained on annotations provided by raters who share a common linguistic and sociocultural background with the Bangladeshi children, our ensemble’s performance will improve and become comparable to the models trained and evaluated on the US dataset.
Although accuracy achieved using our source classifiers originally trained on US datasets was lower when applied to Bangladeshi videos, it still indicated a signal in the Bangladeshi dataset. The relatively low accuracy is most likely a result of three factors. First, these original classifiers were trained on clinical scoresheets, not on features obtained from live video data. Second, these scoresheets were obtained from formal clinical assessments of US children, and therefore they do not capture a culturally diverse set of behavioral nuances. Third, these classifiers were trained to distinguish between typically developing children and children with autism. However, this dataset consists of delays other than autism (eg, SLCs), which may be why these classifiers were unable to classify these cases with higher accuracy.
Although the potential uses for a method of crowdsourced annotation and classification of developmental disorders like the one we established in this work are myriad, we wish to highlight a few uses. First, in areas where resources are scarce, and with a disorder like ASD, where early intervention is the key to successful treatment, our framework could be essential in performing cost-effective and reliable triage. Parents could send short home videos of their children to the cloud, at which point the video would be routed to several raters who perform feature tagging of the child’s behavior. Based on the raters’ previous annotation patterns and their associated models, the child would receive a predicted risk probability of developmental delay or ASD and a clinical team nearby could then be alerted, as appropriate. Since 2008, Dr Khan and her team have assisted the government to establish multidisciplinary Child Development Centers in tertiary hospitals across Bangladesh [
An exciting second consequence of a deployment like this would be the steady development of a large corpus of annotated videos. No such dataset exists to date; however, the potential impact of such a dataset could be substantial. Modern algorithms from machine vision and speech recognition like convolutional and recurrent neural networks could use these annotations to
Another important effect of such a pipeline would be that, with location-tagged videos, we could develop more accurate epidemiological statistics on the prevalence and onset of developmental disorders like ASD worldwide. Better information like this may increase awareness, positively impact policy change, and advance progress for addressing unmet needs of the children with developmental delays. This can have important applications in the developing world by helping countries identify the proportion of the population affected by such delays or impairments and therefore inform policy and gather actionable insights for health sector responses.
Formulae used in creating stacked rater-weighted classifiers.
Autism Diagnostic Interview-Revised
Autism Diagnostic Observation Schedule
autism spectrum disorder
area under the curve
Dhaka Shishu Children’s Hospital
Health Insurance Portability and Accountability Act
Modified Checklist for Autism in Toddlers
speech and language condition
neurotypical development
We are grateful for the generous support and participation of all 159 families who provided video and other phenotypic records. This work was supported by the Bill and Melinda Gates Foundation. It was also supported, in part, by funds to DW from the NIH (1R01EB025025-01 and 1R21HD091500-01), The Hartwell Foundation, the Coulter Foundation, Lucile Packard Foundation, and program grants from Stanford’s Human Centered Artificial Intelligence Program, Precision Health and Integrated Diagnostics Center (PHIND), Beckman Center, Bio-X Center, Predictives and Diagnostics Accelerator (SPADA) Spectrum, and Wu Tsai Neurosciences Institute Neuroscience: Translate Program. We also acknowledge the generous support from Peter Sullivan.
QT: data curation, formal analysis, investigation, methodology, software, validation, visualization, writing (original draft, review, and editing). SLF: formal analysis, investigation, methodology, software, validation, visualization, writing (review and editing). JNS: data curation, investigation, methodology, project administration, resources, and writing (review and editing). KD: data curation, investigation, methodology, project administration, resources, writing (review and editing). CC: formal analysis, investigation, methodology, software, writing (review and editing). PW: data curation, formal analysis, resources, writing (review and editing). HK: data curation, formal analysis, resources, writing (review and editing). NZK: conceptualization, funding acquisition, investigation, data curation, methodology, writing (review and editing). GLD: conceptualization, funding acquisition, data curation, writing (review and editing). DPW: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, supervision, validation, visualization, writing (original draft, review, and editing)
DPW is the founder of Cognoa.com. This company is developing digital health solutions for pediatric care. All other authors declare no competing interests.