This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Autism spectrum disorder (ASD) is a widespread neurodevelopmental condition with a range of potential causes and symptoms. Standard diagnostic mechanisms for ASD, which involve lengthy parent questionnaires and clinical observation, often result in long waiting times for results. Recent advances in computer vision and mobile technology hold potential for speeding up the diagnostic process by enabling computational analysis of behavioral and social impairments from home videos. Such techniques can improve objectivity and contribute quantitatively to the diagnostic process.
In this work, we evaluate whether home videos collected from a game-based mobile app can be used to provide diagnostic insights into ASD. To the best of our knowledge, this is the first study attempting to identify potential social indicators of ASD from mobile phone videos without the use of eye-tracking hardware, manual annotations, and structured scenarios or clinical environments.
Here, we used a mobile health app to collect over 11 hours of video footage depicting 95 children engaged in gameplay in a natural home environment. We used automated data set annotations to analyze two social indicators that have previously been shown to differ between children with ASD and their neurotypical (NT) peers: (1) gaze fixation patterns, which represent regions of an individual’s visual focus and (2) visual scanning methods, which refer to the ways in which individuals scan their surrounding environment. We compared the gaze fixation and visual scanning methods used by children during a 90-second gameplay video to identify statistically significant differences between the 2 cohorts; we then trained a long short-term memory (LSTM) neural network to determine if gaze indicators could be predictive of ASD.
Our results show that gaze fixation patterns differ between the 2 cohorts; specifically, we could identify 1 statistically significant region of fixation (
Ultimately, our study demonstrates that heterogeneous video data sets collected from mobile devices hold potential for quantifying visual patterns and providing insights into ASD. We show the importance of automated labeling techniques in generating large-scale data sets while simultaneously preserving the privacy of participants, and we demonstrate that specific social engagement indicators associated with ASD can be identified and characterized using such data.
Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by social impairments, communication difficulties, and restricted and repetitive patterns of behavior. Currently, 1 in 44 children in the United States have been diagnosed with ASD, with males 4 times more likely to be affected than females [
Standard diagnostic mechanisms for ASD are often accompanied by a range of issues that result in long waiting times for results [
We previously created a mobile app called GuessWhat, which yields video data of children engaged in socially motivated gameplay with parents in a natural home environment [
GuessWhat Mobile app. (A) The parent places the mobile phone in a fixed location, allowing the recording of a semistructured gameplay video. (B) The children are presented with a variety of charades prompts, such as emotions and animals.
The data collection pipeline employed by GuessWhat provides several benefits that make the obtained information amenable to computational analysis. First, although children are performing varied tasks in diverse environments, GuessWhat videos encourage inherent structure, with factors such as the position of the phone camera, location of the child relative to the camera, and game-based social interactions between the parent and child remaining generally consistent between videos. In addition, as children are in a home environment and are unencumbered by bulky hardware such as eye trackers or head mounts, they can interact with their parents and surroundings in a natural manner. As a result, we hypothesize that computer vision algorithms can be designed to monitor socially motivated facial engagement in children during gameplay, allowing effective identification of behaviors, eye contact events, and social interactions potentially correlated with the ASD phenotype.
In this work, we used computational techniques to analyze these videos and identify differences in social interactions between children with ASD and neurotypical (NT) children. We specifically analyzed 2 common social engagement signals that are included in standard clinical diagnostic instruments and can be identified through computer vision methodologies: (1) gaze fixation patterns, which represent the regions of an individual’s visual focus and (2) visual scanning methods, which refer to the ways in which individuals scan their surrounding environment. We performed these tasks without sharing participant videos or private patient information with human annotators.
Ultimately, the development of this system can help improve diagnosis of ASD through automated detection of impaired social interactions, mitigating the problems associated with limited diagnostic resources for neurodevelopmental disorders, especially in regions where access to care is limited [
Researchers have demonstrated the usefulness of video data in providing diagnostic insights into gaze and engagement behaviors associated with ASD. Prior work can generally be divided into three categories: (1) manual annotation methods, (2) eye-tracking systems, and (3) use of structured environments.
Some studies have used human annotators to label social interaction and engagement information in video frames. Several prior works, such as those by Tariq et al and Leblanc et al, performed manual annotation of behavioral features in home videos, which enabled the creation of classifiers that could identify ASD with high accuracy [
Several studies have used eye trackers to identify patterns in gaze and engagement behaviors that may be indicative of ASD or other developmental conditions [
Hashemi et al explored the use of computer vision algorithms to identify behaviors associated with ASD [
To the best of our knowledge, this is the first study that attempts to obtain diagnostic insights into ASD from social gameplay videos without the use of eye-tracking hardware, manual frame-level annotations, and structured scenarios or environments. We show that semistructured gameplay videos collected on mobile devices reveal specific regions of gaze fixation as well as visual scanning patterns that differ between individuals with ASD and NT children during social gameplay. With further research and development, our system can be deployed as a diagnostic tool in diverse settings on a large scale.
We used the GuessWhat mobile app to collect videos of children engaged in gameplay with a parent. Participants were recruited using social media advertisements and research email lists maintained by the study team. Approximately 1000 individuals proceeded to download the GuessWhat app, and we collected 449 videos from 95 children for this study. The participants ranged in age from 2 to 15 years and included 68 children (15 females, 53 males) diagnosed with ASD as well as 27 NT children (9 females, 18 males). Each child contributed a mean of 4.7 videos (SD 7.3), resulting in a total data set size of 1,084,267 individual frames and 11.1 hours of footage, presented in
Data set information. These graphs show the breakdown of our data set by age, diagnosis, and sex. In our data set, 1 NT male failed to provide his age, and this information has been excluded from this figure. ASD: autism spectrum disorder; NT: neurotypical.
Although the semistructured format of our video data set presents numerous advantages, home videos are naturally heterogeneous in quality; this results in several challenges that must be addressed prior to computational analysis. Specifically, excessive camera movement and poor lighting conditions rendered some frames in our data set too blurry for use. Moreover, other adults or siblings would often join in gameplay, resulting in multiple faces in the frame and making identification of the participating child challenging. Another major challenge arises from the lack of fine-grained annotations and ground truth labels; although the lack of eye-tracking hardware enables natural child motions and interactions, this also results in a lack of calibration information for obtaining accurate gaze locations.
We began our analysis with extensive quality control and data preprocessing. To preserve privacy, we annotated our data set solely using computational methods. We first used Amazon Rekognition, a powerful off-the-shelf computer vision platform developed by Amazon, to perform noisy labeling of key features in each still frame, including 30 facial landmarks and facial bounding boxes. Frames with 0 or greater than 2 faces were removed from the data set. We then used an open-source facial landmark annotation platform called OpenFace to obtain automated estimates of gaze directions [
Finally, to discretize gaze annotation data, we divided the coordinate map into 16 distinct areas of interest (AOIs), as shown in
Gaze annotations. (A) Gaze coordinates range between –1 and 1 on the x- and y-axes. (B) To categorize gaze coordinates into discrete regions, we divided the gaze map into 16 buckets. Each area of interest is labeled with corresponding row and column letters.
Gaze fixation, which occurs when one's gaze is held on a single target for an extended period, plays an important role in social interaction by signaling communicative intent and enabling interpersonal relationships. In a dyadic social interaction, individuals usually fixate their gaze on the target's eyes. However, individuals with ASD often face difficulty with maintaining eye contact and instead tend to focus their visual attention on other regions of the target's face. Several studies involving eye trackers and visual stimuli have shown that children with ASD tend to fixate on the mouth or other body parts; this has even been observed in children aged as young as 2 to 6 months who were later diagnosed with ASD [
To determine the gaze fixation patterns of individuals during a single 90-second game, we used the coarse gaze annotations obtained from our preprocessed data set. For each video in our data set, we computed the percentage of time that the child fixated his or her gaze on each of the 16 predefined AOIs. A 2-sided permutation test was used at every AOI to identify statistically significant differences between the ASD and NT populations, with the null hypothesis that the fixation times for both populations followed an equivalent distribution; we calculated the difference in the mean fixation times for 100,000 rearrangements of the 2 groups. Bonferroni correction was applied to account for multiple hypothesis tests. It is important to note that because the AOIs are correlated, the Bonferroni correction is extremely stringent and will reduce the likelihood of Type 1 errors.
Humans tend to transition their gaze between various objects in their environments when encountering visual stimuli, a phenomenon called visual scanning. The patterns and frequencies with which humans scan their surroundings can provide insight into how individuals process the world around them. In the context of social interaction, prior research has shown that individuals with ASD vary in the way that they scan a target's facial landmarks during a social scenario, which may contribute to difficulty with interpreting emotional or nonverbal cues. This was shown by Pelphrey et al, who demonstrated that when presented with images of faces, NT individuals typically transitioned their gaze between core features, such as the eyes and nose, whereas individuals with ASD appeared to scan nonfeature areas of the face, such as the forehead and cheeks [
Modeling gaze transition patterns as a graph problem can provide insight into the regions that children focus on while scanning their environments [
Graph model of gaze transitions. We modeled the gaze transitions in each gameplay video as a graph, which was then used to generate a 16 × 16 adjacency matrix.
We computed adjacency matrices for all gameplay videos and normalized each matrix by dividing each entry by the total number of transitions. Then, we computed the average of all matrices associated with the NT individuals in our data set, resulting in a single 16
Next, we used deep learning techniques to measure the predictive power of gaze fixation patterns. We began by converting fixation data points into feature matrices that could serve as the input to our classifiers. We first extracted the sequence of gaze coordinates from each video using the coarse annotation procedure described in the previous section. This resulted in a vector of n ordered pairs (x,y) for every video, where n represents the number of valid frames in the video, and x and y are the gaze fixation coordinates ranging from –1 to 1. We then matched each ordered pair with its associated AOI, as demonstrated in
We used a sliding window approach to divide
Gaze fixation feature representation. In this demonstrative example, we begin with a video consisting of 9 frames. Gaze coordinates are matched with corresponding area of interest (AOI) regions. Using a window of 4 and a shift value of 2 divides vector v into 3 feature vectors. Each feature vector is then one-hot encoded. All input feature matrices are assigned the same label.
We then used deep learning models to determine if gaze fixation patterns could be predictive of ASD. We assigned 324 videos (275 ASD, 49 NT) in our data set to the training set, 71 videos (62 ASD, 9 NT) to the validation set, and 54 videos (43 ASD, 11 NT) to the held-out test set, ensuring that all videos corresponding to a single child were assigned to the same set. Input feature matrices were constructed using the approach described above. A binary label
To exploit the temporal nature of our data set, we used long short-term memory (LSTM) networks, which are a type of recurrent neural network that can model long-term dependencies. A
Model architecture. The model consists of a long short-term memory network with w cells. Each cell accepts a one-hot vector of size 16, represented in the figure by xi, and outputs a cell state ci and a hidden state hi. The final cell is connected to a fully connected layer, which generates a single class output. FC: fully connected layer; LSTM: long short-term memory.
Finally, to characterize model performance, we report four metrics: macroaveraged recall, macroaveraged precision, weighted-average recall, and weighted-average precision. As our data set exhibits class imbalance with cases outnumbering controls, these metrics provide the most accurate representation of model performance. Macroaveraged statistics compute the arithmetic mean of performance on each class, whereas weighted-average statistics compute the weighted mean. We performed all parameter experiments on our validation set and evaluated our final best-performing models on the held-out test set.
This study was approved by the Stanford Institutional Review Board (eProtocol number: 39562).
We first analyzed gaze fixation patterns to determine if regions of focus differ between children with ASD and NT children during a single 90-second game. Coarse gaze annotations, which were obtained using the automated labeling procedure described in the Methods section, were grouped into 16 AOIs, and the percentage of time that the child fixated on each region was computed.
Gaze fixation results. The heat maps located at the upper left and lower left show the mean percentage of time that an individual fixated his or her gaze on each area of interest (AOI). The bar charts and the box and whisker plots show the distribution of fixation times across all videos. ASD: autism spectrum disorder; NT: neurotypical.
Next, we used graph methods to analyze the ways in which participants scanned their environments during gameplay. We modeled the gaze transitions in each gameplay video as a network and computed the mean adjacency matrices for the ASD and NT populations, which are shown in
Gaze transition heat maps. These heat maps show the percentage of gaze transitions that occur between each pair of AOIs during a 90-second game. AOI: area of interest; ASD: autism spectrum disorder; NT: neurotypical.
We measured the classification performance of models trained on gaze fixation patterns. Gaze fixation coordinates were encoded as one-hot vectors and passed as input to an LSTM network, which generated a single class output representing the likelihood of ASD. LSTM models were trained with a range of window and shift parameter values and evaluated on the validation set. Our results from the validation set allowed us to identify our top 3 models, which were trained with parameters (1)
Classifier performance on held-out test set with gaze fixation features.
Window ( |
Shift ( |
Macroaveraged recall | Macroaveraged precision | Weighted-average recall | Weighted-average precision |
100 | 10 | 0.598 | 0.595 | 0.656 | 0.661 |
200 | 10 | 0.561 | 0.577 | 0.662 | 0.635 |
500 | 10 | 0.576 | 0.577 | 0.625 | 0.624 |
The model with parameters
In this study, we used computational techniques to analyze home videos and obtain diagnostic insights into ASD. We collected a large data set of semistructured videos featuring children engaged in gameplay with a parent, and we analyzed 2 key markers of social engagement that have been shown to differ between children with ASD and their NT peers: (1) gaze fixation and (2) visual scanning. For each marker, we identified statistically significant differences between the 2 cohorts and demonstrated that this information could be useful in identifying the presence of ASD.
Our study demonstrates the potential that mobile tools hold for quantifying visual patterns and providing insights into ASD. Despite the presence of high heterogeneity and varying quality in our data set, the automated labeling techniques and deep learning classifiers used in this work were able to extract usable signals and identify differences in gaze fixation and visual scanning patterns between the 2 cohorts. These methods also enabled us to preserve participant privacy by avoiding the use of human annotators. Our findings support prior works that have identified social and visual engagement differences between individuals with ASD and NT individuals [
This work has some limitations. First, due to the class imbalance in our data set, the predictive accuracy of ASD differs from that of the control individuals; this is reflected in
Future directions for this work include expanding the size of the experimental population; analyzing additional motion-based features in gameplay videos, such as limb movements and coordination; performing qualitative human-centered investigations or pragmatic randomized controlled trials to evaluate clinical usability; and evaluating the real-world diagnostic capabilities of our approach across diverse environmental settings [
Overall, this study demonstrates the usefulness of game-based mobile apps and heterogeneous video data sets in aiding in the diagnosis of ASD. With further research and development, the system described in this work can ultimately serve as a low-cost and accessible diagnostic tool for a global population.
area of interest
autism spectrum disorder
long short-term memory
neurotypical
This work was supported in part by funds to DPW from the National Institutes of Health (1R01EB025025-01, 1R01LM013364-01, 1R21HD091500-01, 1R01LM013083), the National Science Foundation (Award 2014232), The Hartwell Foundation, Bill and Melinda Gates Foundation, Coulter Foundation, Lucile Packard Foundation, Auxiliaries Endowment, the ISDB Transform Fund, the Weston Havens Foundation, and program grants from Stanford’s Human Centered Artificial Intelligence Program, Precision Health and Integrated Diagnostics Center, Beckman Center, Bio-X Center, Predictives and Diagnostics Accelerator, Spectrum, Spark Program in Translational Research, MediaX, and from the Wu Tsai Neurosciences Institute's Neuroscience: Translate Program. We also acknowledge generous support from David Orr, Imma Calvo, Bobby Dekesyer and Peter Sullivan. MV is supported by the Knight-Hennessy Scholars program at Stanford University and the National Defense Science and Engineering Graduate (NDSEG) Fellowship. PW would like to acknowledge support from Mr. Schroeder and the Stanford Interdisciplinary Graduate Fellowship (SIGF) as the Schroeder Family Goldman Sachs Graduate Fellow.
DPW is the founder of Cognoa.com. This company is developing digital health solutions for pediatric care. AK works as a part-time consultant with Cognoa.com. All other authors declare no conflict of interests.