This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Previous studies have shown promising results in identifying individuals with autism spectrum disorder (ASD) by applying machine learning (ML) to eye-tracking data collected while participants viewed varying images (ie, pictures, videos, and web pages). Although gaze behavior is known to differ between face-to-face interaction and image-viewing tasks, no study has investigated whether eye-tracking data from face-to-face conversations can also accurately identify individuals with ASD.
The objective of this study was to examine whether eye-tracking data from face-to-face conversations could classify children with ASD and typical development (TD). We further investigated whether combining features on visual fixation and length of conversation would achieve better classification performance.
Eye tracking was performed on children with ASD and TD while they were engaged in face-to-face conversations (including 4 conversational sessions) with an interviewer. By implementing forward feature selection, four ML classifiers were used to determine the maximum classification accuracy and the corresponding features: support vector machine (SVM), linear discriminant analysis, decision tree, and random forest.
A maximum classification accuracy of 92.31% was achieved with the SVM classifier by combining features on both visual fixation and session length. The classification accuracy of combined features was higher than that obtained using visual fixation features (maximum classification accuracy 84.62%) or session length (maximum classification accuracy 84.62%) alone.
Eye-tracking data from face-to-face conversations could accurately classify children with ASD and TD, suggesting that ASD might be objectively screened in everyday social interactions. However, these results will need to be validated with a larger sample of individuals with ASD (varying in severity and balanced sex ratio) using data collected from different modalities (eg, eye tracking, kinematic, electroencephalogram, and neuroimaging). In addition, individuals with other clinical conditions (eg, developmental delay and attention deficit hyperactivity disorder) should be included in similar ML studies for detecting ASD.
Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by social communication deficits along with restricted and repetitive behavior [
With respect to seeking objective biomarkers for ASD, recent studies reflect increasing interest in applying machine learning (ML) algorithms to examine whether features extracted from neuroimaging [
Recently, a few studies have revealed that eye-tracking data could be used to identify ASD by implementing ML algorithms [
The eye-tracking data used in these prior studies were primarily obtained by having participants watch images (ie, videos, pictures, web pages) [
The major novelty of this study is that we investigated the feasibility of using eye-tracking data from face-to-face conversations to classify children with ASD and TD. This research question is of practical significance since face-to-face interaction is omnipresent in everyday life. With the development of eye-tracking technology that enables the detection of natural social gaze behavior, ASD might be initially screened in daily life without needing to undergo lengthy and sophisticated procedures in clinical settings. In addition, apart from visual fixation measures, we included the length of conversation as an input feature to investigate whether combining features from these two modalities would increase the classification performance. The majority of prior eye-tracking ML research focused on using gaze data to identify ASD. To the best of our knowledge, only two recent studies combined eye tracking and EEG or kinematic data, showing that combined features yielded better classification performance than using features from a single modality [
Data used in this study were obtained from a research project aiming at identifying behavioral markers of ASD. Twenty children with ASD and 23 children with TD were enrolled in the study. Children with ASD were recruited from the Child Psychiatry Department of Shenzhen Kangning Hospital. Owing to limited access to instruments such as the Autism Diagnostic Observation Schedule or the Autism Diagnostic Interview-Revised, ASD was primarily diagnosed by a licensed psychiatrist with no less than 5 years of clinical experience following the Diagnostic and Statistical Manual of Mental Disorders-IV criteria. In addition, the ASD diagnosis was further evaluated by a senior psychiatrist. A consultation with at least two additional senior psychiatrists would be arranged if there was disagreement among the specialists. All of these procedures ensured the correctness of the ASD diagnosis for the children enrolled in our study. Additional inclusion criteria were as follows: (1) aged between 6 and 13 years; (2) at least average nonverbal intelligence (IQ level was initially screened by the psychiatrist, and measured with the Raven advanced progressive matrices [
Participants were asked to engage in a structured face-to-face conversation with a 33-year-old female interviewer who was blinded to the participant’s group membership. The interviewer was required to behave consistently across all interviews with all participants. Participants were required to wear a head-mounted eye tracker (Tobii Pro Glasses 2; sampling rate: 50 Hz; Tobii Technology, Stockholm, Sweden) during the conversation, and they were seated 80 cm away from the interviewer’s chair (
Experimental setup.
Participants were not informed of the function of the eye tracker, and they were asked to avoid moving the glasses or to make any intense head movements during the conversation. A postexperiment interview confirmed that none of the participants was aware that their gaze behavior had been recorded. In addition, once the eye tracker was moved by the participant (particularly those with ASD), an accuracy test was performed at the end of the conversation to ensure the accuracy of the eye-tracking data recording. Verifications showed that Tobii Pro Glasses 2 was reliably accurate even if the glasses were moved by participants during the conversation.
The structured conversation consisted of four chronologically arranged sessions: general questions in the first session, hobby sharing in the second session, yes-no questions in the third session, and question raising in the fourth session. The first session allowed both the interviewer and the child to become familiarized with each other. The second session served the purpose of examining the participants’ behavior when speaking about their hobbies, which might induce different gaze behavior from that induced when discussing more generic topics [
What is your name?
How is your name written?
What is the name of your school and what grade are you in?
Who is your best friend? What is your favorite thing to do together?
Could you please share with me the most interesting thing that happened last week? Let me know the time, place, people, and the whole process of the event.
What is the plan for your summer vacation?
What is your favorite thing to do? And can you tell me why you like doing it?
Do you like apples? Do you like to go to the zoo? Do you like to go to school? Do you like reading? Do you like painting? Do you like watching cartoons? Do you like sports? Do you like watching movies? Do you like traveling? Do you like shopping?
Now that I have asked you many questions, do you have any questions for me?
Data of four participants (one with ASD and three with TD) were discarded due to technical problems that occurred during the eye-tracking process. Hence, the final dataset consisted of 20 children with TD and 19 children with ASD. The participants’ demographic information is presented in
The eye-tracking data were analyzed with Tobii Pro Lab software, which enables processing visual fixation data on dynamic stimuli. Note that the interviewer was also a dynamic stimulus as she was interacting with the participants throughout the conversation.
Features were extracted on visual fixation and session length from the eye-tracking data. For the visual fixation features, four AOIs were analyzed, including the eyes, mouth, whole face, and whole body (
Comparison of demographic information and the area of interest (AOI)-based features in the autism spectrum disorder (ASD) and typical development (TD) groups.
Characteristic | ASD | TD | Group comparison | ||
|
|
|
|
|
|
|
Sex ratio, M:F | 17:2 | 17:3 | χ21=0.17 | .68 |
|
Age (months), mean (SD) | 99.6 (25.1) | 108.8 (27.0) | t37=1.09 | .28 |
|
IQ, mean (SD) | 100.8 (22.7) | 116.1 (22.7) | t37=2.45 | .02 |
|
|
|
|
|
|
|
Mouth_Session 1 | 0.05 (0.06) | 0.19 (0.13) | <.001 | |
|
Eyes_Session 1 | 0.06 (0.06) | 0.08 (0.09) | .63 | |
|
Face_Session 1 | 0.21 (0.17) | 0.41 (0.18) | .001 | |
|
WholeBody_Session 1 | 0.33 (0.23) | 0.55 (0.21) | .003 | |
|
Mouth_Session 2 | 0.05 (0.09) | 0.16 (0.13) | .004 | |
|
Eyes_Session 2 | 0.04 (0.04) | 0.06 (0.07) | .18 | |
|
Face_Session 2 | 0.17 (0.16) | 0.39 (0.20) | .001 | |
|
WholeBody_Session 2 | 0.29 (0.26) | 0.52 (0.25) | .008 | |
|
Mouth_Session 3 | 0.12 (0.15) | 0.21 (0.17) | .10 | |
|
Eyes_Session 3 | 0.07 (0.06) | 0.08 (0.10) | .91 | |
|
Face_Session 3 | 0.33 (0.26) | 0.49 (0.21) | .05 | |
|
WholeBody_Session 3 | 0.46 (0.28) | 0.06 (0.20) | .12 | |
|
Mouth_Session 4 | 0.05 (0.06) | 0.12 (0.12) | .05 | |
|
Eyes_Session 4 | 0.06 (0.09) | 0.08 (0.11) | .85 | |
|
Face_Session 4 | 0.21 (0.20) | 0.32 (0.18) | .05 | |
|
WholeBody_Session 4 | 0.34 (0.25) | 0.47 (0.22) | .07 |
aDue to a violation of the normality assumption, Mann-Whitney
Four areas of interest.
To obtain the percentage of visual fixation time on each AOI, the first step was to draw a snapshot image from the eye-tracking video for the purpose of defining AOIs. Once AOIs were defined, with the help of the real-world mapping algorithm, Tobii Pro Lab automatically mapped the gaze point in the video onto correct spots of the snapshot image. The correctness of the mapping process was confirmed by a human observer. Manual mapping was performed in the case that no fixation was automatically mapped onto the snapshot or if the fixation automatically assigned failed to match the correct spot. In this way, the accuracy of visual fixation was reliably ensured. Note that we used the velocity-threshold identification fixation filter to define fixation, which meant that a fixation was detected if the velocity of the eye movement was below 30 degrees per second for no less than 60 milliseconds. Finally, the percentage of visual fixation time on each AOI in a session was computed as the length of the fixation time on the AOI divided by the total duration of the particular session. Results regarding the group comparison on the AOI-based features in different sessions are presented in
The length of each session varied across participants. Mann-Whitney
Sixteen features on visual fixation (percentages of visual fixation time on four AOIs [mouth, eyes, face, and whole body] in four conversation sessions) and five features on session length were computed as features fed into the ML procedure. Therefore, the original dataset for the ML procedure was a 39 (participants)×21 (features) matrix. Three types of ML models were established, one with visual fixation features alone, one with session length features alone, and one with combined features on both modalities, to investigate whether combined features would yield better classification performance.
The classification task was performed by implementing four ML classifiers: support vector machine (SVM), linear discriminant analysis (LDA), decision tree (DT), and random forest (RF). The description of these classifiers is detailed below.
SVM is a supervised learning algorithm that has been previously implemented in classifying individuals with and without ASD [
The task of classifying children with ASD from those with TD is a binary classification problem. In this case, the LDA classifier works as a dimension reduction technique that projects all data points in the high-dimensional space onto a straight line (ie, one dimension) with training samples. Testing samples were classified in either group by the threshold value on the straight line.
The DT classifier is a tree-like flowchart. The nodes in the model represent tests on an attribute, the branches represent the outcomes of the tests, and the leaf nodes denote class labels. The DT classifier exhibits the advantage of strong interpretability, but it is prone to overfitting.
Instead of building a tree-like structure, the RF classifier is established by creating multiple simple trees with the training data. Test samples are categorized into a specific group based on the majority of votes from the trees.
Forward feature selection (FFS) was applied to select features for model training and testing. Specifically, FFS is an iterative process starting with the evaluation of each individual feature by examining their classification performance. The feature with the highest classification accuracy would be preserved and is then combined with each of the other features to form two-feature models whose classification performances are further evaluated. The two features with optimal classification accuracy are then retained and used to establish three-feature models by combining them with each of the remaining features. By repeating these procedures, the one-feature, two-feature, …,
The entire ML procedure is schematically presented in
Flowchart of the machine learning procedure. LOOCV: leave-one-out cross-validation.
The variation in classification accuracy according to the number of features used in the model is illustrated in
Variation of the classification accuracy with the number of features. SVM: support vector machine; LDA: linear discriminant analysis; DT: decision tree; RF: random forest.
The classification performance of the SVM classifier was the highest among the four classifiers. The variation of the SVM classification performance according to the number of features is presented in
Variation of the support vector machine classification performance with different features.
Number of features | Added feature | Accuracy (%) | Sensitivity (%) | Specificity (%) |
1 | Total SLa | 79.49 | 68.42 | 90.00 |
2 | ~b +Mouth_Session 1 | 84.62 | 78.95 | 90.00 |
3 | ~ +Wholebody_Session 3 | 92.31 | 84.21 | 100.00 |
4 | ~ +Face_Session 3 | 92.31 | 84.21 | 100.00 |
5 | ~ +Face_Session 2 | 92.31 | 89.47 | 95.00 |
6 | ~ +Eyes_Session 4 | 92.31 | 89.47 | 95.00 |
7 | ~ +Face_Session 1 | 92.31 | 89.47 | 95.00 |
8 | ~ +SL_Session 2 | 92.31 | 89.47 | 95.00 |
9 | ~ +Wholebody_Session 1 | 89.74 | 89.47 | 90.00 |
10 | ~ +Face_Session 4 | 92.31 | 89.47 | 95.00 |
11 | ~ +Mouth_Session 2 | 92.31 | 89.47 | 95.00 |
12 | ~ +Eyes_Session 1 | 89.74 | 84.21 | 95.00 |
13 | ~ +Eyes_Session 2 | 89.74 | 84.21 | 95.00 |
14 | ~ +Mouth_Session 3 | 87.18 | 84.21 | 90.00 |
15 | ~ +SL_Session 3 | 89.74 | 84.21 | 95.00 |
16 | ~ +Wholebody_Session 4 | 89.74 | 84.21 | 95.00 |
17 | ~ +Mouth_Session 4 | 87.18 | 84.21 | 90.00 |
18 | ~ +Eyes_Session 3 | 84.62 | 78.95 | 90.00 |
19 | ~ +SL_Session 1 | 82.05 | 78.95 | 85.00 |
20 | ~ +SL_Session 4 | 79.49 | 78.95 | 80.00 |
21 | ~ +Wholebody_Session 2 | 76.92 | 73.68 | 80.00 |
aSL: session length.
bIn forward feature selection, ~ represents all features in the previous iteration; for example, ~ represents all 6 previously selected features in the 7th iteration.
The confusion matrix of this three-feature model that achieved the highest accuracy is presented in
Confusion matrix of the support vector machine classifier with the highest accuracy.a
Actual class | Predicted class | |
|
TDb | ASDc |
TD | TNd=20 | FPe=0 |
ASD | FNf=3 | TPg=16 |
aAccuracy=TP+TN/TP+FP+FN+TN; sensitivity=TP/TP+FN; specificity=TN/FP+TN.
bTD: typical development.
cASD: autism spectrum disorder.
dTN: true negative.
eFP: false positive.
fFN: false negative.
gTP: true positive.
Boxplots of three features that achieved the highest classification accuracy in the support vector machine classifier along with the three mislabeled samples. ASD: autism spectrum disorder; TD: typical development.
Following the same procedure but feeding only AOI-based features into the ML classifiers achieved a maximum classification accuracy of 84.62% by the LDA classifier (specificity=80.00%, sensitivity=89.47%, AUC=0.86) with three features (mouth in session 1, face in session 2, and mouth in session 3), and by the DT classifier (specificity=80.00%, sensitivity=89.47%, AUC=0.86) with two features (face in session 2 and eyes in session 3).
When using only session length features to perform the classification task, the maximum classification accuracy of 84.62% was achieved by the SVM classifier (specificity=90.00%, sensitivity=78.95%, AUC=0.87) with four features (session length in sessions 1, 3, and 4, and total session length).
In this study, we extracted features on visual fixation and session length from eye-tracking data collected during face-to-face conversations and investigated their capacity for classifying children with ASD and TD. The maximum classification accuracy of 92.31% was achieved by combining features on both visual fixation and session length with the SVM classifier. The classification accuracy was higher than that obtained using visual fixation features (highest accuracy: 84.62%) or session length features (highest accuracy: 84.62%) alone. Since 19 children with ASD and 20 children with TD were enrolled in this study, there was a slight class imbalance. Majority class prediction is typically used as a baseline for imbalanced classification. In the context of this study, majority class prediction requires every participant sample to be predicted as “TD”. Thus, the classification accuracy of majority class prediction would be 51.3% (ie, 20/39), which is greatly lower than the optimal classification accuracy of our results. This suggests that our results could not be explained by majority class prediction.
The highest classification accuracy was achieved with three features: total session length, percentage of visual fixation time on the mouth AOI in the first session, and percentage of visual fixation time on the whole body AOI in the third session. As shown in
Notably, fixation measures on the mouth and whole body AOIs played important roles in the SVM classifier that produced the highest classification accuracy. The mouth AOI emerged as a prominent feature in this study, possibly owing to the fact that participants were engaged in a conversational task. Previous studies showed that the mouth is an important body feature that affords the looking-toward behavior in conversations [
Apart from the fact that we used data from face-to-face interaction as opposed to data obtained from image-viewing tasks used in previous related studies, our study is different from other eye-tracking ML studies in two main aspects. First, this study recruited children aged between 6 and 13 years, whereas Wan et al [
To ensure that the participants would be able to converse with the interviewer, we recruited children within the age range of 6-13 years with at least average intellectual ability. Participants with severe symptoms of autism were not included. In addition, only four girls were enrolled in our study. Prior studies reported that males with ASD differ from females with ASD in many respects, including behavioral presentation, cognitive domains, and emotions [
This study utilized a head-mounted eye tracker to record the gaze behavior, which might affect the social behavior of children with ASD to a larger extent. In general, individuals with ASD are more sensitive to wearing devices and eye-tracking techniques usually require extensive calibration [
Our study only computed the percentage of visual fixation time on different AOIs as measures of gaze behavior. In fact, a variety of other features could be obtained from the gaze behavior, including the number of fixations, entropy, and number of revisits [
Using eye-tracking data from face-to-face interaction was a major novelty of this study. However, human interaction may introduce a variety of subjective factors that are difficult to control but might influence the gaze behavior of participants. For example, the interviewer might unconsciously behave differently with the children with ASD from the TD group, even if she was required to maintain a similar manner of behavior when interacting with participants in both groups. To examine whether the interviewer behaved consistently with both groups of participants, the overall amount of movement she made during the conversation was estimated using image differencing techniques applied to the video recordings [
Our study extracted features from eye-tracking data during face-to-face conversations to investigate their capacity of detecting children with ASD. With a relatively small sample, our results showed that combining features on visual fixation and session length could accurately classify children with ASD and those with TD. It is proposed that future eye-tracking ML studies could use features from gaze-based measures [
attention deficit hyperactivity disorder
area of interest
autism spectrum disorder
area under the receiver operating characteristic curve
decision tree
electroencephalogram
forward feature selection
linear discriminant analysis
machine learning
random forest
support vector machine
typical development
This study was financially supported by the SZU funding project (number 860-000002110259), Science and Technology Innovation Committee of Shenzhen (number JCYJ20190808115205498), Key Medical Discipline of GuangMing Shenzhen (number 12 Epidemiology), Sanming Project of Medicine in Shenzhen (number SZSM201612079), Key Realm R&D Program of Guangdong Province (number 2019B030335001), Shenzhen Key Medical Discipline Construction Fund (number SZXK042), and Shenzhen Double Chain Grant (number [2018]256).
ZZ, XZ, XQ, and JL designed the experiment. ZZ, HT, and XH performed the data analyses. ZZ, HT, and XQ wrote the manuscript.
None declared.