This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Electronic health records (EHRs) bring many opportunities for information utilization. One such use is the surveillance conducted by the Centers for Disease Control and Prevention to track cases of autism spectrum disorder (ASD). This process currently comprises manual collection and review of EHRs of 4- and 8-year old children in 11 US states for the presence of ASD criteria. The work is time-consuming and expensive.
Our objective was to automatically extract from EHRs the description of behaviors noted by the clinicians in evidence of the diagnostic criteria in the Diagnostic and Statistical Manual of Mental Disorders (DSM). Previously, we reported on the classification of entire EHRs as ASD or not. In this work, we focus on the extraction of individual expressions of the different ASD criteria in the text. We intend to facilitate large-scale surveillance efforts for ASD and support analysis of changes over time as well as enable integration with other relevant data.
We developed a natural language processing (NLP) parser to extract expressions of 12 DSM criteria using 104 patterns and 92 lexicons (1787 terms). The parser is rule-based to enable precise extraction of the entities from the text. The entities themselves are encompassed in the EHRs as very diverse expressions of the diagnostic criteria written by different people at different times (clinicians, speech pathologists, among others). Due to the sparsity of the data, a rule-based approach is best suited until larger datasets can be generated for machine learning algorithms.
We evaluated our rule-based parser and compared it with a machine learning baseline (decision tree). Using a test set of 6636 sentences (50 EHRs), we found that our parser achieved 76% precision, 43% recall (ie, sensitivity), and >99% specificity for criterion extraction. The performance was better for the rule-based approach than for the machine learning baseline (60% precision and 30% recall). For some individual criteria, precision was as high as 97% and recall 57%. Since precision was very high, we were assured that criteria were rarely assigned incorrectly, and our numbers presented a lower bound of their presence in EHRs. We then conducted a case study and parsed 4480 new EHRs covering 10 years of surveillance records from the Arizona Developmental Disabilities Surveillance Program. The social criteria (A1 criteria) showed the biggest change over the years. The communication criteria (A2 criteria) did not distinguish the ASD from the non-ASD records. Among behaviors and interests criteria (A3 criteria), 1 (A3b) was present with much greater frequency in the ASD than in the non-ASD EHRs.
Our results demonstrate that NLP can support large-scale analysis useful for ASD surveillance and research. In the future, we intend to facilitate detailed analysis and integration of national datasets.
Based on data from autism spectrum disorder (ASD) surveillance, it is estimated that the prevalence of ASD is approximately 1.5% [
Data on long-term trends, symptoms, diagnoses, evaluations, and treatments are critical for planning interventions and educational and health services. To understand and act upon such trends, large-scale studies are needed that can evaluate trends over time, integrate different types of data, and review large datasets. In recent years, data have been increasingly electronically encoded in electronic health records (EHRs) in structured fields and free text. Collection of such EHRs enables analyses that compare and contrast ASD prevalence in relation to other variables and over time.
Much of the published work on ASD leverages information in the structured fields of the EHRs such as gender, medication taken by the mother, birth complications, scores on a variety of tests, and others. The structured data portions are relatively easy to extract and are useful for large-scale studies. However, the results of the analysis are commonly limited to reviews and counts of the presence of conditions in certain populations [
EHR of people with ASD contain extensive free text fields with important information that is often complementary to and more detailed and explanatory than the structured data. This is because in the absence of any biological laboratory test, diagnosis is generally made in person using specific test instruments, history, and differential diagnosis, and much of this information is recorded as narrative. Automatically extracting this information from the EHRs requires natural language processing (NLP). So far, a few NLP approaches have been used to analyze language generated by people on the spectrum [
The existing projects that focus on the text in EHRs fall into two groups. The first group focuses on using all the free text combined with structured fields to automatically assign case status (classification of patients as cases of autism or not) to an entire record. A variety of machine learning algorithms are useful for this task. Using a subset of the EHRs for training, these algorithms create a model that can be applied to future EHRs to assign case status. These models can be human-interpretable, such as decision trees, or can be black box approaches, such as neural networks. In our own work [
In addition to case status assignment, more detailed use of the information contained in the free text would be helpful for large-scale analysis, for example, cultural comparisons as suggested by Mandy et al [
In this project, we aimed to extract the expressions of the behaviors indicative of individual diagnostic criteria as described in the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision (DSM-IV-TR) [
We first describe the development and evaluation of the parser, including important decisions on using off-the-shelf tools and machine learning algorithms. Then, we present a case study where we applied our parser to almost 5000 EHRs to demonstrate usefulness for detailed analysis over time.
Our parser uses human-interpretable rules to match complex patterns that represent the DSM diagnostic criteria. These rule-based algorithms rely on the creation of patterns of terms, grammatical relationships, and the surrounding text to recognize the entities of interest in text.
We work with EHRs created by the Arizona Developmental Disabilities Surveillance Program (ADDSP) as part of the CDC multicenter Autism and Developmental Disability Monitoring Network surveillance. Our ADDSP records are collected from educational and clinical data sources in 11-15 school districts for 8-year olds. From 2000 to 2010, a total of 27,515 records were reviewed and 6176 records were abstracted that included any of the 32 social behavioral triggers consistent with ASD as listed in the Abstraction Manual developed by the CDC. These records referred to 4491 children. The identified records for each child were further evaluated by trained clinical reviewers who applied standardized criteria to highlight criteria and determine ASD case status. This yielded 2312 confirmed cases.
We have access to the records and the case status of each child as determined through expert review of the information. For this study, we leveraged a subset of these records (n=93) that have been printed and have the diagnostics criteria annotated on this paper version. The electronic version does not include markings indicating the criteria. Therefore, we first created an electronic gold standard with all information combined. Records were loaded using WebAnno [
We intend to automate the extraction of the DSM-IV-TR [
Rules
A: A total of 6 or more items from (1), (2), and (3), with at least 2 from (1) and 1 each from (2) and (3):
1: Qualitative impairment in social interaction, as manifested by at least 2 of the following:
A1a: Marked impairment in the use of multiple nonverbal behaviors such as eye-to-eye gaze, facial expression, body postures, and gestures to regulate social interaction
A1b: Failure to develop peer relationships appropriate to developmental level
A1c: A lack of spontaneous seeking to share enjoyment, interests, or achievements with other people
A1d: Lack of social or emotional reciprocity
2: Qualitative impairments in communication as manifested by at least 1 of the following:
A2a: Delay in, or total lack of, the development of spoken language (not accompanied by an attempt to compensate through alternative modes of communication such as gesture or mime)
A2b: In individuals with adequate speech, marked impairment in the ability to initiate or sustain a conversation with others
A2c: Stereotyped and repetitive use of language or idiosyncratic language
A2d: Lack of varied, spontaneous make-believe play or social imitative play appropriate to developmental level
3: Restrictive, repetitive, and stereotyped patterns of behaviors, interests, and activities, as manifested by at least 1 of the following:
A3a: Encompassing preoccupation with 1 or more stereotyped and restricted patterns of interest that is abnormal either in intensity or focus
A3b: Apparently inflexible adherence to specific, nonfunctional routines or rituals
A3c: Stereotyped and repetitive motor mannerisms
A3d: Persistent preoccupation with parts of objects
To our knowledge, no parsers exist that identify DSM criteria in EHRs. As part of our development, we evaluated MetaMap [
Using 2 EHRs from our development set, we analyzed MetaMap’s outcome in the context of ASD. From the 2 EHRs, a total of 259 phrases were extracted and mapped to 632 UMLS concepts. Overall, 46.5% (294/632) of all candidate mappings for those phrases were correct and useful for our domain; 55.2% (143/259) of phrases were given a single candidate mapping to UMLS concepts, and for those single matches, the accuracy was high, with 81.1% (116/143) correct and useful matches for our domain. However, when the number of matched semantic types increases, it becomes increasingly complicated to identify the correct concept and associated semantic type. Furthermore, the majority of semantic types do not apply to our domain. Using a very lenient approach, we consider approximately 31 semantic types useful to match to DSM criteria (eg, Activity, Anatomical Structure, Behavior, Body Part, Organ or Organ Component, Body Substance, Clinical Attribute, Conceptual Entity, and Daily or Recreational Activity, among others). Although the 259 phrases we analyzed are restricted to 31 relevant semantic types, this is not enough to distinguish ASD diagnostic criteria from rest of the text: only 27.0% (70/259) phrases intersect with ASD diagnostic criteria. Because the number of types that are immediately useful is small and this MetaMap outcome would require significant development to adjust for our purpose, building an extraction system on top of this is impractical. Therefore, we decided to build all the components in-house.
When developing a new entity extraction artifact, a rule-based or machine learning approach is chosen as the starting point. Both can be combined in ensemble methods later. We performed a baseline test using a decision tree, which was chosen because it is a human-interpretable machine learning algorithm.
We formulated the problem as a multiclass sentence classification problem (12 diagnostic labels or null label). We used Stanford CoreNLP (version 3.7.0) for NLP processing. We used a standard bag-of-words approach with and trained the algorithm on 120 records containing 19,428 sentences. Because our records contain approximately 0.5%-5% sentences describing a DSM criterion, we undersampled negative examples during training to improve recall: for each positive example, we sampled 30 negative samples (except criteria A2a and A2b, which occurred frequently enough to use on the entire training data). Our features were lemmas, as determined by CoreNLP, which appeared more than 5 times in the training data (2913 terms). We used a pruned decision tree (Weka version 3.8.0) with a pruning confidence threshold of 0.25. Size of the vocabulary, undersampling ratio, and pruning threshold were determined based on the best values we found during exploration.
This machine learning approach will require significant work to improve performance. We believe this cannot be attained with simple changes in the input, such as word embedding, or by changing algorithms. It will require more sophisticated features and a much larger dataset. We, therefore, first created a rule-based parser, which may provide better results overall as well as insights related to lexicons and features useful for future combinations with machine learning in a classifier ensemble.
Decision tree evaluation for sentence classification.
Rule | Count of positive cases | % positive cases (of all sentences) | Precision | Recall | F-score | Specificity |
A1a | 120 | 0.021 | 0.70 | 0.52 | 0.59 | 0.99 |
A1b | 91 | 0.016 | 0.50 | 0.42 | 0.45 | 0.99 |
A1c | 35 | 0.006 | 0.16 | 0.17 | 0.17 | 0.99 |
A1d | 160 | 0.029 | 0.54 | 0.14 | 0.22 | 1.00 |
A2a | 388 | 0.069 | 0.71 | 0.39 | 0.50 | 1.00 |
A2b | 321 | 0.057 | 0.69 | 0.37 | 0.48 | 0.99 |
A2c | 120 | 0.021 | 0.54 | 0.47 | 0.51 | 0.99 |
A2d | 62 | 0.011 | 0.34 | 0.19 | 0.25 | 1.00 |
A3a | 64 | 0.011 | 0.20 | 0.09 | 0.13 | 1.00 |
A3b | 123 | 0.022 | 0.81 | 0.47 | 0.59 | 1.00 |
A3c | 66 | 0.012 | 0.70 | 0.32 | 0.44 | 1.00 |
A3d | 27 | 0.005 | 0.27 | 0.30 | 0.28 | 1.00 |
Microaverage | 1577 | 0.024 | 0.60 | 0.35 | 0.45 | 0.99 |
Lexicon overview.
Pattern use of lexicons | Lexicons | Number of terms | Example lexicon | Example terms |
All rules | 11 | 345 | Body_parts | arm, eye, hair, teeth, toe, tongue, finger, fingers, nose |
Group A1 | 7 | 105 | A1_interact | interact, interactions, communicate, relationship |
Group A2 | 3 | 72 | A2_positive | severe, significant, pervasive, marked |
Group A3 | 2 | 72 | A3_object | door, toys, vacuum, blocks, book, television, lights |
A1a | 4 | 42 | A1a_nonVerbalBehavior | eye contact, eye-to-eye gaze, gestures, nonverbal cues |
A1b | 2 | 11 | A1b_consistent | good, consistent, appropriately, satisfactory |
A1c | 5 | 61 | A1c_affect | excitement, feelings, satisfaction, concerns |
A1d | 12 | 159 | A1d_engage | recognize, recognizes, reacts, respond, regard, attend |
A2a | 4 | 117 | A2a_gained | gained, used, had, obtained, said, spoke |
A2b | 8 | 240 | A2b_recepLang | direction, instructions, questions, conversations |
A2c | 7 | 145 | A2c_idiosyncratic | breathy, echolalia, jargon, neologism, reduced |
A2d | 7 | 83 | A2d_actions | actions, routines, play, signs, gestures, movements |
A3a | 7 | 106 | A3a_obsess | obsessed, obsessive, perseverates, preoccupation |
A3b | 7 | 119 | A3b_nonFunctionalPlay | stack, stacks, lines, lined, nonfunctional, arrange |
A3c | 3 | 67 | A3c_abnormal | grind, grinds, rocks, twirls, spin, tap, clap, flap |
A3d | 3 | 43 | A3d_sensitive | defensiveness, sensitivity, hypersensitivities |
Total | 92 | 1787 | N/Aa | N/A |
aN/A: not applicable.
We developed a rule-based parser to extract all A1, A2, and A3 rules as listed in the DSM. Each DSM group contains 4 specific rules that are representative of the criterion (A1a-d, A2a-d, and A3a-d). Our tool comprises 2 components: (1) annotation of relevant ASD trigger words in free text and (2) recognition of diagnostic criteria based on a pattern of trigger words.
The parser was developed through collaboration between NLP experts and clinicians. Annotations from EHRs were translated into patterns by NLP experts. Then, extensions, abstractions, and generalizations were discussed and the patterns augmented and expanded. This iterative process was continued until changes in patterns provided little or no improvement but increased error rates. Several development rounds were completed, and the EHRs were taken from the 2002 to 2010 surveillance years, with 53% of records having an ASD case status. The ASD label itself is of little consequence because both development and testing are done at the sentence level (not the record level). For testing, new EHRs were used that were not seen in previous development rounds. EHRs were selected randomly from those available to us.
Identifying ASD diagnostic criteria in text requires recognizing important trigger words (ie, words describing typical behaviors of ASD). We capture these words, as well as synonyms and singular or plural variants, in lexicons. Approximately 90 lexicons with about 20 terms each were manually created.
The lexicons are optimized for patterns for each DSM criterion, so the same terms may appear in multiple lexicons. However, a few lexicons are shared by all patterns and used for different DSM criteria. Currently, there are 11 lexicons shared by all patterns (eg, the lexicons containing body parts). In addition, the patterns for the A1, A2, and A3 criteria share, respectively, 7, 3, and 2 lexicons. For example, DSM rules A1a, A1b, A1c, and A1d all require identification of “impairment in social interaction,” and the relevant terms for this trigger are combined in the lexicon “A1_interact.” In addition to these shared patterns, each DSM pattern requires additional individual lexicons optimized for that pattern.
We used the General Architecture for Text Engineering (GATE) [
Tokenizer: recognize individual tokens in the text.
Sentence splitter: set boundaries on sentences so that parts of speech can be deduced for each word in a sentence.
POS tagger [
Visualization of 2 (of 7 existing) patterns for Diagnostic Manual of Mental Disorders criteria A2c.
After processing all free text, terms are annotated using gazetteer lookup. Using the term’s POS tags and lexical labels from the 92 lexicons, the annotated text is processed to identify matching patterns. POS tags help narrow down candidate terms, for example, “object” fits in our lexicons when it is a noun but not when it is a verb. Using 43 annotated records from the ADDSP containing 4732 sentences, we developed 12 sets of patterns (total 104 patterns) for the 12 DSM criteria (see
All patterns are specified in a JAPE file. A JAPE file is a file where patterns to be annotated in the text can be described using GATE-specific formatting. GATE “reads” the JAPE files and applies them to text. When a pattern in the JAPE is recognized in the text, the text matching the pattern is annotated with the labels specified in the JAPE file.
Our testbed consists of the 50 new EHRs, not used during development, containing 6634 sentences. The EHRs were randomly sampled from the 2000-2008 surveillance years, with 68% of records having positive ASD case status. Because evaluation is done at the sentence level and does not take record-level information into account, the case label itself is of little consequence. These are records that were annotated by the clinical experts and the text and annotation stored by us in electronic format. Of the entire set, 20.45% (1357/6634) sentences contained annotations, with some sentences containing more than 1 annotation.
A human-created gold standard, such as our testbed, is seldom completely perfect and consistent: entities may have been missed by the human annotators. We noticed such inconsistencies in prior work by us [
Similar to evaluation standards by others [
For our evaluation, we calculated 4 metrics. Precision provides an indication of how correct the annotations made by the parser are; in other words, if the parser annotates sentences with a DSM label, this refers to the percentage of the labels that are correct. Recall (also referred to as sensitivity) provides an indication of how many of the annotations the parser is able to capture; in other words, of all the sentences that received a DSM label by the human annotators, what percentage does the parser also label correctly. We also calculate the
We calculate these metrics at the annotation and at the sentence level. A true positive at the
We also apply the
Gold standard overview.
Diagnostic and Statistical Manual of Mental Disorders diagnostic criteria | Gold standard | ||
Rule | Theme | Total in records | Average per record |
A1a | Nonverbal behaviors | 126 | 2.52 |
A1b | Peer relationships | 91 | 1.82 |
A1c | Seeking to share | 37 | 0.74 |
A1d | Emotional reciprocity | 165 | 3.3 |
A2a | Spoken language | 406 | 8.12 |
A2b | Initiate or sustain conversation | 333 | 6.66 |
A2c | Stereotyped or idiosyncratic language | 127 | 2.54 |
A2d | Social imitative play | 66 | 1.32 |
A3a | Restricted patterns of interest | 62 | 1.24 |
A3b | Adherence to routines | 135 | 2.7 |
A3c | Stereotyped motor mannerisms | 68 | 1.36 |
A3d | Preoccupation with parts of objects | 28 | 0.56 |
Total | N/Aa | 1644 | 32.88 |
aN/A: not applicable.
Annotation-level results.
Annotationsa | Total in gold standard (number of annotationsb) | Evaluation | ||
Precision | Recall | F-measure | ||
A1a | 126 | 0.96 | 0.57 | 0.72 |
A1b | 91 | 0.63 | 0.27 | 0.38 |
A1c | 37 | 0.78 | 0.19 | 0.30 |
A1d | 165 | 0.62 | 0.27 | 0.37 |
A2a | 406 | 0.69 | 0.44 | 0.53 |
A2b | 333 | 0.79 | 0.44 | 0.57 |
A2c | 127 | 0.68 | 0.36 | 0.47 |
A2d | 66 | 0.79 | 0.56 | 0.65 |
A3a | 62 | 0.83 | 0.40 | 0.54 |
A3b | 135 | 0.75 | 0.51 | 0.61 |
A3c | 68 | 0.82 | 0.41 | 0.55 |
A3d | 28 | 0.53 | 0.29 | 0.37 |
Microaverage | N/Ac | 0.74 | 0.42 | 0.53 |
aBased on 6634 sentences.
bTotal annotations=1644.
cN/A: not applicable.
Sentence-level results.
Sentencesa | Total in gold standard (number of sentences)b | Evaluation | |||
Precision | Recall | F-measure | Specificity | ||
A1a | 120 | 0.97 | 0.59 | 0.74 | 1.00 |
A1b | 90 | 0.68 | 0.30 | 0.42 | 1.00 |
A1c | 35 | 0.78 | 0.20 | 0.32 | 1.00 |
A1d | 158 | 0.63 | 0.28 | 0.39 | 1.00 |
A2a | 391 | 0.71 | 0.45 | 0.55 | 0.99 |
A2b | 329 | 0.83 | 0.47 | 0.60 | 1.00 |
A2c | 121 | 0.67 | 0.37 | 0.48 | 1.00 |
A2d | 65 | 0.83 | 0.58 | 0.68 | 1.00 |
A3a | 61 | 0.73 | 0.36 | 0.48 | 1.00 |
A3b | 123 | 0.74 | 0.52 | 0.61 | 1.00 |
A3c | 64 | 0.82 | 0.42 | 0.56 | 1.00 |
A3d | 28 | 0.53 | 0.29 | 0.37 | 1.00 |
Microaverage | 1585 | 0.76 | 0.43 | 0.55 | 1.00 |
Any Rule | 1357 | 0.82 | 0.46 | 0.59 | 0.97 |
aBased on 6634 sentences.
bSentences with annotations =1357.
The results are very similar to those for the sentence-level evaluation (
We conducted a final, more lenient approach by evaluating whether the system can identify the relevant sentences for DSM criteria, regardless of which criterion they represent. In this case, we found that our parser achieves 82% precision and 46% recall in identifying the 1357 sentences that were annotated for autism-like behavior (
Overall, the rule-based approach resulted in a better performance than the machine learning approach. Some criteria, such as A1c and A3d, showed very large differences in precision between the two approaches, while others, like A1d and A3a, showed a large difference in recall. This may be due to the sparsity of the examples available for training. Furthermore, we chose to evaluate decision trees because of their interpretability. More sophisticated algorithms will be tested when larger datasets become available, and these may provide better results.
As is expected with the development of a rule-based extraction system, the results for precision are higher than those for recall. False negatives represent the annotations that were missed by our algorithm and lowered recall. We noticed 3 types of false negatives due to annotations not seen in the training data. First, there are new examples of behaviors; for example, “being a picky eater” is an A1a criterion, but it did not appear in our training data. To solve this, we will write additional JAPE rules. Second, there are sometimes different lexical variants of behaviors (ie, synonyms or related terms) used to describe behaviors. To solve this, we will look into expanding our lexicons, for example, through using word embeddings. Third, sometimes complex language or longer interstitial text is used that is not captured by our patterns. For example, “Eye contact, while it was also present, was limited at times.” The solution will require further augmenting the patterns. In addition, some false negatives are the results of localized patterns. The criteria annotated in the EHR are sometimes determined by the clinicians using information in the EHR context or the surrounding text. This is not covered by JAPE patterns because it does not appear in the same sentence or neighboring text.
False positives usually occur for either of two reasons: accidental matches to nonsensical sentence fragments or plausible phrases with insufficient context. For example, Pattern 1 in
We see large differences between the various DSM-IV criteria. For example, criterion A1c, which refers to “a lack of spontaneous seeking to share enjoyment, interests, or achievements with other people,” is expressed completely differently in the test set and was not captured by our rules. This is not surprising because A1c is the criterion for which we have the least amount of training data (averaging 0.5 annotations per record). Additionally, the criterion covers a wide range of behaviors that can be expressed in many different ways. The variations and lack of data make describing patterns very difficult. Criterion A1a, which is related to nonverbal communication, obtained relatively high precision and recall. This is because clinicians tend to describe nonverbal communication in unambiguous, self-contained phrases, such as “eye contact” and “nonverbal communication,” for which we can create precise patterns. For a similar reason, we also obtained good results for criteria A3a, A3c, and A3d, which are about abnormal interests, stereotypical actions, and tactile sensitivities, respectively. Some of the criteria (eg, A3c) have precision near 90%. Criteria A2a and A2b, which describe expressive and receptive language issues, are most prevalent among the rules. Combined, they account for >40% of the gold standard annotations. Taking advantage of the large sample of gold standard annotations, we were able to develop many and obtain relatively stable performance from development to testing.
In our case, we believe lower recall does not preclude useful applications of the parser. While some particular expression of a DSM criterion may be missed, it will be rare that all expressions of that particular DSM criterion in one record would be missed and, so, the detected DSM criterion would be taken into account for case assignment. Moreover, because of the high precision of the parser, when an expression of a DSM criterion is flagged, it is unlikely to be a false positive. As a result, large-scale analyses that focus on patterns of different criteria can be performed.
Given the high precision of our parser, we conducted a case study that shows insights into and the potential of the parser for future work. Our goal is to provide a broad overview of DSM criteria patterns found in existing EHRs over a 10-year span.
For our case study, we analyzed 4480 records available electronically from the ADDSP. These records have not been used during the development of the parser and contain a minimum of text (40 characters was empirically determined as the cutoff in this set; this represented about 10 words or a complete sentence, which is required for a complete annotation). We focus only on the free text fields and the results from applying our parser.
Abstractor training has been consistent over the years with the goal to enter only the necessary information to meet the project deadlines. Even so, the average length of the free text has increased over the years: the average number of words before 2006 was 1427 and has increased to 2450 from 2006 until 2010, nearly double the number.
The records contained on average 5.76 different DSM criteria. We performed our analysis separately for records of children with ASD and of those labeled as non-ASD. All counts are normalized by record length: the number of criteria found is divided by the number of words in the document. This normalization avoids increasing the count of criteria solely due to having longer records, for example, when a child is seen multiple times for evaluation and the resulting EHR is longer, but the diversity of criteria may remain the same.
We first focus on the A1 DSM criteria. These criteria describe impairments in social interaction. For children with ASD (
We performed the same analysis for children without ASD (
We repeat the same analysis for A2 DSM criteria (
Finally, we show the analysis for A3 DSM criteria (
Descriptive information on 4480 records available electronically from the Arizona Developmental Disabilities Surveillance Program.
Electronic health record word count for autism spectrum disorder (ASD) and non-ASD cases.
Average A1 criteria per record. ASD: autism spectrum disorder; EHR: electronic health record.
Average A2 criteria per record. ASD: autism spectrum disorder; EHR: electronic health record.
Average A3 criteria per record. ASD: autism spectrum disorder; EHR: electronic health record.
The presence of a criterion in a record depends first on its presence in the child, second on whether the evaluator notes that criterion in the child, and third on whether the evaluator notes it in the record. The criteria that we identified with the greatest frequency were A2a (spoken language) and A2b (initiate or sustain conversation). Issues with language acquisition are the most frequently noted first cause of parental concern [
The frequencies of criteria in the ASD case records were not as different from the non-ASD records as may be expected. However, all children whose records were included in data collection had some type of diagnosis or special education qualification; no typically developing children are included in these data [
Some changes across the years of data collection were observed. The first was an increase in the number of words per record. This increase is likely to reflect a true increase in the words rather than any changes in data collection procedures, as increasing numbers of records to review have motivated efforts to improve efficiency and eliminate the collection of superfluous text. An increase in the number of criteria included will necessarily mean that more words are collected. Next was an increase in the frequency of some specific criteria among the ASD cases. Changes through time in the frequency of a specific criterion may reflect more children who exhibit the criterion or that evaluators may have a heightened awareness of the criterion and are, therefore, more likely to note it. Criteria that were increasing in frequency included A1a (nonverbal behaviors), A1d (social or emotional reciprocity), and A3b (adherence to routine), but the increase in A3b was noted only in the most recent year.
The increases in the frequencies of some criteria in this dataset contrast with results from a study in Sweden, which found fewer autism symptoms among children diagnosed in 2014 than among those diagnosed in 2004 [
The trend of increasing frequency of criteria A1a (nonverbal behaviors) and A1d (social or emotional reciprocity) in ASD-labeled records and the decreasing trend in those same criteria in non-ASD-labeled records may represent improvements in evaluators’ awareness of these as symptoms of ASD and the importance of documenting these criteria for children who have the characteristics of ASD cases.
We described the design and development of a rule-based NLP tool that can identify DSM criteria in text. In comparison to a baseline machine learning approach that used decision trees, the rule-based approach performed better. We evaluated our approach at the annotation level (ie, matching to each rule within a sentence) and at the sentence level (ie, matching to the correct sentence). The system performed reasonably well in identifying individual DSM rule matches, with approximately half of all individual criteria-specific annotations discovered (44% recall) with few errors (79% precision). As expected with manually developed rules, precision was high, while recall was lower. In future work, we intend to increase both lexicons and patterns using machine learning approaches while retaining human-interpretable rules. This will increase the recall of our system. Furthermore, we intend to add negation as an explicit feature, which we believe will be necessary to maintain high precision.
We demonstrated our parser on almost 5000 records and compared the presence of different DSM criteria across several years. Changes in document length as well as in the presence of different DSM criteria are clear. Our analysis also showed that some DSM criteria are almost equally present in both ASD and non-ASD cases. In the future, we intend to increase the size of our records and combine the information extracted (ie, the DSM criteria matches) with other data from the structured fields in those EHRs as well as combine the information with external databases containing environmental and other types of data.
Our future work will be 2-fold. First, we will investigate the integration of our system into the surveillance workflow. For maximum usefulness, we will aim at extreme precision or extreme recall (while both are desirable, there tends to be a trade-off). With extremely high precision, the extracted diagnostic criteria can be used to make case decisions with high precision. Labeling a case as ASD can be automated for a large set of EHRs; only the set where no ASD label is assigned would require human review (due to low recall). In contrast, with extremely high recall, cases where diagnostic criteria are not extracted can be labeled as non-ASD with high confidence and only the cases where a label of ASD is assigned would need review (due to low precision). Second, because the development time of a rule-based system is substantial and application to a new domain would require starting over, we will investigate leveraging lessons learned from the parser to a machine learning approach that can transfer to different domains in mental health.
Arizona Developmental Disabilities Surveillance Program
autism spectrum disorder
Centers for Disease Control and Prevention
Diagnostic Statistical Manual of Mental Disorders, Fourth Edition, Text Revision
electronic health record
General Architecture for Text Engineering
International Classification of Diseases
Java Annotation Pattern Engine
natural language processing
part-of-speech
Unified Medical Language System
The data presented in this paper were collected by the CDC and Prevention Autism and Developmental Disabilities Monitoring Network supported by CDC Cooperative Agreement Number 5UR3/DD000680. This project was supported by grant number R21HS024988 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality
None declared.