This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN).
Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying
We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level
In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes.
Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.
Public health surveillance systems are important for identifying unusual events of public health importance and will provide information for public health action [
Automated surveillance methods are increasingly being researched because of the increasing volume and accessibility of electronic medical data, and a range of studies have proven the feasibility of extracting structured information from clinical narratives [
Another important challenge for automated surveillance algorithms is emerging disease. For example, influenza H1N1 broke out in 2009 and could not have been recorded in any medical records before 2008. Traditional automatic methods based on term vectors cannot use new terms [
Word embedding is a feature learning technique where vocabularies are mapped to vectors of real numbers [
This project aimed to compare traditional machine learning pipelines (NLP plus supervised machine learning models) versus word embedding combined with a CNN in order to identify chapter-level
The Tri-Service General Hospital, Taipei, Taiwan, supplied deidentified free-text discharge notes from June 1, 2015 to January 31, 2017. Research ethics approval was given by the institutional ethical committee and medical records office of the Tri-Service General Hospital to collect data without individual consent for sites where data are directly collected. The Tri-Service General Hospital is located in the Neihu District of Taipei under the name of National Defense Medical Center and provides medical service for service members, their family dependents, and civilians. It has been rated by the Ministry of Health and Welfare in Taiwan as a first-rate teaching hospital on the level of a medical center. The hospital has about 1700 beds and 6000 inpatients per month, and most inpatients are civilians. We collected a total of 103,390 discharge notes, and corrected misspellings using the Hunspell version 2.3 package [
We used 2 testing procedures to assess the performance of the model. First, we conducted a 5-fold cross-validation test. Second, we created training and testing sets by splitting the sample by date (July 1, 2016), because this is more realistic. A classifier can only be trained using retrospective data in the real world, and it will be used to classify future data; the second testing process replicates this. All calculations were conducted on a Fujitsu RX2540M1 48-core CPU, 768 GB RAM server (Fujitsu Ltd, Tokyo, Japan), and the all-flash array was AccelStor NeoSapphire NS3505 (AccelStor, Inc, Taipei City, Taiwan) with a 5 TB serial advanced technology attachment-interface solid-state drive and connectivity of 56 GB/second FDR InfiniBand Quad Small Form-factor Pluggable (Fiberon Technologies, Inc, Westborough, MA, USA).
Prevalence of different
Definition | Stage of the study | |||
Before June 30, 2016 (n=64,023) |
After July 1, 2016 (n=39,367) |
Full study period (n=103,390) |
||
A00-B99 | Certain infectious and parasitic diseases | 7731 (12.1%) | 5455 (13.9%) | 13,186 (12.8%) |
C00-D49 | Neoplasms | 20,585 (32.2%) | 13,993 (35.5%) | 34,578 (33.5%) |
D50-D89 | Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism | 4516 (7.1%) | 3132 (8.0%) | 7648 (7.4%) |
E00-E89 | Endocrine, nutritional, and metabolic diseases | 13,223 (20.7%) | 8765 (22.3%) | 21,988 (21.3%) |
F01-F99 | Mental, behavioral, and neurodevelopmental disorders | 4612 (7.2%) | 2942 (7.5%) | 7554 (7.3%) |
G00-G99 | Diseases of the nervous system | 3703 (5.8%) | 2602 (6.6%) | 6305 (6.1%) |
H00-H59 | Diseases of the eye and adnexa | 2337 (3.7%) | 1374 (3.5%) | 3711 (3.6%) |
H60-H95 | Diseases of the ear and mastoid process | 802 (1.3%) | 470 (1.2%) | 1272 (1.2%) |
I00-I99 | Diseases of the circulatory system | 17,650 (27.6%) | 11,465 (29.1%) | 29,115 (28.2%) |
J00-J99 | Diseases of the respiratory system | 7743 (12.1%) | 5584 (14.2%) | 13,327 (13.0%) |
K00-K95 | Diseases of the digestive system | 12,849 (20.1%) | 8444 (21.4%) | 21,293 (20.6%) |
L00-L99 | Diseases of the skin and subcutaneous tissue | 2568 (4.0%) | 1711 (4.3%) | 4279 (4.1%) |
M00-M99 | Diseases of the musculoskeletal system and connective tissue | 9170 (14.3%) | 5152 (13.1%) | 14,322 (13.9%) |
N00-N99 | Diseases of the genitourinary system | 9929 (15.5%) | 7325 (18.6%) | 17,254 (16.8%) |
O00-O9A | Pregnancy, childbirth, and the puerperium | 2509 (3.9%) | 1271 (3.2%) | 3780 (3.7%) |
P00-P96 | Certain conditions originating in the perinatal period | 793 (1.2%) | 493 (1.3%) | 1286 (1.2%) |
Q00-Q99 | Congenital malformations, deformations, and chromosomal abnormalities | 927 (1.4%) | 513 (1.3%) | 1440 (1.4%) |
R00-R99 | Symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified | 5271 (8.2%) | 3824 (9.7%) | 9095 (8.9%) |
S00-T88 | Injury, poisoning, and certain other consequences of external causes | 6272 (9.8%) | 4564 (11.6%) | 10,836 (10.6%) |
V00-Y99 | External causes of morbidity | 791 (1.2%) | 68 (0.2%) | 859 (0.8%) |
Z00-Z99 | Factors influencing health status and contact with health services | 15,488 (24.2%) | 10,093 (25.6%) | 25,581 (24.8%) |
Traditional classification techniques often combine an NLP pipeline and a classifier to conduct free-text medical writing classification tasks. We extracted the detailed features from the discharge notes by the NLP pipeline; then
In this study, we used a 2-part NLP pipeline to extract the discharge note features. First, word-based features were directly extracted from the free-text description and n-gram phrases (n range 2-5) were generated by the RWeka version 0.4-30 package [
Support vector machines (SVMs) are common classifiers in the machine learning field. They map all samples onto a hyperplane and divide them by a clear gap. In addition, kernel tricks are used to extend this hyperplane. SVMs are proven to have the best performance in free-text medical writing classification, compared with naive Bayes classifiers, C4.5 decision trees, and adaptive boosting [
Random forests (RFs) construct multiple decision trees and use information from each tree to make predictions. It was the best-performing classification model in a previous text classification study [
Gradient boosting machines (GBMs) are also ensembles of weak decision trees, where the gradient boosting method is used to improve the predictive ability of each tree [
Using the “no free lunch” theorem [
Traditional NLP pipelines are limited by their preexisting dictionary and need to build a complex processing flow. Herein, we propose a method combining a word embedding model and a CNN. Word embedding technology is useful for integrating synonyms, and we used a pretrained GloVe model (English Wikipedia plus Gigaword) to vectorize the words. We selected a 50-dimensional model with 400,000 words because of computing time constraints. However, we believe that this was sufficient because there were only 19,064 words in our 103,390 discharge notes. We transformed each discharge note into an n×50 matrix for subsequent classification (where n is the number of words in the discharge note) and trained a CNN using these labeled matrixes.
Although CNNs with various structures have been developed, we focused on a 1-layer CNN with a filter region size of 1-5 (corresponding to 1-5 n-gram phrases) to increase comparability with traditional machine learning technologies. In fact, these simple models have recently achieved remarkably strong performance [
We used the MXNet version 0.8.0 package [
Model architecture with 5 convolution channels and 1 full connection (FC) layer. ReLU: rectified linear unit.
We conducted oversampling processing for sufficiently regarding positive cases but not skewing by an overwhelming number of negative cases [
Global (and lowest 5) means of training and testing AUCsa in the 5-fold cross-validation test.
Pipeline | Training set | Testing set | |||
AUCb | F-measure | AUCb | F-measure | ||
NLPc + SVMd (linear) | 0.9947 (0.9836) | 0.9546 (0.8560) | 0.9571 (0.8891) | 0.8606 (0.6387) | |
NLP + SVM (polynomial) | 0.8627 (0.6736) | 0.5630 (0.2498) | 0.8183 (0.6332) | 0.5050 (0.2023) | |
NLP + SVM (radial basis) | 0.9565 (0.9146) | 0.7984 (0.6613) | 0.9363 (0.8582) | 0.7569 (0.5352) | |
NLP + SVM (sigmoid) | 0.9518 (0.9021) | 0.7852 (0.6368) | 0.9325 (0.8526) | 0.7498 (0.5313) | |
NLP + RFe | 0.9999 (0.9995)f | 0.9864 (0.9628) | 0.9570 (0.8800) | 0.8739 (0.6475) | |
NLP + GBMg | 0.9996 (0.9990) | 0.9868 (0.9660) | 0.9544 (0.8722) | 0.8691 (0.6458) | |
GloVeh + CNNi | 0.9964 (0.9890) | 0.9837 (0.9588) | 0.9696 (0.9135)f | 0.9086 (0.7651) |
aAUC: area under the curve, calculated using the receiver operating characteristic curve.
bThe results are presented as the mean AUC or F-measure (mean of the lowest 5 AUCs or F-measures). Detailed AUCs and F-measures for each chapter-level
cNLP: natural language processing for feature extraction (terms, n-gram phrases, and SNOMED CT categories).
dSVM: support vector machine.
eRF: random forest.
fThe best method for a specific index.
gGBM: gradient boosting machine.
hGloVe: a 50-dimensional word embedding model, pretrained using English Wikipedia and Gigaword.
iCNN: convolutional neural network.
We visualized 3 of the convolving filters selected for the real-world test, as
Global (and lowest 5) means of the training and testing AUCsa in the real-world test.
Pipeline | Training set | Testing set | |||
AUCb | F-measure | AUCb | F-measure | ||
NLPc + SVMd (linear) | 0.9921 (0.9768) | 0.9365 (0.7983) | 0.9477 (0.8549) | 0.8458 (0.5984) | |
NLP + SVM (polynomial) | 0.9103 (0.7975) | 0.6316 (0.4045) | 0.8716 (0.7400) | 0.5761 (0.2802) | |
NLP + SVM (radial basis) | 0.9577 (0.9208) | 0.7954 (0.6484) | 0.9349 (0.8476) | 0.7588 (0.5258) | |
NLP + SVM (sigmoid) | 0.9522 (0.9058) | 0.7840 (0.6261) | 0.9259 (0.8196) | 0.7515 (0.5209) | |
NLP + RFe | 0.9996 (0.9985)f | 0.9869 (0.9664)f | 0.9483 (0.8484) | 0.8582 (0.5901) | |
NLP + GBMg | 0.9995 (0.9985) | 0.9821 (0.9562) | 0.9462 (0.8416) | 0.8568 (0.5948) | |
GloVeh + CNNi | 0.9956 (0.9868) | 0.9803 (0.9523) | 0.9645 (0.8952)f | 0.9003 (0.7204)f |
aAUC: area under the curve, calculated using the receiver operating characteristic curve.
bThe results are presented as the mean AUC or F-measure (mean of the lowest 5 AUCs or F-measures). Detailed AUCs and F-measures for each chapter-level
cNLP: natural language processing for feature extraction (terms, n-gram phrases, and SNOMED CT categories).
dSVM: support vector machine.
eRF: random forest.
fThe best method for a specific index.
gGBM: gradient boosting machine.
hGloVe: a 50-dimensional word embedding model, pretrained using English Wikipedia and Gigaword.
iCNN: convolutional neural network.
Visualization of selected convolving filters.
Information gains of the features extracted by the convolving filters in each classification task. AUC: area under the curve; IG: information gain.
The proposed method, which combines word embedding with a CNN, had a higher testing accuracy than all traditional NLP-based approaches, regardless of the situation. Further analysis showed that convolving filters had fuzzy matching abilities, which greatly reduced the data dimension for the final classification task. Moreover, the training AUCs of the traditional methods were very close to 1. This means that there was no possibility of improvement, and the larger difference between training set and testing set performances implies overfitting.
Arbitrary free-text medical narratives include many word combinations, and there is no good way of integrating similar terms using the current NLP pipelines. Previous studies have highlighted this issue and suggested that improvements are possible by dealing more effectively with the idiosyncrasies of the clinical sublanguage [
Our proposed method not only increased the accuracy compared with traditional methods, but also can avoid troublesome data preprocessing. Our solution for avoiding troublesome data preprocessing is based on word embedding, which can learn semantics from external resources. The vocabularies are mapped to vectors of real numbers, and the word vectors for similar concepts are likewise close. In our work, a discharge note is converted into an n×50 matrix, where n is the number of words, and CNN classifies this matrix based on our designed convolving filters. Because the word vectors for similar concepts are likewise close in terms, convolutional layers effectively identified a large number of keywords in a convolving filter (data shown in
All the classifiers used in this study performed poorly on V00-Y99 (external causes of morbidity) coding tasks, which may be attributed to sparse testing data (0.2%). A previous study found that classifier performance was better on common cancers than on rare cancers [
All traditional term-based classifiers face the problem that emerging diseases cannot possibly be correctly classified. For example, influenza H1N1 could not possibly have been recorded in clinical narratives from 2000 to 2007, so term-based classifiers could not have been aware of the H1N1 pandemic of 2009 [
Previous studies described the classification methods used by human experts, and several rule-based approaches have demonstrated superior performance [
Outbreaks of deliberate and natural infectious disease can lead to massive casualties unless public health actions are promptly instituted [
Several potential limitations of this study should be acknowledged. First, we used only a 50-dimensional GloVe model to process our data, to reduce computing time. However, even a 50-dimensional model has better performance than traditional methods. Thus, we believe that this will not affect our result and that our proposal is a better solution for conducting free-text medical narrative coding tasks. Second, this study included discharge notes from only a single hospital, so we cannot confirm how well it would generalize to other data sources. Although this study only provided a feasibility assessment for extrapolation over time, we believe that it still demonstrated the superiority of our method. Third, this study conducted the classification task only in discharge notes. Discharge notes describe only the presence of the disease, but do not include negative statements. Our CNN architecture includes 3- to 5-gram phrase identifiers, but further studies are still needed to apply this approach to patient progress notes to prove its ability.
Our study showed that combining CNNs with word embedding is a viable analysis pipeline for disease classification from free-text medical narratives. Moreover, it showed outstanding performance compared with traditional NLP employing machine learning classifiers and may avoid troublesome data preprocessing. More complex CNNs could be used to further improve predictive performance, and future studies will not be limited by incomplete dictionaries. Because our data were collected from a single center, further studies can implement this algorithm in other hospitals. We hope our experiment will lead to a range of studies toward developing more efficient automated classification approaches and that a large amount of unstructured information will be extracted from free-text medical writing. We have developed a Web app to demonstrate our work [
ICD-10-CM diagnosis code tutorial.
Detailed training and testing AUCs and F-measures for the 5-fold cross-validation test.
Detailed training and testing AUCs and F-measures for the real-world test.
area under the curve
convolutional neural network
gradient boosting machine
International Classification of Diseases, Tenth Revision, Clinical Modification
natural language processing
support vector machine
random forest
This study was supported by the Smart Healthcare Project from the Medical Affairs Bureau Ministry of National Defense, Taiwan. Funding was supported by the Ministry of Science and Technology (105-2314-B-016-053) and Medical Affairs Bureau Ministry of National Defense (MAB-104-013). The authors appreciate the Medical Records Office at Tri-Service General Hospital for providing the unlinked data source.
None declared.