This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
In question answering (QA) system development, question classification is crucial for identifying information needs and improving the accuracy of returned answers. Although the questions are domain-specific, they are asked by non-professionals, making the question classification task more challenging.
This study aimed to classify health care–related questions posted by the general public (Chinese speakers) on the Internet.
A topic-based classification schema for health-related questions was built by manually annotating randomly selected questions. The Kappa statistic was used to measure the interrater reliability of multiple annotation results. Using the above corpus, we developed a machine-learning method to automatically classify these questions into one of the following six classes:
The consumer health question schema was developed with a four-hierarchical-level of specificity, comprising 48 quaternary categories and 35 annotation rules. The 2000 sample questions were coded with 2000 major codes and 607 minor codes. Using natural language processing techniques, we expressed the Chinese questions as a set of lexical, grammatical, and semantic features. Furthermore, the effective features were selected to improve the question classification performance. From the 6-category classification results, we achieved an average precision of 91.41%, recall of 89.62%, and
In this study, we developed an automatic method to classify questions related to Chinese health care posted by the general public. It enables Artificial Intelligence (AI) agents to understand Internet users’ information needs on health care.
The Internet is increasingly becoming a main resource for consumers to acquire health information. Until December 2015, there were 152 million Internet health users in China, indicating that 22.1% of Chinese Internet users have looked online for health information and services [
In general, a QA system consists of 3 modules: question analysis, information retrieval, and answer extraction. In the first module, question classification plays an important role in identifying the information needs of consumers, reducing the space of candidate answers, and further improving the accuracy of returned answers [
Several studies have been conducted for automatic quesiton classification in the field of health and medicine in order to identify the general topics of clinical questions [
As one of the most common chronic diseases, hypertension has become the main risk factor of cardiovascular diseases. It was estimated that China had 270 million patients with hypertension in 2012, and the incidence rate was approximately 3% per year [
We collected questions posted by health consumers from 1st January to 10th August, 2014, with the tags “hypertension (高血压)” or “blood pressure (血压)” under the Q&A (有问必答) section on a Chinese health website with more than 35 million registered users [
In this study, “question” is defined as a request that a health consumer has posted on the website on a certain subject to elicit answers from physicians, which was identified based on meaning, not form. We focused on questions related to hypertension (
The website provides a template for users to generate questions, which includes three fields: (1) describe your health status (
A topic-based classification schema was developed based on TGCQ [
In round 2, four other annotators (two specialized in medicine and two specialized in informatics) independently annotated 200 questions randomly selected from the sample, using the classification schema. The authors compared the consistency of the five coding results (including the one in the first round) and categorized the 200 questions into three groups: (1) all annotators agreed (n=73), (2) only one disagreed (n=63), and (3) more than one disagreed (n=64). Then we focused on the last group. We addressed ambiguous elements by further specifying annotation rules and improving the descriptions of the question patterns.
In round 3, the revised classification was distributed to the five annotators who independently annotated another 300 questions randomly selected from the remaining sample of 1800 messages. This step was done to measure the interrater reliability of the classification schema as well as to further modify it.
In the last round, each of three annotators independently annotated 500 from the remaining 1500 messages. So each of the 2000 sample questions were annotated by at least two annotators. The authors compared the coding results and the disparities were discussed to achieve an agreement. The codes agreed upon during this step were regarded as the final schema. The number of questions in each category was calculated, and categories in which no questions were filled were deleted (such as physical characteristics of drugs, pharmacodynamics, and mechanism of drug action).
A four-round annotating process to construct and modify the classification schema and annotated corpus.
The 2000 questions annotated by the above steps were used to train and test the classifiers for the primary level topics, including
We explored various features for machine-learning, including lexical, grammatical, semantic, and statistical information (
The word segmentation was obtained from Rwordseg [
We manually developed a dictionary of 42 Chinese interrogative words based on baike.baidu [
The controlled vocabulary of Chinese Medical Subject Headings (CMeSH) [
These were a combination of lexical and statistical features. We used three ways to extract the keywords from a question: (1) the first
These include question length, maximum, minimum, and average word length, maximum, minimum and average TF, maximum, minimum, and average IDF, and maximum, minimum and average TF-IDF. The corpus used to calculate the IDF of each word contained nearly 100 thousand hypertension-related messages that we had collected in our former research [
Mathematical equations.
As the feature space dimension was very large, and some of them could have degraded the performance of the classifiers, we adopted Φ-score to select the most discriminative features, which measures the discriminations in two sets of real numbers [
Calculate Φ(t) of every feature
Calculate the avg Φ of each type of feature and, further, set it as the threshold of the corresponding feature type. The avg Φ was chosen as the feature selection threshold because the distribution of Φ differs greatly between different types of features, while this method can help to keep all the useful features in different types [
For each type of machine-learning feature, select features with Φ ≥ avg Φ of this type.
Since a question can be assigned to multiple topics, the task in this paper was a multi-label classification problem, which was usually transformed into one or more single-label classification or regression problems [
Due to the skewed distribution of consumer questions to different topics, an under-sampling method for the majority classes was applied to ensure that each classifier was trained and tested on the same number of “positive” and “negative” questions. We reported the classification performance using 10-fold cross-validation. The sample data for each binary classifier was equally divided into 10-folds: one of them was used as testing data, and the ramaining 9 folds as training data. The cross-validation process was repeated 10 times (equal to the folds) and the average value and standard deviation were reported. All cases in the sample data were used for both training and validation. Thus, each case was used for validation exactly once, which was the distinct advantage in this method [
The interrater reliability of the classification schema was evaluated by the kappa statistic, which could correct agreement that occurred by chance. Kappa=(Po-Pe)/(1-Pe), where Po is the observed agreement and Pe is the agreement expected by chance [
The performance of automatic classification methods was evaluated by precision (p), recall (r) and
The final classification schema was a four-hierarchical-level of specificity, consisting of 48 quaternary categories (see
Table1. An example of consumer health questions in Chinese with their pattern and annotated tags.
General Topics | Items | Contents |
Diagnosis | Question | 昨天不知道怎么事,突然感到心慌慌的,四肢发凉,全身冒冷汗,之后老婆扶我到小区医院那里去看,量了一下血压,血压比以往要高,之后医生叫我放松,休息了20分钟左右,又感觉没有什么事了。。 请问突然感觉到心慌,四肢发凉,血压升高,这是啥病啊? (Yesterday, my heart suddenly palpitated, my limbs became cold, and my whole body began to sweat. Then my wife accompanied me to the community hospital and checked my blood pressure; it was higher than before. The doctor told me to relax, and I feel much better after resting for about 20 minutes… suddenly felt flustered, limbs became cold, and blood pressure rose. What disease is it?) |
Pattern | 临床发现X1、X2、X3、……,这是啥病?(Clinical finding X1, X2, X3,… What disease is it?) | |
Tag | 1.1.4.1 “诊断(Diagnosis)→病因/临床发现的解释(Interpretation of clinical finding)→不具体的发现或多种发现(Uncertain/multiple findings)” | |
Treatment | Question | 65岁老人血压高经常不稳定,吃哪种降压药最好?(A 65-year-old man with unsteady high blood pressure… What’s the best blood pressure drug to take?) |
Pattern | 病情y,吃/用/服用哪种药最好?(Condition y: What’s the best drug to take or use?) | |
Tag | 2.1.2.1 “治疗(Treatment)→药物治疗(Drug therapy)→效力/适应症/药物选择(efficacy/indications/drug choosing)→治疗(Treatment)” |
This study found that although health consumers would ask numerous health questions about themselves or their families, the general topics of the questions were limited to a small number and each category of the topics had its particular question patterns. The 2000 Chinese consumer health questions were annotated with 2000 major codes and 607 minor codes. The distribution of the sample questions on the primary level category is shown in
Distribution of the 2000 consumer health questions in Chinese on the primary level of topics.
No. | General Topics | Positive | Negative | Total |
1 | Diagnosis | 600 | 1400 | 2000 |
2 | Treatment | 1167 | 833 | 2000 |
3 | Condition management | 136 | 1864 | 2000 |
4 | Epidemiology | 233 | 1767 | 2000 |
5 | Healthy lifestyle | 278 | 1722 | 2000 |
6 | Health provider choice | 45 | 1955 | 2000 |
7 | Other | 5 | ||
Total | 2000 | 2000 | 2000 |
The kappa statistic for the five annotators was 0.63 in the quaternary level of the classification, indicating “substantial” reliability, better than in several similar studies, such as assigning topics to general clinical questions (kappa=0.53) [
The Φ-score of each feature was calculated for each binary classifier. We found that their distribution between different types of features differed greatly. The performance of classifiers using features with Φ ≥ avg Φ was not worse than that of those classifiers using all the features in the corresponding types, and some of them were even higher than the latter. Taking the topic of
Number and Φ distribution of each type of feature for the Chinese consumer health question classification on the topic of
Levels | Features Typesa | Avg Φ | σ (Φ) | nAF | n(Φ ≥ avg Φ) |
Lexical |
Bag-of-words | 0.0016 | 0.0067 | 4967 | 1301 |
Part-of-speech | 0.0014 | 0.0060 | 6154 | 1490 | |
Grammatical |
Interrogative words | 0.0039 | 0.0204 | 97 | 13 |
Noun head chunks | 0.0011 | 0.0010 | 48 | 14 | |
Verb head chunks | 0.0008 | 0.0007 | 19 | 6 | |
Noun rear chunks | 0.0011 | 0.0019 | 73 | 14 | |
Verb rear chunks | 0.0010 | 0.0013 | 22 | 3 | |
Interrogative + noun head chunks | 0.0011 | 0.0013 | 328 | 86 | |
Interrogative + verb head chunks | 0.0011 | 0.0010 | 312 | 85 | |
Noun rear chunks + interrogative | 0.0010 | 0.0013 | 315 | 67 | |
Verb rear chunks + interrogative | 0.0012 | 0.0024 | 318 | 74 | |
Semantic | CMeSH concepts | 0.0016 | 0.0033 | 43 | 9 |
CMeSH semantic types | 0.0124 | 0.0101 | 3 | 1 | |
Lexical & Statistical |
Keywords (TF) | 0.0008 | 0.0009 | 1510 | 282 |
Keywords (IDF) | 0.0007 | 0.0008 | 1137 | 192 | |
Keywords (TF-IDF) | 0.0008 | 0.0008 | 1208 | 190 | |
Statistical | Statistical features | 0.0073 | 0.0060 | 13 | 5 |
Total with duplicates replaced | 15349 | 3656 |
aFor each type of feature, σ (Φ) is the standard deviation of Φ, nAF is the total number of features, n (Φ ≥ avg Φ) is the number of features with Φ ≥ avg Φ.
Therefore, the features with Φ ≥ avg Φ in every feature type were selected as input features for machine-learning, in order to keep all the useful features in different types and to improve the performance of the classifiers. Thus, each classifier received a different feature set, and the number of features within them are showed in the third column in
Feature reduction and the performance of each classifier.
General topics | N (all features) | N (selected features) | Feature reduction proportion | Avg |
σ ( |
Diagnosis | 15349 | 5311 | 0.6540 | 0.9855 | 0.0164 |
Treatment | 15349 | 4216 | 0.7253 | 0.7602 | 0.0482 |
Condition management | 15349 | 3150 | 0.7948 | 0.9963 | 0.0117 |
Epidemiology | 15349 | 4194 | 0.7268 | 0.7177 | 0.0798 |
Healthy lifestyle | 15349 | 3656 | 0.7618 | 0.9913 | 0.0166 |
Health provider choice | 15349 | 2282 | 0.8513 | 0.9635 | 0.0594 |
Performance of each feature type for Chinese consumer health question classification on the topic of Lifestyle.
Performance improvement of each classifier by selecting features above the threshold.
The results were obtained from SVMs in the kernlab package because it performed the best among all the classification algorithms available in the R project. The research findings showed that the feature spaces were reduced from 65.40% to 85.13% by dropping features under the threshold (
The performance on the classification of most topics of consumer health questions in Chinese was high. The evaluation metrics (average precision, recall, and
A classification schema of consumer health questions was built in this study and 2000 hypertension-related consumer health questions in Chinese were manually annotated based on this schema. The research findings demonstrated that health consumers were mainly concerned about what was wrong with their health (or the health of someone they cared about), why it was wrong, how to treat it (including choosing which provider to treat), whether the drugs they used had adverse effects or would do harm in some conditions (eg, pregnancy, breast feeding), whether they could recover from the illness, and what they could do to improve their health in everyday life (mainly diet suggestions).
We explored a machine-learning method to automatically classify these Chinese consumer health questions into one of the six primary level topics, with a novel scoring metric to select the most effective features from the abundant feature types we had explored. The results proved that selecting the features with Φ ≥ avg Φ in each feature type as input features for machine- learning not only increased the efficiency, but also improved the performance of the classifiers successfully. From the 6-category classification results, we achieved an average precision of 91.41%, recall of 89.62%, and
Compared with the 1396 clinical questions annotated by Ely et al [
Compared with other related studies on automatic question classification in the domain of health and medicine, we explored an abundant number of feature types for automatic classifiers. For example, Cao et al [
The feature selection methods in our work were quite different from other relative works, and it has proved that our methods were much more effective and easy. Cao et al [
The performance of the classifiers trained by our study was quite satisfying. The average
One of the limitations of this work is that the sample questions we used to build the classification schema and to train the automatic classifiers were from only one Chinese health website and defined to be hypertension or blood pressure related. Therefore, the applicability of the classification schema and the validity of the automatic classifiers for the vast majority of questions from other websites and other diseases remain to be tested. Another limitation of this work is that some types of features, such as keywords and bag-of-words, might be correlated. However, our feature selection algorithm did not take the impact of correlation into consideration. We only reached moderate performances on the automatic classifiers for the general topics of Treatment and Epidemiology, whereas the reasons for this remain to be explored in the future.
One of the specialties of this research was that Chinese consumer health questions were chosen as the research object. We built a classification schema of consumer health questions which consisted of 48 quaternary categories and 35 annotation rules, and we annotated 2000 questions in Chinese that were randomly selected from nearly 100 thousand messages about hypertension. Then, by using these annotated questions as the corpus, we explored a machine-learning method to automatically classify Chinese consumer health questions into six general topics to facilitate users’ information needs analysis and answer extraction. We explored an abundant number of feature types and adopted a novel method to select all the effective features with Φ ≥ avg Φ. The results proved that our classification approach was relatively more efficient and effective as compared with similar studies.
Supplementary tables.
Taxonomies of Generic Clinical Questions
Layered Model of Context for Consumer Health Information Searching
This research was supported by the Chinese Academy of Medical Sciences (Grant No. 2016ZX330011) and the National Social Science Foundation of China (Grant No. 14BTQ032). The authors would like to thank Dr Chao Xu for his helpful suggestions on data processing.
None declared.