This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services.
To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies.
We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website “Rund ums Baby” (“Everything about Babies”) into one or more of 38 categories belonging to two dimensions (“subject matter” and “expectations”). After creating start and synonym lists, we calculated the average Cramer’s V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification.
According to the manual classification of 988 documents, 102 (10%) documents fell into the category “in vitro fertilization (IVF),” 81 (8%) into the category “ovulation,” 79 (8%) into “cycle,” and 57 (6%) into “semen analysis.” These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as “general information” and 351 (36%) as a wish for “treatment recommendations.” The generation of indicator variables based on the chi-square analysis and Cramer’s V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, “words”) also included variables from other categories, most often with a negative sign. For example, absence of words predictive for “menstruation” was a strong indicator for the category “pregnancy test.”
Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.
Both healthy and sick people increasingly use electronic media to obtain medical information and advice [
In the past, emails, e-consultations, and requests for medical advice via the Internet have been manually analyzed using quantitative or qualitative methods [
Text mining [
An automatic classification of lay requests to medical expert Internet forums is a challenge because these requests can be very long and unstructured as a result of mixing, for example, personal experiences with laboratory data. Very often, people simply require psychological help or are looking for emotional reassurance. Such heterogeneous samples of requests appear in the section “Wish for a Child” on the German
Although involuntary childlessness is not the focus of this paper, some introductory notes on this condition may be helpful. Infertility leading to involuntary childlessness is defined as the inability of a couple to achieve conception or bring a pregnancy to term after a year or more of regular, unprotected sexual intercourse. Infertile couples may pass through different stages of reactions and feelings, which include shock, surprise, anger, helplessness, and loss of control. Feelings of failure, embarrassment, shame, and stigmatization may lead to social isolation and to a breakdown in communication between the couple, including depressive reactions, anxiety, emotional instability, diminished self-confidence, sexual problems, and conflicts [
The vast majority of cases of male infertility are due to a low sperm count, often associated with poor motility and a high rate of abnormal sperm. However, in a large number of patients (25% to 30%), it is not possible to determine the cause of the problem. The main causes of female infertility are ovarian dysfunctions and disorders of the fallopian tubes and uterus. Frequently, two or even all three causes can be found in one patient. Before 1980, infertility due to low sperm quality was treated by performing insemination with the patient’s own sperm or donor sperm. This was followed by in vitro fertilization (IVF) in the early 1980s and intracytoplasmic sperm injection (ICSI) in the early 1990s. ICSI only requires one living sperm cell [
Like many other conditions, involuntary childlessness is often not caused by just one factor, nor can it always be cured with a single treatment regimen. Patients and doctors alike are often confronted with the fact that they cannot find a reason for childlessness and that a treatment for a particular case is not helpful for a person or couple with a similar problem [
Requests addressed to medical expert forums such as “Wish for a Child” can be classified according to (1) the subject matter or (2) the sender’s expectation (eg, to receive a summary of the current treatment options [second opinion], to get general information about a certain disease or biological process, or to ask for advice about where to seek adequate medical help). While the first aspect is of great importance to medical experts so that they can understand the contents of requests, the latter is of interest to public health experts to allow the analysis of information needs within the population.
We carried out an initial trial to automatically classify these requests using standard text-mining software such as that provided by SAS [
To make full use of text mining with complex data, different strategies and a combination of these strategies may refine automatic classification. The aim of this paper is to present a method for an automatic classification of requests to a medical expert forum and to evaluate its performance quality. A special focus of this method should be its flexibility to allow a precise and content-related input of expert knowledge.
The analysis is based on a sample of requests collected from the section “Wish for a Child” on the German
Visitors to the website ask questions directly to a group of medical experts via a Web-based interface. The expert team consists, at the moment, of eight persons who are experts in gynecology, urology, andrology, and/or embryology. Some of them work in outpatient departments, some in reproductive clinics, and some in university hospitals. So, the expert forum is well equipped to give medical advice in difficult situations, to provide help to make the correct decision, to offer a second opinion, or, in some instances, even to meet psychological needs not covered by doctors. The experts work on an honorary (unpaid) basis.
To date, more than 12,000 requests have been sent to the expert forum and have been published on the site. From these requests, we selected a random sample of 988 and classified them manually to provide a sound basis for training and evaluation.
Similar to Shuyler and Knight [
From the very beginning of the classification process, it became apparent that many requests belong to one subject matter category but fit into more than one category of the second dimension (“expectations”). For example, a visitor asked the experts to comment on the results of a semen analysis and, at the same time, wanted some advice about whether he or she should change doctors. We decided to provide as many categories per request as appropriate. In the first dimension (“subject matter”), this request could be categorized as “semen analysis,” and, in the second dimension (“expectations”), as “discussion of results” as well as “treatment options.”
Two of the authors (HWM, WH) independently classified the first set of 100 requests manually. Because of a high rate of differing results, we defined the categories more precisely, added and removed some categories, and agreed upon the use of multiple categories. We then classified another 100 requests. This time, strong classification discrepancies, such as each author classifying the text into a different category, occurred in only 12 cases. Some minor discrepancies also occurred, such as agreement in all categories except one additional category that was suggested by one author but not the other. This resulted in a degree of agreement of 0.69 according to the kappa statistic for overlapping categories [
Terms and parents
Termsa | Parents |
month | month |
months | month |
monthly | month |
all months (eg, January, February) | month |
all abbreviations (eg, Jan., Feb.) | month |
uterus | uterus |
uterus milleu | uterus |
utterus | uterus |
womb | uterus |
in utero | uterus |
uterine | uterus |
adrenal gland | adrenal gland |
temperature | temperature |
temperture | temperature |
temp. | temperature |
body temperature | temperature |
thermometry | temperature |
all temperature degrees (eg, 37.3°C) | temperature |
ultrasound | ultrasound |
ultra | ultrasound |
ultrasonic | ultrasound |
u-sound | ultrasound |
sound | ultrasound |
scan | ultrasound |
aExamples for single words, multi-word terms, synonyms, abbreviations, misspellings, etc are translated from the original German data.
For automatic classification, we created a dataset that contained the text from each request as a separate observation. The text was then parsed into separate words or noun groups. “Parsing” entails several techniques: (1) separation of the text into terms (eg, “uterus”) or multi-word terms (eg, “uterus milleu”), (2) normalization of different formats for dates (eg, 26/02/2008; Feb. 26, 2008) and data (eg, various degrees of temperature), (3) recognition of synonyms, and (4) stemming of verbs, nouns, or (in German) adjectives to their root form (eg, “transfer,” “transferred,” “transferring”). Programs, such as SAS Text Miner, perform this automatically and provide a complete list of all words, noun groups, and so on appearing in the text. The two authors who categorized the requests first by hand formed a detailed starting list [
To reduce the final dataset consisting of 988 rows and 4109 columns, we used three techniques (as different text-mining procedures): (1) indicator variables on the basis of Cramer’s V, (2) principle component analysis (PCA), and (3) singular value decomposition (SVD). The first strategy was developed by the authors. The second strategy used the indicator variables from the first strategy as input for PCA. The third strategy made use of a standard procedure from statistic software for SVD, SAS Text Miner (SAS, Carey, NJ, USA).
We calculated the average Cramer’s V statistic for the association of each of the 4109 “parents” with each category and the subsequent generation of indicator variables that sum for each category all Cramer’s V coefficients over the significant words. Cramer’s V is a chi-square-based measure of association between nominal variables, with “1” indicating a complete positive association and “0” indicating no association at all. The coefficients were normalized according to the length of the texts (ie, the number of words). The selection criterion for including a parent term’s Cramer’s V was the error probability of the corresponding chi-square test. Its significance level was alternatively set at 1%, 2%, 5%, 10%, 20%, 30%, and 40%, leading to seven indicator variables per category.
We conducted PCA to reduce the seven indicator variables of varying significance levels per category into five orthogonal dimensions. PCA transforms a number of correlated variables into a few uncorrelated variables [
The 500 dimensional SVD was based on the standard settings of the SAS Text Miner software [
The sample was split into 75% training data and 25% test data. On the basis of our predictor variables (ie, 38×7 Cramer’s V indicators per category, 38×5 principle components per category, and approximately 500 SVDs), we trained logistic regression models to predict the categories. However, if all these predictor variables would be used in a regression model, it would be rather unlikely to detect any significant variables since many of these are highly correlated. Therefore, we chose a more appropriate modelling approach, a stepwise logistic regression. The choice of predictive variables was carried out by an automatic procedure.
To assess the most appropriate model for a classification, we used the following selection methods: (1) Akaike Information Criterion, (2) Schwarz Bayesian Criterion, (3) cross validation misclassification of the training data (leave one out), (4) cross validation error of the training data (leave one out), and (5) variable significance based on an individually adjusted variable significance level for the number of positive cases. For a more detailed description of most of these selection criteria, see Beal [
We trained for each target category, each selection criterion, and each type of input variable (Cramer’s Vs, principle components, SVDs) one logistic regression. This resulted in 1369 logistic regression models. The detailed notes and the table in the Multimedia Appendix make this procedure more transparent. For the final regression, we used meta-models, which proved the best for each of the 38 categories.
The complete training process produced an automatic method to evaluate both requests from the training sample and new requests. The corresponding software program is called score code. This score code is a function that generates, for any text (request), the probability of belonging to each of the 38 different categories.
To assess the accuracy of our approach, we calculated recall and precision as standard statistics in information retrieval and text mining for each of the 38 categories. Precision is the percentage of positive predictions that are correct (ie, a sort of specificity), whereas recall is the percentage of documents of a given category that were retrieved (sensitivity). We calculated recall and precision at the maximum F-measure [
Quality of automatic classification
Dimension | Requests |
Training/Validation | Validation Data | |
Ratio | Precision%a | Recall%a | ||
|
||||
abortion | 40 | 30:10 | 91 | 100 |
abrasion | 13 | 9:4 | 100 | 100 |
birth control pill | 23 | 17:6 | 100 | 100 |
charges | 25 | 18:7 | 100 | 100 |
clomifen | 26 | 19:7 | 100 | 100 |
cryo transfer | 13 | 9:4 | 100 | 75 |
cycle | 79 | 59:20 | 80 | 86 |
cysts | 16 | 12:4 | 100 | 100 |
endometriosis | 11 | 8:3 | 75 | 100 |
examination of the oviduct | 19 | 14:5 | 100 | 100 |
habitual abortion | 17 | 12:5 | 100 | 100 |
hormones | 36 | 27:9 | 78 | 78 |
insemination | 29 | 21:8 | 100 | 100 |
intermenstrual bleeding | 14 | 10:4 | 100 | 100 |
IVF | 102 | 76:26 | 81 | 88 |
luteal phase defects | 25 | 18:7 | 88 | 100 |
medical drugs | 47 | 35:12 | 92 | 100 |
menstruation | 35 | 26:9 | 90 | 100 |
multiples | 7 | 5:2 | 100 | 100 |
naturopathy | 33 | 24:9 | 90 | 100 |
nourishment | 9 | 6:3 | 100 | 100 |
oviduct | 16 | 12:4 | 100 | 100 |
ovulation | 81 | 60:21 | 90 | 86 |
PCO | 27 | 20:7 | 100 | 100 |
pregnancy symptoms | 36 | 27:9 | 100 | 100 |
pregnancy test | 30 | 22:8 | 88 | 88 |
pregnancy worries | 49 | 36:13 | 100 | 92 |
semen analysis | 57 | 42:15 | 88 | 93 |
sexual intercourse | 14 | 10:4 | 100 | 100 |
sexual intercourse, problems | 5 | 3:2 | 100 | 100 |
stimulation | 40 | 30:10 | 63 | 100 |
thyroid glands | 13 | 9:4 | 100 | 100 |
|
||||
current treatment | 331 | 248:83 | 85 | 72 |
discussion of results | 310 | 232:78 | 86 | 82 |
emotions | 90 | 67:23 | 100 | 61 |
general information | 533 | 399:134 | 92 | 84 |
interpretation of own situation | 242 | 181:61 | 78 | 69 |
treatment options | 351 | 263:88 | 82 | 81 |
aTo calculate recall and precision, we first chose the best model according to the following selection criteria: Akaike’s Information Criterion, Schwarz Baysian Criterion, cross validation misclassification of the training data, cross validation error of the training data; then we determined the optimum compromise between recall and precision by the F-measure.
bMultiple categories possible.
We used different selection criteria to find the best regression models for training and validation. In about half of the categories, the generation of indicator variables based on the chi-square analysis proved to be the best approach for automatic classification. Other categories were best predicted by using either PCA or SVD. Statistical details are shown in the Multimedia Appendix. A 100% precision and 100% recall was realized in 18 out of 38 categories on the validation sample (see
For some categories, the input variables also included variables from other categories, most often with a negative sign. For example, the meta-model for “pregnancy test” included a sample of words (as an indicator variable) predictive for the category “menstruation” with a negative sign. This means that absence of words predictive for “menstruation” was a strong indicator for the category “pregnancy test”.
For other categories, consideration of a sender’s expectation also contributed to a better classification of requests. For example, the meta-model for “hormones” included a sum of relevant terms (on the basis of Cramer’s V) as well as significant terms demonstrating the expectation to learn more about one’s own situation or to have laboratory data interpreted (both with negative signs, meaning that the absence of these expectations were, besides others, indicators for “hormones”).
Most predictive words for the category “general information”
Word | Frequency, No. (%) | Cramer’s V |
|
|
In “General Information” | In Other Categories | |||
X-chromosome | 70 (13) | 143 (31) | − 0.22 | < .001 |
injection | 17 (3) | 68 (15) | − 0.21 | < .001 |
utrogest | 7 (1) | 45 (10) | − 0.19 | < .001 |
clomifen | 32 (6) | 82 (18) | − 0.19 | < .001 |
prescribe | 10 (2) | 45 (10) | − 0.17 | < .001 |
write | 21 (4) | 59 (13) | − 0.16 | < .001 |
med | 45 (8) | 88 (19) | − 0.16 | < .001 |
drug | 24 (5) | 59 (13) | − 0.15 | < .001 |
pill | 20 (4) | 53 (12) | − 0.15 | < .001 |
value | 48 (9) | 88 (19) | − 0.14 | < .001 |
[places 11-50] | ||||
fertile | 36 (7) | 44 (10) | 0.12 | < .001 |
Most predictive words for the categories “oviduct” (total requests = 16) and “examination of the oviduct” (total requests = 19)
Category | Requests in Which This Word Occursa | Cramer’s V |
|
Word | No. (%) | ||
|
|||
tube | 16 (100) | 0.44 | < .001 |
fallopian tube | 16 (100) | 0.44 | < .001 |
removed | 8 (50) | 0.40 | < .001 |
exception | 2 (13) | 0.35 | < .001 |
away | 8 (50) | 0.29 | < .001 |
link | 7 (44) | 0.28 | < .001 |
move | 1 (6) | 0.25 | < .001 |
obliterate | 1 (6) | 0.25 | < .001 |
inappropriate | 1 (6) | 0.25 | < .001 |
sterilisation | 1 (6) | 0.25 | < .001 |
secretion | 1 (6) | 0.25 | < .001 |
scar | 1 (6) | 0.25 | < .001 |
opportunity | 1 (6) | 0.25 | < .001 |
patent | 1 (6) | 0.25 | < .001 |
open | 1 (6) | 0.25 | < .001 |
consider | 1 (6) | 0.25 | < .001 |
extensive | 1 (6) | 0.25 | < .001 |
attachment | 1 (6) | 0.25 | < .001 |
abandon | 1 (6) | 0.25 | < .001 |
cut | 1 (6) | 0.25 | < .001 |
tubal pregnancy | 4 (25) | 0.24 | < .001 |
endoscopy | 7 (44) | 0.21 | < .001 |
level | 8 (50) | 0.21 | < .001 |
|
|||
tube | 15 (79) | 0.37 | < .001 |
fallopian tube | 15 (79) | 0.37 | < .001 |
laparoscopy | 11 (58) | 0.35 | < .001 |
endoscopy | 12 (63) | 0.35 | < .001 |
X-ray | 3 (16) | 0.34 | < .001 |
angiography | 2 (11) | 0.32 | < .001 |
examination | 4 (21) | 0.32 | < .001 |
level | 12 (63) | 0.30 | < .001 |
penetrable | 4 (21) | 0.27 | < .001 |
stomach | 11 (58) | 0.27 | < .001 |
hsg | 2 (11) | 0.26 | < .001 |
structure | 12 (63) | 0.23 | < .001 |
cycle | 1 (5) | 0.23 | < .001 |
adhere | 1 (5) | 0.23 | < .001 |
aSome words in this table occur only once or twice (eg, “move”), but not at all in any of the other subject categories. Therefore, they still have predictive power (with a significant Cramer’s V).
To give a more vivid picture of the results of our method, we present some of the visitors’ requests, including our own manual classification and the automatic classification with scoring values for the probability of falling into a particular category (see
In several instances, and also in two of the three examples presented in
Sample visitor requests and their classification
|
|
|
A combination of different text-mining strategies should classify requests to a medical expert forum into one or several of 38 categories, representing either the subject matter or the sender’s expectations. This combined strategy yielded rates of precision and recall above 80% in nearly all categories. Even in the worst classified categories, the rates were at least above 60%.
In order to evaluate these results, the exceptional character of this text-mining process should be considered. The documents to be classified were complex, sometimes rather long, and, most importantly, needed to be classified not only according to content but also to their (sometimes subliminal) expectations. We were able to show that a combination of different text-mining procedures was superior to a single method. Two factors have particularly contributed to this success: (1) an elaborated starting list and (2) a combination of chi-square statistics, PCA, and an SVD method. These factors mirror a recommendation and an experience reported by Balbi and Meglio [
The creation of good starting (or stopping) lists is necessary to obtain valid and useful results, and comprehensive domain knowledge is essential for creating reasonable lists in the first place. The lists described here contain valuable expert knowledge in the field of involuntary childlessness. It seems reasonable to suppose that creating synonym lists in other medical areas could also be a powerful tool for successful text mining in other Internet forums. In their extensive paper on predictive data mining, Bellazzi and Zupan [
Nearly all words predictive of the category “general information” were negatively associated in the chi-square statistic. This seems to be a “perfect” finding and evidence for our content-related approach since any treatment with injections, for example, would belong to the categories “treatment options,” “interpretation,” or “current treatment” rather than the category “general information.” It is precisely the lack of technical terms or results from prior investigations that defined this category.
Experts usually classify requests, such as the ones we analyzed, in a dichotomous way (ie, either they do or do not belong to a respective category). In contrast, automatic classification with a scoring system similar to the one presented in this paper gives a probability for any given request to fall into any of the categories. Especially in the case of complex texts, it seems appropriate to classify them into multiple dimensions and multiple categories. We defined a cutoff of 50% for our scoring system (ie, we defined a request to fall into a category if the respective score was over 50%). At the same time, it is possible to change the cutoff according to the purpose of an analysis. For example, if we are interested in recognizing possible health needs, a 50% cutoff may contribute to a high recall (sensitivity) so that we do not miss relevant requests. If we are interested in high precision (ie, specificity of classification) to sort out the requests and thus to support the experts’ work, a higher cutoff may be reasonable. Our analysis procedure permits an easy assignment of different cutoff values.
There is another reason why this scoring procedure seems adequate or even superior to a dichotomous expert classification. When we analyze the sender’s expectation, we are usually confronted with a mix of different expectations. In many cases, we classified a request into several expectation dimensions. This seems intuitively better represented by a scoring procedure such as the one presented in this paper. And even the subject matter classifications that we employed in our manual procedure as separated (disjunct) categories may not be as clear as they seem in many requests. It is rather likely that a given request may also fall into more than one subject matter, as demonstrated by the examples in
As SVD is a powerful method for automatic classification, it seemed quite logical that this approach proved best to predict categories in about a quarter of instances. However, there is sometimes reluctance to use SVD-based classification strategies because this process can be controlled only to a limited degree [
In the last decade, the medical profession has witnessed new developments whereby patients have become their own experts, often through the adoption of strategies to empower themselves [
As a further advantage of our approach, we would like to emphasize our comprehensive list of categories. To date, analyses of email requests [
The classification of requests according to the senders’ expectations could be improved. That this process is not optimal may be due to the somewhat vague definition of what exactly constitutes a certain patient’s expectation, and this requires improvement if health experts are to make conclusions about the health needs of a population. However, the overall performance of the subject classification seems to be sufficient, so much so that semiautomated answers to senders’ requests, in this medical area, may be a realistic option for the future.
We consider there to be three relevant applications of our text-mining procedures in the near future:
If our scoring procedure proves successful in further tests, it could be integrated into the
A retrospective application of the scoring procedure to all accumulated requests would allow their mapping into different categories, thus providing an objective historical seismograph and allowing a better understanding of medical and psychological needs that have yet to be met by the current health care system.
The scored database forms the basis for a sophisticated FAQ Internet page that does not address those questions and issues considered by experts to be the most important, as is usually the case, but one which is more oriented to the real needs of visitors and patients.
We are not aware of any studies that have tried to analyze similarly complex texts in Internet forums. Further studies are therefore needed to compare and refine our methodology. Then it should also be possible to decide which aspects of our text-mining strategies—the expert-based synonym list or the combination of different strategies—were most important for the success of our automatic classification.
Our analysis suggests a way of classifying and analyzing complex documents to provide a significant as well as a valid information source for politicians, administrators, researchers, and/or counselors. In the case of involuntary childlessness, it will be possible to fulfill not only patients’ information and health needs with this Internet expert forum, but also to analyze and follow-up these needs over long periods of time. These techniques also seem promising for the analysis of large samples of documents from other Internet health forums, chat rooms, or email requests to doctors.
We are indebted both to Ulrich Schneider, operator of the Rund ums Baby website for his permission to use data from this Internet forum, and Christian Schulz, webmaster, for his technical support. In addition, the authors would like to thank Stephanie Heinemann, who carefully read and discussed the many drafts of this manuscript with the authors and helped them to express their ideas and results as exactly as possible.
HWM is one of the experts who work for the Rund ums Baby forum on an honorary basis. UR is an employee of SAS Institute Germany and works in the Enterprise Intelligence Competence Centre.
Statistical details of the automatic classification (explanation)
Statistical details of the automatic classification (table)
frequently asked question
intracytoplasmic sperm injection
in vitro fertilization
principle component analysis
singular value decomposition