This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The COVID-19 pandemic has created a pressing need for integrating information from disparate sources in order to assist decision makers. Social media is important in this respect; however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. Here, we adopt a triage and diagnosis approach to analyzing social media posts using machine learning techniques for the purpose of disease detection and surveillance. We thus obtain useful prevalence and incidence statistics to identify disease symptoms and their severities, motivated by public health concerns.
This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts in order to provide researchers and public health practitioners with additional information on the symptoms, severity, and prevalence of the disease rather than to provide an actionable decision at the individual level.
The text processing pipeline first extracted COVID-19 symptoms and related concepts, such as severity, duration, negations, and body parts, from patients’ posts using conditional random fields. An unsupervised rule-based algorithm was then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations were subsequently used to construct 2 different vector representations of each post. These vectors were separately applied to build support vector machine learning models to triage patients into 3 categories and diagnose them for COVID-19.
We reported macro- and microaveraged F1 scores in the range of 71%-96% and 61%-87%, respectively, for the triage and diagnosis of COVID-19 when the models were trained on human-labeled data. Our experimental results indicated that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. In addition, we highlighted important features uncovered by our diagnostic machine learning models and compared them with the most frequent symptoms revealed in another COVID-19 data set. In particular, we found that the most important features are not always the most frequent ones.
Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from social media natural language narratives, using a machine learning pipeline in order to provide information on the severity and prevalence of the disease for use within health surveillance systems.
During the ongoing coronavirus pandemic, hospitals have been continuously at risk of being overwhelmed by the number of people developing serious illness. People in the United Kingdom were advised to stay at home if they had coronavirus symptoms and to seek assistance through the National Health Service (NHS) helpline if they needed to [
Herein, we take a diagnostic approach and propose an end-to-end NLP pipeline to automatically triage and diagnose COVID-19 cases from patient-authored medical social media posts. The triage may inform decision makers about the severity of COVID-19, and diagnosis could help in gauging the prevalence of infections in the population. Attempting a clinical diagnosis of influenza, or in our case a diagnosis of COVID-19, purely based on the information provided in a social media post is unlikely to be sufficiently accurate to be actionable at an individual level, since the quality of this information will be typically noisy and incomplete. However, it is not necessary to have actionable diagnoses at the individual level in order to identify interesting patterns at the population level, which may be useful within public health surveillance systems. For example, text messages from the microblogging site Twitter were used to identify influenza outbreaks [
One of our key concerns is in the production of a high-quality human-labeled data set on which to build our pipeline. Here, we give a brief overview of our pipeline and how we developed our data set. The first step in the pipeline was attained by developing an annotation application that detects and highlights COVID-19-related symptoms with their severity and duration in a social media post, henceforth collectively termed as
One author manually annotated our data with concepts and relations, allowing us to present posts highlighted with identified concepts and relations to 3 experts, along with several questions, as shown in
The 3 experts are junior doctors working in the United Kingdom who were redeployed to work on COVID-19 wards during the first wave of the pandemic, between March and July 2020. Their roles involved the diagnosis and management of patients with COVID-19, including patients who were particularly unwell and required either noninvasive or invasive ventilation. There were some training sessions organized for doctors working in COVID-19 wards. However, these were only provided toward the end of the first wave, as there was initially little knowledge of the virus and how to treat it. In the hospital, the doctors followed local protocols, which were adjusted as more experience was gained about the virus.
We also asked the doctors to indicate whether the highlighted text presented is sufficient in reaching their decision in order to understand its usefulness when we incorporate it in the annotation interface. The annotations were found to be sufficient in as many as 85% of the posts, on average, as indicated by the doctors’ answers to question 3 in
The posts labeled by the doctors were then used to construct 2 types of predictive machine learning model using
We trained the SVM models in 2 different ways: first with ground-truth annotations and second using predictions from the concept and relation extraction step described before. Predictions obtained from the concept extraction step make use of
We also discussed the feature importance obtained from the constructed COVID-19 diagnostic models and compared it with the most frequent symptoms from Sarker et al [
We showed that it is possible to take an approach that aims at disease detection to augment public health surveillance systems, by constructing machine learning models to triage and diagnose COVID-19 from patients' natural language narratives. To the best of our knowledge, no other previous work has attempted to triage or diagnose COVID-19 from social media posts.
We also built an end-to-end NLP pipeline by making use of automated concept and relation extraction. Our experiments showed that the models built using predictions from concept and relation extraction produce similar results to those built using ground-truth human concept annotation.
A patient-authored social media post is annotated with symptoms (light green), affected body parts (pale blue), duration (light yellow), and severities (pink). The phrases in square brackets show relations between a symptom and a body part/duration/severity when the distance is greater than 1. This annotated post was presented to 3 doctors to triage and diagnose the author of the post by answering questions 1 and 2, respectively. GP: general physician.
Data derived from social media have been successfully used to facilitate the detection of influenza epidemics [
At an individual diagnostic level, Zimmerman et al [
In Judson et al [
In Schwab et al [
In Wagner et al [
Finally, Sarker et al [
We collected social media posts discussing COVID-19 medical conditions from a forum called
Frequency distribution of annotated classes/concepts from the text are shown. We have also shown the percentage of each class after discounting the OTHER labels. The average number of tokens per post was 130.17 (SD 97.83). BPOC: body part, organ, or organ component; SYM: symptoms.
To measure the agreement between the answers (recommendations and ratings) of the 3 doctors to questions 1 and 2 of
We found that for question 1, the AC1 measure showed moderate agreement (in the middle of the moderate range) between A and B (0.55) and substantial agreement between A and C (0.72); see Landis and Koch [
It is important to note that COVID-19 is a novel virus disease, for which the doctors did not have prior experience or training before the first wave of the pandemic, and thus one would expect some difference of opinion. (We bear in mind that in our setting, the doctors can only see the posts and thus cannot interact with the patients as they would in a normal scenario.) Moreover, there are probable differences in risk tolerances between the doctors, which would lead to potentially different decisions and diagnoses.
Pairwise agreement between pairs of doctors’ answers to questions 1 and 2; see
Pair | Question 1 | Question 2 | ||||
|
|
|
AC1 |
|
|
AC1 |
AB | 0.65 | 0.26 | 0.55 | 0.73 | 0.64 | 0.67 |
BC | 0.63 | 0.14 | 0.53 | 0.73 | 0.64 | 0.67 |
AC | 0.77 | 0.28 | 0.72 | 0.51 | 0.40 | 0.40 |
We mapped the doctors’ recommendations from question 1 to ordinal values; the options
To diagnose whether a patient has COVID-19 from their post, we first estimated the probability of having the disease by normalizing the rating (ie, given a rating, r, the probability of COVID-19,
Given our GTP estimates were discrete, we investigated 3 decision boundaries, denoted by LE, LT, and NEQ, based on a threshold value of 0.5 to classify a post as follows:
LE: If Pr(COVID|r)≤0.5, then NO_COVID, else COVID.
LT: If Pr(COVID|r)<0.5, then NO_COVID, else COVID.
NEQ: If Pr(COVID|r)<0.5, then NO_COVID, elseif Pr(COVID|r)>0.5, then COVID.
Note that NEQ ignores cases on the 0.5 boundary.
A schematic of our methodology to triage and diagnose patients based on their social posts is shown in
A block diagram of the COVID-19 triage-and-diagnosis text processing pipeline. CRF: conditional random field; RB: rule based; SVM: support vector machine.
In the first step, we preprocessed each patient’s post by splitting it into sentences and tokens using General Architecture for Text Engineering (GATE) software’s (University of Sheffield) [
The semantic relation between a symptom and other concepts, which we formally termed
The severity modifiers were mapped to a scale of 1-5; the semantic meaning of the scale was
Fixed-length vector representations suitable as input for SVM classifiers were built as follows:
We utilized SVM classification and regression models to triage and diagnose patients’ posts, respectively, from the vector representations described earlier. For question 1, the recommendation from a doctor or a combination of doctors was the class label of the post; see the Problem setting subsection in the Methods section for a description. To build a binary classifier, we first combined the
For diagnosing COVID-19 cases, we deployed a variant of the SVM, called
We evaluated the performance of the CRF and SVM classification algorithms using the standard measures of precision, recall, and macro- and microaveraged F1 scores [
Support ratio of triage classes across models for question 1 classification tasks. Absolute numbers for the "Send to a hospital" class in test sets were as follows: A=10, B=12, AB(R-a)=14, AB(R-t)=5, BC(R-a)=6, AC(R-a)=5, and ABC(R-a)=9; the value for the remaining models was 0. GP: general physician.
For the CRF, we reported 3-fold cross-validated macroaveraged results. Specifically, we trained each fold by a Python wrapper [
We constructed SVM binary classifiers, SVM classifier 1 and SVM classifier 2, using the Python wrapper for LIBSVM [
We simulated 2 cases for COVID-19 triage and diagnosis. First SVM and SVR models were trained with the ground truth to examine the predictive performance when they are deployed as stand-alone applications. Second, when trained with the predictions from the CRF and RB classifier, they resembled an end-to-end NLP application. To obtain a comparable result, the models were always tested with the ground truth. As a measure of performance, we reported macro- and microaveraged F1 scores for SVM classifiers and SVR, respectively.
The concept and relation extraction phases produced excellent and good predictive performances, respectively; see
Regarding question 2, when we trained the models with the symptom-modifier vector representation from the ground truth, the results of COVID-19 diagnosis were in the range of 72%-87%, 61%-76%, and 74%-87% for the LE, LT, and NEQ decision functions, respectively; see
In general, NEQ models perform better due to the omission of borderline cases where the GTPs are exactly 0.5. The support ratios for each model for different decision functions are shown in
Finally, we trained our models using a linear kernel but found that the RBF dominates in most of the cases; however, linear kernels are useful in finding feature importance [
Concept extraction using CRFa on 3-fold cross-validation.
Label | Precision | Recall | F1 score | Support |
SYMb | 0.94 | 0.97 | 0.95 | 1300 |
SEVERITY | 0.80 | 0.79 | 0.79 | 437 |
BPOCc | 0.92 | 0.83 | 0.87 | 356 |
DURATION | 0.87 | 0.91 | 0.89 | 667 |
INTENSIFIER | 0.88 | 0.97 | 0.92 | 494 |
NEGATION | 0.83 | 0.89 | 0.86 | 338 |
OTHER | 0.99 | 0.98 | 0.98 | 16892 |
Macroaverage | 0.89 | 0.89 | 0.89 | —d |
aCRF: conditional random field.
bSYM: symptoms.
cBPOC: body part, organ, or organ component.
dNot applicable.
Relation extraction using RBa classifier results on 3-fold cross-validation.
Distance | With stop words | Without stop words | |||||
|
Precision | Recall | F1 score | Precision | Recall | F1 score | |
2 | 0.74 | 0.63 | 0.68 | 0.74 | 0.64 | 0.69 | |
3 | 0.75 | 0.67 | 0.71 | 0.75 | 0.67 | 0.71 | |
4 | 0.75 | 0.69 | 0.72 | 0.75 | 0.69 | 0.72 | |
5 | 0.75 | 0.71 | 0.73 | 0.74 | 0.71 | 0.73 | |
6 | 0.74 | 0.72 | 0.73 | 0.74 | 0.72 | 0.73 | |
7 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 |
aRB: rule based.
Question 1: hierarchical classification results for the RBFa kernel using the symptom-modifier relation vector.
Model | SVMb classifier 1 | SVM classifier 2 |
|
||||||||
|
|
Precision | Recall | F1 score | Precision | Recall | F1 score |
|
|||
|
|||||||||||
|
A | 0.82 | 0.91 | 0.86 | 0.73 | 0.95 | 0.83 |
|
|||
|
B | 0.73 | 0.77 | 0.75 | 0.81 | 0.99 | 0.89 |
|
|||
|
C | 0.85 | 0.98 | 0.91 | —c | — | — |
|
|||
|
AB(R-a) | 0.70 | 0.75 | 0.72 | 0.80 | 0.96 | 0.88 |
|
|||
|
AB(R-t) | 0.84 | 0.96 | 0.89 | 0.85 | 1.00 | 0.92 |
|
|||
|
BC(R-a) | 0.72 | 0.75 | 0.73 | 0.92 | 1.00 | 0.96 |
|
|||
|
BC(R-t) | 0.86 | 0.99 | 0.92 | — | — | — |
|
|||
|
AC(R-a) | 0.79 | 0.87 | 0.83 | 0.89 | 1.00 | 0.94 |
|
|||
|
AC(R-t) | 0.88 | 0.98 | 0.93 | — | — | — |
|
|||
|
ABC(R-a) | 0.70 | 0.76 | 0.73 | 0.89 | 0.99 | 0.93 |
|
|||
|
ABC(R-t) | 0.88 | 0.99 | 0.93 | — | — | — |
|
|||
|
|||||||||||
|
A | 0.81 | 0.89 | 0.85 | 0.72 | 0.91 | 0.80 |
|
|||
|
B | 0.74 | 0.74 | 0.74 | 0.81 | 0.99 | 0.89 |
|
|||
|
C | 0.85 | 0.96 | 0.90 | — | — | — |
|
|||
|
AB(R-a) | 0.73 | 0.71 | 0.71 | 0.81 | 0.96 | 0.88 |
|
|||
|
AB(R-t) | 0.84 | 0.94 | 0.88 | 0.84 | 1.00 | 0.92 |
|
|||
|
BC(R-a) | 0.74 | 0.71 | 0.72 | 0.92 | 1.00 | 0.96 |
|
|||
|
BC(R-t) | 0.88 | 0.98 | 0.93 | — | — | — |
|
|||
|
AC(R-a) | 0.81 | 0.85 | 0.83 | 0.89 | 1.00 | 0.94 |
|
|||
|
AC(R-t) | 0.88 | 0.98 | 0.93 | — | — | — |
|
|||
|
ABC(R-a) | 0.72 | 0.72 | 0.72 | 0.89 | 1.00 | 0.94 |
|
|||
|
ABC(R-t) | 0.89 | 0.98 | 0.93 | — | — | — |
|
aRBF: radial basis function.
bSVM: support vector machine.
cNot applicable.
dCRF: conditional random field.
Question 1: hierarchical classification results for the RBFa kernel using the symptom-only vector.
Model | SVMb classifier 1 | SVM classifier 2 |
|
|||||||
|
|
Precision | Recall | F1 score | Precision | Recall | F1 score |
|
||
|
||||||||||
|
A | 0.83 | 0.91 | 0.87 | 0.74 | 0.85 | 0.79 |
|
||
|
B | 0.71 | 0.81 | 0.76 | 0.81 | 0.98 | 0.89 |
|
||
|
C | 0.87 | 0.97 | 0.92 | —c | — | — |
|
||
|
AB(R-a) | 0.69 | 0.75 | 0.72 | 0.83 | 0.96 | 0.89 |
|
||
|
AB(R-t) | 0.85 | 0.94 | 0.89 | 0.85 | 1.00 | 0.92 |
|
||
|
BC(R-a) | 0.71 | 0.79 | 0.75 | 0.92 | 0.99 | 0.95 |
|
||
|
BC(R-t) | 0.88 | 0.98 | 0.93 | — | — | — |
|
||
|
AC(R-a) | 0.80 | 0.86 | 0.83 | 0.89 | 1.00 | 0.94 |
|
||
|
AC(R-t) | 0.90 | 0.98 | 0.94 | — | — | — |
|
||
|
ABC(R-a) | 0.68 | 0.74 | 0.71 | 0.90 | 1.00 | 0.95 |
|
||
|
ABC(R-t) | 0.90 | 0.98 | 0.94 | — | — | — |
|
||
|
||||||||||
|
A | 0.84 | 0.89 | 0.87 | 0.74 | 0.82 | 0.78 |
|
||
|
B | 0.74 | 0.79 | 0.77 | 0.82 | 0.98 | 0.89 |
|
||
|
C | 0.86 | 0.95 | 0.90 | — | — | — |
|
||
|
AB(R-a) | 0.72 | 0.76 | 0.73 | 0.83 | 0.92 | 0.87 |
|
||
|
AB(R-t) | 0.87 | 0.93 | 0.90 | 0.84 | 0.98 | 0.90 |
|
||
|
BC(R-a) | 0.72 | 0.78 | 0.75 | 0.92 | 0.99 | 0.95 |
|
||
|
BC(R-t) | 0.87 | 0.97 | 0.92 | — | — | — |
|
||
|
AC(R-a) | 0.80 | 0.86 | 0.83 | 0.89 | 1.00 | 0.94 |
|
||
|
AC(R-t) | 0.89 | 0.95 | 0.92 | — | — | — |
|
||
|
ABC(R-a) | 0.71 | 0.76 | 0.73 | 0.89 | 0.99 | 0.93 |
|
||
|
ABC(R-t) | 0.90 | 0.95 | 0.92 | — | — | — |
|
aRBF: radial basis function.
bSVM: support vector machine.
cNot applicable.
dCRF: conditional random field.
Support ratio of diagnosis classes across models and 3 decision functions for question 2 classification tasks.
Question 2: microaveraged F1 score results for different models and decision functions. Here, A, B, and C are 3 medical doctors (abbreviated as Dr) who took part in the experiment.
Model | Symptom-modifier vector | Symptom-only vector | |||||
|
|
LE | LT | NEQ | LE | LT | NEQ |
|
|||||||
|
A | 0.72 | 0.61 | 0.78 | 0.70 | 0.59 | 0.74 |
|
B | 0.78 | 0.61 | 0.76 | 0.78 | 0.62 | 0.77 |
|
C | 0.87 | 0.75 | 0.87 | 0.88 | 0.75 | 0.87 |
|
AB | 0.72 | 0.66 | 0.74 | 0.74 | 0.65 | 0.75 |
|
BC | 0.84 | 0.76 | 0.84 | 0.85 | 0.79 | 0.86 |
|
AC | 0.81 | 0.73 | 0.81 | 0.83 | 0.74 | 0.83 |
|
ABC | 0.74 | 0.67 | 0.76 | 0.75 | 0.67 | 0.77 |
|
|||||||
|
A | 0.68 | 0.64 | 0.76 | 0.50 | 0.79 | 0.74 |
|
B | 0.76 | 0.64 | 0.77 | 0.78 | 0.57 | 0.74 |
|
C | 0.86 | 0.75 | 0.87 | 0.87 | 0.74 | 0.86 |
|
AB | 0.70 | 0.65 | 0.73 | 0.71 | 0.66 | 0.74 |
|
BC | 0.83 | 0.76 | 0.83 | 0.85 | 0.78 | 0.86 |
|
AC | 0.80 | 0.74 | 0.82 | 0.80 | 0.73 | 0.81 |
|
ABC | 0.72 | 0.69 | 0.76 | 0.74 | 0.69 | 0.77 |
aCRF: conditional random field.
This study demonstrates the potential to triage and diagnose COVID-19 patients from their social media posts. We presented a proof-of-concept system to predict a patient’s health state by building machine learning models from their narrative. The models were trained in 2 ways: using (1) ground-truth labels and (2) predictions obtained from the NLP pipeline. Trained models are always tested on ground-truth labels. We obtained good performances in both cases, which indicates that an automated NLP pipeline could be used to triage and diagnose patients from their narrative; see the Evaluation Outcomes subsection in the Results section. In general, health professionals and researchers could deploys triage models to determine the severity of COVID-19 cases in the population and diagnostic models to gauge the prevalence of the pandemic.
To quantify the important predictive features in the training set, we experimented with COVID-19 diagnosis using linear kernel SVR regression. More specifically, we used the symptom-only vector representation constructed from the ground truth. We summed feature weights for each Si in <S0, S1, . . . , Sn> from the 7 models and the 3 decision functions; see the Methods section. The features were then mapped to the categories found in the Twitter COVID-19 lexicon complied by Sarker et al [
To compare our importance ranking with that of Sarker et al’s [
The top-right chart in
Next, we compared our most important feature weights with our data set’s frequency ranking using the methods described earlier. From the bottom-left stacked bar chart of
Finally, the bottom-right chart in
Feature comparison between our most important features and Sarker et al’s [
It is worth reiterating that social media posts, which are known to be noisy, are not on a par with the consultation that a patient would have with a doctor. We stress that the aim of this study is to extract useful information at a population level, rather than to provide an actionable decision for an individual via social media posts. Our manually annotated data set has 2 main limitations. First, having only 3 experts limited the quality of our labeling, although we deem this study to be a proof of concept. A larger number of experts, including more senior doctors, would be beneficial in a follow-up study. The robustness of our results could be further improved by both increasing the size of our data set and introducing posts from several alternate sources. Given that the posts come from social media, it is not clear whether the results could be used as such in a diagnostic system, without combining them with actual consultations. However, it is worth noting that medical social media, such as the posts we used herein, may uncover novel information regarding COVID-19.
The coronavirus pandemic has drawn a spotlight on the need to develop automated processes to provide additional information to researchers, health professionals, and decision makers. Medical social media comprises a rich resource of timely information that could fit this purpose. We have demonstrated that it is possible to take an approach that aims at the detection of COVID-19 using an automated triage and diagnosis system in order to augment public health surveillance systems, despite the heterogeneous nature of typical social media posts. The outputs from such an approach could be used to indicate the severity and estimate the prevalence of the disease in the population.
body part, organ, or organ component
conditional random field
computed tomography
electronic health record
general physician
ground-truth probability
logistic regression
natural language processing
rule based
radial basis function
support vector machine
support vector regression
symptoms
gradient boosting
All authors were involved in the design of the work. The first author wrote the code. The first 3 authors drafted the paper, and all authors critically revised the article.
None declared.