This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient.
We aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words.
Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month’s literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month’s network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months.
We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months.
Machine learning–based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.
The COVID-19 pandemic is a global health threat and has proven to be an enigma, with its diverse clinical presentation, controversial evidence for treatment, fast-tracked vaccine development, and unclear systemic implications. Most countries have been affected by COVID-19, with around 187 million confirmed cases over a short span and more than 4 million deaths recorded until July 13, 2021 [
Abstracts of articles hold a substantial amount of information in the literature. Named entities within abstracts play a crucial role in deducing valuable information from large amounts of text and influencing literature trends [
Predicting links between “medical terms” is of high significance to understand the underlying themes within the literature and the phenomenon. Link prediction is the task of predicting the existence of links between 2 nodes in a complex network based on a set of topological features. The problem of link prediction in real-world temporal networks has been explored a lot in recent years [
We have primarily focused on the fast emerging COVID-19 literature to train and validate our architecture for this study. We forecasted semantic and topological proximity features of named entity pairs generated from their temporal trends in prior months. Further, we used these forecasted features to predict links between clinical entities extracted from textual data over the forecasted time interval using machine learning algorithms. Furthermore, these links were used to create a network weighted by forecasted cosine similarity for detecting communities of entities that tend to reflect on the themes of the articles published in that month. To assess the efficacy of our predictive modeling, we validated the proximity features of entity pairs forecasted from autoregressive integrated moving average (ARIMA) using mean squared error (MSE). We also evaluated the machine learning algorithm’s performance for predicting the links over a time span of 3 months.
The schematic representation of workflow has been demonstrated (
Graphical representation of the proposed framework explaining the complete workflow. The pipeline takes abstracts as inputs from which entities are extracted using named entity recognition. Embeddings are generated, which are used as features for longitudinal networks. These networks are used for visualizing the trends using alluvial diagrams, link prediction, and predicting top k influential modules for theme prediction. ARIMA: autoregressive integrated moving average.
The data set was created from abstracts of approximately 150,000 COVID-19 articles published in the publicly available WHO Database [
(A) Graph showing the number of articles occurring each month. The curve depicts that there has been a rampant increase in the number of articles across each month since February 2020. (B) Latent space of word embeddings of diseases visualized around the keyword “post-covid syndrome,” displaying 100 isolated points nearest to it. (C) Bar plot showing the frequency of the top diseases in the corpus of abstracts extracted using named entity recognition (NER). (D) Bar plot showing the frequency of the top chemicals in the corpus of abstracts extracted using NER. HCQ: hydroxychloroquine; IL: interleukin.
Named entity recognition (NER) was used to extract 2 types of entities (diseases and chemicals) from the original abstracts of vetted research articles using a model pretrained on the BC5CDR corpus by SciSpacy, an open-source project for biomedical natural language processing [
Word embeddings were trained upon the abstracts obtained from the WHO database updated with new publications and preprints as these become available every month. A low-dimensional representation (d=100) for the words present in the corpus of abstracts was learned using the Word2Vec model with the skip-gram algorithm and a fixed window size of 5, implemented in Gensim [
High cosine similarity represents strong relationships between words. We used diachronic word embeddings to capture the evolving contextual similarities between various diseases and studied the evolution over time. Weighted networks were constructed using the similarity between word vectors of extracted entities as edge weights. From each month’s corpus of abstracts, top N (=100) most frequently occurring diseases were extracted, and pairs having greater than the 90th percentile of cosine similarity based on the corresponding month’s word embeddings were used to create a union set of entities across months, preserved as nodes in the temporal networks. Therefore, every month’s network had a fixed set of nodes with varying links, labeled as 0 or 1 based on the threshold of cosine similarity, and varying weights, calculated based on the evolving semantic closeness. The mentioned threshold has been chosen empirically based on experimentation; a high threshold has been selected to depict contextual similarity between 2 words present in the same latent space. For training and evaluation, a fixed set of entity pairs was created from the diseases identified in the abstracts of the papers published from February 2020 to February 2021, using the mentioned procedure. For the subsequent months, the word embedding models were trained on the respective corpora of abstracts, and the links between the fixed set of node pairs were assigned if they appeared in the vocabulary and were weighted by the cosine similarity between their word vectors. Community detection was performed over the monthly networks using the Infomap algorithm [
In order to predict the existence of links between nodes in the networks of subsequent months, we computed 5 neighborhood proximity scores for the network of each month. Jaccard similarity, common neighbors, preferential attachment [
Every proximity score was modeled as a time series for each node pair, and the value was predicted for the subsequent month using the ARIMA model [
The proximity scores predicted using the ARIMA model were further used to identify the occurrence of a link between entities in network G𝜏+1 based on the proximity scores and links in all previous networks (G1, G2, G3, …, G𝜏), using supervised machine learning. We experimented with the proposed link prediction approach using logistic regression [
The links between node pairs predicted by the best performing model were used to create networks weighted by cosine similarity scores predicted by the ARIMA model. The Infomap algorithm was applied on the predicted and original test network to cluster the nodes into 10 modules. The modules were compared using intersection over union (IOU) with the following formula:
where A represents a set of nodes in the predicted ith module, i ∊ {1, 2, …, 10}, and B represents a set of nodes in the original jth module, j ∊ {1, 2, …, 10}.
Overall, 46,885 distinct diseases and 53,375 unique chemicals were identified. The top entities are shown in
We conducted detailed inference of the alluvial diagram across different months to graphically explore the temporal trends in the literature based on dynamic and homogeneous networks of prevalent medical entities and their associated cosine similarities.
We further advanced the analysis of trends to predicting links between entity pairs for the upcoming months. Our proposed framework for temporal link prediction effectively forecasted 5 proximity scores, including semantic and topological measures, between node pairs by modeling the time series using the ARIMA model. The MSE in the prediction of each proximity score for April 2021, May 2021, and June 2021 is shown in
The intersection of nodes between the predicted and original modules was analyzed to prospectively validate the effectiveness of the proposed prediction framework.
Analysis of networks constructed upon chemical entities revealed the evolution of various drugs studied in the COVID-19 literature. During February 2020, the major module contained entities such as paracetamol, tofacitinib, thalidomide, vitamins, zinc, and other linked chemicals. Another relevant module included central entities, such as doxycycline, ruxolitinib, heparin, and ivermectin, which were discussed in the scientific research on the treatment and prevention of COVID-19. In contrast, our recently updated models showed the emergence of evidence for various immunosuppressive drugs, such as tacrolimus, and anti-inflammatory drugs, such as glucocorticoids and colchicine, during November 2021 (
(A) Alluvial diagram for tracking the trends in 2020, from the networks of March, August, and December. (B) Alluvial diagram for monitoring the trends in 2021, from the networks of January, March, and June. The alluvial diagram eases tracing the temporal dynamics of the literature across different time intervals.
(A) Evaluation of the mean squared error (MSE) between the original and predicted proximity scores for the network of April 2021, May 2021, and June 2021. (B) Confusion matrix with normalized values of the results from the AdaBoost classifier across the months of April 2021, May 2021, and June 2021. AdaBoost has been the best performing model across all 3 months. (C) Results of link prediction between disease entities from March 2021 to June 2021, with a margin of error for 95% CIs. The mean value of metrics has been recorded by testing the models on a resampled test set. AUROC: area under the receiver operating characteristic curve; RF: random forest; SVM: support vector machine.
Clusters or modules of diseases from the predicted network of January 2021 and June 2021.
Module ID | January 2021 | June 2021 | ||
|
Top nodesa | IOUb | Top nodes | IOU |
1 | Acute kidney injury, ARDSc, coagulopathy, myocardial injury, pulmonary embolism | 0.45 | Headache, lymphopenia, dyspnea, confusion, encephalitis, nausea | 0.71 |
2 | Cardiovascular disease, diabetes mellitus, COPDd, hypertension | 0.66 | Fibrosis, coagulopathy, thrombotic, hypoxia, inflammation, delirium | 0.70 |
3 | Respiratory infection, MERSe, respiratory diseases | 0.55 | Comorbidity, asthma, COPD, hypertension, dementia, diabetes | 0.64 |
4 | Depression, insomnia, anxiety, loneliness | 0.71 | Traumatic, anxiety, depression, loneliness, burnout, insomnia | 0.81 |
5 | Myalgia, lymphopenia, headache, anosmia, dyspnea | 0.43 | Immunocompromised, chronic diseases like tuberculosis | 0.33 |
aA subset of top intersecting nodes in each cluster is mentioned, which collectively signify themes.
bThe given intersection over union (IOU) was computed between clusters of predicted and original networks of the respective months.
cARDS: acute respiratory distress syndrome.
dCOPD: chronic obstructive pulmonary disease.
eMERS: Middle East respiratory syndrome.
In this paper, we demonstrate a computational approach, EvidenceFlow, in which a user interacts with the rapidly expanding COVID-19 literature to derive and predict emerging themes. The proposed framework tracks patterns of changing semantic and topological proximity between entity pairs across months. Further, it predicts links and network communities that may emerge in future months. Hence, users can follow the papers that contribute to emerging communities of themes, for example, literature around thromboembolic complications captured as early as August 2020 and mental health factors during the end of 2020. Interacting with the clusters on the interactive interface of the EvidenceFlow model revealed that symptoms of long COVID, such as fatigue, headache, myalgia, cough, and anosmia, were forming a central cluster during March 2021. This early signal for accumulating evidence was later validated in large prospective and retrospective cohorts of COVID-19 patients [
Prediction of the themes represented by rising centrality of entities can assist in the formation of promising research hypotheses. The dynamics of the literature reveal the emergence of central themes as a combination of pre-existing themes in recent times [
We conducted an analysis on the trends of the PageRank centrality of selected chemical and disease entities. Statins, a class of lipid-lowering medications, were found to be gaining centrality in late 2021 as compared to earlier values (
To explore the potential of unsupervised word embeddings and changing cosine similarity among words, we analyzed the trends of terms having maximum similarity with selected keywords. For example, we analyzed the temporal shift in the context of “vaccine” over the months by finding the top 10 terms most similar to
Temporal evolution of the context of the term “vaccine” across alternate months. The top 10 most similar words based on cosine similarity using monthly Word2Vec embeddings are plotted. Origin and evolution of drug repurposing in the early months, hesitancy, and vaccine candidates in the later months are highlighted.
Our study has some limitations. First, although the WHO database has been built using a detailed search strategy for COVID-19 literature, it does not explicitly report the exact purpose or accuracy of the search and decision process. The documentation [
Further, we are currently using abstracts of research articles to extract named entities and may be missing on the details contained in the full text of the articles while training word embeddings. Therefore, future work may build upon the framework to include the full text of articles wherever available. The NER model used in our study has been reported to have achieved an F1 score of 84.49% on a benchmark data set [
Consortia across the globe were formed for the advancement of research related to COVID-19. The global attention has led to a widespread increase in the scientific literature to study and prevent the disease from spreading, resulting in an understanding of the disease from multiple perspectives. We introduced a framework built upon COVID-19–specific literature vetted by the WHO and deployed as a dashboard called EvidenceFlow [
Supplementary text.
Frequency of articles belonging to specific categories in the COVID-19 literature.
List of software and packages used for our study with their sources and identifiers for the reproducibility of this study.
Distribution of errors in the prediction of proximity scores between node pairs (used as features in model training) for the month of June 2021.
Models and respective parameters used for training.
Latent space of word embeddings of diseases and chemicals visualized around the keyword “mental disorders,” displaying 100 isolated points nearest to it.
The top 10 similar entities (diseases, conditions, or chemicals) with selected keywords (“vaccine,” “comorbidity,” “adverse effects,” “social,” and “psychological”) in descending order of cosine similarity calculated using the word embeddings generated from the Word2Vec model trained on the entire corpus.
Evaluation of the mean squared error between original and predicted proximity scores for the network of April 2021, May 2021, and June 2021.
Results of temporal link prediction between entities for the months of April 2021, May 2021, and June 2021, with a margin of error for 95% confidence intervals.
Welch t test results of the performance of algorithms for the test set of June 2021.
Community detection results from the predicted and actual networks for June 2021.
Results of community detection from the predicted subsequent network based on training data till June 2021.
Percentage of abstracts of articles published in June 2021 mentioning diseases belonging to each module in the actual (A) and predicted (B) networks.
Alluvial diagram for tracking the trends of chemical entities from the networks of February 2020 to November 2021.
Temporal trends of the PageRank centrality of (A) “statins,” (B) “glucocorticoids,” (C) “depressive,” and (D) “thromboembolic”.
autoregressive integrated moving average
area under the receiver operating characteristic curve
intersection over union
mean squared error
named entity recognition
World Health Organization
We acknowledge support from the Center of Excellence in Healthcare and the Center of Excellence in Artificial Intelligence at Indraprastha Institute of Information Technology-Delhi.
RP and HC designed and implemented the computational framework, interpreted the results, and wrote the paper. HB contributed to writing and created the associated dashboard. RA and AN interpreted the results and provided feedback on statistical methods. TS designed the study, analyzed the results, and contributed to writing. All authors read and approved the final paper.
None declared.