A Comprehensive Overview of the COVID-19 Literature: Machine Learning–Based Bibliometric Analysis

Background: Shortly after the emergence of COVID-19, researchers rapidly mobilized to study numerous aspects of the disease such as its evolution, clinical manifestations, effects, treatments, and vaccinations. This led to a rapid increase in the number of COVID-19–related publications. Identifying trends and areas of interest using traditional review methods (eg, scoping and systematic reviews) for such a large domain area is challenging. Objective: We aimed to conduct an extensive bibliometric analysis to provide a comprehensive overview of the COVID-19 literature. Methods: We used the COVID-19 Open Research Dataset (CORD-19) that consists of a large number of research articles related to all coronaviruses. We used a machine learning–based method to analyze the most relevant COVID-19–related articles and extracted the most prominent topics. Specifically, we used a clustering algorithm to group published articles based on the similarity of their abstracts to identify research hotspots and current research directions. We have made our software accessible to the community via GitHub. Results: Of the 196,630 publications retrieved from the database, we included 28,904 in our analysis. The mean number of weekly publications was 990 (SD 789.3). The country that published the highest number of COVID-19–related articles was China


Background
In December 2019, Wuhan city in China registered several cases of an unknown disease characterized by pneumonia, dry cough, fatigue, and fever [1].The investigations revealed that a novel coronavirus (2019-nCoV) was the causative agent of the disease, which was subsequently named COVID-19 [1].Since then, COVID-19 has spread around the globe, leading the World Health Organization to classify it as a pandemic [2].This highly contagious pathogen has affected almost every aspect of our daily lives, such as education, traveling, business, transportation, sports, and health care [3].Most importantly, the COVID-19 pandemic has claimed more than 775,000 lives as of August 19, 2020 [4].To curb the impact of COVID-19, authorities need to implement effective public health measures related to COVID-19 surveillance, diagnostics, vaccines, treatments, and research [5].
Given the novelty and, consequently, the lack of knowledge about the disease, research can play a crucial role in the fight against the COVID-19 pandemic.Scientists have rapidly mobilized to manage and slowdown the growth of the pandemic.The scientific literature in this domain area has exponentially increased [6,7].By the end of May 2020, Aristovnik et al [6] and Doanvo et al [8] retrieved 10,344 and 18,412 COVID-19-related publications written in the English language, respectively, from the Scopus database and the COVID-19 Open Research Dataset (CORD-19).In addition, as of July 13, 2020, more than 1711 clinical trials were registered in different clinical trial registries (eg, NCT, EUCTR, and ISRCTN) [9].
It is very important to have a comprehensive overview of the current state of the literature on COVID-19 for several reasons, namely: (1) to organize and coordinate the literature; (2) to explore research topics addressed; (3) to prioritize research needs or gaps; (4) to understand the evolution of the literature; (5) to recognize the leading researchers, institutes, and countries in this area; and (6) to explore connections between research topics and areas.

Research Problem and Aim
Manually conducting a comprehensive review of the thousands of COVID-19-related publications is a daunting and time-consuming task.Artificial intelligence (AI) methods can play a pivotal role in rapidly surveying the enormous number of publications and extracting critical insights from them.Therefore, in March 2020, the White House strongly recommended researchers to exploit AI methods in COVID-19 research [8].

Study Data Collection
For this study, we used CORD-19, generated by the Allen Institute for AI [20].The dataset is updated daily to include the latest published articles on COVID-19.We used the update corresponding to the timestamp of July 21, 2020, which contained over 196,630 scholarly articles related to COVID-19 and the coronavirus family of viruses.Allen Institute for AI used the following search terms to retrieve studies on all coronaviruses: "COVID-19" OR "Coronavirus" OR "Corona virus" OR "2019-nCoV" OR "SARS-CoV" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome".The search was conducted on PubMed, PubMed Central, and bioRxiv and medRxiv preprint servers.The dataset included a CSV (comma-separate values) file with metadata of all the articles in the dataset, such as article ID, title, abstract, names of authors, and publication date.The articles in the dataset were represented by a single JSON (JavaScript Object Notation) file that consisted of the article ID, title, abstract, body text, and relevant metadata.The metadata of the dataset was analyzed using Python in a Jupyter Notebook environment.We have made our software accessible to the community via GitHub [21].The CSV metadata file was loaded into a data frame provided by Python's pandas library.We removed records with empty and non-English abstracts.We also removed duplicate articles and any articles that were published before January 1, 2020.We then used the search terms "novel coronavirus," "coronavirus 2019," "2019-nCov," "COVID-19," "COVID 2019," "severe acute respiratory syndrome coronavirus 2," and "SARS-COV-2" to select only COVID-19-related articles.Thus, we were able to identify a total of 28,904 abstracts of scholarly articles published after

Data Preprocessing
The 28,904 selected abstracts were cleaned by removing punctuations and alphanumeric characters.Singular and plural uppercased abstract sectioning keywords such as "BACKGROUND," "OBJECTIVE," "METHOD," "RESULT," and "CONCLUSION" were also removed.The data cleaning was performed using Python programming language in Jupyter Notebook environment.The Python libraries used to clean the data include pandas, NumPy, langdetect, re, string, and TextBlob.The abstracts were then converted to lowercased text.After that, we used the Python Natural Language Toolkit library to tokenize the abstracts and remove the stop words.We then applied the SnowballStemmer model to convert words to their stems.The clean text of the abstracts derived after applying the abovementioned pre-processing steps was used for clustering.

Document Clustering
For document clustering, we first converted each document (ie, abstract) to a feature vector, where features were defined by term (ie, words) frequency-inverse document frequency (TF-IDF) weights.TF-IDF represents the importance of a word relative to a document in a corpus.This importance increases proportionally to the number of times the word appears in the document but is offset by the frequency of that word in the corpus.This ensures that TF-IDF-based similarity measures between documents are influenced mainly by discriminative words with relatively low frequencies in the corpus [22].For TF-IDF representation of the abstracts, we used TfidfVectorizer module of the Python scikit library.
The TfidfVectorizer algorithm has two important threshold parameters that cut off low and high word frequencies.The minimum document frequency parameter (min_df) was set to 10 to ignore sporadic terms occurring in less than 10 documents (absolute count).The maximum document frequency parameter (max_df) was set to 0.9 to ignore terms that appear in more than 90% (26,014/28,904) of the documents (relative count).The reason is that we wished to exclude terms that are either too rare to be used in finding document clusters or too common to be discriminative enough to distinguish documents.Based on these parameters (min_df and max_df), the TfidfVectorizer algorithm extracted a vector of 42,061 unique terms to represent each of the 28,904 abstracts, with each term containing a TF-IDF score.This generated a feature matrix of size 28,904×42,061, which was, subsequently, used to feed into a clustering algorithm.We used the k-means clustering algorithm from Python's scikit library to categorize the abstracts into internally coherent but well-separated clusters.To identify the number of clusters in the k-means clustering algorithm, we used the elbow method to determine the number of clusters in the corpus [23].Thus, we found 26 to be the optimal number of clusters for this corpus.

Search Results
By July 21, 2020, the CORD-19 dataset comprised 196,630 articles (Figure 1).Of those, we excluded 167,726 articles for the following reasons: (1) abstracts were unavailable (n=56,300); (2) the articles were published before January 1, 2020 (n=99,665); (3) the articles were written in a language other than English (n=587); (4) the articles were not related to COVID-19, as their titles and abstracts did not contain our search terms (n=10,364); and (5) they had duplicate entries (n=810).Consequently, we included 28,904 articles in the analysis in this study.

Characteristics of Publications
The first paper was published on January 2, 2020.As shown in Figure 2, the number of publications in each week increased considerably since then, until a peak was reached in week 22 (2276 publications).Thereafter, the number of research papers published began to decrease.The mean number of publications for each week was 990 (SD 789.3).The country of publication was identified for 17,270 publications, which were conducted across 221 countries and territories.The country that published the highest number of articles was China (2950/17,270, 17.08%), followed by the United States (1357/17,270, 7.86%), Italy (1157/17,270, 6.70%), Saudi Arabia (978/17,270, 5.66%), and India (854/17,270, 4.94%) (Table 1).
The selected articles were published in about 2500 journals.The highest number of articles were published in bioRxiv (n=1374), the most prominent preprint server for biology.The top 10 sites for publishing COVID-19-related articles (journals and preprint servers) are shown in Table 2.The publications included in this analysis were authored by 150,600 authors.Among those authors, Lei Liu published the highest number of articles (n=46; see Table 3).Based on titles and abstracts alone, we were able to identify 1515 surveys, 733 systematic reviews, 512 cohort studies, 480 meta-analyses, 362 randomized control trials, 199 case studies, 79 scoping reviews, and 62 case-control studies (Table 4).Note that these numbers include only the top 8 study methods for those publications that mention the study method in either the abstract or the title.

Overview
The analysis generated 26 clusters from the included publications.We were able to identify the topic of 21 clusters, whereas the remaining 5 clusters were not labeled as they contained publications with very diverse topics that belonged to other clusters.Therefore, publications in these 5 clusters were moved to the most appropriate cluster among the 21 clusters.Four of the 21 clusters contained publications addressing only two different topics; thus, we further merged the 4 clusters to form 2 different clusters.Overall, we identified 19 different topics addressed in the included publications (Table 5).

Topic 1: Public Health Response
This topic was addressed by 18.66% (5393/28,904) of the publications.The publications in this cluster mainly discussed how public health authorities in various countries responded to the COVID-19 pandemic (eg, [24][25][26][27][28]).The top 5 authors in terms of the highest number of publications related to this topic were Claudine McCarthy (n=8), Valerie A Canady (n=6), Alison Knopf (n=6), Alimuddin Zumla (n=6), and Nima Rezaei (n=6).The top 5 journals and preprint servers hosting the highest number of publishing articles related to this topic were the International Journal of Environmental Research and Public XSL • FO RenderX Health (n=82), Science of the Total Environment (n=80), New Scientist (n=56), Journal of Medical Virology (n=53), and bioRxiv (n=50).The first paper related to this topic was published on January 10, 2020.The number of publications in each week increased significantly until it reached a peak in week 23 (n=434); it then decreased noticeably (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 183.6 (SD 151.5).

Topic 2: Clinical Care Practices During the COVID-19 Pandemic
A total of 17.71% (5118/28,904) of all included publications were mainly about clinical care practices for non-COVID-19 patients during the COVID-19 pandemic (eg, [29][30][31][32][33]).The following authors published the highest number of publications related to this topic: Karthik Rajasekaran (n=14), Francesco Esperto (n=12), Raju Vaishya (n=9), Namrata Sharma (n=8), and Santosh G Honavar (n=8).The top 5 journals publishing articles related to this topic were Otolaryngology-Head and Neck Surgery (n=115), the Journal of the European Academy of Dermatology and Venereology (n=45), Cureus Journal of Medical Science (n=41), Anaesthesia (n=40), and World Neurosurgery (n=35).In this cluster, the first article was published on January 3, 2020.There was a considerable rise in the number of weekly publications from week 12 until it reached a peak in week 23 (n=479); this was followed by a sharp decrease (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 175.2 (SD 159.6).

Topic 4: Epidemic Models for COVID-19 Spread
A total of 10.25% (2964/28,904) of the included publications were related to this topic (eg, [39][40][41][42][43]).The 5 most prominent authors in this cluster were Gerardo Chowell (n=22), Benjamin J Cowling (n=18), Kenji Mizumoto (n=14), Shi Zhao (n=13), and Rosalind M Eggo (n=13).The most common journals and preprint servers where the articles related to this topic were published included Chaos, Solitons & Fractals (n=73), medRxiv (n=66), the International Journal of Infectious Diseases (n=36), Zhonghua liuxingbingxue zazhi (n=30), and bioRxiv (n=26).The first paper related to this topic was published on the January 19, 2020.Although there was a sharp increase in the number of weekly publications between weeks 12 and 15, the trend was almost stable from week 15 to week 22.Then, a rapid decline in the number of weekly publications was noticed (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 101.6 (SD 68.2).

Topic 5: Therapies and Vaccines for COVID-19
In all, 6.38% (1845/28,904) of the publications were about the development and repurposing of therapies and vaccines for COVID-19 (eg, [44][45][46][47]).The following authors published the highest number of articles related to this topic: Wei Zhang (n=13), Xiuna Yang (n=9), Haitao Yang (n=9), Zihe Rao (n=9), and Yao Zhao (n=8).The journals and preprint servers publishing the highest number of studies in this cluster were bioRxiv (n=174), the Journal of Biomolecular Structure and Dynamics (n=74), Trials (n=49), the Journal of Medical Virology (n=20), and Clinical Pharmacology & Therapeutics (n=14).In this cluster, the first article was published on January 6, 2020.The number of weekly publications increased dramatically from week 14 until a peak was reached in week 22 (n=144); thereafter, it decreased noticeably (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 62.9 (SD 52.3).

Topic 6: Host Immune Response to 19-nCoV
This topic was discussed in about 6.36% (1837/28,904) of the publications (eg, [48][49][50][51][52]).Authors who had the highest number of publications related to this topic were Alessandro Sette (n=7), Stanley Perlman (n=6), Nima Rezaei (n=6), Irfan Rahman (n=5), and Akiko Iwasaki (n=6).The top 5 journals and preprint servers in terms of publishing articles related to this topic were bioRxiv (n=199), Medical Hypotheses (n=50), the Journal of Medical Virology (n=49), Frontiers in Immunology (n=22), and the British Journal of Haematology (n=19).The earliest article related to this topic was published on January 2, 2020.From that date until week 14, there was a slight increase in the number of weekly publications before it increased markedly, peaking in week 25 (n=155) (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 62.8 (SD 54.6).

Topic 8: Mental Health and Disorders During the COVID-19 Pandemic
This topic is about COVID-19-related mental health and disorders, which was explored by 3.17% (915/28,904) of the publications (eg, [58][59][60][61][62]).The top 5 authors in terms of number of publications related to this topic were Valerie A Canady (n=15), Mark D Griffiths (n=8), Stephen X Zhang (n=6), Zhilei Shang (n=5), and Modesto Leite Rolim Neto (n=5).The top 5 journals publishing studies related to this topic were Psychological Trauma: Theory, Research, Practice, and Policy (n=101); Psychiatry Research (n=48); the International Journal of Environmental Research and Public Health (n=37); the Journal of Affective Disorders (n=23); and Mental Health Weekly (n=23).In this cluster, the first article was published at the beginning of week 8.There was a considerable rise in the number of weekly publications from week 14 until a peak was reached in week 23 (n=94); this was followed by a steep decline (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 31.1 (SD 30.9).

Topic 10: Social Distancing Measures
A total of 3% (868/28,904) of the articles discussed the topic of social distancing measures used to fight against the COVID-19 pandemic (eg, [68][69][70][71][72]).Authors who had the highest number of publications related to this topic were Lei Zhang (n=7), Adam J Kucharski (n=6), Amy Gimma (n=5), Gerardo Chowell (n=5), and Petra Klepac (4).The top 5 journals and preprint servers in terms of publishing articles related to this topic were medRxiv (n=28); Chaos, Solitons & Fractals (n=6); Morbidity and Mortality Weekly Report (n=5); Science (n=5); and Disaster Medicine and Public Health Preparedness (n=5).The earliest article related to this topic was published in week 7.There was a dramatic rise in the number of weekly publications between week 12 and week 19; thereafter, the trend was unstable from week 20 to week 29 (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 29.8 (SD 26.8).

Topic 12: Protein Structures of 2019-nCoV
About 2.44% (706/28,904) of the included publications focused on structures and functions of 2019-nCoV proteins (eg, [78][79][80][81][82]).The top 5 authors in terms of the number of publications related to this topic were Ralph S Baric (n=14), Jason S McLellan (n=12), Shibo Jiang (n=10), James Brett Case (n=10), and Daniel Wrapp (n=10).The journals and preprint servers publishing the highest number of studies in this cluster were bioRxiv (n=333), the Journal of Virology (n=15), the Journal of Biomolecular Structure & Dynamics (n=15), Science (n=13), and the Journal of Medical Virology (n=11).The earliest study related to this topic was published on January 3, 2020.The mean number of weekly publications in this cluster was 24.2 (SD 18.6).The highest number of weekly publications was 68 in week 25 (Multimedia Appendix 1).

Topic 13: Host Cell Entry
Host cell entry for 19-nCoV (via angiotensin-converting enzyme 2) was a key topic discussed in 2.02% (584/28,904) of the reviewed publications (eg, [83][84][85][86][87]).The 5 most prominent authors in this cluster were Serpil Erzurum (n=4), Giuseppe Lippi (n=4), Daniel Batlle (n=4), Hong Gao (n=4), and Claudio Cavallini (n=3).The most common journals and preprint servers in this cluster were bioRxiv (n=117), the Journal of Medical Virology (n=11), Medical Hypotheses (n=10), European Respiratory Journal (n=8), and medRxiv (n=7).The first article related to this topic was published in the mid of week 4.The number of weekly publications was almost stable between weeks 4 and 13.Thereafter, a sharp increase was noticed between weeks 14 and 16, but it was not stable from then until week 29 (Multimedia Appendix 1).The number of publications in week 20 was the highest (n=50).The mean number of weekly publications in this cluster was 19.9 (SD 16.6).

Topic 15: Detection of 2019-nCoV Antibodies
Detection of antibodies against 2019-nCoV using serological assays was a topic discussed in 1.42% (411/28,904) of all publications (eg, [92][93][94][95][96]).The top 5 authors writing about this topic were Florian Krammer (n=6), Jing Wang (n=6), Yong Zhang (n=5), Juan Chen (n=5), and Viviana Simon (n=5).The top 5 journals and preprint servers that published the highest number of studies in this cluster were bioRxiv (n=30), medRxiv (n=20), the Journal of Medical Virology (n=18), the Journal of Clinical Virology (n=14), and the Journal of Clinical Microbiology (n=7).Only 1 study in this cluster was published in the first 6 weeks.There was a dramatic increase in the number of weekly publications between weeks 16 and 21.Although the number of weekly publications slightly decreased from week 22 until week 26, it increased rapidly until reaching the peak in weeks 28 and 29 (n=36) (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 14.1 (SD 12.8).

Topic 16: Personal Protective Equipment
Around 1.21% (350/28,904) of the publications focused on personal protective equipment in the COVID-19 era (eg, [97][98][99][100]).Authors who published the most in this cluster were Holly Seale (n=3), Keith K Wannomae (n=3), Lei Liao (n=3), Wang Xiao (n=3), and Steven Chu (n=3).The highest numbers of studies were published in the following journals and preprint servers: the American Journal of Infection Control (n=8), the Journal of Hospital Infection (n=7), Anaesthesia (n=6), the Journal of the European Academy of Dermatology and Venereology (n=6), ACS Nano (n=5), and medRxiv (n=5).No articles in this cluster were published before week 9.The mean number of weekly publications in this cluster was 11.9 (SD 11.6).The highest number of weekly publications was 34 in week 23 (Multimedia Appendix 1).

Topic 17: Diabetes Mellitus and COVID-19
Health care management, clinical characteristics, and risk factors for mortality of COVID-19 patients with diabetes was discussed in 1.16% (336/28,904) of the included articles (eg, [101][102][103][104].The 5 most prominent authors in this cluster were Hui Wang (n=5), Sam Foster (n=4), Anoop Misra (n=4), Béatrice Bouhanick (n=3), and Kamlesh Khunti (n=3).The most common journals in this cluster were Diabetes Research and Clinical Practice (n=20), Diabetology & Metabolic Syndrome (n=17), the British Journal of Nursing (n=10), the Journal of the American Medical Directors Association (n=7), and Diabetes Technology & Therapeutics (n=6).Only 4 articles related to this topic were published between weeks 1 and 12.However, there was a substantial increase in the number of weekly publications from week 17 until the peak was reached in week 20 (n=35); this was followed by a slight decrease (Multimedia Appendix 1).The mean number of weekly publications in this cluster was 11.4 (SD 11.6).

Topic 18: Pregnancy and Childbirth During the COVID-19 Pandemic
About 1.08% (312/28,904) of the publications focused on numerous aspects of pregnancy and childbirth during the COVID-19 pandemic (eg, [105][106][107][108][109]).The most common authors writing about this topic were Ling Feng (n=7), Jiafu Li (n=6), Olivier Picone (n=5), Dunjin Chen (n=5), and Guoqiang Sun (n=5).The top 5 journals in terms of publishing articles in this cluster were the International Journal of Gynaecology and Obstetrics (n=18), The Journal of Maternal-Fetal & Neonatal Medicine (n=16), the American Journal of Obstetrics and Gynecology (n=20), Obstetrics and Gynecology (n=10), and the American Journal of Perinatology (n=9).The earliest article in this cluster was published on February 10, 2020.The mean number of weekly publications related to this topic was 10.8 (SD 8.9), and the highest number of weekly publications was 27 in week 21 (Multimedia Appendix 1).

Topic 19: Organ Transplantation During the COVID-19 Pandemic
Organ transplantation in the era of COVID-19 was a key topic in 0.76% (219/28,904) of the included articles (eg, [110][111][112][113]) The top 5 authors in terms of number of publications related to this topic were Paolo Cravedi (n=4), Zhishui Chen (n=4), Luciano De Carlis (n=4), Lai Wei (n=4), and Ashley Fan (n=3).The top 5 journals in terms of publishing articles related to this topic were the American Journal of Transplantation (n=69), Transplant Infectious Disease (n=35), Transplant International (n=11), Transplantation Proceedings (n=10), and Liver Transplantation (n=5).Only one study in this cluster was published before week 12.The mean number of weekly publications related to this topic was 7.5 (SD 8.3), and the highest number of weekly publications was 26 in week 24 (Multimedia Appendix 1).

Principal Findings
We found that 5.92% (1714/28,904) of the included published articles were hosted on preprint servers (bioRxiv or medRxiv).Although these servers are not the only preprint servers available in the academic publishing landscape (many journals publish articles online before they go into print, and we have also observed a rise of purely online journals), they are indicative of the pace with which new knowledge is made available by the international research community.Since such preprint servers do not undergo formal peer reviewing and are, thus, not regarded publications in the traditional academic sense, many researchers are using this device to make findings available and to solicit feedback from the international community before undergoing formal peer-reviewing by journals-a process that takes at least 2 months to get the submitted paper published.
Among the peer-reviewed journals, the Journal of Medical Virology has published the highest number of COVID-19-related articles (n=468).Aristovnik et al and Hossain also listed the Journal of Medical Virology in the top-5 journals publishing COVID-19-related articles [6,14].The Journal of Medical Virology clearly stands out, as it has published more than twice the number of papers compared to the second-ranked journal-the International Journal of Environmental Research and Public Health (n=223).Aristovnik et al [6] listed the International Journal of Environmental Research and Public Health among the 10 top-ranked journals based on COVID-19-related research articles [6].The source normalized impact per paper (SNIP), in the year 2019, was 0.780 for the Journal of Medical Virology [114] and 1.248 for the International Journal of Environmental Research and Public Health [115], and the average time from the submission to the first decision was about 6 weeks [116] and 3 weeks [117], respectively.We believe the speed of the reviewing process of these journals may have motivated the authors to submit their work to these journals.
Considering the study methods, we found that the highest number of studies (n=1515) were surveys, followed by reviews (systematic review, scoping review, or meta-analyses), as shown in Table 3.As the number of research studies on COVID-19 is rapidly increasing, review articles are of utmost importance to summarize the ongoing effort and progress to combat against COVID-19.We found case-control studies to be the lowest represented study design (n=62 only).We speculate that the lack of available data was the main reason for the scarcity of this type of research study.Interestingly, 362 randomized control trials in 7 months indicate the enormous effort made by the scientific community to combat this pandemic.Furthermore, we grouped the 19 topics addressed in the included studies into six thematic areas (summarized in Table 6).The dominant thematic clusters were "Clinical aspects" (29.17%) and "Epidemiology" (28.91%).The "Clinical aspects" theme covers multiple aspects of the clinical practices for patient care and risk factors related to COVID-19.It consists of two topics (ie, "clinical care practices for patients during the COVID-19 pandemic" and "clinical characteristics and risk factors of COVID-19).Interestingly, the "Epidemiology" theme also comprises only two topics (ie, "Public health response" and "Epidemic models for COVID-19 spread"), further underscoring the dominance of these topics.The third most prominent theme "Therapeutics" (21.03%) comprises five topics, making it the most diverse theme; the topics in this theme range from "host cell entry" to drug discovery-related terms such as "Protein structures of 2019-nCoV" and "Virus genomics," as well as "Therapies and vaccines for COVID-19."This theme highlights the initiatives of the scientific community to discover drugs and vaccines and understand the underlying virus-host mechanism to pave the way for effective therapeutic solutions for COVID-19.Considering the severity of COVID-19, we believe there still may be a lack of publications in this theme, despite comprising slightly more than 20% of all articles.We believe that, as clinical practices and public health responses mature, this theme will receive more research articles in the near future.
Almost 10% of articles form the "Diagnostics" theme.This theme focuses on the diagnosis of COVID-19 based on PCR, radiological images, or antibodies.PCR is among the most accurate technologies to diagnose COVID-19 [114], which explains the numerous relevant publications.Due to the advancement of deep learning techniques, radiological image-based diagnosis is becoming more effective and has the potential to save time in clinical environments [115].As a result, we observed a large number of publications on radiological image-based analyses, which is captured under the topic "Diagnosis of COVID-19 based on chest imaging."Antibodies, developed in hosts combatting the novel coronavirus, can be considered a detection mechanism that may play an important role complementary to PCR testing [116].It can be very effective for the diagnosis of patients with asymptomatic COVID-19 or negative RT-PCR results [117].We also noticed many publications on antibody responses against COVID-19, which are covered by the topic "Detection of 2019-nCoV antibodies".
The interplay between COVID-19 and related medical conditions is captured by the "Related conditions" theme.This theme not only comprises articles discussing related conditions caused by COVID-19 but also other conditions that may elevate the COVID-19 risk for the patients with those conditions.About 8% of all articles fall into this theme, covering topics such as mental disorder, diabetes, cancer, pregnancy, childbirth complications, and organ transplantation.
Only slightly more than 4% of all articles fall into the "Prevention" theme.This may be surprising, since prevention is of utmost importance while vaccines and treatments are still under development.However, we believe that this theme is not covered by more studies due to the recent wide acceptance and effectiveness of social distancing and personal protective equipment.Consequently, we expect the percentage of articles grouped in this theme to further reduce in the future.Further insights into the research landscape and the shift in themes over time is summarized in Multimedia Appendix 2.
We noticed that biomedical informatics had a crucial role in several topics.For instance, clinical decision support systems were used in many studies to diagnose COVID-19 based on chest imaging.Telemedicine was also used in multiple studies to provide the required health care support for the patients during the COVID-19 pandemic.Further, mobile applications, including contact tracing apps, were one of the main social distancing measures described in these studies.AI-based models were used in multiple studies to predict protein structures of 2019-nCoV to understand the underlying mechanism of drug-target interaction.Many studies proposed novel AI-based models to discover COVID-19 drugs and vaccines and repurpose existing drugs approved by the Food and Drug Administration as a part of the treatment plan for COVID-19.

Strengths
To the best of our knowledge, our study covers the largest collection of COVID-19-related articles (N=28,904, after considering the inclusion and exclusion criteria) published in the period of 7 months (January to mid-July 2020).The main strength of this study is that it demonstrates the feasibility of mostly automated, AI-based data mining at scale.We believe this is of utmost importance because articles on COVID-19 are published faster than nonautomatic surveys can organize, analyze, and present them to the scientific community.

Limitations
All the publications were collected from only one specific database (CORD-19), so we may have missed some studies or preprints that were not considered in this database.Given the substantial number of publications included in our analysis, we are confident that a large part of the COVID-19 literature was covered.Further, we did not conduct a detailed manual analysis of studies published in journals such as Journal of Medical Internet Research and the International Journal of Medical Informatics to evaluate the use of eHealth technologies for COVID-19, which we believe is beyond the scope of our work and requires other study methodologies such as systematic or scoping review.However, readers are referred to several studies and reviews, which have been conducted to explore eHealth technologies used in the fight against the COVID-19 pandemic [118][119][120][121].
We only considered the articles published in the English language, which may introduce some bias in our analysis.Additionally, articles published after mid-July 2020 were not considered in this study.Moreover, due to the inherent limitation of the bibliographic analysis, which enabled high-level profiling of the text from the corpus of literature, we cannot provide any evidence-based solution for the diagnosis and treatment of COVID-19.Further analysis of the articles that fall under each topic should be considered more carefully.
For this study, we only analyzed article titles, abstracts, and author data.As a result, we could not identify the country of publication for 11,634 articles (around 40%).Although we removed duplicates, academic publications undergo subtle morphology changes.These come in the form of revisions, preprints, follow-up studies, among others.Given only the abstract and title were screened, we cannot rule out that some publications may have substantial overlaps.Finally, there exist ambiguities with respect to author and journal names because authors with the same name can only be resolved uniquely by affiliation or, in some cases, other identifiers such as ORCID.If such elucidating identifiers are missing, automatic disambiguation is not possible.Likewise, journals may be referred to by a plethora of acronyms.Therefore, our study may have merged multiple authors into one person and split journals into multiple entities.

Practical Implications
This study demonstrates the feasibility of AI-based, largely automatic data mining of large corpora of academic publications.Given the pace and dynamics at which COVID-19 research is being conducted at this time, attempts to manually survey the literature will almost certainly fall behind the state-of-the-art, unless specific and confined subtopics are under scrutiny.Automatic data mining, on the other hand, is hindered by the inherent noise in the data (eg, lack of ORCIDs and inability to track genesis and evolution of articles properly).This means that results accurate to every single paper and author are impossible to extract, unless publishers (eg, using blockchain technology) develop a way to trace genesis and evolution of individual publications and unique authors.It is therefore important to stress that such automatic textual analysis is performed at scale.If conducted at scale, we would argue that exact numbers may not matter as much as they once may have, considering that being off by as many as 30 individual publications while screening of almost 29,000 publications means that being off by as many as 30 individual publications is would still constitute a negligible fraction of the overall corpus.We, therefore, believe the numbers we have presented in this report to be aggregates that are representative of a vibrant and rapidly evolving research landscape and that they highlight trends and shifting interests in topics.Whereas automated data mining excels at providing up-to-date, broad overviews of the field as a whole, manual surveys excel at providing detailed overviews of specific topics.We, therefore, see this study complementing previous manual reviews.

Research Implications
This study highlighted the effectiveness of AI methods in the analysis of a large corpus of literature, which researchers can use to perform machine learning-based bibliometric analysis of eHealth-related literature to explore the use of eHealth technologies for COVID-19.
Among the research themes summarized in Table 6, we see the direst need for more research in the "Therapeutics" theme, as clinical aspects and epidemiological aspects are better understood and best practices continue to be more commonly implemented.Consequently, we see this shift in the proportion of articles at the expense of the "Clinical Aspects" and "Epidemiology" themes, as well as the "Prevention" theme.The reason is that the impact of related topics "Social distancing" and "Personal preventive equipment" should be well understood and implemented by now.
In the context of performing bibliometric reviews based on automatically extracted topics, the most important research challenge is to develop methods that are more robust against noise and can process not only abstracts, titles, and author lists, but the entire full text of the publications.Although computationally extremely demanding, this would allow assessing any overlaps between publications, which is a stronger measure than the binary decision of duplicity.

Conclusions
This study provides a comprehensive overview of the COVID-19 literature.Specifically, we identified the main COVID-19-related topics addressed in the existing literature; weekly trends of publications; and top countries, authors, and publishers.This study will help the research community to understand the evolution of the COVID-19-related literature; prioritize research needs; and recognize the leading researchers, institutes, countries, and publishers for each topic.AI-based bibliometric analysis has the potential to rapidly explore large corpora of academic publications.Publishers should avoid noise in the data by developing a way to trace the evolution of individual publications and unique authors.

Table 6 .
Topics grouped by thematic cluster, including the percentage of articles by topic and cluster.