This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Natural language processing (NLP) is an important traditional field in computer science, but its application in medical research has faced many challenges. With the extensive digitalization of medical information globally and increasing importance of understanding and mining big data in the medical field, NLP is becoming more crucial.
The goal of the research was to perform a systematic review on the use of NLP in medical research with the aim of understanding the global progress on NLP research outcomes, content, methods, and study groups involved.
A systematic review was conducted using the PubMed database as a search platform. All published studies on the application of NLP in medicine (except biomedicine) during the 20 years between 1999 and 2018 were retrieved. The data obtained from these published studies were cleaned and structured. Excel (Microsoft Corp) and VOSviewer (Nees Jan van Eck and Ludo Waltman) were used to perform bibliometric analysis of publication trends, author orders, countries, institutions, collaboration relationships, research hot spots, diseases studied, and research methods.
A total of 3498 articles were obtained during initial screening, and 2336 articles were found to meet the study criteria after manual screening. The number of publications increased every year, with a significant growth after 2012 (number of publications ranged from 148 to a maximum of 302 annually). The United States has occupied the leading position since the inception of the field, with the largest number of articles published. The United States contributed to 63.01% (1472/2336) of all publications, followed by France (5.44%, 127/2336) and the United Kingdom (3.51%, 82/2336). The author with the largest number of articles published was Hongfang Liu (70), while Stéphane Meystre (17) and Hua Xu (33) published the largest number of articles as the first and corresponding authors. Among the first author’s affiliation institution, Columbia University published the largest number of articles, accounting for 4.54% (106/2336) of the total. Specifically, approximately one-fifth (17.68%, 413/2336) of the articles involved research on specific diseases, and the subject areas primarily focused on mental illness (16.46%, 68/413), breast cancer (5.81%, 24/413), and pneumonia (4.12%, 17/413).
NLP is in a period of robust development in the medical field, with an average of approximately 100 publications annually. Electronic medical records were the most used research materials, but social media such as Twitter have become important research materials since 2015. Cancer (24.94%, 103/413) was the most common subject area in NLP-assisted medical research on diseases, with breast cancers (23.30%, 24/103) and lung cancers (14.56%, 15/103) accounting for the highest proportions of studies. Columbia University and the talents trained therein were the most active and prolific research forces on NLP in the medical field.
Natural language processing (NLP) refers to the ability of machines to understand and explain the way humans write and talk. It involves studying various theories and methods that can realize effective communication between humans and computers in natural language and is an important direction in the field of artificial intelligence [
In modern medical care, electronic health record (EHR) and electronic medical record (EMR) systems are undergoing rapid and large-scale development [
With the rapid development of NLP in the medical field, there is a constant increase in the number of NLP-related articles, which has led to the accumulation of a substantial amount of research findings. Analyzing these articles can indirectly reflect the dynamic progress of NLP development in the medical field. Moreover, the results of the analysis can provide various benefits to academia, especially to scholars who are interested in pursuing careers in specific areas. Regarding the analysis and research, the studies by Cobo et al [
Previous studies have analyzed and summarized the applications of NLP in the medical field. For example, Chen et al [
Other previously published studies [
PubMed is an important search engine. The source of the PubMed database is MEDLINE, and the core topic is medicine. The objective of this study was to collect academic articles on the application of NLP in medicine. Therefore, PubMed was selected as the search engine in this study. On the PubMed platform, the search strategy was (“natural language processing” [all fields] OR NLP [all fields]) AND (medical [all fields] OR health [all fields] OR clinical [all fields]), automatically translated by PubMed to: ((“natural language processing” [MeSH terms] OR (“natural” [all fields] AND “language” [all fields] AND “processing” [all fields]) OR “natural language processing” [all fields]) OR NLP [all fields]) AND (medical [all fields] OR (“health” [MeSH terms] OR “health” [all fields]) OR clinical [all fields]), and the time period spanned from 1999 to 2018.
All published studies on the application of NLP in medicine (except biomedicine) during the 20 years between 1999 and 2018 were retrieved. A total of 3498 articles were retrieved. The articles were screened according to the following exclusion criteria:
Articles with indeterminate content were excluded, including PubMed articles without abstracts and articles with abstracts but the term NLP could not be retrieved from the abstracts and the full text could not be found.
Review and comment articles were excluded.
Articles with content unrelated to NLP were excluded; for example, articles wherein the term NLP did not stand for natural language processing but for terms such as neurolinguistic programming, no light perception, and ninein-like protein or NLP was only mentioned as a previous study or future study, while the main article was unrelated to NLP.
As the subject of this study was the application of NLP in medicine and diseases, articles on molecular biomedicine, such as studies on protein-protein interactions in biomedical studies [
The first three steps of the screening process were mainly completed by JW, and the last step of screening was jointly completed by JW and HD. In cases of discordance during the screening process on whether the article belonged to the molecular biomedical category, the two authors would review the full text and come to an agreement through discussion. We followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [
Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram depicting the screening procedure for articles on natural language processing (NLP) in the medical field.
The following information was extracted from eligible articles: year of publication, journal name in which the article was first published, all authors, first author, corresponding author, first author’s affiliation institution (and department), first author’s country, research tasks of NLP in the article, and disease type discussed in the article. The obtained data were input into Excel 2016 (Microsoft Corp) for data analysis and processing. Excel and VOSviewer were used in this study for the qualitative and quantitative analyses of author co-occurrences, keywords, and disease types, which helped compile and summarize the characteristics of the development of the medical NLP field in detail. The cutoff date for data collection was December 31, 2018.
Of the 2336 articles that met the study criteria, the time period spanned from 1999 to 2018. The overall trend (
Graph showing the number of articles published over time.
A total of 2336 articles were published in 412 journals.
Medical natural language processing journal rankings (n=2336).
Rank | Journal or proceedings | Publications, n (%) |
1 | Studies in Health Technology and Informatics | 408 (17.47) |
2 | AMIA Annual Symposium Proceedings | 386 (16.53) |
3 | Journal of the American Medical Informatics Association | 256 (10.96) |
4 | Journal of Biomedical Informatics | 223 (9.55) |
5 | International Journal of Medical Informatics | 54 (2.31) |
6 | BMC Medical Informatics and Decision Making | 50 (2.14) |
7 | BMC Bioinformatics | 43 (1.84) |
8 | AMIA Joint Summits on Translational Science Proceedings | 31 (1.33) |
9 | Plos ONE | 31 (1.33) |
10 | Journal of Digital Imaging | 30 (1.28) |
This study screened for the first author, corresponding author, and contributing authors of each article. The top 10 authors in each category are presented in
Rank of top authors by number of articles published and the most articles published as the first plus corresponding author.
Total (first + corresponding + coauthor) | Total (first + corresponding) | |||
Rank | Authors | Publications | Publications | Rank |
1 | Hongfang Liu | 70 | 21 (7+14) | 6 |
2 | Hua Xu | 66 | 48 (15+33) | 1 |
3 | Joshua C Denny | 64 | 26 (12+14) | 4 |
4 | Carol Friedman | 60 | 20 (6+14) | 7 |
5 | Wendy W Chapman | 55 | 25 (11+14) | 5 |
6 | Guergana Savova | 45 | — | — |
6 | Christopher G Chute | 45 | — | — |
8 | Serguei Pakhomov | 43 | — | — |
9 | Özlem Uzuner | 37 | — | — |
9 | George Hripcsak | 37 | — | — |
9 | Thomas C Rindflesch | 37 | — | — |
— | Stéphane Meystre | — | 32 (17+15) | 2 |
— | Özlem Uzuner | — | 30 (16+14) | 3 |
Top first authors and corresponding authors.
Author designation | Rank | Publications | |
|
|
|
|
|
Stéphane Meystre | 1 | 17 |
|
Özlem Uzuner | 2 | 16 |
|
Hua Xu | 3 | 15 |
|
Louise Deleger | 4 | 13 |
|
Joshua C Denny | 5 | 12 |
|
Serguei Pakhomov | 5 | 12 |
|
Wendy W Chapman | 7 | 11 |
|
Sunghwan Sohn | 8 | 10 |
|
Li Zhou | 9 | 9 |
|
Guergana Savova | 9 | 9 |
|
|
|
|
|
Hua Xu | 1 | 33 |
|
Stéphane Meystre | 2 | 15 |
|
Özlem Uzuner | 3 | 14 |
|
Carol Friedman | 3 | 14 |
|
Hongfang Liu | 3 | 14 |
|
Wendy W Chapman | 3 | 14 |
|
Joshua C Denny | 3 | 14 |
|
Imre Solti | 8 | 11 |
|
Genevieve B Melton | 9 | 10 |
|
Hong Yu | 9 | 10 |
This study first analyzed the countries in which the first authors’ institutions were located. The top 10 countries and the articles published are listed in
Ranking of the first author’s countries (top 10, n=2336).
Rank | Country | Publications, n (%) |
1 | United States | 1472 (63.01) |
2 | France | 127 (5.44) |
3 | United Kingdom | 82 (3.51) |
4 | China | 71 (3.04) |
5 | Germany | 57 (2.44) |
6 | Australia | 56 (2.40) |
7 | Japan | 52 (2.23) |
8 | Switzerland | 44 (1.88) |
9 | Canada | 33 (1.41) |
10 | Spain | 28 (1.20) |
Trend in the number of articles published over 20 years in the top five countries with the most articles published.
This study analyzed the relevant data on the institutions from which the articles were published. Specifically, the primary institutions to which the first authors belonged were analyzed (
Ranking of institutions to which the first authors belonged (n=2336).
Rank | Institution name | Publications, n (%) |
1 | Columbia University | 106 (4.54) |
2 | University of Utah | 97 (4.15) |
3 | Mayo Clinic | 90 (3.85) |
4 | Vanderbilt University | 59 (2.53) |
5 | National Library of Medicine | 57 (2.31) |
6 | Brigham and Women’s Hospital | 52 (2.24) |
7 | University of California | 47 (2.01) |
8 | University of Pittsburgh | 38 (1.63) |
9 | Massachusetts General Hospital | 37 (1.58) |
10 | University of Minnesota | 32 (1.37) |
This study evaluated the professional background of the first authors and analyzed the departments to which the first authors belonged, with the aim of observing the overall development of NLP in the medical field across the broad range of the discipline. As statistical analysis of institutions in this study focused on the primary institutions to which the authors belonged, analysis of departments also focused on departments of the primary institutions. If an author was affiliated to multiple departments, all departments were included in the statistical analysis.
Distribution of departments to which the first authors belonged (n=2336).
Rank | Name of department | Publications, n (%) |
1 | Department of biomedical informatics | 334 (14.30) |
2 | Department of computer science | 141 (6.04) |
3 | Department of radiology | 75 (3.21) |
4 | Department of medical informatics | 55 (2.35) |
5 | Department of psychiatry | 37 (1.58) |
6 | Department of neuroscience | 35 (1.50) |
7 | Department of nursing | 30 (1.28) |
8 | Department of health sciences | 28 (1.20) |
9 | Department of medicine | 22 (0.94) |
10 | Department of health informatics | 19 (0.81) |
VOSviewer is a bibliometric analysis software for constructing and visualizing bibliometric maps. It was codeveloped by Nees Jan van Eck and Ludo Waltman of Leiden University in the Netherlands [
Analysis of keywords can indirectly reveal hotspots and changing trends in research topics, critical for understanding the development of this field [
(A) Network visualization of author co-occurrences analyzed using VOSviewer. A circle represents an author, the size of the circle represents the importance, and the thickness of the link connecting the circles represents the relatedness of the connections. Circles with the same color belong to the same cluster. (B) Overlay visualization generated in VOSviewer (Centre for Science and Technology Studies, Leiden University). A color closer to blue represents an earlier time and closer to red represents a time closer to 2018 (note: refer to
(A) Distribution of keywords. A circle represents an identified keyword, the size of the circle represents the importance, and the thickness of the link connecting the circles represents the relatedness of the connections among the keywords. Circles with the same color belong to the same cluster. (B) Changes in keywords over time. A color closer to blue represents an earlier time and closer to red represents a time closer to 2018 (note: refer to
This study found that 413 articles mentioned specific diseases studied using NLP, accounting for about one-fifth of the total number of articles. We conducted a comprehensive analysis of these articles to understand the type of disease information mined by NLP and how it was performed. This could provide a reference tool for the use of NLP when studying disease cases in the future.
Of the 413 articles, the categories of diseases studied using NLP are shown in
Ranking of disease categories based on studies that used natural language processing for the investigation of disease cases.
The temporal distribution of NLP research used to study diseases was analyzed in this study. As shown in
Temporal distribution of studies that used natural language processing for the investigation of disease cases (note: this figure shows the names of the top three diseases in studies that used natural language processing to investigate disease cases each year. Fewer than three disease types indicates that only one or two diseases were studied in the year. The term cancer in the figure indicates the article only mentioned the term cancer, without specifying the type of cancer).
Of the 413 articles that studied disease cases using NLP, the top four countries from where the first authors were located were the United States (68.3%, 282/413), China (4.8%, 20/413), the United Kingdom (3.6%, 15/413), and Australia (3.1%, 13/413). This ranking was consistent with the total number of articles published by country. The status of NLP research for use to study disease cases in these four countries was further investigated. As shown in
Distribution of diseases in studies that used natural language processing for the investigation of disease cases in the United States, China, United Kingdom, and Australia.
The abstracts of 2336 articles were analyzed in this study to explore the research tasks of NLP involved in each article. If the abstract did not mention the specific task of NLP, the full text was reviewed. If the task could not be clearly identified from the full text, the article would be excluded from the analysis. NLP tasks involved were undetermined in 73 articles.
The authors of this study referenced the content on NLP described in chapter 4 of
Top five ranks of the research tasks of natural language processing (NLP) in the medical field.
NLP research in the past 20 years could be divided into 3 phases: the lag period (1999-2004) with a yearly average of 30 (22 to 42) articles published, the slow growth period (2005-2011) with a yearly average of 89 (66 to 124) articles published, and the fast growth period (2012-2018) with a yearly average of 219 articles (148 to 302) articles published, with a peak (302) attained in 2015. Analysis by country showed that the United States has been the leader since the beginning of NLP development. Prior to 2008, only the United States, France, and Germany, with few exceptions, had conducted investigations in the field. Of the five countries shown in
This study identified the prominent authors who had made significant contributions to the NLP field, and we noted the following salient feature: the top two authors with the highest number of publications, Hongfang Liu and Hua Xu, plus Carol Friedman (ranked fourth rather than first because quite a few of her articles are about methodology and biology, which were not included in the scope of this study, but this does not change that she is recognized as a leading pioneer in this field) and George Hripcsak, ninth position, were all from Columbia University. In particular, Carol Friedman and George Hripcsak are currently at Columbia University, whereas Hongfang Liu and Hua Xu are both students of Carol Friedman. Among the top five prolific authors who published as the first plus corresponding author, Hua Xu (ranked first), Hongfang Liu (ranked sixth), and Carol Friedman (ranked seventh), were all from Columbia University. In addition, analysis of the first author’s affiliation institutions showed that Columbia University (106) was ahead of University of Utah (97) in second place and the Mayo Clinic (90) in third place. These findings indicated that Columbia University and its students were the most active in the field of medical NLP research.
Notably, as shown in
Analysis by department showed that the top four majors were biomedical informatics, computer science, radiology, and medical informatics. These four majors mainly involve the processing of highly integrated data using computers and the expertise involved related to interdisciplinary content, such as medical information. It was evident that researchers with professional backgrounds in these fields had contributed significantly to the development of NLP. The research and study of NLP should be the key learning direction for future students majoring these subjects.
Analysis of this study showed that the top disease type in disease research involving NLP was mental illness. The World Health Organization predicts that mental illness may become the third most common human disease in the world in the future, after heart disease and cancer [
The journal Lancet Oncology published global cancer statistics for young people aged 20 to 39 years in 2017: one million young people in the world are diagnosed with cancer each year, and breast cancer is the most commonly diagnosed cancer (20%) [
From 1999 to 2005, NLP was often used to study pneumonia cases. Our analysis showed that the main role of NLP in studies on pneumonia cases was the identification of pneumonia-related concepts from chest radiograph reports, or the use of NLP to complete automatic coding of pneumonia-related concepts. In addition, Jones et al [
Among disease research involving NLP, China ranked second regarding the number of articles published (20 articles).
According to the results of this study, and as shown in
Information extraction accounted for the highest proportion of all medical NLP tasks. Almost one-third of medical NLP tasks were information extraction, indicating its importance in NLP. Information extraction mainly refers to the use of computers to automatically extract a specific type of information (such as entities, relationships, and events) from a vast number of structured or semistructured texts and to form structured data [
Text classification, which is a process of automated text classification based on text content and the use of computers to automatically classify texts under a given classification system and classification criteria [
Syntactic analysis, also known as parsing in natural language, uses syntax and other relevant knowledge of natural languages to determine the functions of each component that constitutes an input sentence. This technology is used to establish a data structure and acquire the meaning of the input sentence [
Information retrieval refers to the query methods and processes for searching related documents required by users from an enormous number of documents using computer systems [
Machine translation refers to the automated translation of words or speech from one natural language to another natural language using computer programs. To put in simple terms, machine translation is the conversion of words from one natural language into words of another language. More complex translations can be automated using corpora [
In this study, we conducted a bibliometric analysis and presented the development of NLP in the medical field over the past 20 years. While the United States continues to be the leader in the field, many countries such as China and the United Kingdom are also advancing rapidly. In recent years, the use of NLP has become popular to process information obtained from social media platforms—for example, studies have obtained information related to diseases and patient care from the Twitter platform. Cancer has always been one of the greatest threats to human health. The use of NLP to assist cancer research has become a recent trend, for example, for use in breast cancer and prostate cancer research. Tasks such as information extraction and syntax parsing have always been popular tasks in the medical NLP field. Future studies will focus on how to better integrate these tasks into medical NLP research.
Network diagrams and analysis of keywords and collaboration among authors.
electronic health record
electronic medical record
natural language processing
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
This study was sponsored by the National Natural Science Foundation of China (grants #81771937 and #81871455).
JL developed the conceptual framework and research protocol for the study. JW and HD conducted the publications review, data collection, and analysis. BL, AH, TW, XZ, and JL interpreted the data, LF made sure the diseases were classified correctly. JW drafted the manuscript, and JL made major revisions. All authors approved the final version of the manuscript.
None declared.