Published on in Vol 22, No 11 (2020): November

Preprints (earlier versions) of this paper are available at, first published .
Leveraging Internet Search Data to Improve the Prediction and Prevention of Noncommunicable Diseases: Retrospective Observational Study

Leveraging Internet Search Data to Improve the Prediction and Prevention of Noncommunicable Diseases: Retrospective Observational Study

Leveraging Internet Search Data to Improve the Prediction and Prevention of Noncommunicable Diseases: Retrospective Observational Study

Original Paper

1School of Public Health, Tianjin Medical University, Tianjin, China

2Department of Epidemiology and Health Statistics, School of Public Health, Zhejiang University School of Medicine, Hangzhou, China

3School of Public Health, Yale University, New Haven, CT, United States

4Health Management Center, Tianjin Medical University General Hospital, Tianjin, China

5School of Nursing, Tianjin Medical University, Tianjin, China

6Department of Land Surveying and Geo-Informatics, The Hong Kong Polytechnic University, Hong Kong, China

7International Institute of Spatial Lifecourse Epidemiology, Hong Kong, China

*these authors contributed equally

Corresponding Author:

Yaogang Wang, MD, PhD

School of Public Health

Tianjin Medical University

No. 22, Qixiangtai Road

Heping District

Tianjin, 300070


Phone: 86 13820046130


Background: As human society enters an era of vast and easily accessible social media, a growing number of people are exploiting the internet to search and exchange medical information. Because internet search data could reflect population interest in particular health topics, they provide a new way of understanding health concerns regarding noncommunicable diseases (NCDs) and the role they play in their prevention.

Objective: We aimed to explore the association of internet search data for NCDs with published disease incidence and mortality rates in the United States and to grasp the health concerns toward NCDs.

Methods: We tracked NCDs by examining the correlations among the incidence rates, mortality rates, and internet searches in the United States from 2004 to 2017, and we established forecast models based on the relationship between the disease rates and internet searches.

Results: Incidence and mortality rates of 29 diseases in the United States were statistically significantly correlated with the relative search volumes (RSVs) of their search terms (P<.05). From the perspective of the goodness of fit of the multiple regression prediction models, the results were closest to 1 for diabetes mellitus, stroke, atrial fibrillation and flutter, Hodgkin lymphoma, and testicular cancer; the coefficients of determination of their linear regression models for predicting incidence were 80%, 88%, 96%, 80%, and 78%, respectively. Meanwhile, the coefficient of determination of their linear regression models for predicting mortality was 82%, 62%, 94%, 78%, and 62%, respectively.

Conclusions: An advanced understanding of search behaviors could augment traditional epidemiologic surveillance and could be used as a reference to aid in disease prediction and prevention.

J Med Internet Res 2020;22(11):e18998



One of the Sustainable Development Goals (SDGs) set by the United Nations General Assembly in 2015 was to reduce premature mortality from noncommunicable diseases (NCDs) by one-third by 2030 [1]. According to the World Health Statistics 2019 [2], NCDs have collectively caused 41 million deaths worldwide, equivalent to 71% of all global deaths. The majority of those deaths were caused by the following NCDs: cardiovascular disease (CVD), cancer, and diabetes. According to the statistics of the Global Burden of Disease (GBD) in 2017 [3], diabetes was the most common NCD in the United States and ischemic heart disease ranked second. In 2020, 1,806,590 new cancer cases and 606,520 cancer deaths are projected to occur in the United States [4].

In the era of social media, there is a current trend in the population for individuals to search the internet for information before they consult with specialists for recommendations [5-7]. Various social media and online communities that enable connectivity have unprecedented influence. They are expanding their reach into the health care domain [8-12]. Researchers have shown that patients with cancer, diabetes, and other chronic conditions search online before and after diagnosis [13]. According to a report by the American Community Survey [14], the percentage of households with a computer has increased almost tenfold since 1984. Like computer use, the percentage of households using the internet has also increased over time [14]. Google accounts for the vast majority of the US search engine market, reaching more than 80% of the population [15].

The application of search engine data to the field of disease surveillance is a classic case of big data application. When GFT was first launched [16], it attracted widespread attention and was followed by many scholars. Previous studies have mainly utilized Google to study outbreaks of influenza epidemics [17-19]. Other studies have tried to mine social network data (such as Facebook and Twitter) through text mining to identify patients’ concerns about drugs, examine their feelings, and understand public opinion [20]. Due to the lack of real-time monitoring data of NCDs, few studies had used internet data to predict the trends of NCDs [21,22]. Our study is the first to explore a correlation among these many different types of online NCD searches with disease prevalence in the United States. We hypothesized that internet search behaviors could reflect people’s awareness of NCDs, and thus the disease characteristics of NCDs (such as incidence and mortality rates) would correlate strongly with the internet search frequency. It also provides a new information channel for grasping public health concerns and promoting NCD prevention.

Disease Data

We obtained the national incidence and mortality rates for NCDs in the United States from the GBD database for 14 years (from 2004 to 2017) [3]. In this study, we initially selected 31 types of NCDs, namely diabetes mellitus, ischemic heart disease, stroke, atrial fibrillation and flutter, prostate cancer, breast cancer, lung cancer, colon and rectal cancer, malignant skin melanoma, non-Hodgkin lymphoma, uterine cancer, cardiomyopathy and myocarditis, kidney cancer, pancreatic cancer, bladder cancer, leukemia, liver cancer, stomach cancer, lip and oral cavity cancer, brain and nervous system cancer, thyroid cancer, multiple myeloma, ovarian cancer, cervical cancer, esophageal cancer, larynx cancer, gallbladder and biliary tract cancer, Hodgkin lymphoma, testicular cancer, mesothelioma, and hypertensive heart disease.

Internet Search Data

Because internet search data are updated in real time, this study mainly considered monthly internet searches from the Google Trends website from 2004 to 2018 at the national level [23]. We downloaded monthly relative search volumes for each search term of each disease. The search data were downloaded from Google Trends in December 2019.

Selection of Search Terms

Different kinds of NCDs have different search terms and each disease has a core search term. We determined the core search term based on the disease name in the GBD database. Google Trends has the function of “related searches” [24]: after entering the core search term, other search terms related to this term in the “related searches” section can be seen at the bottom of the page, which contains up to 49 related search terms and sentences. This method determines the approximate scope of a search term selection based on the object to be studied, which can avoid the subjectivity of search term selection in the research to a certain extent and can minimize the omission of core terms. After the primary selection of terms, the next step was the filtering of search terms. Three types of terms were generally filtered out in this study. The first type was terms with meanings that had nothing to do with the research object. After the primary selection, some terms were still irrelevant to the research object, even if a phrase included the object to be studied, likely because some words have multiple meanings. The second type of terms to filter out was those with a small search volume. Some of the included terms had zero search volume within the specified time frame. Because this study focused on time differences, it was required that search terms had a high search frequency throughout the entire period.

Specific search terms are listed in Multimedia Appendix 1. The selected terms were not searched in quotes. Each data point represents the relative search volumes (RSVs) of specific query terms on a normalized scale of 0 to 100. The RSVs were divided by the total searches of the particular geographic location and a particular time range it represents to compare the relative popularity of the query terms. For example, compared with the total search volumes, if a particular region had a higher number of specific query terms, its RSV would be closer to 100. Data of internet searches used in this study are publicly available, anonymous, and cannot be tracked back to identifiable individuals.

Statistical Analysis

We identified search terms for each NCD based on the above criteria. We performed the Pearson correlation analysis to evaluate the relationship between the known incidence and mortality rates of the NCDs and the RSVs to filter search terms continuously. Finally, the terms that have no significant correlation with the research object were also deleted in the subsequent analysis.

We also considered dealing with the multicollinearity of the RSVs of the search terms. Multicollinearity refers to the distortion or inaccuracy of model estimation due to the high correlation between explanatory variables in a linear regression model. In this study, we calculated the correlation coefficients between the RSVs of the search terms of each disease.

Based on the above steps, we established multiple linear regression models for each disease, and the RSVs of multiple search terms were used for prediction and analysis.

The general form of the multiple linear regression model can be expressed as follows:

y = β0 + β1x2 + β2x2 + ... + βKxk + ε,

where β0, β1, β2,..., βK is the parameter of the model and ε is the error term. The error term reflects the influence of random factors on y, which cannot be determined by the variability explained by the linear relationship between xk and y. In this study, we established two regression models for each NCD, one based on the correlation between RSVs and incidence rate and the other based on the correlation between RSVs and the mortality rate.

Statistical analysis was conducted using IBM SPSS software (version 22.0), and Stata (version 15; StataCorp LLC). The statistical significance was set as P<0.05 (two-sided test).

Correlation Analysis

Table 1 and Multimedia Appendix 2 display the correlation coefficients between the incidence rates and the RSVs of all of the selected search terms for the NCDs. They also display the correlation coefficients between the mortality rates and the RSVs. We found statistically significant correlations between the rates and the RSVs, especially for five diseases with high incidence rates: diabetes mellitus, stroke, atrial fibrillation and flutter, Hodgkin lymphoma, and testicular cancer. For Hodgkin lymphoma, the RSV of each search term was negatively correlated with the incidence rate. For some diseases, we did not find statistically significant correlations between the RSVs and the mortality and incidence rates; these search terms were excluded from the subsequent analysis. If the correlations between the rates and RSVs of all search terms of a disease were found to be not statistically significant, the disease would be excluded from the analysis. For example, prostate cancer and hypertensive heart disease were excluded from further analysis because the RSVs of all of their search terms did not correlate with incidence and mortality rates at the same time.

The RSVs of the specific search terms were correlated with the incidence rates for all of the NCDs, with P values less than .05 (Multimedia Appendix 3). In addition, the RSVs of the specific search terms were correlated with the mortality rates for all of the NCDs. Multimedia Appendix 4 displays the cross correlation analysis results among search terms.

Table 1. Correlation coefficients between the incidence and mortality rates of diabetes mellitus, stroke, atrial fibrillation and flutter, Hodgkin lymphoma, and testicular cancer and their relative search volumes.

Incidence rateMortality rate

RincidenceP valueRmortalityP value
Search terms for diabetes mellitus

What is diabetes mellitus type 20.648<.001–0.636<.001

What is type 2 diabetes0.746<.001–0.789<.001

Causes of diabetes mellitus–0.295<.0010.250.001

Signs of diabetes0.866<.001–0.900.001

What is type 1 diabetes0.729<.001–0.756<.001


Search terms for stroke

Signs of stroke in women0.893<.001–0.495<.001

Symptoms of stroke in women0.609<.001–0.663<.001

Signs of a stroke in women0.906<.001–0.447<.001

Stroke symptoms in men0.805<.001–0.655<.001

Minor stroke0.245.0010.164.03

Symptoms of mini stroke0.558<.001–0.223.004

Signs of stroke in men0.887<.001–0.392<.001

Signs of mini stroke0.733<.001–0.277.003

What are the signs of a stroke0.745<.001–0.381<.001
Search terms for atrial fibrillation and flutter

Atrial fibrillation0.


Atrial fibrillation with rvrb0.218.0040.209.007

Atrial fibrillation and stroke–0.350<.001–0.359<.001

Ablation of atrial fibrillation–0.224.003–0.233.002

Heart flutter0.880<.0010.878<.001

Atrial fibrillation vs flutter0.363<.0010.376<.001

Signs of atrial fibrillation0.202.0090.205.008

Atrial flutter vs atrial fibrillation0.230.0030.230.002

Atrial flutter0.389<.0010.378<.001

Atrial flutter ecgc0.

Atrial flutter ekgd–0.163.03–0.165.03

Atrial flutter vs fibrillation0.391<.0010.403<.001

Atrial fibrillation ecg0.435<.0010.430<.001

What is atrial fibrillation0.685<.0010.681<.001

a fibe0.965<.0010.957<.001

Treatment of atrial flutter–0.187.015–0.185.02

What causes atrial flutter0.328<.0010.331<.001
Search terms for Hodgkin lymphoma

Hodgkin lymphoma–0.518<.001–0.404<.001

Hodgkin lymphoma cancer–0.407<.001–0.316<.001

Hodgkin lymphoma symptoms–0.496<.001–0.413<.001

Lymphoma symptoms–0.365<.001–0.524<.001

What is lymphoma–0.773<.001–0.688<.001

What is Hodgkin lymphoma–0.635<.001–0.561<.001


Symptoms of Hodgkin lymphoma–0.416<.001–0.346<.001

Symptoms of lymphoma–0.534<.001–0.590<.001

Hodgkin lymphoma survival rate–0.425<.001–0.357<.001

Hodgkin lymphoma vs non Hodgkin lymphoma–0.513<.001–0.432<.001

Hodgkin vs non Hodgkin–0.547<.001–0.452<.001

Hodgkin lymphoma prognosis–0.227.003–0.201.009

B cell lymphoma–0.303<.001–0.233.002

B cell non odgkin lymphoma–0.538<.001–0.493<.001

Hodgkin lymphoma causes–0.433<.001–0.363<.001

Stage 4 lymphoma–0.416<.001–0.366<.001

Classical Hodgkin lymphoma–0.495<.001–0.490<.001

Hodgkin lymphoma stage 4–0.415<.001–0.344<.001
Search terms for testicular cancer

Testicular cancer–0.821<.0010.742<.001

Causes of testicular cancer–0.292<.0010.280<.001

Symptoms for testicular cancer–

Testicular cancer ribbon0.342<.001–0.275<.001

Testicular cancer prognosis–0.366<.0010.368<.001

Testicular cancer risk factors–

What does testicular cancer look like0.317<.001–0.254.001

How do you get testicular cancer0.320<.001–0.214.005

Does testicular cancer spread0.297<.001–0.217.005

What are signs of testicular cancer0.557<.001–0.476<.001

aafib: atrial fibrillation.

brvr: rapid ventricular response.

cecg: electrocardiogram.

dekg: electrocardiogram.

ea fib: atrial fibrillation.

Trends in Internet Searches, Incidence Rates, and Mortality Rates

Figure 1 shows a time series of the RSVs, incidence rates, and mortality rates for atrial fibrillation and flutter from 2004 to 2018. Trends of other diseases are displayed in Multimedia Appendix 5. Based on the correlation analysis, we predicted the incidence and mortality rates in 2018. As can be seen in the figure, the incidence rates of most diseases and RSVs fit well. A similar pattern was observed between RSVs and mortality rates. The predicted incidence and mortality rates varied with the fluctuations in the RSVs for most NCDs.

Figure 1. Trends of atrial fibrillation and flutter from 2004 to 2018. (A) Trends of incidence rate and relative search volume (RSV) of atrial fibrillation and flutter from 2004 to 2018. (B) Trends of mortality rate and RSV of atrial fibrillation and flutter from 2004 to 2018. The blue line represents the RSV of each noncommunicable disease from 2004 to 2018, the green line represents the incidence rates, the purple line represents the mortality rates, and the dotted line represents the forecast for morbidity and mortality in 2018.
View this figure

Multiple Linear Regression Models

Based on the correlations, two prediction models were established for each disease, namely the incidence prediction model and the mortality prediction model. Figure 2 shows the relationship between the independent variable (RSV for each search term) and the dependent variable (incidence and mortality rates of atrial fibrillation and flutter) in the model. The relationship between the independent variable and the dependent variable of all of the NCDs can be seen in Multimedia Appendix 6. The prediction models are shown in Multimedia Appendix 7. Table 2 shows the degree of fit of multiple linear regression models. The results of all of the NCDs are displayed in Multimedia Appendix 8. For diabetes mellitus, stroke, atrial fibrillation and flutter, Hodgkin lymphoma, and testicular cancer, the coefficient of determination of the linear regression models for predicting incidence was 80%, 88%, 96%, 80%, and 78%, respectively.

Meanwhile, the coefficient of determination of the linear regression models for predicting mortality was 82%, 62%, 94%, 78%, and 62%, respectively. From the perspective of the goodness of fit of the multiple regression prediction models of other NCDs, most of the results were close to 1.

Figure 2. (A) Scatter plot of relative search volumes and incidence rate of atrial fibrillation and flutter in the United States. (B) Scatter plot of relative search volumes and mortality rate of atrial fibrillation and flutter in the United States.
View this figure
Table 2. Evaluation results of prediction models.

Incidence rateMortality rate

Adjusted RMSEaAdjusted RMSE
Diabetes mellitus80%80%20.5482%81%0.59
Atrial fibrillation and flutter96%95%1.9194%94%0.17
Hodgkin lymphoma80%77%0.1078%75%0.01
Testicular cancer78%77%0.0162%60%0.00

aRoot-mean-square error.

Principal Findings

In recent years, using internet search data to detect influenza has been a research hotspot. Most of the studies utilized data sources from Google Trends or Google Flu Trends [25-27]. Although the existing literature has conducted empirical research on the correlation between search data and influenza, there is generally a lack of systematic preprocessing methods for NCDs. This study mainly focused on three types of diseases: diabetes, cancer, and CVDs. We found that the frequency of searches correlated strongly with previously reported disease epidemiology.

The choice of search terms was a key feature of this study. Based on the disease names used by the GBD, we used the “related searches” function of the Google search engine to supplement the search term and thus obtain a more comprehensive primary selection at a low cost. In terms of the search term selection, compared with Pearson correlation analysis conducted in the previous research, the use of cross-correlation analysis can also determine the time relationship between search terms and disease occurrence and determine the leading search terms in time so as to build a predictive model. The above preprocessing methods provided a better data foundation for the establishment of early warning models.

This study demonstrated the feasibility of combining historical information with search information for the early warning of NCDs, laying the foundation for future model optimization. As the most recent data publicly available from the Centers for Disease Control and Prevention is at least 2 years old while Google Trends data are available nearly instantaneously, this resource could potentially provide a more timely and cost-effective data source for public health researchers [28]. Thus, real-time internet searches could be particularly useful. Studies have shown that tracking and monitoring search behaviors, as well as text mining on social media, can provide new ways to study public concerns about NCDs and information-seeking behaviors [29-32]. Researchers have realized that people can search, understand, and evaluate health information on internet resources and that they harness the information they receive to address health problems [33]. At present, it is necessary to enhance the efficiency of prevention and auxiliary diagnosis for patients with NCDs or the general population by online information transmission. The collection of real-time relevant search data from search engines provides a new way to prevention and control NCDs.

Most types of NCDs that we examined showed statistically significant correlations, although prostate cancer and hypertensive heart disease did not. This pattern can be attributable to different reasons. First, NCDs are highly prevalent. As people are paying more attention to these diseases, the self-management consciousness of patients with certain NCDs has been improving. Second, the monthly rates of diagnosis of NCDs are missing. Third, this pattern could be partly influenced by active public health campaigns, which may broadly increase search volume regardless of disease metrics. Between 2007 and 2014, the incidence of cancer in men in the United States declined rapidly, and then remained stable until 2016, which was related to changes in screening strategies for colon, rectal, and prostate cancer [4]. Doctors and scientists in the United States have noticed that previous prostate-specific antigen (PSA)-based screenings may lead to an overdiagnosis for prostate cancer and have reduced the use of PSA for prostate cancer screening. A previous study showed that fewer older adults in the United States are having heart attacks, and among those who do have them, more are surviving from them [34]. The country has made continuous efforts to prevent heart attacks and improve patient care. Health insurance agencies, the American Heart Association, and other research organizations—as well as a large number of researchers, clinicians, and public health experts—are committed to reducing risk by promoting a healthy lifestyle. At the same time, the recommendations for the secondary prevention of heart disease are more common and standardized. The extensive development of angiography and other related technologies, and the deepening of people’s understanding of CVDs, will affect the frequency of retrieval of related diseases on the internet. This trend can also be seen in our prediction model. The incidence of lung cancer in the United States has also continued to decline due to the continuous implementation of antismoking activities and effective control of the smoking rate. Fourth, we extracted internet search data from as early as 2004. In the past, individuals with the highest risk of NCDs often had limited access to the internet. To date, the internet has ushered in great changes. The number of internet users is currently growing at a rate of more than 11 new users per second, bringing the total number of new users per day to an astonishing 1 million. More and more people tend to search for health-related information online. Our results also indicated that the coefficient of determination and adjusted coefficient of determination of the regression models for diabetes, stroke, atrial fibrillation and flutter, Hodgkin lymphoma, and testicular cancer were higher than for the other diseases studied, indicating that the regression models are better fit to the incidence and mortality rates of these NCDs. This might mean that online search behaviors and volumes can help health professionals to conduct near real-time monitoring of NCDs.

Although the method of using internet search data to predict influenza has made great progress in real time, it still lacks in accuracy. Since 2011, GFT has been overestimating the number of influenza-like illnesses (ILIs), especially the forecast of the peak season of influenza. In 2013, the forecast deviation was even as high as 140%, which may indicate that there is a certain gap in the forecast based on internet data alone. In the traditional methods of influenza prediction, although the manually collected ILI monitoring samples are lagging behind, they are more accurate because of rigorous scientific experiments. Therefore, internet data cannot completely replace the traditional data collection methods but should be used as a complement to the latter.


Our study has some unavoidable limitations. First, because the data from internet searches are public and anonymous, we could not determine who conducted search activities in this study, and the limited number of search terms could not fully represent the search preferences of all people. Second, the use of Google Trends cannot be fully representative of the overall population, since only individuals with access to the internet can be accounted for. Third, we could only obtain the annual incidence and mortality data for NCDs, but not the data with smaller granularity, which might have affected the accuracy of the prediction model. Fourth, as the search algorithm of Google Trends is dynamic, we could not retrieve the original RSVs, and the RSVs of the same search term obtained in different time periods were different. Fifth, the spatiotemporal data of search engines have many limitations, such as high noise and uncertainty. We hope to find ways to identify and reduce bias in search engine data before we utilize web-based data to provide useful information.

Public Health Implications

With the widespread use of internet searches, our study found a correlation between the RSVs and the incidence and mortality rates of the NCDs. This indicates that the search engine data can be used in the early warning and prevention of NCDs, such as diabetes, cancer, and CVDs. We should make good use of such data, especially when the traditional registry data are insufficient or unavailable.


YW was supported by grants from the National Natural Science Foundation of China (grants No. 91746205 and 71673199) and YG was supported by a grant from the Youth Program of National Natural Science Foundation of China (grant No. 71804124).

Authors' Contributions

YW directed the study. YH downloaded all the original Google Trends data in the United States. CX and ZC processed and analyzed the data, and then developed the first manuscript draft. All authors critically revised the manuscript, and all authors critically reviewed, contributed to, and approved the final version.

Conflicts of Interest

None declared.

Multimedia Appendix 1

Search terms of all the noncommunicable diseases.

XLSX File (Microsoft Excel File), 35 KB

Multimedia Appendix 2

Correlation analysis among the relative search volumes, incidence rates, and mortality rates.

XLSX File (Microsoft Excel File), 34 KB

Multimedia Appendix 3

Correlation coefficients among internet searches, incidence rates, and mortality rates in the United States (contains search terms that were ultimately included in the study).

DOC File , 584 KB

Multimedia Appendix 4

Cross-correlation analysis results among search terms.

XLSX File (Microsoft Excel File), 38 KB

Multimedia Appendix 5

Trends of all the noncommunicable diseases.

DOC File , 1062 KB

Multimedia Appendix 6

Relationship between the independent variable and the dependent variable of all the noncommunicable diseases.

PDF File (Adobe PDF File), 30670 KB

Multimedia Appendix 7

Prediction models.

DOC File , 40 KB

Multimedia Appendix 8

Evaluation results of prediction models for all the chronic diseases.

DOC File , 81 KB

  1. Nugent R, Bertram MY, Jan S, Niessen LW, Sassi F, Jamison DT, et al. Investing in non-communicable disease prevention and management to advance the Sustainable Development Goals. The Lancet 2018 May;391(10134):2029-2035. [CrossRef]
  2. WHO. World health statistics 2019.   URL: [accessed 2020-01-17]
  3. Global Burden of Disease. GBD results tool.   URL: [accessed 2020-01-17]
  4. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA A Cancer J Clin 2020 Jan 08;70(1):7-30. [CrossRef]
  5. Ploeg J, Markle-Reid M, Valaitis R, McAiney C, Duggleby W, Bartholomew A, et al. Web-Based Interventions to Improve Mental Health, General Caregiving Outcomes, and General Health for Informal Caregivers of Adults With Chronic Conditions Living in the Community: Rapid Evidence Review. J Med Internet Res 2017 Jul 28;19(7):e263. [CrossRef]
  6. Weber W, Reinhardt A, Rossmann C. Lifestyle Segmentation to Explain the Online Health Information–Seeking Behavior of Older Adults: Representative Telephone Survey. J Med Internet Res 2020 Jun 12;22(6):e15099. [CrossRef]
  7. Gedefaw A, Yilma TM, Endehabtu BF. Information Seeking Behavior About Cancer and Associated Factors Among University Students, Ethiopia: A Cross-Sectional Study. Cancer Manag Res 2020;12:4829-4839 [FREE Full text] [CrossRef] [Medline]
  8. Alioshkin Cheneguin A, Salvat Salvat I, Romay Barrero H, Torres Lacomba M. How good is online information on fibromyalgia? An analysis of quality and readability of websites on fibromyalgia in Spanish. BMJ Open 2020 Jul 05;10(7):e037065. [CrossRef]
  9. Heynsbergh N, Botti M, Heckel L, Livingston PM. Caring for the person with cancer: Information and support needs and the role of technology. Psycho-Oncology 2018 Apr 20;27(6):1650-1655. [CrossRef]
  10. Tonsaker T, Bartlett G, Trpkov C. Health information on the Internet: gold mine or minefield? Can Fam Physician 2014 May;60(5):407-408 [FREE Full text] [Medline]
  11. Teh J, Op't Hoog S, Nzenza T, Duncan C, Wang J, Radojcic M, et al. Penile cancer information on the internet: a needle in a haystack. BJU Int 2018 Oct 29;122:22-26. [CrossRef]
  12. Printz C. Patients with cancer seeking treatment information on the internet face challenges. Cancer 2017 May 19;123(11):1886-1887. [CrossRef]
  13. Heynsbergh N, Botti M, Heckel L, Livingston PM. Caring for the person with cancer: Information and support needs and the role of technology. Psycho-Oncology 2018 Apr 20;27(6):1650-1655. [CrossRef]
  14. Camille R. Computer and Internet use in the United States: 2016.   URL: [accessed 2020-09-17]
  15. StatCounter. Search engine market share.   URL: [accessed 2020-01-01]
  16. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature 2009 Feb 19;457(7232):1012-1014. [CrossRef]
  17. Lu FS, Hattab MW, Clemente CL, Biggerstaff M, Santillana M. Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nat Commun 2019 Jan 11;10(1):147. [CrossRef]
  18. Milinovich GJ, Williams GM, Clements ACA, Hu W. Internet-based surveillance systems for monitoring emerging infectious diseases. The Lancet Infectious Diseases 2014 Feb;14(2):160-168. [CrossRef]
  19. Zhang Y, Bambrick H, Mengersen K, Tong S, Hu W. Using Google Trends and ambient temperature to predict seasonal influenza outbreaks. Environment International 2018 Aug;117:284-291. [CrossRef]
  20. Deiner MS, Lietman TM, McLeod SD, Chodosh J, Porco TC. Surveillance Tools Emerging From Search Engines and Social Media Data for Determining Eye Disease Patterns. JAMA Ophthalmol 2016 Sep 01;134(9):1024. [CrossRef]
  21. Senecal C, Widmer RJ, Lerman LO, Lerman A. Association of Search Engine Queries for Chest Pain With Coronary Heart Disease Epidemiology. JAMA Cardiol 2018 Dec 01;3(12):1218-1221. [CrossRef]
  22. Toosi B, Kalia S. Seasonal and Geographic Patterns in Tanning Using Real-Time Data From Google Trends. JAMA Dermatol 2016 Feb 01;152(2):215-217. [CrossRef]
  23. Google. Google Trends.   URL: [accessed 2019-10-01]
  24. Trends Help. Google Trends.   URL: [accessed 2020-11-08]
  25. Kandula S, Shaman J. Reappraising the utility of Google Flu Trends. PLoS Comput Biol 2019 Aug 2;15(8):e1007258. [CrossRef]
  26. Pollett S, Boscardin WJ, Azziz-Baumgartner E, Tinoco YO, Soto G, Romero C, et al. Evaluating Google Flu Trends in Latin America: Important Lessons for the Next Phase of Digital Disease Detection. Clin Infect Dis 2016 Sep 26;64(1):34-41. [CrossRef]
  27. Yang S, Santillana M, Kou SC. Accurate estimation of influenza epidemics using Google search data via ARGO. Proc Natl Acad Sci USA 2015 Nov 09;112(47):14473-14478. [CrossRef]
  28. Xu C, Wang Y, Yang H, Hou J, Sun L, Zhang X, et al. Association Between Cancer Incidence and Mortality in Web-Based Data in China: Infodemiology Study. J Med Internet Res 2019 Jan 29;21(1):e10677. [CrossRef]
  29. Deiner MS, McLeod SD, Wong J, Chodosh J, Lietman TM, Porco TC. Google Searches and Detection of Conjunctivitis Epidemics Worldwide. Ophthalmology 2019 Sep;126(9):1219-1229. [CrossRef]
  30. Phillips CA, Barz Leahy A, Li Y, Schapira MM, Bailey LC, Merchant RM. Relationship Between State-Level Google Online Search Volume and Cancer Incidence in the United States: Retrospective Study. J Med Internet Res 2018 Jan 08;20(1):e6. [CrossRef]
  31. Huang X, Baade P, Youlden DR, Youl PH, Hu W, Kimlin MG. Google as a cancer control tool in Queensland. BMC Cancer 2017 Dec 4;17(1):816. [CrossRef]
  32. DeJohn AD, Schulz EE, Pearson AL, Lachmar EM, Wittenborn AK. Identifying and Understanding Communities Using Twitter to Connect About Depression: Cross-Sectional Study. JMIR Ment Health 2018 Nov 05;5(4):e61. [CrossRef]
  33. Norman C. eHealth Literacy 2.0: Problems and Opportunities With an Evolving Concept. J Med Internet Res 2011 Dec 23;13(4):e125. [CrossRef]
  34. Krumholz HM, Normand ST, Wang Y. Twenty-Year Trends in Outcomes for Older Adults With Acute Myocardial Infarction in the United States. JAMA Netw Open 2019 Mar 15;2(3):e191938. [CrossRef]

CVD: cardiovascular disease
GBD: Global Burden of Disease
GFT: Google Flu Trends
ILIs: influenza-like illnesses
NCDs: noncommunicable diseases
PSA: prostate-specific antigen
RSV: relative search volume
SDGs: Sustainable Development Goals

Edited by G Eysenbach; submitted 31.03.20; peer-reviewed by K Bogie, R Lan; comments to author 29.06.20; revised version received 10.07.20; accepted 26.10.20; published 12.11.20


©Chenjie Xu, Zhi Cao, Hongxi Yang, Ying Gao, Li Sun, Yabing Hou, Xinxi Cao, Peng Jia, Yaogang Wang. Originally published in the Journal of Medical Internet Research (, 12.11.2020.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.