This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Cancer poses a serious threat to the health of Chinese people, resulting in a major challenge for public health work. Today, people can obtain relevant information from not only medical workers in hospitals, but also the internet in any place in real-time. Search behaviors can reflect a population’s awareness of cancer from a completely new perspective, which could be driven by the underlying cancer epidemiology. However, such Web-retrieved data are not yet well validated or understood.
This study aimed to explore whether a correlation exists between the incidence and mortality of cancers and normalized internet search volumes on the big data platform, Baidu. We also assessed whether the distribution of people who searched for specific types of cancer differed by gender. Finally, we determined whether there were regional disparities among people who searched the Web for cancer-related information.
Standard Boolean operators were used to choose search terms for each type of cancer. Spearman’s correlation analysis was used to explore correlations among monthly search index values for each cancer type and their monthly incidence and mortality rates. We conducted cointegration analysis between search index data and incidence rates to examine whether a stable equilibrium existed between them. We also conducted cointegration analysis between search index data and mortality data.
The monthly Baidu index was significantly correlated with cancer incidence rates for 26 of 28 cancers in China (lung cancer:
Search behaviors indeed reflect public awareness of cancer from a different angle. Research on internet search behaviors could present an innovative and timely way to monitor and estimate cancer incidence and mortality rates, especially for cancers not included in national registries.
Cancer affects people of all socioeconomic levels all over the world [
Today, social media and medical forums are rapidly spreading, and internet users are increasingly exchanging health-related information. When people feel ill or have early symptoms, they may tend to first look for relevant health information on the internet for self-assessment. Some studies have improved the surveillance of epidemics and examined public interest in multiple health topics by monitoring the search behaviors of millions of users and conducting data mining through Google [
In this study, we tracked and monitored the Baidu index [
Flow diagram of the cancer registration system.
National-level incidence and mortality rates of cancers in China were obtained for the period 2011-2016 from the Global Burden of Disease database, which is publicly available [
This study mainly considered cancer search index values from Chinese search engines. The Baidu index was used as the entry point to launch the corresponding research [
The Baidu index derives from search frequencies on the Baidu search engine; it is calculated and displayed based on the search volumes of specific keywords entered by users [
In this study, cancer awareness was examined on the basis of the general population’s ability to seek information on or pay attention to the disease. Because Baidu is a Chinese search engine, the search terms are all expressed in Chinese characters. Given the diverse meanings of Chinese characters, in addition to their formal Chinese names, some cancers have various synonyms. All their formal Chinese names were referenced to the International Classification of Diseases for Oncology. Therefore, we selected both the formal Chinese names and common terms for various cancers while searching. Standard Boolean operators were used to combine terms. The search index value for each cancer could be incorporated into five keywords, and the selected terms were not searched in quotes. For most cancers, we used two or more search terms in Chinese to cover as many synonyms as possible.
The Baidu index covers the function of keyword analysis, which is the process of scientifically determining keywords based on the mode through which the searchers initiate a search request. According to the time period of the research (January 2011 to December 2016), the Baidu index system automatically analyzed the flow and trend of keywords imported in the Baidu search engine. We first entered the formal Chinese names of various cancers as keywords. The keyword analysis function automatically generated a corresponding number of related words and the search demand of the related words themselves. These words could be used as search terms to reflect people’s retrieval needs. The function of keyword analysis helped us screen the search terms preliminary. We also conducted different retrieval methods for keyword selection to make the process more rigorous. We conducted comparative retrieval, cumulative retrieval, and combined retrieval for keywords and related words. Comparative retrieval aims to separate different keywords with commas among multiple words, which can realize the comparative query of keyword data. Cumulative retrieval indicates that among different keywords, different keywords are connected by a plus sign, and the addition of different keyword data can be realized. The aggregated data are presented as a combination of keywords. Combined retrieval is a combination of “comparative retrieval” and “cumulative retrieval.” Subsequently, the search terms of each cancer can be determined.
For non-Hodgkin lymphoma, we also added the more common term “lymphoma” because that search term is twice as common as “non-Hodgkin lymphoma,” and approximately 90% of lymphomas are non-Hodgkin lymphoma [
First, we performed the Spearman correlation analysis to evaluate the relationship between the known cancer incidence and mortality rates for all cancer types and the Baidu index for the period 2011-2016. The distribution of the original variables is not required in the Spearman correlation analysis, as it is a nonparametric statistical method, and the scope of application is wider; thus, statistical significance was set as .05 (two-sided test).
Second, we used the Engel-Grange test to determine whether there was cointegration or long-term association between the three indicators. We defined the search index values and the incidence and mortality rates for each type of cancer over the past 6 years as time-series data. To eliminate heteroscedasticity in the time series, in the first step, we obtained the log version of the Baidu index, and incidence and mortality rates [
Statistical analysis was conducted using IBM SPSS (version 22.0, IBM Corporation, Armonk, NY), EViews (version 8, IHS Global Inc, London, United Kingdom), and R project (version 3.4, R Development Core Team, Vienna, Austria). We used Tableau (version 2018.3, Tableau Software, Seattle, WA) to conduct statistical analysis and create figures.
We obtained the cancer incidence and mortality rates from 2011 to 2016 using data from the Global Burden of Disease database [
Incidence and mortality rates of cancers in China, 2016. Data were obtained from the Global Burden of Disease database. The blue signs represent incidence rates and the red signs represent mortality rates.
Incidence and mortality rates of cancers in China among men in 2016. The blue signs represent incidence rates and the red signs represent mortality rates.
Incidence and mortality rates of cancers in China among women in 2016. The blue signs represent incidence rates and the red signs represent mortality rates.
Time series of search index values and incidence and mortality rates (top five most commonly occurring cancers).
For Hodgkin lymphoma, gender percentage data were missing before May 2015. The incidence and mortality rates of brain, nervous system, thyroid, gallbladder, and biliary tract cancers were found to be higher for women than for men. Incidence and mortality rates of female-specific cancers (breast, cervical, and uterine cancers) were also higher than those of male-specific cancers (prostate and testicular). Other cancers had higher incidence and mortality rates among men than among women. Incidence and mortality rates have increased annually among both men and female for lung cancer, liver cancer, colorectal cancer, pancreatic cancer, brain and nervous system cancer, non-Hodgkin lymphoma, prostate cancer, bladder cancer, gallbladder and biliary tract cancer, lip and oral cavity cancer, ovarian cancer, kidney cancer, multiple myeloma, and malignant skin melanoma. Relatively, men paid more attention to search terms related to these cancers than women. In terms of the whole population, the number of women who searched for cancer-related information has slowly risen since 2015, while the number of men has shown a downward trend. This trend is even more obvious for female-specific cancers such as breast, cervical, ovarian, and uterine cancers. Initially, more men searched for terms related to breast cancer, but over time, an increasing number of women searched for such terms. More men paid attention to prostate and testicular cancers than women (
Incidence, mortality, and search distribution of cancers divided by gender (top five most commonly occurring cancers). The percentile chart represents the change from September 2013 to September 2016.
Age distribution of the searchers from 2013 to 2016 (top five most commonly occurring cancers).
Ranking of regional distribution of the online searchers from 2013 to 2016 (top five most commonly occurring cancers) in mainland China. AH: Anhui, BJ: Beijing, FJ: Fujian, GS: Gansu, GD: Guangdong, GX: Guangxi, GZ: Guizhou, HI: Hainan, HE: Hebei, HA: Henan, HL: Heilongjiang, HB: Hubei, HN: Hunan, JL: Jilin, JS: Jiangsu, JX: Jiangxi, LN: Liaoning, NM: Inner Mongoria, NX: Ningxia, QH: Qinghai, SD: Shandong, SX: Shanxi, SN: Shaanxi, SH: Shanghai, SC: Sichuan, TJ: Tianjing, XZ: Tibet, XJ: Xinjiang, YN: Yunnan, ZJ: Zhejiang, CQ: Chongqing.
Correlation coefficients between search index values, incidence rate of cancers, and mortality rate of cancers.
Cancer | Correlation between search index values and incidence rate | Correlation between search index values and mortality rate | ||
Lung cancer | 0.80 | <.001 | 0.80 | <.001 |
Liver cancer | 0.28 | .02 | 0.28 | .02 |
Stomach cancer | 0.50 | <.001 | 0.02 | .88 |
Esophageal cancer | 0.50 | <.001 | 0.21 | .08 |
Colon and rectal cancer | 0.81 | <.001 | 0.81 | <.001 |
Pancreatic cancer | 0.86 | <.001 | 0.86 | <.001 |
Breast cancer | 0.56 | <.001 | 0.76 | <.001 |
Leukemia | 0.75 | <.001 | -0.70 | <.001 |
Brain and nervous system cancer | 0.63 | <.001 | 0.63 | <.001 |
Cervical cancer | 0.64 | <.001 | 0.65 | <.001 |
Non-Hodgkin lymphoma | 0.88 | <.001 | 0.88 | <.001 |
Prostate cancer | 0.67 | <.001 | 0.67 | <.001 |
Nasopharynx cancer | 0.08 | .51 | 0.44 | <.001 |
Bladder cancer | 0.62 | <.001 | 0.62 | <.001 |
Gallbladder and biliary tract cancer | 0.88 | <.001 | 0.88 | <.001 |
Lip and oral cavity cancer | 0.88 | <.001 | 0.88 | <.001 |
Ovarian cancer | 0.58 | <.001 | 0.58 | <.001 |
Larynx cancer | 0.82 | <.001 | 0.74 | <.001 |
Kidney cancer | 0.73 | <.001 | 0.73 | <.001 |
Squamous cell carcinoma | 0.94 | <.001 | 0.87 | <.001 |
Uterine cancer | 0.04 | .73 | -0.42 | <.001 |
Multiple myeloma | 0.84 | <.001 | 0.84 | <.001 |
Thyroid cancer | 0.77 | <.001 | 0.77 | <.001 |
Malignant skin melanoma | 0.55 | <.001 | 0.55 | <.001 |
Hodgkin lymphoma | 0.91 | <.001 | -0.91 | <.001 |
Mesothelioma | 0.79 | <.001 | 0.79 | <.001 |
Testicular cancer | 0.57 | <.001 | -0.08 | .48 |
Basal cell carcinoma | 0.83 | <.001 | —a | —a |
aNot available.
Augmented Dickey-Fuller unit root test was used to examine the stationarity of the time series. Schwarz information criterion was used to determine lag length automatically. We first made logarithmic changes to the three indexes. After transformation, the series were all stationary at the first difference (
For most cancers, the Baidu index was positively correlated with cancer incidence rates. For several cancers including lung cancer, liver cancer, stomach cancer, colon and rectal cancer, breast cancer, prostate cancer, brain and nervous system cancer, cervical cancer, pancreatic cancer, non-Hodgkin lymphoma, bladder cancer, nasopharynx cancer, lip and oral cavity cancer, kidney cancer, thyroid cancer, squamous cell carcinoma, larynx cancer, ovarian cancer, gallbladder and biliary tract cancer, multiple myeloma, malignant skin melanoma and mesothelioma, the Baidu index was positively correlated with cancer mortality rates. The results suggest that the search engine data can reflect actual prevalence to some extent. Such data sources might be particularly useful when real-time information is required or missing (eg, mortality rate of basal cell carcinoma is lacking), considering that there is often a lag of several years in the publication of cancer registration data. The results of this study suggest that we should study and make use of Web-based data and publicly available information regarding people’s interest in health topics to estimate cancer trends. Although most cancers examined in this study showed statistically significant correlations of the Baidu index with incidence and mortality rates, nasopharynx, uterine, stomach, esophageal, and testicular cancers did not show such correlations. This is probably attributable to various public health-related phenomena that may increase search volumes independent of disease metrics, such as the National Cancer Prevention Week held by the China Anti-Cancer Association (April of each year) or appearance of reports of cancer among public figures. After launch of a public health campaign for a disease, the information-search behavior associated with the disease will also increase [
In terms of the gender distribution of the search population, there were initially more men than women in our study. This could be attributable to gender differences in the disease burden pattern. For example, men are more susceptible than women to various deadly diseases, including cancer [
Given the nonstandard treatments and other related issues, cancer diagnoses in China are generally made late, and the survival rates are not high [
Norman and Skinner defined eHealth literacy as “The ability to search, understand, and evaluate health information on electronic resources, and harnessing the information they receive to address and solve health problems”. As the content of health literacy continued to expand, Norman and Skinner proposed that electronic health literacy is a combination of different abilities, which can be divided into two types: traditional literacy vs computing ability, media literacy, and information literacy. Computer literacy, scientific literacy, and health literacy refer to the ability to deal specifically with problems in specific areas [
Research on attention paid to health information in the Web as well as population characteristics can help estimate some indicators of diseases among the population, which can help improve the allocation of health resources and implementation of effective public health measures. This could also help medical providers who are facing various challenges including understanding characteristics of patients who use the internet, the reasons for utilizing the internet, and the effectiveness and security of websites currently providing health-related information to patients.
Previous studies have mainly used Google Trends [
This study has some limitations. The use of Baidu search data to estimate disease metrics might not be completely generalizable, since the data are restricted to those with access to the internet. We were also unable to determine the types of internet users or which stakeholders were responsible for search activities. Given China’s vast size and large population, the registry of cancer statistics is usually lagging and not comprehensive. We could not obtain timely data on the monthly incidence and mortality rates of all cancers in China. Use of search index values from a popular internet search engine can only account for a small portion of changes in incidence and mortality rates of cancers, which are also greatly affected by public health activities. Studying search engine data is inevitably restricted by these random factors; this is an unavoidable limitation in such research. We hope to find ways to identify and reduce bias in search engine data before we utilize Web-based data to provide useful information for cancer surveillance, evaluation of public cancer awareness, and education programs.
Owing to the widespread proliferation of internet technology, all kinds of people make use of the internet. In the medical field, it is often intended to prompt informed conversations with clinical professionals and suggest potentially helpful resources to patients or other people. Indeed, this study found a correlation between search index values and the incidence and mortality rates for most types of cancers. In a way, search behaviors and volumes can reflect the public awareness of cancer. Therefore, an advanced understanding of search behaviors could augment traditional epidemiologic surveillance and help achieve the goal of cancer prevention and control. It will be beneficial for us to pay attention to internet search data, especially when registry data are insufficient or lagging.
Search terms of 28 cancers in Chinese.
Relevant figures of the remaining types of cancers.
Results of unit root tests for the time series of monthly Baidu index, incidence rate, and mortality rate for each cancer type.
Results of the cointegration test of the two-time series of monthly Baidu indexes and incidence rates of cancers.
Results of the cointegration test of the two-time series of monthly Baidu indexes and mortality rates of cancers.
This research was funded by National Natural Science Foundation of China (#91746205; #71673199).
YW developed the original research idea for the study and directed the study. All authors conducted the analysis and interpreted the results. CX, YW, and HY developed the first manuscript draft. All authors critically revised the manuscript. All authors critically reviewed and contributed to the final version and approved it. All authors had full access to the study data. YW had the final responsibility of the study.
None declared.