This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Internet search queries have become an important data source in syndromic surveillance system. However, there is currently no syndromic surveillance system using Internet search query data in South Korea.
The objective of this study was to examine correlations between our cumulative query method and national influenza surveillance data.
Our study was based on the local search engine, Daum (approximately 25% market share), and influenza-like illness (ILI) data from the Korea Centers for Disease Control and Prevention. A quota sampling survey was conducted with 200 participants to obtain popular queries. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development Set 2 and 2011/12 for validation Set 2). Pearson’s correlation coefficients were calculated between the Daum data and the ILI data for the development set. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. Then, we created a cumulative query method
In validation set 1, 13 cumulative query methods were applied, and 8 had higher correlation coefficients (min=.916, max=.943) than that of the highest single combined query. Further, 11 of 13 cumulative query methods had an
Cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.
Syndromic surveillance may alert public health care providers in the early phases of an outbreak, allowing them to decrease morbidity and mortality resulting from the outbreak [
Because conventional syndromic surveillance of indicators such as influenza-like illness (ILI) depends on case reporting to report disease activity, time delays in reporting and case confirmation can interfere with the early detection of outbreaks or increases in influenza cases in the community. Thus, researchers have been investigating alternative data sources for the detection of outbreaks. For example, over-the-counter sales of medications and school absenteeism data have been used for earlier detection of outbreaks [
Internet search queries have become an important data source in recent years [
In South Korea, influenza is generally seasonal, with most activity occurring during winter. The 2009/10 epidemiological year, called the Influenza A (H1N1) pandemic period, was an exceptional situation (see
The study period was September 6, 2009 (week 36), through September 1, 2012 (week 34)—156 weeks of data for 3 consecutive epidemiological years. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development set 2 and 2011/12 for validation set 2).
We collected the ILI data from the Korea Centers for Disease Control and Prevention (KCDC) as a gold standard. KCDC ILI data were available from the KCDC website; we downloaded the ILI data for the study period from this site [
To obtain population search queries related to influenza, we conducted a survey from quota sampling based on sex and age in September 2012. The quotas were based on address of resident registry, age, and sex. There were five quota groups by age: 20-29 years, 30-39 years, 40-49 years, 50-59 years, 60 years or older. Half of each quota group were female. We randomly selected the addresses from the residence registry in Seoul, and then if interviewees living at the address of residence registry met the criteria, we included the oldest interviewee. We then conducted face-to-face interviews. The survey included searching history for influenza and typed queries. The survey was performed anonymously. A KCDC definition of ILI was a person with a fever (발열 in Korean) of 38°C with a cough (기침) and/or a sore throat (인후통). These three queries from the definitions of ILI were included in the queries for the following operations, regardless of the survey result. In the case of queries originally submitted in English only, we translated them to Korean and added them as new queries.
We believe that people typically search for things of interest on the Internet using one or more queries at a time. To reflect people’s searching behavior and include as many queries as possible, we used a combination of queries. Queries from the survey results and the definition of KCDC ILI were divided into groups as follows: query group 1 consisted of queries specific to influenza (eg, “H1N1”, “Influenza”), and query group 2 contained queries not specific to influenza (eg, “Treatment”, “Symptom”). Then, we combined query groups 1 and 2. Combined queries consisted of query group 1 alone and a combination of query groups 1 and 2 (eg, “H1N1”, “H1N1 Treatment”, “H1N1 Symptom”, “Influenza”, “Influenza Treatment”, “Influenza Symptom”).
We sent the combined queries and the queries that belonged to query group 1 (because these queries were searchable by themselves) to Daum and received proportional data in weekly form. Proportional data for these combined queries were extracted from the Daum search engine during development sets 1 and 2. Proportional data from the Daum search engine were calculated by dividing the number of each combined query by the total number of search queries for 1 week.
Pearson’s correlation coefficients were calculated between the Daum data for the combined queries and the KCDC ILI data in development sets 1 and 2. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. To see the change of correlation coefficients over time, we also calculated correlation coefficients of the combined queries in subsequent epidemiological years. We then created a cumulative query method
This study was approved by the Institutional Review Board of Asan Medical Center (Seoul, Korea).
We contacted 322 people and included 200 participants older than 20 years who lived in Seoul, Korea. Over a quarter (56/200, 28%) answered “Yes” to the question of searching history for influenza and provided search queries (
Results of the survey.
Raw data | English translation | Frequency (%) |
신종 | New | 1 (1.8) |
신종플루 | New flua | 23 (41.1) |
신종플루 증상 | New flu symptom | 1 (1.8) |
신종플루 증세 | New flu sign | 1 (1.8) |
신종플루, 독감 | New flu, bad cold | 2 (3.6) |
신종플루, 목아픔 | New flu, neck pain | 1 (1.8) |
신종플루, 백신, Tamiflu | New flu, vaccine, Tamiflu (English)b | 1 (1.8) |
신종플루, 신플 증상 | New flu, new flu (abbr.)c symptom | 1 (1.8) |
신종플루, 인플루엔자, H1N1, PCR | New flu, influenza, H1N1 (English)b, PCRd (English)b | 1 (1.8) |
신종플루, 조류독감 | New flu, bird flu | 1 (1.8) |
신종플루의 치료, 합병증 | New flu, treatment, complication | 1 (1.8) |
신종플루증상 | New flu symptom | 1 (1.8) |
신종플루증세, 예방, 마스크 | New flu sign, prevention, mask | 1 (1.8) |
신플증상 | New flu (abbr.)c symptom | 1 (1.8) |
열, 기침 | Fever, cough | 1 (1.8) |
유행성독감, influenza | Epidemic bad cold, influenza (English)b | 1 (1.8) |
인플루엔자 | Influenza | 7 (12.5) |
인플루엔자, 신종독감, 신종 플루 | Influenza, new bad cold, new flu | 1 (1.8) |
인플루엔자, 조류독감 | Influenza, bird flu | 1 (1.8) |
인플루엔자, 조류독감, 돼지독감, 신종플루 | Influenza, bird flu, swine flu, new flu | 1 (1.8) |
조류독감 | Bird flu | 5 (8.9) |
조류독감, 사망 | Bird flu, decease | 1 (1.8) |
증상, 목통증 | Symptom, throat pain | 1 (1.8) |
Total |
|
200 (100.0) |
aSince the Influenza A (H1N1) pandemic period, media began to use “New flu (신종플루)” to distinguish the H1N1 influenza and previous influenzas in Korea. In 2010, KCDC announced that the official term was “Influenza (인플루엔자)”. But “New flu (신종플루)” and “Bad cold (독감)” are still more popular terms than “Flu (플루)” or “Influenza (인플루엔자)” in Korea. “Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold.
bThe query was originally submitted in English.
cAbbreviation: “New flu (abbr.) (신플)” is the abbreviation of “New flu (신종플루)” in Korean.
dPCR: polymerase chain reaction.
Query group 1 contained 14 queries that were specific to influenza, and query group 2 had 14 queries that were not specific to influenza (
Query groups 1 and 2 from the survey results and the KCDC definition of ILIa.
Query group 1 | Query group 2 |
Flu | Vaccine |
New flu | Prevention |
New flu (abbr.)b | Mask |
Influenza | Symptom |
Influenza (English)c | Sign |
New influenza | Cough |
Bad coldd | Fever |
New bad cold | Neck pain |
Epidemic bad cold | Sore throat |
H1N1 (English)c | Throat pain |
Bird flu | PCR (English)c,e |
Swine flu | Treatment |
Tamiflu | Complication |
Tamiflu (English)c | Decease |
aQuery group 1 consisted of queries specific to or related to influenza. Query group 2 contained queries not specific to influenza.
bAbbreviation.
cThe query was originally submitted in English.
d“Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold. “Flu” in query group 1 is “플루” which is the English pronunciation written in Korean. In Korea, “Bad cold (독감)” is a more popular term than “Flu (플루)” or “Influenza (인플루엔자)”.
ePCR: polymerase chain reaction.
Correlation analysis was performed between the Daum data for combined queries and the KCDC ILI data in development sets 1 and 2 (
Correlation analysis between the Daum data for combined queries and the KCDC ILI data in development sets 1 and 2.
Order | Combined query | Correlation coefficient | Combined query | Correlation coefficient | |||
Development set 1 (2009/10) | Validation set 1 (2010/11) | Validation set 2 (2011/12) | Development set 2 (2010/11) | Validation set 2 (2011/12) | |||
1 | New flu (abbr.)a | .894b | .622b | c | Bad cold + Symptom | .969b | .981b |
2 | Flu + Vaccine | .871b | -.062d | -.157e | New flu + Treatment | .951b | .616b |
3 | New flu + Cough | .849b | .930b | .291b | New flu + Cough | .930b | .291b |
4 | New flu + Fever | .814b | .591b | .460b | New flu + Sign | .919b | .684b |
5 | Tamiflu + Vaccine | .805b | -.062c | c | Tamiflu | .904b | .981b |
6 | Tamiflu + Symptom | .800b | c | c | New influenza + Symptom | .896b | .650b |
7 | Flu + Symptom | .799b | .815b | .416b | Bad cold + Treatment | .887b | .814b |
8 | H1N1 + Symptom | .791b | c | c | Swine flu + Symptom | .877b | .005e |
9 | New flu + Sore throat | .738b | .504b | c | New flu + Symptom | .836b | .936b |
10 | New flu (abbr.)a + Vaccine | .713b | c | c | Flu + Symptom | .815b | .416b |
11 | New flu + Symptom | .709b | .836b | .936b | Influenza + Symptom | .813b | .782b |
12 | Tamiflu | .703b | .904b | .981b | Influenza (English)g | .762b | .751b |
13 | Tamiflu (English)g | .700bb,h | .523b | .286b | New influenza | .748b | .503b |
14 |
|
|
|
|
Bird flu + Symptom | .747b | .005f |
15 |
|
|
|
|
Bird flu | .709b,h | .136i |
aabbr.: abbreviation
b
cCorrelation cannot be computed because it has a constant value in that period (see
d
e
f
gThe query was originally submitted in English.
hWe selected the combined queries for which the correlation coefficients were ≥.7 and listed them in descending order.
i
Plot of combined queries that consecutively show correlation coefficient (P<.05) (only “Tamiflu” and “New flu + Symptom” showed r values greater than .7 for 3 consecutive years).
A total of 13 cumulative query methods were created in development set 1 (see
In each development set, cumulative query methods had a higher correlation coefficient than combined queries (see
Correlation coefficients of cumulative query method
Cumulative query method | Correlation coefficient in validation set 1 | Correlation coefficient in validation set 2 from development set 1 | Correlation coefficient in validation set 2 from development set 2 |
1 | .622b | c | .981b,d |
2 | .183e | -.157f | .975b,d |
3 | .916b,d | .092g | .975b,d |
4 | .933b,d | .467b | .975b,d |
5 | .933b,d | .467b | .987b,d |
6 | .933b,d | .467b | .986b,d |
7 | .943b,d | .486b | .986b,d |
8 | .943b,d | .486b | .986b,d |
9 | .943b,d | .486b | .968b |
10 | .943b,d | .486b | .968b |
11 | .838b | .935b,d | .965b |
12 | .841b | .953b,d | .965b |
13 | .841b | .953b,d | .964b |
14 | Not applicable | Not applicable | .964b |
15 | Not applicable | Not applicable | .780b |
aWe selected the combined queries for which the correlation coefficients were ≥.7 and listed them in descending order. We then created a cumulative query method
b
cCorrelation of cumulative query method 1 in validation set 2 from development set 1 cannot be computed because it has a constant value in that period (see
dUseful cumulative query method in the validation set was defined as having higher correlation coefficient than the highest correlation coefficient of a single combined query in the same development set.
e
f
g
Scatter plot between the KCDC ILI and cumulative query model 5 in validation set 2.
Correlation coefficients of combined queries for which the correlation coefficients were ≥.7 and cumulative query methods in set 1.
Cumulative query method from development set 1 (2009/10) | Correlation coefficient | Combined query from development set 1 (2009/10) | Correlation coefficient | ||
2009/10 | 2010/11 | 2009/10 | 2010/11 | ||
1 | .894 | .622 | New flu (abbr.)a | .894 | .622 |
2 | .887 | .183 | Flu + Vaccine | .871 | -.062 |
3 | .883 | .916 | New flu + Cough | .849 | .93 |
4 | .861 | .933 | New flu + Fever | .814 | .591 |
5 | .86 | .933 | Tamiflu + Vaccine | .805 | -.062 |
6 | .859 | .933 | Tamiflu + Symptom | .8 | .b |
7 | .849 | .943 | Flu + Symptom | .799 | .815 |
8 | .849 | .943 | H1N1 + Symptom | .791 | .b |
9 | .851 | .943 | New flu + Sore throat | .738 | .504 |
10 | .853 | .943 | New flu (abbr.)a + Vaccine | .713 | .b |
11 | .712 | .838 | New flu + Symptom | .709 | .836 |
12 | .728 | .841 | Tamiflu | .703 | .904 |
13 | .728 | .841 | Tamiflu (English)c | .7 | .523 |
aabbr.: abbreviation
bCorrelation cannot be computed because it has a constant value in that period (see
cThe query was originally submitted in English.
Correlation coefficients of combined queries for which the correlation coefficients were ≥.7 and cumulative query methods in set 2.
Cumulative query method from development set 2 (2010/11) | Correlation coefficient | Combined query from development set 2 (2010/11) | Correlation coefficient | ||
2010/11 | 2011/12 | 2010/11 | 2011/12 | ||
1 | .969 | .981 | Bad cold + Symptom | .969 | .981 |
2 | .977 | .975 | New flu + Treatment | .951 | .616 |
3 | .978 | .975 | New flu + Cough | .93 | .291 |
4 | .982 | .975 | New flu + Sign | .919 | .684 |
5 | .97 | .987 | Tamiflu | .904 | .981 |
6 | .968 | .986 | New influenza + Symptom | .896 | .65 |
7 | .969 | .986 | Bad cold + Treatment | .887 | .814 |
8 | .967 | .986 | Swine flu + Symptom | .877 | .005 |
9 | .853 | .968 | New flu + Symptom | .836 | .936 |
10 | .853 | .968 | Flu + Symptom | .815 | .416 |
11 | .854 | .965 | Influenza + Symptom | .813 | .782 |
12 | .854 | .965 | Influenza (English)a | .762 | .751 |
13 | .857 | .964 | New influenza | .748 | .503 |
14 | .857 | .964 | Bird flu + Symptom | .747 | .005 |
15 | .86 | .78 | Bird flu | .709 | .136 |
aThe query was originally submitted in English.
Plot of combined queries for which the correlation coefficients were .7 or higher and cumulative query methods of set 1.
Plot of combined queries for which the correlation coefficients were .7 or higher and cumulative query methods of set 2.
In this study, the cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.
Many people use Internet searches for health information before visiting a doctor [
Search queries may vary from country to country. In Korea, “Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold. Since the 2009/10 epidemiological season, the Influenza A (H1N1) pandemic period, the media began to use “New flu (신종플루)” in order to distinguish H1N1 influenza and previous influenzas. In 2010, KCDC announced that the official term was “Influenza (인플루엔자)” [
For the 2009/10 epidemiological year (development set 1), 13 combined queries had correlation coefficient
It is difficult to predict the change of search queries in the future. To reduce the effects from changes in search queries, we used a combination of queries and cumulation of combined queries to construct our method. Additionally, the method we wanted to develop was meaningful only when the cumulative query method had a higher correlation coefficient than the highest single combined query. In each validation set, 8 useful cumulative query methods were developed. The useful cumulative query methods in each validation set had a high correlation coefficient (
We used proportional data from Daum, a non-dominant local search engine (approximately 25% of the market share) in South Korea [
There are several limitations to this study. The survey of our study is not a representative sample. Because respondents were asked to provide typed queries without mention of the influenza pandemic of 2009/10, recent search queries were more likely to have been included in this study because the survey was conducted recently. This might affect performance of the cumulative query method. Further, the data from the influenza pandemic of 2009/10 might affect the outcome of this study. In this study, we did not combine queries from the same query group. Although important, the performance of using symptoms in the definition of KCDC ILI was not tested. The learning effect from the influenza pandemic of 2009/10, news reports, outbreak briefs, health information from the Internet, and changing search behavior stemming from the diffusion of smartphones might have affected the outcome of this study. We did not determine the extent to which these factors affected the searching behavior. More data for subsequent years are required in order to know the life of the cumulative query method.
We presented a cumulative query method using search engine data. We conducted a survey to obtain population search queries. To reduce the effects from changes in search queries, we used a combination of queries and cumulation of combined queries. Our method showed high correlation with national influenza surveillance data in South Korea. However, to further our method, additional research is needed.
Seasonality of influenza in South Korea.
Full study data (proportional data from Daum are multiplied by 12 squares of 10).
Scaled proportional data based on the two best cumulative query methods.
Scatter plots between the KCDC ILI and other useful cumulative query methods.
Cumulative query method for influenza virologic data.
Google Flu Trends
influenza-like illness
Korea Centers for Disease Prevention and Control
polymerase chain reaction
This study was supported by grant 2012-0580 from the Asan Institute for Life Sciences, Seoul, Korea. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Asan Institute for Life Sciences, Seoul, Korea. The technical consultation was supported by Daum Communications. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Our study was based on the search engine Daum. This study was partly supported by Daum Communications, the employer of author Maengsoo Yu.