This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Influenza epidemics result in a public health and economic burden worldwide. Traditional surveillance techniques, which rely on doctor visits, provide data with a delay of 1 to 2 weeks. A means of obtaining real-time data and forecasting future outbreaks is desirable to provide more timely responses to influenza epidemics.
This study aimed to present the first implementation of a novel dataset by demonstrating its ability to supplement traditional disease surveillance at multiple spatial resolutions.
We used internet traffic data from the Centers for Disease Control and Prevention (CDC) website to determine the potential usability of this data source. We tested the traffic generated by 10 influenza-related pages in 8 states and 9 census divisions within the United States and compared it against clinical surveillance data.
Our results yielded an
These results demonstrate that internet data may be able to complement traditional influenza surveillance in some cases but not in others. Specifically, our results show that the CDC website traffic may inform national- and division-level models but not models for each individual state. In addition, our results show better agreement when the data were broken up by seasons instead of aggregated over several years. We anticipate that this work will lead to more complex nowcasting and forecasting models using this data stream.
Every year, an estimated 5% to 20% of people in the United States become infected with influenza [
One common surveillance measure is the fraction of patients presenting with influenza-like illness (ILI), consisting of a fever of at least 100°F (37.8°C) and a cough or sore throat with no other known cause [
In the United States, 87% [
There are two main types of health-related internet activity. The first is health sharing, in which internet users post about health-related topics (eg, a tweet about being sick). The second is health seeking, in which users use the internet to obtain information about health-related topics [
Internet data have been used to forecast disease incidence in other models. Polgreen et al [
Using a Poisson distribution and Lasso regression, McIver and Brownstein [
Integrating both Wikipedia data and Google Flu Trends, Bardak et al [
As part of the Centers for Disease Control and Prevention (CDC)’s 2013 to 2014 Predict the Influenza Season Challenge, 9 teams used digital data sources to create forecasting models. The digital sources these teams used were Wikipedia, Twitter, Google Flu Trends, and HealthMap. The teams used either mechanistic or statistical models to create their forecasts, with the most successful team using multiple data sources, which may have reduced biases usually associated with internet data streams [
Internet data streams have also been used to supplement traditional surveillance techniques with nowcasting models. Paul et al [
In contrast, we considered data on page views of the CDC website rather than search data from sites not solely devoted to public health. We used this dataset because we expect it to be inherently less noisy because of its focus on public health issues. We used OLS to nowcast influenza nationally, across the 9 US census divisions, and across 8 states using access data from 10 influenza-related CDC pages. Our nowcasting models cover influenza seasons from 2013 to 2016, with the 2012 to 2013 season being partially included because the CDC page view dataset begins on January 1, 2013. The inclusion of an incomplete influenza season serves to inform if this dataset can be used given a more restrictive time frame. We included both positive and negative results to advance our knowledge regarding when internet data may or may not work. The negative results are crucial to advancing the field of disease surveillance using internet data, as they demonstrate when these data sources contribute to unreliable surveillance. We focus on answering the following two research questions: (1) Can CDC page visits be used as an additional data source for monitoring disease incidence? and (2) What is the appropriate time shift of the page view data needed to obtain the best data fit?
We used page view data provided by the CDC. Each data point contains the page name, date and time of access, and the geographic location from where the page was viewed. These data are available at geographic resolutions of national and state levels and include some metropolitan areas (eg, New York City). The data are available at a number of temporal resolutions beginning on January 1, 2013. For these models, we used weekly page view data to coincide with the ILI data temporal resolution. The data are available as raw page view counts and page view counts normalized with respect to all CDC page views, and we considered the latter for this work. We selected pages associated with general influenza information, treatment, and diagnosis. Pages were sometimes renamed, but we were able to follow the evolution of each selected page by using keywords in the page titles as well as the date ranges for available data.
As the majority of health-related internet searches concern specific conditions, treatments, and procedures [
The states we selected were based on the severity of flu (determined from FluView) during the available seasons and the availability of ILI data at the time of the study, which is not standardized and is dependent on each state’s reporting mechanism. ILI data for each state include the week ending or starting date and the percentage of ILI for the specified week. Although some states also report additional data, such as school closures and hospitalizations, these data are not made available by every state. Note that ILI reporting and accessibility vary across all states. The states we selected were (1) California, (2) Maine, (3) Missouri, (4) New Jersey, (5) New Mexico, (6) North Carolina, (7) Texas, and (8) Wisconsin. With the exception of Texas, these states did not release ILI data outside of the typical flu season. As the purpose of this study was to demonstrate the viability of nowcasting, we considered only those ILI data available during the study period. Although some states have made their ILI more accessible since the end of the study, we did not consider these data, as they were not available during the study period. The exclusion of additional data not available during the study period helps to preserve the premise of nowcasting by focusing only on data sources available during the study period. Likewise, our state ILI data often came from the state’s individual weekly reports during the seasons used in the study. A complete list of the data sources for the state ILI can be found in
Percentage of ILI visits per state compared with the typical national baseline of 2%. Maine (dark blue) and Texas (teal) exhibit outlier behavior, with Texas having a greater ILI percentage and Maine having a lesser ILI percentage. The remaining states follow the national ILI trend, shown in black. ILI: influenza-like illness.
Normalized CDC web traffic as a heat map. Darker areas indicate more page views and appear to correlate with increases in influenza-like illness. The page views also appear to be more prevalent during the typical influenza season, October to May. CDC: Centers for Disease Control and Prevention.
In addition to selected states, we also considered the 9 US census divisions: New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, West South Central, Mountain, and Pacific.
We used statsmodels version 0.9.0 [
Mathematical formula of the linear ILI models created in this study. The model M represents the fraction of ILI visits, where <inline-graphic xlink:href="jmir_v22i6e14337_fig7.png" mimetype="image" xlink:type="simple"/> are the regression coefficients and
We analyzed the data at the national, division, and state levels and computed
We selected pages that corresponded to the topics most often searched during web-based health-seeking activities. Aggregating all 10 pages in a single model, we were able to achieve an
Pages and shifts for the most successful models for each influenza season at the national level.
Pages used in model | Season | Shift |
|
Root mean square error | Normalized root mean square error |
FluView, Symptoms, and Treatment | 2012-2013 | None | 0.912 | 0.423 | 0.070 |
Symptoms | 2015-2016 | None | 0.892 | 0.213 | 0.060 |
FluView | 2013-2014 | None | 0.802 | 0.510 | 0.111 |
Antivirals and Prevention | 2014-2015 | None | 0.778 | 0.615 | 0.103 |
These plots show national models and the associated pages and influenza seasons. (A) FluView, Symptoms, and Treatment, 2012 to 2013. (B) Symptoms, 2015 to 2016. (C) FluView, 2013 to 2014. (D) Antivirals and Prevention, 2014 to 2015. CDC: Centers for Disease Control and Prevention; ILI: influenza-like illness.
Using the data for each of the 9 census divisions, we were able to achieve an
Census division model successes using the FluView, Symptoms, and Treatment pages for the 2012 to 2013 influenza season. (A) West North Central, 2012 to 2013. (B) Mountain, 2012 to 2013. (C) East North Central, 1-week shift, 2012 to 2013. (D) Pacific, 2012 to 2013. (E) West South Central 2012 to 2013. These plots represent the census division models that had the highest
The 9 census divisions and the season and shift for which the division’s model had the highest
Division | Season | Shift |
|
Root mean square error | Normalized root mean square error |
West North Central | 2012-2013 | None | 0.955 | 0.367 | 0.057 |
Mountain | 2012-2013 | None | 0.921 | 0.336 | 0.077 |
New England | 2015-2016 | None | 0.920 | 0.096 | 0.096 |
East North Central | 2012-2013 | 1 week | 0.899 | 0.331 | 0.076 |
South Atlantic | 2015-2016 | None | 0.893 | 0.218 | 0.065 |
Middle Atlantic | 2015-2016 | None | 0.861 | 0.302 | 0.073 |
Pacific | 2012-2013 | None | 0.849 | 0.503 | 0.094 |
West South Central | 2012-2013 | None | 0.828 | 0.986 | 0.105 |
East South Central | 2015-2016 | 1 week | 0.793 | 0.365 | 0.082 |
We found
The most successful results for each state considered in this study.
State | Page(s) | Season | Shift |
|
Root mean square error | Normalized root mean square error |
Texas | Alla | 2012-2013 | 1 week | 0.930 | 0.667 | 0.067 |
Wisconsin | FVSTb | 2012-2013 | None | 0.833 | 0.533 | 0.127 |
New Jersey | All | 2012-2013 | 1 week | 0.832 | 0.767 | 0.117 |
Missouri | FVST | 2012-2013 | 1 week | 0.823 | 0.801 | 0.127 |
North Carolina | FVST | 2015-2016 | 1 week | 0.781 | 0.455 | 0.106 |
New Mexico | All | 2015-2016 | 1 week | 0.771 | 1.184 | 0.197 |
California | FVST | 2012-2013 | 1 week | 0.758 | 0.777 | 0.125 |
Maine | Antivirals | 2012-2013 | None | 0.662 | 0.445 | 0.171 |
a
b
Different states during different seasons. (A) Texas, 1-week shift, 2012 to 2013. (B) Wisconsin, 2013 to 2014. (C) Missouri, 2014 to 2015. (D) North Carolina, 2015 to 2016. (E) Wisconsin 2012 to 2013. The
We were not surprised that Texas had the best fit. Texas was the only state we included that provided ILI data not only for the typical influenza season but also for the off-season. These additional data likely contributed to the success of the Texas models. In keeping with our nowcasting scenario, we only included data available during the study period. During that period, Texas was the only state that provided off-season ILI data. These data have since been made available from other states, but the availability was not present during the study. The lack of success we encountered in modeling Maine was also expected because of Maine’s outlier behavior in ILI, having values considerably lower and out of pattern with other states. The models in
We then shifted the ILI data forward by 1 week. The regression analysis yielded 7 state/season combinations with
States with models that had an
State | Season |
|
Root mean square error | Normalized root mean square error |
Texas | 2012-2013 | 0.930 | 0.667 | 0.067 |
New Jersey | 2012-2013 | 0.832 | 0.767 | 0.117 |
New Mexico | 2015-2016 | 0.771 | 1.184 | 0.197 |
California | 2012-2013 | 0.746 | 0.797 | 0.129 |
Wisconsin | 2012-2013 | 0.727 | 0.626 | 0.153 |
North Carolina | 2015-2016 | 0.708 | 1.028 | 0.204 |
Missouri | 2012-2013 | 0.702 | 1.039 | 0.165 |
Adding only the FluView, Symptoms, and Treatment pages, we obtained an
The purpose of this study was to demonstrate the viability of near real-time nowcasting during the influenza seasons from 2013 to 2016. To maintain the premise of nowcasting, we chose states with publicly available data, or data available on request, during the period of the study. During the study period, state ILI data were not readily available on the CDC website. Instead, we had to rely on data available through state health-related organizations for each state. In addition, throughout the course of influenza seasons, ILI numbers are often updated as delayed data are reported and made available. However, because we are focusing our study on a nowcasting scenario, we do not consider the ILI data from those seasons as they are reported today but rather as they were reported during the study period.
We generally found the models to be successful when considering pages most closely related to typical health-seeking behavior and when considering each flu season individually. When trying to model multiple influenza seasons together, we had a number of unsuccessful models. Considering all pages and national ILI data, the model combining the 2012 to 2013 and 2013 to 2014 influenza seasons had an
We speculate that a number of factors could contribute to these negative results. Although influenza is a seasonal disease, similar strains can span multiple years, affecting the susceptible populations in subsequent years. Our data stream may be biased toward individuals with more awareness of the CDC. Furthermore, individuals who search for influenza information in one season may not search for that information the next year. Finally, with the exception of Texas, we only have ILI data for the influenza season itself. Thus, although we do have internet data for off-season influenza page views, we do not have corresponding ILI data.
Internet surveillance data have proven beneficial in predicting ILI incidence during flu seasons. However, our results show that the benefit of internet data streams on informing disease is inconclusive; that is, this study shows that the CDC website traffic can be informative in some cases (eg, national level) but not in others (eg, state level). To determine the extent, we must return to our original research questions.
Given the successes of some of our models, we can conclude that CDC page view data can be used as an additional data source for monitoring disease incidence in some cases (eg, at the national level). The degree to which these data can be used appears to rely on the page selection and time frame. The results of the best models varied across geographic and temporal resolutions, but some trends were consistent in most cases. We obtained successful nowcasts when selecting pages related to topics most commonly used for web-based health queries (specific diseases and treatments) during the time span of a typical influenza season. Longer time spans and pages less associated with specific diseases and treatments led to less successful models. Outlier behavior, such as the ILI data in Maine, affected our models and resulted in less successful models than states with ILI curves exhibiting expected behavior. These results can assist others in selecting appropriate supplemental datasets for disease surveillance as well as appropriate spatial and temporal resolutions.
We obtained our most successful results using a 1-week shift. Moreover, 2-week shifts were successful in some cases but were overall less correlated than 1-week shifts (
We conclude that more studies on internet data streams are needed to understand when and why internet data work. Our methods are consistent with other feasibility studies and provide insight into the conditions under which internet data streams may inform influenza models. Future work should include rigorously testing the predictive power of the models by separating data into training and testing sets [
More studies on geographic resolution could provide a better insight into why some models outperform others at various spatial resolutions. National models across single influenza seasons performed well, with each season included in the study having at least one model with an
More studies on temporal resolution could provide a better insight into how best to model seasonal diseases over multiple seasons. Models across multiple seasons were not successful, which we attribute in part to the off-season ILI data being unavailable during the study period. As influenza is a seasonal disease, modeling multiple seasons with 1 model may not be the correct approach, and our multiseason models support this idea. However, more exhaustive studies are needed to draw definitive conclusions on the appropriate spatial resolution for modeling influenza.
Names and date ranges of web pages used in this study.
Sources for state influenza-like illness data used in this study.
Influenza-like illness data for the nine census divisions.
State influenza-like illness data.
The nine US census divisions, listing all states in each division.
Comprehensive list of all models in this study not included in the main text. The list includes model successes and model failures.
Centers for Disease Control and Prevention
influenza-like illness
normalized root mean square error
ordinary least squares
root mean square error
This work was supported by the Los Alamos National Laboratory Information Science and Technology Institute and the National Institutes of Health, the National Institute of General Medical Sciences, and the Modeling for Infectious Disease Agent Study program under grant U01-GM097658-071. Matthew Biggerstaff of the CDC provided integral insight and support. Curt Canada and the Data Science at Scale Summer School were integral to this project. Stephen Wirkus, Abigail Hunter, and Catherine S Plesko provided recommendations and clarifications. Jonathan Woodring, David H Rogers, and Francesca Samsel provided technical and visual assistance. Los Alamos National Laboratory is operated by Triad National Security, Limited Liability Company, for the National Nuclear Security Administration for the US Department of Energy (contract number: 89233218NCA000001).
WC, GF, and SD conceptualized the project and performed data analysis. GF performed data curation. WC wrote the manuscript. GF and SD edited the manuscript. WC created the visualizations.
None declared.