Nowcasting Influenza Incidence with CDC Web Traffic Data: A Demonstration Using a Novel Data Set

Influenza epidemics result in a public health and economic burden around the globe. Traditional surveillance techniques, which rely on doctor visits, provide data with a delay of 1-2 weeks. A means of obtaining real-time data and forecasting future outbreaks is desirable to provide more timely responses to influenza epidemics. In this work, we present the first implementation of a novel data set by demonstrating its ability to supplement traditional disease surveillance at multiple spatial resolutions. We use Internet traffic data from the Centers for Disease Control and Prevention (CDC) website to determine the potential usability of this data source. We test the traffic generated by ten influenza-related pages in eight states and nine census divisions within the United States and compare it against clinical surveillance data. Our results yield $r^2$ = 0.955 in the most successful case, promising results for some cases, and unsuccessful results for other cases. These results demonstrate that Internet data may be able to complement traditional influenza surveillance in some cases but not in others. Specifically, our results show that the CDC website traffic may inform national and division-level models but not models for each individual state. In addition, our results show better agreement when the data were broken up by seasons instead of aggregated over several years. In the interest of scientific transparency to further the understanding of when Internet data streams are an appropriate supplemental data source, we also include negative results (i.e., unsuccessful models). We anticipate that this work will lead to more complex nowcasting and forecasting models using this data stream.


Introduction
Every year, an estimated 5% to 20% of people in the United States become infected with influenza [1]. The typical influenza season begins in October and ends in May, with the peak occuring in the winter months. Annually, 3,000-50,000 people die from the flu, with another 200,000 requiring hospitalization [15]. The yearly flu burden is estimated to cost around $87 billion in lost productivity [15]. Timely surveillance of influenza can help reduce this burden, allowing health care facilities to more adequately prepare for the influx of patients when flu levels are high [13].
One common surveillance measure is the fraction of patients presenting with influenza-like illness (ILI), consisting of a fever of at least 100 o F (37.8 o C) and a cough or sore throat with no other known cause [16]. ILI data are collected from about 2,900 volunteer health care providers throughout the United States, although each week only about 1,800 report their data. These data are then aggregated and made public, after a time lag of about 1-2 weeks [1][2][3][4][5][6]. Because the ILI data are collected from volunteer providers, the data set is incomplete. If policies were enacted to provide incentive for reporting health care providers, or to make reporting compulsory, the result would be a more complete data set. Other surveillance systems include virological data from the World Health Organization, emergency department visits, electronic health records, crowdsourced ILI reports, Widely Internet Sourced Distributed Monitoring, Influenzanet, and Flu Near You [17,20].

Internet data streams
In the United States, 87% [7] of adults use the Internet. Of those Internet users, 72% [7] have used the Internet to search for health information within the last year. The most common health-related searches are for information regarding a specific disease or condition (66%) and information about a specific treatment or procedure (56%) [7,8].
There are two main types of health-related Internet activity. The first is health sharing, in which Internet users post about health-related topics (e.g., a tweet about being sick). The second is health seeking, in which users utilize the Internet to obtain information about health-related topics [2]. In this paper, we focus on health-seeking behavior. Previous studies have shown that analyzing online health-seeking behavior can improve early detection of disease incidence by detecting changes in disease activity [5,9,21,23,25,28]. Similarly, other studies have shown that Internet data emerging from search queries can aid detection of outbreaks in areas with large populations of Internet users [10], because online health-related search queries and epidemics are often strongly correlated [10,11].
Internet data have been used to forecast disease incidence in other models. Polgreen et al. developed linear influenza forecasting models with lags of 1 to 10 weeks for each of the 9 U.S. census regions using search queries from Yahoo [5]. The best performing models had lags of 1-3 weeks and an average r 2 of 0.38 (with a high of 0.57 in the East-South-Central region) [5]. These low r 2 values demonstrate potential problems in relying on search information alone. Ginsberg et al. were able to predict influenza epidemics two weeks in advance using Google search queries to fit linear models using log-odds of ILI visits and related searches [9].
Using a Poisson distribution and LASSO regression, McIver and Brownstein obtained an r 2 value of 0.946 using Wikipedia data [4], although some data were excluded from analyses due to increased media attention and higher than normal influenza activity. Generous et al. used Wikipedia data to train a statistical model with linear regression, which demonstrated its potential for forecasting disease incidence around the globe, including influenza in the United States, which had an r 2 of 0.89 [3]. Hickmann et al. conducted a similar study of linear regression models which showed that using Wikipedia to forecast influenza in the United States for the 2013-2014 season resulted in an r 2 value greater than 0.9 in some instances [1].
Integrating both Wikipedia data and Google Flu Trends, Bardak et al. obtained r 2 values of 0.94 and 0.91 using ordinary least squares (OLS) and ridge regression, respectively, for forecasting influenza outbreaks [12]. For OLS nowcasting, the r 2 value was 0.98 in the best case. For the best fit, the weekly data was offset by one week [12].
As part of the CDC's 2013-2014 Predict the Influenza Season Challenge, 9 teams used digital data sources to create forecasting models. The digital sources these teams utilized were Wikipedia, Twitter, Google Flu Trends, and HealthMap. The teams used either mechanistic or statistical models to create their forecasts, with the most successful team using multiple data sources, which may have reduced biases usually associated with Internet data streams [18]. Broniatowski et al. used Twitter data to detect increasing and decreasing influenza prevalence with 85% accuracy [19]. Zhang et al. used Twitter data to inform stochastically, spatially structured mechanistic models of influenza in the United States, Italy, and Spain [26].
Internet data streams have also been used to supplement traditional surveillance techniques with nowcasting models. Paul et al. used Twitter along with ILI data from the CDC to produce nowcasting influenza models as well as nowcasting models using solely ILI data. They conclude that the addition of Twitter data led to more accurate nowcasting models [22]. Santillana et al. combined Google Trends data and CDC-reported ILI data to create models for nowcasting and forecasting influenza [24]. Lampos et al. used search query data to explore both linear and nonlinear nowcasting models [27]. Yang et al. used Google search data to create an influenza tracking model with autoregression [29].
In contrast, we consider data on page views of the CDC website rather than search data from sites not solely devoted to public health. We use this data set because we expect it to be inherently less noisy because of its focus on public health issues. We use ordinary least squares to nowcast influenza nationally, across the 9 U.S. census divisions, and across 8 states using access data from 10 influenza-related CDC pages. Our nowcasting models cover influenza seasons from 2013 to 2016, with the 2012-2013 season being partially included because our data set begins Jan. 1, 2013. The inclusion of an incomplete influenza season serves to inform whether this data set can be used given a more restrictive time frame. We include both positive and negative results to advance our knowledge regarding when Internet data may or may not work. The negative results are crucial to advancing the field of disease surveillance using Internet data, as they demonstrate when these data sources contribute to unreliable surveillance. We focus on answering the following two research questions: Q1: Can CDC page visits be used as an additional data source for monitoring disease incidence?
Q2: What is the appropriate shift needed to obtain the best data fit?

Data Sources
We used page view data provided by the Centers for Disease Control and Prevention (CDC). Each data point contains the page name, date and time of access, and the geographic location from where the page was viewed. These data are available at geographic resolutions of national and state levels and include some metropolitan areas (e.g., New York City). The data are available at a number of temporal resolutions beginning on January 1, 2013. For these models, we use weekly page view data to coincide with the ILI data temporal resolution. The data are available as raw page view counts and normalized page view counts, and we consider the latter for this work. We selected pages associated with general influenza information, treatment, and diagnosis. Pages were sometimes renamed, but we were able to follow the evolution of each selected page by utilizing key words in the page titles as well as the date ranges for available data.
Because the majority of health-related Internet searches concern specific conditions, treatments, and procedures [8], we selected pages related to those topics. These pages also align with Johnson et al., who used pages in the categories of Diagnosis/Treatment and Prevention/Vaccination for influenza surveillance [14]. Specifically, we used the following pages: antivirals, flu basics, FluView, high risk complications, key facts, prevention, symptoms, treating influenza, treatment, and vaccine.
We then aggregate the page views of interest for each of our models. A complete list of pages can be found in Appendix A.
The states we selected were based on severity of flu (determined from FluView) during the available seasons and the availability of ILI data, which is not standardized and is dependent on each state's reporting mechanism. ILI data for each state include the week ending or starting date as well as the percentage of influenza-like illness for the specified week. While some states also report additional data, such as school closures and hospitalizations, these data are not made available by every state. Note that the ILI reporting and accessibility vary across all the states. The states we selected were 1) California, 2) Maine, 3) Missouri, 4) New Jersey, 5) New Mexico, 6) North Carolina, 7) Texas, and 8) Wisconsin. With the exception of Texas, these states did not release ILI data outside of the typical flu season. A complete list of the data sources for the state ILI can be found in Appendix B, and the clinical data are available in Appendix E. Fig 1 shows the percentage of ILI visits for each state considered in this study as well as the national percentage of ILI visits. We see distinct spikes that indicate the peaks of the flu seasons. With the exception of Maine, which behaves as an outlier at times, the figure shows spikes indicating there are "peak" weeks for influenza-related page views. Texas also exhibits outlier behavior with ILI percentages consistently higher than the typical national baseline of 2%, which is used to determine when the flu has reached epidemic status. These two outliers are shown in teal (Texas) and dark blue (Maine). The national ILI is shown in black. The remaining states exhibit behavior consistent with the national ILI trend. Fig 2 shows the CDC page view data as a heat map: weeks with more page views are shown darker than weeks with fewer page views.

Linear Regression
We used statsmodels version 0.9.0, a statistical analysis module for Python, to perform linear regression on our data sets using OLS. This creates a linear model M of the form where α i are the regression coefficients, and X = (1, X 1 , X 2 , ..., X n ) is the vector of CDC page view data, with n representing the number of CDC pages used for the model, ranging from 1 to 10. We correlate ILI and CDC page views for the same week or with a one-week shift. In the shifted cases, we shift the ILI data forward by one week, so that the model associates the current week's page views with the following week's ILI data. This shifting is performed to account for the incubation period of influenza and the time between the onset of symptoms and the first doctor visit. Statsmodels uses the CDC page view and ILI data to determine the appropriate regression coefficients, fits parameters with OLS, and computes the goodness of fit, r 2 , also referred to as the coefficient of determination. The r 2 value measures how well two time series correlate. An r 2 = 1 indicates a perfect fit, while an r 2 value of 0 indicates no correlation. Although r 2 is not necessarily the best metric to use for judging goodness of fit [2], it is nonetheless the most common metric used and still provides one with a decent overall sense of fit quality. Additionally, we examined the root mean squared error (RMSE) and the normalized root mean squared error (NRMSE) using Python's sklearn libraries.

Results
We analyzed the data at the national, division, and state levels and computed the r 2 for each geographic resolution. In this section, we discuss the results of our experiments, both successes and failures. We include figures of models at the national, census division, and state levels. Because of the varying scales between page views and ILI percent, we choose to normalize the data and our models in order to plot them on the same axes. We use raw data to create the models, and then we normalize the each model with respect to its maximum. We also normalize the ILI data and CDC.gov web traffic data with respect to their maximums for the given time period so that all three curves may appear in the same plot.

National Results
We selected pages that corresponded to the topics most often searched during online health-seeking activites. When we combined all ten pages, we were able to achieve an r 2 value of 0.889 for the national 2012-2013 influenza season after implementing a one-week shift. We also had success modeling the national 2015-2016 influenza season with no shift, achieving an r 2 value of 0.834. We obtained better results when limiting the pages to FluView, Symptoms, and Treatment, which we attribute to the information on these pages aligning with topics most commonly used for Internet health seeking. For these pages, the most successful models did not have a shift. For the 2012-2013 influenza season, we achieved an r 2 of 0.906. The model for the 2015-2016 season had an r 2 value of 0.891. Table 1 shows the most successful model for each influenza season included in this study. Fig. 3 shows these models.

Census Division Results
Using the data for each of the nine census divisions, we were able to achieve r 2 > 0.7 in at least one case for each division. We considered all seasons together and separately, with the better results coming from modeling each individual season. We considered all pages together and pages most closely associated with topics most commonly searched by health-seeking individuals. In the most successful case, the model was able to closely match the 2015-2016 influenza season for the West North Central division with an r 2 of 0.955 using the FluView, Symptoms, and Treatment pages. Although we had successes using all 10 pages, the most successful model for each division involved only these three pages. Fig. 4 shows some of these models, and Table 2 highlights these successes.

State Results
We found r 2 for each of the states considered in this study, using a variety of pages and page combinations. Table 3 lists the most successful model for each state, the season, the data shift, and the r 2 value.  Table 3: This table shows the most successful results for each state considered in this study. "All" refers to an aggregation of all 10 pages, and "FVST" refers to an aggregation of the FluView, Symptoms, and Treatment pages.  5a) and Wisconsin (see Fig 5e), respectively, for the 2012-2013 influenza season. For the 2013-2014 season, the highest r 2 value was 0.187 for Wisconsin (see Fig 5b). For the 2014-2015 season, the highest r 2 value was 0.322 for Missouri (see Fig 5c). For the 2015-2016 season, the highest r 2 value was 0.647 for North Carolina (see Fig 5d). We were not surprised that Texas had the best fit. Texas was the only state we included that provided ILI data not only for the typical influenza season but also for the off-season. This additional data likely contributed to the success of the Texas models. The lack of success we encountered in modeling Maine was also expected because of Maine's outlier behavior in ILI, having values considerably lower and out of pattern with other states. The models in Fig 5 included all 10 pages aggregated together. However, as indicated by the individual state results, this does not always lead to the best fit. Successful models often included a combination of select pages (such as FluView, Symptoms, and Treatment) but not an aggregation of all 10. Furthermore, aside from Texas, we did not have ILI data for the states outside of the typical flu season. Without this additional data, we are unable to determine how strongly the lower page views in the off-season correlate with off-season ILI. We then shifted the ILI data forward by one week. The regression analysis yielded 7 state/season combinations with r 2 values greater than 0.7 (see Table 4). The table also includes both the regular and normalized root mean squared errors  Adding only the FluView, Symptoms, and Treatment pages, we obtained r 2 ≥ 0.7 for 6 state/season combinations. For the 2013-2014 season, the highest r 2 values were 0.612 for California and 0.568 for Wisconsin. While this is still less than desired, it is a vast improvement upon the r 2 values found from adding all 10 pages. For the 2014-2015 season, the highest r 2 was 0.575 for Missouri. Again, although the correlation appears to be weak, it is a stronger correlation than taking all 10 pages together. Using these same three pages and implementing a one-week shift, we obtained r 2 ≥ 0.7 for 10 state/season combinations. For the 2014-2015 season, the highest r 2 value was 0.548 for Missouri.

Model Failures
We generally found the models to be successful when considering pages most closely related to typical health-seeking behavior and when considering each flu season individually. When trying to model multiple influenza seasons together, we had a number of unsuccessful models. Considering all pages and national ILI data, the model combining We speculate that a number of factors could contribute to these negative results. While influenza is a seasonal disease, similar strains can span multiple years, affecting the susceptible populations in subsequent years. Our data stream may be biased toward individuals with more awareness of the CDC. Furthermore, individuals who search for influenza information in one season may not search for that information the next year. Finally, with the exception of Texas, we only have ILI data for the influenza season itself. Thus, while we do have Internet data for off-season influenza page views, we do not have corresponding ILI data.

Conclusions
Internet surveillance data has proven beneficial in predicting ILI incidence during flu seasons. However, our results show that the benefit of Internet data streams on informing disease is inconclusive.
That is, our work shows that the CDC website traffic can be informative in some cases (e.g., national level) but not in others (e.g., state level). To determine the extent, we must return to our original research questions. Q1: Given the successes of some of our models, we can conclude that CDC page view data can be used as an additional data source for monitoring disease incidence in some cases (for example, at the national level). The degree to which this data can be used appears to rely on the page selection and time frame. We obtained successful nowcasts when selecting pages related to topics most commonly used for online health queries (specific diseases and treatments) during the time span of a typical influenza season. Longer time spans and pages less associated with specific diseases and treatments led to less successful models. These results can assist others in selecting appropriate supplemental data sets for disease surveillance.
Q2: We obtained our most successful results using a one-week shift. Two-week shifts were successful in some cases but were overall less correlated than one-week shifts. Using no shift at all proved successful in some cases but not in others. We surmise that the shift required for the best fit depends upon the incubation period for the disease in question as well as the time period of reporting. The CDC Internet data are available daily; however, ILI data are available weekly, so we are limited in the types of shifts we can apply to the data sets.
We conclude that more studies on Internet data streams are needed to understand when and why Internet data works. Our methods are consistent with other feasibility studies and provide insight into conditions under which Internet data streams may inform influenza models. Future work should include rigorously testing the predictive power of the models by separating data into training and testing sets [2].