^{1}

^{2}

^{3}

^{4}

^{5}

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.

Influenza is a viral respiratory disease capable of causing epidemics that represent a threat to communities worldwide. The rapidly growing availability of electronic “big data” from diagnostic and prediagnostic sources in health care and public health settings permits advance of a new generation of methods for local detection and prediction of winter influenza seasons and influenza pandemics.

The aim of this study was to present a method for integrated detection and prediction of influenza virus activity in local settings using electronically available surveillance data and to evaluate its performance by retrospective application on authentic data from a Swedish county.

An integrated detection and prediction method was formally defined based on a design rationale for influenza detection and prediction methods adapted for local surveillance. The novel method was retrospectively applied on data from the winter influenza season 2008-09 in a Swedish county (population 445,000). Outcome data represented individuals who met a clinical case definition for influenza (based on International Classification of Diseases version 10 [ICD-10] codes) from an electronic health data repository. Information from calls to a telenursing service in the county was used as syndromic data source.

The novel integrated detection and prediction method is based on nonmechanistic statistical models and is designed for integration in local health information systems. The method is divided into separate modules for detection and prediction of local influenza virus activity. The function of the detection module is to alert for an upcoming period of increased load of influenza cases on local health care (using influenza-diagnosis data), whereas the function of the prediction module is to predict the timing of the activity peak (using syndromic data) and its intensity (using influenza-diagnosis data). For detection modeling, exponential regression was used based on the assumption that the beginning of a winter influenza season has an exponential growth of infected individuals. For prediction modeling, linear regression was applied on 7-day periods at the time in order to find the peak timing, whereas a derivate of a normal distribution density function was used to find the peak intensity. We found that the integrated detection and prediction method detected the 2008-09 winter influenza season on its starting day (optimal timeliness 0 days), whereas the predicted peak was estimated to occur 7 days ahead of the factual peak and the predicted peak intensity was estimated to be 26% lower than the factual intensity (6.3 compared with 8.5 influenza-diagnosis cases/100,000).

Our detection and prediction method is one of the first integrated methods specifically designed for local application on influenza data electronically available for surveillance. The performance of the method in a retrospective study indicates that further prospective evaluations of the methods are justified.

In light of the rapidly growing availability of “big data” from both diagnostic and prediagnostic (syndromic) sources in health care and public health settings, a new generation of epidemiological and statistical methods is needed for reliable analyses and modeling [

Several weaknesses of infectious disease surveillance and prediction systems described in previous decades [

(1) Compartmental models: These are based on mechanistic assumptions about how the influenza virus is transmitted and use these assumptions to estimate the number of individuals in various states related to a disease [

(2) Agent-based models: These are more complex types of mechanistic models that typically use synthetic populations based on census data and build complex schemes of social interaction and disease progress in simulated individuals and communities [

(3) Nonmechanistic statistical models: These are phenomenological approaches, that is, they aim to model patterns and trends in the data without necessarily considering the underlying mechanisms. Typical approaches of this type are linear autoregression, which estimate influenza activity using a linear function based on recorded past activity. More complex methods in this category include generalized linear models, Box-Jenkins analysis [

More alternative surveillance methods include, for instance, prediction markets [

The aim of this study was to present a novel method for integrated detection and prediction of influenza activity using data electronically available for real-time surveillance in local settings in the Western hemisphere and to evaluate its performance by retrospective application on authentic data from a Swedish county. By local settings in the Western hemisphere is meant communities with specified populations in Europe and North America. Winter influenza seasons and pandemics can be expected to spread to these settings, but the dissemination of the actual types and strains of influenza virus does not likely originate from there. In the presentation of the integrated detection and prediction method, the term epidemic is used as a summary label for both winter influenza seasons and pandemics.

The novel method was formally defined based on a design rationale for integrated detection and prediction methods specifically adapted for application in local influenza surveillance. An overview of the method design is exhibited in the Results section, followed by detailed descriptions of the detection and prediction modules. The results of a retrospective performance study evaluation based on authentic data from a Swedish county are also presented. The study design was approved by the regional research ethics board in Linköping (dnr. 2012/104-31).

The rationale for design of a novel integrated detection and prediction method is that the aim of local influenza surveillance is early detection and prediction of infected individuals requiring clinical attention, with the purpose of timely allocation of scarce health care resources. Precious time is lost before laboratory data are available for algorithmic processing and test samples are not taken from all patients. Syndromic data are used for peak timing prediction because it is challenging to only use unidimensional gold standard data to predict the peak timing.

Both the detection and prediction functions are to comply with requisite quality and accuracy criteria for technologies to be used in health care and public health practice [

Influenza detection is defined as indicating the initiation of an epidemic in the community, that is, a prolonged period of elevated incidence rates (exceeding a given limit) of influenza cases, as defined by the rate of individuals clinically diagnosed with influenza in a population under surveillance. Influenza prediction denotes foretelling the peak timing and the peak intensity of an epidemic in the community. For detection, weekday effects and optimal alerting thresholds with reference to influenza-diagnosis data are retrospectively established in the method calibration. For prediction, both the weekday effects and the grouping of variables in the syndromic data with the largest correlation strength and longest lead time to influenza-diagnosis data are established.

The influenza case-rate level when a local influenza epidemic factually takes off was set to 6.3 influenza-diagnosis cases/100,000 during a floating 7-day period. This limit was determined by inspecting the epidemic curves of previous local influenza epidemics in the learning dataset. A similar definition (6.4 influenza-diagnosis cases/week/100,000) was determined for the winter influenza season in 2008-09 in a recent comparison of influenza intensity levels in Europe [

For a retrospective performance evaluation of the integrated detection and prediction method, outcome cases were represented by individuals clinically diagnosed with influenza during the 2008-09 winter influenza season in a Swedish county (population 445,000). The thresholds used in epidemic detection were determined using data from a learning dataset containing the 2008-09 winter influenza season. The metrics used to evaluate the detection of influenza epidemics were timeliness, sensitivity, and specificity. Timeliness was defined as the time difference (in days) between the actual start of the epidemic and the start indicated by the model. Specificity was calculated from when the detection algorithm is started (ie, when previous epidemic has come to an end) and until the beginning of the current epidemic per the standard definition (6.3 influenza-diagnosis cases/100,000 during a floating 7-day period). This means that the period length for specificity calculations varies with the interepidemic period. Sensitivity was calculated from the beginning of the current epidemic (according to the same definition) and 45 days into the epidemic. The optimal alerting threshold was decided by calculating sensitivity and specificity and studying them on a receiver operating characteristic (ROC) curve, giving specificity priority over sensitivity because a high level of false alarms is undesirable in public health practice.

To evaluate the prediction of the peak timing, timeliness (defined as time between the predicted day of the influenza-diagnosis peak (highest number of daily cases) and the day of the peak in the observed smoothed series (using moving average of influenza-diagnosis data) was used as metric. To evaluate the prediction of the peak intensity, the absolute and relative differences between the predicted peak intensity expressed as the number of influenza-diagnosis cases at the predicted day of the peak and the observed peak intensity were used as metrics. The reason for not comparing the predicted peak intensity with the actual peak intensity (ie, without smoothing data first) was to reduce the impact from possible outliers.

Influenza cases were identified using the International Classification of Diseases version 10 (ICD-10) codes for influenza (J10.0, J10.1, J10.8, J11.0, J11.1, J11.8) [

The integrated detection and prediction method is based on nonmechanistic statistical models, that is, patterns and trends in the data are modeled without necessarily considering underlying mechanisms. It is designed for integration in local health information systems. Accordingly, the underpinning structure is defined at four levels, ranging from data sources to performance validation (

An overview of the main statistical assumptions and equations for each component is displayed in

Structure of the integrated detection and prediction method displayed design patterns.

An overview of the main mathematical equations or functions used for each component.

Exponential regression (1) is used for detection modeling, based on the observation that the beginning of an influenza epidemic is assumed to have an exponential growth of infected individuals:

(1) _{t}^{a₀ + b₁t}

with _{0} representing the level, and _{1} representing the trend. The expected number of visits at local health care services, _{t}

(2) _{t}^{a₀ + b₁t}^{a₀ + ln(p) + b₁t}^{b₀ + b₁t}

Where _{0} now combines the current level of number of infected and probability of visiting the local health care service without any possibility to separate them. As daily data are used in the analysis, weekday effects, _{w}, are also calculated and used as an offset variable in the exponential regression analysis. The weekday effects are calculated as follows: let _{Monday} be the average number of events on Mondays during previous epidemics and denote the values for other weekdays by _{Tuesday}_{Wednesday,} and so on. Let _{Total}_{Monday}_{Sunday}_{Monday}_{Total} and so on. The weekday effects are included in the model:

(3) _{t}^{b₀ + b₁t + ln(Aw)}

If

(4) _{t}^{b₀ + b₁t + ln(Aw)}

Furthermore, the time is shifted, that is, the most recent day is considered as

(5) _{t}^{b₀}

as an estimate of the current level of visits which is smoothed for random variation and adjusted for weekday effects. This is repeated for each day by moving the time axis one day at a time so that the most recent point in time of the series is considered

Detection starts when the previous epidemic has ended (the interepidemic period level for the community where the detection component is applied), and runs during the inter-epidemic period until an increase in diagnosed influenza cases is detected. When the increase is confirmed, the algorithm is paused and restarted when the epidemic has ended.

The detection algorithm is adjusted in exceptional situations, that is, if an epidemic “simmers” before it begins. The risk of simmering is extensive for a pandemic or an exceptionally mild winter influenza season. In the first case, if there is a fear of a pandemic outbreak among the population, individuals are more likely to contact medical services for influenza symptoms, leading to an increased baseline which increases the risk for false alarms. Also, if a winter influenza season is exceptionally mild, individuals contacting medical services for influenza-like symptoms in the winter will sporadically be misdiagnosed with influenza before the actual circulation of the influenza virus, leading to an increased baseline and thus, an increased risk for false alarms. The alerting threshold determined in the learning set is therefore doubled in these particular cases. It was contended that a strong indication of preepidemic simmering is when it takes extended time between when the influenza incidence increases above a baseline level and when the start of the epidemic occurs (according to the standard definition 6.3 influenza-diagnosis cases/100,000 during a floating 7-day period). The definition for when the influenza incidence has increased above the baseline level is set to 3.2 influenza-diagnosis cases/100,000 during a floating 7-day period (ie, half of the start-of-epidemic definition). An epidemic is then defined to simmer if the time-period separating these 2 dates is longer than three times the average length of the period during previous local influenza epidemics. In other words, the alerting threshold is only doubled due to simmering if the incidence has increased over the baseline level but not exceeded the start-of-epidemic level during this observation period.

The prediction process is divided into two components. In the first component, syndromic data are used to predict the peak timing, and in the second component, influenza-diagnosis data are used to estimate the peak intensity.

In the first component, the aim is to predict the peak timing using linear regression. Including weekday effects _{w} and smoothed for random variation, the model for the number of cases in syndromic data is expressed as

(6) _{t}_{0}_{1}_{w}

with _{0} representing the level and _{1} representing the trend. Since the weekday effects _{w} are known, a model smoothed for weekday effects and random variation can be expressed as:

(7) _{t}_{w}_{0}_{1}

For each 7-day period, a linear regression (7) is run and parameter estimates _{0} and _{1} are fitted. The idea is to estimate the trend in syndromic data for every 7-day period (the first period being days 1-7 and the second being days 2-8), from the beginning of an epidemic and until the peak is found. Although it is unlikely that an epidemic curve increases and decreases linearly, the assumption can be made that the trend during a short period of 7 days has almost a linear increase or decrease.

The search for the peak starts when the detection algorithm signals that an epidemic has taken off and continues until the peak is detected. To identify the peak timing, two conditions are set. As per the first condition, it is essential to ensure that the epidemic has a sufficiently sharp upward trend. The trend is therefore defined as sufficiently sharp when significantly positive (_{1} have occurred either during two consecutive or during three different 7-day periods. When one of these events has occurred, the second condition is applied. According to this condition, when the first significantly negative trend (_{1}) during a 7-day period has occurred, it is assumed that the peak has been reached on the first day of this period. However, there is a possibility that this 7-day period “overlaps” with a previous 7-day period, which includes a significantly positive trend. In that case, the first 7-day period with a significantly negative trend is ignored and the peak is instead assumed to appear during the second 7-day period with a significantly negative trend. The search is aborted if the peak is not found when the epidemic has already descended in the local setting where the algorithm is applied.

When the peak is found in the syndromic data, the 14 days preceding influenza-diagnosis data [

Depending on what day of the week the peak in the syndromic data is expected to take place, the prediction of the influenza-diagnosis peak is made between 6 and 11 days before it is expected to occur, as the syndromic peak can be determined first after 6 days has passed of the syndromic data series.

In the second component of the prediction module, the aim is to predict only the peak intensity. Based on empirical assessments of previous epidemics, an epidemic adjusted for weekday effects is assumed to show a bell-shaped form from the beginning to the end, and can therefore be expressed using a derivate of a normal distribution density function. The intensity function must also include weekday effects and total number of events during the whole epidemic. Use of bell-shaped functions was systematically introduced in epidemiology by Brownlee in the early 20th century [

Assume that day number _{i}; the observed number of influenza-diagnosis cases is _{1}, _{2}, _{3},..., _{i}, and that

(8)

where

It is important that the start of the series seems appropriate because the second prediction component assumes that the level is zero or at an interepidemic level at the start and it is not optimal that there are single or occasional spikes at the beginning of the series. For that reason, the start of the series should be a couple of weeks before an epidemic is detected.

The optimal threshold for the lower confidence limit of the expected number of influenza-diagnosis cases was computed to 0.21/day/100,000 for the detection algorithm. The detection sensitivity and specificity (calculation based on the interepidemic period 211 days) were both 1.00 and the timeliness 0 (

The detection algorithm applied on winter influenza season 2008-09 (A[H3N2]). The blue line represents the number of influenza-diagnosis cases/day/100,000, the gray bar marks the start of the winter influenza season according to the definition (6.3 influenza-diagnosis cases/100,000 during a floating 7-day period), and the orange line denotes the lower limit estimated using the detection algorithm.

The prediction module performance was satisfying both with regard to the peak timing and peak intensity. The peak timing was estimated 8 days in advance and occurred 7 days before the factual peak occurred. The predicted peak intensity at the predicted day of the peak was estimated to 6.3 influenza-diagnosis cases/100,000 (

The prediction method applied on winter influenza season 2008-09 (A[H3N2]). The blue line represents the number of known actual influenza-diagnosis cases/day/100,000 at the time when the prediction is performed, the orange line represents the number of “unknown” actual influenza-diagnosis cases/day/100,000 at the time when the prediction is performed from the first unknown day and until the peak has passed, the gray bar marks the end of the known and the beginning of the “unknown” actual influenza-diagnosis cases/day/100,000, and the black line denotes the predicted values (using the peak intensity prediction) from the first “unknown” day and until the predicted peak occurs.

The aim of this study was to present an integrated influenza detection and prediction method that uses data electronically available in local public health information systems for real-time surveillance. In the performance evaluation based on retrospective data, the method detected the winter influenza season of 2008-09 on the day it actually occurred, whereas the prediction module showed satisfying performance both with regard to the peak activity timing and its intensity.

Many important policy decisions in the response to increased influenza activity are made at the local level, for example, planning of resources at intensive care units and deciding social distancing measures such as school closures. The design of the presented integrated detection and prediction method can be compared with current state-of-the-art big data approaches to influenza forecasting [

Our approach also differs from the framework with the addition of a detection function. Combined detection and prediction methods are common in weather forecasting but not in infectious disease epidemiology. It is somewhat surprising that this is the case, as there are several studies that have focused on developing either influenza detection or influenza prediction algorithms, but seldom a combination of these [

The performance evaluation of the integrated detection and prediction method based on retrospective data showed promising results. The rationale for developing our influenza detection and prediction method was to inform the planning of local response measures and adjustments of health care capacity. During emerging epidemics of infectious diseases, it is vital to have up-to-date information on epidemic trends because hospitals and intensive care units have limited excess capacity [

The method presented in this paper has both strengths and weaknesses that need to be taken into regard. An important strength is that the design rationale is documented in detail in order to allow the researchers to consider the arguments for different design decisions when building next generation of integrated detection and prediction methods. Another key strength is that analyses of an epidemic is divided into three separate components (beginning of epidemic, peak timing of epidemic, and peak intensity of epidemic), where statistical and mathematical assumptions for each of these components are made independently of each other. Also, different data sources are applied in each component. Concretely, to detect the beginning of an epidemic, exponential regression is applied on influenza-diagnosis data; to predict the peak timing, simple linear regression is applied on syndromic telenursing data; and to predict the peak intensity, the epidemic is assumed to follow a bell-shaped function of time around the peak and therefore a derivate of the normal distribution density function is applied on influenza-diagnosis data. An approach similar to this has rarely been reported in the field of influenza surveillance. One possible limitation of the method design is that the series of actual influenza-diagnosis data are smoothed and the peak of the smoothed series is used as the actual peak. However, as mentioned in the Methods section, the reason for this design choice was to reduce the risk of misleading influence from outliers.

One potential limitation concerns the use of sensitivity and specificity in the method. These metrics are, however, restricted to assess the accuracy of the alerting threshold. We have previously contended that it is important to determine the appropriate period in time which calculations of sensitivity and specificity are to be based upon [

Another possible limitation concerns the second prediction component, where we chose to apply linear regression on 7-day periods for the search of positive and negative trends in order to find the peak timing in the syndromic data. The length of the period could have been extended with 1-2 days to get more reliable estimates of the trend. However, this alternative was weighted against the risk of predicting the influenza-diagnosis peak with fewer days in advance, and the advantage with earlier prediction of saving these days was preferred. Another limitation is that the prediction of the peak intensity is affected by the peak timing prediction, since a precise prediction of the peak timing increases the chance of an accurate prediction of the peak intensity. Concretely, if the timeliness for the prediction of the peak timing was 0 days instead of 7 days in our retrospective evaluation of the 2008-09 winter influenza season, the predicted peak intensity would have been estimated to 7.7 instead of 6.3 influenza-diagnosis cases/100,000 compared with the factual 8.5 influenza-diagnosis cases/100,000. In other words, the absolute difference between the predicted and the actual incidence would have been 10% instead of 26%. Finally, in the second prediction component, we assumed that an influenza epidemic takes a bell-shaped form from the beginning to the end, and therefore we employed a derivate of a normal distribution density function to find the peak intensity. The same assumption was used by Bregman and Langmuir [

During the recent decade, a multitude of algorithms for influenza detection or prediction have been reported [

acquired immune deficiency syndrome

Centers for Disease Control and Prevention

electronic patient records

International Classification of Diseases version 10

receiver operating characteristic

Susceptible-Infectious-Recovered

This study was supported by grants from the Swedish Civil Contingencies Agency (2010-2788) and the Swedish Research Council (2008-5252). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

None declared.