Applying Machine Learning Models with An Ensemble Approach for Accurate Real-Time Influenza Forecasting in Taiwan: Development and Validation Study

Background: Changeful seasonal influenza activity in subtropical areas such as Taiwan causes problems in epidemic preparedness. The Taiwan Centers for Disease Control has maintained real-time national influenza surveillance systems since 2004. Except for timely monitoring, epidemic forecasting using the national influenza surveillance data can provide pivotal information for public health response. Objective: We aimed to develop predictive models using machine learning to provide real-time influenza-like illness forecasts. Methods: Using surveillance data of influenza-like illness visits from emergency departments (from the Real-Time Outbreak and Disease Surveillance System), outpatient departments (from the National Health Insurance database), and the records of patients with severe influenza with complications (from the National Notifiable Disease Surveillance System), we developed 4 machine learning models (autoregressive integrated moving average, random forest, support vector regression, and extreme gradient boosting) to produce weekly influenza-like illness predictions for a given week and 3 subsequent weeks. We established a framework of the machine learning models and used an ensemble approach called stacking to integrate these predictions. We trained the models


Introduction
Seasonal influenza is one of the most prevalent infectious diseases in Taiwan, accounting for millions of cases, over tens of thousands of patient hospitalizations, and hundreds of deaths annually [1][2][3][4]. In Taiwan, the seasonal influenza epidemic typically begins in winter and continues to the end of the year until the spring of the next year [2]. However, the changeful influenza activity in subtropical areas like Taiwan sometimes causes problems in epidemic preparedness. For instance, in Taiwan, the 2015-2016 influenza epidemic, with H1N1 as the main circulating strain, was the biggest since the 2009 novel H1N1 outbreak. Nevertheless, H1N1 influenza activity was unexpectedly low in the following 2016-2017 influenza season, whereas H3N2 influenza activity peaked unpredictably in the summer season in 2017 and caused a severe epidemic [2,3].
Since 2004, to monitor changes in influenza activity, the Taiwan Centers for Disease Control (Taiwan CDC) has established real-time national influenza surveillance systems for influenza-like illness visits to hospitals and clinics [5,6]. The surveillance systems have minimal time lag in data collection; therefore, public-health professionals can immediately adjust their response almost in real time. The decision-making process, however, remains based only on past data (despite the short time lag). Influenza epidemic forecasting for upcoming weeks or months can provide more information for policymaking and is relevant for preparedness [7]. The ability to provide a short-term forecast in terms of epidemic magnitude is particularly vital for emergency departments during a long weekend or the Lunar New Year in Taiwan (eg, the Lunar New Year comprised 9 vacation days in 2019), during which time, influenza-like illness visits at emergency departments considerably increase (since outpatient services are closed), and sometimes patients crowd the emergency departments. In this situation, reliable forecasts are required to determine the surging capacity.
Many research teams have worked on influenza forecasting for a long time. Among the models used by researchers, the autoregressive integrated moving average (ARIMA) model is a methodology that is often chosen for seasonal influenza forecasts because of its advantage in dealing with time-series data [8,9], its satisfactory performance using data that are time dependent for short-term projection, and its widespread use in other health-related forecasting tasks [9][10][11][12][13][14]. Decision tree-based machine learning algorithms such as random forest and extreme gradient boosting also have their strengths in predictive analysis and forecasting, which has been shown in data science competitions such as Kaggle [15], influenza outbreaks [14], and foodborne disease trends [16]. A study in Canada [17] showed random forest models had better performance predicting influenza A virus frequency than that of ARIMA and generalized linear autoregressive integrated moving average models. Unlike ARIMA, random forest and extreme gradient boosting, as ensemble weak prediction models, have better performance dealing with high-dimension data [18], while support vector regression's strength is finding an optimal hyperplane with a nonlinear boundary [19,20]. Previous research has also demonstrated a successful combination of linear regression with nonlinear predictor, random forest, support vector regression, and extreme gradient boosting to predict dengue fever outbreak in the United States [21].
Instead of traditional surveillance data, researchers have also attempted using nontraditional data sources, such as Google Flu Trends and Flu Near You, to improve their forecasts since 2008 [22,23]. These data served as surrogate indicators or supplement data for influenza-like illness activity. Lasso regression, random forest, extreme gradient boosting, and support vector regression have been widely implemented to aggregate these data from Google search, Google trend, Wikipedia, and social media (such as Twitter and Baidu) in influenza forecasting [24][25][26][27]. The performance of elastic net and support vector regression was considered to be comparable in a study [26] which used the Baidu index as a predictor and predicted the number of influenza cases in China by support vector regression, and in a study [28] in France which used electronic health record data with historical epidemiology information for influenza-like illness incidence rate predictions.
On the other hand, researchers began to explore the possibility of simultaneously using multiple models or data sources to find an ensemble approach to produce more robust forecasts by combining the results of different forecasting models [11,[29][30][31][32]. For seasonal influenza forecasting in the US, the empirical Bayes method has been used to integrate the forecasts from linear models using multiple data sources as predictors [29]. Kandula et al [32] also evaluated the performance of the susceptible-exposed-infectious-recovered-susceptible model, Bayesian weighted outbreaks, k-nearest neighbor, and a superensemble method when combining distinct forecast methods to predict influenza outbreaks in the United States. A meta-ensemble of statistical and mechanistic methods has shown better accuracy than individual methods [31,32].
Compared to internet data, which might easily be influenced by search engine marketing, the surveillance database in Taiwan can provide much more comprehensive data with a small time lag. These data sources are also easier to be maintained and reliable for a long-term decision-making system. Therefore, using the surveillance data, we aimed to develop a practical framework consisting of an ensemble model with machine learning models to combine the advantages of different forecasting models for real-time influenza-like illness predictions, and facilitate influenza preparedness.

Data Source
The data we used to train and validate the machine learning algorithm included weekly data from the Real-Time Outbreak and Disease Surveillance System, the National Health Insurance Database, and the National Notifiable Disease Surveillance System [5,6]. The details and characteristics of the data sets are described in Multimedia Appendix 1. Other data used include the number of national holidays in each week, regular weekends, and long weekends. We used surveillance data from 2008-2017 to establish the framework of the forecasting models.

Forecasting Targets and the Renewal of the Surveillance Data and Models
The forecasting targets in our study were short-term forecasts-weekly number of influenza-like illness visits for the 4-week period after the most recent surveillance data. Real-Time Outbreak and Disease Surveillance System, National Health Insurance, and National Notifiable Disease Surveillance System databases were updated daily. Because of the potential delay in data entry by hospitals, all the models were automatically retrained and updated every Tuesday night using data up until the end of the previous week. The updated models would then produce the predictions of influenza-like illness visits for 4 weeks from that time point (the end of the previous week). Therefore, the initial forecast actually predicts the number of influenza-like illness visits in the then-current week (nowcast), whereas the 1-week, 2-week, and 3-week forecasts represent the weekly predictions for each of the subsequent 3 weeks (Figure 1).

Machine Learning Algorithms
Four machine learning algorithms-ARIMA, random forest, support vector regression, and extreme gradient boosting-were used to produce weekly influenza-like illness predictions for a 4-week period. We chose these algorithms, each with different characteristics and strengths, so that the forecasting task could benefit from the diversity of the machine learning algorithms.
To summarize the forecasts of the 4 different machine learning algorithms, we adopted the ensemble method called stacking [31,33] An ensemble model was trained using another support vector regression algorithm with a linear kernel that optimized the best regression between the observed number of influenza-like illness visits and the 4-week forecasts. A previous study [32] adopted a Bayesian model, which requires the prior distribution estimation to produce the ensemble forecast. We chose a support vector regression with a linear kernel model because it can produce a weighted-average forecast from 4 individual models without considering data distribution. By using the stacking method, the forecasts of different algorithms are automatically weighted and combined to produce the ensemble forecasts. The linear kernel was chosen because of the forecasting and efficient computing performance it showed in the training process. The hyperparameter tuning mechanism, described in the section that follows, was used to evaluate the performance of the ensemble model from the first week of 2015 to the 40th week of 2017.

Feature Selection, Engineering, and Model Tuning
The initial features were selected after discussions with experts of Taiwan CDC. The number of past influenza-like illness visits in the 8 previous weeks (from the Real-Time Outbreak and Disease Surveillance System and National Health Insurance database) and the length of national holidays in a week were the basic features. We also included essential holidays, such as the Lunar New Year, in the feature set because it was believed to have a significant influence on influenza-like illness visits, especially in emergency departments. Our feature engineering work included moving average, moving difference with varying time lags, and the proportion of influenza-like illness visits to total medical visits (Multimedia Appendix 2).
We chose naïve (heuristic) mechanisms, instead of the conventional methods, for feature selection. We used surveillance data from 2008-2017 for feature selection and model tuning. We evaluated the overall forecasting performance of the algorithms during the first week of 2015 to the 40th week of 2017 by comparing the forecasts to observed historical data in the same period. Using this framework, we dynamically retrained each model from zero every week to incorporate newly collected data ( Figure 2): 4. The forecasted number was compared to the observed number of influenza-like illness visits in week T + h using the evaluation metrics.
5. For each week from the first week of 2015 to the 40th week of 2017, we repeated Step 1 to Step 4 and calculated the evaluation metrics for specific feature sets and hyperparameters. Then we selected the feature set and hyperparameters that performed best in evaluation metrics.
If we only used k-fold cross-validation during model training, look-ahead bias might have occurred when using time-series data with potential autocorrelation. The advantage of this framework avoided look-ahead bias and made use of all historical data before the week T to train the models in forecasting the weekly influenza-like illness visits of the week T + h at the week T.

Evaluation Metrics
The metrics we used to evaluate the model performance included Pearson correlation (ρ), root mean squared error (RMSE), mean absolute percentage error (MAPE), and hit rate (Multimedia Appendix 3). Lower MAPE, lower RMSE, higher hit rate, and higher correlation indicated better forecasting performance.

Software and Visualization
We used data munging and feature engineering (dplyr), the time-series model, ARIMA (forecast) , random forest model (randomForest), support vector regression (e1071), and extreme gradient boosting model (xgboost) packages in R (version 3.4.4) on Ubuntu (version 14.0.4). The functions and hyperparameters that were used are listed in Multimedia Appendix 4. A visualization dashboard website was designed to display and compare the predictions of the 5 models (using D3.js and several JavaScript frameworks) [34].

Real-Time Estimates (Nowcast)
The visualized comparison of the estimated epidemic curve in nowcasts and the observed number of influenza-like illness visits showed that all the models, especially the ensemble model, could predict the time and magnitude of the peaks of the influenza epidemic throughout the influenza season from 2015-2017, such as the peaks of the Lunar New Year vacation for each year and the peak of the summer flu in 2017 ( Figure  3). All models could appropriately fit the epidemic curve of the outpatient (ρ=0.891-0.962) and emergency (ρ=0.802-0.967) departments (Table 1).

Forecasts for the Following 3 Weeks
The forecasts for the following 3 weeks using our ensemble model exhibited satisfactory performance for predicting the epidemic trend and successfully captured the epidemic peaks. Still, there were some time lags in peak prediction in the 1-, 2-, and 3-week forecasts (Figure 4). The accuracy slightly decreased with an increase in the forecast time horizons as well (MAPE: 8.3%-18.9%; hit rate: 0.643-0.786 in the 1-week, 2-week, and 3-week forecasts) ( Table 1). Although the ARIMA model had the highest accuracy and hit rate in nowcasts, the random forest and support vector regression models performed better in the forecasts of the subsequent 2 and 3 weeks, particularly in terms of outpatient influenza-like illness visits.

Real-World Application in 2018
We started using the framework with the 5 models in Taiwan CDC in early 2018. Since 2018, the nowcasts of our models has exhibited good accuracy (outpatient MAPE: 5.3%-5.8%; emergency MAPE: 5.7%-8.0%). Moreover, the 3-week forecasts maintained comparable accuracy to one another (outpatient MAPE: 8.8%-13.5%; emergency MAPE: 8.8%-13.5%; Table  2 and Multimedia Appendix 5). Hit rates of the nowcasts were 0.600-0.727 in outpatient and 0.582-0.782 in emergency department and remained at a high level in the 3-week forecasts (0.787-0.908 and 0.596-0.788 in outpatient and emergency department, respectively). All the models could approximately detect the declining trend when the magnitude of the epidemic had already reached a peak ( Figure 5). The random forest and extreme gradient boosting model better identified the increasing trend during the early stage of the epidemic.

Visualization Dashboard of Forecasts
To easily compare the predictions of the 5 models, we created a visualization dashboard website to display the projections concurrently (Multimedia Appendix 6). We also provided the MAPEs and hit rates of all the models that were calculated using the recent 4-, 8-, and 52-week data. In this manner, policy makers could also consider accuracy when evaluating predictions.

Principal Results
By using the influenza surveillance data from Taiwan CDC, we established a forecasting model framework that comprises 4 machine learning models and one ensemble model. Our ensemble approach and the framework of model training could provide highly precise forecasts of weekly influenza-like illness visits for a 4-week period. Then-current week forecasts (nowcasts) were the most accurate with MAPE as low as approximately 5% and hit rates of approximately 0.75. Because of the satisfactory hit rate, the change in the influenza-like illness visits in the forecasts for the 4-week period could be regarded as the estimated temporal trend forecasts of a future epidemic as well.
A comparison with models developed in other countries or areas revealed that our models could provide better accuracy with a very low MAPE, which was less than 10% in nowcasts and remained below 20% in the 4-week period forecasts. These results outperform previously-reported models with MAPE mostly greater than 15%-20% [11,25,35,36], suggesting that our models can provide promising predictivity of short-term forecasts on the epidemic magnitude; however, it is difficult to directly compare the performance among different algorithms developed in varying clinical settings and with varying data quality. As for short-term forecasting of the epidemic magnitude in Taiwan, an MAPE less than 10%, especially during the peak time, would be helpful when policymakers need to evaluate the required surging capacity. Because the weekly change of influenza-like illness visits is usually less than 10%-15% in Taiwan, an MAPE greater than 15%-20% might not reliably catch the shift in the epidemic.
The high accuracy of our models might be attributed to the comprehensive data set that we used [5]. The high coverage and good representativeness of the Real-Time Outbreak and Disease Surveillance System and National Health Insurance database allowed the forecasts to more accurately reflect the trends and magnitude of influenza-like illness without being affected by the bias that would have been caused by incompletely sampled data. Conversely, previous models in the US and other countries mostly relied on the sentinel surveillance system such as the influenza-like illness net in the US, which was mainly composed of volunteered sentinel clinics and had problems pertaining to completeness and representativeness [37][38][39][40]; the predictivity of the models might have been significantly impaired when they were trained using imprecise historical data. Researchers usually develop an algorithm using a specific period of historical data and then use the trained algorithm with newly-collected data to forecast; therefore, forecasting performance can become worse and worse over time and require periodic adjustment.
In contrast, with our method, models can be retrained every week using updated data. In this way, the algorithms learn from the updated data and maintain satisfactory performance even after being used by the Taiwan CDC for more than one year. In addition, our ensemble model was adapted from the stacking method and could summarize the forecasting outputs from the 4 basic machine learning algorithms with appropriate weighting [31][32][33]. The aim of our ensemble model was not to build the most accurate forecasting model for any given time. Since the 4 models select features independently from our data sources and had different forecasting performance in real-world applications, for example, ARIMA was usually a lagging indicator of the peak in the influenza season, while random forest and extreme gradient boosting predicted the peak better but tended to underestimate the magnitudes at the peak; therefore, by combining the forecasts from ARIMA and extreme gradient boosting model, the ensemble approach could overcome the disadvantages of each individual model and generate the most robust forecasts with stable performance.
In addition to completeness and representativeness, the Real-Time Outbreak and Disease Surveillance System and National Health Insurance database provided excellent timeliness for our forecasting models. Thanks to the nearly real-time data exchange of the Real-Time Outbreak and Disease Surveillance System and National Health Insurance database with a time lag of, at the most, 1-2 days [5], we could use influenza-like illness data from the previous week at the beginning the week and generate forecasts for a 4-week period that started every Tuesday, for any given week. Because of the delay in the collection of surveillance data, models developed in other countries usually acquire data with at least 1-2 weeks of delay. Thus, their 1-week forecast generated using historical data up until 2 weeks prior is actually the prediction for the previous week [11,25,[40][41][42]. Compared to those models [43], the aforementioned short time delay made our forecast model, which can generate the forecast of a given week (nowcast), a real real-time forecasting model. This information can be of great help to the authorities for decision making concerning epidemic preparedness and interventions.
In order to resolve the timeliness problem of the influenza-like illness surveillance data, researchers have attempted to explore the use of social media data (such as Twitter and Facebook) or internet search data (such as Google search and Google Flu Trends) to develop forecasting models because these data can be collected in almost real-time [11,22,26,36,41,44]. However, the method of data collection, quality of social media data, and accuracy of the models still posed problems [22,26,36,41]. In our framework, we did not include social media data because of the following reasons. First, ideal sources of social media information have not yet been established in Taiwan. The largest social media website in Taiwan is Facebook. Still, it is rarely used in social media surveillance because of the hindrances in collecting personal posts from individual profiles (personal walls). A microblog such as Twitter is less prevalent in Taiwan netizens. Second, as for web search data, the Taiwan CDC conducts a regular weekly press release and usually causes a higher amount of search for the related terms on the day of the press release. For example, the searches for influenza significantly increase when an influenza-related news article is released. Therefore, it is difficult to determine whether the increase in the number of web searches, epidemic-related news, or social media discussions is due to an increase influenza-like illness visits or the effect on the media of the official press release. Conversely, access to medical service in Taiwan is easy, and our surveillance systems, such as the Real-Time Outbreak and Disease Surveillance System and the National Health Insurance database, have already collected highly comprehensive data. Thus, we do not need to rely upon the use of social media as a supplementary data source for disease forecasting, especially for influenza-like illness.
For the models based on traditional frequency statistics such as ARIMA, it is relatively easy to produce a 95% confidence interval, but it is not similarly easy for random forest, support vector regression, and extreme gradient boosting models. Although some literature discussed how to generate prediction intervals for machine learning models like random forest, it is not practical to display 5 intervals on one chart simultaneously. Too much information only confuses the user and makes it difficult to interpret the trend from the 5 forecasts. As we introduced 4 forecasting models based on different algorithms, they already provide and demonstrate the variations in forecasts. When we combine these forecasts with the most robust forecasts from the ensemble model, decision makers can easily get an impression of the forecast without ignoring the potential outlier at one time. Thus, this framework can provide similar information to that provided by confidence intervals.
The models used were machine learning models, which were different from traditional mechanistic models and those such as the susceptible-infectious-recovered model [45,46]. The susceptible-infectious-recovered model considers the dynamics of infectious disease and other biological components. For example, a researcher could create a compartment to simulate the interaction dynamics between infected and immunized people to estimate the effects of vaccination. However, these models are usually built on the basis of historical data and are useful in evaluating the relationship between the different compartments. This characteristic makes such a model better for assessing the effectiveness of vaccination or other interventions on disease transmission, but poor in making future prediction since it is difficult to extrapolate the results because of unknown data at forecasting [43]. For example, when building a susceptible-infectious-recovered-V model, including the compartment V as vaccinated, we need to enter the possible number of vaccinated people in the near future if we want to use this model for forecasting.

Limitations
There are some limitations to our forecasting models. First, the predictivity of our models decreased with longer time horizons, and the best hit rate was only approximately 0.75, suggesting that our models are better at predicting the epidemic magnitude but not the trend. However, we could calibrate the forecasts by learning from the experience of the real-world application. For example, compared with traditional time-series models, such as the ARIMA model, we found that the random forest and support vector regression models may better predict the epidemic dynamics, when the models were applied in 2018. By combining the forecasts and human judgment, the decision-making process for future epidemics can be further ameliorated. Second, using other new deep learning algorithms, especially those with promising performance in time-series forecasting tasks, such as a recurrent neural network and long short-term memory networks, may help to improve the forecasting accuracy. Unlike sequential learning, we retrained the models from zero with mostly updated data every week to manage the time factor better. Our model is only designed for short-term forecasts not for the long-term epidemic change. A deep learning algorithm may be able to deal with this type of forecasting task. Further studies using other algorithms on different forecasting targets, such as the start of a seasonal influenza outbreak and its peak time, are still required to be able to provide more information.

Conclusions
Our project demonstrated real-time short-term forecasting models on weekly influenza-like illness visits using comprehensive influenza surveillance data. By using an ensemble approach to aggregate the forecasts of 4 machine learning models, we could provide accurate predictions for a 4-week period (nowcast and forecasts for the subsequent 3 weeks) to enhance epidemic preparedness.