This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Advanced prediction of the daily incidence of COVID19 can aid policy making on the prevention of disease spread, which can profoundly affect people's livelihood. In previous studies, predictions were investigated for single or several countries and territories.
We aimed to develop models that can be applied for realtime prediction of COVID19 activity in all individual countries and territories worldwide.
Data of the previous daily incidence and infoveillance data (search volume data via Google Trends) from 215 individual countries and territories were collected. A random forest regression algorithm was used to train models to predict the daily new confirmed cases 7 days ahead. Several methods were used to optimize the models, including clustering the countries and territories, selecting features according to the importance scores, performing multiplestep forecasting, and upgrading the models at regular intervals. The performance of the models was assessed using the mean absolute error (MAE), root mean square error (RMSE), Pearson correlation coefficient, and Spearman correlation coefficient.
Our models can accurately predict the daily new confirmed cases of COVID19 in most countries and territories. Of the 215 countries and territories under study, 198 (92.1%) had MAEs <10 and 187 (87.0%) had Pearson correlation coefficients >0.8. For the 215 countries and territories, the mean MAE was 5.42 (range 0.2615.32), the mean RMSE was 9.27 (range 1.8124.40), the mean Pearson correlation coefficient was 0.89 (range 0.080.99), and the mean Spearman correlation coefficient was 0.84 (range 0.21.00).
By integrating previous incidence and Google Trends data, our machine learning algorithm was able to predict the incidence of COVID19 in most individual countries and territories accurately 7 days ahead.
COVID19, a highly infectious disease with serious clinical manifestations, was first reported in China in late 2019 and spread to other countries within weeks [
Prediction of the incidence of COVID19 in individual countries and territories is extremely important to provide reference for governments, health care providers, and the general public to prepare management measures to combat the disease. In the early months of 2020, prediction was useful to inform the countries and territories at risk of outbreak to take action for prevention. Currently, although the COVID19 outbreak has already occurred in almost all regions globally, prediction has merits in monitoring the severity of spreading and recovery and assessing the likelihood of a secondary or tertiary epidemic. COVID19 may become a seasonal or persistent epidemic in future years, like influenza. Thus, predicting and monitoring COVID19 activity is needed at present and in the future. As a typical example, Google search data can predict the incidence of influenza effectively [
There are reported studies on the prediction of the incidence of COVID19. A susceptibleexposedinfectiousrecovered metapopulation model was used to simulate the epidemic in major cities in China [
In this study, we aim to develop an efficient and novel methodology for realtime prediction of COVID19 activity based on the previous daily incidence of COVID19 and infoveillance data (search volume data via Google Trends) in all individual countries and territories worldwide.
Two sets of data were collected. The first data set was the search volume data obtained from the Google Trends service. We collected the Google search volumes of 28 candidate features related to COVID19 from January 1 to July 26, 2020, in 215 countries and territories. We used 14 terms, including
The second data set was the daily number of new COVID19 cases from January 10 to August 16, 2020, in 215 countries and territories, obtained from the WHO website [
The Google Trends search volumes for the topics
We conducted two types of correlation analyses in each country or territory to find the right combination of input features. The first analysis was the Spearman correlation between the Google volume data of each feature and daily new confirmed cases with n days of lag (n=7, 14, 21, 28). We obtained the average and maximum Spearman correlation coefficients for different lag days between each Google volume data feature and daily newly confirmed cases in the 215 countries and territories (
We calculated the Spearman correlation among 28 independent Google Trends volume data features in the 215 countries and territories to assess the independence of the features. We obtained the average results of the Spearman correlation coefficients between the top 10 crosscorrelation Google volume data features in the 215 countries and territories. The highest correlation coefficient, between
The internet search patterns and the management of COVID19 vary among countries and territories. We used a hierarchical clustering technique to cluster the 215 countries and territories into several groups, and we built a model for each group. Two types of data were used for clustering. One type was the related metrics of the importance scores of 29 input features resulting from the random forest regression algorithm in 215 countries and territories. The other type was the correlation of the development trend of daily new confirmed cases in the 215 countries and territories. Finally, the 215 countries and territories were clustered into 8 groups (
In addition, to explore the internal relationships between the countries and territories in each cluster, two types of Spearman correlation coefficients were calculated. The first was the Spearman correlation between the daily new confirmed cases among the countries in the first five groups, which contain more than one country or territory. The second was the Spearman correlation between the search volume of the term
Average Spearman correlation coefficients in each group of countries/territories for the 2 data sets.
Data set  Average Spearman correlation coefficient  


Group 1 (n=150)  Group 2 (n=38)  Group 3 (n=11)  Group 4 (n=7)  Group 5 (n=6)  Group 6 (n=1)  Group 7 (n=1)  Group 8 (n=1) 
Daily new cases  0.46  0.18  0.35  0.26  0.38  0.23  0.04  0.15  
0.76  0.68  0.71  0.74  0.69  0.52  0.00  0.43 
We built a model for each group of countries/territories separately according to the clustering results and produced a 7dayahead incidence prediction of COVID19 for the 215 countries and territories. The random forest regression algorithm has many decision trees; therefore, it has good robustness and a strong ability to resist overfitting. In addition, the random forest regression (RFR) algorithm gave the importance score of the features, which was helpful for feature selection. Therefore, the RFR algorithm was used to forecast the daily new confirmed cases of COVID19 over the time series data set, which consisted of Google Trends data and the incidence of COVID19 with 21day lag provided by the WHO. Python 3.7.6 was used for the modeling and evaluation.
To quantitatively evaluate the performance of the models, we calculated four metrics: the mean absolute error (MAE), root mean square error (RMSE), Pearson correlation coefficient, and Spearman correlation coefficient.
where
Several experiments were conducted to optimize and validate the results. First, to evaluate the role of each input feature in the prediction, we calculated the importance score of each input feature. To validate the features included in this study, a series of ablation studies, in which the top n features (n=1, 2, ..., 29) were used as inputs in turn, were conducted on different groups of countries/territories according to the clustering results. The Pearson correlation coefficient was averaged for the 8 groups of countries/territories based on the sum of the input features (
Second, to reduce the influence of random noise, we used multiplestep forecasting (5step, 6step, 7step, 8step, 9step, and 10step), which used the data over the past n days (n=5, 6, 7, 8, 9, 10) to predict the daily new confirmed cases of COVID19. The average quantitative results of the different steps of the second period in the 215 countries and territories are shown in
Third, the proposed RFR algorithm was compared with two other mainstream algorithms. One was decision tree regression (DTR), which is a traditional machine learning algorithm. The other was long shortterm memory (LSTM), which is a deep learning algorithm. Especially, LSTM, an artificial recurrent neural network, was used as a 3layer model in our study. TensorFlow and Keras were used as frameworks for training the LSTM models.
Fourth, we trained the models with as much data as possible. Therefore, we updated the models and repeated the experiments at regular intervals. The effectiveness of the models at 3 time periods was compared.
Fifth, we predicted the incidence of COVID19 2 days and 7 days ahead.
The average Pearson correlation coefficients of the number of topscoring features included in the prediction of the incidence of COVID19 in different groups of countries. The red, blue, green, orange, purple, yellow, black, and brown lines represent groups 1 to 8, respectively.
Average results without and with Google Trends data in 215 countries.
Method  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  
Avg^{d}  Max^{e}  Min^{f}  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  
Without Google Trends  5.67  17.20  0.77  10.27  26.00  2.86  0.82  0.99  –0.02  0.76  0.99  –0.02  .018  <.999  <.001 


With Google Trends  5.42  15.32  0.26  9.27  24.40  1.81  0.89  0.99  0.08  0.84  1.00  0.21  .001  .24  <.001 

^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
^{d}Avg: average.
^{e}Max: maximum.
^{f}Min: minimum.
The results of multiplestep forecasting in 215 countries and territories in the second period.
Method  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  
Step_5  5.48  9.35  0.88  0.84  .006 
Step_6  5.46  9.21  0.88  0.84  .006 
Step_7  5.44  9.12  0.88  0.84  .006 
Step_8  5.66  9.49  0.88  0.84  .006 
Step_9  5.66  9.46  0.88  0.85  .006 
Step_10  5.70  9.50  0.88  0.84  .006 
^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
Average performance of different methods in 215 countries and territories.
Method  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  
Decision tree regression  6.78  11.57  0.81  0.79  .011 
Long shortterm memory  9.13  14.29  0.76  0.78  .025 
Random forest regression  5.42  9.27  0.89  0.84  .006 
^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
The average results of different time windows of the second period in 215 countries and territories.
Time window  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  
2 days ahead  4.09  7.40  0.94  0.87  .003 
7 days ahead  5.42  9.27  0.89  0.84  .006 
^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
We produced 7dayahead and realtime COVID19 forecasts for 215 countries and territories. Our final models performed well in predicting the daily new confirmed cases of COVID19 in most of the countries and territories examined. Of the 215 countries and territories, 198 (92.1%) had MAEs <10 (
The performance of the models in 3 different time periods is shown in
Heat maps of the (a) mean absolute error and (b) Pearson correlation coefficients of the predicted and actual daily new confirmed case numbers in different countries worldwide. The deeper the color, the lower the mean absolute error (a) and the higher the correlation coefficient (b).
Performance of the final models in the different groups of countries/territories.
Cluster  Countries/ territories, n  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  


Avg^{d}  Max^{e}  Min^{f}  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  
1  150  5.52  15.00  0.26  8.96  23.45  3.32  0.94  0.99  0.72  0.92  1.00  0.74  <.001  <.001  <.001  
2  38  3.92  10.18  0.36  7.17  14.73  1.81  0.90  0.98  0.65  0.67  0.90  0.25  <.001  <.001  <.001  
3  11  7.38  13.26  1.71  13.77  20.68  5.97  0.77  0.89  0.70  0.81  0.92  0.63  <.001  <.001  <.001  
4  7  6.41  8.82  1.32  14.01  16.86  8.83  0.52  0.63  0.39  0.60  0.76  0.42  <.001  <.001  <.001  
5  6  6.12  15.32  1.76  13.01  23.19  4.99  0.66  0.86  0.50  0.71  0.79  0.58  <.001  <.001  <.001  
6  1  5.72  5.72  5.72  12.75  12.75  12.75  0.27  0.27  0.27  0.58  0.58  0.58  <.001  <.001  <.001  
7  1  2.39  2.39  2.39  2.39  12.17  12.17  0.30  0.30  0.30  0.46  0.46  0.46  <.001  <.001  <.001  
8  1  14.60  14.60  14.60  24.40  24.40  24.40  0.008  0.008  0.008  0.21  0.21  0.21  .24  .24  .24  
Total  215  5.42  15.32  0.26  9.27  24.40  1.81  0.89  0.99  0.08  0.84  1.00  0.21  .001  .24  <.001 
^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
^{d}Avg: average.
^{e}Max: maximum.
^{f}Min: minimum.
Performance of the models in different time periods.
Time period (2020)  MAE^{a}  RMSE^{b}  Pearson correlation  Spearman correlation  

Avg^{d}  Max^{e}  Min^{f}  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  Avg  Max  Min  
Feb 04July 17  10.11  20.88  0.31  15.76  33.15  3.94  0.79  0.99  0.03  0.75  1.00  0.04  .009  <.999  <.001  
Feb 04July 31  5.44  15.98  0.28  9.12  25.02  2.71  0.88  0.99  0.05  0.84  1.00  0.13  .006  .83  <.001  
Feb 04Aug 16  5.42  15.32  0.26  9.27  24.20  1.81  0.89  0.99  0.08  0.84  1.00  0.21  .001  .24  <.001 
^{a}MAE: mean absolute error.
^{b}RMSE: root mean square error.
^{c}
^{d}Avg: average.
^{e}Max: maximum.
^{f}Min: minimum.
The prediction curves in China (ad), India (eh), Italy (il), and the United States (mp). The blue and red lines indicate the predicted and real daily new cases, respectively. The first panels in each country (a, e, i, m) show the prediction for July 3 to July 17. The second panels (b, f, j, n) show the actual number of daily new cases for July 3 to July 17 and the prediction for July 18 to July 31. The third panels (c, g, k, o) show the actual number of daily new cases for July 18 to July 31 and the prediction for August 1 to August 16. The fourth panels (d, h, l, p) show the actual number of daily new cases from August 1 to August 16.
In this study, we have established a method to obtain a 7dayahead prediction of COVID19 activity by combining the previous incidence of COVID19 and Google search volume data at the country or territory level in the real world. The models performed well in most countries. In a total of 215 countries and territories, the mean MAE, RMSE, Pearson correlation and Spearman correlation values were 5.42, 9.27, 0.89, and 0.878, respectively. The
Our study has several advantages compared to other reported studies on similar topics. First, we investigated 215 individual countries and territories. Other studies only investigated a single country [
The high accuracy of our models enables realtime forecasting of the shortterm trends of COVID19, not only during the outbreak but also during the recovery period and subsequent second or third epidemic in individual countries. The methods can also be adapted to predict the incidence of subregions. Realtime digital surveillance of COVID19 is provided, which would save time and resources in data collection. As the COVID19 pandemic is changing rapidly, digital health systems should provide an effective solution to address the challenges to public health and consequential socioeconomical complications with high efficiency. The epidemic of COVID19 is likely to persist in the future, and it may even become a seasonal infectious disease like influenza. Therefore, accurate surveillance and prediction of its activity would help governments, health care providers and the general population to take appropriate actions to compact the disruptive effects of COVID19 [
We recognize some limitations of the current study. First, the models did not predict well in some countries, which were clustered into independent groups and had limited periods of data for training. Second, the occurrence of COVID19 in some countries would have increased the awareness of other countries, especially those with a close relationship, consequently resulting in a large Google search volume for terms related to COVID19. Third, there is a limitation in applying our models in countries and territories where Google is not the mainstream search engine. Thus, predictions of COVID19 incidence in regions with limited internet access or prohibited Google access may not be accurate. Fourth, COVID19 and influenza share some common symptoms and even prevention methods. Therefore, the Google Trends data of these terms may not be able to differentiate COVID19 and influenza. Further studies are needed to develop an algorithm to differentiate these two infectious diseases using Google Trends.
In this study, we integrated the Google Trends data and previous daily incidence of 215 individual countries and territories using the techniques of features engineering, country clustering, and machine learning. We are able to achieve prediction 7 days ahead of time of the daily incidence of COVID19 in real time in most countries and territories.
Spearman correlation coefficients of 28 Google volume data features with the incidence of COVID19 at n (n=7, 14, 21, 28) days of lag in 215 countries.
List of countries and territories with different clustering results.
Performance of the predicting model in 215 individual countries and territories.
decision tree regression
long shortterm memory
mean absolute error
random forest regression
root mean square error
susceptibleinfectedrecovereddead
World Health Organization
This study was supported in part by the National Key R&D Program of China (2018YFA0701700), the Department of Education of Guangdong Province (2020KZDZX1086), the National Natural Science Foundation of China (61971298), and the Grant for Key Disciplinary Project of Clinical Medicine under the Guangdong Highlevel University Development Program.
YP and YR analyzed the data. YP drafted the manuscript. CL collected the data. CL, CPP, HC, and XC interpreted the data. HC and XC designed the study. CP and HC revised the manuscript. All authors agreed on the final version for submission to the journal.
None declared.