This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
COVID-19 is the most widely discussed topic worldwide in 2020, and at the beginning of the Italian epidemic, scientists tried to understand the virus diffusion and the epidemic curve of positive cases with controversial findings and numbers.
In this paper, a data analytics study on the diffusion of COVID-19 in Italy and the Lombardy Region is developed to define a predictive model tailored to forecast the evolution of the diffusion over time.
Starting with all available official data collected worldwide about the diffusion of COVID-19, we defined a predictive model at the beginning of March 2020 for the Italian country.
This paper aims at showing how this predictive model was able to forecast the behavior of the COVID-19 diffusion and how it predicted the total number of positive cases in Italy over time. The predictive model forecasted, for the Italian country, the end of the COVID-19 first wave by the beginning of June.
This paper shows that big data and data analytics can help medical experts and epidemiologists in promptly designing accurate and generalized models to predict the different COVID-19 evolutionary phases in other countries and regions, and for second and third possible epidemic waves.
The unexpected pandemic diffusion of COVID-19 [
In this paper, we aim to develop an accurate predictive model tailored to forecast the evolution of the diffusion over time, exploiting big data and data analytics. Generally, epidemics follow an exponential curve in the spread of positive cases. This is not the case of the curve observed in Wuhan, China [
To design the predictive model, we exploited all the official data sets available so far. We analyzed the Wuhan official data set, available at [
Exploiting the similarity between the behavior of the COVID-19 contagion in Wuhan and the starting Italian scenario, we designed our predictive curve adapting well-known mathematical methods to this particular context: the Pearson correlation index to formally evaluate this similarity in the contagion, the logistic curve (sigmoid) to design the cumulative number of COVID-19 positive cases, regression models to evaluate the best correlation fit for our predictions, and the power law to model the initial ascent of the pandemic.
In the first step of our method, we computed the Pearson correlation coefficient applied to a sample between the two data sets (Italy vs Wuhan) with reference to the daily new positive cases as:
Starting from the data set available from March 2, 2020, it was possible to observe a strong Pearson correlation between the data set related to Italy and the data set related to Wuhan with a
Graphical comparison between the Italian curve and Wuhan curve (total cumulative positive cases).
Starting from this assumption, it was then possible to use the Wuhan data set (with a strong statistical significance) to try to predict the logistic curve of total positive cases to COVID-19 for Italy and the Lombardy Region. We started designing our model by the basic assumption that every pandemic phenomenon follows a logistic (sigmoid) curve. We then searched for a curve that best fit the initial growing part of the logistic, and we found that the best fit is a power law curve in the form of y = m * xb. To compute the coefficients m and b of the power curve, we started our elaborations with two other assumptions in mind:
We based our initial estimation on the number of swab tests analyzed in the initial days of the pandemic period, where an average of 8000 swab tests were performed daily (in May, the number of swab tests was increased significantly to an average of 50,000 tests/day).
We assumed that the Italian Government would act promptly with restrictions and lockdowns on the Italian population and that the Italian citizens would follow these restrictions with a sense of responsibility.
Moreover, we used the stabilized data set for Wuhan City, and we adopted additional official statistics (published by the WHO [
The goodness of fit for the detected model is high, with a high coefficient of determination (
The predicted curve of total positive cases in Italy (as of March 2, 2020).
Graphical comparison between the real vs predicted number of Italian cumulative cases. Tot.: total.
Our model forecasted (18 days in advance) the peak in the number of daily new cases on March 21, 2020, with a total of 42,000 positive cases against the official datum of 53,500 total positive cases, with a confidence level of 95%. The model significantly outperformed other predictions based on exponential models that forecasted more than 180,000 positive cases.
As highlighted in
March 21, 2020, is registered as the date with the absolute peak in the number of daily new cases, hence the error (that is a cumulative error) is increasing over time.
The Lombardy Region has the same consideration (where the peak has been forecasted and actually observed on March 17). The Lombardy Region accounts for 50% of the overall national value.
The predictive model is strongly related to the curve of Wuhan. Although the increase of daily cases is similar between the Italian curve and the Wuhan curve, the decrease is a bit different. The Italian one is less steep than the Wuhan decrease since the restrictions implemented by the Italian government are less stringent than the Wuhan restrictions, and the Italian population reacted to the restrictions with less determination.
Moreover, it is important to highlight that (as depicted in
Graphical comparison between the Lombardy Region curve and the Wuhan curve (total cumulative positive cases).
The model elaborated from the beginning of March 2020 (
It is important to observe that reaching the plateau does not indicate that the COVID-19 epidemic has been solved, but it means that the cumulative number of COVID-19 positive cases is slowing down in its ascent. The WHO guidelines suggest waiting for the contagion to be +0 (ie, the new daily cases are reduced to a few dozen positive cases per day) and then taking restrictions for two additional cycles of COVID-19 incubation (mean incubation period 5-6 days, range 1-14 days.) Hence, considering the national plateau on March 31, 2020, we estimated the actual containment of the COVID-19 epidemic at the beginning of June for the following reasons:
One additional month, after the beginning of the plateau, to reach a small and contained number of new daily cases (end of April)
One additional month in waiting for the two COVID-19 incubation cycles (end of May)
Gradual return to normal life starting from the beginning of June
In
In
Graphical plot of daily new cases in the Lombardy Region.
Graphical plot of daily new cases in Wuhan.
This is a good indicator (the peak matches the inflection point in the logistic curve) that we had started the descent toward the plateau in the Lombardy Region (the last inflection point in the logistic curve).
Additionally, in that case, it is important to observe that the curve of fatalities and recoveries (as depicted in
Graphical plot of COVID-19 fatalities and recovered in Italy.
This study was conducted in the early days of the pandemic in Italy to promptly define a model able to predict the curve of total positive cases in Italy and the Lombardy Region.
The model predicted the real data published daily by the Department of the Italian Civil Protection, estimating in a precise manner and several months in advance the plateau for both the logistic curves for the Lombardy Region and Italy, and the end of this first COVID-19 pandemic wave. This suggests the possibility to generalize the model for other countries, which will follow the restrictions imposed by the Italian government, to have a clear picture on the evolution of the number of new cases and to act promptly with policies and restrictions that can maximize care and treatments offered to patients with COVID-19. Moreover, this paper shows that big data and data analytics can help medical experts and epidemiologists in promptly designing accurate models to predict the different COVID-19 evolutionary phases in other countries and regions, and for second and third possible epidemic waves.
World Health Organization
This research is partially funded by the European Research Council Advanced Grant project 693174 GeCo (Data-Driven Genomic Computing), 2016-2021.
None declared.