This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
In Brazil, a substantial number of coronavirus disease (COVID-19) cases and deaths have been reported. It has become the second most affected country worldwide, as of June 9, 2020. Official Brazilian government sources present contradictory data on the impact of the disease; thus, it is possible that the actual number of infected individuals and deaths in Brazil is far larger than those officially reported. It is very likely that the actual spread of the disease has been underestimated.
This study investigates the underreporting of cases and deaths related to COVID-19 in the most affected cities in Brazil, based on public data available from official Brazilian government internet portals, to identify the actual impact of the pandemic.
We used data from historical deaths due to respiratory problems and other natural causes from two public portals: DATASUS (Department of Informatics of the Unified Healthcare System) (2010-2018) and the Brazilian Transparency Portal of Civil Registry (2019-2020). These data were used to build time-series models (modular regressions) to predict the expected mortality patterns for 2020. The forecasts were used to estimate the possible number of deaths that were incorrectly registered during the pandemic and posted on government internet portals in the most affected cities in the country.
Our model found a significant difference between the real and expected values. The number of deaths due to severe acute respiratory syndrome (SARS) was considerably higher in all cities, with increases between 493% and 5820%. This sudden increase may be associated with errors in reporting. An average underreporting of 40.68% (range 25.9%-62.7%) is estimated for COVID-19–related deaths.
The significant rates of underreporting of deaths analyzed in our study demonstrate that officially released numbers are much lower than actual numbers, making it impossible for the authorities to implement a more effective pandemic response. Based on analyses carried out using different fatality rates, it can be inferred that Brazil’s epidemic is worsening, and the actual number of infectees could already be between 1 to 5.4 million.
On December 31, 2019, the World Health Organization (WHO) received a report from China about cases of pneumonia of unknown etiology in Wuhan, Hubei Province. By January 7, 2020, Chinese scientists isolated the virus, identifying it as a novel coronavirus and initially referred to it as 2019-nCoV (later named severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) [
The global impact of the virus has been of great concern and has overburdened public health systems worldwide. It can be considered the first true global epidemic of this magnitude in the digital era [
As the disease propagates, the burden to health care systems increases, despite a large number of asymptomatic cases. Studies in China show that 62% of COVID-19 transmissions occur as a result of asymptomatic and presymptomatic individuals [
Outside the Asian continent, the disease was initially concentrated in Western Europe and North America. In a short period of time, however, it expanded to other parts of the world like Africa and Latin America [
The country’s difficult situation is magnified due to social inequalities. According to the Brazilian Institute of Geography and Statistics (IBGE) [
Brazil has 27 states divided territorially into five major regions: North, Northeast, Midwest, Southeast, and South, with specific climatic, social, and economic characteristics. According to the IBGE [
A proper estimation of underreported or wrongly reported cases is necessary for a better understanding of the actual epidemic scenario; this will allow for necessary and effective measures to be undertaken by the authorities. In Brazil, underreporting is due to the low rate of testing per 1 million inhabitants. Additionally, there is significant delay in the reporting of test results [
Different grades of testing and reporting are observed in other countries [
This undersampling leads to a high degree of underreported cases, which affects estimates of the actual fatality rate of the disease [
It has been suggested that the reproduction number (R) must be less than 1 in order to reduce the number of infected cases [
With the increasing spread of SARS-CoV-2 in Brazil, there has been a considerable growth in the population's interest for information about the disease. According to Google Trends [
To manage this increase in interest, several official internet portals were created by the Brazilian municipal, state, and federal bodies for dissemination, monitoring, and guidance. However, the data presented by these public internet portals are contradictory and inaccurate. Some of the data released highly underreport the true number of cases, leading to false perceptions that the contagion is under control.The population must trust the data provided to them in order to accept proposed recommendations [
We believe that by aggregating officially available information into a single internet portal, removing contradictions, and using reliable sources, we can gather support from the Brazilian populace to follow WHO-recommended guidelines, thus reducing the contagion rate in Brazil. This portal is under development as part of the work presented in this paper and will enable policy and decision makers to base their assessments on scientific evidences and guide citizens in adopting recommended measures and behaviors (eg, social distancing, frequent hand sanitizing, and more attention to hygiene issues).
The work described in this paper conducts an investigation into underreported deaths with respect to COVID-19 based on historical mortality data due to respiratory problems and other natural causes. These data are publicly available on the internet through the two main portals of the Brazilian government: the Mortality Information System (SIM) of DATASUS (Department of Informatics of the Unified Healthcare System) [
In this study, we used as case studies the capital cities of three regions that were most affected by the pandemic: North (Belém and Manaus), Northeast (Fortaleza and Recife), and Southeast (São Paulo and Rio de Janeiro). The resulting mortality underreporting scenario will be considered for the entire country as these cities represent around 47% of the total deaths in Brazil as of June 9, 2020 [
We followed the Knowledge Discovery in Databases workflow to extract new and relevant data to enable decision making (
Methodology diagram adapted from Fayyad et al [
Data were collected from two government sources accessible for public use. The registers present in both databases follow the international standards set by the WHO.
Part of the data collected for this research was extracted from DATASUS (SIM) [
Another source was the Brazilian Transparency Portal of Civil Registry [
The Brazilian civil registry portal presents the data duly notarized by the civil registry offices and follows a series of legal timelines established by the Brazilian Constitution—a family has 24 hours after the death of a member to notify the registry office, and in turn the registry office has up to 5 days to duly register the death; within 8 days the Information Center of Civil Registry receives the report, which is published by the civil registry portal. Therefore, there may be a delay of 14-15 days for the portal to publish a record.
In addition to the large delay in the Transparency Portal of Civil Registry death reports, it is important to highlight that the update frequency might be different for each city. For certain regions, the delays are even longer. In general, the data for capital cities are updated more frequently. For this reason, although the data were collected on June 1st, the analysis will be conducted using data made available up to May 21st. By adopting this procedure, we can mitigate the effect of late notifications in the analysis.
Data were preprocessed by removing missing and duplicated information to improve quality, so that more significant results can be presented. This removal of data was not substantial, and the entire data set was stored in a single database.
The time series of deaths due to the previously mentioned diseases were from DATASUS (SIM) and were duly processed to be concatenated with those from the Transparency Portal of Civil Registry. Following the conditions used by the civil registry portal, each occurrence of death was classified according to the International Statistical Classification of Diseases and Related Health Problems (ICD) [
In order to classify each record of data from DATASUS (SIM) based on the listed conditions, it was necessary to identify the ICDs [
In order to merge the databases, data referring only to death records for capital cities were extracted from DATASUS (SIM). These records were then aggregated on a daily basis. Therefore, both the databases are now compatible with respect to their indices and columns, making it possible to concatenate the data and merge into a single data set, which was then used to conduct this study.
Conditions established by the Transparency Portal of Civil Registry to classify deaths.
Order | Condition |
1 | If there is any mention of COVID-19a in the death certificate, suspected or confirmed, it was considered a death attributed to COVID-19. |
2 | If there is any mention of severe acute respiratory syndrome (SARS), it was considered the cause of death. |
3 | If there is any mention of pneumonia, it was considered the cause of death. |
4 | If respiratory failure is listed as the only cause, it was considered the cause of death. |
5 | If the certificate does not mention any of the above conditions, the cause of death was considered as “other”. |
aCOVID-19: coronavirus disease.
International Statistical Classification of Diseases and Related Health Problems–10th Revision (ICD-10) classification adopted by the Transparency Portal of Civil Registry.
Disease | ICD-10 classification |
Severe acute respiratory syndrome (SARS) | I260, U04, J22, J100, J110 |
Pneumonia | J12, J13, J14, J15, J16, J180, J181, J182, J188, J189, B953, B960, B961 |
Respiratory failure | J96 |
The models used for time-series prediction were adjusted to predict the expected number of deaths for 2020 based on a historical series from 2010 to 2018 for six capital cities. In order to conduct the experiment, training based on the modular regression model FbProphet [
where, according to the model by Harvey and Peters [
The main component of equation 1,
where is the growth rate, δ is a vector containing adjustments to the growth rate, is used as an offset parameter, and γ is used as an adjustment vector for the parameter . The vector
As previously mentioned, component
In order to fit the model to the data, the time-series forecasting is treated as a curve-fitting problem, taking the data seasonalities and holiday effects into consideration [
For this analysis, we used data on COVID-19–related deaths of the six capital cities with the highest number of deaths recorded by the civil registry website: Belém (capital of Pará), Fortaleza (capital of Ceará), Manaus (capital of Amazonas), Recife (capital of Pernambuco), Rio de Janeiro (capital of Rio de Janeiro), and São Paulo (capital of São Paulo).
Once the processing workflow and data cleaning are completed, it is possible to devise a system to predict trends in deaths caused by respiratory issues, as well as to predict the expected behavior of diseases for 2020. Based on the number of deaths per year for each disease for the capital cities under consideration, an estimate of deaths was calculated for normal conditions (ie, no pandemic). Thus, the difference between the number of expected cases for 2020 and recorded cases for 2020 was determined. Next, this extrapolation was added to the deaths reported for COVID-19, allowing us to estimate the actual number of deaths due to the pandemic. With this analysis, the actual cause of sudden increase in deaths, not only due to respiratory issues but also other deaths, could be estimated.
We conducted an exploratory analysis of the data to evaluate patterns in the number of deaths during the pandemic. Subsequently, we employed a time-series model to estimate the number of incorrectly reported figures.
The historical series of deaths for 2010-2018 (extracted from SIM [
Recife, Belém, Fortaleza, São Paulo, and Rio de Janeiro also presented a significant increase in the number of deaths in 2020.
As previously mentioned, we observed a major discrepancy for SARS-related deaths for all cities. A sudden increase of 6991% (from 9.8 to 685) for SARS in Recife, for example, might be associated with errors in reporting. SARS, first detected in China in November 2002, is caused by a type of coronavirus called severe acute respiratory syndrome coronavirus (SARS-CoV), with symptoms similar to COVID-19, causing a severe respiratory viral infection [
Increases in the number of deaths due to respiratory failure and severe acute respiratory syndrome (SARS).
Mean (SD) for the historical series and percent increase/decrease of deaths caused by respiratory failure, pneumonia, severe acute respiratory syndrome (SARS), and other causes.
City | Respiratory failure | Pneumonia | SARS | Other causes | |
|
|||||
|
Mean (SD) | 72.7 (15.62) | 180.6 (42.09) | 3.6 (2.54) | 491.1 (72.5) |
|
Increase/decrease (%) | +78 | +37 | +1900 | –1 |
|
|||||
|
Mean (SD) | 144.3 (29.09) | 442.2 (147.90) | 17.9 (10.12) | 1474.6 (161.92) |
|
Increase/decrease (%) | +35 | –1 | +553 | –11 |
|
|||||
|
Mean (SD) | 66.8 (9.01) | 259.4 (41.80) | 9.0 (2.16) | 1162.6 (122.70) |
|
Increase/decrease (%) | +283 | +192 | +5188 | +69 |
|
|||||
|
Mean (SD) | 207.5 (32.2) | 307.1 (39.3) | 9.8 (6.98) | 1963.2 (305.40) |
|
Increase/decrease (%) | –24 | –43 | +6991 | –25 |
|
|||||
|
Mean (SD) | 611.3 (108.85) | 1501.1 (166.94) | 22.7 (7.64) | 6065.1 (495.47) |
|
Increase/decrease (%) | +15 | +16 | +1701 | –5 |
|
|||||
|
Mean (SD) | 861.8 (59.12) | 2933.3 (247.11) | 70.4 (30.82) | 8418.1 (571.08) |
|
Increase/decrease (%) | +70 | –2 | +192 | +6 |
The exploratory analysis identified values that were much higher than the average of the historical series for registered deaths during the pandemic period. For this reason, in this section we further analyze the results obtained from the time-series models developed to compare the expected trend (predicted) and the actual trend.
We trained the time-series models with data from January 2010 to May 2019. The model was adjusted to individually predict the behavior of each of the three diseases and deaths over other causes in each.
To compute the error metrics, each model was initially trained using 7 years of data. A cross-validation process was then conducted for the remaining data for every 90-day cutoff at a 470-day horizon.
The models were then used to predict data up to May 21, 2020, to be compared with the actual data presenting the observed anomalies.
Mean absolute error (MAE) and mean absolute percentage error (MAPE).
City | Respiratory failure | Pneumonia | SARSa | Other causes | |
|
|||||
|
MAE | 1.61 | 2.64 | 0.40 | 5.15 |
|
MAPE | 9.6 | 11.6 | 33.5 | 8.3 |
|
|||||
|
MAE | 1.81 | 2.34 | 0.57 | 6.59 |
|
MAPE | 11.4 | 2.6 | 37.0 | 10.1 |
|
|||||
|
MAE | 0.75 | 1.88 | 0.34 | 4.47 |
|
MAPE | 14.0 | 10.0 | 28.4 | 8.3 |
|
|||||
|
MAE | 1.91 | 2.34 | 0.59 | 7.45 |
|
MAPE | 12.3 | 7.8 | 40.0 | 6.0 |
|
|||||
|
MAE | 2.77 | 4.99 | 0.50 | 13.38 |
|
MAPE | 6.7 | 6.8 | 25.8 | 5.2 |
|
|||||
|
MAE | 3.34 | 7.78 | 0.96 | 12.27 |
|
MAPE | 2.4 | 3.1 | 36.4 | 2.3 |
aSARS: severe acute respiratory syndrome.
Predicted and actual deaths per epidemiological week related to respiratory diseases. COVID-19: coronavirus disease.
Taking into account the peak periods for each city, predicted figures are smaller than the actual values in terms of the days with a high number of deaths due to respiratory and other causes. The estimates of errors in death reports for each disease, per city, are shown in
Each city, with its own particularities (
The predicted values show different increases for the investigated cities. For São Paulo, where the first COVID-19 death confirmed by the Brazilian government occurred in the 11th week, the increase was 24.4% (from 7238 to 9004). For the other cities, the following increases were observed: 144.7% (from 1274 to 3117) for Manaus, 128.9% (from 575 to 1317) for Recife, 99.6% (from 485 to 968) for Belém, 41.2% (from 1279 to 1806) for Fortaleza, and 39.9% (from 3475 to 4863) for Rio de Janeiro. These percentages refer to the increase in death records that didn’t reference COVID-19. Thus, one can see a significant increase in the number of deaths during the epidemic period that attributed to causes that deviate from the expected pattern.
The discrepancy is clearly very large, in terms of percentage values, with respect to the reports on deaths due to diseases considered in this research and other causes, especially SARS, which reported an increase of around 5820% (from 8.04 to 476) in Manaus and 2880% (from 23.32 to 695) in Recife.
Estimated number of deaths wrongfully attributed to respiratory system diseases for the considered periods. SARS: severe acute respiratory syndrome.
Difference (∆) between real and predicted values.
Cities (epidemiological weeks) | ∆ Deaths | ∆ Total deaths | ||||
|
Respiratory failure | Pneumonia | SARSa | Other causes |
|
|
|
||||||
|
Difference | 88 | 90 | 178 | 125 | 481 |
|
Increase (%) | 127.24 | 60 | 1715.56 | 49.41 | 99.61 |
|
||||||
|
Difference | 69 | 52 | 157 | 248 | 526 |
|
Increase (%) | 52.69 | 23.47 | 815.20 | 27.33 | 41.17 |
|
||||||
|
Difference | 220 | 477 | 467 | 678 | 1842 |
|
Increase (%) | 611.11 | 196.12 | 5820.40 | 68.75 | 144.74 |
|
||||||
|
Difference | 18 | 11 | 671 | 41 | 741 |
|
Increase (%) | 25 | 35.48 | 2880.27 | 9.13 | 128.92 |
|
||||||
|
Difference | 319 | 483 | 391 | 194 | 1387 |
|
Increase (%) | 96.28 | 42.49 | 2284.8 | 9.7 | 39.96 |
|
||||||
|
Difference | 851 | 301 | 274 | 339 | 1765 |
|
Increase (%) | 91.15 | 24.31 | 493.53 | 6.77 | 24.40 |
It is reasonable to assume that the values presented in
Therefore, the extrapolated (period not covered in the historical series) values of the number of deaths were attributed to the underreporting of the pandemic.
For the cities of this case study, an average underreporting of 40.7% is estimated for deaths related to COVID-19. The values vary between 25.9% to 62.7%, with emphasis on Manaus, which had the highest number of deaths underreported (62.7%), and Recife, with almost 50%. Fortaleza had the lowest number, with 25.9% of underreporting, in spite of its count being substantial.
Underreported deaths due to coronavirus disease (COVID-19).
City | Population, N (PNADa) | Extrapolated number of predicted deaths | Official number of deathsb |
Total number of estimated deaths | Number of deaths per 1 million inhabitants | Underreported deaths (%) |
Belém | 1,492,745 | 481 | 952 | 1433 | 959.98 | 33.57 |
Fortaleza | 2,669,342 | 526 | 1503 | 2029 | 760.11 | 25.92 |
Manaus | 2,182,763 | 1842 | 1094 | 2936 | 1345.08 | 62.74 |
Recife | 1,645,727 | 739 | 747 | 1486 | 902.94 | 49.73 |
Rio de Janeiro | 6,718,903 | 1387 | 2376 | 3763 | 560.06 | 36.86 |
São Paulo | 12,252,023 | 1765 | 3238 | 5003 | 408.34 | 35.28 |
aPNAD: Pesquisa Nacional por Amostra de Domicílios (National Household Sample Survey).
bAs of May 21, 2020,
The National Household Sample Survey (Pesquisa Nacional por Amostra de Domicílios, PNAD) of the IBGE compiles data based on the socioeconomic characteristics of the Brazilian population [
São Paulo, for example, ended up with the least number of deaths in terms of percentage (per population) and the least total difference (percentagewise) in deaths for the period not covered in the historical series (
In a recent study, EPICOVID19-BR, carried out by the Federal University of Pelotas (UFPel) [
In the context of EPICOVID19-BR, fatality rates were estimated using the total deaths predicted, along with the official figures of infections and the number of infections estimated by UFPel [
Another relevant study, from Imperial College [
From the several fatality rates investigated (up to the time this study was conducted), and considering the main countries affected by the pandemic and number of predicted deaths in our research, it is possible to estimate the number of infected cases and consequently estimate the percentage of underreporting of infected cases.
Depending on how high or low the fatality ratio is, there is variation in the number of infected cases. For example, as seen in
Based on these differing fatality rates, underreported infection numbers may be monumental. For example, underreporting of infected cases in Manaus (using the fatality ratio from the Imperial College study [
There were 739,503 confirmed cases and 38,406 official deaths, as of June 9, 2020 [
Estimated number of infection cases and percentage of cases underreported considering differing estimations in fatality rate.
Cities |
Official counta | Fatality rate | ||||||
|
UFPelb [ |
Imperial College [ |
China (1.38%) [ |
Brazil (6.6%) | United States (6%) | Global (6.5%) | ||
|
||||||||
|
Infections, n |
|
225,404 | 159,222 | 103,841 | 21,712 | 23,883 | 22,046 |
|
Underreported (%) |
|
2837 | 1975 | 1253 | 183 | 211 | 187 |
|
||||||||
|
Infections, n |
|
232,233 | 184,455 | 147,029 | 30,742 | 33,817 | 31,215 |
|
Underreported (%) |
|
1146 | 889 | 689 | 65 | 81 | 67 |
|
||||||||
|
Infections, n |
|
272,845 | 367,000 | 212,754 | 44,484 | 48,933 | 45,170 |
|
Underreported (%) |
|
2115 | 2880 | 1627 | 261 | 297 | 267 |
|
||||||||
|
Infections, n |
|
52,663 | 135,091 | 107,681 | 22,515 | 24,767 | 22,861 |
|
Underreported (%) |
|
355 | 1066 | 830 | 94 | 114 | 97 |
|
||||||||
|
Infections, n |
|
147,816 | 470,375 | 272,681 | 57,015 | 62,717 | 57,892 |
|
Underreported (%) |
|
689 | 2410 | 1355 | 204 | 235 | 209 |
|
||||||||
|
Infections, n |
|
379,813 | 714,714 | 362,536 | 75,803 | 83,383 | 76,969 |
|
Underreported (%) |
|
816 | 1624 | 775 | 83 | 101 | 86 |
aAs of May 21, 2020.
bUFPel: Federal University of Pelotas.
Regarding the number of those infected by the pandemic, based on the value previously calculated for the number of total deaths (40.7%, 64,746 deaths), it can be inferred that Brazil’s count of infection ranges between 981,013 and 5,395,571 (considering respectively the highest and lowest lethality rate, 6.6% and 1.2%, respectively [
When comparing both countries, the United States currently performs more tests for the disease than any other country in the world [
It is also worth considering the tendency to flatten the evolution curve of COVID-19, which represents the reduction in the number of daily new cases. We compared the evolution of weekly confirmed cases from United States and Brazil, up to June 9th. The reduction in the number of occurrences in the United States indicates that the curve is flattening. In contrast, the number of weekly confirmed cases in Brazil is still increasing. This ascending curve indicates that the pandemic is still growing, tending to surpass the official number of infected Americans in the near future when considering the official numbers. If we consider the highest lethality rates presented in this work, the actual number of infected Brazilian citizens would have already surpassed that of the United States.
The significant rates of underreporting of deaths presented in our research indicate that the counts released by the official Brazilian internet portals are much lower than the actual numbers, making it impossible for the authorities to take more effective action. This is also confusing to citizens, who have demonstrated failure to comply with social isolation measures. Therefore, a public access portal is being developed in order to disseminate more realistic and reliable data on the pandemic, in order to undo the contradictions of official data, guide the population, formulate policies, and estimate the R factor more efficiently.
Our results suggest a growing pandemic and reveal a wide heterogeneity in the outbreak of the epidemic in the cities considered in this case study, suggesting a greater number of underreporting in deaths and infected cases in some cities. This demonstrates differing levels of the outbreak stage, more advanced in some cities compared to others. However, in no city do the results indicate that herd immunity is close to being achieved. In addition, the underreporting of deaths is not stationary over time and may increase during the pandemic period.
The number of deaths due to SARS was considerably higher than the expected number for all six cities, indicating that a large number of deaths related to COVID-19 were possibly mistakenly recorded as SARS. It is assumed that this is due to lack of confirmation and delays in testing or confusion in diagnosis, since COVID-19 is a new disease. Furthermore, delays in disclosing test results also impact the effect and reach of the pandemic. Therefore, it is of paramount importance to increase testing in order to reduce underreporting and encourage rapid dissemination of test results to allow for a closer view of the real COVID-19 situation in Brazil.
coronavirus disease
Department of Informatics of the Unified Healthcare System
gross domestic product
Human Development Index
Brazilian Institute of Geography and Statistics
International Statistical Classification of Diseases and Related Health Problems–10th Revision
Pesquisa Nacional por Amostra de Domicílios (National Household Sample Survey)
reproduction number
reverse transcription polymerase chain reaction
severe acute respiratory syndrome
severe acute respiratory syndrome coronavirus
severe acute respiratory syndrome coronavirus 2
Mortality Information System
Sistema Único de Saúde (Unified Healthcare System)
Federal University of Pelotas
World Health Organization
The authors would like to thank CEPID CeMEAI and FAPESP (process 2013/07375-0) for supporting this work.
None declared.