This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
SARSCoV2, the novel coronavirus that causes COVID19, is a global pandemic with higher mortality and morbidity than any other virus in the last 100 years. Without public health surveillance, policy makers cannot know where and how the disease is accelerating, decelerating, and shifting. Unfortunately, existing models of COVID19 contagion rely on parameters such as the basic reproduction number and use static statistical methods that do not capture all the relevant dynamics needed for surveillance. Existing surveillance methods use data that are subject to significant measurement error and other contaminants.
The aim of this study is to provide a proof of concept of the creation of surveillance metrics that correct for measurement error and data contamination to determine when it is safe to ease pandemic restrictions. We applied stateoftheart statistical modeling to existing internet data to derive the best available estimates of the statelevel dynamics of COVID19 infection in the United States.
Dynamic panel data (DPD) models were estimated with the ArellanoBond estimator using the generalized method of moments. This statistical technique enables control of various deficiencies in a data set. The validity of the model and statistical technique was tested.
A Wald chisquare test of the explanatory power of the statistical approach indicated that it is valid (χ^{2}_{10}=1489.84,
DPD models successfully correct for measurement error and data contamination and are useful to derive surveillance metrics. The opening of America involves two certainties: the country will be COVID19–free only when there is an effective vaccine, and the “social” end of the pandemic will occur before the “medical” end. Therefore, improved surveillance metrics are needed to inform leaders of how to open sections of the United States more safely. DPD models can inform this reopening in combination with the extraction of COVID19 data from existing websites.
The SARSCoV2 pandemic is unprecedented [
Public health surveillance is defined as the “ongoing systematic collection, analyses, and interpretation of outcomespecific data for use in the planning, implementation and evaluation of public health practice [
The conventional approach to modeling the spread of diseases such as COVID19 is to posit an underlying contagion model [
In contrast to previous studies, we used an empirical approach that focuses on statistical modelling of widely available empirical data, such as the number of confirmed cases or the number of tests, which can inform estimates of the current values of critical parameters such as the infection rate or reproduction rate. We explicitly recognized that the data generating process for the reported data contains an underlying contagion component; a politicoeconomic component, such as availability of accurate test kits; a social component, such as how strongly people adhere to social distancing measures, mask requirements, and shelterinplace policies; and a sometimes inaccurate data reporting process that may obscure the underlying contagion process. Therefore, we sought to develop a statistical approach that can provide meaningful information despite the complex and sometimes obfuscating data generation process. Our approach is consistent with the principles of evidencebased medicine, including controlling for complex pathways that may include socioeconomic factors such as mediating variables and policy recommendations, and “based on the best available knowledge, derived from diverse sources and methods [
There are two primary advantages to this empirical approach. First, we can apply the empirical model relatively quickly to a short data set. This advantage stems from the panel nature of the model. We used US states as the crosssectional variable; therefore, one week of data from 52 states and territories (including Puerto Rico and the District of Columbia) provides a reasonable sample size. In addition to enabling parameter estimation early in a pandemic, using this property, we tested to see if a shift had occurred in the infection or reproduction rates of the contagion process in the past week (ie, whether there is statistical evidence that reopening is associated with an acceleration in the number of cases).
The second advantage of our approach is that it directly measures and informs policyrelevant variables. For example, the White House issued guidance on reopening the US economy that depends on a decrease in the documented number of cases and in the proportion of positive test results over a 14day period, among other criteria and considerations [
Herein, we proceed with a brief discussion of the contagion models that informed our selection of an empirical model. We describe the basic dynamic panel data (DPD) approach and its advantages for analyzing the current pandemic. We obtained results that validate the model specification, which is a necessary and important step in the development of a surveillance system [
Transmission models are typically populationbased differential equations of the form
where
The availability of statelevel data suggests that Equation 1 can be rewritten in panel regression form as
The additional index
We apply the dynamic panel data approach to the number of positive test results per day as reported on internet sites. To avoid imposing too much specificity, we allowed for some flexibility in the functional form by including the number of tests both linearly and quadratically and as a proportion of the population:
where
Case and test data, including the total number of tests administered and the number of positive results, were taken from the COVID Tracking Project [
There are three problems with the specification of Equation 3 for estimation purposes. First, the inclusion of lagged dependent variables on the right side means that the errors are autocorrelated and that the usual exogeneity restrictions are violated; therefore, least squares estimates are inappropriate. Second, some variables are omitted, such as all the variables represented in extensions of the SIR model, and other variables that represent socioeconomic factors influencing the contagion, testing, and reporting processes may also have been omitted. Third, our data set has a relatively short time duration, and the asymptotic properties of fixedeffects or randomeffects panel data estimators such as statistical efficiency or normality apply as
Fortunately, DPD methods can be used to specifically resolve these statistical problems [
First, the GMM approach is asymptotically efficient; however, it also has good small sample properties, including samples with a large crosssection and a small number of time periods [
Second, this approach is robust to omitted variables because of its reliance on identifying restrictions and instrumental variables. This is important because we estimate a relatively sparse model that does not include direct controls for mediating factors, data collection issues, or reporting idiosyncrasies.
Third, the approach includes statistical testing of the overidentifying restrictions (ie, whether the empirical model and estimation technique are statistically valid). For this test, we used the Sargan chisquare test.
Fourth, this approach corrects for autocorrelation.
A significant drawback to DPD methods is that they are computationally complex and become very time and resourceintensive as the number of observations grows.
We used the ArellanoBond estimation technique developed specifically for DPD applications. We implemented the ArellanoBond technique using the
To validate the significance of the regression, we used a Wald chisquare statistic to test the null hypothesis that the independent variables did not explain the dependent variable (standard goodnessoffit measures such as
We report the point estimates and the
We translated the estimation results into a surveillance reporting context. The dynamic component (Equation 3) is presented in terms of the persistence rate per 100,000 cases, defined as the number of new COVID19 cases in every 100,000 cases that remained constant, and this component was applied to the reported infection numbers to determine its effect on the number of cases per state per day. The contemporaneous component was applied to the reported infection numbers to determine its effect on the number of cases per state per day. The two effects were added to obtain a modeled total number of cases per state per day, and this number was multiplied by 52 to obtain a national figure (including the District of Columbia and Puerto Rico but excluding other territories).
The internet data mining effort resulted in a panel (longitudinal data set) with 52 “panels” (50 states, the District of Columbia, and Puerto Rico) using observations from June 13 through July 10, 2020. Before the analysis, outlying and negative values were crosschecked with other reputable COVID19 data tracking websites, including USA Facts [
We present the estimation results in
ArellanoBond dynamic panel data modeling of the number of daily infections by state from March 20 to July 10, 2020.
Estimation  Coefficient  



Lagged daily positive cases  0.0630  .31 

Lagged daily positive shift, June 27July 03  0.0977  .14 

Lagged daily positive shift, July 0410  –0.1727  .009 

Sevenday lagged daily positive cases  0.5188  <.001 

Sevenday lagged daily positive shift, June 27July 03  0.0118  .90 

Sevenday lagged daily positive shift, July 0410  0.2691  .002 

Constant  17.7791  .68 

Daily tests  0.0520  <.001 

Daily tests squared  –1.54 × 10^{7}  .002 

Daily tests / population  –86,527  <.001 



Wald test of regression significance (χ^{2}_{10})  1489.84  <.001 

Sargan test of overidentifying restrictions (χ^{2}_{946})  935.52  .59 

Test of lagged daily positive cases + shift July 0410 = 0 (χ^{2}_{1})  –9.92  .002 
To examine the model fit, we applied a Wald chisquare test of the null hypothesis that there is no explanatory power in the explanatory variables. The model was statistically significant (χ^{2}_{10}=1489.84,
The coefficient on the lagged dependent variable of the number of daily cases that tested positive on the previous day was positive and statistically significant (0.0630,
The coefficient on the 7day lagged dependent variable, the number of daily cases that tested positive 7 days earlier, was positive and statistically significant (0.5188,
The coefficient on the linear term in the number of daily tests administered was positive and statistically significant (0.0520,
Dynamic panel data estimation results for the United States from June 27 to July 10, 2020.
Variable  June 27July 3  July 410  

State average  National average  State average  National average 
Daily average number of cases for the week  909  47,278  1048  54,491 
Daily average number of tests for the week  12,281  638,619  12,630  656,741 
Estimated daily persistence rate (per 10,000 cases)  1607  1607  –1106  –1106 
Estimated 7day persistence rate (per 10,000 cases)  5306  5306  7816  7816 
Estimated dynamic component (number of cases per day)  499  25,968  595  30,923 
Estimated contemporaneous component (number of cases per day)  466  24,254  490  25,464 
Total number of estimated cases per day  966  50,221  1084  56,387 
Our primary findings are that the 7day persistence rate is statistically significant and important in magnitude and that the 7day persistence rate increased by almost 50% from the week of June 27July 3 to the week of July 410 (
The coefficients on the daily lagged dependent variable are small in magnitude and do not indicate strong daytoday persistence. The negative estimated daily persistence rate for the week of July 4 is indicative of a daily “snaggletooth” pattern in the number of daily cases at the state level. This simply indicates that a low number of cases on one day is offset by a high number of cases the next day, probably due to reporting delays and differential testing periods; this pattern appears slightly in the US aggregate data and is strongly evident in the California data. Other states exhibited different snaggletooth patterns, including highincidence states such as Florida, Texas, and Georgia.
The contemporaneous component of the model contributed positively to the number of new daily cases but did not change significantly over the sample period.
While DPD is useful in deriving dynamic estimates of the rate of transmission of COVID19, static numbers using traditional surveillance tools must also be included to obtain a complete understanding of the pandemic.
The DPD model is a statistically validated analysis of reported COVID19 data and an important addition to the epidemiological toolkit for understanding the progression of the pandemic. It is important to recognize that this is a supplementary tool that does not replace detailed contagion modeling with detailed and specific data for accurate representation of contagion model parameters. However, there are four salient advantages of the DPD approach. First, this approach enables statistically efficient extraction of information from existing data sets, including statistical validation of results; therefore, it is applicable to the most commonly tracked and reported data in the current pandemic. Second, the tool could be applied relatively quickly after the pandemic started because of its ability to model reported data rather than detailed contract tracing data, which is largely unavailable to date. That is, changes in the evolution of the pandemic can be confirmed much more quickly using panel data than using aggregate data. Third, this approach informs realtime policy decisions, including decisions based on commonly reported data, such as reopening state economies. Fourth, the model results can help inform the parameterization of more traditional contagion models.
This model is consistent in that it shows a higher reproduction rate during the most recent 7 days; this confirms that in general, normal operation should not be resumed in the United States. Rather, empirically validated public health guidelines such as wearing masks, social distancing, social isolation, hand washing, and avoidance of social gatherings should be immediately adopted to reduce the contagion. In fact, White House guidelines recommend 14 sustained days of reduced COVID19–related deaths, new infection cases, and proportions of positive test results prior to reopening. That threshold has not been met. While these findings reflect the national average, it is possible that some areas within the United States meet the White House guidelines, even though reopening is contraindicated in general.
The opening of America involves two certainties. First, the United States will be COVID19–free only when there is an effective vaccine. While scientists are working at unprecedented speed worldwide to develop a SARSCoV2 vaccine [
dynamic panel data
generalized method of moments
basic reproduction number
susceptibleinfectedrecovered
The opinions expressed herein are those of the author(s) and do not necessarily reflect the views of the US Agency for International Development.
None declared.