This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Missing data is a common nuisance in eHealth research: it is hard to prevent and may invalidate research findings.
In this paper several statistical approaches to data “missingness” are discussed and tested in a simulation study. Basic approaches (complete case analysis, mean imputation, and last observation carried forward) and advanced methods (expectation maximization, regression imputation, and multiple imputation) are included in this analysis, and strengths and weaknesses are discussed.
The dataset used for the simulation was obtained from a prospective cohort study following participants in an online self-help program for problem drinkers. It contained 124 nonnormally distributed endpoints, that is, daily alcohol consumption counts of the study respondents. Missingness at random (MAR) was induced in a selected variable for 50% of the cases. Validity, reliability, and coverage of the estimates obtained using the different imputation methods were calculated by performing a bootstrapping simulation study.
In the performed simulation study, the use of multiple imputation techniques led to accurate results. Differences were found between the 4 tested multiple imputation programs: NORM, MICE, Amelia II, and SPSS MI. Among the tested approaches, Amelia II outperformed the others, led to the smallest deviation from the reference value (Cohen’s
The use of multiple imputation improves the validity of the results when analyzing datasets with missing observations. Some of the often-used approaches (LOCF, complete cases analysis) did not perform well, and, hence, we recommend not using these. Accumulating support for the analysis of multiple imputed datasets is seen in more recent versions of some of the widely used statistical software programs making the use of multiple imputation more readily available to less mathematically inclined researchers.
Missing data is a common nuisance in eHealth research [
As dropout rates in eHealth studies tend to be relatively high and are even considered typical by some, addressing data missingness and dropout is of great importance. The observation that in any eHealth trial a substantial proportion of users drop out before completion has been called the “Law of Attrition” [
The primary concern when facing substantive missingness is that a study with high attrition rates may yield biased estimates (of the mean, for example) caused by a biased sample. Patients that leave studies prematurely have been shown to be more likely to be involved in drug use or deviant behavior [
In short, four key reasons for the use of missing data approaches should be recognized: (1) Missing data may compromise randomization integrity in randomized clinical trials, as drop-out rates may differ over the trial arms. (2) In all longitudinal study designs, missing data may introduce selection bias, as is made clear in the previous section. (3) An intention to treat analysis—as is requested in the consolidated standards of reporting trials (CONSORT) statement and in most other guidelines for the analysis of randomized (controlled) clinical trials (RCTs)—is a necessary step when clinical endpoints are missing for some of the participants [
Remarkably, the problems encountered and the solutions implemented while solving missing data problems are rarely mentioned outside the statistical literature [
The aim of this paper is to provide a straightforward primer for eHealth researchers who seek solutions for missingness in datasets. To provide researchers with tools for working with data missingness, this paper reviews the strengths and weaknesses of the most common missing data approaches and tests the approaches in a simulation study. Theory on missingness patterns and the most widely used methods of handling missing data are comprehensively presented. The validity, reliability, and coverage of 9 different methods for dealing with incomplete datasets are presented. Some of these methods are relatively straightforward and basic, while others are more advanced and use computationally demanding algorithms to estimate missing values. Although the technical and mathematical details of the presented methods are outside the scope of this paper, those interested can consult with any of a number of references [
In general, 4 forms of missingness can occur in longitudinal studies: (1) In the case of initial nonresponse, no baseline data is collected for the participant, although follow-up measures may have been completed. (2) Loss to follow-up is the other way around: baseline data is collected, but (at a certain time point) the researchers fail to collect follow-up data. (3) Wave nonresponse is closely related to loss to follow-up in that data is not collected during one or more of the “waves,” but data are collected during earlier and later measurement waves. Missing data has to be interpolated if this form of missingness occurs. (4) The fourth form of missingness stems from item nonresponse. This occurs when a participant fails to respond to certain measures or questions, such as when some of the items from a questionnaire are skipped. For example, when the missing items are part of a highly correlated construct measurement (eg, one of the 16 items in a quality of life scale is missing), imputation is possible based on the other 15 collected item scores. In short, the selection of a missing data approach will in part depend on the form of missingness encountered. Although some of the presented methods may be efficacious at handling data problems, the most important determinant for preventing missing data values is to retain subjects in the study [
In general, 3 mechanisms of missingness are discerned: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR) [
Commonly, the probability that an observation is missing depends at least in part on information that is present: missingness is dependent on observed characteristics. This type of missing data generally is referred to as missingness at random or MAR [
Missing completely at random (MCAR) is a special case of MAR [
If the probability that an observation is missing depends on an unmeasured factor, this factor is partly missing itself and therefore not available, or the value of the observation predicts its own probability for missingness, the missing data pattern is called missing not at random or MNAR [
In general, there is no way to test whether MAR or MNAR holds in a dataset [
For MAR and MNAR, it should be recognized that patterns of missingness and the consequences for derived estimators are not solely a characteristic of the data, but a combination of the available data and the planned analysis. For example, if an MNAR pattern in which an unobserved or unmeasured variable is predictive of missingness (for example, left or right handedness) but is not correlated with the endpoint of the study, then the MNAR pattern does not lead to biased estimators (only to a loss of power). Another example is pointed out by Graham [
Over the last couple of decades, several methods for handling missingness have been developed. In this section, a number of these missing data approaches are presented. The approaches that are most useful and applied most often are described below [
Missing data approaches in this study
Approach | Description | Missingness Pattern | Type |
Complete cases | Only cases without missing observations in analysis | MCARa | Basic, single |
Mean imputation | Imputes missing observations with listwise mean for each variable | MCARb | Basic, single |
LOCF | Imputes the last available observation in the current data collection wave | - | Basic, single |
Regression imputation | Imputes missing observations by prediction based on other variables in a regression model | MAR, MCAR | Advanced, single |
EM imputation | Imputes missing observations using expectation maximization algorithm | MAR, MCAR | Advanced, single |
NORM | Multiple imputes missing observations under a normal model | MAR, MCAR | Advanced, multiple |
MICE | Multiple imputes missing observations using chained equations | MAR, MCAR | Advanced, multiple |
SPSS MI | Multiple imputes missing observations under a normal model in SPSS | MAR, MCAR | Advanced, multiple |
Amelia II | Multiple imputes missing observations using a bootstrapping-based algorithm | MAR, MCAR | Advanced, multiple |
a This approach will lead to unbiased point estimators (eg, means) under MCAR, but will result in lowered power and sample size.
b This approach will lead to unbiased point estimators (eg, means) under MCAR, but will result in biased, smaller confidence intervals.
The most popular and most often used missing data handling method is complete case analysis (casewise deletion). In complete case analysis, all cases with missing values are removed from the dataset before analysis. This method is straightforward in its application. This technique assumes MCAR and its application will lead to biased results under other patterns of missingness. Even under a valid assumption of MCAR data, this method is not preferential because the reduced number of cases used for the analysis leads to loss of statistical power [
Listwise mean imputation, in which missing values of each variable are imputed with the arithmetic mean of the available observations for the variable, attempts to overcome the loss of power of complete case analysis. Like complete case analysis, listwise mean imputation assumes the MCAR missingness pattern, which is uncommon in empirical datasets with missing observations. If the data missingness pattern is not MCAR, imputing missing values with the listwise mean will result in a biased estimation of the mean. Under all missing data patterns (also MCAR), listwise mean imputation will reduce the variance of the variable. Imputed values equal to the mean do not contribute to the total variance. This leads to decreased standard errors and artificially small confidence intervals. Because of the inadequacy of listwise mean imputation to conserve the imputed variables variance, this method is considered by some to be one of the worst missing data approaches [
The third most-often used method is last observation carried forward (LOCF). This approach is regularly used in epidemiological research, especially in clinical trials [
Regression imputation is the first of two “advanced” single imputation methods discussed in this paper. By adding randomly sampled “noise” from a normal distribution to a prediction model based on linear regression, the regression method imputes missing values based on the relations between variables in the dataset while preserving the variables’ variance. There is some discussion about the number of predictors that should be included in the model. In general, the use of more predictor variables in the regression equation is not necessarily better. A more parsimonious model, where only statistically significant predictors are retained, is usually a better model. However, it is important to keep in mind that two types of predictor variables should be retained in the model: those predicting the variable(s) with missing observations and those that predict missingness. The latter group of predictors help to correct for differential dropout-inducing bias to the estimators. In theory, regression imputation is applicable under both MCAR and MAR missingness patterns.
The other advanced single imputation method discussed here is based on expectation maximization (EM). The EM approach is a procedure that estimates unmeasured data and is based on iterating through two alternating steps [
In recent years, multiple imputation (MI) has emerged as a methodology for handling missing data. Originally, it was viewed as being most appropriate for complex surveys, although in the 1990s it was shown to be valuable in other settings as well [
Missing values that are replaced with more than one possible estimator will produce more than one completed dataset: each of the 3 to 10 imputations leads to a new dataset containing the original “complete” available observations and the new “generated” imputed ones. Each of the 3 to 10 datasets is first analyzed as if it were a complete dataset with no missing values. The separate results can then be combined into one final result according to specific rules. Rubin [
From a researcher’s perspective, the biggest advantage of MI is flexibility. It applies to a wide range of missing data situations and is simple enough to be used by nonstatisticians. Theoretically, this approach is superior to other models because it often produces the most robust effects. In this paper, four multiple imputation programs are compared. The first, called NORM [
The dataset in this simulation study was obtained from an online, self-help prospective study for problematic alcohol consumers. The online self-help program was developed by a substance abuse treatment center in Amsterdam, the Netherlands. Each new participant was invited for a measurement of alcohol consumption, quality of life, self-efficacy, and demographics. Data were collected at two waves, at baseline, and 3 months after baseline. All the cases with missing values were removed from the original dataset, resulting in a dataset with 124 cases, with 0% missing data. The dataset contains self-reported daily alcohol consumption quantities measured in standard drinking units containing 10 grams of ethanol. These consumption quantities were available at baseline and at the 3-months follow-up. For the purposes of this paper, we used only the subscale measuring alcohol consumption for the last 7 days, measured using Timeline Follow-Back methodology [
This complete (0% missing) dataset was used as a reference value for comparison of each approach. Next, one of the weekdays from the follow-up measurement was selected and an MAR missingness pattern was induced, leading to 50% MAR missingness in this variable. The operationalization of MAR applied by the execution of this macro is according to the method suggested by Scheffer [
After MAR induction, the missing data approaches were performed on the dataset with missing observations. For LOCF, data collected at baseline were carried forward to the missing follow-up measurement for the variable upon which missingness was induced. All “advanced” missing data approaches came with default software settings. It is possible to adjust these settings to change the number of iteration steps, convergence criteria, and the distribution of random error. For the presented analysis, the default software settings were used. To test sensitivity of the results to changes in these default software settings, the study was replicated using stricter, more calculation-intensive settings, that is, a larger number of iterations or stricter convergence criteria. The results obtained with these stricter settings did not differ systematically from the results obtained using the default settings.
To investigate reliability and coverage of the results obtained through these approaches, a resampling approach was performed. A total of 75 samples of n = 124 were drawn with replacement from the MAR imposed dataset, and these resampled datasets had, on average, 50% MAR missingness on the selected variable. Next, missing values from each dataset were imputed using the different approaches.
Superior performance of the MI approaches over the other advanced approaches (and of the advanced approaches over the basic approaches) was expected, based on previous studies [
For successful application in a variety of missing data situations, it is important to test for reliability in addition to validity. For example, will the use of the presented methods lead to comparable results with repeated application? Coverage can be regarded as a combined indicator for validity and reliability. It is expected that coverage of the advanced approaches will outperform the basic methods.
Validity was operationalized as the extent to which the estimate obtained by a missing data approach approximated the reference value. Validity (ie, test validity) was assessed by calculating
The complete (reference) dataset and the datasets that resulted after application of the missing data approaches are plotted in
Strip chart for 9 missing data approaches and the reference value
A number of participants reported zero postintervention drinks per day; subsequently, their data points are plotted very close to each other. As is often the case for count data, observations are positive integers only and the distribution of the observations is nonnormal.
Independent samples t tests for missing data approaches against reference value
Method | Mean | SD |
|
Degrees of Freedom |
|
Cohen’s |
Reference | 2.62 | 5.22 | 0 | 246 | 1 | 0 |
Complete cases | 1.39 | 2.63 | -2.09 | 176 | 0.04 | -0.31 |
Mean imputation | 1.39 | 1.73 | -2.50 | 246 | 0.01 | -0.35 |
LOCF | 4.85 | 5.43 | 3.29 | 246 | 0.001 | 0.42 |
Regression imputation | 1.39 | 2.37 | -2.38 | 246 | 0.01 | -0.32 |
EM imputation | 3.09 | 3.85 | 0.809 | 246 | 0.42 | 0.10 |
NORM | 3.14 | 9.55 | 0.534 | 246 | 0.53 | 0.07 |
MICE | 3.06 | 4.30 | 0.730 | 246 | 0.47 | 0.09 |
SPSS 17 MI | 1.49 | 2.03 | -2.26 | 246 | 0.03 | -0.31 |
Amelia II | 2.88 | 3.33 | 0.468 | 246 | 0.64 | 0.06 |
a Independent samples
To supplement the visual analysis with statistics,
Repeated application of nine missing data approaches
The nine approaches differed remarkably in the robustness and, therefore, in the reliability of their results. The largest difference between the simulated datasets was produced by the NORM software package, with some of the highest mean values being eight times larger than the smallest. The lowest variance was seen in the complete cases, mean imputation, regression imputation and SPSS MI. LOCF, EM imputation, MICE, and Amelia II showed an average amount of inter-dataset variance.
Coverage of the reference confidence interval for imputed means
Missing Data Approach | Coverage Proportion | Variance of Bootstrapped Sample |
Complete cases | 0.15 | 0.088 |
Mean imputation | 0.15 | 0.088 |
LOCF | 0 | 0.381 |
EM imputation | 0.83 | 0.206 |
Regression imputation | 0.17 | 0.105 |
NORM | 0.43 | 3.027 |
MICE | 0.71 | 0.622 |
SPSS MI | 0.23 | 0.093 |
Amelia II | 0.96 | 0.205 |
In this paper, the application of nine approaches for handling missing data is presented and compared. The most valid result was obtained using multiple imputations from the Amelia II algorithm, closely followed by MICE, NORM, and EM imputation. However, due to the large standard errors resulting from the NORM algorithm, the power of the analysis based on this dataset was much lower than the power of an analysis using MICE or Amelia II would have been. The results obtained using the other tested approaches differed significantly from the reference value and can therefore be considered as less valid.
Although complete cases, mean imputation, regression imputation, and SPSS multiple imputation led to reliable results in the sense of small variance between the bootstrapped means (
To mimic the real-life missing data problems more closely in this study, missingness was imposed on a variable containing count data (alcohol consumption counts). However, it should be noted that none of the presented approaches were specifically designed for the imputation of nonnormally distributed count data: specific missing data approaches for this type of data are currently lacking. From the Schafer suite, in addition to NORM, one could select CAT or MIX packages as an alternative, as these are intended for categorical or mixed datasets; however, these programs are also limited with regard to the imputation of missing count data. On the other hand, according to [
To evaluate the selected methods under more ideal conditions as well, the methods were retested using a normally distributed variable with missingness imposed under the same 50% MAR pattern (data not presented here). Differences between the methods became smaller; the less-than-optimal methods led to better results under these conditions. Multiple imputation still led to optimal results, and among the multiple imputation methods, the best results were reached using Amelia II.
Both EM imputation and Amelia II performed reasonably well in this study. EM imputation produces maximum likelihood estimates for the missing values, thus approaching true sample means and variances for an incomplete variable. However, being a single imputation method, the accuracy or inaccuracy of this estimation process is not accounted for in the variances of the resulting estimators. This leads to smaller variances, smaller confidence intervals, and therefore a greater risk of finding significant differences between variables when there are no actual differences (type I error, false positive). This shortcoming of EM imputation and other single imputation approaches marks the biggest advantage of multiple imputation. The latter captures uncertainty due to missingness of data in the variance between the generated datasets, making the estimators from multiple imputed datasets less prone to this type I error.
The main reason why MI is not used more often is probably due to the perceived complexity of its application. Working with more than one instance of the dataset may seem discouraging to researchers without extensive statistical knowledge or interest. Second, the fact that widely used statistical packages until recently did not natively support multiple imputation makes it understandable that most researchers using these software programs do not directly chose to apply this technique in case of missing data. In that sense, the introduction of multiple imputation in recent releases of statistical software (ie, the “mi” command in Stata 11 and the multiple imputation module in SPSS 17) may mark a leap forward. Positive experiences with the new “mi” command in Stata have been reported. However, under the conditions in the presented studies, the results obtained with SPSS 17 multiple imputation were less than optimal.
To conclude, this paper introduced both the implications and the practical use of data techniques to a wide, nonstatistical audience. Using the software packages tested and described in this paper, multiple imputation is feasible for any researcher in the eHealth field or related disciplines. The use of these approaches may invoke a considerable improvement of the validity of results obtained from datasets with missing values.
This study is supported by a grant from the ZonMW Addiction II program, grant number 31160006. The substance abuse treatment center referred to in this text is Jellinek, part of Arkin, the Amsterdam-based mental health and addiction treatment center. The authors would like to thank Michelle Miller for text- editing, the AIAR JOO and Bouke Sterk for helpful comments on early drafts, and the reviewers for their valuable comments. Their efforts improved the quality of the final manuscript.
None declared
All authors have contributed substantially to this protocol. Matthijs Blankers constructed the design of the study and drafted the manuscript. Maarten Koeter led the overall methodological development and revised the manuscript. Gerard M Schippers is principal investigator and supervised the production of the study and manuscript. All authors have read and approved the final manuscript.
Amsterdam Institute for Addiction Research
consolidated standards of reporting trials
expectation maximization
last observation carried backward
last observation carried forward
missing at random
missing completely at random
multiple imputation
missing not at random
randomized controlled trial