This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Electronic surveys are convenient, cost effective, and increasingly popular tools for collecting information. While the online platform allows researchers to recruit and enroll more participants, there is an increased risk of participant dropout in Web-based research. Often, these dropout trends are simply reported, adjusted for, or ignored altogether.
To propose a conceptual framework that analyzes respondent attrition and demonstrates the utility of these methods with existing survey data.
First, we suggest visualization of attrition trends using bar charts and survival curves. Next, we propose a generalized linear mixed model (GLMM) to detect or confirm significant attrition points. Finally, we suggest applications of existing statistical methods to investigate the effect of internal survey characteristics and patient characteristics on dropout. In order to apply this framework, we conducted a case study; a seventeen-item Informed Decision-Making (IDM) module addressing how and why patients make decisions about cancer screening.
Using the framework, we were able to find significant attrition points at Questions 4, 6, 7, and 9, and were also able to identify participant responses and characteristics associated with dropout at these points and overall.
When these methods were applied to survey data, significant attrition trends were revealed, both visually and empirically, that can inspire researchers to investigate the factors associated with survey dropout, address whether survey completion is associated with health outcomes, and compare attrition patterns between groups. The framework can be used to extract information beyond simple responses, can be useful during survey development, and can help determine the external validity of survey results.
Web-based surveys are convenient and cost-effective means for collecting research information. Researchers can reach a large number of participants quickly through electronic media, such as email and websites, when compared with conventional paper-based surveys. Applications like REDCap (REDCap Consortium) and SurveyMonkey (SurveyMonkey, Inc) automate the data collection and storage process as well as provide the capability to capture survey paradata or metadata. Web-based paradata allow researchers to capture respondent actions in addition to responses and to track the time participants spend on particular questions [
This technology’s relative ease in soliciting survey participants is coupled with an increased risk of survey attrition—participants dropping out. Potential respondents may ignore solicitations, whereas others may skip questions or exit the survey before answering all the questions. Proper testing before administration, such as completion of a principal component analysis or factor analysis of survey items [
Attrition can occur through different mechanisms and produce different types of bias. Nonusage or nonresponse attrition occurs when participants are solicited but choose not to participate in a survey [
Respondent fatigue is another factor which leads to dropout attrition, especially when questions seem inappropriate or inapplicable [
This paper discusses novel ways to measure and investigate “dropout attrition” [
Our proposed approach for evaluating dropout attrition includes 3 steps that are as follows: (1) visualization, (2) confirmation, and (3) factor identification. These steps are arranged in the order of increasing thoroughness for investigating attrition, with each step providing a more nuanced and detailed picture. Thus, investigators can work through these steps as far as their needs require.
The graphic representation of participant dropout could help visualize attrition trends or patterns. We proposed 2 visualization types—bar charts and survival-type curves—each with several variations to highlight different attrition trends.
Bar charts that described the amount (proportion, percentage, or number) of respondents or dropouts for each survey item provided multiple perspectives to explore dropout patterns. They allowed identification of differences between sequential questions, isolation of questions of specific interest, and discovery of overall trends. Plotting the percentage or proportion of respondents or dropouts was useful for identifying potentially significant attrition trends. Whether one plots respondents or dropouts depends on personal interest, although these might not be the exact inverses if the survey allows respondents to skip items. Plotting the raw number of dropouts was useful for finding other points of attrition that were not obvious when plotting proportions. Although not statistically significant, these trends provided information about when respondents left the survey, information that could be useful while testing a new survey instrument. Further, the stacked bar chart, which added the percentage of skips in each question, helped to better visualize attrition for surveys with skip patterns. Grouped bar charts were useful for comparing attrition visually between groups. As these final 2 types of bar charts may not be applicable, we suggest, at minimum, plotting the percentage of respondents or dropouts, along with the raw number of dropouts, to visualize attrition patterns.
Survival-type curves (or step functions) provided another way to visualize attrition. Unlike traditional survival curves, which stipulate decreasing patterns, these plots could incorporate situations in which the number of responses increased (eg, when a large number of respondents skip a particular item). These plots provided visual comparison of several groups with more clarity than the grouped bar chart, especially when comparing more than 3 groups. This visualization type was also useful for identifying what Eysenbach describes as the sigmoidal attrition curve, a pattern that includes a “curiosity plateau” at the beginning of the survey when response rates are high, an attrition phase when response rates decrease, and a stable participation phase when response rates are relatively constant for the remainder of the survey [
The second step was to determine whether any visually identified attrition patterns were statistically significant. A statistical model could determine the attrition changes from question to question. For example, a generalized linear mixed model (GLMM)—a broad set of models that includes logistic and Poisson regression—could incorporate both fixed and random effects to test if the proportion of patient responses decreases between subsequent questions [
We applied a GLMM to test the hypothesis that the proportion of respondents is equal between 2 sequential questions. In our model, the outcome is binary, whether or not a person answered the survey questions (yes or no). An indicator for identifying the previous or subsequent question was included as a fixed effect and a subject-level random effect was included to account for within-subject dependence between response rates. The GLIMMIX procedure in the SAS software (SAS Institute) can be used to fit the GLMM to each pair of sequential questions. To transform the results into the difference in proportions, the IML procedure is needed to apply the multivariate delta method and thereby obtain a point estimate of the difference in response rates, along with the standard error and 95% CI for each comparison. The NLMIXED procedure could also be used to directly obtain point estimates of the difference in response rates between subsequent questions, but it does not allow for the covariance structure necessary to model more than 2 questions at a time.
The final step was to examine different factors that may be associated with attrition, such as patient characteristics (eg, age, gender), health outcomes (eg, cancer screening), survey responses, and survey metadata. Knowing that significant attrition trends exist in the dataset, we investigated factors associated with the observed dropout; high attrition rates could be attributable to any number of factors, including the survey itself. Results could also be stratified by population subgroups, such as gender, race, and ethnicity. In addition to looking at attrition question by question, we could also consider the overall attrition as a binary variable (ie, survey completers vs noncompleters).
We proposed 3 general methods for examining factors suspected to be associated with attrition: chi-square analyses (or Fisher’s exact test), the log-rank test, and Cox proportional hazards regression. Whereas previous research has used chi-square analyses to compare completers and noncompleters by demographics and lifestyle characteristics [
We adopted Eysenbach’s suggestion for survival analysis [
The survey—entitled the Informed Decision-Making (IDM) module—was designed by our research team to explore how people approach potentially difficult decisions about breast, colorectal, and prostate cancer screenings. It was developed in 2013 through intensive stakeholder engagement, including working with patients to ensure questions were in an understandable format that was easy to answer [
The study was conducted between January and August, 2014 at 12 primary care practices in northern Virginia that used the interactive online patient portal
Most questions in the IDM module had several subquestions. The system did not force respondents to answer all questions and allowed patients to skip questions. Five questions were directed to a subset of patients based on their answer to a previous question. Although these questions were imperative to our original study goals, we excluded them from this attrition analysis. The study was funded by the Patient Centered Outcomes Research Institute in 2012 and approved by the Virginia Commonwealth University Institutional Review Board [
All statistical analyses were conducted using SAS version 9.4 (SAS Institute), whereas all graphs were created using R version 3.1.1 (R Foundation for Statistical Computing) with the
During the study period, 2355 patients started the IDM module: 638 from the breast cancer cohort, 1249 from the colorectal cancer cohort, and 468 from the prostate cancer cohort. A bar chart displayed the percentage of respondents for each succeeding question in the module (
The bar chart reveals an increase in the percentage of patients who answered Question 8, which occurred because patients were able to skip questions. A stacked bar chart demonstrates that some participants skipped Questions 4, 6, 7, 9, and 12 (
The right panel of
We used grouped bar plots as per Ekman [
The top panel of
Bar charts for percent of answers for all cancer types without skips (left) and with skips (right).
Bar charts for number of dropouts (left) and percent of dropouts (right).
Grouped bar charts for the number of dropouts (left) and percent of dropouts (right).
Step function comparing all cohorts (top) and attrition curve of all cancer types (bottom).
As observed through visualization, the GLMM results suggest that the attrition that occurred between Questions 2 and 4, 4 and 6, 6 and 7, and 8 and 9 were statistically significant (
Generalized linear mixed model (GLMM) results.
Analysis | p1 | p2 | p1-p2 | Standard error | 95% CI | |
Q1 to Q2 | 1.00 | 0.97 | 0.03 | 0.660 | −1.264 to 1.321 | .99 |
Q2 to Q4 | 0.97 | 0.76 | 0.21 | 0.001 | 0.206 to 0.208 | <.001 |
Q4 to Q6 | 0.76 | 0.56 | 0.20 | 0.001 | 0.203 to 0.204 | <.001 |
Q6 to Q7 | 0.56 | 0.52 | 0.04 | 0.001 | 0.039 to 0.041 | .006 |
Q7 to Q8 | 0.52 | 0.54 | −0.02 | 0.001 | −0.023 to −0.022 | .12 |
Q8 to Q9 | 0.54 | 0.49 | 0.05 | 0.001 | 0.048 to 0.049 | <.001 |
Q9 to Q10 | 0.49 | 0.47 | 0.02 | 0.001 | 0.023 to 0.024 | .10 |
Q10 to Q12 | 0.47 | 0.45 | 0.02 | 0.001 | 0.022 to 0.023 | .12 |
Q12 to Q13 | 0.45 | 0.44 | 0.01 | 0.001 | 0.005 to 0.006 | .70 |
Q13 to Q16 | 0.44 | 0.43 | 0.01 | 0.001 | 0.012 to 0.013 | .38 |
Q16 to Q17 | 0.43 | 0.42 | 0.01 | 0.001 | 0.011 to 0.012 | .41 |
We used the chi-square test to determine if a respondent’s answer to a particular question was associated with dropout in the next question and found that patients in the middle of the decision-making process—having indicated on Question 2 that they were either thinking about or close to making a decision (
Determining if a patient’s response to Question 2 (“How far along are you with making a decision about cancer screening?”) was associated with answering the next question.a
Response to Question 2 | Answered Question 4 | |
Yes (%) | No (%) | |
I have not yet thought about the choice. | 79.47 | 20.53 |
I am thinking about the choice. | 85.54 | 14.46 |
I am close to making a choice. | 84.72 | 15.28 |
I have already made a choice. | 75.24 | 24.76 |
aOverall chi-square test:
We applied the log-rank test to determine if the overall attrition pattern differed by gender within the colorectal cancer cohort (the only cohort that included both men and women) and found that the dropout pattern differed significantly (
We performed a Cox proportional hazards regression to examine whether the relationship between gender and dropout was confounded by demographic and other patient characteristics. Bivariate analyses of ethnicity, race, preferred language, recruitment phase, insurance type, and age, when compared with time to dropout, suggested that recruitment phase was the only covariate associated with survey completion (
Kaplan-Meier survival curves by gender within colorectal cancer cohort.
Using our test case, visualization allowed us to identify the two most obvious points of attrition, Questions 4 and 6, with the overall attrition rate converging to approximately 60%. The use of the GLMM helped confirm these as points of significant attrition and chi-square analyses suggest that participant responses from prior questions were associated with dropping out at these points. Overall, survival analyses suggested that IDM module dropout was significantly associated with gender, implying that survey content was biased toward men, but not after accounting for recruitment phase. Furthermore, survey completion was positively associated with getting the cancer screening test. Despite the yearlong effort to create the IDM module, including focus groups, question testing, and several revisions [
The proposed framework suggests that we plot overall attrition to identify patterns, analyze these patterns for significance, and then investigate potential reasons for dropout throughout the module. As the first step in evaluating attrition, visualization provides a broad view of dropout patterns throughout a survey, such as visual approximations of Eysenbach’s curiosity, attrition, and stable use phases [
Prior work in this area has encountered challenges. For example, Ekman plotted the number of dropouts per question on 2 surveys in a bar chart, revealing that most of the dropout occurred within the first 8 questions [
The second stage of our dropout attrition framework is designed to confirm whether certain drops in response rates are significant. These formal statistical analyses can not only confirm observed trends from the visualizations, but also locate differences that were not observable.
The last stage proposes an examination of possible causes of participant dropout. Collecting and adjusting for demographic characteristics (especially those previously suggested as predictive of survey completion including gender, age, education, and ethnicity) [
As noted in the Introduction, the methods proposed here are meant only as a starting point. These methods could additionally be considered as a part of the survey testing process in helping to refine the instrument and retain the maximum number of participants. This paper does not discuss other forms of attrition that apply to online surveys, such as nonresponse attrition, attrition in longitudinal surveys, or methods to minimize attrition or correct for potential bias introduced by high attrition rates.
Although not exemplified in this paper, discrete time survival analysis would be a more appropriate though more complex method to identify this type of survival pattern as patients can only drop out at discrete time points (ie, after each question). We applied the GLMM pairwise to our case study but it is also possible to fit a single model to the entire survey, though this complex modeling would require more sophisticated parameterization (eg, dependence structures) that may affect estimator accuracy and convergence. The indicator used in our GLMM distinguished whether a patient answered a survey question or not, but could have instead indicated whether the respondent dropped out at a particular question. Results will not be the exact inverse in cases where respondents are allowed to skip questions.
These analyses can be enhanced by linking responses to subject characteristics or metadata. Online surveys provide additional information not previously available in paper-based surveys, perhaps most notably metadata. The amount of time a patient spends on each question, the time of day a survey is taken, and Internet browser version compatibility are all examples of metadata that could also affect attrition patterns.
Survey characteristics associated with overall completion, such as survey relevance, could also be examined question by question [
We contend that simply reporting attrition rates is not enough; we must dig deeper to examine where and why attrition occurs. Our contribution here is to advocate advances in the science of attrition. The framework outlined in this manuscript is especially important when fielding new surveys that have not been previously tested or validated. This framework is best applied as both part of the survey development process and as a tool for interpreting survey results. We encourage researchers to engage with these steps throughout the research process as we work as a community to establish a “law of attrition.”
MyQuestions Informed Decision Making Module.
electronic health records
generalized linear mixed model
informed Decision-Making
MyPreventiveCare
randomized controlled trial
This project was funded by the Patient Centered Outcomes Institute (PCORI Grant Number IP2PI000516-01) and National Center for Advancing Translational Sciences (CTSA Grant Number ULTR00058). The opinions expressed in this paper are those of the authors and do not necessarily reflect those of the funders.
The authors thank the research teams and Dr Robert Perera, Paulette Kashiri, and Eric Peele for their valuable efforts.
Furthermore, the authors thank their practice partners at Privia Medical Group and the Fairfax Family Practice Centers, including Broadlands Family Practice, Family Medicine of Clifton and Centreville, Fairfax Family Practice, Herndon Family Medicine, Lorton Stations Family Medicine, Prince William Family Medicine, South Riding Family Medicine, Town Center Family Medicine, and Vienna Family Medicine for their insights and hard work.
None declared.