Attrition from Web-Based Cognitive Testing: A Repeated Measures Comparison of Gamification Techniques

Background The prospect of assessing cognition longitudinally and remotely is attractive to researchers, health practitioners, and pharmaceutical companies alike. However, such repeated testing regimes place a considerable burden on participants, and with cognitive tasks typically being regarded as effortful and unengaging, these studies may experience high levels of participant attrition. One potential solution is to gamify these tasks to make them more engaging: increasing participant willingness to take part and reducing attrition. However, such an approach must balance task validity with the introduction of entertaining gamelike elements. Objective This study aims to investigate the effects of gamelike features on participant attrition using a between-subjects, longitudinal Web-based testing study. Methods We used three variants of a common cognitive task, the Stop Signal Task (SST), with a single gamelike feature in each: one variant where points were rewarded for performing optimally; another where the task was given a graphical theme; and a third variant, which was a standard SST and served as a control condition. Participants completed four compulsory test sessions over 4 consecutive days before entering a 6-day voluntary testing period where they faced a daily decision to either drop out or continue taking part. Participants were paid for each session they completed. Results A total of 482 participants signed up to take part in the study, with 265 completing the requisite four consecutive test sessions. No evidence of an effect of gamification on attrition was observed. A log-rank test showed no evidence of a difference in dropout rates between task variants (χ22=3.0, P=.22), and a one-way analysis of variance of the mean number of sessions completed per participant in each variant also showed no evidence of a difference (F2,262=1.534, P=.21, partial η2=0.012). Conclusions Our findings raise doubts about the ability of gamification to reduce attrition from longitudinal cognitive testing studies.


Procedure
The study elements participants completed each session depended on the day of the study they were currently on. From the main menu, clicking the start button would display a series of instructions screens followed by a ~10-minute delivery of the SST and that day's questionnaires, as shown in Supplementary  Table 1.

Supplementary Table 1: Study elements delivered each session. Optional sessions are shown in light grey
Session Number 1 2 3 4 5 6 7 8 9 10 Instructions X X X X X X X X X X Stop Signal Task X X X X X X X X X X Demographic Questionnaire X Full Engagement Questionnaire X X X X Short Engagement Questionnaire X X X X X X Perseverance Questionnaire X Free Text Questionnaire X X History Screen X X X X X X X X X X

Stop Signal Task: Staircase and Block Details
Further to the information provided in the main manuscript: Stop Signal Delay (SSD) was varied according to a four-staircase convergence algorithm, designed to sample evenly across the SSD/Inhibition-Probability space. Staircases 2 and 3 converged to a 50% failed inhibition rate, while staircases 1 and 4 sampled the limits of a participant's inhibition, see Supplementary Table 2. On a step-up or step-down a staircase was adjusted by +/-50 ms respectively, and the step size changed to +/-25ms after two reversals of direction. The shortest possible SSD was 25ms and the longest possible was 750ms. The task consisted of 5 blocks of 48 trials each. Each block contained 3 sub-blocks of 16 trials each, of which 12 were go trials and 4 were stop trials. The first sub-block of each session consisted entirely of Go trials, so in total each session contained 240 trials, of which 56 were stop trials. After 48 trials the block ended and the subject had to wait for 10 seconds before they automatically continued to the next block. In order to maintain response speed and to discourage strategy, the subject was prompted to go faster during this break. A dynamic speed-prompt was also displayed if the subject's responses in one sub-block were on average 50 ms slower than those in the previous sub-block. Once five blocks had been completed, the task ended. This typically took ~10 minutes.

Stop Signal Reaction Time Calculation
Estimated SSRTs were calculated automatically at the end of each session using the integration method as detailed in (Band, van der Molen, & Logan, 2003;Logan, 1994), and were presented to participants on the history screen.

Free Text Questionnaire
After the third session, participants were presented with a short questionnaire to which they could respond to using free text of up to 500 characters. The following questions were presented in a random order: (1) Have you noticed any bugs or errors in the experiment so far? (2) Are you enjoying the experiment so far? Is there anything you would change? (3) What has motivated you to take part in the experiment so far?

Perseverance Questionnaire
After the second session participants completed a visual-analogue-scale based perseverance subscale of the Urgency, Premeditation, Perseverance and Sensation Seeking (UPPS) Impulsive Behaviour Scale (Whiteside & Lynam, 2001), presented in the same format at the Enjoyment and Engagement questionnaire. The main aim of this questionnaire was to test whether individual differences in perseverance might confound attrition rates on the task variants. A total perseverance score was calculated as the mean of all items, with items 2 and 10 reverse-scored. The following questions were presented in a random order: (1) I generally like to see things through to the end, (2) I tend to give up easily, (3) Unfinished tasks really bother me. (4) Once I get going on something I hate to stop. (5) I concentrate easily.
(6) I finish what I start. (7) I'm pretty good about pacing myself so as to get things done on time. (8) I am a productive person who always gets the job done. (9) Once I start a project, I almost always finish it, and (10) There are so many little jobs that need to be done that I sometimes just ignore them all.

Supplementary Analyses: Planned Analyses
The following analyses were planned as described in our preregistered study protocol: https://osf.io/ysaqe/

Cognitive Data
Go RTs and FailedStop RTs were summarised at a participant level using medians. We also calculated the gradient of the inhibition function at the point of P(Respond|Signal) = 0.5 using numerical differentiation.
To assess whether the introduction of game mechanics would affect the cognitive data collected by each task variant we used mean Go RT, FailedStop RT, Go Accuracy and Stop Accuracy data from the four compulsory sessions and performed a series of univariate ANOVAs with task variant (Non-Game, Points, Theme) as a factor, see Supplementary Table 3. We found effects of task variant on all measures except for Go Accuracy, and this is likely because Go Accuracy scores were high and participants were operating at ceiling. The effects of task variant were quite small, yet still indicate an impact of game mechanics on the comparability of the data collected by the task.  Figure 2 shows boxplots of these variables for each task variant, made of up participants' median responses over the four compulsory sessions. Cognitive measures appear broadly comparable between task variants, but the effects detected by the ANOVAs are apparent on closer inspection. We used t-tests to explore differences of interest, and Bayesian t-tests to assess similar distributions for equality.

Supplementary Figure 2: Box and whisker plots of mean Go Reaction Time, FailedStop Reaction Time, Go
Accuracy and Stop Accuracy. Data combined per participant over the first four sessions and shown separately by task variant.
We also found evidence of a difference in FailedStop  Given the lack of effect of task variant on Go Accuracy we used Bayesian t-tests to assess the variants for equality. These tests were inconclusive for all comparisons (BF = 0.31 and 0.38) except Points (M = 94%, SD = 5%) compared to Theme (M = 94%, SD = 6%), where we found substantial evidence of equality (BF = 0.17) With respect to Stop Accuracy we saw differences between the Non-Game (M = 52%, SD = 9%) and Theme (M = 54%, SD = 7%) variants (mean difference 25%, 95% CI 21 to 48, t(169.9) = 2.153, p = .03, d = 0.32) and the Non-Game and Points (M = 55%, SD = 8%) variants (mean difference = 33%, 95% CI 9 to 57, t(175.2) = 2.722, p = .01, d = 0.40). There was no evidence of a difference between Points and Theme (p = .50), but a Bayesian t-test could not provide evidence for equality either (BF = 0.50). We also calculated the slopes of the modelled inhibition curves using numerical differentiation, and assessed the gradient for differences between task variants. A one-way ANOVA did not show evidence of an effect of task variant on inhibition slope (F [2,255] = 1.437, p = .24, partial η 2 = 0.011), and Bayesian ttests showed moderate evidence that the Non-Game and Theme variants' slopes were equivalent (BF = 0.24), and that Points and Theme variants' slopes were also equivalent (BF = 0.22) However, there was insufficient evidence to suggest that the Non-Game variant and the Points variant had equivalent slopes (BF = 0.62), see Supplementary Table 4 and Supplementary Figure 3.

Reliability of Cognitive Measures over Time
We found the test-retest reliability of SSRTs from the first four sessions to be very good, with an overall Cronbach's alpha of 0.85. When assessed by task variant, the Points (α = 0.86), and Theme (α = 0.86) variants showed the most consistent results with Non-Game (α = 0.75) showing lesser, yet still good, reliability. We used cocron (Diedenhofen & Musch, 2016) to investigate differences between these alphas but saw no evidence for an effect of task variant (X 2 (2, N = 258) = 5.140, p = .08).
We also wanted to investigate whether time or practice effects impacted the cognitive data collected by the task variants, and so ran a series of repeated-measures ANOVAs with Go RT, FailedStop RT and SSRT as the dependant variables and session number (1-4) as the time factor in each, see Supplementary Table 5.
Where there was evidence of Sphericity we used Greenhouse-Geisser corrected p values. We saw small effects of session number on all cognitive measures, but no clear evidence of interactions between task variant and session number on any of the measures (ps > .07). Supplementary Table 6 shows the mean RTs from each session, combined across task variant.

Individual Engagement Questionnaire Data
Supplementary Figure 4 shows the scores of individual questions on the enjoyment and engagement test, calculated by averaging the long-form questionnaires from Sessions 1 and 4. Figure 4: Individual question scores from the subjective enjoyment and engagement questionnaire. Mean responses of VAS scores from questionnaires delivered on sessions 1 and 4, shown separately by task variant. Error bars represent 95% confidence intervals.

All Participant Attrition
Supplementary Figure 5 shows the number of participants remaining in the study at each timepoint, including all participants who signed up. We used the Kaplan Meier method to calculate the estimated survival times, and a Log-Rank test showed no evidence of a difference between the distributions (X 2 (2, N = 482) = .816, p = .67). The mean number of sessions completed in each task variant was very similar, as shown in Supplementary Table 7.

Supplementary Figure 5: Number of participants that took part each day over the ten-day period
We were also interested in whether gamification would affect the number of participants who decided to stay with the study after trying one initial session. Supplementary Figure 6 shows the percentage of participants that completed a session on the first four days, divided by task variant.

Loosely Conforming Participant Attrition
We performed an additional attrition analysis including 32 participants who only managed to complete their final compulsory test session on the fifth day of the study, rather than the fourth. As such, the maximum number of optional days they these participants could complete was 5. Collectively, we referred to the 297 participants who completed the 4 compulsory sessions within 5 days as loosely conforming, and the analyses below cap the maximum number of sessions completed at 9, as this is the maximum number of sessions that every participant had a chance to complete. Loosely conforming participants (95% CI) Non-Game 4.9 (4.4 to 5.5) 7.0 (6.6 to 7.5) 6.9 (6.5 to 7.4) Points 5.1 (4.5 to 5.6) 7.1 (6.7 to 7.6) 7.0 (6.6 to 7.4) Theme 5.3 (4.7 to 5.9) 7.6 (7.1 to 8.0) 7.3 (6.8 to 7.7)

Supplementary
Supplementary Table 7 and Supplementary Figure 7 show the attrition of these loosely conforming participants. Again, we used the Kaplan Meier method to calculate estimated survival times. A Log-rank test showed no evidence of a difference between the distributions (X 2 (2, N = 297) = 1.418, p = .49) and a oneway ANOVA of the number of sessions completed also found no evidence of a difference between task variants (F [2,296] = 0.648, p = .52, partial η 2 = 0.004).
We then used Bayesian t-tests to assess the mean number of sessions completed in each variant for equality. We found substantial evidence that the number of sessions completed in all variants was equal, with Points being equal to Non-Game (BF = 0.16), Points being equal to Theme (BF = 0.23) and Non-Game being equal to Theme (BF = 0.25). Figure 7: Percentage of loosely conforming participants that completed a session on each day of the nine-day period.

Individual differences and Attrition
To ensure that individual differences in participant perseverance between groups were not masking an effect of task variant on attrition, we used a one-way ANCOVA of mean number of sessions completed with task variant (Non-Game, Points, Theme) as the between-subjects factor and score on the perseverance questionnaire as the covariate. Again, we saw no clear evidence of an effect of task variant on the mean number of sessions completed (F [2,259] = 1.168, p = .31, partial η 2 = 0.009) and only weak evidence for an effect of perseverance (F [1,259] = 3.562, p = .06, partial η 2 = 0.013).
Previous literature has suggested that participant age, sex or their amount of video game experience can impact an individual's enjoyment of a video game, so we also ran a one-way ANCOVA of mean score with task variant (Non-Game, Points, Theme) as the between-subjects factor and age, sex and hours spent playing video games as covariates. We found no evidence for effects of the three covariates (ps > .28) and hence saw evidence of an effect of task variant on overall score (F [2,259] = 4.030, p = .02, partial η 2 =0.030).