This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Smartphones and their builtin sensors allow for measuring functions in diseaserelated domains through mobile tests. This could improve disease characterization and monitoring, and could potentially support treatment decisions for multiple sclerosis (MS), a multifaceted chronic neurological disease with highly variable clinical manifestations. Practice effects can complicate the interpretation of both improvement over time by potentially exaggerating treatment effects and stability by masking deterioration.
The aim of this study is to identify shortterm learning and longterm practice effects in 6 active tests for cognition, dexterity, and mobility in userscheduled, highfrequency smartphonebased testing.
We analyzed data from 264 people with selfdeclared MS with a minimum of 5 weeks of followup and at least 5 repetitions per test in the Floodlight Open study, a selfenrollment study accessible by smartphone owners from 16 countries. The collected data are openly available to scientists. Using regression and bounded growth mixed models, we characterized practice effects for the following tests:
Strong practice effects were found for
Smartphonebased tests are promising for monitoring the disease trajectories of MS and other chronic neurological diseases. Our findings suggest that strong longterm practice effects in cognitive and dexterity functions have to be accounted for to identify diseaserelated changes in these domains, especially in the context of personalized health and in studies without a comparator arm. In contrast, changes in mobility may be more easily interpreted because of the absence of longterm practice effects, even though shortterm learning effects might have to be considered.
Multiple sclerosis (MS) is a multifaceted and variable chronic autoimmune neurological disease affecting approximately 2.3 million people worldwide [
MS progresses in different phases with highly variable speed and severity. To optimize treatment strategies, timely and precise monitoring of patients’ disease status is essential. As MS affects multiple functional domains, a range of validated clinical tests are used: for cognition, the Symbol Digit Modalities Test (SDMT) measures mental processing speed and is highly established as a screening tool for cognitive impairment in MS [
Wearable technologies, such as smartphones and smartwatches, are expected to capture more representative data at a higher resolution not only in the patients’ natural environments in MS but also in other neurological diseases such as Parkinson disease and Huntington disease [
Acknowledged difficulties in interpreting the results of repeated tests are learning and practice effects, especially in neuropsychology [
The aim of this analysis is to examine shortterm learning and longterm practice effects in highfrequency smartphonebased tests representative of the assessment of 3 domains often affected by MS: cognition, dexterity, and mobility.
We used publicly available data from the
We included data up to and including July 31, 2021, and focused our analyses on the following 6 tests [
The
Participants were selected for each test separately if at least 5 repetitions per test and at least 5 weeks between their first and last repetitions were available. This yielded slightly different but largely overlapping subsets of participants for each test.
First, summary analyses were performed to investigate the mean scores of the first, fifth, and last trials of each test. We assumed that improvements up to the fifth score were more likely due to shortterm learning effects, where participants learned to execute a test, and improvements from the fifth trial onward were more likely because of longterm practice effects. Naturally, these effects are intertwined, but using the fifth trial as the baseline was supported by Solari et al [
To examine group differences in baseline performances and potential shortterm learning effects in low and high performers, linear quantile regression was performed on each test for the first 5 trials for the 5th, 25th, 50th, 75th, and 95th percentiles. Quantile regression
Longterm practice effects were assumed for tests with a significant mean difference from the fifth to the last score. The positive association of this difference with the number of repetitions (logtransformed to account for the strong rightskewness) adjusted for the potential confounders, age, first score, and fifth score, was considered as an additional indicator of longterm practice effects.
For tests suggestive of longterm practice effects that meet the 2 abovementioned criteria, learning curve analysis was performed with 1 nonparametric and 3 parametric mixed effect models of increasing complexity, each modeling performance as a function of repetition, grouping by patient for cognition and mobility and by hand for dexterity. The performance of the 4 models was compared using both root mean squared error (RMSE) and the number of (effective)
For the nonparametric model, smoothing splines calculated by generalized additive models were fitted to examine the unbiased shape of the potential learning curves, exhibiting different effective
For the parametric models, simple linear (
We treated boundary and baseline (y_{0}) as random effects, while we considered the growth constant
In addition to our main analysis on practice effects as a function of repetition with the selection criteria of a minimum of 5 weeks and 5 repetitions, we performed 3 additional sensitivity analyses: sensitivity analyses 1 and 3 were modeling practice effects as a function of weeks since the first test instead of the number of repetitions, and sensitivity analyses 2 and 3 were performed using stricter selection criteria of a minimum of 10 weeks and 10 repetitions (
Comparison of the main analysis with the 3 sensitivity analyses performed.
Criteria  Minimum of 5 weeks and 5 repetitions  Minimum of 10 weeks and 10 repetitions 
Practice effects as a function of number of repetitions  Main analysis  Sensitivity analysis 2 
Practice effects as a function of weeks since first test  Sensitivity analysis 1  Sensitivity analysis 3 
All statistical analyses were performed using R 4.0.3 (R Foundation for Statistical Computing). Point estimates are accompanied by 95% CI in brackets, unless otherwise stated.
Of the 1147 patients who performed at least one cognitive
Characteristics of included patients with multiple sclerosis.
Domain  Cognition  Dexterity  Mobility  











Total, N  1147  1109^{a}  1079^{b}  575  901  1056  

Selected, n (%)  262 (22.8)  264 (23.8)  259 (24)  171 (29.7)  217 (24.1)  257 (24.3)  



Total, N  6240  22,550  21,816  15,512  16,841  19,000  

Selected, n (%)  4824 (77.3)  19,650 (87.1)  19,019 (87.2)  14,393 (92.8)  15,051 (89.4)  16,797 (88.4)  



Total, N  262  499  484  171  217  257  

Female, n (%)  184 (70.2)  353 (70.7)  345 (71.3)  123 (71.9)  155 (71.4)  181 (70.4)  
Age (years), median (IQR; range)  50.2 (42.058.0; 20.079.0)  50.0 (41.858.0; 20.079.0)  49.6 (41.558.0; 20.079.0)  50.0 (41.658.1; 20.074.5)  49.6 (41.557.0; 20.079.0)  48.7 (41.157.0; 20.079.0)  
Number of repetitions, median (IQR; range)  11 (718; 5119)  17 (941.5; 5416)  17 (941; 5414)  30 (1583.5; 5827)  24 (1169; 5829)  24 (1267; 5828)  
Median of intertest intervals (days), median (IQR; range)  7.9 (7.110.2; 6.787.1)  3.1 (2.15.4; 1.942.8)  3.3 (2.25.6; 1.977.6)  1.4 (1.03.0; 1.024.9)  1.8 (1.13.3; 1.025.4)  1.7 (1.13.1; 0.728.4)  
Median of IQR of intertest intervals (days), median (IQR; range)  3.6 (1.09.9; 0.0133.8)  3.3 (1.18.0; 0.0198.1)  3.4 (1.18.8; 0.0251.0)  2.0 (0.95.1; 0.139.2)  2.6 (0.96.9; 0.081.5)  2.3 (0.87.7; 0.093.0)  
Number of weeks from the first to the last test, median (IQR; range)  18.3 (11.554.1; 5.0164.7)  17.4 (10.949.4; 5.0164.7)  17.9 (11.049.9; 5.0164.7)  17.2 (11.148.2; 5.0146.3)  16.3 (10.347.0; 5.0146.3)  16.7 (10.347.0; 5.0152.1) 
^{a}With 26.19% (499/1905) of hands selected.
^{b}With 26.23% (484/1845) of hands selected.
A summary analysis of the 262 selected patients yielded a mean difference from the first to last score of 9.8 correct responses, representing an average observed improvement of 25.4% (95% CI 23.1% to 27.8%) from the first score. Although the majority of this improvement (19.7%, 95% CI 17.5% to 21.9%) occurred up to the fifth score and can thus be considered a shortterm learning effect, there was still a significant improvement from the fifth score onward of, on average, 5.7% (95% CI 4.1% to 7.4%), suggesting a longterm practice effect. A multivariate regression model of this difference yielded a significant association with the total number of repetitions, further supporting the longterm practice effects (
Patientlevel summary analysis for the
When comparing performances by 5th, 25th, 50th, 75th, and 95th percentile groups up to the fifth trial with quantile regression, baseline performances were normally distributed with intercept estimates of 22.0 (19.124.9), 34.0 (32.735.3), 40.0 (38.841.2), 46.3 (45.147.6), and 55.0 (53.456.6), respectively. The ANOVAtype test for all 5 slopes (β=1.52.0) did not suggest that shortterm learning rates for these groups differed significantly (
Linear quantile regression for the
The longterm learning curve analysis showed that the bounded growth model fit the data best with an RMSE of 3.3 correct responses, followed by 3.6 for the smoothing spline, 3.8 for the quadratic, and 4.0 for the linear model (
Learning curve analysis for the
For
A summary analysis of the 499 selected hands yielded a mean difference from the first to last score of 14.3 successful pinches, representing an average observed improvement of 54.2% (95% CI 49.3% to 59.1%) over the first score. Similar to the findings on the
Handlevel summary analysis for
Baseline performances were normally distributed with intercept estimates of 6.0 (95% CI 4.5 to 7.5) for the fifth percentile, 19.0 (95% CI 17.8 to 20.2) for the 25th, 27.0 (95% CI 25.8 to 28.2) for median performers, 37.0 (95% CI 35.6 to 38.4) for the 75th, and 51.7 (95% CI 49.6 to 53.8) for the 95th percentile with quantile regression. The β coefficients for shortterm learning up to the fifth trial were the highest for the 75th percentile and median performers with 2.50 (95% CI 1.96 to 3.04) and 2.00 (95% CI 1.45 to 2.55) additional successful pinches per repetition, lower for the 25th percentile (1.50, 95% CI 1.00 to 2.00) and the lowest for the 5th and 95th percentiles (1.25, 95% CI 0.57 to 1.93, and 1.33, 95% CI 0.56 to 2.11, respectively). These differences in slopes between performance levels were significant (ANOVAtype
Linear quantile regression for
Longterm learning curve analysis again showed that the bounded growth model fit the data best with an RMSE of 6.8 successful pinches, followed by 7.5 for the smoothing spline, 7.9 for the quadratic, and 8.1 for the linear model (
Learning curve analysis for
A summary analysis of the 484 selected hands yielded a mean improvement in the number of shapes drawn correctly from the first to last score of 23.9% (95% CI 18.3% to 29.5%), from the first to fifth score of 15.1% (95% CI 9.8% to 20.3%), and from the fifth to last score of 8.8% (95% CI 3.8% to 13.8%). This difference was significantly associated with the total number of repetitions, suggesting a longterm practice effect (
Handlevel summary analysis for
Intercept estimates for baseline performances were 1 shape drawn correctly for the 5th percentile, 2 for the 25th percentile, 3 for the median performers, 5 for the 75th percentile, and 6 for the 95th percentile with quantile regression. In this analysis, only median performers showed a significant shortterm learning rate up to the fifth trial (
The longterm learning curve analysis again showed that bounded growth models fit the data best with an RMSE of 1.02 shape drawn correctly, followed by 1.06 for the smoothing spline, 1.07 for the quadratic, and 1.08 for the linear model (
Learning curve analysis for
A summary analysis of the 171 selected patients yielded no significant difference between the first, fifth, and last scores with a mean difference from the fifth to last score of 1.4 (95% CI −5.2 to 7.9) steps. This difference was also not associated with the total number of repetitions performed (
Patientlevel summary analysis for
The distribution of baseline performance was leftskewed with performers in the 5th percentile achieving, on average, 87.0 (95% CI 52.2 to 121.8) steps; in the 25th percentile, 181.0 (95% CI 170.7 to 191.3) steps; on median, 219.0 (95% CI 212.6 to 225.4) steps; in the 75th percentile, 236.0 (95% CI 230.6 to 241.4) steps; and in the 95th percentile, 260.0 (95% CI 255.2 to 264.8) steps. No significant slopes up to the fifth trial could be observed (
A summary analysis of the 217 selected patients yielded a significant improvement from the first to last score with a mean difference in turn speed average of 0.13 rad/s, representing an average observed difference of 11.0% (95% CI 5.7% to 16.2%) over the first score. However, the majority of this difference occurred up to the fifth score (9%, 95% CI 3.7% to 14.3%), and the remaining difference from the fifth to last score (1.9%, 95% CI −2.3% to 6.1%) was neither significant nor associated with the total number of repetitions performed (
Patientlevel summary analysis for
Baseline performances estimated with quantile regression were normally distributed with 0.5 rad/s (95% CI 0.5 to 0.6) for the 5th percentile, 0.9 rad/s (95% CI 0.9 to 1.0) for the 25th percentile, 1.3 rad/s (95% CI 1.2 to 1.3) for median performers, 1.5 rad/s (95% CI 1.5 to 1.6) for the 75th percentile, and 2.0 rad/s (95% CI 1.9 to 2.1) for the 95th percentile groups. Only the slope of the 25th percentile group was significant in this analysis up to the fifth trial (β=.04; 95% CI 0.02 to 0.06), and the difference in slopes was not significant in the ANOVAtype test (
A summary analysis of the 257 selected patients yielded a significant difference from the first to last score, with a mean difference in sway path of −16.9 m/s². This is the only test in which fewer numbers are better. Thus, the average observed improvement was −28.6% (95% CI −48.6% to −8.5%) over the first score. However, the majority of this improvement occurred up to the fifth score (−21.1%, 95% CI −45% to −2.8%), and the remaining difference from the fifth to last score (−7.5%, 95% CI −24.1% to 9.2%) was neither significant nor associated with the total number of repetitions performed (
Patientlevel summary analysis for
Baseline performance estimates were strongly rightskewed with 5.7 m/s² (95% CI 4.3 to 7.0) for the 5th percentile, 11.7 m/s² (95% CI 9.4 to 14.0) for the 25th percentile, 23.8 m/s² (95% CI 20.7 to 26.8) for the median performers, 68.0 m/s² (95% CI 55.7 to 80.3) for the 75th percentile, and 260.2 m/s² (95% CI 200.4 to 320.0) for the 95th percentile. In this test, the 5th and 25th percentiles are the top performing groups, and their quantile regression slopes up to the fifth trial are not significant. However, the significant negative slopes of median (β=−2.32; 95% CI −3.47 to −1.17), 75th percentile (β=−8.46; 95% CI −12.21 to −4.72), and 95th percentile (β=−27.25; 95% CI −46.30 to −8.20) performers were increasingly steep, and the overall ANOVAtype difference test yielded
The results of the sensitivity analyses were in line with the results of the main analysis. Sensitivity analysis 2, which used stricter inclusion criteria with a minimum of 10 weeks and 10 repetitions, was overall very similar with expected further increases in mean improvement from the fifth to last score (mean improvement 5.7% for main analysis vs 9.1% for sensitivity analysis 2 for
Strong longterm practice effects were found for
These practice effects likely include both shortterm learning effects, where patients become acquainted with the tests, and longterm practice effects. We believe these effects have not only different origins, time scales, and magnitudes but also different implications for the use of digital assessments in clinical studies and clinical practice. Shortterm learning effects can be addressed by ensuring that participants have sufficient training before the observational period; longterm practice effects constitute a significant challenge for all applications beyond trials with a comparator arm. Although these effects are impossible to untangle in an unsupervised setting like this, we considered improvements up to the fifth trial to be more likely due to shortterm learning and improvements afterward more likely because of longterm practice effects, based on the recommendation to use the fifth trial of the 9HPT as baseline [
For
For
The 3 sensitivity analyses confirmed our main findings. However, for sensitivity analyses 1 and 3, which modeled practice effects as a function of weeks since the first test instead of the number of repetitions, the effect sizes were smaller. We believe this is caused by the irregular nature of these timeseries data, as the intertest intervals differed widely, highlighting a complication in userscheduled testing (
Only a few studies have examined practice effects in smartphonebased tests for patients with MS. Bove et al [
In addition, Liao et al [
Practice effects are well known for SDMT in both healthy controls and patients with MS, although the effect sizes reported were highly variable. Morrow et al [
In contrast, Benedict et al [
Roar et al [
Indeed, all of the traditional paper and pencil SDMT versions have the limitation of a fixed key, which is why Benedict et al [
With our result of an average boundary improvement over baseline of 40.8%, we can show that with weekly testing, practice effects for SDMT are likely to be stronger than with monthly testing, as performed by the abovementioned studies, and at least partly independent of the key.
As a limitation, the smartphonebased test in this study was not oral but based on touching a number pad, potentially biasing the results by dexterity problems (and dexterous practice effects). However, our analysis of
Practice effects often become apparent in the examination of testretest, intrarater, and interrater reliability. In this way, Cohen et al [
Solari et al [
The smartphonebased
No practice effects were found for T25FW, which was examined alongside 9HPT in the abovementioned studies [
MS diagnoses of study participants were selfdeclared, and there was no confirmation or assessment by health professionals. In addition, no clinical information was available for the participants to compare with their performance in digital tests. Differences caused by disease duration, severity, or treatment could not be analyzed.
In addition, we observed a high variability of results, which is most likely partly due to biomedical daytoday fluctuations and partly due to circumstantial and technical noise, for example, caused by interrupted test performance or sensor error. However, it is impossible to determine these effects using the present data set.
Finally, these timeseries data are highly irregular and have strong rightskewness. Our models expect data missing at random. We found no evidence that baseline performance influenced adherence and the number of repetitions, but age was found to be a confounder for all domains. Interestingly, older people tended to perform more repetitions than younger people (
In summary, we analyzed the practice effects in 6 active smartphonebased tests for cognition, dexterity, and mobility performed at high frequencies. Smartphonebased tests promise to help monitor MS disease trajectories, and there are currently multiple initiatives in development [
Pairwise associations for the <italic>electronic Symbol Digit Modalities Test</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=262 patients).
Model selection for <italic>electronic Symbol Digit Modalities Test</italic>: Comparison of 4 mixed models of increasing complexity (parametric linear, quadratic and bounded growth models and the non–parametric smoothing spline model) by root mean squared error (RMSE) and (effective) degrees of freedom (eDF) used.
Comparison of correct responses of the <italic>electronic Symbol Digit Modalities Test</italic> (xaxis) and of the <italic>electronic Symbol Digit Modalities Test</italic> divided by baseline (yaxis) to correct for dexterity and reaction speed (n=6190 tests).
Patientlevel summary analysis for the <italic>electronic Symbol Digit Modalities Test</italic> corrected for dexterity and reaction speed: comparison of the first, fifth, and last score. Multivariate association of the difference from the fifth to the last score with age, first and fifth score, and the logtransformed number of repetitions (n=262 patients).
Linear quantile regression for the <italic>electronic Symbol Digit Modalities Test</italic> (corrected for dexterity and reaction speed) of shortterm learning effects up to the fifth repetition. Comparison of baseline performance and linear slope of low (5th and 25th percentiles), median, and high performers (75th and 95th percentiles). Quantile regression <italic>P</italic> values are Bonferroniadjusted (n=1310 tests).
Learning curve analysis for the <italic>electronic Symbol Digit Modalities Test</italic> corrected for dexterity and reaction speed: bounded growth mixed model of practice effects with 95% CI band and baseline, 50%, and 90% practice points marked (m=slope of tangent; n=4801 tests).
Pairwise associations for <italic>Finger Pinching</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=499 hands).
Model selection for <italic>Finger Pinching</italic>: comparison of 4 mixed models of increasing complexity (parametric linear, quadratic and bounded growth models, and the nonparametric smoothing spline model) by root mean squared error and (effective) degrees of freedom used.
Pairwise associations for <italic>Draw a Shape</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=484 hands).
Linear quantile regression for <italic>Draw a Shape</italic> of shortterm learning effects up to the fifth repetition. Comparison of baseline performance and linear slope of low (5th and 25th percentiles), median, and high performers (75th and 95th percentiles). Quantile regression <italic>P</italic> values are Bonferroniadjusted (n=2420 tests).
Model selection for <italic>Draw a Shape</italic>: comparison of 4 mixed models of increasing complexity (parametric linear, quadratic and bounded growth models, and the nonparametric smoothing spline model) by root mean squared error and (effective) degrees of freedom used.
Pairwise associations for <italic>Two Minute Walk</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=171 patients).
Linear quantile regression for <italic>Two Minute Walk</italic> up to the fifth repetition. Comparison of baseline performance and linear slope of low (5th and 25th percentiles), median, and high performers (75th and 95th percentiles). Quantile regression <italic>P</italic> values are Bonferroniadjusted (n=855 tests).
Pairwise associations for <italic>UTurn</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=217 patients).
Linear quantile regression for <italic>UTurn</italic> of shortterm learning effects up to the fifth repetition. Comparison of baseline performance and linear slope of low (5th and 25th percentiles), median, and high performers (75th and 95th percentiles). Quantile regression <italic>P</italic> values are Bonferroniadjusted (n=1085 tests).
Pairwise associations for <italic>Static Balance</italic> of the mean age, first score, fifth score, logtransformed number of repetitions performed, last score, and the difference from the fifth to the last score and their respective histograms (n=257 patients).
Linear quantile regression for <italic>Static Balance</italic> of shortterm learning effects up to the fifth repetition (a smaller sway path is better). Comparison of baseline performance and linear slope of low (5th and 25th percentiles), median, and high performers (75th and 95th percentiles). Quantile regression <italic>P</italic> values are Bonferroniadjusted (n=1285 tests).
Key results for the main analysis versus sensitivity analyses 13 for cognition, dexterity, and mobility.
9hole peg test
analysis of variance
multiple sclerosis
root mean squared error
Symbol Digit Modalities Test
timed 25foot walk
The authors thank all the participants of the Floodlight Open study for their participation and Roche for making the data set publicly available.
TW designed the study, performed the analysis, interpreted the data, and drafted and revised the manuscript. JL designed the study, interpreted the data, and revised the manuscript. TW and JL had full access to the available source data and guaranteed the integrity of the analysis. SP, AW, LK, and YN helped interpret the data and revised the manuscript for important intellectual content.
The research activities of the Research Center for Neuroimmunology and Neuroscience Basel, the affiliation of TW, SP, LK, YN, and JL, are supported by the University Hospital Basel, the University of Basel, and by grants from Novartis and Roche. One of the main projects of Research Center for Neuroimmunology and Neuroscience Basel is the development of a new comprehensive MS digital solution. TW and SP report no further conflicts of interest. AW reports no conflicts of interest. The University Hospital Basel, as the employer of LK, has received and dedicated to research support fees for board membership, consultancy or speaking, or grants in the past 3 years from Actelion, Bayer, Biogen Idec, CSL Behring, Eli Lilly EU, Genmab, GeNeuro SA, Janssen, Merck Serono, Novartis, Roche, Santhera, SanofiAventis, Teva, European Union, Innosuisse, Roche Research Foundation, Swiss MS Society, and Swiss National Research Foundation. The University Hospital Basel, as the employer of YN, has received financial support for lectures from Teva and Celgene and grant support from Innosuisse (Swiss Innovation Agency). JL received research support from Innosuisse, Biogen, and Novartis; he received speaker honoraria and/or served on advisory boards for Biogen, Novartis, Roche, and Teva.