Time to Change for Mental Health and Well-being via Virtual Professional Coaching: Longitudinal Observational Study

Background Optimal mental health yields many benefits and reduced costs to employees and organizations; however, the workplace introduces challenges to building and maintaining mental health that affect well-being. Although many organizations have introduced programming to aid employee mental health and well-being, the uptake and effectiveness of these efforts vary. One barrier to developing more effective interventions is a lack of understanding about how to improve well-being over time. This study examined not only whether employer-provided coaching is an effective strategy to improve mental health and well-being in employees but also how this intervention changes well-being in stages over time. Objective The goal of this study was to determine whether BetterUp, a longitudinal one-on-one virtual coaching intervention, improves components of mental health and psychological well-being, and whether the magnitude of changes vary in stages over time. This is the first research study to evaluate the effectiveness of professional coaching through three repeated assessments, moving beyond a pre-post intervention design. The outcomes of this study will enable coaches and employers to design more targeted interventions by outlining when to expect maximal growth in specific outcomes throughout the coaching engagement. Methods Three identical assessments were completed by 391 users of BetterUp: prior to the start of coaching, after approximately 3-4 months of coaching, and again after 6-7 months of coaching. Three scales were used to evaluate psychological and behavioral dimensions that support management of mental health: stress management, resilience, and life satisfaction. Six additional scales were used to assess psychological well-being: emotional regulation, prospection ability, finding purpose and meaning, self-awareness, self-efficacy, and social connection. Results Using mixed-effects modeling, varying rates of change were observed in several dimensions of mental health and psychological well-being. Initial rapid improvements in the first half of the intervention, followed by slower growth in the second half of the intervention were found for prospection ability, self-awareness, self-efficacy, social connection, emotional regulation, and a reduction in stress (range of unstandardized β values for each assessment: .10-.19). Life satisfaction improved continuously throughout the full intervention period (β=.13). Finding purpose in meaning at work and building resilience both grew continuously throughout the coaching intervention, but larger gains were experienced in the second half of the intervention (β=.08-.18), requiring the full length of the intervention to realize maximal growth. Conclusions The results demonstrate the effectiveness of BetterUp virtual one-on-one coaching to improve psychological well-being, while mitigating threats to mental health such as excessive and prolonged stress, low resilience, and poor satisfaction with life. The improvements across the collection of outcomes were time-dependent, and provide important insights to users and practitioners about how and when to expect maximal improvements in a range of interrelated personal and professional outcomes.


Vision for BetterUp Assessments
BetterUp's mission is to help every professional lead their life with greater clarity, purpose, and passion. Self-exploration and self-discovery are powerful components of the coaching process. Coaches often employ a range of tools to support client self-exploration. These tools vary in the degree of structure and facilitation required. Common tools include self-observation exercises, journaling, behavioral practices, topic-specific exercises, mindfulness exercises, and topicspecific educational tools (e.g., readings, videos).
Another broad category of coaching tools is psychometric assessments. Assessments are a common tool used to facilitate self-discovery. They share many of the features of less structured/formal tools mentioned above; however, they typically have the added benefit of being grounded in theory and having measurable properties of quality such as reliability and validity.
If the end goal is facilitating self-discovery and awareness, and ultimately transforming behavior, assessments need to be accurate to be valuable tools in helping our coachees (referred to as BetterUp members) take action. Sound psychometric assessments can provide valuable information to coaches to inform their approach. Thus, the reliability and validity of these assessments are critical. In addition to facilitating the coaching process, accurate and reliable assessments help organizations (referred to as BetterUp partners) evaluate the impact of coaching at both an individual and aggregate levels by measuring growth throughout the engagement.
Psychometric assessments can be used to measure a broad variety of individual differences. Examples of common assessment types include personality, interests, preferences, skills, performance, knowledge, and abilities. By capitalizing upon an assessment's ability to provide different lenses through which a client can view him or herself -in combination with the support of a coach -we can build a personalized self-discovery experience for our coachees. The suite of assessments described in this technical report will create a psychometrically sound assessment library that can be leveraged by organizations, coaches, and coachees to both facilitate and track growth.
While the assessment library will be built with flexibility and personalization in mind, we want to do so in a way that does not impede obtaining a holistic perspective or limit BetterUp's ability to accumulate knowledge. To balance these priorities, coachees will receive a common set of assessments at certain points in their coaching experience (e.g., at onboarding and at predetermined "reflection points"), referred to as "core" assessments. Additional assessment content from the library can then be delivered to personalize the coachees' experience through push (e.g., assigned by the coach, automatically suggested based on where they are in their development journey, or assigned by their organization) or pull (coachee initiated/selected) mechanisms. Ultimately, the ability to consistently provide new personal insights is expected to (a) enhance coachee engagement, (b) serve as a coaching tool to help tailor coaching and, (c) through use of diverse, personalized, and engaging assessments provide additional data to feed ongoing development. In sum, these changes should create a more integrated experience across coaching and other platform components, such as assessments, extended network (i.e., specialist coaching), and smart learning.

Review of the Literature
We developed the guiding framework of WPM 2.0 based on a thorough review of research published in top-tier academic journals. During the literature review, we focused on recent, cutting edge advancements. Given BetterUp's mission to help professionals lead their lives with greater clarity, purpose, and passion, we specifically focused on research on goal-setting, wellbeing, and leadership. These literatures informed the outcomes we selected to measure in the model.
The core outcomes we identified are outlined below: Self-awareness. Self-awareness is the extent to which we direct our consciousness to focus on the self. 1 Self-awareness relates to the goal component of Motivational Systems Theory. We must be aware of our behaviors in order to set strategic personal goals and work towards achieving them. 2 BetterUp's data suggest that self-awareness is one of the most foundational topics discussed in coaching sessions, and research suggests that self-awareness coupled with goal-setting is predictive of high performance. 3 Self-efficacy. Self-efficacy refers to individuals' beliefs about their capabilities to achieve their goals. 4 In Motivational Systems Theory, self-efficacy is a personal agency belief that helps predict and explain individuals' behaviors. Research suggests that self-efficacy is associated with stronger effort, persistence, and goal attainment. 5 Social Connection. The extent to which we remain close and engaged with important, supportive people in our lives. By maintaining our close personal relationships, we feel a heightened sense of belonging, and we receive social support that enhances our well-being. 6 Prospection. The extent to which we pragmatically think through ways to achieve our future goals. By envisioning desired future states and figuring out how to turn those desires into reality, we are able to enhance personal meaning and motivation. 7 Emotional Regulation. The extent to which we regulate our emotions to remain calm and collected. The more emotionally stable we are, the better equipped we are to remain resilient and excel when challenges arise. 8 Resilience. The extent to which individuals positively adapt in the context of negative or stressful experiences. 9 Resilience in the workplace is becoming increasingly important as novel demands such as new technologies and changing governmental policies become more prevalent. 10 In addition to its link with performance, resilience is associated with higher wellbeing. 11 Stress. The extent to which individuals experience tension as a result of work and non-work demands. Work stress imposes an estimated $187 billion financial burden in the United States. 12 In addition to its detrimental toll on employee well-being, employee stress is associated with organizational health care costs and productivity losses. 13 Purpose and Meaning. The extent to which individuals experience a sense of personal meaning to what they do at work. As mentioned earlier in the literature review, BetterUp research demonstrates that employees highly value meaningful work; on average, employees would take a substantial pay cut in exchange for more meaningful work. 14 Employers may also indirectly benefit from employee meaning and purpose through reduced turnover and increased performance. 15 Life Satisfaction. The extent to which individuals make positive global assessments of the quality of their lives. 16 Life satisfaction is a cognitive, judgmental process that centers on individuals' own evaluations. Because life satisfaction is a subjective experience, it can be influenced by one's mindsets and behaviors without changes in the external environment 17

Sample
The validation effort recruited 1030 qualified MTurk workers for participation. MTurk is a platform that provides relatively quick online access to large samples of workers. 18 Research suggests that MTurk workers are generally representative of the U.S. population. MTurk samples tend to be older, more ethnically diverse, and have more work experience than traditional college student samples. 19 Research has found that MTurk samples produce test-retest reliabilities and coefficient alphas similar to those of other research samples. 20 We adopted the following methodology to ensure the MTurk sample is representative of the average BetterUp member. First, we administrated a short demographic survey and restricted invitations to participate in the study to individuals who work full-time on a team of three or more individuals, for whom English is a primary language, and who have MTurk worker approval scores above 95%. 21 Additionally, all participants eligible for inclusion in this validation study had previously completed other studies with BetterUp. This gave us access to additional assessments beyond what was initially included in the WPM 2.0 validation efforts. Overall, our sample was 57% male, an average of 39 years of age, had an average job tenure of 6 years, and worked an average of 42 hours per week. For a full demographic and industry breakdown of sample participants refer to Tables 1 -3.

Scale Development
To develop the scales, members of BetterUp Labs followed best practices as outlined in The Standards for Educational and Psychological Testing 22 , which describe four sources of validity evidence: (a) test content, (b) response processes, (c) internal structure, and (d) relations to other measures.
The first step in designing the scales was to identify constructs to include in the model. Using the knowledge gained from the validation of earlier scales, we conducted an updated review of the literature and discussions with key stakeholders, 23 and a hypothesized model was developed. A team of nine industrial organizational psychologists including BetterUp Labs team members, contractors, and university faculty contributed to this effort. BetterUp psychologists then wrote operational definitions for all constructs identified. These definitions were developed through an extensive review of the literature. Items were then written to measure the relevant outcomes. 24 All items were compiled, reviewed, and revised by two separate team members. A post-review discussion was then held to resolve issues and remove items as needed. Because most of these scales were developed for this project, we identified existing scales previously validated in the literature to evaluate discriminant and convergent validity.
All Items included in this validation effort used one of these four different response scales: We chose to employ a variety of rating scales when creating our initial item pool to increase our ability to assess constructs from a variety of perspectives. Using a variety of response options also helped keep the assessment short by minimizing item redundancy and increased the likelihood that items would assess distinct portions of their intended construct.

Analytic Approach Overview
The primary goals in the development were to assure strong psychometric properties of the scales, to make the collection of scales both comprehensive and flexible, and to create a clear narrative for members and coaches.
Several phases were conducted to achieve these goals: measures demonstrated the expected relationships with established measures of similarly validated constructs. Phase 4 assessed if there were differences in scores across demographic groups. Phase 5 assessed if scores on the assessments remained consistent 1 and 3 months after the initial survey. Finally, during phase 6 we sought feedback from the prominent scholars on BetterUp's science board to provide a final review of our validation efforts.

Model Development and Validation
Phase 1: Item Analysis The primary goal of this phase of the validation was to reduce the number of items per subscale by removing poorly performing items based on various classical test theory (CTT) item statistics. This was determined by evaluating means, standard deviations, skewness, kurtosis, inter-item correlations, item-total correlations, and Cronbach's alpha. Items were considered poorly performing if their means were too high, standard deviations were too low, skewness and kurtosis values were too high, or inter-item / item-total correlations were too low (in both relative and absolute terms). Overall, the purpose of evaluating these statistics was to test whether the items within each scale correlated highly enough such that a total-score was appropriate for each item set. Additionally, these analyses allowed us to reduce the number of items administered. The final operational subscale definitions and their reliabilities can be found in Appendix B.
Through CTT we were able to trim each scale down from 3-10 items to 2-4 items. All multi-item scales retained an acceptable internal consistency reliability.

Phase 2: Final Dimension Correlation
We next evaluated the correlations between each of our newly developed measures (Table 4). As expected, the vast majority of our dimensions were related.

Phase 3: Dimension Discriminant and Convergent Validity
In addition to examining the patterns of correlations between dimensions, we also examined the relationships between the selected dimensions and measures previously validated in the wellbeing literature (i.e., marker scales). This included the Authentic Leadership Questionnaire (ALQ) 25 , Psychological Capital (PsyCap) 26 , and Seligman's PERMA 27 . Table 5 present the correlations between each dimension and the marker variables included in this validation study.
Overall, these analyses provided further evidence that the new assessments accurately and comprehensively assess the well-being construct space.

Phase 4: Demographic Comparisons
Several multivariate analyses of variance (MANOVAs) were conducted to detect differences across gender and ethnicity on the different dimensions. MANOVAs were chosen as a first step due to the moderate amounts of multicollinearity among the dimensions and as an attempt to control the family-wise error rate. Using a MANOVA provides an indication of the magnitude of differences across the components between these demographic groups. In the following analyses, we interpret the multivariate test statistic, Wilks Lamba (Λ), and the effect size indicator, partial eta squared (ηp 2 ). Established cutoffs suggest that a ηp 2 value greater than or equal to .01 can be considered a small effect size, .06 can be considered a medium effect size, and an effect size greater than .14 can be considered large. 28 All variables demonstrated linearity and acceptable univariate normality. However, all Box's M tests were significant which suggests there exists inequality of the covariance matrices across groups. When given sufficiently large sample sizes, MANOVAs are robust to this violation; however, in the ethnicity analyses there is a large degree of unevenness in the sample sizes across groups. In these cases, this assumption violation can result in significance testing that is either too liberal or too conservative. As a follow-up to a significant MANOVA, we computed ttests for demographic variables consisting of two groups (i.e., gender), and performed an appropriate post-hoc test for the ethnicity variable that included four groups.

Gender
The multivariate analysis found there was significant differences attributed to gender Λ = .86, F (25, 1003) = 6.62, p <.001, ηp 2 = 0.14. This meets the criterion for a large effect size. Results suggested that approximately 14% of the variance of the dimensions can be explained by gender.
Univariate t-tests were performed to explore differences between men and women across the dimensions (Table 6). Women rated themselves significantly higher across several beneficial WPM 2.0 components and lower on several harmful components. The largest differences were in Emotional Regulation, Stress, and Resilience.

Ethnicity
The multivariate analyses found statistically significant differences attributed to ethnicity Λ = .90, F (75, 2903) = 1.31 p <.05, ηp 2 = 0.03). Although significant, these results suggested that only 3% of the variance was explained by participants' ethnicity. This is the smallest amount of variance explained by the demographic variables assessed. Means and standard deviation for each group across all dimensions can be found in Table 7.
Given the unequal variances and sample sizes discussed previously, a Games-Howell post hoc test was performed to investigate group differences. The largest of the statistically significant differences included: Participants identifying as Black reported higher levels Prospection, and participants identifying as Asian reported the most self-efficacy. Overall, all of the differences were small in magnitude.

Phase 5: Test-Retest
The next step in the validation process was to complete two separate test-retest reliability studies. The first was completed one month and the second was completed three months after the original assessment. Collecting test-retest data allowed us to measure changes in scores in an uncoached sample and evaluate the stability of these metrics. In the absence of coaching, we expect our assessments to demonstrate minimal change over time.

One-Month Retest
We first examined test-retest correlations of each of the dimensions. The test-retest correlations were moderate to high (rs = .64-.85), as expected. We then examined individuals' change scores on each dimension across time. Paired sample t-tests demonstrated that the average change scores were not significantly different from zero. The high degree of stability in these scales over time and the minimal change in the absence of coaching is a critical finding for evaluating change scores among BetterUp members going through coaching, where we would expect to see changes in scores over time.

Three-Month Retest
The test-retest correlations from Time 1 (original assessment) to Time 3 (three months later) were moderate to high (rs = .63-.90), as expected. Just as before, we then examined individuals' change scores on each dimension across time. Paired sample t-tests demonstrated that the average change scores were significantly different from zero for prospection. However, the effect sizes of the significant differences was small (d = .11). Together, the findings support our hypotheses that our measures are reliable over a three-month period. It's not surprising that the measures were not as highly correlated across a three-month span compared to the one-month comparison, yet they still demonstrated a high degree of stability.

Phase 6: Science Board Review
After completing the previous phases of the validation efforts, BetterUp's Science Board conducted a final review of our validation efforts. BetterUp's Science Board includes luminaries from across scientific disciplines including psychology, business, and management. The Science Board was created to ensure that BetterUp benefit from broad insight into core scientific, business, and technological trends, and helps shape our short and long-term business strategies. As another layer expert input, the Science Board was asked to review the technical approach and the assessments. This was intended to provide an additional independent review by highly respected scientists who were not involved in the validation work.
Overall, the assessments were received positively by our Science Board members. Feedback mainly centered on the nomenclature of the individual psychological constructs of the assessment, clarity of specific items, and considerations for future content and research. Feedback was reviewed by the validation team and implemented where possible or noted for future research. The Science Board review process helped ensure that our validation efforts reached the highest standards for psychometric testing and scientific excellence.

Summary of Key Findings
Our validation efforts supported reduced model redundancies and length, validated our newly developed measures against previously established assessments, and found few unexpected differences across demographic groups.
We assessed the relationships among all of the components and previously validated measures assessing similar psychological constructs. This included established leadership (ALQ), wellbeing (PERMA & PsyCap), and life satisfaction metrics. Overall, these analyses provide evidence that the collection of assessments are measuring what they were intended to measure.
Minor differences between demographic groups were found on the developed components. The majority of the differences were expected. The largest group differences were for gender. In our analyses of gender, we found that women self-rated as higher on Emotional Regulation, Resilience, and lower on Stress.
The test-retest analyses demonstrated that the measures are generally reliable over time. Testretest correlations after one month ranged from .64 to .85. Test-retest correlations after three months ranged from .63 to .90. Paired sample t-tests suggested that the average change scores on the measures were not generally significantly different from zero across both one and three months, which is an important finding given that these participants did not receive coaching through BetterUp's platform.
Overall, the validation efforts of these dimensions, and subsequently supported through analyses of the benchmarking datasets, resulted in a comprehensive assessment comprising mindsets and behaviors that promote well-being, leadership behaviors, and the key outcomes. This will allow BetterUp to accurately provide feedback to members, coaches, and partners, and ultimately work to direct coaching in a way that is the most meaningful and developmentally impactful across all domains of life.    The extent to which we regulate our emotions to remain calm and collected. The more emotionally stable we are, the better equipped we are to remain resilient and excel when challenges arise.
Life Satisfaction α = .84 The extent to which we are fulfilled in our lives.
Prospection α = .87 The extent to which we pragmatically think through ways to achieve our future goals. By envisioning desired future states and figuring out how to turn those desires into reality, we are able to enhance personal meaning and motivation.
Purpose & Meaning α = .92 The extent to which we experience a sense of personal meaning associated with what we do at work.

Resilience α = .88
The extent to which we can recover quickly from stressful experiences.
Self-Awareness α = .70 The extent to which we direct our consciousness to focus on the self. We must be aware of our behaviors in order to set strategic personal goals and work towards achieving them.
Self-Efficacy α = .84 The extent to which we believe that we are capable of producing achieving our goals.
Social Connection α = .80 The extent to which we remain close and engaged with important, supportive people in our lives. By maintaining our close personal relationships, we feel a heightened sense of belonging, and we receive social support that enhances our well-being.

Stress α = .89
The extent to which we experience tension as a result of personal and work circumstances.