Recently there has been a proliferation of interactive tailored patient assessment (ITPA) tools. However, evidence of the reliability and validity of these instruments is often missing, which makes their value in research studies questionable. Because several of the common methods to evaluate instrument reliability and validity are not applicable to interactive tailored patient assessments, informatics researchers may benefit from some guidance on which methods of reliability and validity assessment they can appropriately use. This paper describes the main differences between interactive tailored patient assessments and assessment instruments based on psychometric, or classical test, theory; it summarizes the measurement techniques normally used to ascertain the validity and reliability of assessment instruments based on psychometric theory; it discusses which methods are appropriate for interactive tailored patient assessments and which are not; and finally, it illustrates the application of some of the feasible techniques with a case study that describes how the reliability and validity of the tailored symptom assessment instrument called Choice were evaluated.
Recent years have seen a proliferation of interactive health communication tools, together with a growing trend toward empowering patients to take a more active role in their own health care. A prerequisite to effectively helping patients in need of care is to elicit their symptoms and health problems from their perspective. Interactive tailored patient assessments (ITPAs) have become increasingly important as a means of eliciting patients’ illness experiences and tailoring patient care or self-care recommendations to each patient’s individual needs. The ease of deployment of Web-based surveys has made the use of interactive tailored questionnaires more common, and software that allows researchers to rapidly develop custom-tailored questionnaires has started to emerge.
Interactive tailored patient assessments have a number of advantages compared with standardized assessments, in which respondents are required to complete all questions. In interactive tailored patient assessments, the questions can be tailored to each patient individually based on his or her initial responses. Superfluous questions are eliminated, and the questions that remain are more relevant to the patient. For example, the Dialogix system developed at Columbia University implements structured interviews on a series of Web pages. It supports complex branching and conditional tailoring so that questions and summary reports can be tailored to the subject’s responses [
One might argue that there do exist traditional assessments that behave somewhat like interactive tailored patient assessments; for example, “if you answered ‘yes’ to question 4, skip questions 5 through 8,” and so on. However, in assessments of this type,
The credibility of interactive tailored patient assessments depends on their ability to adequately capture patients’ experienced symptoms and problems. Validity and reliability are, therefore, crucial issues. Despite an increasing number of studies that use interactive tailored patient assessments as research tools, even in randomized controlled trials, information about reliability and validity is often missing. Consequently, those wishing to implement a specific interactive tailored patient assessment in practice have little assurance about the instrument’s reliability and validity. Also, without such evidence, it is difficult to disseminate study results outside the informatics community and into the clinical literature where a minimum standard for reporting reliability and validity is required for publication. A minimal standard for research instruments should at least include test results of one type of reliability for the group being tested, one type of content validity, and at least one type of criterion-related or construct validity [
Psychometric theory offers a number of techniques to examine the reliability and validity of research instruments. However, many of these techniques only apply to instances in which individuals respond to the same set of items, in contrast to interactive tailored patient assessments, in which each informant responds to a different subset of individually selected items. Thus, informatics researchers who are interested in developing an interactive tailored patient assessment are left with the question of which methods they can appropriately use to establish its reliability and validity.
The purpose of this paper is to provide some guidance on evaluating reliability and validity of interactive tailored patient assessments. In it, we (1) describe the main differences between interactive tailored patient assessments and assessment instruments based on classical test theory, using a tailored symptom assessment instrument called Choice as an example,(2) summarize the psychometric techniques normally used to ascertain the validity and reliability of instruments for self-reported assessments, (3) discuss which methods are appropriate for interactive tailored patient assessments and which are not, and finally, (4) illustrate the application of some of the feasible techniques with a case study that describes measurement of the reliability and validity of the Choice instrument. This may serve as a model for other researchers for evaluating reliability and validity of interactive tailored patient assessments.
Choice is the name of a suite of tailored symptom assessment tools designed to help patients report their experienced symptoms and health problems so that their care providers can tailor patient care to each patients’ individual symptoms, problems, and needs. The Choice application used here as an example targets patients with chronic and serious long-term illnesses such as cancer. However, interactive tailored patient assessments are also applicable to other patient populations.
The application is contained and administered via a tablet computer with a touch-sensitive screen or is administered via an Internet application. It supports complex branching, so only relevant questions are asked, and conditional tailoring, so questions are tailored to a subject’s previous responses. For example, in the Choice cancer module, patients first identify among 19 problem categories those that apply to themselves. This triggers a subset of related symptoms from which patients again only select those that apply. For example, if patients initially select the “Problems with eating and drinking” category, they are presented with a more detailed list that helps them specify their eating and drinking problems (eg, taste changes, lack of appetite). The patients then rate the degree of bother and their priorities for care for the selected symptoms. When they are done, the system creates an assessment summary that displays patients’ selected symptoms ranked by their priorities for care. This summary can be used by patients and clinicians for subsequent shared care planning. The Choice instrument has consistently been demonstrated to significantly increase congruence between patients’ reported symptoms and patient care in both rehabilitation and cancer patients [
Interactive tailored patient assessments such as the Choice instrument are different in several respects from other standardized measurement approaches that rely on patient self-report. The primary goal of traditional instruments is to support research, that is, to describe, contrast, or compare populations and to arrive at more generalizable conclusions based on specific observations [
In the application of either scales or indexes, all respondents complete a given set or subset of items [
Interactive tailored patient assessments are primarily designed for clinical application. Thus, the main focus of interest is to elicit characteristics that are unique to a particular person. The purpose is to provide the person with individually tailored care, information, or behavioral change strategies [
Another difference from traditional assessment instruments is that an interactive tailored patient assessment may be purposely designed to capture each patient’s personal experience. For example, in the Choice instrument, the goal is to help patients find descriptions of their symptoms and health problems that reflect their personal experiences as closely as possible. Thus, patients may choose between relatively similar symptoms that are expressed with synonymous terms, selecting those that they feel are closest to their experience. Such comprehensiveness of symptom descriptions would be difficult in traditional measurement instruments with a parsimonious set of items and would be considered redundant.
There may also be differences in how questions in the instrument are organized and structured. For example, scales combine items into internally consistent scales, or subscales, which tap the same underlying concept. An example is the Center for Epidemiological Studies Depression Scale (CES-D), described later in the case study, which consists of four subscales for which indicators of depression include “problems concentrating” and “sleeping problems” [
Differences Between Traditional Measurement Instruments and Interactive Tailored Patient Assessments
|
|
|
|
Understanding characteristics of populations; generalizability | Understanding characteristics of individuals |
|
Research | Clinical practice; to tailor patient care / advice to each individual |
|
1. Each subscale measures one latent concept at a time. Different concepts are contained in internally consistent subscales. |
May capture patients’ symptom and problem experiences on different dimensions |
|
Every respondent completes more or less the same set of questions. | Every respondent completes a different set of questions, based on initial item selection. |
|
Parsimony: to explain the greatest amount of variance in the concept measured with the fewest numbers of items. | Comprehensiveness: to help patients find a close match between the item description and their actual experience. |
Measurement is the process of linking abstract concepts to empirical indicators. This can happen in two ways. The first is by focusing on the crucial relationship between the observable response and the underlying unobservable theoretical concept. This is the case with concepts such as “intelligence,” which we cannot observe directly, but implications of it, such as peoples’ vocabulary, mathematical ability, and knowledge about the world, stem from this quality. Instruments constructed to capture such concepts have come to be called scales [
Reliability and validity are the two basic properties of empirical measurements. Reliability concerns the extent to which an experiment, test, or any measuring procedure yields the same results on repeated trials. Validity is the degree to which an instrument measures what it purports to measure. Reliability is a necessary but not a sufficient condition for validity [
Psychometric concepts, definitions, and methods
|
|
|
|
|
|||
Internal consistency | Average intercorrelation among items | Cronbach alpha, split-half | Inappropriate due to highly variable number of assessment items among respondents |
Test-retest | Association between measurements on the same respondents at multiple points in time using the same version of the measurement instrument; coefficient of stability | Correlation between two measurements | Inappropriate if concept being measured changes over time; otherwise appropriate. Even small changes over time might fundamentally change the patient’s response to the interactive tailored patient assessment. |
Alternate forms | Association between measurements on the same respondents at multiple points in time using two forms of the “same” measurement instrument; coefficient of equivalence | Correlation between two measurements | Inappropriate if concept being measured changes over time; otherwise appropriate. Due to the nature of the interactive tailored patient assessment, with possibly detailed items, coming up with an alternate form might be difficult. |
|
|||
Content | Extent to which a specific measure depicts a domain of content | Literature review, expert review | Appropriate |
Criterion-related | Extent of correlation between the test and the criterion | Concurrent validity (test and criterion at same point in time); predictive (test and criterion at a future point in time) | Appropriate. Be aware that it might be difficult to find a sensible criterion when many issues are addressed simultaneously, as often is the case. |
Construct | Extent to which a particular measure performs in accordance with theoretically derived hypotheses concerning the concepts (or constructs) being measured | Factor analysis, convergent validation, discriminant validation, known group differences, multitrait-multimethod matrix | Factor analysis is often inappropriate due to variable number of assessment items among respondents, or the large sample size that otherwise would be required. Other methods are usually appropriate. |
Common approaches to examine reliability include test-retest, alternate forms, split-half, and tests of internal consistency [
In the test-retest method, the same test is given to the same people after a period of time [
The alternate form method requires two testing situations with the same people, but an alternate form of the same test is administered [
In the split-half technique, items of the scale are split in two. To obtain a measure of reliability, the scores of the halves are correlated. This follows the same logic as in the test-retest technique, where the correlation between two parallel measures equals the reliability coefficient. The issue of how to split the items in half, however, is not clear cut.
By far the most popular approach is the internal consistency reliability coefficient Cronbach alpha [
A problem with all the above measures is that they indirectly depend on all respondents completing more or less the same consistent set of items, making the measures difficult to apply to interactive tailored patient assessments. A scale’s reliability is mainly addressed by looking at correlations— mathematical expressions of association. The calculations are done by pairing data and comparing whether variable values behave in a similar manner; if the value of one variable goes up, and the value of another tends to do so as well, the two variables will be more correlated than if this was not so. Problems arise, however, in the presence of missing data (ie, there is no value for a given variable to compare with another). Usually, the issue of missing values in a data set constitutes no major problem when calculating correlations. For example, for 100 patients measured on weight and shoe size, with two persons missing out on the weighting because they were in the gym, this still leaves 98 people for the calculation of the correlation between weight and shoe size for that group of patients. Generally, the amount of missing data in reviewing scales is negligible. There will most likely be some patients that have not answered one item or another, but the amount of pairs left for correlation calculations is rarely affected to such an extent that these calculations suffer severely.
In interactive tailored patient assessments, however, the amount of missing data could be devastatingly high, effectively making well-known techniques useless. Take the Choice instrument. It has a total of 141 symptoms that the patients can choose from. In the testing of the system, the average number of symptoms the patients reported was 10 [
This lack of a fixed system of items to perform calculations on in order to verify the reliability of an interactive tailored patient assessment constitutes a major statistical challenge. All correlation calculations are deemed to be suffering from this fact, and all correlations will be calculated less precisely since the unanswered questions will contribute a “missing,” erasing that piece of information totally, rather than a zero or similar value, as in more traditional assessments. For example, a patient answering items 1 through 5 in one administration of an interactive tailored patient assessment and items 2 through 10 in another administration of the same interactive tailored patient assessment, would, in a test-retest, only have four items in common for the two administrations, even though five items were answered the first time and nine the second time, for a total of 10 different items.
The calculation of Cronbach alpha [
Factor analysis is closely linked to reliability measures, but makes less stringent assumptions than alpha-type methods. Such methods are, however, also deemed to be unreliable in the setting described above. Factor analysis does nothing more than redefine and simplify the correlation matrix, a matrix that may be calculated on the basis of a huge amount of missing data and very sparse real information. The number of assessments needed in order to have a trustworthy correlation matrix would then have to be extremely high. There are several guidelines for sample size. Among others, Tinsley and Tinsley [
The main methods to assess the validity of a test for a group of people under certain circumstances are content validity, criterion-related validity, and construct validity. Fundamentally, content validity depends on the extent to which an empirical measurement reflects a specific domain of content and whether the items reflect the meaning associated with each dimension or subdimension [
Criterion-related validity refers to the correlation of a measure with a criterion variable that is external to the measuring instrument itself [
In contrast to content validity and criterion-related validity, construct validity has a more generalized applicability and lends itself easier to empirical investigation. Constructs concern domains of variables [
A number of techniques for examining construct validity are applicable to interactive tailored patient assessments. For example, convergent and discriminant approaches, including known group differences, are based on hypothesized relationships between the measurement of concern and another variable. Convergent validity is demonstrated when two independent methods that measure the same variable or attribute are highly correlated. Divergent validity is demonstrated when measures of different attributes do not highly correlate.
In their seminal paper on construct validation, Campbell and Fiske [
Other techniques to establish construct validity that examine the internal structure of a measurement instrument, such as factor analysis, are, however, often inappropriate for interactive tailored patient assessments because of their dependence on a reliable correlation matrix. The share size of the population needed to verify the instrument, coping with both the possible three-digit number of items and the possible close-to-100% missing data, could approach numbers way out of practical reach.
When testing the reliability of Choice, it was evident that we needed a way of being able to pair observations on the different items without encountering an overwhelming amount of missing data. Because questions in the Choice instrument are tailored to each respondent based on initial response, reliability measures that are built on internal consistency could not be appropriately used for the evaluation of reliability.
A first thought was to perform a test-retest, as it would be natural to assume that an individual would correlate higher with himself or herself (ie, having the same bothersome symptoms and same priorities for treatment if the time frame between the tests was sufficiently short), reducing the amount of missing data in the correlation pairing. A complete test-retest using the Choice instrument felt inappropriate, however, because of the risk that patients’ symptom reports could change to such an extent that the discrepancy between items chosen in the test and the retest would make the correlation calculations unreliable. This concern was strengthened by the fact that several of the items address issues that change fairly quickly with time.
The alternate form approach seemed a logical second option, but as the Choice instrument contains 141 symptoms with several nuances in the wording to capture the specific disease pattern of the particular patient, as described earlier, an alternate form could run the risk of being different in such a way that patients would choose other symptoms merely due to the wording of the items. It seemed difficult to come up with an acceptable, completely alternative form of the instrument. There did, however, exist a somewhat alternative format of the Choice instrument that would at the same time minimize the amount of missing data: the full list of the 141 symptoms. We used this to assess the reliability of the Choice instrument.
To collect the reliability data, we conducted a separate study independent from our clinical trial. Because reliability is sample-specific, patients in this new study were recruited from the same population and setting and had to meet the same inclusion criteria as patients in the clinical trial. After Institutional Review Board approval was obtained, 100 patients undergoing cancer treatment were recruited. First, patients were asked to complete the tailored Choice assessment similar to patients in the clinical trial. Immediately after and in the same data collection session, they were asked to complete a questionnaire, the alternate form that included the full set of 141 symptom descriptions contained in Choice. The correlation between Choice and questionnaire data was 0.74 for all symptoms, and 0.85 for moderately or very bothersome symptoms [
It may at first be surprising that the correlation coefficients between the two formats were not higher. The main reason was that in the Choice instrument it is possible to choose different terms to express almost the same symptom. For example, a patient who chose “lack of energy” in the interactive tailored patient assessment version, chose instead “fatigued” in the paper-based form. While the patient may not have been aware of this distinction, this weakened the correlations between the two forms, making them somewhat lower than one might expect.
As above mentioned, content validity depends greatly on the adequacy with which a specific domain of content is sampled [
The goal when constructing the tailored Choice instrument was to assist patients in communicating their illness experience along physical, psychosocial, and functional dimensions as close as possible to their actual experiences. It was, therefore, important to include a comprehensive set of items that reflected all dimensions of patients’ illness experiences in sufficient level of detail and that were expressed in lay language to support patient recognition and communication.
To identify items to be included, we conducted a thorough review of the scientific literature to identify problems, specific symptoms, and functional limitations encountered by cancer patients. This search and review included the health care bibliographic databases as well as the World Wide Web and resulted in a preliminary list of symptoms and functional problems for potential inclusion. Expert groups of specialists in cancer care (physicians, nurses, social workers) then critically reviewed this list for relevance, comprehensibility, completeness, and level of detail and supplemented it with expert opinion [
To evaluate construct validity of the Choice instrument, we used known group differences techniques as well as assessments of convergent and discriminant validity. We performed three evaluations of known group differences based on data collected in a clinical trial of 148 patients who received active cancer treatment for leukemia and lymphoma.
The first test was based on the hypothesis that patients undergoing a stem cell transplant would report more symptoms with the Choice instrument than patients treated with chemotherapy only. This hypothesis is consistent with empirical evidence on treatment side effects and was supported by the data. Patients undergoing a stem cell transplant reported significantly more symptoms than patients in the chemotherapy group (14.6 vs 9.2,
In the second test, we examined gender differences in self-reported symptoms. Because the literature has provided some evidence that women report more symptoms than men [
Finally, we examined whether the most reported symptoms during patients’ illness trajectories were consistent with expected symptom patterns during different phases of treatment and rehabilitation. This was again supported. The most frequently selected symptoms 1 to 2 months into treatment were side effects related to chemotherapy and stem cell transplant, including nausea, vomiting, and mouth sores. During the third and fourth months of treatment, long-term side effects such as neurological problems, memory problems, and weight loss started to occur more frequently. During rehabilitation, the number of physical symptoms decreased and the focus of self-reported symptoms shifted to issues regarding resuming a normal life and worries about the future. Thus, all three known group difference tests performed as expected and provided support for the validity of the Choice instrument.
To measure convergent and discriminant validity, we compared the performance of the Choice instrument in our clinical trial data set with two other measures taken at the same time point: the CES-D [
To assess discriminant validity, we performed correlations between Choice subscales and CES-D and SF-36 subscales that measured different attributes, hypothesizing that they would not correlate to a very high degree. This was supported by our data. The physical symptom subscales of the Choice instrument correlated only weakly with the CES-D depression subscale (
In this paper, we strongly advocate evaluating and reporting reliability and validity of interactive tailored patient assessments, which is crucial for the credibility of interactive tailored patient assessments as research instruments. However, several of the common measurement techniques available to assess these psychometric properties are not applicable to interactive tailored patient assessments. The advantage of computerized tailored assessments is that patients can skip unimportant items and hone in on problems that matter to them and that reflect their actual experience. However, this advantage makes reliability and validity assessments of interactive tailored patient assessments a challenge for informatics researchers. To assist in this task, we have discussed which techniques might be feasible for establishing reliability and validity of interactive tailored patient assessments and demonstrated their application in a case study of the Choice instrument.
Although assessment of reliability of an interactive tailored patient assessment may require collection of a separate data set in addition to the clinical trial data, this is well worth the effort. A basic core of evidence of reliability and validity is needed for any instrument. Reliability is a prerequisite for validity, and an unreliable instrument cannot be valid. Unreliable and invalid instruments are not worth further investigation [
This work was supported by the Norwegian Research Council, grant #154739/320. The authors would like to thank members of the Choice research team for their assistance in data collection, data entry, and scoring of the instruments: Dr. Glenys Hamilton, Jørn Kristiansen, and Heidi Sandbæk.
None declared.
Center for Epidemiological Studies Depression Scale
interactive tailored patient assessment
Medical Outcomes Study 36-Item Short Form Health Survey