A Good Death

The Institute of Medicine defines a good death a “one that is free from avoidable death and suffering for patients, families and caregivers in general accordance with the patients’ and families’ wishes.”. The current system creates barriers to reducing the stress and suffering that accompany a patient’s end of life. Data and eHealth technology, if it were more accessible, could help patients, families, and caregivers to cope with end of life issues. (J Med Internet Res 2007;9(1):e6) doi:10.2196/jmir.9.1.e6


Introduction
Mom had Alzheimer's. She was barely hanging on to her apartment in an assisted living facility and would soon have to move to an Alzheimer's unit. Then breathing troubles started. In three days she was diagnosed with pneumonia, developed congestive heart failure, had a heart attack (one day after being admitted to the hospital), and died. In those three days more money was spent "caring for her" than had been during her entire life.
My dad died when I (the oldest) was 11. Mom raised three kids on a secretary's salary, put three kids through college, and did so with jokes, smiles, and songs. Her view of life was: make lemonade out of lemons; get on with it; don't complain.
She took the same approach to death. She got sick on Saturday, spent Sunday and Monday in an Intensive Care Unit, and died Tuesday noon. Even 30 minutes before she died she was trying to make it easy on us by joking. Then she waited to die until we went for to lunch because (I think) she did not want us to see her die. Like always, she did her job and moved on. She was 87. I loved her very much.

A good death?
The Institute of Medicine defined a good death a "one that is free from avoidable suffering for patients, families and caregivers in general accordance with the patients' and families' wishes" [1]. I wish I could say my mother got the care she needed and deserved. She did not; neither did we. Those whom I told our experiences suggest they are all-too-common. Hence, I would like to share these experiences here, and discuss the needs, but also responsibilities, of patients and family members at end of a life, and the implications for the health field.
She was lying in that bed. When she first arrived at the hospital, she was still able to walk. I know that even at my age, I can't go as long without physical exercise as I used to. When I take time off and get back to it, my muscles are sore. How long could she stay in that bed before she would no longer have the muscles to walk out? Would she not be able to return to her assisted living apartment? What about her heart attack? How much would that limit her? Even in her debilitated mental state, she was probably the most rational person of us all. Mom knew what she wanted: to have the restraints and tubes removed and to go home. We were told that we could have none of that. The pneumonia had to be treated. The restraints had to stay on. The tubes had to stay in. Could her antibiotics be given by intramuscular or by pill? What if we took out the tubes, stood at the door and told the nurses to leave our mother alone? What would they have done? What rights do family members have and how could we exercise them?
The thing that bothered me the most was those restraints. I wonder if my mother would have had the heart attack if she had not been placed in restraints. Did the hospital "kill" her? Did the heart attack kill her? Or did she die of a broken heart? Let me explain. I was told that when Mom got to the hospital, she kept pleading to go home. Over and over and over again. The Alzheimer's had made it difficult for her to deal with change and when the nurses added the IVs, she became combative. A team of nurses had to use restraints to subdue a frail 87-year-old woman! Straps held down her hands so she would not pull out the IVs. Mom hated lying on her back, but the restraints made it impossible for her to lie on her side or even to scratch her nose. She continued to fight the restraints for several hours, all the time asking why she was in the hospital and why she could not go home. After a while, she was just exhausted, and shortly after that, she had a heart attack. If I were 87, begging to go home, and doing everything I could do to fight restraints, how long would my heart hold out? Did this have to happen? What rights did the family have? How could we have intervened? Looking back, I wonder what would happen if everyone working in an ICU were required to lie in restraints for just one hour.
I found that not all "do not resuscitate" (DNR) orders are the same. A nurse in the ICU told us that Mom's orders were signed so long ago (seven years ago) that they were not valid. But now she had Alzheimer's and, because of her limited cognitive functioning, had given up power of attorney for health care. Was the nurse right? How were we supposed to know that? What can we do now?
The ICU nurses said that my mother's assisted living facility could not handle a person as sick as she. Again, we had so many unanswered questions. Is it illegal for an assisted living facility to provide such care, even on a temporary basis? Could our family have hired 24-hour nursing care and kept her in the assisted living facility? Where could we find good caregivers even if we were allowed to? How would we know a good home care nurse from a bad one? How would we monitor the care to ensure it was of high quality? What could we have done if poor care was provided?
The nurses told us we would have to raise our questions with the doctors. Yet they seemed to be avoiding us. We called the lead physician's office many times and received no response. We even arrived very early (before 7:00 a.m.) to catch that physician on his rounds; he had "just left." The few communications we did have with doctors were inconsistent and conflicting. The pulmonologist said Mom had 24-48 hours left. The nurses said she was looking better. An internist said she would be home in two days and a psychiatrist just appeared regularly to yell (literally): "Olive, do you know where you are? Can you spell your name?" I interrogated the cardiology nurse. She would not give a prognosis, so I asked "Was there tissue damage?" She said "Yes." I said "Was it a minor heart attack?" She said "No." I said "Was the damage was pretty significant?" She said "Yes." I looked at Mom and thought that she did not look good at all. While she might have been more awake on the second day, she was fighting for air; her chest and abdomen going up and down. I could not see how she was getting better. How could I get a straight answer [2]? How could I take definitive action when I received conflicting advice and my own common sense told me that I was not getting a straight story? How could I get the doctors and nurses to talk to me? Who could I call if they don't? Who was in charge?
It was difficult for us to know whether our mother's condition was due to her disease or her treatment. Mom was clearly out of it. But she was on Haldol, which she had waited several hours to get while the staff attended to another emergency and while she was fighting her restraints. The nursing staff reported that the Haldol helped Mom quiet down. But when I first saw her, her speech was slurred and she looked awful. How could I know whether it was the Haldol or the heart attack or the lack of oxygen to the brain caused by pneumonia? No one would talk with us! I kept asking myself: "What does Mom want?" Does she want to die? Mom was awake. I could have asked her. But how? Do I say: "Mom, do you want to die?" Or "Have you had enough?" I wanted to approach this the right way, but how do I open the discussion? AND, suppose she had said "Yes!" Then what? What could we have done to help her along?
Was it time to talk to hospice? The community offered several hospice options. Which one should we talk to? How could we find the best one? Was she eligible to go? What would they do for her? If hospice was available, it might be another difficult transition for Mom. I just wanted to ease her misery.

Training medical professionals on the dying process
The death and dying process, including the needs of family members, should have a significant place in the training of physicians and nurses. But when I asked the nurse in charge of the ICU whether the hospital had a palliative care program, she replied: "What is that?" After we complained several times that we could not reach the physician, a "case manager" appeared, asking: "What do you want?" But we did not know what we wanted! She gave a list with telephone numbers of nursing homes and hospices but would not identify the good ones. When we asked whether the good ones were full and all the bad ones empty, she avoided the question. Again, there were so many unanswered questions. How could we get Mom into a nursing home or hospice that would be nice to her? How would we know if they were giving good care? How could we intervene if they weren't? Time was getting short. We needed answers.
I wish that I could be optimistic that things will be better. I am not. The health field has recommended an array of end-of-life policies and best practices [3][4][5][6]. The Institute of Medicine and major provider associations, have issued reports, studies, and calls for change. The Robert Wood Johnson Foundation invested millions of dollars to promote a good death. But little has changed. The health care field is full of studies showing that patients' wishes are not respected [7], that communication in the ICU is sorely lacking [8]; that evidence-based practices are not implemented [9]; and that fundamental concepts of palliative care (and even decency) are absent in many health care organizations [10]. of medical knowledge. In many cases, health care providers just don't know how bad things are or the best course of action for a patient. So it is reasonable for a doctor to express uncertainty about what to do or whether the situation is dire enough to call the family together. But many providers act as if death is a failure, when it is part of life [11]. All patients and families deserve honest and consistent information, even if the information is an expression of uncertainty.
Healthcare providers are good people. I have seen my sister, a critical care nurse, come home from her job at a leading teaching hospital so tired she can barely move. I have seen doctors so frustrated with the health system that they can barely see straight. Providers are trained to cure at all costs but the incentives are to ensure that they spend as little time a possible with individual patients. I have seen hospital administrators after another long night of worrying on how they will make budget. If I were in their shoes with the same pressures and incentives, I would probably do the same thing. It should be different but it won't; not for a long time. The problem is the system, not the people.
In the final analysis it is up to us (the families) to take more responsibility for the dying person and for ourselves [12]. We are the ones who care the most about what happens. We are the ones that will make the time. We need to know what to look for and what to expect. We need to know how to care for a patient at the end of life. We need to know our rights and how to exercise them; our options and how to choose between them. We need to know how to assess quality and how to act on those assessments. And we need to understand that death is part of life and that uncertainty is part of the dying process [13].
Ideally, families should be prepared to deal with death and dying well before the event. But it doesn't work that way. I have done health services research for over 30 years. I currently have two research grants on death and dying. I knew my mother was approaching the end of her life. I should have been prepared for her death. I wasn't. I was powerless when it came to making things happen in the ICU. I knew the principles but not the specifics of how to interact with a dying patient, and I needed the specifics. Our family (those in the ICU and those far away) needed ready access to information on Mom's status as well as easy-to-find, easy-to-apply, just-in-time training on death and dying; training that was accessible while we sat in the intensive care unit and included the specific signs to look for and specific words to use with patients and providers.

Recommendations
I have three suggestions: 1) automate the processes for helping patients and families deal with dying and death, 2) when automation is out of the question, make it hard for ICUs to do the wrong thing, and 3) improve transparency of how healthcare deals with death and dying.

Automate
Given the pressures that health care providers operate under, it is unrealistic to expect training and exhortation to change anything. Technologies are needed that equip patients and families to deal with death and dying . Things would have been   so much better if I could have opened my smart phone and  pulled up a list of the 20 things to watch for in the ICU: things  like restraints, conflicting information, care contrary to patient  wishes, goals for end of life, family-physician communication.  If I could have selected one of those things on my smart phone  and seen an overview of the issues, my rights, specific steps I  could take. If I could clicked on a topic and received more detail through decision aids, scripts, assessments, and training on how to exercise my rights, all presented in way that I could apply on the spot. If I could have had immediate Web access to databases on ICU, hospice and nursing home quality and relevant literature. If I could have sent a text message to an expert in death and dying or seen video clips of an effective encounter with a patient, a family member, a nurse, doctor or administrator. If I could have asked a question of other families who had gone through something similar or read stories of their experiences. The reality is that I could have-the technology, the knowledge, and the data exist to deal with most of these issues.
My wife and I just completed our Health Care Power of Attorney documents and sat down with the kids to review them. I wish that I could have handed the kids a memory stick containing that information, reviewed the basic structure with them, and asked them to carry it with them, because some day (5, 10, 30 years from now-or maybe tomorrow) they will need it, and so will I. Research and development could make this kind of tool a reality. But we can't stop with development. We need to have a system for dissemination. Partnerships with the legal profession could ensure that when a will is prepared or updated that the participants and their families be given access to these tools. Hospitals could make this technology available to families whenever a patient is admitted to intensive care.
Secondly, electronic medical records (EMRs) could improve end of life. Is it out of the question to give families access to the medical record during this difficult time? Could EMRs place the patient's goals for end of life care in an easy to access location that would be hard for any provider to miss? Could EMR's have reminder systems that would alert providers when it is time to encourage patients to update their end of life wishes? Whenever a patient is considered to have a potentially life-threatening condition, could the EMR require documentation from ICU clinicians indicating that they have read and accept the patient's goals for care at end of life?

Make it hard to do the wrong thing
Systems change can be difficult. However, there are principles that can increase the likelihood of success. One is to remove the status quo. For instance, one cannot use mechanical restraints if they are not available. Administration could remove the mechanical restraints from the ICU and lock them in a cabinet for use only with permission of senior leaders. Medicare could treat inappropriate care at the end of life in the same way they treat other medical errors. Medicare policy could just flat out prohibit the use of mechanical restraints without permission from a senior leader of the hospital. Then, immediate convening of a rapid response team could be required to determine steps needed to remove the restraints and make sure they are never needed again. These steps have already been taken by mental health hospitals; we could do the same for dying patients.
Technology could help to ensure that these policies are implemented.

Transparency
Public reporting on the quality of a death should be required. Organizations like Medicare and NCQA could collect and publicly report data on use of restraints and other measurable dimensions of quality in death and dying. But public reporting is just a start. Systems must be in place to ensure that people will actually act on this information. Given our reticence to address death and dying before it happens, it is unrealistic to expect families to study quality of death data until the time comes. Hence, it will be important to find ways to make these data and resources easily available, easily understandable and easy to act on in a just-in-time basis.
We (the patients and families) need to take responsibility for our own dying and death. It is the centerpiece for us being able to do that. Mom needed it, I needed it and so will you.

Introduction
In recent years, many tailored health promotion programs have become available through the Internet. As with any other health-promoting intervention, these Web-based health promotion programs are not expected to lead to either short-term or sustained behavior change unless the intervention reaches the intended target audience. Unfortunately, it has been postulated that health-promoting lifestyle interventions tend to reach those who need them the least [1]. For ethical and practical reasons, there is often no information available on people who decide not to participate (see, for example, the study by Sirard and colleagues [2]). Dutta-Bergman [3] showed that people who look for health information on the Internet are more health oriented than people who don't look for such information. They are also more likely to hold stronger health-oriented beliefs and to engage in healthy activities. Similarly, Verheijden et al [4] conducted a nonresponse survey and showed that participants in a Web-based tailored nutrition counseling program were a relatively well-educated and healthy subsample of the target audience. It is thus evident that efforts need to be made to minimize selective enrollment in Web-based health promotion programs.
Furthermore, it is known that there is a strong dose-response relationship between the number and intensity of counseling sessions and behavior change outcomes [5][6][7]. It is therefore unlikely that a difficult process such as health behavior change can be achieved in a single counseling session; repeated participation is likely necessary to achieve sustainable changes. It seems reasonable to assume that this is also true for Web-based counseling, which means that repeated exposure to the counseling programs is necessary to achieve sustainable changes.
In addition to selective enrollment in Web-based programs, selective attrition during follow-up may thus be a concern too. This concern was also expressed by Eysenbach [8] and by Danaher and colleagues [9], who argued that attrition, uptake, exposure, and diffusion measures need to be addressed in addition to the effectiveness of eHealth programs. Indeed, little is known about attrition and its determinants in Web-based behavior change programs. The current study therefore addresses the rates and health and lifestyle determinants of repeated participation in a Web-based health behavior change program. The current focus on user characteristics is in line with the relevance of these characteristics that was made explicit by Christenson and Mackinnon [10].

The Web-Based Health Promotion Program
The Web-based health promotion program on which the current paper is based was a Web-based version of the Dutch National Health Test, designed by the Netherlands Organisation for Applied Scientific Research TNO in cooperation with the Dutch Foundation Pur Sang. The program was developed with funding from the Dutch Ministry of Health, Welfare and Sport and aimed to increase people's awareness of their own lifestyle, to promote physical activity, and to prevent overweight and obesity. The program was available on the Internet from July 2004 onward at no cost. The launch of the site was brought to the public's attention with a press release presenting the State Secretary as the very first participant in the program. No further media and marketing strategies were used to keep the program in the public's view. However, various organizations were interested in the program, which generated free publicity. For example, articles on the Dutch National Health Test were published in national and local newspapers, in free magazines published by several supermarket chains, in women's magazines, and on websites of general practices and municipal health services.
The registration procedure for the website included selection of a personal username and password and a series of questions on sociodemographic characteristics of the participants. The health promotion program contained modules on anthropometrics (height, weight, waist circumference), physical activity, dietary habits, alcohol intake, smoking, work, cardiorespiratory fitness, and muscle strength. The modules on lifestyle consisted of a series of questions to assess the current behavior. The modules on anthropometrics, cardiorespiratory fitness, and muscle strength contained instructions for the appropriate self-tests. Upon completion of each individual module, participants received feedback that was tailored to the responses they had given in that module. Because the physical activity module was the core of the program, additional questions on physical activity were included. Figure 1 presents a screenshot of the Dutch National Health Test (in Dutch). Additional screenshots are available in the Multimedia Appendix (in Dutch). The Dutch National Health Test is no longer available to the public. More information can be obtained from the authors.
The tailored feedback messages for the individual modules were also integrated in an overall report. Participants could print this report and keep it for their own reference. The data were stored in a database and used as a basis for longitudinal feedback in follow-up participation. The data were also kept for research purposes. Participants' consent for this was obtained. Participants were allowed to complete the modules over a two-week period. They were sent an email reminder to complete the modules in a timely manner. During the initial participation, people were made aware of the availability of follow-up modules. They were encouraged to participate in these follow-up modules to monitor their progress and to receive more tailored feedback. Invitations for follow-up participation were sent out by email three months after the completion date of the last entry. The program was made available online in June 2004, which means that the first reminders for follow-up could have been sent out in October 2004; however, due to a technical error in the automated reminders, the first reminders were not sent out until March 2005.

The Independent Variable: Participation
Although each individual module led to immediate tailored feedback, people were not considered to be participants unless they had completed the modules on anthropometrics and physical activity. Single participation was therefore defined as having only one record in the Dutch National Health Test database in which the modules on anthropometrics and physical activity were completed. Repeat participation was defined as having two or more records in the Dutch National Health Test in which the modules on anthropometrics and physical activity were completed.

Predictors for Repeat Participation
Data on gender, age (15-20, 21-30, 31-40, 41-50, 51-60, 61 and older), and education level (very low, low, intermediate, high, very high) were obtained in the registration process of the Web-based health promotion program. The body mass index was calculated using data on self-reported height and weight. It was subsequently categorized (≤ 25 kg/m 2 , 25.01-30 kg/m 2 , > 30 kg/m 2 ). People were encouraged to measure and report their waist circumference (in centimeters). Detailed instructions on measuring waist circumference, which included some clear pictures, were provided. Smoking status was defined as currently smoking, formerly smoking, or never having smoked. Physical activity was categorized based on the criteria for sufficient physical activity of (1) moderate intensity (moderate intense physical activity at least 5 days per week for at least 30 minutes per day) and (2) high intensity (intense physical activity at least 3 days per week for at least 20 consecutive minutes). Current Dutch guidelines define minimum intake levels for sufficient fruit and vegetables. For fruit, the guideline is a minimum of two pieces per day; for vegetables, the guideline is a minimum of 200 g per day. Participants were categorized as either meeting or not meeting the guidelines for fruit and vegetable consumption. Alcohol consumption was defined based on current Dutch gender-specific guidelines, which define the maximum number of alcoholic drinks per week (ie, a maximum of 15 alcoholic drinks per week for women and 21 for men). People exceeding these numbers were defined as excessive drinkers.

Analysis
Descriptive statistics were used to present data on the baseline characteristics of the participants of the Web-based health promotion program. No data on waist circumference will be presented because inspection of the data revealed that people likely gave invalid answers. For example, the reported waist circumferences varied from as little as 20 cm to as much as 2 km. Furthermore, values such as 50 cm, 60 cm, 70 cm, and 80 cm were reported much more frequently than values such as 53 cm, 69, and 77 cm. Descriptive statistics were also used to present data on single and repeated use of the Web-based health promotion programs. These analyses are based on data from the 9774 people who met the participation requirements of this program (ie, people who completed questions on anthropometrics and physical activity).
Univariate and multivariate logistic regression analyses were conducted to identify the crude and adjusted effects of factors associated with repeat participation in a Web-based health behavior change program. A total of 9774 people enrolled in the Web-based health promotion program, but because of missing data, all regression analyses were based on 6272 participants. The exclusion of 3502 people was the result of missing values in the variables on smoking (missing for 3010 people) and on the consumption of fruit (missing for 2751 people), vegetables (missing for 2751 people), and alcohol (missing for 3399 people). Analyses comparing the characteristics of people with and without missing values revealed no clinically relevant differences. People with missing values, for example, were significantly older than people without missing values (P < 0.01), but the difference was less than one year.

Participants and Number of Visits to the Web-Based Health Promotion Program
Approximately two thirds of the 9774 participants were female. The mean age was 36 years (SD = 13). The vast majority of the participants (90.1%) had an intermediate or (very) high education level. The mean body mass index was 24.5 kg/m 2 (SD = 5.7). At the time of enrollment, 22% of the participants were smokers. The guidelines for physical activity of moderate intensity and high intensity were met by 51% and 46% of participants, respectively. Few people met the guidelines for fruit and vegetable consumption (22% and 14%, respectively), and 7% consumed more alcoholic drinks per week than recommended by current Dutch gender-specific guidelines.
Of the 9774 people who enrolled in the Web-based health promotion program, almost 10% participated more than once: 7.6% participated twice, and 1.9% participated three times. The completion of four visits was very infrequent (< 1%).

Determinants of Repeat Participation
People aged 41 years and older repeatedly participated in the intervention more than those aged 15-20 years (Table 1). In the univariate analyses, a healthy lifestyle was related to repeat participation. For example, repeat participation was more frequent among former smokers (OR = 1.73) and people who never smoked (OR = 1.47) than among current smokers. It was also more frequent among people with sufficient physical activity than among people with insufficient physical activity (OR = 1.31 for moderate intensity activity, OR = 1.23 for high intensity activity). Finally, repeated participation was more frequent among people meeting the guidelines for fruit consumption (OR = 1.26) and vegetable consumption (OR = 1.39) than among those failing to meet the guidelines. In contrast to the pattern that a healthy lifestyle was related to repeat participation, people who were overweight (OR = 1.20) or obese (OR = 1.54) more frequently participated repeatedly than people of normal body weight.
In the multivariate analyses, repeat participation was more frequent among people aged 41 years and older (OR = 1.40-1.68), among obese people (OR = 1.41), among former smokers and individuals who never smoked (OR = 1.49 and 1.44, respectively), among people with insufficient physical activity of moderate intensity (OR = 1.23), and among people with a sufficient vegetable consumption (OR = 1.26) than among the relevant reference groups. Moderate or none (n = 5852) * The regression model contained repeat participation (yes/no) as the dependent variable and gender, age, education level, body mass index, smoking status, physical activity (moderate and high intensity), fruit consumption, vegetable consumption, and alcohol consumption as independent variables. † Current guidelines in The Netherlands recommend a total of at least 30 minutes of physical activity of moderate intensity at least 5 days per week. ‡ Current guidelines in The Netherlands recommend at least 20 consecutive minutes of physical activity of high intensity at least 3 days per week. § Current guidelines in The Netherlands recommend at least 200 g of vegetables per day.

Discussion
This study showed that people who repeatedly participated in a Web-based health promotion program generally had healthier lifestyles than people who participated only once. In contrast to this and to our expectations, people who were overweight or obese participated more frequently than people of normal body weight. Repeated use was relatively infrequent; approximately 10% of the people used the program more than once.
The initial concern in reaching the appropriate target audience with Web-based health promotion programs is to prevent selective enrollment in the program. An extensive comparison of the baseline characteristics of the participants in the current study with the Dutch population in general is beyond the scope of this article. A birds-eye view of the baseline characteristics, however, indicates that the participants in the Web-based program had a lower prevalence of overweight and obesity and smoking, and a higher prevalence of compliance with the guidelines for physical activity and fruit and vegetable consumption than the Dutch population in general.
As was discussed recently by Eysenbach [8], eHealth applications face the difficulty that a (sometimes substantial) proportion of people will not be using the application or will be using it sparingly. The latter was also true for the Dutch National Health Test. Our study confirms our hypothesis that selective retention in Web-based health behavior change programs is a concern in addition to selective enrollment. The group of participants who used the program repeatedly were a relatively healthy subsample of all people who enrolled in the program. The only exception to this was for body weight, as people who were overweight or obese used the program more frequently than people of normal body weight. One explanation for this is that a higher risk for disease is associated with a higher probability of participating in counseling [11]. It is known that overweight and obese people perceiving weight as a health risk are more likely to have prepared and/or initiated activities to lose weight [12]. Furthermore, people with chronic conditions are more likely to search for health information on the Internet than those without [13]. Another possible explanation for the fact that overweight and obese people used the program relatively frequently is that Web-based counseling may be particularly appealing for people with stigmatizing diseases. Despite the increasing prevalence of overweight and obesity in most groups of the population [14,15], excessive body weight or the failure to lose weight may continue to be stigmatizing [16][17][18]. This may also help explain the unexpected effects that were observed for body weight in the current study.
Given the known dose-response relationship between the frequency and intensity of counseling and the achieved behavior change outcomes, it is disappointing that only 10% of the participants used the program more than once. Previous research on Web-based health behavior change programs has shown that people are much less interested in programs that encourage lifestyle improvement than they are in programs that simply compare their behavior to relevant guidelines [19]. This comparison to relevant guidelines can be achieved with a single participation. On the other hand, when people are looking for solid counseling on possible lifestyle improvements, multiple counseling sessions are necessary. People's lack of interest in behavior change counseling may thus help to explain the limited repeated use of the current program.
It is unclear how people's motivation to participate in behavior change counseling may be increased, but it is evident that this change needs to be brought about before Web-based health promotion programs have the potential to lead to sustained behavior change. Upon first use of the program, an effort should be made to explain that behavior change does not occur overnight and that the program people are working with includes follow-up modules that make long-term support possible. Work presented by Spittaels and De Bourdeaudhuij [20] suggests some other approaches that may contribute to increased use of Web-based behavior change programs. A key issue may be to have face-to-face contact before people are referred to the Web-based program. When 100 flyers with information on a Web-based program to promote physical activity were handed out to participants in person, it led to 41 people receiving tailored advice. When the same number of flyers were placed in strategic positions throughout a hospital, it led to only 8 people receiving tailored advice. Another factor that was emphasized in the work of Spittaels and De Bourdeaudhuij [20] is the use of frequent reminder emails. These reminder emails were appreciated by the study participants, but no effect in terms of self-reported physical activity was observed. Rewarding people with something may also help to increase repeated participation [19]. Verheijden and colleagues reported that 76% of the people who were not intrinsically motivated to participate in follow-up programs said they would be interested in participating when given a bonus or reward.
In conclusion, our findings support the debate on the current proliferation of Web-based health behavior change programs and they stress the need to find new approaches to reach the primary target groups via the Web. This study supports earlier findings that Web-based health behavior change programs may largely fail to reach those for whom health behavior change is most necessary. By interesting contrast, overweight and obese people were more frequently repeat users than people of normal body weight. This effect may be due in part to the non-stigmatizing nature of Web-based interventions as opposed to face-to-face interventions. These findings suggest that Web-based health behavior change programs may be more successful in the area of weight management than in many other health-related areas. It also stresses the importance of adequate coverage of weight management in Web-based health promotion programs, as a driver to continue participation for overweight and obese people.

Introduction
Nowadays, online assessment is becoming necessary as clinical psychology is considering the Internet as a medium through which therapy and counseling can be offered [1]. It has already been shown how easy it is to create a website containing tools to assess psychological problems or constructs [2]. Moreover, the advantages over the traditional way of gathering data, such as easy and immediate scoring and missing data handling, have been made evident [3]. At this point, the reliability and validity of online questionnaires have become current and relevant research topics.
So far, the question "Will the mode of administration affect the respondent's score?" has barely been formulated, and research on this topic has been undertaken by only a few studies. Concepts related to social desirability [4], self-disclosure [5], or computer anxiety [6] are suggested as modulating variables that could modify the attitude toward computerized tests. Despite literature on these subjects, the research is still scarce and inconclusive and points to the need for further research to compare data from paper-and-pencil and online versions.
In that sense, a growing number of computerized or online questionnaires related to areas such as panic/agoraphobia [3], youth independence living [7], aggression and impulsivity [8], quality of life in diabetes [9], and a battery of 16 other health-related questionnaires [10] have already been studied. All but one of the computer/online versions (the Aggression Questionnaire by Buss and Perry [11]) were declared equivalent to their respective paper-and-pencil tests. Along with this, randomized studies on psychological distress tests have shown the same equivalence between the online and paper-and-pencil versions [12]. Nevertheless, in spite of the positive results supporting online assessment, the study of psychometric properties of online tests has frequent methodological problems (lack of random assignment or differing demographic characteristics to ensure sample equivalence), which make the adequate reliability or equivalence analysis difficult [13].
Taking the current state of the research into account, the present work aimed to obtain reliability and validity data for the online versions of two of the most frequently used psychopathology screening questionnaires in mental health: the General Health Questionnaire-28 (GHQ-28) [14] and the Symptoms Check-List-90-Revised (SCL-90-R) [15]. This paper is part of more extensive research aiming to develop a psychological treatment website following previous analysis of online clinical psychology websites in Spain [16]. Both questionnaires are used in a counseling website, the preliminary phase of the psychological treatment site. This choice was based on the wide research on and the historical use of these two questionnaires in psychopathology [17], as well as by their simple self-report structure, which makes it easy to incorporate them into a website.
A test-retest situation was chosen to obtain the reliability and validity data. Reliability was calculated as internal consistency, and test-retest correlation served as an equivalence index of the two test administration methods (paper-and-pencil and online). Inner structure exploration by factorial analysis was used to evaluate the construct validity of online versions. Although both questionnaires have a general score, they are divided into scales proposed as psychological disorders markers. The four scales of GHQ-28 (A: somatic symptoms, B: anxiety/insomnia, C: social dysfunction, and D: depression) have been found as a four-factor structure in previous studies [18][19][20]. For SCL-90-R, its nine scales (somatization, obsessive-compulsive symptoms, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, psychoticism) were originally proposed as representing a nine-factor structure [21], but most of the research to date has failed to replicate this and has instead found either a primary global distress factor [22][23][24][25][26] or a four-, five-, or six-factor solution [26].
In short, with this work we try to contribute some of the needed empirical supporting data in order to ensure that online questionnaires have at least the psychometric characteristics attributed to their corresponding paper-and-pencil versions.

Sample
Participants were 185 psychology students recruited from two universities in Madrid, Spain. All of them had Internet access at home. This was a requirement to participate in the study in order to informally control how familiar participants were with the required technology. Although Internet familiarity is not a representative feature of the general population in Spain, this work is framed into a project in which the final point will be the development of a treatment website for mood disorders, so the sample resembles the target population in Internet familiarity.

GHQ-28
The General Health Questionnaire (GHQ) is used to detect psychiatric disorder in the general population and within community or non-psychiatric clinical settings such as primary care or general medical outpatients. In the GHQ-28 the respondent is asked to compare his recent psychological state with his usual state. It is therefore sensitive to short-term psychiatric disorders but not to long-standing attributes of the respondent. All items have a 4 point scoring system using Likert scoring (0-1-2-3). The GHQ-28 contains 28 items that, through factor analysis, have been divided into four subscales, as mentioned above.
The Spanish-language version of the General Health Questionnaire by Lobo and Muñoz [28] was used. In the online version, one could scroll through the whole test. A pull-down menu in which the possible answers appeared followed the text of each item.

SCL-90-R
The Symptom Checklist-90-R (SCL-90-R) instrument has been designed to evaluate a broad range of psychological problems and symptoms of psychopathology. The instrument is also useful in measuring patient progress or treatment outcomes.
The SCL-90-R has 9 subscales, as mentioned above and in Table  4. The sum of all 9 subscales is the Global Severity Index (GSI), which can be used as a summary of the test, reflecting overall psychological distress.
We used the Spanish-language version of the Symptoms Check-List-90-Revised by González de Rivera et al [29]. The same online display method was used as for the GHQ-28.

Procedure
A classic test-retest design was carried out: the paper-and-pencil version of the instrument was used for the test and the online version for the retest. After verbally agreeing to participate in the study, participants received a booklet containing instructions, sociodemographic questions, and both screening questionnaires in paper-and pencil-format. At the end of the instructions page there was a box with the address of the website containing the online questionnaires and the dates the site would be available. Identification of participants' online questionnaires was achieved by a nickname chosen and written by each subject in the questionnaire booklet.
To ensure that participants completed the online tasks, email addresses were requested in order to provide reminder messages (22 participants refused). Individual messages were sent 14 days after the paper-and-pencil task. A second reminder was sent if the online questionnaires were not received within a week after the first message.

Statistical Analysis
Statistical analysis was carried out with SPSS version 12.0. Reliability as internal consistency measured by Cronbach alpha was tested for both formats of the questionnaires and their subscales. Pearson correlation was used to prove the equivalence between paper-and-pencil forms and the online versions. A t test served to evaluate if there were statistically significant differences between the mean scores of the formats. We also applied η 2 after a repeated measures ANOVA. η2 is a measure of effect size in ANOVA: the degree of association between an effect (e.g., a main effect, an interaction, and a linear contrast) and the dependent variable. We used this statistic in trying to decide whether mean score differences have clinical relevance. Different benchmarks have been used to interpret η 2 , but as for the P < 0.05 rule in hypothesis testing, there is only a rough guide to be used when no literature is available to compare effect size values, and the best way to interpret it must consider what outcome is being studied [30]. As this "rough guide," we will use η 2 = .01 -.09 for a small effect, η 2 = .10 -.24 for medium effects, and η 2 ≥ .25 for large effects [31].
As stated earlier, construct validity was evaluated by means of principal components factorial analysis. Factorial structures similar to the ones shown in previous investigations following varimax rotation were expected, that is, four factors in GHQ-28 and nine in SCL-90-R. We also analyzed the unrotated solution and the sampling adequacy, using the Kaiser-Mayer-Olkin (KMO) test.

Participant Demographics
From the initial sample of 185 participants, 104 completed both online questionnaires. This represents 56% of the total sample and 63% of those who received reminder messages. Although missing data was not possible online, four participants were rejected because of paper-and-pencil missing data, so 100 questionnaires were actually analyzed ( Table 1). The majority of retests were received around the 14th day after the test (median = 17 days; min = 14, max = 38), and 90% had been received after 28 days.

GHQ-28
Reliability results for the GHQ-28 are shown in Table 2. Cronbach alpha was .90 for the whole test in both the paper-and-pencil and online formats, and it ranged from .71 to .85 among the scales, with scale C (social dysfunction) showing lower values in both formats. Test-retest data showed significant correlations, ranging from .30 for scale C to .72 for scale B. Total score test-retest correlation was .69. We did a t test to see whether differences between scores from the two formats appeared. This occurred in scale B-paper-and-pencil scores were higher than online scores. We then used η 2 to check how big this difference was if taken as an effect size: its value was small (.057), being in the same range as for those scales in which mean differences were not statistically significant (see Table 2).
The factorial analysis of GHQ-28 reproduces fairly well the presupposed four-factors solution in both the online and paper-and-pencil administrations. Table 3 represents item factorial loads among factors. Taking .30 or larger loads to assign each item to a factor, in both the online and paper-and-pencil analysis factor 1 includes all depression items (D), with the exception of online item D5, and factor 4 includes all social dysfunction (C) items, except paper-and-pencil item C2. Factor 2 grouped B (anxiety) items online and A (somatization) paper-and-pencil items. Factor 3 does the opposite, corresponding to A items online and B items in paper-and-pencil, except for B5. So, it could be said that each factor is close to its clinical interpretation. Nevertheless, a few items have bigger loads than expected in other factors. Scales A and B share large loads, a fact quite understandable given that somatization and anxiety appear together several times. Item D5 did not load at all in factor 1 in the online version, but did in factors 2 and 3. This could be explained by the meaning of the word "nerves" (included in the text of this item) identifying closer to anxiety than to depression. Item D5's large load on paper-and-pencil factor 2 supports this interpretation. Lastly, paper-and-pencil scale C has smaller loads than expected in factor 4 in three of its seven items. We will interpret this alongside scale C's test-retest correlation later.
The predominantly positive values in the original correlation matrixes suggest paying attention to a general unrotated factor that could explain some of the item sharing among scales. This general factor explained 28.44% (paper-and-pencil) and 29.48% (online) of the variance, and 27 and 26 (paper-and-pencil and online, respectively) out of 28 items had loads of .30 or greater.  Table 4 shows reliability data for the SCL-90-R. The Cronbach alpha of the global severity index (GSI) was .96 and .97 for the paper-and-pencil and online versions, respectively. Scales showed .72 or higher except for phobic anxiety in the paper-and-pencil questionnaire, which was .62. Test-retest correlation ranged from .63 for hostility to .86 for psychoticism. The correlation for the GSI was .83. Paper-and-pencil means were higher than online means in every score. A t test for repeated measures showed that those differences were statistically significant except for phobic anxiety and psychoticism. Squared eta (η 2 ) analysis showed values from small to medium. It is important to note that η 2 for the GSI was .232, which means that more than 23% of the variance was due to method administration (see Table 4). That proportion could have clinical implications that we will discuss later.

SCL-90-R
The factorial analysis showed difficulty confirming the expected nine-factors solution for both the online and paper-and-pencil administration. All the items where scattered through the forced nine factors without the presupposed order. As an example, we could mention that the first online factor grouped items (.30 or bigger loads) from seven theoretical scales (anxiety, hostility, depression, interpersonal sensitivity, obsessive-compulsive symptoms, phobic anxiety, and psychoticism). Another fact lead us to reject a factorial analysis for this questionnaire: the KMO test (online = .394; paper-and-pencil = .414) was under the recommended .6 value to accept such an analysis [32]. As a comparison, the GHQ-28 KMO values were .788 for the online version and .781 for the paper-and-pencil version. As a result we do not recommend the use of the SCL-90-R scales as the way to discriminate among different clinical problems.
However, it should be noted that the first unrotated component of the analysis explained more than 25% of the variance in both online and paper-and-pencil questionnaires, and 94% of the online items (85 out 90) and 92% of the paper-and-pencil ones (83 out 90) presented loads of .30 or higher for this general factor. This, together with reliability data, led us to accept this test as a general screening tool.

Discussion
The aim of this work was to find out whether the psychometric characteristics of two well-known, self-report questionnaires remain consistent when administered via the Internet. Our analysis of the online versions matches the results of the paper-and-pencil versions in several aspects, but some identified differences between the two formats should be explained.

GHQ-28
Regarding the GHQ-28, internal consistency was high in both formats (Cronbach alpha for all scales and total score was over .70). Nevertheless, test-retest reliability ranged from a too modest .30 to .72, while other studies have presented coefficients over .70, some of them using Spanish translations of the questionnaire [33]. On one hand, it could be said that the GHQ-28 keeps its reliability as internal consistency when delivered via the Internet, but, on the other hand, equivalence data are lower than expected, especially in scale C. The small test-retest correlation in this scale (.30) as well as its factorial instability in the paper-and-pencil version could be due to the experimental situation. C scale accounts for "social dysfunction," and the paper-and-pencil situation was "social" (all the students and the investigator were together in the same classroom), whereas the online task was completed at home. Perhaps this caused participants to interpret the C items differently and to vary their answers.
Mean differences between formats were small enough to be negligible if we take into account η 2 results. Even in scale B, where these differences were significant, the accounted variance for method administration was only 5.7%, a proportion not very important when talking about a rough general screening test.
Validity analysis of GHQ-28 showed that previously reported factor structure was fairly replicated. As a whole, both online and paper-and-pencil results of this study match former works in which scales C (social dysfunction) and D (depression) were more consistent than A (somatic symptoms) and B (anxiety) [33,34]. This situation is clinically understandable given that somatic symptoms are frequent in anxiety disorders. A tentative explanation for the relative instability of the online C factor based on the experimental situation has already been pointed out.

SCL-90-R
The SCL-90-R maintained its internal consistency when delivered over the Internet; in fact, it was higher than in the paper-and-pencil version, and test-retest correlations were as high as in previous studies [26]. This leads us to propose equivalence of the online and paper-and-pencil formats. Our results match the literature on reliability as internal consistency in nonclinical samples [26] as well as the equivalence data using an SCL-90-R computerized version [17]. Nevertheless, all paper-and-pencil scores were higher than online ones. Here it is important to mention the η 2 values. Three scale differences and that for the GSI could be labeled as medium effects. As we mentioned above, in the case of GSI, this means that 23.2% of the variance could be explained by test administration method. This proportion is big enough to recommend caution if we mixed online and traditional versions of this test because scores could differ enough as to cover (if online is first) or to resemble (if paper is first) the effect of a treatment. The presence of the experimenter and the participants during the paper-and-pencil session, plus the fact that all participants had Internet connections at home, leads us to believe that the online tasks were less aversive. This could be a tentative explanation of higher paper-and-pencil scores.
We have already mentioned the problems that most authors have faced when replicating the nine-factor structure of the SCL-90-R. In our case, the more parsimonious interpretation matches the conclusions of several articles: even when the proposed solution has more than a factor [22], the high variance percentage explained by the first factor should lead to consideration of the total score as a general dimension of psychopathology [26]. Perhaps, as stated by Cyr et al [23], "interpreting nine dimensions for clinical purposes is highly questionable" no matter if we are talking about online or classic assessment. As only one strong factor appears, a psychopathology discrimination function can not be assigned to this tool. However, it does not lose its usefulness as a general psychopathological screening tool.

Conclusion
The results of this research are encouraging for the online use of the two questionnaires. In the GHQ-28, although two of its four scales had relatively small equivalence values, those of the other two as well as that of the general score were adequate, and the internal consistency values were high. Further research should be carried out to confirm this data, but our work supports the online use of this assessment tool.
The same could be said about the SCL-90-R: its online version could be taken as being equivalent to its classic paper-and-pencil version, and its internal consistency is high. However, paper-and-pencil scores were higher than online ones. Even when an online test has shown acceptable reliability and validity values, the use of normative data from paper-and-pencil questionnaires may not be appropriate [2], suggesting that as online testing spreads, research to obtain a bank of normative data from larger Internet samples should be an important goal.
Factorial analysis results for both online questionnaires showed factor structures similar to paper-and-pencil versions. SCL-90-R showed a similar factorial structure in its online and paper-and-pencil applications, but the results do not replicate the nine factor structure proposed by Derogatis [21]. Other researchers also had difficulties to replicate the nine factors [21,23,26]. As a consequence, we recommend use of the questionnaire as a general index of psychopathology, using the summary score (GSI) only, not the subscales.
The use of standardized tools administered through the Internet needs further investigation, and as for paper-and-pencil versions, they are not enough to properly assess a clinical case. The results obtained by these screening tools should be taken only as part of the assessment and should never be used as the only basis to support any intervention.
Lastly, we should mention two limitations of this work that future research should try to address. First, as the most probable Internet users, the university community will be one of the target populations for any Internet-related research. We must stress that this technology is spreading fast, so samples outside the university community must be analyzed. Second, our experimental design did not allow us to separate the effects of the test-retest situation from those of the format effect. Therefore, the next step should be to compare four groups (Internet and Internet; Internet and paper-and-pencil; paper-and-pencil and paper-and-pencil; paper-and-pencil and Internet) to discriminate both effects.

Introduction
Asthma is one of the most common chronic conditions in the United States, yet it is estimated that approximately three fourths of patients with asthma do not have adequate control [1]. New interventions are needed to improve the care of patients with this condition [2][3][4]. In 1997 and 2002, the National Heart, Lung, and Blood Institute released guidelines for asthma care [5]. Despite the existence of these guidelines, studies show that health practitioners are not following the recommendations and that there is low compliance and inconsistency in asthma management nationwide [2,6,7].Noncompliance with guidelines can lead to overconsumption of health care resources, increased cost, and increased morbidity [8]. Though patient adherence to medications (eg, corticosteroid inhalers) is partly to blame, lack of asthma control also reflects "clinical inertia," or the tendency of providers to make no treatment changes even though a patient has not achieved a treatment target [9][10][11][12]. However, research evidence strongly suggests that patients who ask their health care providers for tests and treatments are more likely to receive them [13,14], though the effect of this strategy on chronic disease management has not been well studied [15].
To test the impact of patients asking their health care providers about tests and treatments they could receive, we developed an interactive website (myexpertdoctor.com) to inform patients about asthma and to provide tailored feedback. The website is designed to be used before a physician visit to help patients know what questions to ask during the visit, which in turn may increase the chance that they receive tests and treatments suggested by evidence-based guidelines (see Figure 1 and   Computer applications have been used to improve asthma control by improving patient education [16,17], disease monitoring [18,19], and by prompting physicians to practise guideline-concordant care [20][21][22]. However, we are not aware of any interventions designed to prompt patients to ask questions during provider visits in order to improve the quality of their care. We conducted a qualitative study to understand the effects of a Web-based intervention on the physician-patient relationship and on asthma care [23,24]. Although previous studies have shown that the intervention did, by pointing out deficiencies in the quality of their care, cause users to believe they received worse care [5,26], data are needed to understand more fully the potential effects of such an intervention, in particular, how the intervention can impact doctor-patient communication. That was the goal of the current study.

Intervention Development
The overall design of the Web-based intervention, which included modules related to various medical conditions, including migraine and osteoarthritis, has been published elsewhere [5,26]. To review, four steps were used in developing the intervention. First, evidence-based decision rules were identified by reviewing clinical guidelines [5]. Second, a self-report survey was created to measure the adherence to each of the guidelines. Next, tailored feedback and suggested questions were created and prioritized for each guideline (see Table 1 and the Multimedia Appendix). Finally, the questions and feedback were programmed into a secure and reliable website, resulting in a three-step process for study patients accessing the site prior to a physician visit. First, patients were prompted to answer 10-20 questions relating to their asthma and its care. Next, patients received immediate personalized feedback and information about their condition, including a list of suggested questions to ask their physician. And finally, patients were encouraged to use this information and the questions during their upcoming physician visit. You may benefit from using a second inhaler to prevent asthma symptoms. From what you told us, your asthma is not well controlled. Also, you're not using a second type of medicine to prevent asthma symptoms. You may want to ask your doctor about this. Medicines that prevent asthma symptoms come in two types. The first are steroids, such as Beclovent. The second type opens the airways, such as Serevent. Both of these medicines should be used each day, whether you have symptoms or not.
Would I benefit from using a peak flow meter to monitor my asthma at home?
You may benefit from using a peak flow meter at home. It's not always easy to know when your asthma is getting out of control. From what you've told us, you've been to the emergency room at least a couple of times over the past year. Because of that, you might benefit from using a peak flow meter every day. You may want to ask your doctor about this. A peak flow meter is a small plastic tube that you blow in to see how your asthma is doing. That way, if you feel okay, but your peak flow is low, you can make a change before you feel worse. It's also important for you to know what to do depending on your peak flow number. This is also something to discuss with your doctor. This is where an "asthma action plan" comes in handy. This is discussed below.
Would I benefit from seeing an asthma specialist at this time?
You may benefit from seeing an asthma specialist. From what you've told us, you've been to the emergency room at least a couple of times over the past year. Because of that, it's important for you to see a specialist once in a while. You haven't seen an asthma specialist in the past year. You may want to ask your doctor about this. A specialist can help you figure out if you need different tests or treatments for your asthma. These doctors include "pulmonologists" and "allergists." How frequently should I be using my controller medication(s)?
You could do better to prevent asthma attacks. From what you told us, you're using a medicine to prevent asthma attacks. These are called "controller" medications, such as the Azmacort that you are taking. You are not using your controller medicine every day. You may want to talk to your doctor about this. Controller medications help to prevent you from feeling short of breath. They need to be used every day, even if you feel fine.
The website feedback consisted of three elements: (1) a list of suggested questions for the patient to ask his or her physician, (2) a lay explanation of why the patient should ask the physician these questions (one message for those whose care was in keeping with the guideline [eg, moderate persistent asthma whose medication list included a corticosteroid inhaler] and one for those whose care was not), and (3) links to other websites for further reading and explanations of the suggested topics (sites selected and reviewed by panel of experts). By pointing out areas for potential improvements in care ("quality gaps"), we had a concern that some patients would believe that their provider was not giving them needed care. We expended efforts to make the feedback as neutral as possible with regard to this issue. For example, rather than indicating "you need…," the feedback typically suggested "you may benefit from…." The rationale for the feedback was based on the Chronic Care Model, a theoretical model of chronic illness care in which one of the overarching goals is to foster productive interactions between patients, who actively participate in their care, and providers, who can draw on the expertise of guideline-based reminders [27,28].

Study Design
The study was designed to document the experiences of patients who had used the asthma quality improvement website prior to a visit with their asthma care provider. Because qualitative methods are useful when conducting exploratory research [23,24], and little is known regarding the impact of these types of interventions on quality of care or on doctor-patient communication, this study employed a semi-structured interview methodology to investigate the range of patient experiences with the website, before and during their doctor visit [29]. The interview guide was created by the research team, including two doctoral-level anthropologists and a board-certified internist (CS). Interview questions focused on patients' impressions of the utility of the website, including their experiences using it and specific website navigation issues, as well as the effect of the tailored feedback from the website on doctor-patient communication and perceived quality of care during patient visits. In addition, physicians were recruited to view the website and subsequently participate in a survey regarding their impressions of the site. Ethical approval for the study was obtained from the Institutional Review Board at The Abacus Group, LLC, Cranston, RI, USA.

Sampling and Recruitment
In response to an advertisement on Google and letters mailed to asthma patients of a large health insurance company in Rhode Island, USA, potential subjects were encouraged to contact research staff and were screened over the phone for inclusion and exclusion criteria. The following inclusion criteria were used: (1) age greater than 21 years, (2) self-reported history of asthma, (3) a planned visit with an asthma care provider in the next 2 months, (4) Internet access at home or work, and (5) asthma care from a primary care provider rather than a specialist (eg, pulmonologist). During the screening phone call we asked patients the date of their next asthma provider visit and set a date for a follow-up phone interview after the visit. The final sample comprised 37 patients, who were mailed an informed consent form. The content of the form was explained over the phone, and subjects were encouraged to call the research team if they had any questions. All 37 subjects returned the consent form by mail before the date of their physician visit. Subjects were reimbursed US $100 for participation in the study.
For the physician sample, a nationally representative database of primary care providers was purchased from a marketing organization. A recruitment letter was mailed to a national random sample of 250 physicians; 26 physicians agreed to participate in the study and completed the anonymous survey on the study website. Physicians were reimbursed US $100 for participation in the study.

Data Collection
A research assistant monitored patients' use of the website to be certain they used it before their physician visit. Emails were sent to every participant 7, 4, and 3 days before the date of the visit, reminding them to use the website. Two of the 37 subjects had not used the website within 72 hours of their planned asthma care provider visit and were called by the research assistant to remind them to use the website before their visit. Prior to the asthma care visit, all 37 subjects used the website, answering questions and receiving personalized feedback. The website and the research assistants encouraged patients to print the individualized feedback and take it with them to the physician visit. After visiting their physician, all patients were contacted and participated in a semi-structured telephone interview that lasted approximately 25 minutes. The interviews were conducted by one of two trained research assistants (DB, HL) using an interview guide created by the research team. Open-ended questions, through the use of follow-up questions and probes, allowed for in-depth exploration of topics such as those related to the individualized feedback provided by the website, how this feedback shaped (if at all) the participants' interactions with their physicians, and whether and how the website and feedback were useful in helping them communicate with their physician [29]. Close-ended questions were also asked and addressed (1) use of the website before the visit, (2) use of the website information during the physician visit, and (3)

Data Analysis
Patient interviews were audiorecorded, transcribed, and entered into QSR NVivo qualitative software (version 2.0; QSR International; Melbourne, Australia) in order to facilitate data management and analysis. The transcripts were coded by a doctoral-level anthropologist based on the grounded theory technique, in which codes are drawn from the text and coding involves frequent comparative analysis of the data [30,31]. In order to establish the coding scheme, a random 25% of the interviews were coded by an additional doctoral-level anthropologist and discrepancies in coding were resolved via consensus [24,31,32]. Overall, 108 separate codes were identified. As this was a pilot study, a majority of the interview questions, and therefore the codes, related to patients' experiences using the Internet for medical searches and their impressions and use of specific pages of the website. For purposes of the thematic analysis presented here, we were interested in understanding the experiences of patients using the website and how use of the website impacted their physician visit, with particular attention on the doctor-patient interaction and relationship. For that reason, in the thematic portion of the manuscript, we concentrate on presenting the codes and themes that were germane to this issue. Patients' suggestions for improving the website are presented as well. Physician data were collected online and transferred to an Excel database. All descriptive statistics were calculated using SAS (version 9.1).

Quantitative
The large majority of the patients in this study were female (

Qualitative
Overall, patients in this study found that having information from the website positively impacted their interactions with their physicians, and, importantly, while some patients reported some dissatisfaction with the website overall, no patient reported a negative encounter with the physician attributable to use of the website. Physicians also reported positive feelings about the website content, while at the same time offering suggestions for improvement. A number of other comments fell into no specific category and ranged from encouragement to keep the questions and feedback simple, to providing lists of asthma specialists in the area, to allowing for questions from patients and including an appointment schedule.

Patient Suggestions for Improvement of Website
A total of 37 patients suggested improvements for the website. Their feedback pertained to its not providing enough feedback, not providing new information, not giving feedback that was specific to the user, and not having enough scientific information. For example, some participants (n = 13) found that they wanted more detail in both the questions and the feedback, particularly some of the participants whose asthma was rated by the website as being "under control." A majority of the users who expressed some reservations about the website nevertheless found some aspect of it to be helpful, either the feedback, the links to other sites, or simply the encouragement to approach their physician with questions.

Patient Themes
Patients' answers to interview questions regarding the individualized feedback and the impact of the website itself centered around two main themes. First, one common result of visiting the website and subsequently visiting a physician was a positive shift in patient attitudes regarding interactions with their physicians. Patients reported that they had more self-confidence, they talked more during the visit, and they had more confidence in the care they were receiving. A second, related theme focused on patients becoming more actively involved in their own asthma care. They gained a better understanding of their treatment options and of their role in managing the condition.

Theme 1: Positive Shift in Patient Attitudes Toward Interactions with Physician
A majority of patients (20/37, 55.6%) answered "yes" when asked a close-ended question about whether the website had influenced their physician visit in some way. In addition, many of those answering "no" or "unsure" to the same question nevertheless revealed in answers to subsequent open-ended questions that their visit had been positively affected. Common to many patients' statements was an increase in self-confidence with regard to communicating with the physician because of the information and questions from the website. Finally, because the website was able to give them or guide them to specific information about their condition, patients repeatedly mentioned feeling more confident in their understanding of issues related to their asthma and its treatment.
This led to what they perceived as improvements in the physician-patient relationship and greater confidence in the care they were receiving. In some cases, the physician's authority was validated by the patient's own research, rather than the research being validated by the physician's authority. This was in part due to the fact that the website and the links it provided gave more information than these participants felt they could glean from their short interactions with their physicians. As can be seen from the quotations above, use of the website promoted affirmative changes in patient attitudes toward their interactions with their physician. Not only did visit experiences improve from the patients' perspectives, but some patients also indicated that their questions prompted changes in their treatment. In other words, asking specific questions was reported to overcome clinical inertia and result in positive modifications in patients' care.

Theme 2: Patients Becoming More Actively Involved in Their Asthma Care
Participants varied in how they spoke of their knowledge about asthma and their understanding of their role in managing their asthma. A number of patients were generally more knowledgeable about their condition than others, and while some found the information on the website too limited in scope, others thought the website encouraged them to become more active in seeking additional information, either online, in print sources, or from other individuals. Some patients had traditionally relied in part or, to a lesser extent, exclusively on their physicians for information about their treatment. For many of these patients, the information provided in the feedback and from the links to other sites led to an increased understanding of their condition and of how they could become more involved in their own care.
[ Patients' responses illustrate increased involvement in their own care as a result of information and feedback attained either directly or indirectly through use of the website. In addition, as was evidenced in the first theme, having patients pursue new treatment alternatives and become more actively involved in their own care led to perceived positive changes in physician behaviors, including medication and monitoring adjustments.

Discussion
Inconsistencies in the implementation of and noncompliance with asthma treatment guidelines contribute to a reduction in health quality for individuals with the condition [1,34]. This study of a novel Web-based intervention investigated the impact of giving patients individualized information that was designed to prompt physician-patient discussions around issues raised by the evidence-based asthma guidelines [35] and improve the quality of care. To date, to our knowledge, no research has been performed on this type of strategy. Analysis of the data collected showed that knowledge gained from the website and its feedback form positively influenced patients' interactions with their physicians, their knowledge about asthma, and their feelings of responsibility for managing their condition. In addition, patients accepted and enjoyed using the technology to assist with their asthma care, and physicians had positive impressions of the site and its potential to improve care.
Our qualitative results validate the findings of many quantitative studies that have examined similar issues using survey data.
Our previous research has shown that patients have a need for the information that is provided on the website. For example, in 2001, we asked 300 primary care patients if they had ever used or had any interest in using the Internet to "find out what questions you should ask your doctors when you see them" [36]. Nearly 60% of the patients were interested in using the Internet for this purpose, and fewer than 30% of the patients had ever found this information on the Internet. It is therefore not surprising that such a high percentage of patients in this study were satisfied with the asthma website. In a similar survey-based study, in which patients were observed searching the Internet for health information before a doctor visit, most (90%) reported feeling more satisfied with their visit than with previous visits because of the Internet use [37]. In that study, patients were trained to use a specific website with links to specifically chosen patient education websites. Those positive findings, coupled with our current findings, suggest that a guided Internet search experience can be quite acceptable to patients and can improve satisfaction with subsequent doctor visits [37].
We hoped that by providing patients with a small number of evidence-based questions, they would have an enhanced patient-provider experience. A previous review of the effects of the Internet on doctor-patient communication identified some concern that physicians may be annoyed by patients bringing in information from the Internet, and that this may possibly harm the doctor-patient relationship [38]. For example, only 15% of 168 physicians surveyed believed that the Internet would improve their relationship with patients, while 49% felt it may harm it [39]. Our results generally point toward the intervention helping patients to communicate better with their physician in order to become more of a partner in their own care. No patient mentioned having a negative interaction with the physician as a result of having used the website. Additionally, the physicians themselves rated the website positively and indicated that they thought the website would be useful in helping to improve patient health care.
We were pleased that subjects viewed the website as reinforcing the care they were receiving from their physician, rather than causing them to question it and potentially undermining their belief in their physician. Our findings are consistent with two other published studies of the website, which observed that the website feedback had no adverse impact on patient perceptions of overall quality of care from a physician [25] or care during a physician visit [27]. This is also consistent with the findings of Kivits [40], who observed that patients viewed Internet health information as complementing, rather than opposing, information from their doctor.

Limitations
Data in this exploratory study were collected from a convenience sample of individuals-the majority of the patient participants were white, female, and over age 35. Because of the sampling method, the results may not be generalizable to patients with asthma, the general patient population, or the physician population. In addition, patients were prompted several times to use the website before their provider visit, including phone calls, which would not occur if the website were implemented in a nonstudy setting. However, although research in the area of using websites to impact physician-patient interactions is new, previous research does support patients' positive reactions to prompts designed to further physician-patient communication [41,42].
Because some of the patients' physicians may have seen the printout from the website and some patients may have told their providers that they were participating in a study, a Hawthorne effect cannot be ruled out. Nevertheless, as the intention of the study was to have patients use the information from the website to prompt discussions with their physicians through the use of the feedback form, we judged the bias attributable to this effect to be limited. Although some providers may have known that their patients were study participants and therefore may have paid more attention to the patients' questions, none of the data collected from the patients themselves indicated that this was the case.
Finally, although patients reported changes in the interactions with their physicians, these may not have translated into changes in patient care. Further research needs to be conducted on whether the website is successful in improving patient outcomes.

Implications for Practice
The Internet has great capacity to positively influence health care, and there is good evidence that it is already an entrenched part of the medical landscape. In the United States, 60% of adults have Internet access, and over 80% of patients with Internet access have searched for health information online [33]. Similarly, we have observed in a survey of 330 primary care patients that most (62.1%) patients felt that their doctor should "recommend specific websites where I can learn more about my health and health care" [43]. However, once physicians encourage patients to use the Internet, patients are likely to hold them accountable for discussions of the information they find. In a separate study, we observed that patients whose physicians did not discuss information gleaned from a tailored-message computer program, much like the website we have designed, were significantly less satisfied with their visit [44]. This suggests that there are likely to be bumps along the road in making suggestions from websites and discussions of Internet searches a part of routine care.
Nevertheless, having patients ask the most pertinent questions relating to their care may help them get the most out of the normally brief office visit. Having patients access this type of information while in a doctor's office has been shown to be too challenging [45]. And while lists of questions to ask a physician about specific conditions are available online (from the American Heart Association, for example), the website in this study took these types of questions one step further, tailoring the feedback based on individuals' answers and explaining why each feedback question was important. In addition, the feedback page placed questions about controlling the condition at the top because results from a prior study indicated that patients were more likely to ask the questions at the top of the feedback page [26]. It is possible that interventions such as the one in this study may increase the efficiency of brief office visits by allowing patients to access pertinent information at home and come to the visit prepared with a list of individualized, prioritized questions and a greater understanding of why asking these questions is important.
While the website has clear implications for practice, it was not designed to be made available to patients in a physician's office. Rather, it should be viewed instead as a prototype for possible distribution through a number of channels: (1) managed care organizations, to improve the quality of care they provide and satisfy accreditation requirements from the National Committee on Quality Assurance [46]; (2) employers, to decrease work limitations from chronic diseases such as asthma; and (3) advocacy groups, such as the American Lung Association, to improve the quality of care for their constituents. Being a narrowly focused website, it is a relatively inexpensive intervention that requires minimal maintenance, as guidelines are not published that often and major changes to standards of care occur infrequently. Given the steady increase in Internet access [33], we believe that future versions of myexpertdoctor.com and similar offerings could have a significant and positive impact on asthma care and quality of life among patients with asthma and other chronic conditions.

Conclusions
The present study has given us confidence that the current intervention has the potential to improve the way patients communicate with their provider and that the suggested questions can overcome the clinical inertia of providers. Both physician and patient users of the site provided useful feedback on changes that could be implemented in future versions of the website to make it more effective. The main findings-that use of the website and its feedback form positively influenced patients' interactions with their physicians, their knowledge about asthma, and their feelings of responsibility for managing their condition-all point in the expected direction and suggest that the website can improve the quality of care patients receive. We believe that it is essential to give the Internet functionality beyond being a passive, albeit massive, repository of health information. A national study of 4764 adults who used the Internet for health information noted that only one in six believed that the Internet had influenced treatments that they used for a health condition [47]. Although patients may have the potential to learn a great deal, much of the information is beyond their comprehension as it is written at a high reading level and many patients are relatively health illiterate [48,49].
However, empowering patients with specific questions to ask appears to put health information into patients' hands in a way that activates them to be involved in their care. Despite the great number of medically oriented websites, we are not aware of another that provides patients with evidence-based questions to ask their doctor. Most interactive health websites focus instead on providing tailored risk assessment, such as the Heart to Heart Tool [50], Heart Profilers on the American Heart Association website, RealAge, and WebMD. Ongoing studies are evaluating the effect of the website we have designed on patient health outcomes. Given the steady increase in Internet access, we believe that if the current intervention proves to be effective, it may have a significant impact on the control of asthma, as well as other chronic medical conditions.

Introduction
Two important steps in vocabulary development are (1) the identification of candidate strings (ie, words or phrases) in a domain and (2) the determination of which of these should be included in a vocabulary as "valid" terms, also called "termhood determination." Health vocabulary development, which has a long history, requires significant effort for collecting candidate terms and determining termhood [1]. While vocabularies such as SNOMED (Systematized Nomenclature of Medicine) and ICD-9 (International Classification of Diseases, Ninth Revision) include many health terms, there is no consensus on termhood criteria (ie, what constitutes a "term") [2]. The decision to include terms in a vocabulary is made for a particular domain for certain tasks (eg, indexing or billing). Thus, the review criteria and procedures used by vocabulary developers, which are often not published, inevitably differ. Terms included in health vocabularies also vary significantly. For instance, in the Unified Medical Language System (UMLS), the same concept is often represented in various source vocabularies by different terms. The terms "head ache" and "cranial pain" are both synonyms of the UMLS concept "headache." The source vocabulary for "head ache" is DXplain, and the source vocabulary for "cranial pain" is MeSH (medical subject heading).
Research and development of controlled consumer health vocabularies (CHVs) is a relatively new endeavor in the health vocabulary field [3]. In the general biomedical literature, research on consumer understanding of medical words and concepts has focused primarily on relatively short lists of discrete terms in various specialties. In the informatics domain, a few companies (eg, Apelon and WellMed) offer proprietary CHV products, though these products have not been publicly evaluated.
The general goal of our CHV research is to help overcome the vocabulary gap between consumers and health information provided by informatics applications. The specific aim of this paper is to elucidate term identification methods for CHVs. CHV research has largely been driven by the proliferation of health-related materials on the Web, the emergence of electronic personal health records, as well as the growing availability of various consumer health applications (eg, decision support tools). Over the past five years, researchers have found that consumer terms are not well covered by the existing health vocabularies, which mostly represent the language of health professionals [4][5][6][7][8][9]. Indeed, expressions used by consumers to describe health-related concepts and relationships among such concepts frequently differ on multiple levels (ie, syntactic, conceptual, and explanatory) from those of professionals. Thus, consumer health informatics research and application development will benefit from the development of CHVs.
Developing and validating a comprehensive CHV is challenging because "consumers" constitute a plethora of highly diverse groups. Further, individuals uniquely acquire health-related terms and concepts from formal and informal sources (eg, media exposure) and from personal experiences. Nevertheless, there is strong evidence of the stability of lay health language among particular populations, for specific tasks [3].
We have been working on an open access and collaborative (OAC) CHV project. The first step in creating the OAC CHV was to identify consumer terms since surface forms, represented as strings in written text, are more tractable than concepts (ie, underlying meanings) or semantic relations, both of which require in-depth understanding of term usage, rhetorical intent, and explanatory models. Because consumer terms are heterogeneous and even less well defined than professional terms [10], the termhood determination task proved to be particularly challenging. Our term identification effort has been guided by two principles: 1. CHVs consist of actual terms commonly used by consumers (in any particular discourse group).
2. CHV terms must allow for computer processing of consumer language.
Since many professional health vocabulary terms are already used by consumers, though in some cases with different or broader semantics (eg, "diabetes" for diabetes mellitus, types 1 and 2), we focused on consumer terms not yet represented in existing vocabularies (eg, "broken finger" for any type of fracture in the "distal," "middle," or "proximal phalanges").
Because the number of candidate strings is often very large in any domain, researchers have explored the use of corpus-based automated term recognition (ATR) methods for extracting the most promising strings for human review from domain-specific documents [11,12]. ATRs vary from statistical or information theory-based approaches (eg, t test) [13] to syntax-based methods (eg, noun phrase extraction and context analysis) [14] and hybrid mechanisms (eg, C-value formula) [15,16]. Both the t test and the C-value formula have been used successfully in termhood determination. Such studies reinforce the general notion that strings typically considered as terms share some common characteristics, such as words in a term tend to occur more frequently together, terms are often noun phrases, and terms may be part of several longer strings.
In the biomedical domain, ATR methods have been applied to Medline literature [17] and clinical reports [15]. While most ATR methods outside the biomedical domain were designed to be general purpose, biomedical ATR methods tend to be more narrowly focused [18]. The type of terms targeted by ATR vary, including gene and protein names in a number of recent studies [18][19][20][21].
In this study, we first identified CHV terms through collaborative review of strings derived from query logs of a consumer health site [22]. Because of the considerable variability in lay health expressions, standardized review criteria and procedures to ensure consistency in selecting CHV terms were developed. After obtaining the human-reviewed n-grams (ie, n word strings), we experimented with two ATR methods: logistic regression and the C-value formula. The initial features used in the regression model were informed by existing ATR methods, in particular, the C-value model [16] and the termhood formula proposed by Wermter and Hahn [12]. We also evaluated the popular C-value method.
Our use of ATRs in this study differs from that in prior studies in the biomedical domain in two aspects: (1) short phrases from query logs were used as the text corpus rather than entire sentences from full-text sources, and (2) "new" CHV terms, not yet part of existing vocabularies, were identified rather than "pre-existing" terms such as UMLS terms.

Methods
The term identification study had three components: 1. Candidate string extraction from a query log data set of terms that could not be mapped to UMLS 2. Collaborative manual review of a subset of the candidate strings and identification of CHV terms 3. Application of ATR methods (the C-value formula and logistic regression models) to human-reviewed CHV terms

Candidate String Extraction
We obtained a set of query log files [22] from the MedlinePlus site covering the period from October 2002 to October 2003, courtesy of the National Library of Medicine (NLM). The log data were preprocessed to filter out all queries that were not in English, appeared to be machine generated (eg, very large numbers of queries from the same IP address within a minute), and that were redundant (ie, from the same host at time intervals of less than 5 minutes).
The preprocessed queries were then mapped to the 2004AA version of the UMLS Metathesaurus using lexical methods (ie, removing non-alphanumeric symbols, stemming, normalization, and truncation). Queries that did not map to the UMLS Metathesaurus were broken into n-grams. N-grams that matched terms in the Metathesaurus were removed, and the remaining n-grams were collected into sets by frequency and number of words.
We used n-gram analysis to find candidate terms from unmapped query strings. The n-gram analysis uses the frequencies of n-grams and text fragments of n words in a text sample to estimate the likelihood that a string is a potential term. In general, the more frequently an n-gram appears in text documents, the increased likelihood that the n-gram is a "useful" term.

Collaborative Manual Review
Six researchers (first six of the authors) reviewed candidate strings (n-grams) collaboratively. First, each reviewer independently reviewed a subset of the n-grams (n = 1 to 4 and frequency > 50) and voted on whether they should be considered CHV terms. Unanimous votes for n-grams that were reviewed by at least three people were entered as "master" votes. Otherwise, termhood was discussed by the entire group until consensus was reached and a master vote was cast. To support reviewers from geographically distributed locations and to calculate votes, a specially designed Web-based application [23] was utilized ( Figure 1). Through several iterations of votes and discussion, we established the following review criteria: 1. CHV terms should be syntactic constituents or phrases such as a noun phrase or adjective phrase (eg, "bypass surgery" is a phrase, but "fever in" is not). Special attention should be given to noun phrases.
2. CHV terms should have independent semantics and should not only occur as a part of longer valid terms or as a part of wild card searches (eg, [chicken-, small-] "pox vaccine" is not considered a CHV term). 3. CHV terms should be specific to the medical domain (eg, "Google" and "Yahoo" are general words, not CHV terms). 4. CHV terms should function as semantic components in addition to functioning as syntactic components (eg, stop words "the" and "a" as well as empty verbs "make" and "take" are not considered CHV terms). 5. N-grams representing existing UMLS medical concepts are considered to be CHV terms, but CHV terms may represent non-UMLS concepts. 6. Eponymous forms of CHV terms are considered to be CHV terms (eg, "Parkinson's"). 7. CHV terms may include spelling errors, (eg, "Chron's disease"). These misspelled terms are given the label "disparaged." 8. Terms with distinct clinical semantics (eg, "result") are considered to be CHV terms, regardless of ambiguity and/or vagueness in other domains.
We singled out several types of terms for future investigation and assigned special labels to them: • meta: A term that is usually used to indicate the category/type of information sought or presented (eg, "picture," "guideline," and "tutorial").
• modifier: A term not typically used by itself, but for limiting or qualifying other terms (eg, "sexually" as in "sexually active").
• relation: A term not typically used by itself, but used to describe relations among concepts (eg, "caused by" and "results in"). We also include the unary relation "not" in this set.
Currently, we consider terms classified as meta and modifier to be CHV terms, but relations are not considered CHV terms.
Once these review criteria were established, researchers double-checked the previously cast master votes for compliance. A second round of discussion resulted in some adjustments to the votes.

Application of Automated Term Recognition (ATR)
We explored the use of two ATR methods to facilitate candidate selection for human review: (1) the C-value method (C loosely stands for "candidate collection") and (2) logistic regression.
We applied the C-value method to the strings that had already been reviewed. First, the strings were parsed to filter out single-word strings and strings that were not noun phrases. The C-value was calculated using the formula [16] given in Textbox 1. To create the logistic regression model that predicts the termhood of a candidate string a, we explored syntactic category, frequency of occurrence, string length, word count and number, frequency and termhood status of a's nesting, and nested strings as variables and used the master vote as outcome.
Human-reviewed strings were used as the training and testing data sets. The initial feature variables were as follows: 1. part-of-speech (POS) tag (eg, noun or adjective) of the first word 2. POS tag of the last word 3. noun phrase status (ie, yes/no) 4. word count (ie, number of words in a) 5. number of distinct a's nesting string b 6. number of repeated b 7. percentage of distinct b that are known valid (UMLS) terms 8. percentage of repeated b that are known valid (UMLS) terms 9. number of distinct a's nested string c 10. number of repeated c 11. percentage of distinct c that are known valid (UMLS) terms 12. percentage of repeated c that are known valid (UMLS) terms 13. frequency of a 14. number of distinct host h that a originated from 15. average number of distinct queries containing a per host The frequency distribution of the POS tags (variables 1 and 2) required them to be collapsed into fewer categories for modeling. The original tags came from a Brill-style, rule-based POS tagger developed by Mark Hepple [24]. We first transformed them into a smaller set of tags used by the UMLS SPECIALIST Lexicon of the National Library of Medicine (NLM) [25]. (Details of the transformation rules can be found in [26].) Several tags appeared with low frequency and were then merged: the tags AUXILARY and MODAL were merged with VERB, and the tags CONJUNCTION, DETERMINER, NUMBER, SYM, UNKNOWN, PRONOUN, and PREP were merged into a new category, OTHER.
The continuous variables (variables 4 to 15) were dichotomized based on the median value. The dichotomized variables were used in the logistic regression to predict or explain the probability of having a term voted "yes" for termhood. The logistic regression model building was carried out by a stepwise procedure. After calculating the odds ratio estimates, most of the variables were dropped. The remaining variables 1, 2, 3, 6, 10, and 15 were represented in the regression formula as FirstPOS, LastPOS, np_value, repeat_sup_gt_median, repeat_sub_gt_median, and distinct_perhost_gt_median.
For both the C-value formula and the regression model, we calculated the sensitivity and specificity at different thresholds to create the receiver operating characteristic (ROC) curves. To estimate the area under the ROC curve for the logistic regression, we used the c-statistic [27] (note that this is not the same as C-value). It has the following meaning. From the final multivariable logistic regression model, the predicted probability of the termhood voted "yes" can be computed for each term. For any two terms, one with vote "yes" and one with vote "no," if the predicted probability for vote "yes" is higher than the predicted probability for vote "no," then we have a concordant pair. If the predicted probability of vote "no" is higher, then we have a discordant pair. If the pair is neither concordant nor discordant, then it is tied. Let T be the total number of all possible yes-no pairs of all terms. Let C be the number of concordant pairs, and D the number of discordant pairs. The c-statistic is calculated as c = (C + 0.5(T − C − D)) / T.

Results
We identified 18454 candidate n-grams (n = 1 to 5); 7967 were reviewed by at least one reviewer, and 1893 distinct n-grams received master votes (Table 1). Among the n-grams with master votes, 23 were meta, 39 were modifier, and 5 were relation.   Figure 2. In this logistic regression model, syntactic information (first 9 variables) and nesting pattern (last 3 variables) determine the termhood. The importance of syntactic information has long been recognized by models like the C-value. Conspicuously, word count and frequency are missing from our model, though longer and more frequent strings are more likely to be considered terms. To a large extent, length and frequency are reflected by the nesting patterns: very short strings are likely to be part of many nesting strings, and less frequent strings are likely to be coincidental combinations of more common words, meaning that it would have more nested strings.
The ROC curves for C-value and the regression model are shown in Figure 3. The area under the ROC curve (AUC) is 70.9% for the C-value method and 95.5% for the regression model. Higher AUC signifies increased distinguishing power: 100% = perfect discriminative ability, 50% = no ability, < 50% = predications were made in the wrong direction. Thus, the AUC results suggest the regression model to be very effective and better than the C-value for identifying CHV terms.

Discussion
This paper reports on several term identification methods for the OAC CHV project. We established a set of criteria and procedures to conduct a manual review, resulting in multiple reviewers reaching consensus on 1893 n-grams, including identification of 753 new terms for inclusion in the OAC CHV that were not in the 2004AC version of UMLS.
The OAC termhood criteria were established collaboratively, reflecting the reviewers' backgrounds in several different fields: controlled vocabulary, health informatics, linguistics, cognitive science, and computer science. While the OAC termhood criteria could be further refined and termhood criteria for health vocabularies are often not published, we believe publishing such criteria could benefit vocabulary research. For instance, many articles evaluate vocabularies and study methods of mapping one vocabulary to another [28][29][30][31]. These evaluations and mapping methods could be better guided by the termhood criteria of target vocabularies.
In CHV research, the termhood issue is of particular importance because there has been limited discussion and little consensus on what should be considered a consumer term. Is "sun poisoning" an acceptable term? How about "skin conditions?" As was pointed out in the Introduction, health professional vocabularies do not always agree on the termhood of a phrase. Consumer expressions, however, require more scrutiny because it is harder to determine their semantics and contexts of usage.
We tested two ATR methods (C-value and logistic regression) on the human-reviewed n-grams. The C-value was useful for determining termhood, though it did not have high distinguishing power (AUC = 70.9%). The AUC for the logistic regression model was 95.5%, which is fairly satisfactory.
These results suggest that a specially fitted logistic regression model is better suited than the generic C-value method for the task of identifying CHV terms according to our criteria. The C-value method's performance problem was partially caused by issues unique to this data set, among them the inclusion of infrequent misspellings and the high frequency of most candidates, which made frequency a less reliable predicator. The imperfection in noun-phrase parsing is not unexpected, though the relatively short query string posed a greater challenge for parsing. Like many vocabularies, OAC includes strings that are single words and are not noun phrases, while C-value is typically calculated for multiword noun phrases. The logistic regression model demonstrated excellent suitability for OAC termhood determination. It may have to be altered to be used with other corpora or for other types of vocabularies due to the particularities of query-based corpus attributes such as the short length of the documents. Nonetheless, training of predictive models for a particular corpus and vocabulary is a generalizable strategy. Although general principles exist, the determination of which strings are to be considered legitimate vocabulary terms often depends on the domain and the vocabulary developers' criteria (eg, including verb phrases [15] or not).
The regression model utilizes syntactic and nesting pattern features; both types of features are well-recognized termhood indicators. A concern often raised about CHV research is that the syntax and semantic of consumer phrases are too unruly to be represented in a computable vocabulary. The fact that many consumer phrases have common term characteristics suggests that they are tractable terms.
Our study has several limitations. Because consumer utterances are not readily available as corpora of medical literature or clinical records, we used query logs that contained relatively few complete sentences. Subsequently, this resulted in many POS and noun phrase analysis errors. As well, we only had researchers and not lay consumers review the candidate terms, due to budget and logistic constraints. However, the analysis was based on utterances from queries submitted by tens of thousands of consumers.
Based on the result of this study, we plan to apply the logistic regression model to the candidate n-grams and select those predicted to be terms for human review. We also plan to add the identified CHV terms to OAC. The authors associated with NLM are interested in investigating similar techniques to aid in identifying candidate terms for inclusion into the SPECIALIST Lexicon of the NLM, and for quality control.

Introduction
Improving the readability of online consumer health materials is an important area of eHealth research. Studies indicate that health information on the Web is beyond the reading ability of average consumers [1,2]. Research on general literacy suggests that readability decreases as the number of "difficult" words, those unfamiliar to the average reader, increases. Since familiarity correlates with education and literacy levels, "easy" terms are those that are familiar to many individuals who have lower reading skills. For example, the Dale-Chall readability formula incorporates a list of 3000 words and phrases (expressions) familiar to 80% of fourth-grade students in the United States [3]. However, because obtaining a comprehensive, empirically derived list of familiar words is difficult, many other existing readability formulas use average number of syllables per word as a surrogate for word difficulty.
Many researchers point to the need to reduce the gap between health literacy of the readers and the readability of consumer health materials [4]. As guidelines call for using simple common words, adhering to them requires predicting consumer familiarity with various health-related words. Currently, the only available methods are general purpose readability formulas developed by K-12 researchers. However, using such readability formulas to predict readers' ability to comprehend health texts has been criticized by the health literacy community. As McCray observes, "counting words and syllables and consulting a grade-level word list are most likely not sufficient to determine how readable a text is" [5]. Reliance on word length is particularly ill suited for the health domain, where short technical terms are likely to be unfamiliar to consumers (eg, apnea). The logic of graded word lists simplifies the phenomenon of word knowledge by implying that it is binary in nature and suggests that a reader is either unfamiliar or familiar with a particular word, with the switch between not knowing and knowing occurring at a single point in time. However, consumer health term familiarity is a more nuanced phenomenon involving partial knowledge [6], and increased exposure likely results in increased familiarity.
Recognizing the limitations of these previous approaches, we set out to explore alternative measures that account for "average" familiarity with health terms among members of a convenience sample of consumers. The ability to recognize terms is important because readers need to associate health terms with their corresponding concepts in order to extract useful information from text. Thus, we decompose health vocabulary knowledge into two parts: (1) surface-level term familiarity, or recognition of the lexical form, and (2) concept-level term familiarity, or understanding of the underlying concept. In cognitive science, a concept can be viewed as a set of slots that can be filled with characteristics describing a class of objects or events [7]. For instance, a "disease" concept may be characterized by attributes such as cause, severity, duration, and pathophysiology (among others). The completeness and accuracy of conceptual knowledge exists on a continuum, dictated by context. Thus, a healthy individual with a family history of diabetes and a diabetic patient may each benefit from explanations focusing on different aspects of diabetes (eg, prevention versus treatment).
Yet, historically, readability studies do not distinguish between surface-level lexical forms (commonly referred to as "terms") and concepts and, therefore, do not separately assess familiarity at each "level." We had previously developed a support vector machine regression model for predicting "familiarity likelihood scores" of consumer health vocabulary (CHV) terms using the empirical data from user studies evaluating "consumer-friendly display" names for medical concepts [8] as training data and the term frequency counts from health text corpora as features [9]. The model evaluated by this current study was an improved version of the initial model published in 2005 [9]: actual familiarity data were collected from 41 subjects for training, and term and word frequencies in three different corpora were used as features, including (1) Reuters news reports (health and non-health articles), (2) queries to a health search engine (MedlinePlus), and (3) queries to a general search engine (MetaCrawler). This algorithm assigns each consumer health term with a predictive score ranging from 0 to 1.0, representing the likelihood that a term is familiar to the average consumer. Terms are classified into three familiarity categories based on their scores: "likely" (> 0.8), "somewhat likely" (0.8-0.5), and "not likely" to be familiar (scores < 0.5).
The primary goal of the research reported in this paper was to develop and apply a simple methodology for validating the CHV familiarity predictive model against actual empirically derived familiarity with various health terms among health consumers. The validation is distinct and independent from the empirical data used in deriving the model. Both surface-level (ie, recognition) and concept-level familiarity (ie, understanding of the underlying meaning) data were collected from participants. Surface-level familiarity was investigated because it corresponds with existing conventional approaches to assessing health vocabulary knowledge. The goal of concept-level familiarity assessment was to explore the potential of this novel approach and to characterize the relationship between the two familiarity levels. Finally, we sought to describe the effect of demographic factors (including health literacy and education level) on actual consumers' scores. The following three hypotheses addressed the goals of the study: Incorporating the REALM procedure, SAHLSA requires the examinee both to correctly pronounce the target term and to select the key term. However, since our goal was to measure familiarity with written health expressions and concepts explicitly using a self-administered tool (eg, via the Web), the SAHLSA requirement for examinees to pronounce each target expression was dropped. The final test included surface-level familiarity items for all three health topics (questions 1-45) and concept-level familiarity items for GERD terms only (questions 46-60). The entire instrument is available in the Multimedia Appendix.

Administration, Scoring, and Analysis
Participants first completed the demographics survey, followed by the S-TOFHLA and CHV familiarity survey (surface-level items followed by concept-level familiarity items). For scoring, each correct answer was awarded one point. Surface-level and concept-level familiarity scores were calculated separately. Regression analysis tests on the data were performed at the 0.05 level of significance. Since the study is exploratory in nature, the values between 0.05 and 0.1 are reported for descriptive purposes, as indicating trends for further investigation.

Mean Familiarity Scores
Three types of means were computed for each predicted familiarity likelihood level ("likely," "somewhat likely," and "unlikely" to be familiar): total surface-level familiarity, GERD surface-level familiarity, and GERD concept-level familiarity ( Table 2). Total surface-level familiarity reflects surface-level familiarity with terms on all three topics. Since the test included five terms per topic per level, 15 is the maximum possible surface-level familiarity score for each level. GERD surface-level familiarity indicates surface-level familiarity with GERD terms only, with five the maximum possible score (based on five GERD terms at each level). GERD concept-level familiarity reflects answers to GERD concept questions, with five the maximum possible score for each level. Total surface-level familiarity and GERD concept-level familiarity were the dependent variables of hypotheses 1 and 2. GERD surface-level familiarity was used in computing the gap between GERD surface-level and concept-level familiarity, the dependent variable for hypothesis 3.

Predictors of Total Surface-Level Term Familiarity
Seven independent variables-predicted familiarity likelihood level, gender, English proficiency, highest education level, age, race, and health literacy level (S-TOFHLA scores)-were regressed onto the dependent variable, total surface-level term familiarity score. Linear regression found a statistically significant effect (P < .001) of predicted familiarity likelihood level on surface-level term familiarity. Health literacy was another statistically significant predictor of surface-level familiarity (P < .001). English proficiency was significant (P = .05); education level was not (P = .15).

Predictors of GERD Concept-Level Familiarity
All seven independent variables from the previous regression analysis plus GERD surface-level familiarity were regressed onto GERD concept-level familiarity score. Linear regression found statistically significant effects of predicted familiarity likelihood level (P = .009) and GERD surface-level familiarity score (P < .001) on GERD concept-level familiarity scores. The effect of health literacy level on GERD concept-level familiarity merits further investigation (P = .06).

Relating GERD Surface-Level and Concept-Level Familiarity Scores
While previous regression analysis indicated that GERD surface-level familiarity score was a significant predictor of GERD concept-level familiarity, the concept-level familiarity consistently lagged behind surface-level familiarity at all three levels (see Table 2). Linear regression analysis of the effect of predicted familiarity likelihood level on the surface-level-concept-level familiarity gap was performed. For the overall model, the gap was statistically significantly different from zero (P = .001). In addition, the gap was statistically significantly greater for terms predicted as "likely" then for those "not likely" to be familiar (P = .006). The gap for terms predicted as "somewhat likely" versus those predicted "not likely" to be familiar merits further investigation (P = .07).

Implications for the Validity and Usefulness of the CHV Familiarity Model
Although preliminary in nature, this study presents an initial evaluation of the first model for estimating consumer familiarity with health-specific terms. The findings confirmed hypotheses 1 and 3 and partially confirmed hypothesis 2. Confirmation of hypothesis 1 provided initial validity evidence for the CHV familiarity likelihood model [8] by demonstrating a relationship between predicted familiarity and two types of empirically derived consumer familiarity scores. The brief "proof of concept" survey used in this study requires additional research to evaluate the underlying model's robustness with various target audiences of online consumer health materials: seniors, low-literacy individuals, chronic patients, etc. The approach used in the study provides a methodological framework for such follow-up validation studies. The present study, however, contributes to the field as it suggests that a health corpora frequency-based algorithm presents a feasible and more flexible alternative to general word lists or word length algorithms for estimating the difficulty of consumer health materials. For example, our existing model for predicting term difficulty can be used as a quick screening tool for determining "difficult" terms in consumer health texts and suggesting more consumer-friendly synonyms. Incorporating the model into a formula that produces a single text readability score would potentially automate the complex task of matching consumer health materials to readers (assuming that relevant reader information is available).

Insights for Improving the Power of CHV Familiarity Prediction
Partial confirmation of hypothesis 2 and confirmation of hypothesis 3 both point to limitations of the model with respect to its ability to identify "consumer-unfriendly" words. Part of the variance in readers' performance is likely to be related to demographic characteristics, not accounted for in the model. With further research, it is perhaps possible to adjust predicted familiarity likelihood categories for some target populations on the basis of known effects of demographics variables. However, identifying the full range of meaningful demographic variables is not realistic. Moreover, most sites are developed for a broad range of health consumers who represent a diverse range of competencies and experiences. This limitation is not unique to our approach but is true for all attempts to evaluate the difficulty of terms or a text. While individualized prediction of text difficulty on the basis of a model is desirable, it is also much more error prone than population-wide predictions because most predictive models are based on population statistics or empirical expert knowledge. Any prediction is necessarily an approximation, but a high-quality approximation is of considerable value. Presently, our predictive model framework also does not make a theoretical distinction between surface-level familiarity and conceptual understanding and does not make provision for the possible uneven gap between the two. If the uneven gap phenomenon is confirmed, then the "easiness" of terms predicted as highly likely to be familiar may be deceptive. Answering this question requires a strong operational definition of sufficient concept knowledge and a way of assessing it. The present instrument is an exploratory step in the direction of concept knowledge measurement. A satisfactory instrument should reconcile the goals of assessing a complex and multifaceted construct while being relatively quick and easy to administer.

Limitations of the Study
While most of the study results corresponded to our research hypotheses, the lack of significant effects of most demographic variables, particularly educational level, is surprising and may be due to sampling bias. It is possible that uneven representation obscured any education effects -41 out of 52 participants had at least some college education. Note that education is a proxy for general literacy, which is only one component of health literacy [10]. Other components, such as health care experience and motivation, may have a much stronger effect on health term familiarity and need to be explored in further research.

Follow-Up Work
Follow-up work includes validating and possibly adjusting the algorithm for specific populations, evaluating the role of potentially influential demographic variables in designs where these variables are represented across a broad range of values, and developing a formula that would assign a single-value text difficulty on the basis of the present algorithm. The calibration of such formulae in order to estimate the desired scores for various populations would require a set of extensive psychometric studies that are beyond the scope of most informatics research programs. However, developing the algorithm and testing its effectiveness against existing readability formulas are well within the capabilities of consumer health informatics research. It is also essential to develop methods to explore consumer understanding of health concepts in-depth, as the current study only touches the surface of this important topic.