Online Doctor Reviews: Do They Track Surgeon Volume, a Proxy for Quality of Care?

Background: Increasingly, consumers are accessing the Internet seeking health information. Consumers are also using online doctor review websites to help select their physician. Such websites tally numerical ratings and comments from past patients. To our knowledge, no study has previously analyzed whether doctors with positive online reputations on doctor review websites actually deliver higher quality of care typically associated with better clinical outcomes and better safety records.


Introduction
Every day a patient somewhere will ask: "Is Dr. X a good doctor?"By itself, such a statement is meaningless.The patient is really asking if Dr. X is a good doctor for a particular end.For example, is Dr. X a good doctor to address a particular symptom or to perform a defined treatment?
As an analogy, the question is as unspecific as "Is this a good car?" Better questions are: "Is this a good car for the gas mileage?" or "Is this a good car for value?" or "Is this a good car for accelerating quickly?"Each question delivers a different answer.
Patients access the Internet seeking an answer to the question "Is Dr. X a good doctor?"but they are really asking if Dr. X is a good doctor for a particular end.Is Dr. X a good diagnostician?Or is he compassionate with excellent listening skills?Or is she a doctor who has treated over 1000 patients with Chiari malformation?A typical doctor review website rarely makes that type of distinction with sufficient clarity.
Our hypothesis is that isolated doctor review websites may not be good proxies for what patients truly care about-namely clinical outcomes and safety.Doctor review websites measure whether patients like their doctor.These websites also measure subjective responses.Does the doctor communicate well?Does the doctor listen?How did they experience a procedure?These measures are important as clinical outcomes depend upon the collaborative role a patient plays in terms of decision making and compliance.Such measures could be complemented by more objective communication measures such as a doctor's ability to consistently transmit information about risks, benefits, and options (eg, of various treatments) to patients with a broad range of medical literacy.Other complementary objective metrics include clinical outcomes and safety.To the extent clear online metrics of an individual doctor's outcomes or safety record exist [1], they are not currently collated by the popular doctor review websites.
The medical literature supports the idea that for some surgical procedures, surgeon volume correlates with clinical outcomes [2][3][4][5][6][7][8][9][10][11].In other words, for specific procedures, high-volume (HV) surgeons have better results than low-volume (LV) surgeons.It is unclear why this is the case: perhaps practice makes perfect-or the more successful doctors get more referrals.But, online information about a surgeon's volume is also hard to find-if available at all.The question we posed was whether posts on online doctor review websites, in aggregate, correlate with surgeon volume, as a proxy for quality, for three distinct procedures.We targeted surgical procedures where this correlation has been previously suggested: lumbar surgery [12], total knee replacement [13][14], and bariatric surgery [15][16][17][18].In other words, are high-volume surgeons, in aggregate, more likely to have positive posts (and fewer negative posts) than low-volume surgeons, in aggregate?In doing so, we hope to better understand whether high-volume doctors (who have better clinical track records overall) collectively have better online reputations.

Physicians
Surgeons who perform lumbar surgery, total knee replacement, and bariatric surgery were selected for study because there are data supporting a correlation between surgeon volume and clinical outcome/patient safety for each of these procedures.Further, these procedures are more likely to be considered "elective" and affect a younger demographic than vascular or oncologic procedures (for which there are also data correlating surgeon volume and clinical outcome/patient safety).We believed that "younger" patients considering an "elective" procedure would be more likely to access an online review website to help guide their decision on surgeon selection.
Current procedural terminology (CPT) codes for bariatric surgery, lumbar surgery, and total knee replacement, were identified and selected (Table 1).Although there are other codes used to label these three surgeries, the codes presented in Table 1 identify the vast majority of the patients who have had bariatric surgery, lumbar surgery, or total knee replacement.Our sample consisted of 600 physicians with practices in bariatric surgery (n = 200), lumbar surgery (n = 200), and total knee replacement (n = 200).From the quartile of physicians who submitted the most claims for reimbursement for each CPT/ICD9-coded target procedure, 100 physicians were randomly selected to represent "high-volume" physicians and 100 "low-volume" physicians were randomly selected from the lowest quartile of physicians (who submitted the fewest CPT/ICD9 procedure codes for reimbursement for the target procedure in 2009-2010).Low-volume surgeons submitted at least one CPT/ICD9 procedure code for the relevant procedure.The median numbers of relevant surgeries for each of the three categories performed by high-and low-volume surgeons in 2009-2010 submitting bills to a NHI carrier are reported in Table 3.The underlying supposition was that patients intending to have bariatric surgery, lumbar surgery, or total knee replacement would search the Internet for information about physicians who have the experience to perform such procedures (and submit a bill for reimbursement to an insurance company).

Data Collection
The authors were blinded as to which doctors were high-volume surgeons and which were low-volume surgeons.
Physician evaluations in the form of numerical ratings and comments were collected from 9 different heavily trafficked websites: 1 review website limits its focus to doctors and lawyers (Avvo); 3 websites limit their focus to doctors (HealthGrades, RateMDs, and Vitals); and 5 websites review a broad array of businesses and services including doctors (Citysearch, InsiderPages, Yahoo!Local, Google Maps, and Yelp).Ranking of traffic in the United States by Alexa (www.alexa.com)for the websites is presented in Table 4. Alexa is a leading provider of global web metrics, such as traffic.A rating is a numerical metric defined by the patient's subjective impression.For example, on a scale of 1-5, how does the patient rate the doctor's overall quality, timeliness, ability to communicate, etc.Each website had different measures, but most asked at least one general question similar to: "Overall, how would you rate the doctor?" We searched each website using the name and location of each physician in our sample.We recorded the number of ratings and the "overall" rating reported for each physician.On websites that allowed ratings on multiple dimensions (eg, communication, trust, punctuality, and time spent with patient), the averages of all numerical ratings were also recorded.
A comment is a free text description of the patient's subjective experience.For example, "Dr.X was very compassionate and listened to each and every one of my concerns." We recorded the number of comments posted about each physician.One of three independent judges, also blinded to the volume of a physician's practice, reviewed each post and categorized it as containing glowing praise or scathing criticism and whether the glowing praise or scathing criticism addressed quality of care/safety or customer service.A single post could include comments about both quality of care and customer service.If so, it was included in both counts.Comments that were neither glowing nor scathing were recorded in the total number of posts, but not in the glowing/scathing tallies.A prototypical example of a glowing quality of care/safety comment is "Dr.X gave me back my life."In comparison, a scathing quality of care/safety comment is "Dr.X was a butcher."A prototypical example of a glowing customer service comment is "Dr.X returned my call late at night and gave me all the time I needed."In comparison, a scathing customer service comment is "Dr.X was dismissive, arrogant, and never listened."One of the websites, HealthGrades, does not allow posting of comments.Since many consumers may not do an exhaustive search for physician information, we recorded whether a link to any of the study websites was among the first 20 retrieved in a Google search for each physician in the lumbar and total knee replacement samples.A Google search was performed on each doctor in each of three formats: 1. "Dr.First_Name Last_Name" + "City, State" 2. "First_Name Last_Name, D.O." + "City, State" 3. "Dr.First_Name Last_Name, M.D." + "City, State" Separate analyses were performed using only data retrieved in this abbreviated search.The first 20 links correlate with the first 2 webpages retrieved in a typical search as the default setting for a Google search is 10 results per page.[20] Once the data was captured from the online review websites, the spreadsheet was sent to the Lewin Group.They added a field indicating whether a doctor was high volume or low volume.All other physician-identifying information was subsequently stripped and the rows were shuffled.The database was then returned to the authors for analysis.

Analytic Approach
Do ratings and comments posted on physician review websites provide valid information regarding surgical volume, a proxy for clinical outcomes/safety?We answered this by comparing the information available on high-and low-volume physicians, controlling for surgical practice in a 2 × 3 analysis of variance.Our analysis also considered whether the differences between high-and low-volume physicians were consistent across bariatric, lumbar, and total knee replacement surgical practices.
Analyses were performed using the mean number of ratings per website (on which each physician was rated at least once).Additional analyses were performed for each physician's overall rating, averaged across websites.Analyses using physicians' overall ratings tracked averages that included ratings of specific physician characteristics (average of multidimensional numerical ratings) very closely (all r > .85),so only analyses using the overall rating are presented.The Vitals website uses a different rating scale (1)(2)(3)(4) than the other websites (1-5); therefore, ratings from each website were standardized using a z test (converting each physician's score into a value expressed as the number of standard deviations from the mean on each website).The z score, or standard score, allowed for averaging ratings across websites.
Analyses were performed using the average number of comments per physician on websites with at least one posted comment.Additional analyses were performed identifying the proportions of comments that were glowing and scathing broken down by whether they concerned the physicians' quality of care or customer service.

Results
First, we report the results of these analyses using all available data for each physician.Second, we report analyses restricted to data available in the first 20 links of a Google search for each physician in the lumbar surgery and total knee replacement samples.Finally, we present the results of an analysis that explores the incremental validity of using data from both ratings and posted comments to distinguish high-and low-volume physicians.
Table 5 presents the numbers of physicians in our sample with ratings and comments posted on each of the study websites.

All Available Data
Numerical ratings were found for the majority (547/600, 91.2%) of the physicians in our sample; comments were found for 385 (64.2%) of the physicians.The average physician had ratings on 3 of the 9 websites (range: 1-7) and comments on 1 website (range: 1-5).Preliminary analysis noted the correlation between rank orders of physicians' total number of ratings aggregated across all websites and total number of ratings per website was r = .86,(P < .001).Additional preliminary analyses revealed that high-volume physicians had more total ratings across all websites and ratings on more websites than did low-volume physicians.Our analyses focus on average number of ratings per website on which a physician is rated-based on an assumption that a typical consumer may not do an exhaustive review of all available ratings on many websites but be satisfied upon finding one website with information on his or her physician.
Table 6 presents results of analyses of all available physician data.High-volume physicians had significantly more ratings per website compared to low-volume physicians for every type of practice (P < .001)and there was no evidence that this effect differed among physician groups (P = .15).However, the standardized numerical ratings assigned to high-volume physicians were not significantly different from those assigned to low-volume physicians (P = .27),nor was this null finding different across physician groups (P = .48).Table 6 also shows that high-volume physicians had more comments per website than did low-volume physicians for each type of practice (P = .05).Again, there was no evidence this differed among physician groups (P = .74).
Table 7 shows that only comments related to quality of care seem to distinguish high-and low-volume physicians; high-volume physicians had a significantly greater proportion of glowing comments (P = .002)and a significantly lower proportion of scathing comments regarding quality of care than low-volume physicians (P = .005).Again, we observe these patterns for each surgical practice and our analyses offer no basis for inferring that it's more true for one group than another (P = .70for glowing; P = .41for scathing).We also observed that there were far more glowing than scathing comments overall, even for low-volume physicians.In general, we observed that high-volume physicians tend to have almost 64% glowing comments (versus 51% for low-volume physicians) regarding quality of care.Proportion of glowing/scathing comments related to customer service did not differentiate between high-versus low-volume physicians overall (P = .52for glowing; P = .48for scathing) nor was there evidence that this null finding differed across physician groups (P = .92for glowing; P = .20for scathing).

First 20 Links
We conducted a reanalysis of the physician data restricted to review websites within the first 20 links returned by a Google search of a physician's name (Table 8).These searches returned links to some or all of our sample doctor review websites enabling access to the majority (896/1134, 79%) of webpages where doctors had at least one rating and of the webpages where doctors had at least one comment (347/456, 76%).This analysis was restricted to lumbar and total knee replacement samples.We excluded bariatric surgery from this subanalysis because the number of reviews and comments accessible via the first 20 links for that category was inadequate to draw meaningful conclusions.The analyses in Table 9 parallel those reported in Table 7 using the full available data.
Again, we find that high-volume physicians had greater numbers of ratings and comments per linked website than did low-volume physicians.The numerical ratings given to high-and low-volume physicians did not differ.And high-volume physicians had greater proportions of glowing (and lower proportions of scathing) comments about quality of care than did low-volume physicians.There were no differences in proportions of comments concerning customer service.

Additional Analyses
The preceding analyses suggest that high-and low-volume surgeons could be identified based on the (1) number of ratings; (2) number of comments; (3) proportion of glowing comments about quality of care; and (4) proportion of scathing comments about quality of care.Next, we attempted to establish the practical usefulness of these various pieces of information for distinguishing high-and low-volume physicians.The (discriminant) analysis develops a function that maximally distinguishes study groups from each other.Function coefficients (see Table 10) are the weights that support this discrimination; higher absolute weights indicate greater contribution of the variable to differentiating groups from each other.As illustrated in the table, discriminant analysis suggests that ratings per website, and proportion of glowing comments about quality of care are the two most differentiating pieces of information (highest absolute weights), followed by proportion of scathing comments about quality of care.The number of comments per website, while providing some information when examined alone, provides little additive information (beyond the other measures).
As a follow-up, we also performed a classification analysis wherein physicians' surgical volume (high or low) was "predicted" by the number of ratings and comments they received as well as the proportion of glowing and scathing comments about quality of care (using the discriminant function).The results revealed that one could accurately identify a physicians' surgical volume 61.6% of the time.An examination of the resulting discriminant function revealed that the number of ratings per website and proportion of glowing postings seemed most central to the discrimination, followed by proportion of scathing comments.Number of comments was largely redundant to these other measures.

Discussion
Our study found there is evidence that online doctor review websites can be used to identify high-volume surgeons performing targeted procedures-a proxy which correlates with higher quality care.Patients naturally want to identify, and be treated by, the best practitioners.And they seek such information online.The importance of the Internet in determining patients' health care choices in the United States should not be underestimated.A recent study by The Pew Internet and American Life Project noted that 59% of adults have looked online for information on 15 health topics such as a specific disease or treatment [21].And they are looking for information about health care providers too; 12% of adults have consulted online rankings or reviews of doctors or other providers.
Online review websites track patient sentiment.Recent advances even allow for automating the classification of patient comments by sentiment.Xia et al [22] described a multistep sentiment classifier for patient opinion mining that, in principle, could analyze large collections of data, online or otherwise, to assign sentiment scores to patient reviews.While patient sentiment is helpful, to our knowledge, our study is the first to tackle the connection between patient reviews, patient sentiment, and a proxy for clinical outcomes.
Defining quality in healthcare is difficult.From a patient's perspective, soft measures (eg, communication skills and ability to listen) are important for issues such as decision making and compliance-issues which impact outcomes.More objectively, quality often distils to patient safety and clinical outcomes.Such metrics include morbidity and mortality rates, length of stay in hospital, blood loss, time to return to work, and the like.This detailed information tracking of individual practitioners is not readily available online for patients to analyze.
The medical literature suggests that, for a number of surgical procedures, the volume of cases performed annually by an individual surgeon correlates with patient safety and clinical outcome metrics.In other words, for specific procedures, high-volume surgeons have better results than low-volume surgeons do.
We targeted three surgical procedures where this correlation has been shown previously: lumbar surgery [12], total knee replacement [13][14], and bariatric surgery [15][16][17][18].To our knowledge, our analysis is the first to tackle the question of whether online reviews can identify the more successful surgeons using a proxy for clinical outcomes and safety.We posed the following hypothetical question: Do quantity and character of posts on online doctor review websites, in aggregate, correlate with surgeon volume, as a proxy for quality, for these three distinct procedures?
Our findings provide evidence that the following data aggregated from 9 doctor review websites can distinguish high-volume from low-volume surgeons: total number of numerical reviews; total number of text comments; proportion of glowing positive comments; and proportion of scathing negative comments.Analysis of the actual numerical ratings did not distinguish between high-and low-volume surgeons.The same conclusions were noted when limited to doctor review websites from the first 20 links of a Google search for the doctor's name.
While our analysis provides evidence that data from doctor review websites can help consumers identify higher quality doctors, the effect size is weak.From the patient's perspective, a far better way to determine whether a surgeon performs a high volume of procedures is to ask the doctor.Or the doctor could preemptively provide such information on the various review websites.
One surprising result was while the total number of reviews correlated with surgeon volume, the actual rating value did not.Also, it is unclear why the total number of reviews and comments are associated with surgeon volume.Perhaps high-volume surgeons are more comfortable with their skills/results and are more likely to ask their patients for feedback-internally or on the Internet.In any event, such observations deserve further study.
Our analyses also supported a finding previously reported by others [23]; namely, on online review websites, the single metric (overall rating) correlated highly with more granular, multidimensional numerical ratings.In our analyses, this correlation was between overall rating and the average of all multidimensional ratings (all r > .85).Accordingly, analyzing patient responses to the question "Overall, how would you rate this doctor?"predicts positive and negative sentiment from more detailed questions.
Even with these findings, it is still an open question whether consumers should rely heavily on the websites partly because the websites have limited data.Among the 600 doctors, on websites where the doctor was rated, the average doctor had between 4 and 6 ratings and between 2 and 3 comments.As the websites accumulate more data, our conclusions may change.
Our study identified at least one rating for 91% of doctors in our sample.This contrasts with the study by Lagu et al [24] where 70% of their physician sample did not have a single review on any of the 33 websites they looked at.This study captured data limited to Boston generalists and undefined subspecialists in the spring of 2009.Our study captured data for specific categories of surgeons across the country in the summer of 2011.The experience a patient has with a surgeon is arguably different from the experience one has with a generalist or many types of subspecialists.The surgical experience is typically a "once-off."The experience with a generalist and many types of subspecialists is typically longer term.Patients may be more inclined to post ratings and comments based on a single (more emotionally charged) experience with a surgeon compared with a routine long-term experience with a generalist.But, the threshold of a doctor converting from no reviews on any website to at least one review on a website is low.The average doctor sees over 1000 patients per year.If just one patient takes the effort to post a review, that threshold is crossed.As our data was gathered two years after that of Lagu et al, this suggests that although the number of online reviews per doctor is still limited, the trend is for more reviews for more doctors.
Our study was limited to a sample of targeted surgical procedures.Within that dataset, there may be high-volume surgeons who have poor clinical outcomes/patient safety records.And there may be low-volume surgeons with excellent clinical outcomes/patient safety records.Our study only attempted to track a proxy for clinical quality-surgical volume-and not clinical quality itself.Also, our sample makes no conclusions about surgeons who perform procedures other than those analyzed or any conclusions about non-surgical practitioners.
Another limitation is that the NHI database used to identify low-and high-volume surgeons, while extensive, only covered CPT/ICD9 procedure codes submitted to private insurance carriers.The NHI database does not reflect data submitted to Medicare.In surveying the literature correlating surgeon volume with quality of care, we intentionally selected three surgical procedures that were more likely than others to be performed on a younger demographic, hoping to minimize whatever effect the absence of Medicare data might have on our analysis.
One further limitation is that our classification of comments into the categories of quality of care and customer service as glowing praise or scathing criticism required human judgment, making it susceptible to potential inter-reviewer variance.While it is unlikely different reviewers would classify words such as "butcher" and "life saver" differently, new technologies [22] may help automate the review process for greater consistency.
Online doctor review websites provide a growing collection of data for consumers to use.These websites provide fertile ground for future studies on whether its data can help patients reliably differentiate doctors who provide better clinical outcomes and patient safety.
In summary, online review websites provide a rich source of data that may be able to track quality of care, though the effect size is weak and not consistent for all review website metrics.

Table 1 .
Procedure codes and selection criteria for bariatric surgery, lumbar surgery, and total knee replacement.

Table 2 .
Number of unique physicians submitting a bill at least once to a Normative Health Insurance (NHI) carrier for relevant CPT/ICD9 procedure codes in 2009-2010 a .

Table 3 .
Median number of surgical procedures performed by high-and low-volume surgeons a .

Table 4 .
[19]a traffic rank in the United States for selected review websites[19].
a Reviews lawyers also

Table 5 .
Numbers of surgeons with ratings and comments posted on a study website.

Table 6 .
Analysis of ratings and comments for high-and low-volume surgeons.
a Comparing high-versus low-volume surgeons b Comparing bariatric, lumbar, and knee surgeons c Comparing high-versus low-volume surgeons across surgeon categories d Only includes individual websites on which doctor had at least one rating/comment e z score

Table 7 .
Analysis of scathing and glowing comments for high-and low-volume surgeons.
a Comparing high-versus low-volume surgeons b Comparing bariatric, lumbar, and knee surgeons c Comparing high-versus low-volume surgeons across surgeon categories

Table 8 .
Analysis of ratings and comments for high-and low-volume surgeons on first 20 websites (excluding bariatric surgery).
a Comparing high-versus low-volume surgeons b Comparing lumbar and knee surgeons c Comparing high-versus low-volume surgeons across surgeon categories d Only includes individual websites on which doctor had at least one rating/comment e z score

Table 9 .
Analysis of scathing and glowing comments for high-and low-volume surgeons on first 20 websites (excluding bariatric surgeons).
a Comparing high-versus low-volume surgeons b Comparing lumbar and knee surgeons c Comparing high-versus low-volume surgeons across surgeon categories

Table 10 .
Discriminant function analysis results.