This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
There are large amounts of unstructured, free-text information about quality of health care available on the Internet in blogs, social networks, and on physician rating websites that are not captured in a systematic way. New analytical techniques, such as sentiment analysis, may allow us to understand and use this information more effectively to improve the quality of health care.
We attempted to use machine learning to understand patients’ unstructured comments about their care. We used sentiment analysis techniques to categorize online free-text comments by patients as either positive or negative descriptions of their health care. We tried to automatically predict whether a patient would recommend a hospital, whether the hospital was clean, and whether they were treated with dignity from their free-text description, compared to the patient’s own quantitative rating of their care.
We applied machine learning techniques to all 6412 online comments about hospitals on the English National Health Service website in 2010 using Weka data-mining software. We also compared the results obtained from sentiment analysis with the paper-based national inpatient survey results at the hospital level using Spearman rank correlation for all 161 acute adult hospital trusts in England.
There was 81%, 84%, and 89% agreement between quantitative ratings of care and those derived from free-text comments using sentiment analysis for cleanliness, being treated with dignity, and overall recommendation of hospital respectively (kappa scores: .40–.74,
The prediction accuracy that we have achieved using this machine learning process suggests that we are able to predict, from free-text, a reasonably accurate assessment of patients’ opinion about different performance aspects of a hospital and that these machine learning predictions are associated with results of more conventional surveys.
Understanding patients’ experience of health care is central to the process of providing care and is a fundamental pillar of health care quality [
Outside health care, natural language processing of large datasets, including sentiment analysis and opinion mining, has been critical to understanding consumer attributes and behaviors, for example, in election forecasting [
The Information Strategy for the National Health Service (NHS) in England states that sentiment analysis of data could be a novel source of information [
We applied data processing techniques to all the online free-text comments about hospitals on the NHS Choices website in 2010. Our purpose was to test whether we could automatically predict patients’ views on a number of topics from their free-text responses. A machine learning classification approach was chosen in which an algorithm “learns” to classify comments into categories from a given set of examples, using open-source Weka data mining software. This software has been extensively used in previous research and provides accurate classification results, including in health care [
The machine learning approach had two components: (1) pre-processing, in which data from patient comments are split into manageable units to build a representation of the data [
In the “bag-of-words” approach, the total body of words analyzed (known as the corpora) is represented as a simplified, unordered collection of words [
A technique called “information gain” was used to reduce the size of the bag-of-words by identifying those words with the lowest certainty of belonging to a given class, and then removing them—this is an approach to feature selection [
A number of different technical approaches can be taken to classification in machine learning. We applied four different methods, to see which gave the quickest and most accurate results: (1) naïve Bayes multinomials (NBM) [
To obtain a score to predict sentiment analysis against, patient ratings left on the NHS Choices website on a Likert scale were converted into simple categories, either positive or negative about cleanliness and dignity, to simplify the prediction task. The website presents patients with five options to rate the cleanliness of a hospital: “exceptionally clean”, “very clean”, “clean”, “not very clean”, “dirty”, and “does not apply”. In this analysis, the first three options were grouped into a “clean” class and the “not very clean” and “dirty” into a “dirty” class. The website also asks patients to rate whether they were treated with dignity and respect by the hospital staff, with the options being “all of the time”, “most of the time”, “some of the time”, “rarely”, and “not at all”. Again, the first three options were grouped, in this case into a “more dignity” class and the “rarely” and “not at all” into a “less dignity” class. Finally, the NHS Choices website asks all patients whether they would recommend the hospital or not.
Having calculated the accuracy of our prediction algorithm, the results of the sentiment analysis were then compared with the national inpatient survey results for 2010. This is an annual national survey of randomly selected patients admitted to NHS hospitals in England, similar to the HCAHPS survey in the United States. The 2010 survey covered all 161 acute hospitals with adult services in England, involving 60,000 respondents nationally (response rate 50%). Patients were contacted via post between September 2010 and January 2011 if they had received overnight care in hospital in 2010 [
There was agreement between the patients’ own quantitative rating of whether they would recommend their hospital and our prediction from sentiment analysis between 80.8% and 88.6% of the time (expressed as accuracy;
NBM, bagging, and decision tree approaches to classification all produced similar measures of association, but the NBM algorithm, a first-order probabilistic model that uses word frequency information, performed the calculation faster (less than 0.2 seconds compared to hundreds of seconds for the other analyzed approaches). Of note, all algorithms tended to be worse at predicting cleanliness and SVM in particular. This may represent the limited language around cleanliness compared to the other opinions examined, or the more skewed results, with higher number of negative ratings.
On this basis, we choose to use NBM results for further comparison of our results with patient survey data. The relationship between the predictions of the NBM approach and the real ratings was reflected as Kappa statistics for interrater reliability of between .40 and .74 (
Accuracy of different approaches to machine learning.
Question | Overall rating | Cleanliness | Dignity and respect | |
|
|
|
|
|
|
ROC | 0.94 | 0.88 | 0.91 |
|
|
0.89 | 0.84 | 0.85 |
|
Accuracy (%) | 88.6 | 81.2 | 83.7 |
|
Time (s) | 0.11 | 0.05 | 0.06 |
|
|
|
|
|
|
ROC | 0.84 | 0.76 | 0.79 |
|
|
0.81 | 0.86 | 0.8 |
|
Accuracy (%) | 80.8 | 88.4 | 83 |
|
Time (s) | 552 | 206 | 332 |
|
|
|
|
|
|
ROC | 0.89 | 0.83 | 0.87 |
|
|
0.82 | 0.87 | 0.85 |
|
Accuracy (%) | 82.5 | 89.2 | 84.5 |
|
Time (s) | 4871 | 2018 | 3164 |
|
|
|
|
|
|
ROC | 0.79 | 0.53 | 0.6 |
|
|
0.84 | 0.84 | 0.8 |
|
Accuracy (%) | 84.6 | 88.5 | 84.1 |
|
Time (s) | 612 | 305 | 520 |
The 10 one or two word phrases with the highest predictive accuracy for each topic.
Overall | Cleanliness | Dignity |
told | dirty | rude |
thank you | floor | told |
left | left | left |
rude | the floor | thank you |
excellent | thank you | friendly |
the staff | filthy | excellent |
hours | bed | rude and |
asked | patients | asked |
was told | friendly | the staff |
friendly | hours | staff |
Comparison of patient survey responses and machine learning prediction of comments at hospital trust level.
Patient survey question | Machine learning prediction | Spearman correlation coefficient | Probability |
In your opinion, how clean was the hospital room or ward that you were in? | Machine learning prediction of comments about standard of cleanliness | 0.37 |
|
Overall, did you feel you were treated with respect and dignity while you were in the hospital? | Machine learning prediction of comments about whether the patient was treated with dignity and respect | 0.51 |
|
Overall, how would you rate the care you received? | Machine learning prediction of comments about whether the patient would recommend | 0.46 |
|
Comparison of the proportion recommending a hospital using sentiment analysis and traditional paper-based survey measures.
Our results reinforce earlier findings that sentiment analysis of patients’ online comments is possible with a reasonable degree of accuracy [
Sentiment analysis via a machine learning approach is only as good as the learning set that is used to inform it. By taking advantage of a complete national rating system over several years, we have been able to use many more ratings in this learning set than in other studies. Indeed, our learning set was more than 10 times larger than earlier work [
Online comments left without solicitation on a website are likely to have a natural selection bias towards examples of both good and bad care. It is likely that these online reviews are contributed more by those in particular demographic groups including younger and more affluent people [
Further research is needed to improve the performance of sentiment analysis tools, extending the process to other forms of free-text information on the Internet and exploring the relationships between views expressed by patients online and clinical health care quality. For example, several technical components might be added to improve the process, including the consideration of higher number n-grams (longer phrases) and refining contextual polarity (understanding what a word or phrase means given its context in a sentence). It would also be useful to compare the relatively simple techniques used in this analysis against other platforms and tools used for the sentiment analysis and opinion mining process, for example WordNet Affect [
Large amounts of data about the use of services are collected in digital form. An important strand of this is consumer opinion and experience. Today, many people express their views and share their experience of goods and services via the Internet and social media. Such data, converted to information, are essential in improving services, facilitating consumer choice and, in some sectors, exploring public accountability and value in the use of taxpayers’ money.
By its nature, the information is highly personalized, idiosyncratic, and idiomatic. However, if it is to be useful, it must be analyzed in ways that are not solely reliant on someone reading individual contributions (although this is valuable to consumers) nor on pre-structured responses necessary to allow aggregation. A solution to the challenge of “big data” is to find automated methods for analyzing unstructured narrative commentary, which is a potentially rich source of learning. In this respect, health care is no different to many other industries although it has perhaps been slower than other sectors to recognize the importance of it.
As our confidence in techniques of data mining and sentiment analysis grow, information of this sort could be routinely collected, processed, and interpreted by health care providers and regulators to monitor performance. Moreover, information could be taken from a number of different text sources online, such as blogs and social media. If this information could be harvested from these locations and then processed into timely and relevant data, it could be a valuable tool for quality improvement. We have previously suggested that as the usage of rating websites, social networks, and microblogs increases [
This work demonstrates that sentiment analysis of patients’ comments about their experience of health care is possible and that this novel approach is associated with patient experience measured by traditional methods such as surveys. This work adds to a growing body of literature opening up a new understanding of the patients’ point of view of care from their postings online—on social networks, blogs, and rating websites. Although at an early and experimental stage, it presents future possibilities to understand health care system performance in close to real time. Bates and colleagues have described the confluence of patient-centered care and social media as a “perfect storm” that is likely to be of major value to the public and to health care organizations [
Hospital Consumer Assessment of Healthcare Providers and Systems
Naïve Bayes Multinomial
National Health Service
NHS Choices
Receiver Operating Characteristic
Support Vector Machine
We would like to thank the team at NHS Choices, and John Robinson, Paul Nuki, and Bob Gann in particular, for providing access to their data. We thank Jane Lucas for reviewing the sentiment of words.
Dr Greaves was supported for this research by The Commonwealth Fund. The views presented here are those of the authors and should not be attributed to The Commonwealth Fund or its directors, officers, or staff. Dr Millett is funded by the Higher Education Funding Council for England and the National Institute for Health Research. The Department of Primary Care & Public Health at Imperial College is grateful for support from the National Institute for Health Research Biomedical Research Centre Funding scheme, the National Institute for Health Research Collaboration for Leadership in Applied Health Research and Care scheme, and the Imperial Centre for Patient Safety and Service Quality. The funding sources had no role in the design and conduct of the study; collection, management, analysis, or interpretation of the data; or preparation, review, or approval of the manuscript.
Professor Donaldson was Chief Medical Officer, England from 1997 to 2010. Professor Darzi was Parliamentary Under-Secretary of State (Lords) in the United Kingdom Department of Health from 2007 to 2009. The other authors declare no conflicts of interest.