This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Machine learning techniques may be an effective and efficient way to classify open-text reports on doctor’s activity for the purposes of quality assurance, safety, and continuing professional development.
The objective of the study was to evaluate the accuracy of machine learning algorithms trained to classify open-text reports of doctor performance and to assess the potential for classifications to identify significant differences in doctors’ professional performance in the United Kingdom.
We used 1636 open-text comments (34,283 words) relating to the performance of 548 doctors collected from a survey of clinicians’ colleagues using the General Medical Council Colleague Questionnaire (GMC-CQ). We coded 77.75% (1272/1636) of the comments into 5 global themes (innovation, interpersonal skills, popularity, professionalism, and respect) using a qualitative framework. We trained 8 machine learning algorithms to classify comments and assessed their performance using several training samples. We evaluated doctor performance using the GMC-CQ and compared scores between doctors with different classifications using
Individual algorithm performance was high (range
Machine learning algorithms can classify open-text feedback of doctor performance into multiple themes derived by human raters with high performance. Colleague open-text comments that signal respect, professionalism, and being interpersonal may be key indicators of doctor’s performance.
Multisource “360-degree” feedback is increasingly used across business and health sectors to give workers insights into their performance and to identify areas in which improvements may be made. Such feedback often includes different reporting modalities that most commonly take the form of validated questionnaires or open-text comments. In the United Kingdom, large-scale national surveys include open-text feedback, such as the Friends and Family Test, the Inpatient Survey, and the Cancer Patient Experience Survey.
The complexity of open-text information means that, unlike the scores from validated patient-reported experiences and outcome measures, the words cannot simply be “added up” to create insight and meaning. As such, the task of making sense of such data has historically been completed manually by skilled qualitative analysts.
As the volume of text increases, qualitative data can quickly become difficult to manage and draw insights from. Coding and interpreting large bodies of qualitative information received from open-text comments collected is labor-intensive and is at risk of bias if multiple raters use subtly different coding heuristics. Where human raters systematically analyze qualitative data, there remain issues with both time and financial constraints of doing so, as well as potential challenges in ensuring intercoder consistency [
The term
Although machine learning appears to be eminently suitable for the task of classifying open-text data from national surveys, its potential is largely untested in the context of comments made by medical professionals about doctors’ performance. Classification algorithms have been previously applied to patient comments about the experience of living beyond cancer [
While algorithms have demonstrated excellent performance in diverse tasks, there is no evidence specifically relating to their ability to classify comments about doctors made by their colleagues as part of a formal evaluation. Although doctors’ performance might be best assessed by fellow professionals who know them very well, positive reporting bias in open-text reports may occlude differences in performance [
The objective of this study was to train and evaluate an ensemble of machine learning algorithms to accurately classify open-text reports of doctors, which are known to be positively biased, and to assess the potential for theory-based classifications in open text to signal differences in doctors’ professional performance in the United Kingdom.
We collected data from all non–training-grade doctors from 11 sites in England and Wales between March 2008 and January 2011. We recruited doctors from 4 acute hospital trusts, an anesthetics department, 1 mental health trust, 4 primary care organizations, and 1 independent (non–National Health Service) health care organization. We provided all doctors with detailed information regarding the study before they consented to take part in it; they were told they could withdraw at any point without justification. Detailed description of this sample is reported elsewhere [
Doctors were asked to suggest up to 20 colleagues (half of whom were to be medically qualified) who could provide multisource “360-degree” feedback regarding their professional performance.
Multisource feedback was elicited using the General Medical Council Colleague Questionnaire (GMC-CQ), a reliable measure of doctor performance that is validated for use in the United Kingdom [
Qualitative analysts inductively coded the open-text feedback from the GMC-CQ into 5 themes relating to (1) innovation and openness to change (59/1636 comments, 3.6%); (2) interpersonal skills and caring (432/1636 comments, 26.4%); (3) popularity (131/1636 comments, 8%); (4) professionalism (701/1636 comments, 42.8%); and (5) respect or esteem in which the doctor was held (346/1636 comments, 21.1%) [
The number of comments in each category, the distribution of words, and statistical comparison of the word length are provided in
Number of comments, distribution of words, and statistical comparison for each of the 5 categories.
Categories | Reports in category | Length of report, mean (SD) | ANOVAa |
Innovator | 59 | 41.99 (30.84) | <.001 |
Interpersonal | 432 | 23.87 (16.39) | .99 |
Popular | 131 | 25.49 (16.74) | .97 |
Professional | 701 | 24.46 (17.34) | .91 |
Respected | 346 | 20.69 (19.13) | .03 |
More than 1 category | 1189 | 21.63 (16.76) | .56 |
No categories | 425 | 19.54 (13.62) | <.001 |
aANOVA: analysis of variance; conducted with post hoc Tukey tests.
The qualitative researchers followed Holsti’s approach [
Comments were generally, though not always, positive. In our sample, 91.5% (1497/1636) of all comments were positive, 5.93% (97/1636) of the comments were mixed, containing both a positive and a negative statement about the doctor, and the remaining 2.57% (42/1636) of comments were either neutral or negative (see
The process of training, validating, and deploying the algorithms is illustrated in
Example quotes from each category
Theme | Comment |
Innovator | “It is clear from the advice he gives that he is aware of [the] current good practice, is highly motivated, very practical and very much a team player. His advice, when working with consultant colleagues was respected, and he recognized where practice/primary care limitations were and yet looked for opportunities for change and improvement.” |
“She has an admirable level of commitment and enthusiasm for her patients and her work. She has been instrumental in promoting change and improvement in her department. She is a great asset to the department and the hospital.” | |
Interpersonal | “She is a very good, committed colleague always keen to improve, very liked by her patients and highly valued by all who work with her.” |
“Very approachable and professional.” | |
Popular | “Excellent well liked and easy working colleague.” |
“Very popular doctor. Works to high standards.” | |
Professional | “I find this doctor to be very efficient, caring, honest and very professional.” |
“I find that he very easy and helpful to work with, he always has time for patients and staff.” | |
Respected | “A first class colleague.” |
“Pleasant and valued colleague.” | |
Not coded by qualitative rater (given label of 0) | “Supportive colleague, excellent time management skills.” |
“I think I have a good working relationship with this doctor. I have been impressed with his openness to Psychological work with his patients and his support for my work. In my opinion he gives thorough consideration to his diagnosis.” |
Flow diagram of the stages “training,” “validation,” and “application to new data.”.
The first step within each stage is the identification of features within the comments. The features used in this study were identified and stored using a term-document matrix that describes the frequency of terms that appear in each of the comments. The term-document matrix uses a bag-of-words structure that counts the number of terms in each comment and does not consider the order in which the words appear. Term-document matrices are a simple way to represent text data that are computationally straightforward. The matrix comprised unweighted words and was cleaned by stemming, removing numbers, and removing sparse terms (where a certain word was only used in fewer than 0.02% of cases) [
The term frequencies for each comment were therefore used as features that the algorithms used to classify the text. An example of a term-document matrix is provided in
An example term-document matrix for 3 texts.
Texts | Terms | |||||||||||
a | and | colleague | doctor | great | is | patients | respected | this | troublesome | well | with | |
Text 1a | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Text 2b | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Text 3c | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
aText 1: “A great colleague.”
bText 2: “A troublesome colleague.”
cText 3: “This doctor is well respected, and great with patients.”
Once the features have been extracted, they are used to “train” the algorithms to describe the relationship between the features and the classification.
In the validation stage the classifications given by the ensemble to new data are compared with the classifications made by the human qualitative analysts. If the results of the validation stage are acceptable, the algorithms can be exported and used to classify new data independently of the dataset that was used to train and validate the models.
The steps of these stages are given in greater detail in
“RTextTools” brings together other packages that contain different machine learning algorithms and provides a system by which the performance of each algorithm can be assessed both individually and as a collected ensemble of different methods that are combined to maximize performance in the training dataset. We included all of the available algorithms within RTextTools apart from the neural network, which did not converge in pilot assessments. The algorithms were support vector machine (SVM) using the radial basis function kernel with the penalty parameter of error term set to 1 and a gamma parameter set to 1/number of features [
A review and summary of supervised machine learning algorithms can be found elsewhere [
Algorithms were trained with a corpus of 1000 randomly selected precoded comments (see
We assessed algorithm performance using statistics of (1) recall (analogous to sensitivity)—what proportion of cases in a class are correctly assigned to the class; (2) precision (analogous to specificity)—how often a case that is predicted to belong to a class does belong to that class; and (3)
When assessed as an ensemble of multiple algorithms working together, recall is evaluated alongside coverage (the proportion of cases within the dataset to which the recall value applies) [
In addition to the standard assessment of algorithm performance using a validation dataset (636 comments), stability of algorithm performance across different data was also tested using an n-fold cross-validation. In the current analysis, a 10-fold validation was used in which 10 randomly selected samples of 1000 comments were selected from the dataset and validated using the remaining documents.
This analysis will indicate the robustness of the algorithms and their suitability for application to novel data and is preferable to split-half validation or bootstrapping [
As well as the precision of the algorithms it is important to assess their training efficiency (the relationship between performance and size of the training dataset) so that we might best understand how to apply these techniques in practice. We compared training efficiency performance using randomly selected training sets of 1000, 750, 500, 250, 100, and 50 to accurately classify randomly selected comments from a fixed-size validation set (636 cases).
To assess the ability of these categories to highlight differences in global performance, we investigated the differences in GMC-CQ scores for doctors whose comments were classified into at least 1 of 5 categories and those who were not placed in any category. We hypothesized that doctors who were placed into one or more categories would perform better than those doctors who were not classified into any of the categories.
We conducted this analysis using both the machine-coded dataset (the entire dataset was recoded by the algorithm blinded to the codes given by the human rater and the original human-rated dataset.
Questionnaire data were scored using the graded response model [
Computational text classification and statistical analysis were conducted within the R statistical programming environment (R Foundation) [
The study was originally considered by the Devon and Torbay NHS Research Ethics Committee but judged not to require a formal ethics submission. No subsequent ethical approval was sought for the secondary analyses on the anonymized datasets presented here.
Summary of algorithm and ensemble performance in the main analysis.
Modela | Metric | Innovator | Interpersonal | Popular | Professional | Respected | Average |
Support vector machine | .73 | .69 | .84 | .79 | .73 | .76 | |
Scaled linear discriminant analysis | .77 | .65 | .88 | .73 | .77 | .76 | |
Boosting | .75 | .77 | .81 | .76 | .75 | .77 | |
Bootstrap boosting | .87 | .85 | .83 | .80 | .82 | .83 | |
Random forests | .67 | .59 | .87 | .78 | .74 | .75 | |
Decision tree | .80 | .75 | .88 | .78 | .80 | .80 | |
Generalized linear model | .89 | .82 | .88 | .81 | .89 | .85 | |
Maximum entropy | .70 | .62 | .73 | .65 | .70 | .68 | |
Final ensemble (3+ models with |
Recall |
.98 | .80 | .97 | .82 | .87 | .89 |
10-Fold validation mean (range) | .97 (.96-.98) | .80 (.74-.86) | .97 (.96-.98) | .79 (.75-.83) | .86 (.84-.89) | .88 |
aTraining set size=1000; validation=636.
The GLM/LASSO algorithm was the single highest-performing algorithm for correctly classifying the open-text comments into the “innovator” category. The ensemble of 3 algorithms (GLM/LASSO, bootstrapped boosting, and regression tree) has a 98% recall agreement with the human coder. The 10-fold validation indicated robust accuracy scores between .96 and .98 (mean .97). In the whole dataset, 48 comments (3.5%) were classified as an innovator by the algorithms.
The bootstrapped boosting algorithm was the best-performing algorithm for categorizing open-text comments into the “interpersonal skills” category. The ensemble of 3 algorithms (boosting, GLM/LASSO, and bootstrapped boosting) demonstrated an 80% recall agreement with the human-coded dataset, the lowest performance for any of the classes. The 10-fold validation indicated similar performance in each fold and agreement values between .74 and .86 (mean .80). The algorithms classified 435 comments (28.4%) as “interpersonal.”
All algorithms performed exceptionally well in classifying open-text comments into the “popular” category, with
Similar performance was evident for many algorithms including SVM, random forests, and GLM/LASSO . Overall ensemble performance (GLM/LASSO, bootstrap boosting, and SVM) had an interrater recall of .82. The 10-fold validation suggested good agreement between the algorithm and the human analyst (mean .79, range .75-.83). The algorithms classified almost half of the comments in the whole dataset into the “professional” category (643 comments, 42.7%).
Once again, the GLM/LASSO algorithm showed the strongest single performance in the classification task for the “respected” category (
The GLM/LASSO algorithm was the strongest performing individual algorithm and the maximum entropy the worst. The overall average performance was nevertheless high (
Results for the 10-fold cross-validation were very similar for the final recall values for the individual samples. The n-fold result displayed a tight distribution over the 10 samples (
As the training sample size was reduced, the algorithms continued to perform well but fell sharply when the training dataset was reduced to fewer than 250 comments.
Algorithm performance with differing training sample sizes. Performance decreases as expected with smaller training corpora.
The
Comparison of means between doctors classified into a category and those who were unclassified.
Categories | Panel Aa | Panel Bb | |||||||||
Mean score |
Reports in |
Mean score |
Reports in |
||||||||
dfc | df | ||||||||||
Innovator | 0.00 | 48 | 0.00 | 55.74 | .99 | 0.01 | 59 | 1.14 | 35.69 | .26 | |
Interpersonal | 1.97 | 435 | 1.98 | 857.97 | .04 | 0.07 | 432 | 2.97 | 346.63 | <.01 | |
Popular | −0.05 | 107 | −0.88 | 176.42 | .38 | 0.13 | 131 | 1.32 | 149.05 | .19 | |
Professional | −0.03 | 643 | 2.51 | 901.34 | .01 | 0.1 | 701 | 3.47 | 286.99 | <.001 | |
Respected | 0.15 | 243 | 3.75 | 629.17 | <.001 | 0.44 | 346 | 5.58 | 300.13 | <.001 | |
More than 1 category | 0.04 | 1081 | 0.77 | 173.81 | .001 | 0.12 | 1189 | 3.81 | 239.8 | <.001 | |
No categories | −0.09 | 413 | N/Ad | N/A | N/A | −0.4 | 425 | N/A | N/A | N/A |
aPanel A: analysis using machine ensemble classifications on entire corpus.
bPanel B: analysis using human rater classifications on entire corpus.
cdf: degrees of freedom.
dN/A: not applicable.
Comparison of General Medical Council Colleague Questionnaire (GMC-CQ) scores between doctors who were placed in 1 of the 5 categories versus those who were not (positive comments only). Significance (P) values for the t tests are shown to indicate the relationship between the 2 groups.
This study demonstrates the ability of machine learning algorithms to categorize qualitative data with high performance. The integration of such algorithms into data analysis tool kits for nationwide surveys may allow rich qualitative data to be analyzed without the resource burden associated with expert human ratings across an entire corpus [
We also demonstrate the ability of categories to highlight differences in overall doctor performance that were statistically significant. We hypothesized that doctors who were classified into 1 of the 5 categories would have higher scores on the GMC-CQ than those who were unclassified. We found partial support for this hypothesis: doctors who were classified as “respected,” “professional,” and “interpersonal” tended to outperform unclassified doctors, whereas no significant difference in performance was evident between doctors who were classified as “popular,” “innovative,” and those who were not classified into any of the 5 categories. However, the number of doctors who received a classification of “innovative” was low, which resulted in tests with low power and potential for type II error. Doctors with multiple ratings performed better than those without any ratings. Doctors who were classified as “respected” had the highest performance of each group in both the human-rated and machine-rated datasets.
These techniques have clear potential for developing actionable insights in diverse specialties: they have also been used to classify patient-derived open-text comments in national cancer surveys [
We trained algorithms using precoded data and validated their performance on uncoded data, an example of “supervised” machine learning. We demonstrate strong performance using a relatively simple sparse “term-document matrix” method of identifying features in open text. The term-document matrix simply counts the instances of a word’s use within a comment and does not consider the order in which the words are presented. This approach has been used in similar studies published in the medical literature [
It is possible to extract features using different, more complex, methods. Feature extraction using
Although performance of these techniques is demonstrably high, further research may be warranted to explore the extent to which they can improve the accuracy of classification algorithms in this context and at what cost (eg, computational burden or interpretability).
Natural language processing algorithms and their related software, driven by market forces in the technology industry, are improving at a remarkable rate and, paradoxically given the economic motivation for their development, increasingly being distributed at no cost under open-source licenses. As the field develops, we may expect these sorts of algorithms to successfully classify more complex corpora and perhaps even identify important elements within open-text comments without task-specific training data, which is known as “unsupervised” machine learning.
This study has some limitations. The average performance of our algorithms is likely to be high in some instances given the low incidence of category. For example, algorithm performance was exceptionally high (recall=.97) for the “innovator” category, where only 4.2% of doctors were rated as innovative. In this instance, the low number of classifications in the population would have meant that a “dumb guess” that simply rated each doctor as “not innovative” would have demonstrated a 95.8% agreement rate. However, while algorithm performance was somewhat lower for categories with balanced distributions (eg, 46% of doctors were rated as professional), it was still acceptably high (recall=.82, 10-fold accuracy=.87).
Because of a somewhat small dataset, the trained ensemble was used to reclassify the whole corpora on which the algorithms were originally trained. It is likely that the performance of the algorithms will be higher when reclassifying the data they were originally trained on. The rationale for this decision was to maximize the number of classified categories and therefore maintain statistical power in the analyses. The results of these assessments were broadly the same as the results from the human-classified dataset, albeit the effect sizes were consistently smaller in the analyses using the machine-classified codes. This may be especially important as it appeared that the machine-labeled dataset was less sensitive to differences in doctor’s performance than the human-labeled dataset.
A further possible limitation of the dataset was the necessity to have a small training-set to validation-set ratio (3:2) to keep sufficient number of comments in the validation sample. While this advantaged the statistical analysis of difference in performance signaling for different categories, it may have hampered the performance of certain algorithms by not providing sufficient training data.
The significant positive skew in doctors’ ratings and the scarcity of comments that were outright negative meant that we were unable to conduct a sentiment analysis. We expect this to be the case in a population where most subjects are performing well, and this is probably representative of most datasets that collect open-text information on doctor performance. The content of uncategorized comments reveals a trend of doctors saying
Although algorithm performance was generally high for each of the individual machine learning techniques, it is apparent that the generalized linear model with lasso regularization had the highest performance for each of the classes. The precise reason for this improved performance is somewhat opaque, but the lasso regularization technique is especially suitable for classification problems using sparse matrices [
Similarly, it is not immediately clear as to why certain codes could be classified with greater accuracy than others. The differences in performance between classes may be explained by differences in the conceptual basis of each class; both humans and algorithms may find it easier to classify comments that reflect easily defined concepts such as being “popular” (the class for which algorithm performance was highest), rather than less well-defined concepts such as being “interpersonal” (the class for which algorithm performance was lowest) [
There may be an opportunity for similar techniques to be applied to patient experience data to build algorithms that can correctly classify and perhaps, using sentiment analysis, quantify open text in national-scale patient experience surveys and provide feedback that is more meaningful to both patients and practitioners. Computational analysis of open-text comments may be of greater usefulness when it is used to identify issues that were not previously envisaged.
This study demonstrates excellent performance for an ensemble of machine learning algorithms tasked to classify open-text comments of doctors’ performance. These algorithms perform well, even where limited time and resources are available to code training datasets. We demonstrate that machine identification of qualitatively derived, theory-based open-text classifications can signpost significant differences in a doctor’s performance, even when comments are exclusively positive. These findings may inform future predictive models of performance and support real-time evaluation to improve quality and safety.
analysis of variance
General Medical Council Colleague Questionnaire
National Institute for Health Research
scaled linear discriminant analysis
support vector machine
We thank Karen Alexander, the National Institute for Health Research (NIHR) Adaptive Tests for Long-Term Conditions (ATLanTiC) patient and public involvement partner, for providing critical insight, comments, and editing the manuscript. Data collection and qualitative coding were funded by the UK General Medical Council as an unrestricted research award. Support for the novel work presented in this paper was given by a postdoctoral fellowship award for CG (NIHR-PDF-2014-07-028).
JC has been an advisor to the UK General Medical Council and has received only direct costs associated with the presentation of this work.