Automatic Classification of Online Doctor Reviews: Evaluation of Text Classifier Algorithms

Background: An increasing number of doctor reviews are being generated by patients on the internet. These reviews address a diverse set of topics (features), including wait time, office staff, doctor’s skills, and bedside manners. Most previous work on automatic analysis of Web-based customer reviews assumes that (1) product features are described unambiguously by a small number of keywords, for example, battery for phones and (2) the opinion for each feature has a positive or negative sentiment. However, in the domain of doctor reviews, this setting is too restrictive: a feature such as visit duration for doctor reviews may be expressed in many ways and does not necessarily have a positive or negative sentiment. Objective: This study aimed to adapt existing and propose novel text classification methods on the domain of doctor reviews. These methods are evaluated on their accuracy to classify a diverse set of doctor review features. Methods: We first manually examined a large number of reviews to extract a set of features that are frequently mentioned in the reviews. Then we proposed a new algorithm that goes beyond bag-of-words or deep learning classification techniques by leveraging natural language processing (NLP) tools. Specifically, our algorithm automatically extracts dependency tree patterns and uses them to classify review sentences. Results: We evaluated several state-of-the-art text classification algorithms as well as our dependency tree–based classifier algorithm on a real-world doctor review dataset. We showed that methods using deep learning or NLP techniques tend to outperform traditional bag-of-words methods. In our experiments, the 2 best methods used NLP techniques; on average, our proposed classifier performed 2.19% better than an existing NLP-based method, but many of its predictions of specific opinions were incorrect. Conclusions: We conclude that it is feasible to classify doctor reviews. Automatically classifying these reviews would allow patients to easily search for doctors based on their personal preference criteria. (J Med Internet Res 2018;20(11):e11141


Background
The problem of automatic reviews analysis and classification has attracted much attention because of its importance in ecommerce applications [1][2][3]. Recently, there has been an increase in the number of sites where users rate doctors. Several works have analyzed the content and scores of such reviews, mostly by examining a subset of them through qualitative and quantitative analysis [4][5][6][7][8][9] or by applying text-mining techniques to characterize trends [10][11][12]. However, not much work has studied how to automatically classify doctor reviews.
In this study, our objective was to automatically summarize the content of a textual doctor review by extracting the features it mentions and the opinion of the reviewer for each of these features; for example, to estimate if the reviewer believes that the wait time or the visit time is long or if the doctor is in favor of complementary medicine methods. We explore the feasibility of reaching this objective by defining a broader definition of the review classification problem that addresses challenges in the domain of doctor reviews and examining the performance of several machine learning algorithms in classifying doctor review sentences.
Previous work on customer review analysis focused on automated extraction of features and the polarity (also referred as opinion or sentiment) of statements about those features [2,13,14]. Specifically, these works tackle the problem in 2 steps: first they extract the features using rules, and then, for each feature, they estimate the polarity using hand-crafted rules or supervised machine learning methods. This works well if (1) the features are basic, such as the battery of a phone, which are generally described by a single keyword, for example, the battery of the camera is poor, and (2) the opinion is objectively positive or negative but does not support more subjective features like visit time, where for some patients it is positive to be longer, and for some, it is negative. In other words, statements about features in product reviews tend to be more straightforward and unambiguously positive or negative, whereas reviews on service, such as doctor reviews, are often less so, as there may be many ways to express an opinion on some aspect of the service.
In our study, the features may be more complex, for example, the visit time feature can be expressed by different phrases such as "spends time with me," "takes his time," "not rushed," and so on. As another example, "appointment scheduling" can be expressed in many different ways, for example, "I was able to schedule a visit within days" or "The earliest appointment I could make is in a month." Other complex classes include staff or medical skills.
Furthermore, in our study, what is positive for one user may be negative for another. For example, consider the sentence "Dr. Chan is very fast so there is practically no wait time and you are in and out within 20 minutes." The sentiment in this sentence is positive, but a short visit implied by in and out within 20 minutes may be negative for some patients. Instead, what we want to measure is long visit time versus short visit time. This is different from work on detecting transition of sentiment [15] because it is not enough to detect the true sentiment, but we must also associate it with a class (long visit time vs short visit time).
To address this variation of the review classification problem, we created a labeled dataset consisting of 5885 sentences from 1017 Web-based doctor reviews. We identified several classes of doctor review opinions and labeled each sentence according to the presence and polarity of these opinion classes. Note that our definition of polarity is broader than in previous work as it is not strictly positive and negative but rather takes the subjectivity of patient opinions into account (eg, complementary medicine is considered good by some and bad by others).
We adapt existing and propose new classifiers to classify doctor reviews. In particular, we consider 3 diverse types of classifiers: 1. Bag-of-words classifiers such as Support Vector Machine (SVM) [16,17] and Random Forests [18] that leverage the statistical properties of the review text, such as the frequency of each word.

Deep learning methods such as Convolutional Neural
Network (CNN) [19], which also consider the proximity of the words. 3. Natural Language Processing (NLP)-based classifiers, which leverage the dependency tree of a review sentence [20]. Specifically, we consider an existing NLP-based classifier [21] and propose a new one, the Dependency Tree-Based Classifier (DTC).
DTC generates the dependency tree for each sentence in a review and applies a set of rules to extract dependency tree-matching patterns. These patterns are then ranked by their accuracy on the training set. Finally, the sentences of a new review are classified based on the highest-ranking matching pattern. This is in contrast to the work by Matsumoto et al [21], which treats dependency tree patterns as features in an SVM classifier.
The results of our study show that classifying doctor reviews to identify patient opinions is feasible. The results also show that DTC generally outperforms all other implemented text classification techniques.
Here is a summary of our contributions: 1. We propose a broader definition for the review classification problem in the domain of doctor reviews, where the features can be complex entities and the polarity is not strictly positive or negative. 2. We evaluated a diverse set of 5 state-of-the-art classification techniques on a labeled dataset of doctor reviews containing a set of commonly used and useful features. 3. We propose a novel decision tree-based classifier and show that it outperforms the other methods; we have published the code on the Web [22].

Literature Review
In this section, we review research in fields related to this study, which we organize into 5 categories: • Quantitative and qualitative analysis of doctor review ratings and content

Doctor Review Analysis
Several previous works have analyzed Web-based doctor reviews. Gao et al described trends in doctor reviews over time to identify which characteristics influence Web-based ratings [4]. They found that obstetricians or gynecologists and long-time graduates were more likely to be reviewed than other physicians, recent graduates, board-certified physicians, highly rated medical school graduates, and doctors without malpractice claims received higher ratings, and reviews were generally positive. Segal et al compared doctor review statistics with surgeon volume [5]. They found that high-volume surgeons could be differentiated from low-volume surgeons by analyzing the number of numerical ratings, the number of text reviews, the proportion of positive reviews, and the proportion of critical reviews. López et al performed a qualitative content analysis of doctor reviews [6]. They found that most reviews were positive and identified 3 overarching domains in the reviews they analyzed: interpersonal manner, technical competence, and system issues. Hao analyzed Good Doctor Online, an online health community in China, and found that gynecology-obstetrics-pediatrics doctors were the most likely to be reviewed, internal medicine doctors were less likely to be reviewed, and most reviews were positive [7]. Smith and Lipoff conducted a qualitative analysis of dermatology practice reviews from Yelp and ZocDoc [8]. They found that both the average review scores and the proportion of reviews with 5 out of 5 stars from ZocDoc were higher than those from Yelp. They also found that high-scoring reviews and low-scoring reviews had similar content (eg, physician competency, staff temperament, and scheduling) but opposite valence. Daskivich et al analyzed health care provider ratings across several specialties and found that allied health providers (eg, providers who are neither doctors nor nurses) had higher patient satisfaction scores than physicians, but these scores were also the most skewed [9]. They also concluded that specialty-specific percentile ranks might be necessary for meaningful interpretation of provider ratings by consumers.

Text Mining of Doctor Reviews
Other previous papers have employed text-mining techniques to characterize trends in doctor reviews. Wallace et al designed a probabilistic generative model to capture latent sentiment across aspects of care [10]. They showed that including their model's output in regression models improves correlations with state-level quality measures. Hao and Zhang used topic modeling to extract common topics among 4 specialties in doctor reviews collected from Good Doctor Online [11]. They identified 4 popular topics across the 4 specialties: the experience of finding doctors, technical skills or bedside manner, patient appreciation, and description of symptoms. Similarly, Hao et al used topic modeling to compare reviews between Good Doctor Online and the US doctor review website RateMDs [12]. Although they found similar topics between the 2 sites, they also found differences that reflect differences between the 2 countries' health care systems. These works differ from ours in that they use text-mining techniques to analyze doctor reviews in aggregate, while our goal is to identify specific topics in individual reviews.

Customer Review Feature and Polarity Extraction
As discussed earlier in the Introduction, these works operate on a more limited problem setting where the features are usually expressed by a single keyword, and the sentiment is strictly positive or negative. Hu and Liu extracted opinions of features in customer reviews with a 4-step algorithm [2]. This algorithm consists of applying association rule mining to identify features, pruning uninteresting and redundant features, identifying infrequent features, and finally determining semantic orientation of each opinion sentence. Popescu and Etzioni created an unsupervised system for feature and opinion extraction from product reviews [3]. After finding an explicit feature in a sentence, they applied manually crafted extraction rules to the sentence and extracted the heads of potential opinion phrases. This method only works when features are explicit.

Sentiment Analysis With Dependency Trees
There are number of existing works that use dependency trees or patterns for sentiment analysis. A key difference is that our method does not always capture sentiment but the various class labels (eg, short or long) for each class (eg, visit time). Hence, we cannot rely on external sentiment training data or on hard-coded sentiment rules, but we must use our own training data.
Agarwal et al used several hand-crafted rules to extract dependency tree patterns from sentences [23]. They combined this information with the semantic information present in the Massachusetts Institute of Technology Media Lab ConceptNet ontology and employed the extracted concepts to train a machine learning model to learn concept patterns in the text, which were then used to classify documents into positive and negative categories. An important difference from our method is that their dependency patterns generally consist of only 2 words in certain direct relations, while our patterns can contain several more in both direct and indirect relations.
Wawer induced dependency patterns by using target-sentiment (T-S) pairs and recording the dependency paths between T and S words in the dependency tree of sentences in their corpus [24]. These patterns were supplemented with conditional random fields to identify targets of opinion words. In contrast to our patterns, which can represent a subtree of 2 or more words, the patterns in this work are generated from the shortest path between the T and S words.
Matsumoto et al's work [21] is the closest work to our proposed method, which we experimentally compare in the Results section. They extract frequent word subsequences and dependency subtrees from the training data and use them as features in an SVM sentiment classifier. Their patterns involve frequent words and only include direct relations, whereas our patterns involve high-information gain words and consider indirect relations. Pak and Paroubek follow a similar strategy of extracting dependency tree patterns based on predefined rules and using them as features for an SVM classifier [25]. Matsumoto et al perform better on the common datasets they considered.

Text Classification
Machine learning algorithms are commonly used for text classification. Kennedy et al used a random forest classifier to identify harassment in posts from Twitter, Reddit, and The Guardian [26]. Posts were represented through several features such as term frequency-inverse document frequency (TF-IDF) of unigrams, bigrams, and short character sequences; URL and hashtag token counts; source (whether the post was from Twitter); and sentiment polarity. Gambäck and Sikdar used a CNN to classify hate speech in Twitter posts [27]. The CNN model was tested with multiple feature embeddings, including random values and word vectors generated with Word2Vec [28]. Lix et al used an SVM classifier to determine patient's alcohol use using text in electronic medical records [29]. Unigrams and bigrams in these records were represented using a bag-of-words model.

Problem Definition
Given a text dataset with a set of classes c 1 , c 2 , …, c m that represent features previously identified by a domain expert, each class c i can take 3 values (polarity): The sentence is not relevant to the class.
• c i x , c i y :Yes or no. Note that to avoid confusion, we do not say positive or negative, as for some classes such as visit time in doctor reviews, some patients prefer when their visit time is long and some prefer short. In this example, "Yes" could arbitrarily be mapped to long and "No" to short. and "I'll call to reschedule everything." A sentence may take labels from more than one class.
In this study, given a training set T of review sentences with class labels from classes c 1 , c 2 , …, c m , we build a classifier for each class c i to classify new sentences to one of the possible values of c i . Specifically, we build m training sets T i corresponding to each class. Each sentence in T i is assigned a class label c i x , c i y , or c i 0 .

Doctor Reviews Dataset
We crawled Vitals [30], a popular doctor review website, to collect 1,749,870 reviews. Each author read approximately 200 reviews and constructed a list of features. Afterward, through discussions, we merged these lists into a single list of 13 features, which we represent by classes as described in the problem definition (Table 1).
To further filter these classes, we selected 600 random reviews to label. We labeled these reviews using WebAnno, a Web-based annotation tool [31] ( Figure 1). Specifically, each sentence was tagged (labeled) with 0 or more classes from Table 1 by 2 of the authors. The union of these labels was used as the set of ground-truth class labels of each sentence; that is, if at least one of the labelers labeled a sentence as c i x , that sentence is labeled We found that some of these classes were underrepresented. For each underrepresented class, we used relevant keywords to find and label more reviews from the collected set of reviews, for example, wait for wait time and listen for information sharing, which resulted in a total of 1017 reviews (417 in addition to the original 600). These 1017 reviews are our labeled dataset used in our experiments.    Following this, we found that some classes such as complementary medicine and joint decision making were still underrepresented, which we define as having less than 2% of the dataset's sentences labeled c i x or c i y , so we omitted them from the dataset. The final dataset consists of 5885 sentences and 8 opinion classes. These classes and the frequency of each of their labels are shown in Table 2.

Background on Dependency Trees
In this section, we describe dependency trees and the semgrex tool that we used for defining matching patterns. Dependency trees capture the grammatical relations between words in a sentence and are produced using a dependency parser and a dependency language. In a dependency tree, each word in a sentence corresponds to a node in the tree and is in one or more syntactic relations between the word or node exactly one other word or node. A dependency tree is a triple T = 〈N, E, R 〉, where • N is the set of nodes in T where each node n N is a tuple containing one or more string attributes describing a word in the sentence T was built from, such as word, lemma, or POS (part of speech) • E is the set of edges in T where each edge e E is a triple e = 〈n g , r, n d 〉, where • n g N is the governor or parent in relation r • r is a syntactic relation between the words represented by n g and n d • n d N is the dependent or child in relation r • R N is the root node of T Figure 2 shows a sample dependency tree for the sentence "there are never long wait times." The string representation of this tree, including the parts of speech for its words, is as follows:  To match patterns against dependency trees, we used Stanford semgrex utility [32]. In the following, we explain some of the basics of semgrex patterns that help the reader understand patterns presented in this study using descriptions and examples from the Chambers et al study [32]. Semgrex patterns are composed of nodes and relations between them. Nodes are represented as {attr1:value1;attr2:value2;…} where attributes (attr) are regular strings such as word, lemma, and pos, and values can be strings or regular expressions marked by "/"s. For example, {lemma:run;pos:/VB.*/} means any verb form of the word run. Similar to "." in regular expressions, {} means any node in the graph. Relations in a semgrex have 2 parts: the relation symbol, which can be either < or > and optionally the relation type (ie, nsubj and dobj). In general, A<reln B means A is the dependent of a relation (reln) with B, whereas A>reln B means A is the governor of a reln with B. Indirect relations can be specified by the symbols >> and <<. For example, A<<reln B means there is some node in a dep->gov chain from A that is the dependent of a reln with B. Relations can be strung together with or without using the symbol &. All relations are relative to first node in string. For example, A>nsubj B>dobj D means A is a node that is the governor of both an nsubj relation with B and a dobj relation with D. Nodes can be grouped with parentheses. For example, A>nsubj (B>dobj D) means A is the governor of an nsubj relation with B, whereas B is the governor of a dobj relation with D. A sample pattern that matches the tree in Figure 2 can be:

{} >neg {} >> ({word:wait} > {word:long})
Using the Stanford CoreNLP Java library [33], our proposed classifier builds a dependency tree from a given sentence and determines whether any of a list of semgrex patterns matches any part of the tree.

Proposed Dependency Tree-Based Classifier
Our DTC algorithm is trained on a labeled dataset of sentences as described in the Problem Definition section. On a high level, given a sentence in training dataset T, the classifier generates a dependency tree using the Stanford Neural Network Dependency Parser [34] and extracts semgrex patterns from the dependency tree. These patterns are assigned the same class as the training sentence. When classifying a new sentence, the classifier generates the sentence's dependency tree and assigns a class label to the sentence based on which patterns from the training set match the dependency tree.
In more detail, the classifier's training algorithm generates a sorted list of semgrex patterns, each with an associated class label, from a training dataset T and integer parameters n i x , n i y , and m. Parameters n i x and n i y are the maximum number of terms (words or phrases) that will be used to generate patterns of classes c i x and c i y , respectively. In this study, we only use words, as dependency trees capture relations between words rather than phrases.
The pattern extraction algorithm described in the Pattern Extraction section below receives as input 2 sets W x and W y of high-information gain words, for the "Yes" (c i x ) and "No" (c i y ) class labels, respectively, from where we pick nodes for the generated patterns. The intuition is that high-information gain words are more likely to allow a pattern to differentiate between the class labels. Considering all words would be computationally too expensive, and it does not offer any significant advantage as we have seen in our experiments. The information gain for W x is determined by a logical copy of training dataset T in which class labels other than c i x are given a new class label c i x ', as the words in W x will be used to identify sentences of class c i x . This process is repeated for W y . Parameter m is the maximum number of these selected words that can be in a single pattern.
The final list of (semgrex pattern p and class label c') pairs is sorted by the weighted accuracy of the pair on the training data, which we define below.
We define Accuracy c (p, T) as the ratio of training instances in T with class label c that were correctly handled by pattern p. Pattern p, which was paired with class label ', is correct if it matches an instance with class label c' or it does not match an instance without class label c', but it is incorrect if it matches an instance without class label c' or it does not match an instance with class label c'. |c i | is the number of class labels in class c i , which is 3 for all of the classes in this study. Intuitively, weighted accuracy treats all class labels with equal importance regardless of their frequency, so patterns that perform well on sentences of often low-frequency class labels c i x and c i y are assigned higher rank than they would otherwise. The training algorithm is shown in Textbox 1.
Given a to-be-classified sentence, we compute its dependency tree t and find the highest ranked (pattern p and class label c) pair where p matches t. Then the sentence is classified as c. If no pattern matches the sentence, we provide 2 possibilities: the sentence can be classified as the most common class label in T or it can be classified by a backup classifier trained on T.

Parameters Setting
In all experiments, we use n i x =n i y =30, as intuitively it is unlikely that there are more than 30 words for a class that can participate in a discriminative semgrex pattern. We set m to 4 for all experiments, because for m>4, it becomes too computationally expensive to compute all patterns.

Pattern Extraction in the Dependency Tree Classifier Algorithm
Overview Given a dependency tree, we now describe how to extract patterns. Note that we repeat the pattern extraction for the "Yes" and "No" class labels, using W x and W y , respectively (W in this section refers to W x or W y ). We extract semgrex patterns from a dependency tree t with class label c using a set of high-information gain words W and a maximum number of words m. The algorithm returns a set of patterns extracted from t made from up to m words in W.
The rationale for only working with high-information gain words is that we want to generate high-information gain patterns. We also want to preserve negations as they have a great impact to the accuracy of the patterns. If a low information gain word is negated, we replace it by a wildcard (*), which we found to be a good balance for these 2 goals. Each pattern p is associated with c such that a new sentence that matches p is classified as c. Textbox 2 describes the pattern extraction algorithm.

2.
P=set of patterns, initially empty
for each combination C of words in W with |C |==min(|W|, m)

S.push((t, C))
6. while S is not empty: n=root of t'

10.
while n==* and n has exactly 1 child: 11. n=child of n

12.
t'=subtree of t' with root n 13. remove each "*" node n' in t' with exactly 1 child c', and make the parent of n' the parent of c' with an indirect relation 14. add (pattern(t'), c) to P

S.push((t', C'))
17. return P 1. prune(t, W): recursively prune from t' leaves that do not start with any word in W and are not in a negation relation 4. for each node n in t':

5.
if n does not start with any word in W: 6. n=*

Details
The algorithm first creates a copy t' of t for each combination C of m words in W and pushes each (t', C) pair onto a stack. For each (t', C) popped from the stack, we execute the following steps: 1. Create initial subtree: Prune t' to keep only words in C, negations, and intermediate "*" nodes connecting them.
2. Remove unimportant nodes: Eliminate "*" nodes from t' starting with the root if it is a "*" node and has exactly 1 child (the child becomes the new root of t' and this repeats until the root no longer meets these criteria). Subsequently, remove each "*" node n' in t' with exactly 1 child and add an indirect relation edge from the parent of n' to the child of n'. 3. Add subpatterns: If (pattern(t'), c) is not already in P, add (pattern(t'), c) to the set of patterns P, and then push(t', c') onto the stack for each combination C' of 2 or more non-* words in t'.
The algorithm then moves on to the next item on the stack. Once the stack is empty, we return the resulting set of patterns and their associated class labels.
The prune(t, w) procedure recursively removes leaf nodes that do not start with any word in W and are not in a negation relation with their parents. Intermediate nodes that connect the remaining nodes and do not start with any word in W are replaced by *. The pattern(t) procedure converts a dependency tree t to its semgrex format representation. Each "*" node is represented by an empty node {}, and most relations are represented by the generic > or >> relations (for direct and indirect relations, respectively), which match any type of relation. An exception to this is the negation relation, which is preserved in the semgrex pattern as the >neg token.

Example
Consider a sentence from the doctor review dataset class c 8 (wait time), "I arrived to my appointment on time and waited in his waiting room for over an hour," which has class label c 8 y (long wait). The dependency tree generated from this sentence is shown in Figure 3. Pattern 1 means that some node has a direct descendant time and an indirect descendant hour. Pattern 2 means that time is a direct descendant of arrived. Pattern 3 means that some node has 2 direct descendants; 1 is time and the other is some other node that has direct descendants room and hour. Finally, pattern 4 means that hour is an indirect descendant of arrived.

Classifiers Employed
We consider 3 types of classifiers: 1. Statistical bag-of-words classifiers, which view the documents as bags of keywords: • Random Forests (RF): RF, as implemented in Scikit-learn by Pedregosa et al [35]. Documents are represented with TF-IDF using n grams of 1 to 3 words, a minimum document frequency of 3%, up to 1000 features, stemming, and omission of stop words. The classifier uses 2000 trees. All other parameters are given their default values from [35].
• SVM: C-support vector classifier as implemented in Scikit-learn by Pedregosa et al [35], which is based on the implementation from the study by Chang and Lin [36]. Documents are represented with TF-IDF using the same parameters as with random forest. The parameters for the classifier are given their default values from Scikit-learn by Pedregosa et al [35].

Deep learning classifiers:
• CNN or CNN-W (CNN with Word2Vec): We use 2 variants of the CNN implementation by Britz [37]. Both use the default parameters. The first variant is initialized with a random uniform distribution, as in the CNN implementation by Britz [37]. The second is initialized with values from the Word2Vec model implementation from Gensim by Rehurek and Sojka [38].
• D2V-NN (Doc2Vec Nearest Neighbor): A nearest neighbor classifier that uses the Doc2Vec model [39] implementation from Gensim by Rehurek and Sojka [38]. Documents are converted to paragraph vectors and classified according to the nearest neighbor using cosine similarity as the distance function.
For CNN-W and D2V-NN, the Word2Vec and Doc2Vec models, respectively, are trained on an unlabeled set of 8,977,322 sentences from the collected doctor reviews that were not used to create the labeled dataset. 3. NLP classifiers, which exploit the dependency trees of a review's sentences: • Matsumoto: We implemented the method described in the study by Matsumoto et al [21] using the best-performing combination of features from their experiment using the Internet Movie Database dataset from the study of Pang and Lee [40], that is, unigrams, bigrams, frequent subsequences, and lemmatized frequent subtrees. For POS tagging before the step in frequent subsequence generation that splits sentences into clauses, our implementation uses the Stanford parser [41]. We use the dependency parser by Chen and Manning [34] to generate dependency trees for frequent subtree generation. For the SVM, we use the implementation from Pedregosa et al's Scikit-learn with a linear kernel and all other parameters given their default values from [35]. All parameters related to frequent subsequence and subtree generation are the same as described in the study by Matsumoto et al [21].
• DTC: As described in the Methods section.

Variants of Dependency Tree Classifier
We consider the following variants of our DTC text classifier: DTC: as described above, with sentences not matching any pattern classified as the most common class label in the training data.
DTC RF : Sentences not matching any pattern are classified by a random forests classifier trained on the training data for each class.
DTC CNN-W : Sentences not matching any pattern are classified by a CNN-W text classifier (as defined above) trained on the training data for each class.

Experiments
We performed experiments with the classifiers on each class of the doctor review dataset using 10-fold cross validation. To evaluate their performance, we use weighted accuracy. For a trained classifier C and dataset D of class c i , we define this as shown below.
Accuracy c (C, D) is the ratio of sentences in D with class label c that were classified correctly by C. As before, |c i | is 3, the number of class labels in class c i . We use weighted accuracy in our experiments as it places more importance on less frequent class labels, whereas regular accuracy is often above 90% because of the high number of instances labeled c i 0 for each c i .
The results of our experiments are shown below. In Table 3, we see that DTC CNN-W has better weighted accuracy than at least 4 baselines in each class. On average, it performs 2.19% better than the second-best method, the Matsumoto classifier ([57.05%-55.83%]/55.83%=2.19%). We also observe that both the deep learning classifiers (CNN, CNN-W, and D2V-NN) and NLP classifiers (Matsumoto and DTC variants) tend to perform better than the bag-of-words classifiers (RF and SVM). This is expected as the deep learning and NLP classifiers take advantage of information in sentences such as word order and syntactic structure that cannot be expressed by a bag-of-words vector.
Next, we further examine the performance of the top 3 classifiers, CNN-W, Matsumoto, and DTC CNN-W .  Table 5 shows the ratio of review sentences classified as c i x or c i y (ie, a classifier predicted their class labels as c i x or c i y ) that were classified correctly. By this measure, DTC CNN-W performs poorly compared with CNN-W and Matsumoto. Although the DTC algorithm's semgrex patterns classify more sentences as c i x or c i y , many of these classifications are incorrect. In the next section, we discuss reasons for some of these misclassifications.

Anecdotal Examples
In this section, we show some specific patterns generated by our algorithm along with some actual review sentences that match these patterns. in the doctor review dataset. It consists of a node that has 2 descendants: another generic node in a direct negation relation and wait in an indirect relation. The word wait has 1 direct descendant, the word long. The following is an example of a correctly matched sentence: "You are known by name and never have to wait long." This is an incorrectly matched one: "As a patient, I was not permitted to complain to the doctor about the long wait, placed on hold and never coming back to answer call." We see that it contains the words long and wait, as well as a negation (the word never); however, the negation is not semantically related to the long wait the author mentioned. Providing additional training data to the classifier may prevent such misclassifications by finding a pattern (or improving the rank of an existing pattern) that more appropriately makes such distinctions.

Limitations
In addition to the incorrect handling of negation described above, another limitation of our algorithm is that some sentences of a particular class can be sufficiently similar to sentences from another class, which may lead to misclassifications. Some examples of this can be seen in class c 6 (staff). Specifically, some sentences referring to a doctor (rather than staff members) were incorrectly classified as c 6 x (good staff) or c 6 y (bad staff).
For example, "Dr. Fang provides the very best medical care available anywhere in the profession" and "Dr. Overlock treated me with the utmost respect," which clearly refer to doctors rather than staff and should have been classified as c 6 0 (no mention of staff). The DTC algorithm generated some patterns for c 6 x that focus on positive statements for a person but miss the requirement that this person is staff. In the case of the above sentences, they were matched by

Conclusions
In this paper, we study the doctor reviews classification problem. We evaluate several existing classifiers and 1 new classifier. A key challenge of the problem is that features may be complex entities, for which polarity is not necessarily compatible with traditional positive or negative sentiment. Our proposed classifier, DTC, uses dependency trees generated from review sentences and automatically generates patterns that are then used to classify new reviews. In our experiments on a real-world doctor review dataset, we found that DTC outperforms other text classification methods. Future work may build upon the DTC classifier by also incorporating other NLP structures, such as discourse trees [42], to better capture the semantics of the reviews.