This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events.
The intent of the study was to rank ADRs according to severity.
We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs.
There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy.
ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.
Pharmacovigilance plays a crucial role in the continuing evaluation of drug safety. Adverse drug reactions (ADRs) contribute to excess length of hospitalization time, extra medical costs, and attributable mortality [
Ranking large sets of ADRs is challenging; theoretical analyses have provided a framework for such evaluations [
Our goal was to rank the ADRs by severity from a population (non-expert, non-clinician) perspective. We ranked a list of 2929 ADRs by assigning 126,512 ADR pairwise comparisons to 2589 individuals and processing the comparisons with an optimization algorithm to rank the ADR severities.
ADRs are reported in drug labels following clinical trials. Additional drug-ADR associations can be inferred, both empirically, through reporting systems such as the US Food and Drug Administration (FDA) Adverse Events Reporting System (AERS), or based on computational predictions (using drug similarity [
ADRs were retrieved from the “SIDER2” side effect database (October 2012 version, listing total of 4192 ADRs) [
The AERS data files were downloaded from the FDA website [
Semantic similarity between ADRs was computed using the Human Phenotype Ontology (HPO) [
MTurk is a platform for task creation, labor recruitment, and compensation. “Requesters” create and publish “human intelligence tasks” (HITs) and “workers” complete these tasks. The tasks are ones that can be completed using a computer and typically require a short time to complete, with a corresponding small compensation. Prior to posting a task, the requester sets the compensation amount (Amazon charges an additional 10% commission). Workers can browse and choose from available tasks and are paid upon successful completion of each task. Requesters can also reject subpar work. In this case, the rejected workers do not receive payment and it also negatively affects the worker record as requesters may limit their tasks to workers with low rejection rates.
We retrieved a set of 2929 common ADRs (expressed in the MedDRA terminology) from drug labels, as represented in the SIDER2 database [
The workers were required to possess satisfactory task completion records, rejected in less than 5% of past tasks (95% approval rate), and be located in the United States, as a proxy to English proficiency. In order to identify reliable workers, each worker task of 10 pairs included three pre-defined quality control pairs with expected answers and seven randomly chosen pairs. These quality control pairs were constructed by pairing all combinations from a manually selected set of severe ADRs and a set of mild ADRs.
Using the pre-defined set of quality control comparisons, we removed inconsistent workers who did not answer these appropriately, resulting in 124,513 usable pairwise comparisons (57,901 unique comparisons, multiple comparisons were made for consistency evaluation, see
In construction of the pairwise comparisons, we took the following measures in order to maximize the tested pairs and reduce as much as possible potential biases: (1) the tasks were distributed on different weekdays over a period of 1 month, and (2) using an initial crude ranking computed from the first batch of comparisons, we randomly selected the ADR pairs that were not too easy (comparing a severe and a mild ADR) or equivalent (ADRs with very close ranks), as equivalent ADRs are harder to compare and have the potential to frustrate the MTurk workers in being forced to choose.
A quality control batch of pairwise comparisons (14,645 pairs) was repeated three times to assess reproducibility. It was also constructed to maximize the number of pairs that can be tested for triangular inequality (ie, for ADRs A, B, and C, test A vs B, B vs C, and A vs C).
Each task, consisting of 10 pairwise comparisons, took 5 minutes to complete on average, yielding US $0.45 per worker (half a dollar including Amazon’s fee). The entire ranking totaled in 146 person days at a cost of US $6,300. A more detailed description and worker statistics are found in
We formulated a linear programming scheme to compute a ranked list of the ADRs from the pairwise comparisons (illustrated in
The linear programming was implemented in MATLAB using IBM CPLEX package version 12.6 [
MTurk task construction (A) and ranking process (B). (A) Random list of pairwise comparisons and list of predefined quality control pairs are constructed (1). Each worker receives unique set of 7 random ADR pairs to compare and 3 quality control pairs for performance evaluation (2). Results are collected and merged (3). (B) Ranked pairs are sampled (1), sent to a linear programming task (2), and ranking of each sample merged to a global ranking (3).
We estimated the consistency of pairwise comparisons using a batch of comparisons that was constructed for quality control purposes. It was repeated three times and included multiple ADR triplets that were tested for triangular relationships. Specifically, for each ADR (A), we included 10 comparisons that formed 10 testable triangular relations (ie, for ADRs B and C, we included the three comparisons A vs B, A vs C, and B vs C).
We tested the reproducibility of the ranking across the three repeated batches. Only 16% of the workers participated in more than one of the repeated batches (13% in two batches, and 3% in all three batches).
We counted the number of reports associated with an ADR in the AERS and the number of reports specifying one of the six outcomes (death, disability, life-threatening, required intervention to prevent permanent impairment/damage, hospitalization, and congenital anomaly). The rate of each outcome per ADR is the number of reports with that outcome divided by the total number of reports for that ADR, including reports with non-specific outcome tagged as “other serious” (25%) and reports with no outcome specified (20%).
In order to extract the major outcomes associated with the severity ranking, we used the lasso regression method [
We ranked a set of 2929 common ADRs from the SIDER2 database [
Top- and bottom-ranked ADRs.
Rank | Top-ranked severe ADRs | Rank | Bottom-ranked mild ADRs |
1 | Cardiac arrest | 2910 | Growth of eyelashes |
2 | Bone cancer metastatic | 2911 | Eye rolling |
3 | Left ventricular failure | 2912 | Night sweats |
4 | HIV infectiona | 2913 | Chapped lips |
5 | Anal cancer | 2914 | Nasal congestion |
6 | Lung cancer metastatic | 2915 | Agitation |
7 | Hemorrhage intracranial | 2916 | Excitability |
8 | Chronic myeloid leukemia | 2917 | Breath odor |
9 | Coma | 2918 | Hair growth abnormal |
10 | Breast cancer | 2919 | Hot flush |
11 | Multi-organ failure | 2920 | Sleep talking |
12 | Cardiopulmonary failure | 2921 | Blister |
13 | Cardiac death | 2922 | Tongue dry |
14 | Chronic leukemia | 2923 | Moaning |
15 | Cardio-respiratory arrest | 2924 | Discomfort |
16 | Pulmonary embolism | 2925 | Decreased appetite |
17 | Completed suicide | 2926 | Dry mouth |
18 | Metastatic renal cell carcinoma | 2927 | Early morning awakening |
19 | Hepatic angiosarcoma | 2928 | Euphoric mood |
20 | Anaplastic thyroid cancer | 2929 | Elevated mood |
aHIV: Human immunodeficiency virus infection, while not caused by a drug, is associated in with several drugs in SIDER.
We estimated the consistency of pairwise comparisons by repeating a quality control batch of pairwise comparisons three times. The batch included multiple ADR triplets that were tested for triangular relationships. Only 10% (SD 0.3%) of these ADR triplets violated the triangular inequalities (total of 23,071-26,245 triplets in each batch repeat, variation is due to exclusion of workers judged inconsistent on pre-defined quality control pairwise comparisons).
We next tested the reproducibility of the ranking across the three repeated batches (see Methods). Among pairs compared by at least three different workers from these three duplicate batches, 58% had full agreement. Despite this agreement, the Spearman correlation coefficient between the ranking independently computed from the three duplicate batches was .71 (SD .009,
Finally, ADRs sharing high semantic similarity exhibited smaller difference in their severity ranks (Pearson correlation ρ=−.94,
Correspondence between duplicate quality control batches. Ranking correlation between duplicate batches 1-3 (A-C) and a box-plot of the standard deviation in rank scores across the 3 batches as a function of the score (D).
AERS contains reports on adverse event submitted to the FDA. Some of the reports include a specific outcome of the ADR (55% of the reports including ADRs in our set). These specific outcomes are death, disability, life-threatening, required intervention to prevent permanent impairment/damage, hospitalization, and congenital anomaly. We found a significant correlation between the relative death rate in AERS reports (ie, the relative number of deaths out of all ADR reports) and our severity rank for the ADR (ρ=.53,
Correlation between ADR rank and outcomes. Severe ADRs tend to have significantly higher death rate (A), moderate correlation with life-threatening (B), and hospitalization (C), and negligible correlation with congenital anomaly (D), required intervention to prevent permanent impairment/damage (E), and disability (F).
Term clouds for top 95 percentile ADRs (A) and bottom 5 percentile (B). Term size is proportionate to the relative number of reports in the FDA AERS.
Drug risk assessment is affected by the severity of its associated ADRs and by their frequency in the population. In order to evaluate the reliability of ADR frequencies, we surveyed drug labels for 65 severe and frequent drug-ADR associations, where we define severe ADRs as those ranked above the 95th percentile and frequent drug-ADR association as those reported with larger than 1% frequency in the SIDER database. The frequency information in those labels was largely insufficient to estimate the marginal frequency above a control (ie, a placebo). Only two associations (3%) were compared to a control group that underwent a procedure (orchiectomy) instead of receiving a different drug. The reported frequency was significantly higher than that control (5% occurrence for congestive cardiac failure and chronic obstructive pulmonary disease after administration of zoladex, vs 1% for the control,
We associated ADRs with a set of therapeutic drug classes by aggregating the drug-ADR associations according to therapeutic class, as defined by the second level of the drug Anatomical Therapeutic Chemical (ATC) Classification System. We counted the number of different severe ADRs per drug as mapped in SIDER. Aggregated across the ATC classes, we identified classes with high variability among drugs in terms of the number of associated severe ADRs (
Severity of ATC classes. Box plot of ATC class severity measured by number of severe ADRs in each class (severe defined by top 95 percentile of the ranks) and percentage of drugs with black box warning in that class. Only classes that include more than 2 drugs with ADR information and have at least more than 3 severe ADRs are displayed.
A recent study predicted drug-ADR associations using a statistical analysis of AERS (438,801 drug-ADR pairs [
During drug development it is useful to identify genes and pathways that are associated with ADRs; it may be even more useful to quantitatively compare these using our severity ranking. Accordingly, we used gene-ADR associations assembled from literature [
Genes reported to be associated with severe adverse drug reactions (ADRs) (top 10 percentiles).
Gene | ADR (Percentile) | Reference |
EGFR | Glioblastoma multiforme (95) | [ |
SCN1A | Epilepsy (93) | [ |
VDR | Chronic renal failure (91) | [ |
TNF | Multiple sclerosis (91) | [ |
RYR1 | Malignant hyperthermia (90) | [ |
We ranked the severity of 2929 ADRs using a crowdsourcing platform. This ranking helps highlight drug classes based on the severity of their associated ADRs, triage predicted drug-associated ADRs for further investigation, and associate genes with a severity score based on their association with ADRs, with some implications for drug design. Although our ranking is consistent and reproducible, we cannot claim that it is optimal. A broader sampling of the potential ADR space (perhaps including professionals and patients who have experienced these effects) or a more sophisticated ranking method might improve the quality of the ranking. We include the raw pairwise comparison data (
Our ranking is based on a non-expert and inexperienced understanding and interpretation of ADR severity. Our analysis includes both point events and interval events, and these were compared without (1) reference to their different time courses, or (2) variations in severity between different instances of the same ADR—the MTurk workers were simply asked to decide if one ADR was better or worse than another, integrating all considerations. The high performance on the quality control ADR pairs (marked in
As mentioned above, we identified some ADRs with discordance between our estimated severity and their mortality rate in the AERS reports. There are two reasons for such discrepancies: (1) a misunderstanding by laymen of the true severity of an ADR (eg, the word “cancer” may get a high ranking, regardless of its survival statistics), and/or (2) a bias in the associated death rates in the AERS system. We are unable to distinguish these, and it is likely that both contribute, highlighting areas for potential improvement.
There is no correlation between the outcome rates of disability, required intervention to prevent permanent impairment/damage, or congenital anomaly to our ADR ranking. After manual examination of the ADRs with high rates for these three types of outcomes, we identified that for the first two, disability and “required intervention” outcomes, a lack of context caused ADRs with high rates to be classified as mild. For example, grimacing or rectal cramps are associated with more than 55% disability rate, and may be frequent disability co-occurring ADRs. Similarly for “required intervention”, light anesthesia (>42% rate) and hyposmia (>25% rate) are moderate without context. In the case of congenital anomaly, many of the anomalies are not life threatening and thus were ranked low (eg, supernumerary nipple, low set ears, or ear malformation).
Finally, we used the list of ADRs appearing in SIDER and the FDA AERS systems “as-is”. Some of the ADRs in our list may not be directly caused by drugs but are associated with drugs (eg, infections may be more frequent as a side effect of the drug, or may simply co-occur with diseases that the drug treats). We retained these ADRs, as they provide important insight regarding how individuals perceive their relative severity.
We highlight drug therapeutic classes that display large variability between their drug members in terms of occurrences of severe ADRs, suggesting staying vigilant in regard to the effect of drug choice on ADR occurrence in patients. We also highlight genes associated with severe ADRs, which should be subject for further investigations.
Among the potential applications for a ranked list of ADRs, we suggest that mapping these ADRs to drug-drug interactions could aid in reducing “alert fatigue” stemming from too frequent alerts, which often emerge on relatively mild events. This phenomenon may cause physicians to dismiss these alerts and could possibly be attenuated if the alerts focused mostly on major adverse event [
Finally, we focused on the severity of ADRs, but ADR frequency is also crucial for assessment of drug risk. These ADR frequencies require proper control to correct for background frequencies. Carefully constructed clinical trials that allow extracting statistically significant frequencies in a rigorous way should be given high priority.
We believe that our ranking of ADRs may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.
An example of a comparison presented to an MTurk worker.
Table S1. The MTurk workers pairwise comparisons used to compute the ranking.
Supplementary methods, figures, and Multimedia Appendix legends.
Table S2. Ranked list of ADRs with their reported frequency.
Correlation between ADR semantic similarity and mean difference in severity scores, computed for 793 ADRs.
Table S3. Top prescribed drug in 2013 that have novel severe ADRs in OFFSIDES database.
Table S4. Genes and their most severe associated ADRs.
adverse drug reaction
Adverse Events Reporting System
Anatomical Therapeutic Chemical Classification System
Food and Drug Administration
human intelligence tasks
human immunodeficiency virus
Human Phenotype Ontology
medical dictionary for regulatory activities
Amazon Mechanical Turk
We would like to thank Nir Ailon for helpful suggestions for the linear programming, Steve Bagley for supplying the LAERS files, and the thousands of Mechanical Turk workers. Funding for RBA and AG was provided by NIH LM05652, GM102365, and GM61374. MD is supported by NIH U54 HG004028. The study was approved by the Institutional Review Board of Stanford.
None declared.