Published on in Vol 22, No 3 (2020): March

Assessment of the Frequency of Online Searches for Symptoms Before Diagnosis: Analysis of Archival Data

Assessment of the Frequency of Online Searches for Symptoms Before Diagnosis: Analysis of Archival Data

Assessment of the Frequency of Online Searches for Symptoms Before Diagnosis: Analysis of Archival Data

Authors of this article:

Irit Hochberg1, 2 Author Orcid Image ;   Raviv Allon2 Author Orcid Image ;   Elad Yom-Tov3, 4 Author Orcid Image

Original Paper

1Institute of Endocrinology, Diabetes, and Metabolism, Rambam Health Care Campus, Haifa, Israel

2Bruce Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel

3Microsoft Research, Herzeliya, Israel

4Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Haifa, Israel

Corresponding Author:

Elad Yom-Tov, PhD

Microsoft Research

13 Shenkar St

Herzeliya, 46733


Phone: 972 747111359


Background: Surveys suggest that a large proportion of people use the internet to search for information on medical symptoms they experience and that around one-third of the people in the United States self-diagnose using online information. However, surveys are known to be biased, and the true rates at which people search for information on their medical symptoms before receiving a formal medical diagnosis are unknown.

Objective: This study aimed to estimate the rate at which people search for information on their medical symptoms before receiving a formal medical diagnosis by a health professional.

Methods: We collected queries made on a general-purpose internet search engine by people in the United States who self-identified their diagnosis from 1 of 20 medical conditions. We focused on conditions that have evident symptoms and are neither screened systematically nor a part of usual medical care. Thus, they are generally diagnosed after the investigation of specific symptoms. We evaluated how many of these people queried for symptoms associated with their medical condition before their formal diagnosis. In addition, we used a survey questionnaire to assess the familiarity of laypeople with the symptoms associated with these conditions.

Results: On average, 15.49% (1792/12,367, SD 8.4%) of people queried about symptoms associated with their medical condition before receiving a medical diagnosis. A longer duration between the first query for a symptom and the corresponding diagnosis was correlated with an increased likelihood of people querying about those symptoms (rho=0.6; P=.005); similarly, unfamiliarity with the association between a condition and its symptom was correlated with an increased likelihood of people querying about those symptoms (rho=−0.47; P=.08). In addition, worrying symptoms were 14% more likely to be queried about.

Conclusions: Our results indicate that there is large variability in the percentage of people who query the internet for their symptoms before a formal medical diagnosis is made. This finding has important implications for systems that attempt to screen for medical conditions.

J Med Internet Res 2020;22(3):e15065



Online self-diagnosis of health conditions is a well-known phenomenon that has grown substantially with ease of access to medical information facilitated by the internet and mobile technologies [1,2]. A large survey found that more than one-third of Americans self-diagnose when they encounter a health problem [3], and another study indicated that about 70% of American adults consult the internet for a variety of medical information [4].

The prevalence of self-diagnosis is leading countries and large epidemiologic centers to use the available information for public health goals [5,6]. Epidemics such as influenza and dengue fever have been tracked by observing the number of people who query internet search engines for the symptoms of these diseases [7,8]. A recent study showed the potential of identifying serious medical conditions, such as cervical and ovarian cancers, from people’s searches on online search engines [9]. These results suggest that search data could be used as a novel screening tool.

Nevertheless, utilizing search engines as an effective screening tool requires an accurate characterization of how people use search engines for self-diagnosis. In addition, conditions need to be independently characterized to understand the type and number of people who are searching for information and the common words used for these searches. It is currently not known how commonly people conduct an online search for their condition before diagnosis by a health professional. Determining this will provide an important indication of the percentage of people for whom online data screening is applicable and the diseases for which internet-based screening is effective. The purpose of this study was to characterize the prevalence and content of searches made by users before a medical diagnosis by a health care professional.

In this work, we analyzed data from search engine users who self-identified their diagnosis of a medical condition and traced back the data to determine how many of these instances could have been predicted by an earlier search for the signs and symptoms of the condition by that same user.

In addition, to better understand our search data results, we analyzed the association between the frequency of internet inquiries and the population’s general knowledge regarding certain conditions and their symptoms. We hypothesized that people inquire more about symptoms that they do not recognize or cannot associate with a certain disease.

Search Data

We extracted all queries made on Microsoft Bing in English by people in the United States between May 1, 2017, and April 30, 2018. For each user, we recorded an anonymized username, the time and date of the query, and the text used in the query. We focused on 20 medical conditions that are known to have evident symptoms, not systematically screened, not usually diagnosed in asymptomatic individuals by usual medical tests, and generally diagnosed after the investigation of specific symptoms. To ensure statistical power and validity, we limited the analysis to conditions for which at least 75 people self-identified their condition. The 20 conditions used in this study are listed in Table 1.

The population of self-identifying users was defined as those people who made a diagnosis ascertainment query (DAQ), indicating that they had been formally diagnosed with 1 of the 20 conditions analyzed in this study (eg, “I was diagnosed with COPD” or “I have COPD”). Queries that indicated the possibility of such a condition (eg, “do I have COPD”) were excluded. Specifically, DAQs were defined as queries that matched the phrases “I have” or “diagnosed with” and the name of the condition and excluded queries that contained any of the phrases “do I have,” “can I have,” “I think I have,” “did I have,” “nurse,” “patient,” “cat,” “dog,” “wife,” “husband,” “son,” or “daughter.”

For each condition, we calculated the fraction of people who queried about a relevant symptom before their first mention of the condition in the DAQ. The list of relevant symptoms was defined by two authors (IH and RA, both medical doctors) and enhanced using the synonym list developed by Yom-Tov and Gabrilovich [10].

In addition, symptoms were mapped to their perceived Medical Severity Rank (MSR), a validated measure of their apparent importance to both medical specialists and laypeople [11]. MSR measures the urgency of a symptom as perceived by people, from a symptom that requires immediate urgent care (MSR=1) to one that can be disregarded (MSR=10).

DAQs do not usually provide an indication of when the diagnosis was made. To estimate whether the DAQs are typically made around the time of diagnosis or throughout a person’s illness, we assessed if the time of the DAQ corresponded closely to the time of the first queries for hospitals, medical centers, or clinics. This analysis was conducted for each user indicating a diagnosis of 1 of the 3 malignant conditions (endometrial cancer, esophageal cancer, and lymphoma), where a hospital visit is usually required soon after the initial diagnosis is made.

Table 1. Conditions analyzed, symptoms associated with them, and the percentage of people who asked about the symptoms before their first query indicating that they have the condition.
ConditionSymptomsPeople with symptomsa

Degenerative disc diseaseBack pain, leg weakness, leg pain, leg numbness, leg tingling, loss of bowel control, and loss of bladder control29/16517.6
Chronic obstructive pulmonary disorderChronic cough, shortness of breath, dyspnea, recurrent pneumonia, wheezing, and dystonia44/5388.4
MenopauseHot flash, night sweat, vaginal dryness, and alopecia34/30811.4
Heart failureShortness of breath, dyspnea, chronic cough, leg edema, leg swelling, rapid weight gain, and fatigue116/89612.9
GoutPain, tenderness, swelling, inflammation, and redness520/205225.34
Ulcerative colitisDiarrhea, abdominal pain, bloody bowel movement, rectal bleeding, tenesmus, lack of appetite, and fatigue70/32421.9
Bladder cancerBlood in urine, hematuria, blood clots in urine, pain or burning sensation during urination, frequent urination, and not able to pass urine33/30211.3
Parkinson diseaseTremor and bradykinesia144/36134.01
Endometrial cancerDischarge and bleeding25/11022.7
Crohn diseaseDiarrhea, blood in stool, fatigue, abdominal pain, cramping, mouth sores, reduced appetite, weight loss, and fistula146/89416.4
Angina pectoris or coronary heart diseaseChest pain, chest pressure, chest tightness, shortness of breath, and heartburn239/77530.8
Grave’s diseaseAnxiety, irritability, heat sensitivity, increased sweating, weight loss, enlargement of thyroid or goiter, frequent bowel movements, diarrhea, bulging eyes, rapid heartbeat, rapid pulse, irregular heartbeat, and atrial fibrillation112/47923.4
Esophageal cancerDifficulty or pain while swallowing solid food, vomiting, choking on food, heartburn, chest pressure, weight loss, coughing, and hoarseness29/11924.4
LymphomaEnlarged lymph nodes, night sweats, weight loss, intermittent fever, and fatigue177/84921.0
Plantar fasciitisFoot pain and heel pain10/2125.2
CellulitisPainful area of skin, leg, foot, hand, or face; skin erythema or redness; skin edema; hot skin; and dropsy4/2781.8
ProstatitisPainful, difficult, or frequent urination, blood in urine, groin pain, rectal pain, abdominal pain, low back pain, malaise, body aches, urethral discharge, and painful ejaculation or sexual dysfunction12/10312.6
MastitisBreast tenderness, pain or burning sensation, breast warmth or redness, breast swelling or thickening, breast lump, breast pain or burning, and malaise10/9012.2
Bell’s palsyFacial paralysis on one side; drooping of the mouth to one side; asymmetrical mouth movement or smile; loss of blinking on one side; decreased or increased tearing; altered sense of taste; slurred speech; drooling; difficulty eating, drinking, or chewing; and pain or numbness behind the ear6/1683.6
MononucleosisSore throat, malaise, headache, loss of appetite, myalgia, muscle pain, chills, and nausea 21/91 22.8

aAverage=15.49% (1792/12,367).

Survey Data

To estimate whether laypeople recognize the investigated medical conditions and their symptoms, we conducted a survey using the crowdsourcing platform CrowdFlower. We randomly selected 104 actual condition and symptom pairs and created another 156 random pairs. The latter were created by randomly matching a condition and a symptom and then verifying that the symptom is not manifested in the selected condition. A total of 10 crowdsourced workers were asked to answer, for each of the 260 pairs, whether they recognized the name of the medical condition and whether they thought that the given symptom could be the sign of the condition.

Laypeople’s knowledge about certain conditions and their symptoms, as determined through the surveys, was compared with the rate of searches for these symptoms in our search data.


Data analysis was conducted using MATLAB version 9.4.0. Spearman correlation was used to evaluate associations. The level of significance was set as 5% (P<.05). The Institutional Review Board of the Technion—Israel Institute of Technology approved this study.

On average, 618 people (mean SE 189) self-identified their diagnosis from 1 of the 20 analyzed conditions by making a DAQ. The users asked about each symptom, on average, 1.7 times. Table 1 lists the 20 conditions analyzed, the symptoms that were designated as being associated with that condition, and the percentage of people who queried about their symptoms before a formal diagnosis was made. On average, 15.49% (1792/12,367, SD 8.4%) of people queried about their symptoms before receiving a formal medical diagnosis.

There was a significant correlation between the percentage of people querying about a condition and the median number of days between the first symptom query and the DAQ (rho=0.60; P=.005; number of conditions=20). Thus, patients with conditions that had a longer duration between the onset of symptoms and the time of diagnosis were more likely to query for symptoms before they received a medical diagnosis.

We labeled each condition according to the lowest (most worrying) MSR for the related symptoms. We focused on the most worrying symptoms because of prior work, which shows that people most often recall the worst experience of pain [12,13], and thus, we hypothesized that people would be driven to search for the most worrying symptoms. A total of 4 conditions were excluded from this calculation because they had no symptoms for which Youngmann and Yom-Tov [11] provided an MSR. We found that more people inquired about the symptoms of a condition before a formal medical diagnosis (429/2328, 19.13%) for conditions that had an MSR≤2 as compared to those that had an MSR>2 (1382/10,107, 16.81%). Thus, people tended to query 14% more for diseases with more worrying symptoms (lower MSR) than those with less distressing symptoms.

A total of 4636 condition-symptom pairs were evaluated by crowdsourced workers in a survey aimed at estimating whether laypeople recognize the 20 investigated medical conditions and their symptoms. The responders reported recognizing the condition in 86% (3987/4636) of the pairs presented to them. The least recognized conditions were plantar fasciitis (40%, 16/40) and chronic obstructive pulmonary disorder (29%, 132/462). As described in the Methods section, our survey included actual disease-symptom pairs and sham disease-symptom pairs. For the real pairs, 86% (3987/4636) of responses correctly identified the pair as being associated. A similar level of success was observed for the sham pair, with 74% of the responses correctly indicating that the symptom was not associated with the disease.

Spearman correlation between the percentage of people querying for the symptom of a condition on Bing and the percentage of people who correctly recognized its symptoms in the survey was −0.47 (P=.08; number of conditions=14). This means that people tend to query more for conditions with less recognizable symptoms.

To estimate whether the DAQs give an accurate indication of the time of initial diagnosis, we compared the timing of the DAQ with the first query about a hospital in the three malignant conditions. These specific conditions were analyzed because a hospital visit is usually required soon after the initial diagnosis is made. Table 2 shows the median time from the first query for a hospital or clinic until the first DAQ, the percentage of queries for hospitals made before the DAQ, and the percentage of people querying for a hospital. Figure 1 shows the distribution of the number of days between a query for a hospital and a query indicating a formal diagnosis, for all three conditions. There was no significant difference in time between queries for hospitals and queries for clinics (P>.05; rank sum test). As shown in Table 2 and Figure 1, there is only a short lag between the queries for a hospital and queries indicating a formal medical diagnosis. Indeed, 67% of queries for hospitals were within a window of 2 weeks before or after the DAQ (47% within a week around the DAQ). Figure 1 shows a clear peak for the first hospital searches occurring on the same day as the DAQ (19% of queries). Thus, queries indicative of diagnosis, specifically DAQs, correspond to the time of the actual diagnosis.

Table 2. The time from the first query for a hospital and the first query indicating the condition, the percentage of times the hospital query occurred before the diagnosis query, and the percentage of people who queried for a hospital.
ConditionMedian time from hospital query to diagnosis query (days)aQueries for hospitals made before queries for diagnosisbPeople who inquired about a hospitalc

Endometrial cancer2.546/726472/11065
Esophageal cancer14.345/666866/11955


bAverage=65% (395/616).

cAverage=59% (616/1078).

Figure 1. The time from the first query about a hospital and the time of the first query indicating the condition. Negative times indicate that the first query to a hospital was made prior to the first query about the disease. The figure shows data for endometrial cancer, esophageal cancer, and lymphoma.
View this figure

Internet interventions for public health have increased dramatically in the past decade, with multiple interventions across a wide range of conditions and populations. Following rapid developments in the last few years, it has been suggested that studies on internet interventions in the current decade will determine both the effectiveness and potential of such online interventions on public health [14].

It is well known that people search for symptoms on the internet before consulting a medical professional [15]. However, there has been no detailed characterization of when or what people inquire about when symptoms appear.

We found that an average of 15.5% of people queried for their symptoms before receiving a medical diagnosis from a health care professional, and we found high variability in this figure among the 20 conditions analyzed in our study. For instance, only 1.8% of patients diagnosed with cellulitis searched for a painful red area of skin before a medical diagnosis was made, compared with 30.8% of patients diagnosed with coronary heart disease who searched for chest pain or shortness of breath before receiving a medical diagnosis.

Our results showed that more people search for symptoms when the time between that first search to a formal diagnosis is longer (Spearman rho=0.6). One possible explanation for this correlation is that a long period of diagnosis gives an opportunity for more people to ask online about the condition. Another possible explanation is that people who search the internet for symptoms tend to do more thinking and consulting before approaching a medical professional, thus delaying their diagnosis. If the latter is true, better information needs to be provided to information seekers when it is likely that they have an acute condition that needs prompt treatment, emphasizing the need to enable internet-based screening for certain conditions.

The finding that most people inquire about hospitals around the time of making queries that suggest a diagnosis supports our assumption that such queries are made around the time of actual diagnosis. This concurs with past studies [16] that found that the number of people querying for different types of cancer were correlated with the incidence of cancers, not prevalence.

The survey results show that the conditions we studied and the match between the condition and symptom are well recognized. However, some conditions were less recognized than others. By comparing our survey results and our search data results, we showed a negative correlation between the rate of inquiry and the knowledge about a condition, suggesting that the less people know about a condition, the more they query for its symptoms. This result was not statistically significant (P=.08), but there is clearly a pattern that could illuminate one of the basic motives for people to first approach the Web when symptoms occur.

Our study limitations are inherent in a search engine data study, including the dependence on the user’s declaration of diagnosis, which assumes the exclusion of healthy people searching for diseases out of general curiosity. The main limitation is that most users do not declare the diagnosis in a search and therefore cannot be identified as patients. Moreover, users who self-declare their diagnosis are known to be a biased sample of patients, comprising relatively more females and younger people [17], a bias caused by a preference for query length. Thus, the prior rate of queries could be more heavily reflective of these population segments than that of the general population.

Although we have strived to find a comprehensive list of symptoms for each condition (including synonyms thereof), some symptoms could have been missed, especially colloquial references to symptoms, and these, if included, could have increased the reported percentage of searches for symptoms before the formal medical diagnosis. In addition, as our observation window is finite (1 year), people might have queried for their symptoms before the beginning of the data period. Such searches might increase the reported fraction of people performing searches for disease by approximately 1/12, meaning that the average percentage of people conducting an online search for a disease would be approximately 16.8%.

Finally, a symptom could be related to multiple underlying conditions. Although we have tried to focus on conditions with clear and distinct symptoms, such cases could have skewed our estimate for the search rates.

Despite these shortcomings, the strength of the study is significant; it is based on a diverse cohort made available because most people in industrialized countries now have access to the internet and use search engines as their primary data source when seeking health-related information [4]. This, together with our new understanding of when and how people seek information on symptoms, may enable future systems for screening of serious medical conditions from internet data, in general, and search queries, in particular, thereby overcoming the barriers of health illiteracy, unfamiliarity with medical conditions, and difficult access to the health system. Our hope is that such systems will enable earlier diagnosis of many serious medical conditions.

Conflicts of Interest

EY is an employee of Microsoft, the owner of the Bing search engine.

  1. Avery N, Ghandi J, Keating J. The 'Dr Google' phenomenon--missed appendicitis. N Z Med J 2012 Dec 14;125(1367):135-137. [Medline]
  2. Robertson N, Polonsky M, McQuilken L. Are my symptoms serious Dr Google? A resource-based typology of value co-destruction in online self-diagnosis. Australas Mark J 2014;22(3):246-256. [CrossRef]
  3. Kuehn BM. More than one-third of US individuals use the internet to self-diagnose. J Am Med Assoc 2013 Feb 27;309(8):756-757. [CrossRef] [Medline]
  4. Fox S, Duggan M. Pew Internet - Pew Research Center. 2013. Health Online 2013   URL: [accessed 2020-01-13]
  5. Paul MJ, Dredze M. Social monitoring for public health. Synthesis Lect Inf Concepts Retr Serv 2017;9(5):1-183. [CrossRef]
  6. Yom-Tov E, Cox IJ, Lampos V. Learning About Health and Medicine from Internet Data. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. 2015 Presented at: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining; February 2 - 6, 2015; Shanghai, China p. 417-418. [CrossRef]
  7. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature 2009 Feb 19;457(7232):1012-1014. [CrossRef] [Medline]
  8. Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis 2011 May;5(5):e1206 [FREE Full text] [CrossRef] [Medline]
  9. Soldaini L, Yom-Tov E. Inferring Individual Attributes from Search Engine Queries and Auxiliary Information. In: Proceedings of the 26th International Conference on World Wide Web. 2017 Presented at: WWW'17; April 3 - 7, 2017; Perth, Australia p. 293-301. [CrossRef]
  10. Yom-Tov E, Gabrilovich E. Postmarket drug surveillance without trial costs: discovery of adverse drug reactions through large-scale analysis of web search queries. J Med Internet Res 2013 Jun 18;15(6):e124 [FREE Full text] [CrossRef] [Medline]
  11. Youngmann B, Yom-Tov E. Anxiety and Information Seeking: Evidence From Large-Scale Mouse Tracking. In: Proceedings of the 2018 World Wide Web Conference. 2018 Presented at: WWW'18; April 23 - 27, 2018; Lyon, France p. 753-762. [CrossRef]
  12. Ariely D, Loewenstein G. When does duration matter in judgment and decision making? J Exp Psychol Gen 2000;129(4):508-523. [CrossRef]
  13. Ariely D, Kahneman D, Loewenstein G. Joint comment on 'When does duration matter in judgment and decision making?' (Ariely & Loewenstein, 2000). J Exp Psychol Gen 2000;129(4):524-529. [CrossRef]
  14. Bennett GG, Glasgow RE. The delivery of public health interventions via the internet: actualizing their potential. Annu Rev Public Health 2009;30:273-292. [CrossRef] [Medline]
  15. Higgins O, Sixsmith J, Barry MM, Domegan C. ARAN - Access to Research at NUI Galway. 2011. A Literature Review on Health Information Seeking Behaviour on the Web: A Health Consumer and Health Professional Perspective   URL: [accessed 2020-01-13]
  16. Ofran Y, Paltiel O, Pelleg D, Rowe JM, Yom-Tov E. Patterns of information-seeking for cancer on the internet: an analysis of real world data. PLoS One 2012;7(9):e45921 [FREE Full text] [CrossRef] [Medline]
  17. Yom-Tov E. Demographic differences in search engine use with implications for cohort selection. Inf Retrieval J 2019;22(6):570-580. [CrossRef]

DAQ: diagnosis ascertainment query
MSR: Medical Severity Rank

Edited by G Eysenbach; submitted 17.06.19; peer-reviewed by R West, C Eickhoff; comments to author 03.10.19; revised version received 07.10.19; accepted 16.12.19; published 06.03.20


©Irit Hochberg, Raviv Allon, Elad Yom-Tov. Originally published in the Journal of Medical Internet Research (, 06.03.2020.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.