This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Twitter’s 140-character microblog posts are increasingly used to access information and facilitate discussions among health care professionals and between patients with chronic conditions and their caregivers. Recently, efforts have emerged to investigate the content of health care-related posts on Twitter. This marks a new area for researchers to investigate and apply content analysis (CA). In current infodemiology, infoveillance and digital disease detection research initiatives, quantitative and qualitative Twitter data are often combined, and there are no clear guidelines for researchers to follow when collecting and evaluating Twitter-driven content.
The aim of this study was to identify studies on health care and social media that used Twitter feeds as a primary data source and CA as an analysis technique. We evaluated the resulting 18 studies based on a narrative review of previous methodological studies and textbooks to determine the criteria and main features of quantitative and qualitative CA. We then used the key features of CA and mixed-methods research designs to propose the combined content-analysis (CCA) model as a solid research framework for designing, conducting, and evaluating investigations of Twitter-driven content.
We conducted a PubMed search to collect studies published between 2010 and 2014 that used CA to analyze health care-related tweets. The PubMed search and reference list checks of selected papers identified 21 papers. We excluded 3 papers and further analyzed 18.
Results suggest that the methods used in these studies were not purely quantitative or qualitative, and the mixed-methods design was not explicitly chosen for data collection and analysis. A solid research framework is needed for researchers who intend to analyze Twitter data through the use of CA.
We propose the CCA model as a useful framework that provides a straightforward approach to guide Twitter-driven studies and that adds rigor to health care social media investigations. We provide suggestions for the use of the CCA model in elder care-related contexts.
In the digital age, social networking sites such as Twitter are increasingly turned to as an information source, as they offer a large amount of digital text and are readily available to multisite apps (eg, personal computers, mobile phones, and tablets). Health discussions, for example, occur regularly on Twitter, with online discussions and content sharing among a variety of populations, including health care professionals, patients with chronic conditions, and their caregivers. Some efforts have emerged to investigate the content of health care-related posts on Twitter, constituting a new area for researchers to investigate using content analysis (CA). These approaches are also known as infodemiology, infoveillance, or digital disease detection research. In many of these research initiatives, quantitative and qualitative Twitter data are combined, but there are few clear guidelines for researchers or reviewers to follow when collecting and evaluating this content. An explanation for this could be that contemporary CA is best described as a juxtaposition of quantitative (eg, frequency analysis to count words in a text and represent them statistically) and qualitative (eg, nonfrequency analysis for in-depth hermeneutic interpretations of a text) methodological dimensions [
Research using social media platforms (eg, Facebook, Twitter, or LinkedIn) is in the early stages, and despite the great potential for the application of CA to Twitter-based health care content, there are few guidelines for the collection, analysis, and evaluation of the various types of Twitter data. Thus, the aim of our study was to use criteria available in the CA literature, specifically literature on the use of CA in health care research, to identify and evaluate published studies that used Twitter as a primary source of data and CA as a method of analysis and interpretation. Based on our analysis, we propose the combined content-analysis (CCA) model as an organizing framework to guide the application of integrated methods (quantitative and qualitative) and modes (manual and computer assisted) of CA, and to address the varied nature of Twitter feed data (eg, textual, numerical, audio, and video material) within single or multiple-phase studies.
In this paper, we first discuss the position of CA in previous research and then illustrate how CA has been used in health care research. Building on common characteristics of CA found in the literature, we evaluate 18 studies published between 2010 and 2014. Finally, we propose the CCA model of CA along with mixed-methods research approaches. We suggest how to apply the CCA model and offer supporting resources drawing on elder care-related examples.
CA is a research methodology or set of methods to analyze content collected from written (eg, open-ended surveys, personal communications, letters, diaries, short stories, newspapers or magazines, and theoretical or methodological trends in journal papers), verbal (eg, interviews, focus groups, radio programs, and folk songs), or visual (eg, films, videos, and TV programs) materials, from printed and electronic resources [
Between the 1930s and 1950s, CA was called “symbol analysis” and was a scientific method of recording the frequency of certain keywords found in newspapers [
As CA spread to other disciplines in the social sciences, such as sociology, psychology, business, and health research, the qualitative approach to CA was developed and was recognized as an approach for data analysis in many research disciplines [
CA researchers such as Holsti [
CA has come into widespread use in health care research in recent years because of its sensitivity and flexibility as a research technique concerned with meanings, intentions, consequences, and context [
Hsieh and Shannon [
In conventional CA, it is assumed that because there is insufficient or fragmented knowledge about a phenomenon [
Deductive or directed CA can be used when the purpose of the study is to test a theory or extend an existing theory or prior research [
The third type of CA used in health care research is the summative approach. Rather than the data being analyzed as a whole, as in the previous two approaches, the text is searched for particular words or content in relation to a particular topic. For example, the summative approach was used to examine content related to end-of-life care in 14 critical care nursing textbooks [
Content analysis (CA) in health care research. Adapted from Hsieh & Shannon (2004, p 1286, Table 4) with permission of SAGE Publications, Inc.
To locate current trends in health care social media studies and studies using CA to analyze data from the most popular social media tool, Twitter, we conducted a PubMed search of the years 2010 to 2014. Keyword sets combined “content analysis” AND one of the following: “healthcare social media,” “social networking websites,” “Twitter-driven content,” “Twitter feeds,” OR “healthcare tweets.” The primary research questions were “How is CA used in health care social media studies?” and “Does it follow the common features of CA literature identified in CA research, in general, and health care-related research, in particular?” Paper selection was based on the title and the abstracts. In case of uncertainty, we read the entire text of a paper. In addition, we manually searched the reference lists of all included studies. From the 21 studies found, we selected 18 for examination (see
Our results show that, in the 18 studies examined (in English), Twitter was used as a public and real-time source for textual health data where users tried to disseminate health information from formal sources (eg, academic journals or news websites) and informal sources (eg, personal opinions or actual experiences). In these studies, researchers analyzed Twitter messages using CA as a sole technique or with other research techniques, such as the infoveillance approach (eg, [
Studies analyzing health-related Twitter posts (2010–2014).
Author(s) | Keywords and hashtags (#) | Sampling and data collection | Data analysis (coding process) | Validation and presentation of results |
Chew & Eysenbach (2010) [ |
“swine flu”, “swineflu”, and “H1N1” | Random sample of 5395 tweets for 9 days (each 4 weeks apart) generated from 2 million archived tweets over 8 months. Tweets were posted between May 1 and December 31, 2009 (n=600 tweets/per day were collected for analysis). | Infoveillance approach (statistical classifier) for tracking flu rate (longitudinal text mining and analysis). This approach includes in-depth qualitative manual coding, automated CAausing a triaxial coding scheme, and sentiment analysis. | Pilot coding (1200 tweets), ICRbfor a subset of 125 tweets using kappa statistic (κ>.70), Pearson correlations between manual and automated coding, and chi-square to test changes over time, frequency tables, and text matrices with quotes illustrating the categories. |
Scanfeld et al (2010) [ |
“antibiotic” and “antibiotics” | Random sample of 52,153 tweets. Tweets were posted weekly between March 13 and July 31, 2009 (n=1000 tweets were collected for analysis). | Cross-sectional survey approach using Q-methodology and CA (frequencies). | Pilot coding of 100 tweets, ICR for a random sample of 10% of the analyzed tweets using kappa statistic (κ=.73), frequency tables, and text matrices with quotes illustrating the categories. |
Heaivilin et al (2011) [ |
“toothache”, “tooth ache”, “dental pain”, and “tooth pain” | Random sample of 4859 tweets over 7 nonconsecutive days (n=1000 tweets were collected for analysis). | Cross-sectional survey approach and CA (frequencies and descriptive statistics). | Pilot coding of 300 tweets, ICR using kappa statistic (κ=.96), frequency tables, and continuous text with quotes illustrating the categories. |
Signorini et al (2011) [ |
“flu”, “swine”, “influenza”, “vaccine”, “tamiflu”, “oseltamivir”, “zanamivir”, “relenza”, “amantadine”, “rimantadine”, “pneumonia”, “h1n1”, “symptom”, “syndrome”, and “illness” and additional keywords (eg, travel, trip, flight, fly, cruise, and ship) | Two large data sets for tracking flu rate over time and location. The first data set consists of 951,697 tweets selected from the 334,840,972 tweets. Tweets were posted between April 29 and June 1, 2009. The second data set consists of 4,199,166 tweets selected from roughly 8 million tweets. Tweets were posted between October 1, 2009 and December 2009. | Quantitative CA (descriptive and advanced statistics). | Regression analysis and frequency graphs with respect to time. |
McNeil et al (2012) [ |
“seizure”, “seizures”, “seize”, “seizing”, and “seizuring” | Random sample of 10,662 tweets from a period of 7 consecutive days. Tweets were posted between April 15 and April 21, 2011 (n=1504 tweets were collected for analysis). | Prospective qualitative CA. | Pilot coding of a 48-hour preliminary data set and interrater agreement (85.4%), frequency tables, and text matrices with quotes illustrating the categories. |
Sullivan et al (2012) [ |
“concussion”, “concussions”, “concuss”, “concussed”, “#concussion”, “#concussions”, “#concuss,” and “#concussed” | Random sample of 3488 tweets over 7 consecutive days. Tweets were posted between 12:00 GMTcon July 23 and 12:00 GMT on July 30, 2010 (n=1000 tweets were collected for analysis). | Prospective observational study using qualitative CA. | Pilot coding of 100 tweets from a sample collected over a 24-hour period and interrater agreement, frequency tables, and text matrices with quotes illustrating the categories. |
Donelle & Booth (2012) [ |
“#health” and “health” as a single word, part of a word (eg, health care) | Purposeful cross-sectional sample of 36,042 tweets. Tweets were collected over 4 consecutive days, from June 16, 2009 at 19:32 GMT until June 20, 2009 at 12:02 GMT (n=2400 tweets were collected for analysis; the first 100 tweets from the end of each hour of June 19, 2009, starting at 05:00 GMT for a 24-hour period). | Qualitative (directed and deductive) CA [ |
Trustworthiness and validation of findings (interrater agreement, systematic data analysis, analyst triangulation, and verbatim data collection, and basic descriptive statistics). Data were presented through frequency graphs, text matrices, and continuous text with quotes illustrating the categories. |
Robillard et al (2013) [ |
“dementia” and “Alzheimer” | Random sample of 9200 tweets for a period of 24 hours (starting February 15, 2012 at 3:35 pm) (n=920 tweets were collected for analysis in addition to a subsample containing 100 tweets generated by the top users). | Cross-sectional survey using CA [ |
Pilot coding of an initial set of 100 random tweets and frequency graphs and tables. |
Lyles et al (2013) [ |
“pap smear” and “mammogram” | Cross-sectional sample of top tweets during a 5-week period. Tweets were posted between April and early May 2012 (n=474 tweets were collected for analysis). | Exploratory qualitative CA. | Pilot coding of 20% of collected tweets, ICR of 40% of collected tweets, interrater agreement, frequency graphs, text matrices, and continuous text with quotes illustrating the categories. |
Bosley et al (2013) [ |
“cardiac arrest”, “CPR”, “AED”, “resuscitation”, “heart arrest”, “sudden death”, and “defib” | All identified resuscitation-related tweets from the keyword search. Tweets were posted between April 19 and May 26, 2011 (n=15,475 tweets were collected for analysis). | Quantitative CA (descriptive statistics). | Pilot coding of 1% of identified tweets, ICR using kappa statistic (κ=.78), frequency graphs and text matrices with quotes illustrating the categories. |
Hanson et al (2013) [ |
“prescription drugs” | Random set of tweets posted by 25 identified social networks or circles. Tweets were posted between November 29, 2011 and November 14, 2012 (up to 3200 tweets per user were collected for analysis). | Quantitative CA of identified social circles | Pearson correlation coefficient of user interactions. Frequency tables and social network graphs. |
Henzell et al (2013) [ |
“braces”, “orthodontist”, and “orthodontics” | Convenience sample of consecutive tweets posted over a 5-day period. Tweets were posted between September 3 and 7, 2012 (n=131 tweets were collected for analysis). | Qualitative (discourse) CA. | Continuous text with quotes illustrating the categories. |
Myslín et al (2013) [ |
“cig*”, “nicotine”, “smoke*”, “tobacco”, “hookah”, “shisha”, “waterpipe”, “e-juice”, “e-liquid”, “vape”, and “vaping” | Random sample of tweets at 15-day intervals. Tweets were posted between December 5, 2011 and July 17, 2012 (n=7362 tweets were collected for analysis). | Infoveillance methodology [ |
Pearson correlations between manual and automated coding, chi-square to test changes over time, frequency graphs, and text representation diagrams. |
Rui et al (2013) [ |
Not stated | Random sample of tweets posted by 58 health organizations (chosen randomly) within 2 months. Tweets were posted between September and November 2011 (n=1500 tweets were collected for analysis). | Quantitative (deductive) CA guided by the classic categorization of social support. | Descriptive statistics, ICR of 200 random tweets using Krippendorff alpha (.74), frequency tables, and continuous text with quotes illustrating the categories. |
Zhang et al (2013) [ |
113 physical activity keywords generated from lists of published physical activity measures | A random sample of 30,000 tweets selected from a pool of one million tweets. Tweets were posted between January 1 and March 31, 2011 (n=4672 tweets were collected for analysis in addition to 1500 collected from this sample for further coding). | Quantitative CA (descriptive and advanced statistics). | Pilot coding of 100 tweets (separate from the final 1500 tweets) to calculate ICR (ranges from 0.83 to 0.98) using Holsti’s [ |
Park et al (2013) [ |
“health literacy” | Random sample of 1044 tweets. Tweets were posted during the time following time periods to construct a composite month: October 25–31, 2009; November 7–14, 2009; December 15–23, 2009; and January 4–10, 2010 (n=571 tweets were collected for analysis). | Quantitative CA based on Web reports on key Twitter features and previous literature in health communication and media studies. | Pilot coding, ICR of a subsample of 111 tweets using Holsti [ |
Love et al (2013) [ |
“vaccine”, “vaccination”, and “immunization” | Random sample of 6827 English-language tweets. Tweets were posted between January 8 and 14, 2012 (n=2580 tweets were collected for analysis). | Quantitative CA. | Statistical analysis (frequencies and chi-square analyses and tables). |
Jashinsky et al (2013) [ |
Keywords and phrases created from suicide risk factors (12 identified factors) | All tweets (1,659,274 tweets) posted by 1,208,809 unique users over a 3-month period. Tweets were posted between May 15, 2012 and August 13, 2012 (n=37,717 tweets from 28,088 unique users were collected for analysis). | Quantitative CA (descriptive and advanced statistics). | ICR using kappa statistic (κ=.48), Spearman rank correlation coefficient, vital statistics, and text matrices with quotes illustrating the categories. |
aCA: content analysis.
bICR: intercoder reliability.
cGMT: Greenwich mean time.
Twitter archive software used in the studies analyzing health-related Twitter posts (2010–2014).
Author(s) | Archive software used |
Chew & Eysenbach (2010) [ |
Infoveillance system and Twitter APIa |
Scanfeld et al (2010) [ |
Twitter search engine |
Heaivilin et al (2011) [ |
Twitter search engine |
Signorini et al (2011) [ |
JavaScript application and Twitter’s API |
McNeil et al (2012) [ |
Twitter search engine |
Sullivan et al (2012) [ |
Twitter search engine |
Donelle & Booth (2012) [ |
The Archivist (MIX Online, 2011) data collection software program |
Robillard et al (2013) [ |
Twitter’s API |
Lyles et al (2013) [ |
Twitter search engine |
Bosley et al (2013) [ |
Twitter search engine |
Hanson et al (2013) [ |
Twitter’s API |
Henzell et al (2013) [ |
Twitter search engine |
Myslín et al (2013) [ |
Twitter’s API |
Rui et al (2013) [ |
ActivePython v2.7.2 |
Zhang et al (2013) [ |
Twitter’s API |
Park et al (2013) [ |
Twitter’s API |
Love et al (2013) [ |
Twitter’s API |
Jashinsky et al (2013) [ |
Twitter’s API |
aAPI: application programming interface.
The qualitative approaches to sampling techniques, such as purposeful and convenience sampling, were used in only 2 studies ([
Among the reviewed studies, all used a form of CA that was neither purely quantitative nor purely qualitative. Despite the fact that these two types of data were combined, no formal approach to mixing methods was described within any of the methods sections. With either approach chosen by the researchers there were mixed modes of analysis. Data were either imported and coded automatically (computer assisted) or imported automatically and coded manually (with human-assisted analysis). While the manual mode of CA can be used to qualify small amounts of coded data, the automatic mode may be used for large samples of either categorical or more quantifiable words or texts. The validation of results in these studies was based mostly on the pilot coding (also called trial coding [
We propose that a blended research methodology that considers quantitative and qualitative perspectives in the study design and coding procedure would be fruitful for the advancement of CA methodologies. Further, an approach that allows for a combination of manual and computer-assisted coding through the most suitable supported software for the methodological approach of the study would be beneficial. A robust approach of this kind was not explained explicitly in these studies; we describe our proposed model for such studies is in the Discussion section.
Building on our review of the literature for key concepts, components, and data collection and analysis procedures of CA, and our appraisal of 18 health care social media studies, we propose the CCA model as a solid model for combining methods (quantitative and qualitative), coding procedures (inductive and deductive), and analytic modes (manual and automated) of CA. Our model is designed to address the mixed (quantitative and qualitative) nature of Twitter feed data in single or multiple-phase studies depending on the research aim of the phenomena under investigation. The model enables researchers to integrate methods and blend data in a single study—or a series of studies—using Twitter as a primary data source for analysis; it is a mixed-methods approach to CA research in the age of digital data. The CCA model integrates the major designs of mixed-methods research—the convergent, sequential, embedded, and transformative designs [
Because text is always qualitative to begin with and the quantification of text alone is insufficient for successful understanding of content [
When referring to potential mixed-methods design, in the CCA algorithm we used the most common notations (abbreviations) used in mixed-methods literature [
The combined content-analysis (CCA) algorithm.
The combined content-analysis (CCA) model. CA: content analysis; qual: qualitative supplement; QUAL: qualitative priority; quan: quantitative supplement; QUAN: quantitative priority.
Researchers interested in health care social media-driven data can use Twitter as a rich and useful data source to generate information related to their health topic. This way of collecting data may go beyond traditional data collection methods (eg, observations, interviews, or focus groups), and researchers may have a large amount of textual data that is shared by a diverse group of people in a social and natural platform. Analyzing Twitter-driven content such as tweets can be a productive way not only to analyze text, but also to evaluate discourses surrounding health and disease-related issues [
Twitter features a search function (eg, keyword or hashtag search) to filter status updates that meet particular search criteria. Archive software is also available to search, track, store, and retrieve targeted health topics from collected tweets by date, time, and possible geographic location. Because reading any form of text, even using a technical search, is fundamentally an interpretive process regardless of its numerical outcomes [
Before conducting a study on health tweets, several factors are important for researchers to consider in deciding what CA approach to use. First, it is essential to confirm that data on their topic have been tweeted (preliminary search for data) and to determine the time frames or periods of time when this has occurred. Some Twitter databases may be created in response to a specific event (eg, an Alzheimer awareness day or month); the data cannot be interpreted well if that event (the context of the data) is not taken into account in the analysis. Discussions on specific health topics may not be established yet, and the number of tweets may be insufficient to facilitate analysis. Searching for health-related keywords in Twitter is the first step for any Twitter-driven study using CA. This step is common to traditional summative CA studies and mirrors the first part of the CCA model equation ([ (qual “Keywords search”) + (Aim) ] →), which is usually qualitative in nature because it is done manually. However, the Twitter database itself may be collected directly from Twitter (eg, Twitter’s advanced search), downloaded from chat recaps (eg, Twitter chat transcripts) using particular health care social media websites (eg, the Healthcare Hashtag Project [
Availability of data and worded objectives will help researchers choose the study, data collection, and analysis approaches to use. To make a final decision on study approach, it is important for researchers to consider which CA approach will be helpful in achieving their desired results. For example, researchers might ask the following questions. Should we test hypotheses by counting words (a single word), the co-occurrence of words (word-to-word), or text as a whole in the targeted tweets? Should we explain counted results using descriptive or inferential statistics and then integrate additional qualitative information (eg, QUAN + or → qual)? Should we try to understand the environment surrounding tweets (text and related context) by asking questions and seeking answers within the data and then support the answers using descriptive statistics (eg, QUAL + or → quan)? Are both numbers or hypotheses and words or questions equally important in understanding the big picture (eg, QUAN + or → QAUL)? Are we interested in an interpretive analysis of the content and, if so, what qualitative methods can best inform the design and analysis? By considering all of these factors researchers can choose an appropriate direction (and potential assisted software) for CA as per the second part of the CCA algorithm [ (QUAN + or → qual) OR (QUAL + or → quan) OR (QUAN + or → QAUL) + (CA) ].
The last part of the equation, “+ (CA)”, includes the key feature of successful CA, which moves from selecting the sample of content, establishing the coding process, and developing or testing category schemes to determining the quality criteria of study results. We provide these steps and explanations of how combined mixed-methods approaches to CA (as shown in the CCA algorithm) can be applied to the analysis of Twitter feed content in this section.
Although in all potential approaches—that is, “(QUAN + or → qual) + (CA)”, “(QUAL + or → quan) + (CA)”, and “(QUAN + or → QAUL) + (CA)”, —researchers sample the text or “universe” [
Despite the gap in the social media methodology literature about sampling, the CA literature follows the general direction of research paradigms [
On the other hand, with the “(QUAL + or → quan) + (CA)” approach, the focus is on the transferability rather than the generalizability of results. As such, researchers can purposefully collect a sample of tweets (hundreds) within the tweets database that is unique to specific users (eg, regular users or chat managers of a specific topic identified by an elder care-related hashtag), events (eg, an elder care-related event), or researchers’ assumptions about such tweets. Nonprobability samples, such as purposeful, convenience, and other types of qualitative samples, allow for the collection of important interpretive data and for the consideration of research questions that acknowledge the contexts, meanings, emphasis, and thematic dimensions of the topic. For example, a researcher might select his or her purposeful sample based on selected tweets of a popular health care community on Twitter (eg, #AlzChatUS). The selection of data may continue throughout the coding phase. Once the researcher establishes a rationale for specific tweets (which are likely to involve purposive, convenience, or other nonrandom sampling methods), the dominant direction of the study will no longer be quantitative, unless the rationale is combined with a random sampling method for the inclusion of tweets in the study. For instance, if researchers choose to analyze the random tweets of top users on an Alzheimer awareness month or day, the “(QUAN + or → QAUL) + (CA)” approach might lead the study, because the tweets, their environment, and specific (top) users are important. Regular tweets about Alzheimer disease from users tweeting on this subject may differ from tweets and users on Alzheimer awareness month or day. If researchers want to choose their sample purposefully (tweets of Alzheimer awareness month or day) but also want to track the changes of tweets over time (eg, in 2010, 2012, and 2013), this also means that the two approaches lead the study equally, because the aim is to track changes over time related to a specific event or Twitter context. It is important, however, to note that there is a potential for rich data within the structure of the social network from which the textual information is derived—information that may best be understood through an application of social network analysis. Such analyses are, however, beyond the scope of this paper. Further information may be found in Gruzd and Haythornthwaite [
Establishing coding categories is one of the most fundamental steps in CA, especially for checking the quality criteria of the study, such as trustworthiness [
It is suggested that CA has the potential to be a valid and reliable tool to summarize extensive content if it is conducted carefully with clear and understandable results and well-described categories. This strength of the research is enhanced when researchers explain how they matched the reported results in their study with the study’s aim, questions, and hypothesis. This matching can be done with the use of quality criteria of CA. When considering the evaluation of CA results, there are two ways to ensure the rigor of a CA study: (1) using classic criteria to determine valid and reliable CA, and (2) using specific criteria to assess quality within the dominant research paradigm used. With the first way, while validity and reliability concepts can be used with quantitative CA, QUAN-dominant study and results can be presented through basic and advanced statistics (eg, percentages, probability, or inferences) that allow for objectivity and replication. Credibility, transferability, dependability, conformability, and other areas for ensuring trustworthiness [
Schreier [
Another way to test ICR reliability is to use reliability checks before conducting the analysis, which often entails pilot coding (trial coding) or pretesting categories several times before the actual coding. Pilot coding involves coding a small portion of the tweets to be analyzed or all tweets generated before selecting the sample (all retrieved sampling units). Such a pretest can enable researchers to determine whether the categories are clearly specified and meet the requirements, that the coding instructions are adequate, and that coders are familiar with the data and are suitable for the job. It is recommended that with a QUAN-dominant study, the sample of pilot coding should be different from the sample of actual coding. In contrast, if the QUAL-dominant approach is used, the sample of pilot coding should be a subset of the sample of actual coding [
With an inductive coding procedure, on the other hand, reliability checks between coders may not be helpful when an in-depth (line-by-line) analysis and iterative process is required. According to Elo et al [
Validity with CA may refer to the representation of the intended concept [
Representing the results linked to the quality criteria of CA, particularly showing the connection between the aim of the study and the reported data [
This section summarizes how technology can be used to facilitate different approaches of CA. As mentioned, the main idea behind CA is to break down a large amount of text into small codes, nodes, categories, themes, or concepts by making links between those concepts to support an emergent theory or test an existing theory [
In aiding CA, the software can be classified into two types: (1) computational software packages, such as text mining and statistical software packages [
Selected software to aid content analysis.
Software (source) | Web address | |
|
||
|
Analytics for Twitter for Excel (Microsoft) | www.microsoft.com/en-us/download/details.aspx?id=26213 |
|
twitteR (The Comprehensive R Archive Network) | cran.r-project.org/package=twitteR |
|
Tweet Archivist (Tweet Archivist) | www.tweetarchivist.com |
|
Twitter Analytics (Twitter) | analytics.twitter.com/about |
|
||
|
CAQDASa,bNetworking Project (University of Surrey) | www.surrey.ac.uk/sociology/research/researchcentres/caqdas/support/choosing/ |
|
|
|
|
Text Analysis Info (Social Science Consulting) | textanalysis.info/pages/text-analysis-software---classified.php |
aCAQDAS: CAQDAS (computer assisted qualitative data analysis) networking project.
bFor example, ATLAS.ti, NVivo, MAXQDA, Dedoose, HyperRESEARCH.
In addition to the benefits of computerized coding listed above, software can be used to capture multiple types of data, such as multimedia data (eg, sounds and videos). On Twitter, for example, tweets can be coded manually or by data-analysis software depending on the leading approach chosen, length and format of the text (tweets), and the researchers’ aims. It is suggested that with limited qualitative data, manual coding provides a better understanding of the meanings between the lines [
CA is a prevalent methodology used to analyze health care social media-driven content, such as Twitter feeds. With the digital revolution of social networking platforms, Twitter has become a common source for online discussions on health issues; thus, health researchers need to become familiar with a structured model of CA that can respond to the nature of the retrieved digital data and the varied purposes of their studies. This paper reviews the general and health care literature of CA and evaluates how CA was used in Twitter-driven studies between 2010 and 2014. The CCA model is suggested as a new research framework that takes into account the various dimensions of the CA research methodology in a way that allows for mixing methods, procedures, and modes and components of CA. Thus, the CCA model will be useful in designing new studies (as a structured model) and evaluating existing studies (as an outline or checklist) that require or use various types or multiple modes of information within a single coherent model. The model integrates the main features of CA with the most common designs of mixed-methods research to facilitate the application and evaluation of studies that intend to use CA to analyze social media-driven content related to the researched phenomenon.
(A) Twitter overview. (B) Examples of eldercare tweet chats [
content analysis
computer-assisted/aided qualitative data analysis software
combined content analysis
intercoder reliability
qualitative priority
qualitative supplement
quantitative priority
quantitative supplement
EH is supported by the graduate scholarship program of King Abdulaziz University, Ministry of Higher Education, Saudi Arabia. An earlier version of this study was presented in the electronic poster session of the 2014 Health and Rehabilitation Sciences Graduate Research Forum, London, Ontario, Canada, February 5, 2014.
This manuscript was a part of EH’s doctoral comprehensive exam. EH designed the study, reviewed related literature, and drafted the first version of the manuscript. MS, JH, and AJ contributed to the quantitative perspective of the study. EK contributed to the qualitative perspective of the study. All authors discussed the study design and contributed to the final version of the manuscript.
None declared.