This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
In low- and middle-income countries (LMICs), historically, household surveys have been carried out by face-to-face interviews to collect survey data related to risk factors for noncommunicable diseases. The proliferation of mobile phone ownership and the access it provides in these countries offers a new opportunity to remotely conduct surveys with increased efficiency and reduced cost. However, the near-ubiquitous ownership of phones, high population mobility, and low cost require a re-examination of statistical recommendations for mobile phone surveys (MPS), especially when surveys are automated. As with landline surveys, random digit dialing remains the most appropriate approach to develop an ideal survey-sampling frame. Once the survey is complete, poststratification weights are generally applied to reduce estimate bias and to adjust for selectivity due to mobile ownership. Since weights increase design effects and reduce sampling efficiency, we introduce the concept of automated active strata monitoring to improve representativeness of the sample distribution to that of the source population. Although some statistical challenges remain, MPS represent a promising emerging means for population-level data collection in LMICs.
Since the filing of Alexander Graham Bell’s patent for the telephone in 1876, voice, and eventually, data communications networks have transformed the globe. Hard-wired landline infrastructure was a necessary developmental milestone for communities entering the modern era, rapidly connecting populations across high-income countries and most urban centers of the developing world [
Until the first mobile phone was introduced in the early 1970s, there was no challenge to the role of the landline telephone as a tool for population-level data collection. As the global mobile phone revolution exploded in the early 2000s, a dramatic shift from landline to cellular networks began to occur. According to the CDC, by early 2005, only 7.3% of US households had shifted to mobile as their only phone connection [
This transition to ubiquitous mobile phone access around the globe has had an important effect on population surveys especially in low- and middle-income countries (LMICs) where the availability of landline phone was rarely universal and surveys were conducted by face-to-face interviews (F2F). However, F2F surveys are expensive, time consuming, and often difficult to conduct in remote or conflict regions. Mobile phone surveys (MPS) are likely to reduce these challenges. In fact, several global agencies and survey firms have begun to leverage mobile phone coverage rates to collect data at random or from panels of respondents [
In this paper, we identified some of the key statistical considerations and challenges associated with each stage of mobile-only surveys in LMICs. We propose some novel methodological approaches for improving population representativeness and efficiency of MPS (see
Key mobile phone surveys (MPS) considerations by survey phase.
Phase | Key considerations | Mitigation |
Presampling | Differences between phone owners and nonphone owners | Decreases as mobile penetration increases in LMICs; |
Sampling and survey execution | Source of numbers to sample from; |
Prescreened “valid numbers only” bank of numbers; |
Postsampling | Residual differences between phone owners and nonphone owners; |
Postsampling weighting |
As the level of mobile phone access and ownership reaches saturation (100%, or at least one mobile phone per eligible adult respondent) at a population level, it becomes plausible to consider the entire population of phone owners as elements of a “sampling frame” for an MPS [
As shown in
In the third layer, we see how this subset of phone owners might respond, by picking up or not, and also a new section, representing numbers that are nonexistent—an artifact of the random digit dialing process, discussed in detail below. The fourth layer depicts the possible response behaviors of the subset of those phone owners who do pick up, with the gradient representing the different possible outcomes of the respondent interaction.
Statistical challenges for each phase of MPS. The figure illustrates the challenges in capturing specific denominators in MPS, where the bottom layer is the complete population, which comprises phone owners and non–phone owners. A cascading gradient has been used to depict the uncertainty and variability, by setting, in the proportions of either group at each layer. Various types of loss and attrition, from nonresponse to invalid numbers, reduce the total number of units sampled successfully and completely.
As populations transition from either fixed or no phone access to widespread cellular networks, heterogeneity in coverage and household ownership can vary across states or districts. These differentials are likely to mirror socioeconomic and rural-urban gradients and could hinder statistically sound estimates of population characteristics and behavior if not carefully understood and managed [
Apart from addressing challenges in the differences between phone users and nonphone users in a population, the next important hurdle for any phone survey methodology begins with the development of a representative and valid sampling frame, broadly defined as the set of eligible numbers from which the sample of telephone households is selected [
Unlike telecom companies in many high-income countries, in LMICs, providers seldom maintain (or publish) directories of active mobile phone users and their numbers [
Due to invalid sequences that are essentially unavoidable in an RDD, the required sample size should be inflated by dividing with a factor 1-Y, which estimates the proportion of nonworking numbers, to yield the number of random mobile phone numbers to be generated. This factor can be determined by first creating a smaller “test” pool of numbers to determine a likely proportion of real numbers through a practice RDD round. As the numbers are dialed, if a number is identified as a nonhousehold or definitively found to be nonworking on the first call, then it should be excluded and the next one be dialed. If a call is not answered, it can be redialed a predetermined number of times before it is identified as nonworking and replaced with a new number from the list.
To ensure that correct sample size is obtained, traditional landline surveys have employed a process called accelerated sequential replacement. This is an iterative approach that selects numbers from a purposefully expanded sample of random numbers and replaces those definitively identified as nonworking at the end of a given operational stage, such as the day or shift. Three statuses are usually assigned to the outcome of a call such as verified household, verified nonhousehold or nonworking, and unresolved (eg, no answer or strange noise, but not verified as nonworking). Before the beginning of the next stage, those numbers that have been verified as nonworking are replaced with an equal number of new random numbers from the randomly generated list. After a predetermined number of calls, unresolved numbers are assumed to be nonworking and are also replaced. Although traditionally this was a manual process, now it can be easily automated and invalid numbers can be replaced automatically, instead of doing in stages, thereby reducing the total time required to conduct the survey.
The relatively low cost and automated nature of most MPS technologies, combined with the vast size of the denominator being selected from—effectively every mobile phone owner in the population with an active subscription, connected to a network—allows us to consider an approach that continues to attempt to fill a particular target demographic stratum of a population until that stratum’s desired sample size is reached. In this quota-driven sampling procedure, a priori “sample size” is determined with a known statistical precision level, and the sample is selected from an RDD list through a probabilistic sampling procedure. Differences between the composition of the general population and the population of phone owners, in terms of gender, age, socioeconomic status, urban versus rural residence, and geographic origin can be mitigated through establishing target quota, based on the relative proportion of individuals of a particular combination of relevant characteristics in the population at large. Recent census data can be used to assess strata-specific population distribution, or in case where data from a recent census is not available, information from the most recent demographic and health surveys (DHS) may be used. The DHS are conducted in over 90 LMICs and provide nationally representative data on these strata-specific population distributions.
Given the digital nature of MPS, real-time data streams can be monitored, and strata actively “closed” once the required sample size for that subgroup has been met. This process can be automated or monitored by study implementers. The concept, automated active strata monitoring (AASM) also allows many more strata to be chosen to minimize the possible effects of nonprobabilistic sampling from the parent population. In this process, when a participant answers the phone, the first survey questions should establish their demographic, education, and other sociodemographic information of interest to determine stratum contribution. If the required number of targeted respondents has already been reached in their stratum, no further questions are asked, and they will be excused from completing the survey. Conversely, if more respondents are still required in their stratum, they are led through the survey questions.
AASM is not plausible in traditional household surveys, as the marginal cost and time required to visit more households, in the effort to complete one or more unfilled strata, becomes prohibitively expensive. With the extremely low cost of MPS, in contrast to traditional F2F methods, and high levels of mobile phone coverage, for the first time in history, the survey denominator is theoretically, in many cases, the entire population (see
For traditional survey methods, an inherent disadvantage of quota-driven sampling is that it is a nonprobabilistic approach, and although certain characteristics of interest have been chosen to recreate a sample reflective of the population as a whole, other unmeasured characteristics have not been accounted for because the underlying population distribution is unknown. In a random sample, the distribution of both measured and unmeasured characteristics withstand a better chance of reflecting the actual population. With AASM in MPS, however, the population distribution strata are preserved through sampling without replacement and quota restrictions to mitigate the oversampling of certain population groups. This method does require, however, access to reliable and recent statistical information regarding the parent-population’s characteristics. These may be available from national surveys and other globally standardized surveys (eg, DHS), although recent information may not be readily available for some populations.
Given the large size of the population being sampled from, sampling without replacement from the entire population of phone owners can usually continue until the desired sample size in every sociodemographic stratum is achieved. In some cases, such as scenario B in
Mobile phone access across two theoretical populations, using age as an illustrative respondent characteristic through which representativeness can be assessed. The figure illustrates a hypothetical population distribution against a distribution of mobile phone ownership, under conditions of low (scenario A) and high (scenario B) mobile penetration. In scenario A, common to populations where mobile phones have recently been introduced, obtaining a representative sample through MPS may not be feasible, even using AASM. As mobile markets mature, the overlap in distributions increases, allowing methods like AASM and random digit dialing to improve the capture of a sample that closely reflects the population-at-large.
It is well recognized that inequity exists in mobile phone access, and phone ownership is not equally distributed across the population in many countries. Young, male urban populations are more likely to have a mobile phone compared with older, female rural populations. Consequently, a major problem of mobile survey is the selectivity bias due to ownership heterogeneity. A common method is to conduct weighed analyses with poststratification adjustments to reduce the selectivity bias and to improve population representativeness.
Poststratification with weighting, however, is not without problems. Kish has shown that the variance σ2/n of a weighted estimate is inflated by a factor of 1+CV2wt, where CV2wt is the relative variance of the sampling weights [
where
A multicountry study in Afghanistan, Ethiopia, Mozambique, and Zimbabwe by the World Bank shows that the design-effects due to weighting were quite large and are 6.3, 11.6, 5.2 and 1.8, respectively [
Trimming extreme weights is often suggested to reduce the coefficient of variation (CV) of weight, which may bias the results. We propose, for MPS, reducing the variability in weighting by restricting quota of interviews for each strata to the original sample allocation size. An AASM approach is expected to substantially reduce the CV of weights and thus the
The design-effect (deffwt) of weighted estimate. Here wi is the weight (inverse of selection/participation probability) in the i-th stratum.
Previous research indicates that dropout rates might be higher with MPS as people are more likely to be occupied than when contacted in person or on a landline and be less available to complete the full survey [
To minimize the effect on the data from dropouts that do occur, the order of the survey modules could be randomized so that each module has the opportunity to be placed at the beginning of an interview to ensure a sufficient number of responses to each set of questions. Additionally, to keep mobile phone interviews short, questions asked should be limited to “important” key indicators identified in consultation with country policy makers and stakeholders, as well as the literature. Further, to control for the differences between those who respond to the survey and those who refuse, during the survey the number of those who choose not to respond should be recorded. The results should be adjusted postsurvey by nonresponse weighting, using a factor,
Nonresponse weighting, using a factor, Ri, where f is the nonresponse rate for the i-th group.
In low socioeconomic populations, mobile phones may be shared among members of a household or among neighbors, which may lead some individuals to be less likely to be surveyed [
A concern commonly voiced when considering populations where cellular phone use is high is that people with more than 1 mobile phone have a statistically higher probability of selection from a population, with the relative probability of being selected proportional to the number of phones owned by that user. Although statistically correct, the practical implications of this relatively rare situation, from a population perspective, are negligible. Using RDD in a large population (eg, the denominator of every possible phone number in a country), the individual probability of selection of an individual is very small, even if the user has more than 1 phone.
The selection probability of an individual person can be calculated as follows: (n, sample size targeted × N[p], of phones per person)/(N[T], total number of phones). When the N(p) in numerator is 1, individuals with one phone have the equal selection probability (approximately) to simple random sampling. If N(p) = 2, the selection probability is doubled and so on thus increasing the selection probability proportional to the number of phones a person may have.
Illustration of the extremely low individual probability of inclusion of those with 1 phone and those with 3 phones in a theoretical population of 100 million and how these probabilities change as the proportion with multiple phones increases.
One mobile phone | Three mobile phones | Relative probability of inclusion (g) |
||||
Percentage of 100 million population with one mobile phone (a) | Number of mobile phones in group (b) |
Individual probability of inclusion (c) |
Percentage of 100 million population with 3 mobile phones (d) | Number of |
Individual probability of inclusion (f) |
|
100% | 100,000,000 | 1.00E-08 | 0% | 0 | 3.00E-08 | 3 |
90% | 90,000,000 | 8.33E-09 | 10% | 30,000,000 | 2.5E-08 | 3 |
80% | 80,000,000 | 7.14E-09 | 20% | 60,000,000 | 2.14E-08 | 3 |
70% | 70,000,000 | 6.25E-09 | 30% | 90,000,000 | 1.88E-08 | 3 |
60% | 60,000,000 | 5.56E-09 | 40% | 120,000,000 | 1.67E-08 | 3 |
50% | 50,000,000 | 5E-09 | 50% | 150,000,000 | 1.5E-08 | 3 |
In countries where a significant proportion of the population has more than one mobile phone, the selection probability could be adjusted to take this into account. This has been done in landline surveys using an adjustment factor A, expressed formally as:
A=1/Ti
where Ti is the number of phones of the
Determining the geographic location of respondents is a challenge unique to mobile surveys that is uncommon for landline or household surveys, where the area code or zip code of a respondent tends to be known. As such, if geographic balance is sought, screening questions may be necessary to associate the mobile phone respondent with a geographic location [
With all RDD surveys, as phone numbers in the sample are randomly generated, it takes time to exclude nonworking numbers and eventually obtain the necessary sample size. This issue is especially pronounced for mobile surveys. In one example, it took almost 30 hours to remove 6872 nonworking mobile numbers compared with a landline sample, which took only 4.5 hours [
The mobile phone revolution has presented an unprecedented opportunity to collect public health-related data directly from populations. Near-universal connectivity has created massive population denominators, which are accessible to researchers interested in supplementing traditional F2F methods with data from MPS. Leveraging large, connected populations for high-quality survey research requires careful consideration of both unique and shared challenges across traditional F2F, landline, mixed population, and mobile-only surveys.
The low cost and automated process of deploying MPS allows for innovative approaches such as AASM to be used to create samples that reflect the population-at-large, acknowledging that nonprobabilistic methods may be accompanied by unmeasurable biases. It is important that researchers working on using MPS methods consider these and, where possible, try to collect data which will improve not only the quality of the study but also our understanding of the strengths and limitations of this method.
We face a unique reality in a growing number of countries, where approaches like RDD now allow virtually every member of a population to be reached and surveyed about important public health issues. In most high-income countries, the over use of mobile networks by telemarketers has reduced our capacity to take advantage of these methods. Robust methods in sampling and design will help maximize the value of MPS data in these countries as a useful approach to population surveillance.
automated active strata monitoring
Coefficient of variation
face-to-face
low- and middle-income countries
mobile phone surveys
rapid digital dialing
mobile network operator
demographic and health surveys
The studies described as part of the research agenda are funded by the Bloomberg Philanthropies. This funding agency had no role in the preparation of this manuscript.
None declared.