Using Partially-Observed Facebook Networks to Develop a Peer-Based HIV Prevention Intervention: Case Study

Background This is a case study from an HIV prevention project among young black men who have sex with men. Individual-level prevention interventions have had limited success among young black men who have sex with men, a population that is disproportionately affected by HIV; peer network–based interventions are a promising alternative. Facebook is an attractive digital platform because it enables broad characterization of social networks. There are, however, several challenges in using Facebook data for peer interventions, including the large size of Facebook networks, difficulty in assessing appropriate methods to identify candidate peer change agents, boundary specification issues, and partial observation of social network data. Objective This study aimed to explore methodological challenges in using social Facebook networks to design peer network–based interventions for HIV prevention and present techniques to overcome these challenges. Methods Our sample included 298 uConnect study respondents who answered a bio-behavioral survey in person and whose Facebook friend lists were downloaded (2013-2014). The study participants had over 180,000 total Facebook friends who were not involved in the study (nonrespondents). We did not observe friendships between these nonrespondents. Given the large number of nonrespondents whose networks were partially observed, a relational boundary was specified to select nonrespondents who were well connected to the study respondents and who may be more likely to influence the health behaviors of young black men who have sex with men. A stochastic model-based imputation technique, derived from the exponential random graph models, was applied to simulate 100 networks where unobserved friendships between nonrespondents were imputed. To identify peer change agents, the eigenvector centrality and keyplayer positive algorithms were used; both algorithms are suitable for identifying individuals in key network positions for information diffusion. For both algorithms, we assessed the sensitivity of identified peer change agents to the imputation model, the stability of identified peer change agents across the imputed networks, and the effect of the boundary specification on the identification of peer change agents. Results All respondents and 78.9% (183/232) of nonrespondents selected as peer change agents by eigenvector on the imputed networks were also selected as peer change agents on the observed networks. For keyplayer, the agreement was much lower; 42.7% (47/110) and 35.3% (110/312) of respondent and nonrespondent peer change agents, respectively, selected on the imputed networks were also selected on the observed network. Eigenvector also produced a stable set of peer change agents across the 100 imputed networks and was much less sensitive to the specified relational boundary. Conclusions Although we do not have a gold standard indicating which algorithm produces the most optimal set of peer change agents, the lower sensitivity of eigenvector centrality to key assumptions leads us to conclude that it may be preferable. The methods we employed to address the challenges in using Facebook networks may prove timely, given the rapidly increasing interest in using online social networks to improve population health.


Introduction
In this Appendix, we describe several details that are not in the main body of the paper. We present a more detailed background on peer-based health interventions, methodological principles important to the identification of candidate peer change agents (PCAs), and provide mathematical definitions of the two PCA identification algorithms we used. We discuss missing network data imputation techniques in more depth, including methods to identify PCAs when networks are partially observed. We also present details on our efforts to include higher order triad closure terms in our imputation model described in the main body of the paper.

Background HIV among young Black men who have sex with men (YBMSM) in Chicago
Even as new HIV infections in the United States have stabilized over the past decade, YBMSM in the U.S. have continued to experience rising HIV incidence over this period [1]. In Chicago, the number of new HIV infections among YBMSM between 13-29 years of age from 2004-2014 was annually at least five times higher than that for their White counterparts [2]. Reducing HIV incidence among YBMSM is an urgent public health priority.
Pre-exposure prophylaxis (PrEP) is a novel biomedical intervention that has been shown to substantially reduce acquisition of HIV infection in multiple high-risk populations [3]. Among MSM adherent to PrEP, an efficacy of over 90% has been estimated [3,4], and the US Centers for Disease Control and Prevention (CDC) have recommended its use [5]. Increasing the use of PrEP to reduce HIV incidence among YBMSM is an official strategy of the Chicago Department of Public Health. Despite recommendations for increasing PrEP uptake, use of PrEP among MSM in Chicago remains low [6,7]. Since individuallevel interventions have had limited success in increasing PrEP use among YBMSM [8], we apply a social network approach here.

Peer network interventions for HIV prevention
Network interventions, particularly those that are channeled through peers, have been found to be efficacious for HIV, both in the U.S. and globally [9]. An early example of prevention through diffusion of messaging was the "Stop AIDS" program in San Francisco during the early days of the HIV epidemic in the 1980s [10][11][12]. This program built upon small group communication and diffusion of information theories [13]. The diffusion was launched by training a small number of outreach workers, who conducted small group meetings in the gay neighborhoods of San Francisco. A well-respected HIV-positive individual in the community led these sessions, which were attended by gay and bisexual men. From 1985 to 1987, Stop AIDS reached over 30,000 men, and coincided with a marked decline in the number of new HIV infections [11]. A variant of this model was successfully applied to reduce sexual and druguse behaviors related to HIV transmission [14], and was found to be effective among male prostitutes [15], in addition to diverse MSM populations in the United States [16][17][18][19], particularly Black MSM [20]. Thus, peer-based HIV interventions have shown promise for diverse high-risk populations. However, other studies have found conflicting results. For instance, the Community Popular Opinion Leader (C-POL) HIV intervention, where POLs were recruited and trained as behavior change endorsers [19], found that this peer-based intervention did not produce greater behavioral change than the control setting [21]. Some methods to improve the efficacy of this intervention have been proposed, including using digital methods to compile more accurate network data, and using formal network analyses to identify peer leaders [22].

Identification of peer change agents (PCAs)
One challenge in the design of PCA interventions is the identification of individuals who would be effective disseminators of the intervention. While traditional PCA selection has used an ensemble of methods, including self-selection, peer nomination, and ethnographic observation [23], recent work has suggested that biobehavioral interventions are most likely to be effective when they account for the network structure of high risk individuals [24]. Such structural network assessments use formal mathematical and computational techniques, and position scores are computed for individuals, or ensembles of individuals [25]. It has also been argued that a PCA identification procedure is most successful when the type of flow process that is of interest is taken into account [26]. Following this argument, we apply two computational algorithms which are well-suited to situations where the underlying flow process involves diffusion of information: eigenvector centrality [27], and keyplayer positive [28]. In a recent study, we selected influential individuals by applying these two measures to the observed Facebook networks of YBMSM in the first two waves of our study. We found that individuals who were unaware of PrEP at baseline but became aware at follow-up had substantially more friendships with the influential nodes we identified than individuals who remained unaware of PrEP at both waves [7].

PCA identification measures of interest: Eigenvector Centrality and Keyplayer Positive
The eigenvector centrality of a network is defined as the principal eigenvector of the adjacency matrix defining the network, as given by = , where is the adjacency matrix of the network, is the constant eigenvalue, and is the eigenvector [27]. Each node receives a score that is equal to the th component of the principal eigenvector. High scoring nodes are those that are connected to others that are themselves high scorers [26]. Eigenvector assumes that the flow process of interest moves through the network via unrestricted walks. It describes a mechanism where one node can impact all of its neighbors simultaneously [26], and has therefore been used in public health applications that utilize peer influence [29][30][31]. Eigenvector centrality (henceforth "eigenvector") is thus consistent with our application of PrEP-related information dissemination.
The keyplayer positive algorithmhenceforth referred to as "keyplayer"is a set-based measure, reflecting the idea that the optimal set may not be necessarily composed of nodes who have the highest individual scores [28]. Keyplayer defines the cohesion between the members of a set in the nodes of a network, and the remainder of nodes in the network − , where cohesion is defined as where is 1 if nodes and are adjacent to each other and 0 otherwise), and ⋃ is a maximum aggregation function equal to the maximum number of nodes in − to which members of are adjacent [28]. To find the optimal keyplayer set , an optimization algorithm that starts with a randomly selected set is used, which computes by swapping each node ∈ S, with node ∈ − to compute a new cohesion score ′ . The nodes and are swapped only if ′ > , and the process is repeated until an optimal set is found [28]. The keyplayer set consists of individuals who are maximally connected to individuals in the network. Thus, passing information through the keyplayer set minimizes the social distance it has to travel to reach the maximum number of members of a social network. Keyplayer is thus an ideal choice for diffusing PrEP-related information, and it has been used in related public health applications [32,33].

Adapting Facebook data for prevention research
Facebook communities are now being used to promote health behaviors such as obesity control [34], smoking cessation [35], and HIV/AIDS prevention [36]. To identify PCAs from a set of potential influencers of Chicago YBMSM, we utilize Facebook data from "uConnect", a population-based longitudinal cohort study in Chicago that aims to study the role of social support networks of YBMSM on risk and risk-reduction practices to reduce new HIV infections [6]. Along with information on detailed biobehavioral characteristics assessed through an in-person survey, the Facebook friend lists of consenting uConnect participants were downloaded. We refer to individuals who appeared in our study as friends of one or more respondents, but who were not enrolled in the study themselves, as "nonrespondents." We are able to match nonrespondents who are named by more than one respondent, and because Facebook ties are undirected, we observe ties between nonrespondents and respondents. The Facebook friendships between pairs of nonrespondents, however, were unobserved. Our data on these Facebook networks are therefore incomplete, making it difficult to use these data to identify structurally positioned individuals most suitable for diffusion innovation that impacts HIV prevention.
There are additional complications in using Facebook data for prevention research. First, Facebook networks tend to be large, since individuals tend to have many Facebook friends. Second, although all network research must deal with issues in specifying the boundary that defines the nodes of interest, such issues might be pronounced when Facebook data are used, because Facebook networks combine individuals from many different components of one's life (compared to, say, professional collaborations in an organizational setting). Facebook friends are also not constrained by geography, or a high level of meaningful interaction. Thus, the ties can represent very heterogeneous relationships. Third, Facebook profiles do not consistently contain all of the attribute information that determine homophily or other social network effects. All of these factors compound the difficulties of using partially observed networks for a health intervention that would exist in other network studies using datasets other than Facebook.

Partially observed network data: theoretical concepts and terminology
It is imperative to examine how the unobserved network can be reconstructed, given that we are using partially observed Facebook networks. Therefore, we must first define the nature of "missingness" in a partially observed dataset. Following the widely used convention developed by Rubin [37], data are called "missing completely at random" (MCAR) when the missingness depends neither on the observed data nor the unobserved data. Data are "missing at random" (MAR) if the probability of missingness does depend on the observed data but not the unobserved data [38,39], and "missing not at random" (MNAR) if the probability of missingness depends on the unobserved data [38]. Huisman (2009) provides an interpretation of these definitions for networks with unobserved edges [40], such as ours. In particular, if the probability of missingness is related to the existence of the edges, the data are MNAR [40].

PCA identification on partially observed network data
There is a growing body of literature on identifying critical nodes on networks that are incompletely observed [41][42][43][44][45][46][47]. This literature can be divided into two parts: 1) papers that examine scenarios with false positive ("spurious") and false negative ("missing") nodes [43,[48][49][50], and, 2) those that consider spurious and missing edges [46,47]. The findings suggest a "smooth and consistent decline" in the accuracy of PCA identification as the number of missing nodes increases [51]. The conclusions with missing edge sets are mixed. On one hand, Borgatti et al. use Erdős-Rényi graphs to find that missing edges have a sizeable negative effect on the accuracy of identified key nodes, which increases in magnitude as the number of missing edges increases, and contend that missing edges are more damaging than the spurious [46], though Wang et al. [47] use two empirical networks to suggest that this result may not be generally true.
While these studies are instructive, they are not perfectly comparable to our case. Borgatti and Carley et al. considered a random Erdős-Rényi network with randomly missing nodes and edges [45,46], a pattern that fits the MCAR definition above. Wang et al. consider empirical networks, but with only 5 percent of nodes and edges that are MCAR. Smith and Moody considered the impact of random [51] and nonrandom [52] node removal on network statistics in a number of directed and undirected empirical networks, but not that of missing edges. Kossinets considers bipartite networks with a number of mechanisms for missing edge configurations in a bipartite scientific collaboration network, but the network statistics he considered are not applicable to our PCA identification measures [53]. Our data describe an empirical network with a large amount of missing data and an MNAR structure of missingness; more details are in Section 4 below.

Reconstructing unobserved networks
The preferred approach for recovery of missing data depends on the type of missingness. One early technique for working with missing network data is the method of "reconstruction", proposed by Stork and Richards [54], and examined in more recent work [40,55]. This method allows for unobserved ties from nonrespondents to respondents to be reconstructed, based on the report provided by the respondent on this tie. However, as the authors themselves argue, this technique cannot be used directly to impute unobserved ties between nonrespondents.
Huisman (2009) presented a sparse network of adolescent friendships with multiple mechanisms of missing nodes and edges and at varied proportions of missingness [40]. He then imputed the missing data using four different techniques based on sampling from unconditional distributions, and measured the effect of this imputation on some common statistics of network structure, but not the type of node influence processes that we are interested in.
Our interest is in the reliable identification of PCAs, given an incompletely observed network, where the data are MNAR. A method to impute unobserved edge data using conditional distributions was proposed by Robins [44]. This method utilized exponential random graph models (ERGMs) and considered two cases: one where the goal was to model the full network structure but nonrespondents were MAR, and a second where the principal focus was to understand the respondent-only structure, albeit while using the observed information regarding ties to nonrespondents. The networks discussed by Robins et al. were directed, while the Facebook networks that are of interest to us are undirected. Moreover, Robins et al. applied the imputation method to impute unobserved data between respondents and nonrespondents, not between pairs of nonrespondents, as is needed to characterize the YBMSM network of interest.

Reconstructing nonrespondent-nonrespondent data
To impute the unobserved friendships, we use a method developed by Handcock and Gile [56], where imputation is based on multiple imputations of the full network conditional upon the observed network. In contrast, Huisman's method used unconditional distributions [40]. Thus, we apply a likelihood-based imputation technique where all the observed and unobserved data are modeled simultaneously [44]. We then apply two different algorithms (eigenvector and keyplayer) to identify PCA sets on these imputed networks. We compare PCAs selected by these algorithms in the main body of the manuscript.

Imputation model: Convergence of Triad Closure Terms
Triad closure terms have been found to be important in predicting missing network data [57]. Our extensive efforts to incorporate terms for triad closure, however, all failed to converge. These began with a parameter for the count of triangles in the network, consistent with previous work [58,59]; this model was found to be "degenerate," as defined in earlier work [59,60]. We then explored a variety of models including geometrically-weighted edgewise shared partner (GWESP) terms, alone and in combination with geometrically-weighted dyadwise shared partner (GWDSP) terms [61]. These consisted of models in which decay parameters were fixed and those in which they were estimated. All models were degenerate. We next attempted to constrain the model from diverging from observed network density by including a variety of additional degree terms, which specify the expected number of nodes with a specific degree, in conjunction with GWESP and/or GWDSP terms, with no improvement in model convergence. We take two lessons from this exercise. First, the terms commonly used within ERGMs for capturing triadic effects may not be as appropriate for high-degree networks like Facebook as they are for the lower-degree networks to which they have traditionally been applied. Second, that the presence of missing data may make these terms especially challenging to estimate for networks of the size explored here. Thus, we found something new about the potential limits of existing parameters in modeling large, dense networks. Both areas identified above are where future methodological research is needed.