The Twitter Social Mobility Index: Measuring Social Distancing Practices With Geolocated Tweets

Background Social distancing is an important component of the response to the COVID-19 pandemic. Minimizing social interactions and travel reduces the rate at which the infection spreads and “flattens the curve” so that the medical system is better equipped to treat infected individuals. However, it remains unclear how the public will respond to these policies as the pandemic continues. Objective The aim of this study is to present the Twitter Social Mobility Index, a measure of social distancing and travel derived from Twitter data. We used public geolocated Twitter data to measure how much users travel in a given week. Methods We collected 469,669,925 tweets geotagged in the United States from January 1, 2019, to April 27, 2020. We analyzed the aggregated mobility variance of a total of 3,768,959 Twitter users at the city and state level from the start of the COVID-19 pandemic. Results We found a large reduction (61.83%) in travel in the United States after the implementation of social distancing policies. However, the variance by state was high, ranging from 38.54% to 76.80%. The eight states that had not issued statewide social distancing orders as of the start of April ranked poorly in terms of travel reduction: Arkansas (45), Iowa (37), Nebraska (35), North Dakota (22), South Carolina (38), South Dakota (46), Oklahoma (50), Utah (14), and Wyoming (53). We are presenting our findings on the internet and will continue to update our analysis during the pandemic. Conclusions We observed larger travel reductions in states that were early adopters of social distancing policies and smaller changes in states without such policies. The results were also consistent with those based on other mobility data to a certain extent. Therefore, geolocated tweets are an effective way to track social distancing practices using a public resource, and this tracking may be useful as part of ongoing pandemic response planning.


Introduction
The outbreak of SARS-CoV-2, a coronavirus that causes the disease COVID-19, has caused a pandemic on a scale unseen in a generation. Without an available vaccine to reduce transmission of the virus, public health organizations and elected officials have called on the public to practice social distancing. Social distancing is a set of practices in which individuals maintain a physical distance to reduce the number of physical contacts they encounter [1,2]. These practices include maintaining a distance of at least six feet from other people and avoiding large gatherings [3]. At the time of this writing, in the United States, nearly every state had implemented statewide "stay-at-home" orders to enforce social distancing practices [4].
Social distancing is an important tool in the fight against COVID-19; however, its implementation by the general public can vary widely. Although a state governor may issue an order for the practice, individuals in different states may respond to this order in different ways. Courtemanche et al [5] showed that social distancing policies in the United States reduced the daily growth rate of COVID-19 cases. However, if we only consider the social distancing policy duration and daily confirmed cases, it is difficult to rule out potential confounders, including additional policies for wearing masks and improving hygiene as well as other social norms. Therefore, understanding actual reductions in travel and social contacts is critical to measuring the effectiveness of such policies. Using mobile phone data, Badr et al [6] found that mobility patterns were strongly correlated with decreased rates of COVID-19 case growth for the 25 most affected counties in the United States. These social distancing policies may remain in effect for an extended period of time. Thus, the public may begin to relax their practices, making additional policies necessary. Researchers showed the effectiveness of strict social distancing followed by testing and contact tracing by modeling mobility data from Cuebiq Inc in the Boston metropolitan area [7]. Additionally, epidemiologists have already modeled the impact of social distancing policies on the course of disease outbreaks [8][9][10]. These models may be more effective when incorporating actual measures of social distancing rather than assuming that official policies are implemented in practice.
It can be challenging to obtain data on the efficacy of social distancing practices, especially during an ongoing pandemic. In a recent Gallup poll that surveyed Americans, it was found that many adults are taking precautions to maintain distance from others [11]. However, while polling can provide insights, it cannot provide a solution. Polling is relatively expensive; thus, it is a poor choice for ongoing population surveillance practices and providing data on specific geographic locales (ie, US states and major cities) [12]. Additionally, polling around public health issues suffers from response bias, as individuals may overstate their compliance with established public health recommendations [13].
Over the past decade, analyses of social media and web data have been widely adopted to support public health objectives [14]. In this vein, several efforts have emerged over the past few months to track social distancing practices using these data sources. Google has released COVID-19 Community Mobility Reports [15] that use Google data to "chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential." The Unacast Social Distancing Scoreboard uses data collected from 127 million monthly active users to measure the implementation of social distancing practices [16]. Researchers at the Institute for Disease Modeling have used data from Facebook's Data for Good program to model the decline in mobility in the greater Seattle area and its effect on the spread of COVID-19 [17]. Using mobile phone data, the New York Times completed an analysis that showed that stay-at-home orders dramatically reduced travel; however, it was found that in states where such orders were not quickly enacted, residents continued to travel widely [18].
Identifying and isolating individuals who have potentially been exposed to a virus can blunt the spread of a pandemic. Contact tracing involves finding people who have had contact with an infected individual during the time the individual was contagious. In the current pandemic, several efforts have been made to develop digital contact tracing tools. Google and Apple announced a joint effort to build a Bluetooth-based contact tracing platform, which enhances the interoperability between Android and IOS devices using apps from public health authorities [19]. Singapore [20] and Australia [21] released similar apps that use Bluetooth to exchange "digital handshakes" to establish contacts. Many countries have developed their own contact tracing responses [22]. Li and Guo [23] presented a review of the development of contact-tracing apps for COVID-19. These efforts provide new and important opportunities to study social distancing and contact tracing in real time.
We present the Twitter Social Mobility Index, a measure of social distancing and travel patterns derived from public Twitter data. We used public geolocated Twitter data to measure how much a user travels in a given week. We computed a metric based on the standard deviation of a user's geolocated tweets each week, and we aggregated these data over an entire population to produce a metric for the United States as a whole, for individual states, and for some US cities. We found that in the United States as a whole, there was a dramatic drop in travel in the later weeks of the study period, with travel between March 16 and April 27, 2020, showing the lowest amount since January 1, 2019, the start of our data set. Additionally, we found that travel reductions were not uniform across the United States but varied from state to state. However, there was no clear correlation between social mobility and confirmed COVID-19 cases at the state level. A key advantage of our approach is that unlike the other travel and social distancing analyses referenced above, we rely on entirely public data, which enables others to replicate our findings and explore different aspects of these data. Additionally, because Twitter contains user-generated content in addition to location information, future analyses can correlate users' attitudes, beliefs, and behaviors with changes in social mobility.
One concern regarding the mining of social media data is user privacy. Unlike the data used by the companies described above, all the data we used is publicly available. Users choose to post their location data to Twitter publicly; therefore, these data are accessible to all users. However, while the location data are public, the potential remains for violating user privacy and producing unintended consequences for users, such as highlighting users who are failing to social distance. To ensure privacy in our index, we aggregated all mobility metrics to produce population-level analyses. None of our work considers the identity of individual users, and we removed identifiable user information from the distributed data aggregations. Furthermore, we caution others who pursue work similar to ours to consider privacy ramifications for users when collecting new data and conducting similar analyses.
There is widespread recognition that real-time tweets from millions of users can yield insights into a variety of population-level trends. Our study follows a tradition of using this insight to develop population-level indices and measures from Twitter data. Previous work includes tracking population-level sentiment as an economic indicator that can track stock price [24], political indices that reflect the popular opinion on major socioeconomic issues [25] or opinions about political candidates [26,27], and measures of pop culture such as reception of entertainment programs [28]. The Twitter Social Mobility Index is a measure of this kind, aggregating Twitter data from millions of people to produce real-time measurements of social distancing.
There is a long line of work on geolocation prediction for Twitter, which requires inferring a location for a specific tweet or user [29][30][31][32]. This includes work on patterns and trends in geotagged Twitter data [33]. Although most of these works focus on inferences of users' current locations and thus are not suitable for tracking user movements, there may be opportunities to combine these methods with our approach.
Many studies have analyzed Twitter geolocation data to study population movements. Hawelka et al [34] demonstrated a method for computing global travel patterns from Twitter, and Dredze et al [35] adapted this method to support efforts in combating the Zika virus epidemic. Several studies have used human mobility patterns from Twitter data [36][37][38][39]. These studies include analyses of urban mobility patterns [40][41][42]. Finally, some of these analyses considered mobility patterns around mass events [43].
Our findings are presented on a website [44], and we will continue to update our analysis during the COVID-19 pandemic.

Data Source
Twitter offers several ways in which a user can indicate their location. If a user is tweeting from a GPS-enabled device, they can attach their exact coordinates to that tweet. Twitter can then display the specific place that corresponds to these coordinates to the user and also provide it in their application programming interface (API). Alternatively, a user can explicitly select a location, which can be a point of interest (eg, a coffee shop), neighborhood, city, state, or country. If the tweet is public, this geolocation information is supplied with the tweet.
We used the Twitter streaming API [45] to download tweets based on location. We used a bounding box that covered the entire United States, including US territories. We used data from this collection starting on January 1, 2019, and ending on April 27, 2020. In total, the data set included 3,768,959 Twitter users and 469,669,925 tweets posted in the United States.

Location Data
We processed the two types of geolocation information described in the previous section.

Coordinates
We processed the exact coordinates (latitude and longitude) provided by the user (the "coordinates" field in the Twitter JavaScript Object Notation [JSON] object). Approximately 8% of our data included coordinates.

Place
The "place" field in the Twitter JSON object indicates a known location in which the tweet was authored. A place can be a point of interest (eg, a specific hotel), a neighborhood (eg, downtown Jacksonville), a city (eg, Kokomo, IN), a state (eg, Arizona), or a country (eg, the United States). The place object contains a unique ID, a bounding box, a country, and a name. More information about the location is available from the Twitter Geo API. A place is provided with a tweet in either of two conditions. First, Twitter can identify the coordinates provided by the user as occurring in a known place. Second, the user can manually select a place when authoring the tweet.
Because coordinates give a more precise location, we used them instead of place when available. If only a place was available, we assumed that the user was in the center of the place, as given by the place's bounding box.
For points of interest and neighborhoods, Twitter only provides the country in the associated metadata. Although in some cases, the city can be parsed from the name and the state inferred, we opted to exclude these places from our analysis for states. The full location details can be obtained from querying the Twitter API; however, due to the magnitude of the data in our analysis, this task would have been too time-consuming. This limitation excluded approximately 1.8% of our data.
We performed analyses for the 50 most populous US cities. For these analyses, we included points of interest that c the city name in their names, such as "New York City Center." Specifically for New York City, we included places that corresponded to each of the five New York City boroughs (Brooklyn, Manhattan, Queens, Staten Island, and the Bronx).
In summary, for each geolocated tweet, we obtained an associated latitude and longitude.

Computing Mobility
We defined the Twitter Social Mobility Index as follows. For each user, we collected all locations (coordinates) in a 1-week period, where a week starts on Monday and ends the following Sunday. We denoted the coordinate sequence as , where C j is the coordinate at time j in week i and n is the number of coordinates in that week. We computed the centroid of all of the coordinates and considered this the "home" location for the user. We then measured the distance between each location and the centroid for that week. To determine distance, we measured the geodesic distance in kilometers between two adjacent records, C j and C j+1 , using geopy [46], resulting in a distance sequence of . After collecting the distances, we measured the standard deviations of these distances. Formally, we defined Twitter Social Mobility Index M for each user as where σ(·) is the standard deviation operator and N is the number of weeks considered for the measure. We measured mobility in kilometers.
In summary, this measure reflects the area and regularity of travel for a user rather than the raw distance traveled. Therefore, a user who takes a long trip with a small number of check-ins would have a larger social mobility measure than a user with many check-ins who traveled in a small area. Because the measure is sensitive to the number of check-ins, it reflects when people have fewer check-ins during the pandemic.
We aggregated the results by week by taking the mean measure of all users in a given geographic area. We also present results for a 7-day moving average aggregation as a measure of daily movement. We recorded the variance of these measures to study the travel variance in the population, which indicates if travel is reduced overall but not for some users.
We produced aggregate scores by geographic area for the United States as a whole, for each US state and territory, and for the 50 most populous cities in the United States. We determined the geographic area of a user based on their centroid location for all times in our collection.
We computed the social mobility index for each day and week between January 1, 2019, and April 27, 2020. We selected the date of March 16, 2020, as the start of social distancing on the national level, although individual states implemented practices at different times. Therefore, we divided the data into two time periods: before social distancing (January 1, 2019, to March 15, 2020) and after social distancing (March 16, 2020, to April 27, 2020). We then computed the group level reduction in social mobility by considering the average values as follows: We also computed the reduction for each user and then tracked the median value, number of users active in both periods, and proportion of active users who completely reduced their mobility. We conducted a similar analysis for seasonal effects by comparing mobility after social distancing with mobility during the same period in 2019.
To address sparse data issues in our data set, we excluded users with fewer than 3 geolocated tweets overall and excluded the weekly record for a user if they had fewer than 3 geolocated tweets in that week. Additionally, due to data loss in our data collection process, we removed two weeks that contained far less data than the other time periods by taking a 99.75% confidence limit on the number of users and records. Table 1 shows the Twitter Social Mobility Index measured in kilometers for every state and territory in the United States and the United States as a whole. City results are shown in Table 2. We also included the rank of location by the group level reduction.  We observed that the overall drop in mobility across the United States was large (61.83%). Figure 1 shows the weekly social mobility index for the United States for the entire time period of our data set. The figure reflects a massive drop in mobility starting in March, and the four most recent weeks showed the lowest mobility on record in our data set. Second, every US state and territory saw a drop in mobility, ranging from 38.54% to 76.80% of travel compared to the numbers before March 16, 2020. However, the variance by state was high. States that were early adopters of social distancing practices ranked highly on the reduction in travel, such as Washington (3) and Maryland (9). In contrast, the eight states that had not implemented statewide orders as of the start of April [4] ranked poorly, namely Arkansas (45) (50), Utah (14), and Wyoming (53). We observed similar trends in the city analysis; however, the median users in cities had a larger mobility reduction than the users in states. In addition to the group-level mobility travel reduction, we examined the distribution of user-level travel reduction. For this analysis, we only considered the subgroup of users who had at least two check-ins in both periods. The median values for the reduction distribution were close to 100% for most states. The median values for seasonal reduction were all smaller but still suggested that people substantially reduced their mobility during the pandemic. Moreover, in the United States, 40% of the 818,213 active users completely reduced their mobility (ie, the mobility reduction was 100%). In contrast, during the same period in 2019, a 31% reduction was seen among 286,217 active users.

Social Mobility Index
The White House announced "Slow the Spread" guidelines for persons to take action to reduce the spread of COVID-19 on We computed a moving average of daily mobility data and used an offline change point detection method [48] on this trend. In 2020, 62.26% of the change points occurred after the national announcement date but before the dates on which individual state policies were enacted. This suggests that the national announcement had a larger effect compared to state policies, which is a similar finding to that of a mobile phone-based mobility analysis of four large cities [49]. We also observed that among the 40 states that announced stay-at-home policies, 92.5% (37) of the states had a more stationary daily mobility time series before the policy announcement date compared to the mobility time series over the entire time period, suggesting a rapid mobility change during the pandemic.
Finally, Figure 2 shows a box plot of the mobility variance across all users in a given time period. The distribution is long-tailed with numerous zeros; therefore, we took the log of 1 plus each mobility index. Although mobility was reduced in general, some users still showed a lot of movement, which suggests that social distancing is not being uniformly practiced. These results clearly demonstrate that our metric can track drops in travel, suggesting that it can be used as part of ongoing pandemic response planning.

Correlations
To investigate the factors that explain our Twitter Social Mobility Index and how well the index tracks COVID-19 cases compared to other relevant factors, we performed a correlation analysis on our data. We computed the daily infection rate by dividing the number of new confirmed COVID-19 cases in each US state [50] by the population of the state. We compared the daily infection rate with the social mobility index and the trends in the state characteristics category from [51]. We first ran a correlation analysis for the following trends: state size in square miles, population density per square mile, unemployment rate (2018), percentage of the population living under the federal poverty line (2018), number of homeless individuals (2019), percentage of the population at risk for serious illness due to COVID-19, and number of all-cause deaths (2016). We selected these measures to track the size of the state, economic activity, and composition of the population, which were studied in a similar correlation analysis of other countries [52]. These measures may change how far people typically travel in a given state.
In Figure 3 and Figure 4, we show the characteristics that have high correlation with either the number of confirmed cases or the mobility index. These characteristics were the size of the state in square miles, the number of homeless individuals (2019), the unemployment rate (2018), and the percentage of the population at risk for serious illness due to COVID-19.
For each day, we computed the correlations between the daily infection rate and the above data by state.  Figure 3 shows the correlations by day. We adopted the infection rate because the raw number of confirmed cases is not as informative, as the population has the highest correlation. However, the most significant factors in the early stage were still population-related factors (eg, the number of homeless people). We did not see significant correlations with other factors, including the social mobility index. Starting from mid-March, we observed trends of increasing correlation with the unemployment rate, size of the state, and social mobility index; however, these correlations were not significant (absolute correlation values <.5). A fluctuation occurred in the middle of the period, when states started to report confirmed cases of COVID-19.
We conducted a similar correlation analysis between each data source and the social mobility index, as shown in Figure 4. As expected, geographical state size showed the highest positive correlation. We also observed that the number of people at risk for serious illness due to COVID-19 had a negative correlation at the early stage of the pandemic.  Table 3 demonstrates the effects of various restriction policies on confirmed cases by running a similar correlation analysis on the cumulative confirmed cases for each state on May 10, 2020. The policy types follow the data from [51]. We used the time difference (in days) between May 10, 2020, and the policy release date as the input for the analysis, and we assigned a negative value (-1000) to states that had not announced a policy. The factor with the highest correlation with the social mobility index is the declaration of a state of emergency, which is the broadest type of policy.

Discussion
We present the Twitter Social Mobility Index, a measure of social mobility based on public geolocated tweets. Our analysis shows that there was a large drop in mobility overall in the United States. However, the drop was inconsistent and varied significantly by state. It appears that states that were early adopters of social distancing practices experienced more significant drops than states that had not yet implemented these practices.
Several limitations of using geo-tagged tweets as the subject of our study must be kept in mind. First, users on Twitter and other social media platforms are not representative of the general population. Their demographics, such as age, race, ethnicity, education level, income, and political affiliation, do not perfectly mirror the larger population. In the United States, Twitter users are younger, more educated, have higher incomes, and are more likely to identify as Democrats than the general public [53,54]. Therefore, while our sample of users is large, it is highly biased.
Second, not all users are equally likely to use geotagging features on Twitter, and they may use the features in different ways. For example, in a previous study [32], demographic differences were found in the groups of people who used the two different types of geolocation information (ie, coordinates and place). GPS-tagged tweets are posted more often by young people and by women compared to tweets with self-reported locations.
Third, while we obtained access to millions of geotagged tweets, this is still a relatively small proportion of the total number of nongeotagged tweets on the platform, and it is also small compared to private measures of social mobility computed by companies such as Google and Apple.
Fourth, a small proportion of geotagged tweets report fake geolocation information. However, we believe that this is a negligible problem, as previous work found the rate of fake geolocation to be around 0.22% on social media in general [55] and even lower on Twitter. In our preliminary analysis, we considered mobility data based on GPS from mobile devices alone while excluding place information, as this method has greater precision. However, our results with these limited data were similar to our results with the full data set, except that they were less stable. Therefore, we decided to include all location data.
Despite these limitations, our results produced metrics that align with expected trends given national social distancing guidelines and related statewide policies. This suggests that there is sufficient information in our data to overcome these limitations. Additionally, the public nature of Twitter data has advantages over proprietary and private data sources. More work is needed to compare our mobility trends with those of other data sources.
Our work on this data is ongoing, and there are several directions that warrant further study. First, as states begin to reopen and some states maintain restrictions, tracking changes in population behaviors will be helpful in making policy decisions. Second, we focused on the United States; however, Twitter data provides sufficient coverage to replicate our analysis for many countries. Third, tweet content exists for each user in the data set; this content can reflect the user's attitudes, beliefs, and behaviors. Studying these factors together with users' mobility reduction could yield further insights. Our findings are presented on a website [44], and we will continue to update our analysis during the pandemic.