Collection, Management, and Analysis of Twitter Data

[This article was first published on R on Methods Bites, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a highly relevant platform for political and social online interactions, researchers increasingly analyze Twitter data. As of 01/2021, Twitter renewed its API, which now includes access to the full history of tweets for academic usage. In this Methods Bites Tutorial, Andreas Küpfer (Technical University of Darmstadt & MZES) presents a walkthrough of the collection, management, and analysis of Twitter data.

After reading this blog post and engaging with the applied exercises, readers will be able to:

  • complete the academic research track application process for the Twitter API.
  • crawl tweets using customized queries based on the R package academictwitteR (Barrie and Ho 2021).
  • apply a selection of pre-processing steps to these tweets.
  • take decisions in order to minimize reprodubcibility issues with Twitter data and to comply with the policies.

Note: This blog post provides a summary of Andreas’ workshop in the MZES Social Science Data Lab. The original workshop materials, including slides and scripts, are available from our GitHub. A live recording of the workshop is available on our YouTube channel.

Introduction to social media and Twitter API v2

Social media posts are full of potential for data mining and analysis. Despite problems tackling fake accounts and bots on the platform, it can be a very fruitful source to tackle research questions in a bandwidth of disciplines, including social sciences (e.g., Barberá 2015; Nguyen et al. 2021; Valle-Cruz et al. 2022; Sältzer 2022). Recognizing this potential also for commercial usage, platform providers increasingly restrict free access to such data.

Especially Twitter is an important data source with its richness of social and political interactions. As well Twitter did not offer a free-of-charge option to implement a full archive search of all tweets and users. Back then, the free version of API v1.1 was very limited with a maximum of 3,200 tweets or the past seven days of tweets. In addition, the range of available meta data1 as well as implemented query2 options were rather small. These limitations were lifted by the introduction of the redeveloped and rearranged Twitter API v2 in January 2021.3 For academic purposes, they opened up access to all available tweets and other objects posted on Twitter without any monetary costs for the researcher.

While this blog post focuses on the retrieval of textual data, Twitter content certainly offers more. Looking at social network interactions (e.g., followers, likes, …) is just one of the opportunities beyond text to reveal valuable information. This can be, for example, the usage of follower networks to estimate ideological positions (e.g., Barberá 2015) or measuring the importance of a user in a social network based on social interaction data.

Academic research track application process

As Application Programming Interfaces (APIs) are powerful tools which allow access to vast databases full of information, companies offering them are increasingly careful about who is allowed to use them. While the previous version of the Twitter API provided access without a dedicated application (for a detailed description, see this Methods Bites tutorial), the novel version requires you to go through an application process where you have to provide several details about you and your project with Twitter. This information includes data regarding yourself as well as the research project where you intend to work with Twitter data.

Prerequisites

Before getting access, you have to fulfill several formal prerequisites to be eligible for application:

  • You are either a master’s student, a doctoral candidate, a post-doc, a faculty member, or a research-focused employee at an academic institution or university.
  • You have a clearly defined research objective, and you have specific plans for how you intend to use, analyze, and share Twitter data from your research.
  • You will use this access for non-commercial purposes.4

Furthermore, you need a Twitter account which is also used to log in to the Twitter Developer Platform after a successful application. This portal lets you configure your API projects, keep an eye on your monthly tweet cap5, and more. A more detailed explanation of prerequisites can be found on the Twitter API academic research track Track website.

Application

The whole process can be initiated by clicking Apply on the official Twitter API academic research track Website. You’ll be asked to log in with your personal Twitter account.

Twitter Application Steps for academic research track API access.

Figure 1: Twitter Application Steps for academic research track API access.

The figure above visualizes the steps you have to complete before your application can finally be submitted for Twitter’s internal review:

  1. Basic Info: such as phone number verification and country selection
  2. Academic Profile: such as link to an official profile (department website or similar) and academic role
  3. Project Details: such as information about findings, description of the project itself, and how the API should be used there (e.g. methodologies and how the outcomes will be shared)
  4. Review: provides an overview of the previous steps
  5. Terms: developer agreement and policy

Before starting, it is recommended to carefully read which kind of career levels, projects and data behaviors are not allowed to use the API and thus have a high chance of receiving a refusal for their application. To give an example, if you plan to share the content of tweets publicly, you most probably won’t get access to the API as this would violate the Twitter rules. Again, more detailed information about this can be found on the Twitter API academic research track and Developer Terms information guides.

Step one requests generic information about your Twitter account while in step two you have to provide information about your academic profile. This includes a link to a publicly available record on an official department website and information regarding the academic institution you are working in. The third step is the most sophisticated one: your research project. It asks for short paragraphs about the project in general, what and how Twitter data is used there, and how the outcome of your work is shared with the public. The last two steps, review and terms, do not require any user-specific input but provide an overview of all filled-in information as well as the chance to read the developer agreement and policy.

After submitting your application

After submitting your application, you receive a decision via the e-mail address connected with your Twitter account (usually) within a few days. However, according to Twitter, this process can take up to two weeks.

You application may be rejected for two common reasons: First, you do violate the policy at one point according to the information given, or second, you do not meet the requirements (as described above). Further explanations what can be the next steps after a rejection can be found in the Developer Account Support FAQ.

As of writing this blog post (May 2022), submitting a reapplication for access using the same account is not possible.

Using the API

After your successful application, the Twitter Developer Portal is there to manage projects and environments (which belong to a project), generate API keys (“credentials” for API access), get an overview of real-time monthly tweet cap usage, check available API endpoints and their specifics and more.

After the creation of a project, an environment can be added and API keys generated.

API keys of an environment

Figure 2: API keys of an environment

The following keys are generated automatically and used depending on the API interface (e.g. the R package) at hand:

  • API key \(\approx\) username (also called consumer key)
  • API key secret \(\approx\) password (also called consumer secret)
  • Bearer token \(\approx\) special access token (also called an authentication token)

It is crucial to keep them private and not push them to GitHub or similar! Otherwise someone else could gain access to your API account. Instead, store them somewhere locally or directly within an environment variable. The package we’re going into detail later on this blog post is guiding you safely through this process.

However, in case you’re plan to use them in other applications, you can store your keys in different ways. The most common way in R is to add them to the .Renviron file. To do this with comfort, install the R package usethis and call its method usethis::edit_r_environ() which lets you edit the .Renviron in the home directory of your computer. In the following you can add tokens (or anything else you want to keep stored locally) using this format:

Key1=value1
Key2=value2
# ...

After saving the file you can access values by calling Sys.getenv("Key1") within your R application. More best practices on managing your secrets can be found on the website APIs for Social Scientists.

Postman-as-a-playground

Postman is an easily accessible application to try out different queries, tokens, and more. Without any programming knowledge, you get the API results immediately. Here you can find an official tutorial to use Postman with the Twitter API.

However, there are several reasons why Postman cannot replace a package and programming code.

  • Building flexible queries (e.g., a list of users to retrieve tweets from)
  • Handling large responses which come split up during pagination
  • Handle rate limit restrictions
  • Transforming responses into manageable data structures (e.g., dataframe and comma-separated values)

All of these tasks can be handled by a suitable package in your favorite programming language.

Which package should I choose?

It has to be noted that there are dozens of packages out there but only some of them already integrated the academic research track of the Twitter API. A selection of packages is listed below:

  • academictwitteR (R): The package offers customizable fucntions for all common v2 API endpoints. Additionally, it smoothly guides the developer through all critical steps (e.g. authentication or data processing) of the API interaction.
  • RTwitterV2 (R): Although RTwitterV2 as of now has less API endpoints included than academictwitteR it still is a valuable alternative which covers all basic functionalities.
  • rtweet (R): rtweet does not support the academic research track yet, however it offers much basic functionality by using the previous API version. A dedicated Methods Bites blog post introducing rtweet in detail can be found here.
  • searchtweets-v2 (Python): This is the official package developed and maintained by Twitter available for Python. The library offers flexible functions which even handle very specialized requests but one has to dive deeper into the technical aspects of the API.
  • tweepy (Python): tweepy is the most common package for Python and backed up by a large developer community. As a bonus, it includes many examples of how to use the various features offered by the package.

Which package you pick should depend on your preferred programming language as well as whether the feature list of a package fits your research purpose.

academictwitteR: a code walkthrough using R

In this blog post academictwitteR (Barrie and Ho 2021) (available for R) is used to demonstrate a simple scenario of retrieving tweets from German members of the parliament. The name academictwitteR is derived by the Twitter API academic research track for which it is developed for.

We will start by first loading all the needed R packages for the walkthrough:

Code: R packages used in this tutorial

## Save package names as a vector of strings
pkgs <- c("dplyr", "academictwitteR", "quanteda", "purrr")

## Install uninstalled packages
lapply(pkgs[!(pkgs %in% installed.packages())], install.packages)

## Load all packages to library and adjust options
lapply(pkgs, library, character.only = TRUE)


After loading the packages, we need to share our API Bearer Token with academictwitteR. The following code will guide you through the process to store the key in an R-specific environment file (.Renviron) which we introduced earlier in this blog post:

academictwitteR::set_bearer()
## Instructions:
## ℹ 1. Add line: TWITTER_BEARER=YOURTOKENHERE to .Renviron 
##      on new line, replacing YOURTOKENHERE with  actual bearer token
## ℹ 2. Restart R

After restarting R, everything is initialized and we can load a table of Twitter user IDs from German MPs into R:

german_mps <- read.csv("data/MP_de_twitter_uid.csv",
                       colClasses=c("user_id"="character"))
head(german_mps)
##              user_id                   name party
## 1           44608858       Marc Henrichmann   CDU
## 2 819914159915667456      Stephan Pilsinger   CSU
## 3         1391875208   Markus Alexander Uhl   CDU
## 4          569832889 Sigmar Hartmut Gabriel   SPD
## ...

To prevent replication issues with your work, it is recommended to use the Twitter user ID (e.g. 819914159915667456) instead of the user handle (e.g. @StephPilsinger) as the user handle can be changed by the user over time. This would result in not being able anymore to recrawl tweets of these users. In case you only have access to the handle, there is a v2 API endpoint to receive a user object from a handle: /2/users/by/username/:username

Databases and lists of Twitter users can be retrieved from the following sources:

Afterward, we are ready to crawl our first tweets using a simple wrapper function (get_tweets_from_user()) asking for a single user_id. get_all_tweets(), which is called inside this function is the heart of our code. It manages the generation of queries for the API, working with rate limits as well as storing the data in JSON-files (which can be transformed later).

In case you look for specific content, tweet types, or even topics, you can add another parameter to the package function: query. It allows you to narrow down your search by using specific strings. To give an example, one could look for English retweets containing the keywords putin or selenskij having a geo-location attached. This can be achieved by simply assigning the following string to the query parameter:

(putin OR selenskyj) -is:retweet lang:en has:geo

Beyond that, there exist many more parameters to individualize the crawling method. All of them are documented in the official academictwitteR CRAN documentation of the package. However, in this tutorial I only restrict my search to a Twitter user ID as well as a start and end date for the tweets we are interested in:

# function to retrieve tweets in a specific time period of a single user
# (list of user IDs would be possible but one should keep
# the max. query string of 1024 characters in mind)

get_tweets_from_user <- function(user_id) {
  # Another option is to add "query" parameter
  academictwitteR::get_all_tweets(
    users = user_id,
    start_tweets = "2021-01-01T00:00:00Z",
    end_tweets = "2021-09-30T00:00:00Z",
    data_path = "data/raw/",
    n = 100)
}

The function is then called for each user_id in the dataframe by using walk() from the purrr package (the purrr package allows you to work with functions and vectors):

purrr::walk(german_mps[["user_id"]], get_tweets_from_user)

To import the tweets into a workable format, call bind_tweets() from academictwitteR. It consolidates all available files in the given data_path and organizes them into the requested format (in our case tidy). In addition, only a relevant fraction of columns is selected in the code below by using select() from the dplyr-package.

# concatenate all retrieved tweets into one dataframe and select which columns
# should be kept
# Another option: set parameter "user" to TRUE to retrieve user information
tweets_df <- academictwitteR::bind_tweets(data_path = "data/raw/",
                                          output_format = "tidy") %>%
  dplyr::select(
    tweet_id,
    text,
    author_id,
    user_username,
    created_at,
    sourcetweet_type,
    sourcetweet_text,
    lang
  )

Finally, I store the tweets in a single .csv-file:

write.csv(tweets_df, "data/raw/tweets_german_mp.csv", row.names = FALSE)

Congratulations! You successfully applied to the academic research track, got admitted, and crawled a selection of tweets using the R package academictwitteR.

Preparing for methods: working with textual data

You are now ready to move on! The usual steps applied to textual data (lowercasing, stopwords removal, stemming, …) depending on the method at hand can be used for pre-processing your tweets. Additional fine-tuning of these steps could involve the removal of, e.g. party IDs, URLs, user mentions or similar. Such steps can be easily done by using regular expressions (regex). Regular expressions are used to extract patterns in texts which then can be used for further analysis, removal or replacement. Many tutorials available on the web (e.g. RegexOne interactive tutorial) make it straightforward to learn how to bring such expressions into action within your domain.

The following code provides a first starting point for applying pre-processing steps. The code relies on the package quanteda which is an R package that is often used when working with text data in R. If you want to dive deeper into text mining and text analysis, Methods Bites has more blog posts on these topics.

tweet_corpus <- quanteda::corpus(tweets_df[["text"]],
                                 docnames = tweets_df[["tweet_id"]])

The code first transforms the dataframe of tweets into another data format, corpus, keeping the tweet_id as an identifier attached to each tweet text. Having the tweets in the corpus format makes it easy to apply pre-processing steps after tokenizing. The following list shows a selection of common methods. However, it is important that the decision, on which methods are applied, heavily relies on the following text processing approach:

  • remove_punct: removes all punctuation
  • remove_numbers: removes all numbers
  • dfm_tolower(): applies lowercasing
  • dfm_remove(stopwords("german")): removes German stopwords which occur very frequently
  • dfm_wordstem(language = "german"): applies German stemming (e.g., wurden \(\rightarrow\) wurd)
# "2020 wurden in Berlin ca. 18.800 Miet-
# in Eigentumswohnungen umgewandelt. #Umwandlungsverbot"
dfm <-
  quanteda::dfm(tweet_corpus %>%
                  quanteda::tokens(
                    remove_punct = TRUE,
                    remove_numbers = TRUE)) %>%
  quanteda::dfm_tolower() %>% # removes capitialization
  quanteda::dfm_remove(
    stopwords("german")) %>% # removes German stopwords
  quanteda::dfm_wordstem(
    language = "german") # transforms words to their German wordstems
# "wurd berlin ca miet- eigentumswohn umgewandelt #umwandlungsverbot"

The function dfm (called above) returns a sparse document-feature matrix which could be a fruitful starting point for first-word frequency analysis:

head(dfm)

## Document-feature matrix of: 6 documents, 87 features (79.77% sparse)
## and 0 docvars.
##                       features
## docs                  leb plotzlich mehr schablon gut bos pass 😉 #esk #miet
##   44608858            1   1         1    1        1   1   1    1  1    1
##   819914159915667456  0   0         0    0        0   0   0    0  0    0
##   1391875208          0   0         0    0        0   0   0    0  0    0
##   569832889           0   0         1    0        0   0   0    0  0    1
## ...

You are finally at the step of applying further methods to tackle your research question and getting deeper insights into your crawled tweets. There is much more to explore: You can find further text-as-data tutorials on our blog.

Reproducibility of research based on Twitter data

As reproducible results are one of the major requirements of research projects, it has to be discussed how this could affect your work with Twitter data. The Twitter development agreement includes a clear statement of what researchers are allowed to publish along with their work:

Academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research. For example, you are permitted to share an unlimited number of Tweet IDs for the purpose of enabling peer review or validation of your research.7

This means that the content of tweets must not be shared publicly. As tweets can be deleted or accounts can be suspended this certainly states an issue for subsequent researchers attempting to replicate the findings as they won’t be able to recrawl such tweets via the API. However, there are also platforms like polititweet.org, which track public figures and based on that justify the publication even of deleted tweets:

polititweet.org section of the landing page" width="450" />

Figure 3: polititweet.org section of the landing page

To conclude, this makes the decision of how to share what kind of data not easier. But one still has to ensure to choose the best available option to share his or her data without violating the Twitter rules which are as of today to at least share tweet IDs amongst the community.

Conclusion

This blog post provides a first glimpse into the academic research track Twitter API and the information richness of Twitter data. As there certainly will be further updates and changes to the API in the future, there are plenty of easy-to-use packages that build on active user communities. The community is there to keep the packages updated accordingly to the current Twitter API version. While there exist a lot of powerful packages to tackle the data gathering step, researchers still need to think carefully about how to further process the crawled information depending on their research question and method as well as how to make their research accessible to the community in an open science approach.

About the author

Andreas Küpfer is a graduate of the Mannheim Master in Data Science and a doctoral researcher at the Technical University of Darmstadt. His interdisciplinary research interests include text-as-data, applying machine learning technologies, and substantial inference in the fields of political communication and political competition.

References

Barberá, Pablo. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23 (1): 76–91. https://doi.org/10.1093/pan/mpu011.

Barrie, Christopher, and Justin Chun-ting Ho. 2021. “AcademictwitteR: An R Package to Access the Twitter Academic Research Product Track V2 Api Endpoint.” Journal of Open Source Software 6 (62): 3272. https://doi.org/10.21105/joss.03272.

Göbel, Sascha, and Simon Munzert. 2021. “The Comparative Legislators Database.” British Journal of Political Science, 1–11. https://doi.org/10.1017/S0007123420000897.

Nguyen, Thu T., Shaniece Criss, Eli K. Michaels, Rebekah I. Cross, Jackson S. Michaels, Pallavi Dwivedi, Dina Huang, et al. 2021. “Progress and Push-Back: How the Killings of Ahmaud Arbery, Breonna Taylor, and George Floyd Impacted Public Discourse on Race and Racism on Twitter.” SSM - Population Health 15: 100922. https://doi.org/https://doi.org/10.1016/j.ssmph.2021.100922.

Sältzer, Marius. 2022. “Finding the Bird’s Wings: Dimensions of Factional Conflict on Twitter.” Party Politics 28 (1): 61–70. https://doi.org/10.1177/1354068820957960.

Valle-Cruz, David, Vanessa Fernandez, Asdrubal Lopez-Chau, and Rodrigo Sandoval Almazan. 2022. “Does Twitter Affect Stock Market Decisions? Financial Sentiment Analysis During Pandemics: A Comparative Study of the H1n1 and the Covid‐19 Periods.” Cognitive Computation 14 (January). https://doi.org/10.1007/s12559-021-09819-8.

Vliet, Livia van, Petter Törnberg, and Justus Uitermark. 2020. “The Twitter Parliamentarian Database: Analyzing Twitter Politics Across 26 Countries.” PLoS ONE 15.


  1. Meta data serves as a explanatory information such as topical indicators or the language of the tweet which should explain and enrich the actual tweet, image or main object retrieved from the API

  2. Queries are filter operators to narrow down the amount of tweets which should be retrieved

  3. API stands for Application Programming Interface and allows, simply speaking, the communication between software.

  4. Twitter Developer Platform product page

  5. There is a maximum of tweets which can be retrieved via the API which gets resetted once in a month.

  6. Use such lists with caution as they may do not come from verified sources.

  7. You can find a detailed description of the content redistribution of Twitter data in the official developer policies.

To leave a comment for the author, please follow the link and comment on their blog: R on Methods Bites.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)