Site icon R-bloggers

Visualizing Twitter history with streamgraphs in R

[This article was first published on Juuso's blog on Open Data Science and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was exploring ways to visualize my Twitter history, and ended up creating this interactive streamgraph of my 20 most used hashtags in Twitter:

The graph shows how my Twitter activity has varied a lot. The top three hashtags are #datascience, #rstats and #opendata (no surprises there). There are also event-related hashtags that show up only once, such as #tomorrow2015 and #iccss2015, and annually repeating ones, such as #apps4finland.

How this was made?

Twitter has quite a strict policy for obtaining data, but they do allow one to download the full personal Twitter history, i.e. all tweets as a convenient csv file (instructions here), so that’s what I did.

The visualization was created with the streamgraph R package that uses the great htmlwidgets framework for easy creation of javascript visualizations from R. The plots are designed for daily data, but this ended up being too messy, so I aggregated the data on monthly level instead.

Embedding the streamgraph htmlwidget into this Jekyll blog required a bit of hazzle. As pointed out in the comments here, the widget must be first created as a standalone html file and then embedded as an iframe. Hopefully there will be a more straightforward way to include htmlwidgets to Jekyll blogs in the future!

Some problems:

The script for producing the streamgraph from the Twitter data is here. It is also printed below, with the help of read_chunk(). See more details from the rmarkdown source for this post.

# Script for producing a streamgraph of tweet hashtags

# Load packages
library("readr")
library("dplyr")
library("lubridate")
library("streamgraph")
library("htmlwidgets")

# Read my tweets
tweets_df <- read_csv("files/R/tweets.csv") %>%
  select(timestamp, text) %>%
  mutate(text = tolower(text))

# Pick hashtags with regexp
hashtags_list <- regmatches(tweets_df$text, gregexpr("#[[:alnum:]]+", tweets_df$text))

# Create a new data_frame with (timestamp, hashtag) -pairs
hashtags_df <- data_frame()
for (i in which(sapply(hashtags_list, length) > 0)) {
  hashtags_df <- bind_rows(hashtags_df, data_frame(timestamp = tweets_df$timestamp[i],
                                                   hashtag = hashtags_list[[i]]))
}

# Process data for plotting
hashtags_df <- hashtags_df %>%
  # Pick top 20 hashtags
  filter(hashtag %in% names(sort(table(hashtag), decreasing=TRUE))[1:20]) %>%
  # Group by year-month (daily is too messy)
  # Need to add '-01' to make it a valid date for streamgraph
  mutate(yearmonth = paste0(format(as.Date(timestamp), format="%Y-%m"), "-01")) %>%
  group_by(yearmonth, hashtag) %>%
  summarise(value = n())

# Create streamgraph
sg <- streamgraph(data = hashtags_df, key = "hashtag", value = "value", date = "yearmonth",
                 offset = "silhouette", interpolate = "cardinal",
                 width = "700", height = "400") %>%
  sg_legend(TRUE, "hashtag: ") %>%
  sg_axis_x(tick_interval = 1, tick_units = "year", tick_format = "%Y")

# Save it for viewing in the blog post
# For some reason I can not save it to files/R/ direclty so need to use file.rename()
saveWidget(sg, file="twitter_streamgraph.html", selfcontained = TRUE)
file.rename("twitter_streamgraph.html", "files/R/twitter_streamgraph.html")

To leave a comment for the author, please follow the link and comment on their blog: Juuso's blog on Open Data Science and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.