Working with Your Facebook Data in R

Posted on June 14, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to Read in and Clean Your Facebook Data – I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, “Your Facebook Information.” When you get the Facebook Information screen, select “View” next to “Download Your Information.” On this screen, you’ll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under “Your Information,” I deselected everything but the first item on the list, “Posts.” (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).

After you click “Create File,” it will take a while to compile – you’ll get an email when it’s ready. You’ll need to reenter your password when you go to download the file.

The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others’ timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here’s what I did.

Since the post data is a JSON file, I need the jsonlite package to read it.

setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")
library(jsonlite)

FBposts <- fromJSON("your_posts.json")

This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.

myposts <- FBposts$status_updates

The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.

First, I converted the timestamp to datetime, using the anytime package.

library(anytime)

myposts$timestamp <- anytime(myposts$timestamp)

Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.

library(tidyverse)

myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))

Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

myposts$links <- str_extract(myposts$attachments, url_pattern)

library(qdapRegex)

myposts$posttext <- myposts$data %>%
  rm_between('"','"',extract = TRUE)

There's more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let's look at my most frequent words.

myposts$posttext <- as.character(myposts$posttext)
library(tidytext)
mypost_text <- myposts %>%
  unnest_tokens(word, posttext) %>%
  anti_join(stop_words)

## Joining, by = "word"

counts <- mypost_text %>%
  filter(author == "Sara Locatelli") %>%
  drop_na(word) %>%
  count(word, sort = TRUE)

counts

## # A tibble: 9,753 x 2
##    word         n
##    <chr>    <int>
##  1 happy     4702
##  2 birthday  4643
##  3 today's    666
##  4 song       648
##  5 head       636
##  6 day        337
##  7 post       321
##  8 009f       287
##  9 ð          287
## 10 008e       266
## # ... with 9,743 more rows

These data include all my posts, including writing "Happy birthday" on other's timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I'd need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as "008e" won't be cut out with that function. So I'll just filter them out separately.

drop_nums <- c("008a","008e","009a","009c","009f")

counts <- counts %>%
  filter(str_detect(word, "[a-z]+"),
         !word %in% str_detect(word, "[0-9]"),
         !word %in% drop_nums)

Now I could, for instance, create a word cloud.

library(wordcloud)

counts %>%
  with(wordcloud(word, n, max.words = 50))

In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let's see what happens if I mix in some sentiment analysis to my word cloud.

library(reshape2)

## 
## Attaching package: 'reshape2'

counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"), max.words = 100)

## Joining, by = "word"