Site icon R-bloggers

#AskNASA: What’s the Optimal Time for Aliens to Invade Earth?

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post was originally published on SmartCat, 22 Feb 2017.

My inaugural blog as a Data Science Consultant for SmartCat. The code that accompanies the analyses presented here is available at the respective GitHub repository. On how to use R to estimate the optimal time during the day for aliens to invade Earth and a few more interesting things.

A few days ago, NASA has announced a press conference citing a “… Discovery Beyond Our Solar System”, and I always tend to get excited about such news. I have learned about the #askNASA hashtag used for public and media to approach NASA with their questions on Twitter. And when I hear the word “hashtag”, social media analytics is what first comes to mind. The I thought, what could I find out by studying tweets on #askNASA? Nothing much, unfortunately, because the number of such tweets doesn’t exactly skyrocket (many questions are posted there, of course, but the volume of tweets is far below what one needs for a serious social media analytics study). Ok, then: I will study the whole twitterverse of NASA and NASA related accounts, in order to discover what is the relative position of #AskNASA in the semantic space of what’s being tweeted from NASA. I have used {TwitteR} to access the Twitter Search API for retrospective tweets from almost all of the NASA Twitter accounts listed on https://www.nasa.gov/socialmedia/ (I haven’t included the personal accounts of NASA astronauts; only the main account, and then everything found under the following categories: NASA Centers & Facilities, Organizations & Programs, Missions & Topics, plus NASAAstronauts and NASAPeople; 141 Twitter account in total). I have also accessed the Twitter Search API to collect all recent tweets with the #askNASA hashtag there. In total, I have produced a collection of 1575 tweets with the #askNASA hashtag; these were cleaned from all tweets posted on behalf of NASA and NASA related accounts. From the NASA accounts alone, I was able to get to 255,241 tweets – following a several hours long exercise of the userTimeline() {TwitteR} function.

While scraping with {TwitteR} mindlessly, I was planning the analysis, and my thoughts started wondering around all the cool work that people at NASA do… What will they announce today? A new exoplanet, I can bet. People are crazy about exoplanets, and the aliens, and the SETI program, and Astrobiology, and all that stuff. A Twin Earth! However, nobody realizes, I thought, that the potential discovery of this Earth on behalf of some technologically advanced alien civilization could pose a real existential treat for us humans: a true global catastrophic risk. And with all their antennas, golden plates with pictographs and Bach on their interstellar probes… nobody seems to worry about Sir Stephen Hawking’s well-reasoned warning on how intelligent aliens could destroy humanity (even not he himself, c.f. “Stephen Hawking: Intelligent Aliens Could Destroy Humanity, But Let’s Search Anyway” – says Hawking)! Anyways, it was probably around the third or fourth glass of wine in the Belgrade Cafe from where I’ve left {TwitteR} do the job from my netbook when I’ve realized what I want to do this time with R: I will estimate the optimal time during the day for Aliens to invade our planet by analyzing the daily oscillation in the sentiment and volume of tweets from NASA accounts. Assumption: if aliens somehow figure out where we live, that will be because of these guys with big radio-antennas. Next: given that whatever alien civilization decides to invade Earth, they will certainly be so technologically advanced to immediately discover the very source of our quest for them. Finally, given their technological supremacy, they will be available to analyze all the necessary information to ensure the success of their mission: including our (precious!) tweets.

And here it is, with a little help from {tm.plugin.sentiment}, {dplyr} and {ggplot2}:

emoHours <- tweetsDF %>% 
group_by(Hour) %>%
summarise(tweets = n(),
  positive = length(which(Polarity > 0)),
  neutral = length(which(Polarity == 0)),
  negative = length(which(Polarity < 0))
)
emoHours$positive <- emoHours$positive/emoHours$tweets
emoHours$neutral <- emoHours$neutral/emoHours$tweets
emoHours$negative <- emoHours$negative/emoHours$tweets
emoHours$Hour <- as.numeric(emoHours$Hour)
emoHours$Volume <- emoHours$tweets/max(emoHours$tweets)
emoHours <- emoHours %>%
gather(key = Measure,
  value = Value,
  positive:Volume
)
ggplot(emoHours, aes(x = Hour, y = Value, color = Measure)) +
geom_path(size = .25) +
geom_point(size = 1.5) +
geom_point(size = 1, color = "White") +
ggtitle("Optimal Time to Invade Earth") +
scale_x_continuous(breaks = 0:23, labels = as.character(0:23)) +
theme_bw() +
theme(plot.title = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 8, angle = 90))

Figure 1. Optimal time to invade Earth w. {tm.plugin.sentiment}, {dplyr}, and {ggplot2}

The tweetsDF data frame becomes available after running the previous chunks of code that you will find in the GitHub repo for this blog post. The Polarity column comes from the application of {tm.plugin.sentiment} functions over a {tm} pre-processed corpus of all 255,241 tweets that were collected from NASA’s accounts:

### --- Sentiment Analysis
# - as {tm} VCorpus
nasaCorpus <- VCorpus(VectorSource(tweetsDF$text))
# - {tm.plugin.sentiment} sentiment scores
# - Term-Document Matrix
nasaTDM <- TermDocumentMatrix(nasaCorpus,
                          control = list(tolower = TRUE,
                          removePunctuation = TRUE,
                          removeNumbers = TRUE,
                          removeWords = list(stopwords("english")), 
                          stripWhitespace = TRUE,
                          stemDocument = TRUE,
                          minWordLength = 3,
                          weighting = weightTf))

# - {tm.plugin.sentiment} polarity score
# - NOTE: that would be (n-p)/(n+p)
nasaPolarity <- polarity(nasaTDM)
sum(nasaPolarity != 0)
tweetsDF$Polarity <- nasaPolarity

The optimal time for Alien invasion is obviously somewhere between 7:00 and 9:00 in the morning (NOTE for the aliens: all times are GMT). All tweets were categorized as neutral, positive,  or negative in respect to their polarity, which is given as (n-p)/(n+p), n being the count of negative and p of positive words in the respective tweets. Then, instead of going for a time series analyses, I have simply grouped all tweets per hour of the day in which they have occurred, recalculating the counts of positive, negative, and neutral tweets into proportions of total tweets per hour. Thus, the vertical axis presents proportion of tweets per hour. The Volume variable is simply a rescaling of the plain simple counts of number of tweets per hour by the maximum count found in the data set, so that this can be conveniently presented on a 0 to 1 scale in the chart. And what we see is that between 7:00 and 9:00, approximately, an anomaly in the hourly distribution of tweets from NASA’s accounts takes place: a sudden increase in the proportion of neutral and negative tweets, accompanied by the drop in the volume of tweets occurring. So that’s when we’re moody and not relaxed, and probably tweeting less given the pressure of the daily work routine before lunch: the ideal time for an Alien civilization to invade.

Of course, technologically advanced aliens, who know statistics very well, could as well ask whether the described phenomenon is simply a by product of the increased measurement error related to the quite obvious drop in the sample sizes for the respective, invasion-critical hours…

Putting aside the question of alien invasion, I am really very interested to learn from NASA today what is that was discovered beyond the limits of the Solar system. To prove how the discoveries of potentially habitable exoplanets are popular, the following analysis was conducted. In the first step, we simply concatenate all tweets originating from the same account, while treating the tweets with the #askNASA hashtag as a separate group (i.e. as it was a Twitter account in itself). Given that I was interested in the account level of analysis here, and provided that individual tweets offer too little information for typical BoW approaches in text mining, this step is really justified. Then, I have produced a typical Term-Document Matrix from all available tweets, preserving all terms beginning with “@” or “#” there, and then cleaned the matrix from everything else. Finally, the term counts were turned into binary (present/absent) information in order to compute the Jaccard similarity coefficients across the accounts:

tweetTexts <- tweetsDF %>%
group_by(screeName) %>%
summarise(text = paste(text, collapse = " "))
# - accNames is to be used later:
accNames <- tweetTexts$screeName
accNames <- append(accNames, "askNASA")
tweetTexts <- tweetTexts$text
askNasaText <- paste(dT$text, collapse = "")
tweetTexts <- append(tweetTexts, askNasaText)
tweetTexts <- enc2utf8(tweetTexts)
tweetTexts <- VCorpus(VectorSource(tweetTexts))

# - Term-Doc Matrix for this:
removePunctuationSpecial <- function(x) {
 x <- gsub("#", "HASHCHAR", x)
 x<- gsub("@", "MONKEYCHAR", x)
 x <- gsub("[[:punct:]]+", "", x)
 x <- gsub("HASHCHAR", "#", x)
 x <- gsub("MONKEYCHAR", "@", x)
 return(x)
}

tweetTexts <- tm_map(tweetTexts,
            content_transformer(removePunctuationSpecial),
            lazy = TRUE)

tweetsTDM <- TermDocumentMatrix(tweetTexts,
                   control = list(tolower = FALSE,
                           removePunctuation = FALSE,
                           removeNumbers = TRUE,
                           removeWords = list(stopwords("english")),
                           stripWhitespace = TRUE,
                           stemDocument = FALSE,
                           minWordLength = 3,
                           weighting = weightTf))
# - store TDM object:
saveRDS(tweetsTDM, "tweetsTDM.Rds")

# - extract only mention and hashtag features:
tweetsTDM <- t(as.matrix(tweetsTDM))
w <- which(grepl("^@|^#", colnames(tweetsTDM)))
tweetsTDM <- tweetsTDM[, -w]
# - keep only mention and hashtag features w. Freq > 50
wK <- which(colSums(tweetsTDM) > 10)
tweetsTDM <- tweetsTDM[, wK]
# - transform to binary for Jaccard distance
wPos <- which(tweetsTDM > 0, arr.ind = T)
tweetsTDM[wPos] <- 1
# - Jaccard distances for accounts and #asknasa:
simAccounts <- dist(tweetsTDM, method = "Jaccard", by_rows = T)
simAccounts <- as.matrix(simAccounts)

The following {igraph} works this way: each account – of which one, be reminded, #askNASA is not really an account but represents information on all tweets with the respective hashtag- points to the account which is most similar to it in respect to the Jaccard distance computed from the presence and absence of mentions and hashtags used. So this is more some proxy of a “social distance” between accounts than a true distributional semantics measure. It can be readily observed that #askNASA points to @PlanetQuest as its nearest neighbor in this analysis. The Jaccard distance was used since I am not really into using typical Term-Document Count Matrices in analyzing tweets; they simply convey too sparse an information for a typical approach to make any sense.

Figure 2.Social Network of NASA Twitter accounts + #askNASA. {igraph}


Goran S. Milovanović, PhD
Data Science Consultant, SmartCat

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.