Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Once again, a Twitter trend sent me to my R prompt… Here is a bit of context. My summary: Taylor Swift apparently plays the bad girl in her new album and a fan of hers asked a question…

The tweet was then quoted by many people mentioning badass women, and I decided to have a look at these heroes!

I was a bit lazy and asked Mike Kearney, rtweet maintainer, how to find tweets quoting a tweet, to which Bob Rudis answered. Now that I even had the code, it was no trouble at all getting the data. I added the filtering steps myself, see, I’m not that lazy. I also removed the link to the quoted tweet that was at the end of each tweet.

question_tweet <- "928857792982781952"
badass <-  rtweet::search_tweets(question_tweet, n = 18000, include_rts = FALSE)
badass <- dplyr::mutate(badass, text = stringr::str_replace(text, "https://t\\.co/.*$", "")) badass <- dplyr::mutate(badass, text = trimws(text)) readr::write_csv(badass, path = "data/2017-12-03-badderb_badass.csv")  I obtained 15653 tweets. Not bad! library("magrittr") set.seed(20171015) indices <- sample.int(n = nrow(badass), size = 7) badass$text[indices]

## [1] "Carmina Barrios"
## [2] "Anyone"
## [3] "Shirley Temple"
## [4] "So a lot of people have shared some bad ass women in repsonse to the tweet below. I hope someone is compiling those responses for one kick ass book."
## [5] "Mary Bowers was a slave with a photographic memory who pretended to be \"slow-witted\" in order to spy on confederate soldiers in the house worked, and pass intel she gathered to Union forces during the Civil War."
## [6] "Ramona Quimby, age 8"
## [7] "Snow White. Sleeping Beauty. Cinderella, Elsa, Ariel bec the category example is obvs mediocre Disney caricatures allowed by the patriarchy, right?"


Out of 15653, 570 contained the word “mother” – I haven’t looked for the word “mum” and haven’t checked for the fact that it is someone from the family of the tweet author. Here a few of the personal stories (or not) identified with this quick and dirty method.

set.seed(20171015)
indices <- sample.int(n = 15)
mothers$text[indices] ## [1] "My grandmother struggled through poverty her entire life, in a family prone to depression, addiction, and suicide. She had so many Caesarian sections that when she reached menopause she didnt have a belly button anymore." ## [2] "Mom married my dad. He had been married 2X before with a total of 10 kids. His first wife left him and the kid. 2nd wife died after their divorce. He had custody of all 10, judge asked my mother if she would take them in, she did. Raised his 10 and had 3 with dad. 13 in all" ## [3] "Each of my grandmothers raised 11 kids " ## [4] "My mother" ## [5] "My grandmother." ## [6] "My mother in law, who isnt a bitch in any way, shape, or form, and raised three girls on her own (and who are all bad asses in their own way) without the help of a deadbeat ex." ## [7] "[email protected] for starters. And my great grandmother who drove a car when she was 12. My dog, too. She is definitely a bad bitch." ## [8] "My grandmother escaped from the Tsar with nothing but the clothes on her back." ## [9] "My mother" ## [10] "Almost any woman alive today,@xnulz \n\nAnd - Heres another one, for sure (PLUS shes a good mother):\n\nTonya Harding; @ITonyaMovie \n\n" ## [11] "all of the single mothers doing the most they can for their children" ## [12] "My grandmother was married off at the age of 14 to an older man who had already been married once, dealt with an abusive marriage but stuck around for the kids and went back to school after having 5 children and worked for 20 years to support her family" ## [13] "my grandmother was an army nurse in WW2.\ntaught me how to tourniquet a leg and bandage it using only gauze" ## [14] "My wife birthed a goddamn child. My mother and grandmother both birthed multiple. This is too easy." ## [15] "My grandmother worked for the OSS in London during WW2 as a code breaker."  Can we talk about that belly button thing?! I’m also happy to see a diversity of things they were recognized for. Names of the badder b…..s Quite a few of the tweets from this trend contained the name of someone. In order to extract these names, I resorted to a language processing method called entity extraction, the entity here being a person. For that, I could have used an extractor module of the Monkeylearn platform via my own monkeylearn package. Instead, I chose to illustrate a different method: using the cleanNLP package that I know from the excellent R Journal paper presenting it. Among other things, it serves as an interface between R and the Python library spaCy and also as an interface between R and the coreNLP Java library. Installing these tools is the painful part of the setup, but 1) you only need to install one of them 2) there are detailed instructions here 3) once your tool is installed, using the package is a breeze (and well independent of any rate limit contrary to monkeylearn use). I am at that breeze stage, you can be jealous. There were a few tweets with infuriating encoding issues, BOM or something like that, and I decided to just ignore them by using purrr::possibly. I obviously did this to illustrate the use of this purrr function, not out of laziness. library("cleanNLP") init_spaCy() # we need to remove characters like "\u0098" badass <- dplyr::mutate(badass, text = enc2native(text)) get_entities_with_text <- function(x){ obj <- run_annotators(x, as_strings = TRUE) entities <- get_entity(obj) entities$text <- x
entities
}

possibly_get_entities <- purrr::possibly(get_entities_with_text,
otherwise = NULL)

entities <- purrr::map_df(badass$text, possibly_get_entities) readr::write_csv(entities, path = "data/2017-12-03-badderb_entities.csv")  I got at least one entity for 7504 out of 15653 tweets, and at least one person for 4664. I am very satisfied with this. So, who are you, badder b…..s? We get this kind of entities: NORP, CARDINAL, LANGUAGE, GPE, DATE, ORG, PERSON, TIME, LOC, MONEY, WORK_OF_ART, EVENT, FAC, QUANTITY, LAW, PRODUCT, ORDINAL, PERCENT. I’m more interested in PERSON and no, I’m not shouting. I chose to look at the top 12 in order to get a top 10 excluding Taylor Swift herself. entities %>% dplyr::filter(entity_type == "PERSON") %>% dplyr::group_by(entity) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(- n) %>% head(n = 12) %>% knitr::kable()  entity n Taylor Swift 213 Taylor 145 Rosa Parks 140 Harriet Tubman 109 Dora 90 Rose West 85 Lyudmila Pavlichenko 77 Joan 71 Marie Curie 57 Myra Hindley 50 Nancy Wake 45 Hillary Clinton 41 At that point I did feel like bursting out laughing though. Dora! And I checked, we’re talking about Dora the explorer! Joan is Joan of arc. Interestingly in that top 10 we’re mixing really bad persons, e.g. Myra Hindley was a serial killer, and really badass persons, like Rosa Parks. My husband will be happy to see Marie Curie in this list, since he’s a big fan of hers, having even guided a few tours about her life in Paris. Looking at the most frequently mentioned women obviously makes us loose well wrongly written names, and most importantly personal stories of badass mothers and the like, and of native women for instance, although I have the impression of having read about a few but probably because of my following Auriel Fournier. Writing history? I saw someone said they’d use the tweets as basis for history lessons. In order to get a view of a person, one could concatenate the tweets about them. Take Marie Curie for instance. entities %>% dplyr::filter(entity_type == "PERSON", entity == "Marie Curie") %>% dplyr::summarise(text = toString(text)) %>% .$text



Doing this one also gets the name of many other women. Moreover, if writing history lessons, one should have several sources, right? What about Wikidata like in this other blog post of mine? It should have data for at least the most famous badass women.

# add a function for getting a silent answer
quietly_query <- purrr::quietly(WikidataQueryServiceR::query_wikidata)

# function for getting someone's data
get_wikidata <- function(name, pb = NULL){
if (!is.null(pb)) pb$tick()$print()
Sys.sleep(1)
item <- WikidataR::find_item(name, language = "en")
# sometimes people have no Wikidata entry so I need this condition
if(length(item) > 0){
entity_code <- item[[1]]$id query <- paste0("PREFIX entity: #partial results SELECT ?propUrl ?propLabel ?valUrl ?valLabel ?picture WHERE { hint:Query hint:optimizer 'None' . { BIND(entity:",entity_code," AS ?valUrl) . BIND(\"N/A\" AS ?propUrl ) . BIND(\"identity\"@en AS ?propLabel ) . } UNION { entity:", entity_code," ?propUrl ?valUrl . ?property ?ref ?propUrl . ?property rdf:type wikibase:Property . ?property rdfs:label ?propLabel } ?valUrl rdfs:label ?valLabel FILTER (LANG(?valLabel) = 'en') . OPTIONAL{ ?valUrl wdt:P18 ?picture .} FILTER (lang(?propLabel) = 'en' ) } ORDER BY ?propUrl ?valUrl LIMIT 200") results <- quietly_query(query) results <- results$result
results$name<- name results }else{ NULL } }  Yes, I just had to replace all occurrences of “sv” with “en” to get a function for this post. I’d like to try to write an automatic text about badass women. get_a_string <- function(prop, prep, wikidata){ answer <- dplyr::filter(wikidata, propLabel == prop) %>% .$valLabel %>%
unique() %>%
toString
return("")
}else{
}
}

wikidata <- get_wikidata(name)
questions <- c("occupation", "country of citizenship",

words <- c("a", "from",
"known from her work in", "and who was awarded")

strings <- purrr::map2_chr(questions, words,
get_a_string,
wikidata = wikidata)

strings <- strings[strings != ""]

sentence <- paste(name, "was", toString(strings))
sentence <- paste0(sentence, ".")
return(sentence)
}


Ok, let’s try our automatic history writing function. It won’t work for Dora and Joan, sadly.

tell_me_about("Lyudmila Pavlichenko")

## [1] "Lyudmila Pavlichenko was a historian, sniper, military personnel, from Soviet Union, and who was awarded Medal \"For the Defence of Sevastopol\", Medal \"For the Defence of Odessa\", Hero of the Soviet Union, Order of Lenin, Medal \"For Battle Merit\", Gold Star, Medal \"For the Victory over Germany in the Great Patriotic War 19411945\"."

## [1] "Myra Hindley was a criminal, from United Kingdom."

## [1] "Harriet Tubman was a writer, from United States of America, and who was awarded New Jersey Hall of Fame, National Women's Hall of Fame, Maryland Women's Hall of Fame."


Not many details clearly, but not too bad for a quickly written history hum bot, if I can call it so.

So, happy, Nutella?

This was my contribution to the meme following Nutella’s viral tweet. I am thankful for the badass women I did end up discovering thanks to the tweets, and am waiting for someone to replace the lyrics of all Taylor Swift’s songs with gems from this Twitter trend.