Word Vectors with tidy data principles

Posted on October 29, 2017 by Rstats on Julia Silge in R bloggers | 0 Comments

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data principles and sparse matrices.

Word vectors, or word embeddings, are typically calculated using neural networks; that is what word2vec is. (GloVe embeddings are trained a little differently than word2vec.) By contrast, the approach from Chris’s post that I’m implementing here uses only counting and some linear algebra. Deep learning is great, but I am super excited about this approach because it allows practitioners to find word vectors for their own collections of text (no need to rely on pre-trained vectors) using familiar techniques that are not difficult to understand. And it doesn’t take too long computationally!

Getting some data

Let’s download half a million observations from… the Hacker News corpus.

via GIPHY

I know, right? But it’s the dataset that Chris uses in his blog post and it gives me an opportunity to use the bigrquery package for the first time.

library(bigrquery)
library(tidyverse)

project <- "my-first-project-184003"

sql <- "#legacySQL
SELECT
  stories.title AS title,
  stories.text AS text
FROM
  [bigquery-public-data:hacker_news.full] AS stories
WHERE
  stories.deleted IS NULL
LIMIT
  500000"

hacker_news_raw <- query_exec(sql, project = project, max_pages = Inf)

Next, let’s clean this text up to take care of some of the messy ways it has gotten encoded.

library(stringr)

hacker_news_text <- hacker_news_raw %>%
    as_tibble() %>%
    mutate(title = na_if(title, ""),
           text = coalesce(title, text)) %>%
    select(-title) %>%
    mutate(text = str_replace_all(text, "'|"|/", "'"),   ## weird encoding
           text = str_replace_all(text, "<a(.*?)>", " "),               ## links 
           text = str_replace_all(text, ">|<", " "),              ## html yuck
           text = str_replace_all(text, "<[^>]*>", " "),                ## mmmmm, more html yuck
           postID = row_number())

Unigram probabilities

First, let’s calculate the unigram probabilities, how often we see each word in this corpus. This is straightforward using unnest_tokens() from the tidytext package and then just count() and mutate() from dplyr.

library(tidytext)

unigram_probs <- hacker_news_text %>%
    unnest_tokens(word, text) %>%
    count(word, sort = TRUE) %>%
    mutate(p = n / sum(n))

unigram_probs
## # A tibble: 318,503 x 3
##     word       n          p
##    <chr>   <int>      <dbl>
##  1   the 1101045 0.04190098
##  2    to  761658 0.02898539
##  3     a  669067 0.02546178
##  4    of  541937 0.02062376
##  5   and  532592 0.02026813
##  6    is  431841 0.01643399
##  7  that  406506 0.01546985
##  8     i  401545 0.01528105
##  9    in  368045 0.01400619
## 10    it  333441 0.01268931
## # ... with 318,493 more rows

Skipgram probabilities

Next, we need to calculate the skipgram probabilities, how often we find each word near each other word. We do this by defining a fixed-size moving window that centers around each word. Do we see word1 and word2 together within this window? I take the approach here of using unnest_tokens() once with token = "ngrams" to find all the windows I need, then using unnest_tokens() again to tidy these n-grams. After that, I can use pairwise_count() from the widyr package to count up cooccuring pairs within each n-gram/sliding window.

I’m not sure what the ideal value for window size is here for the skipgrams. This value determines the sliding window that we move through the text, counting up bigrams that we find within the window. When this window is bigger, the process of counting skipgrams takes longer, obviously. I experimented a bit and windows of 8 words seem to work pretty well. Probably more work needed here! I’d be happy to be pointed to more resources on this topic.

Finding all the skipgrams is a computationally expensive part of this process. Not something that just runs instantly!

library(widyr)

tidy_skipgrams <- hacker_news_text %>%
    unnest_tokens(ngram, text, token = "ngrams", n = 8) %>%
    mutate(ngramID = row_number()) %>% 
    unite(skipgramID, postID, ngramID) %>%
    unnest_tokens(word, ngram)

tidy_skipgrams
## # A tibble: 183,434,296 x 2
##    skipgramID   word
##         <chr>  <chr>
##  1        1_1      i
##  2        1_1    bet
##  3        1_1 taking
##  4        1_1      a
##  5        1_1    few
##  6        1_1 months
##  7        1_1    off
##  8        1_1   from
##  9        1_2    bet
## 10        1_2 taking
## # ... with 183,434,286 more rows
skipgram_probs <- tidy_skipgrams %>%
    pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
    mutate(p = n / sum(n))

Normalized skipgram probability

We now know how often words occur on their own, and how often words occur together with other words. We can calculate which words occurred together more often than expected based on how often they occurred on their own. When this number is high (greater than 1), the two words are associated with each other, likely to occur together. When this number is low (less than 1), the two words are not associated with each other, unlikely to occur together.

normalized_prob <- skipgram_probs %>%
    filter(n > 20) %>%
    rename(word1 = item1, word2 = item2) %>%
    left_join(unigram_probs %>%
                  select(word1 = word, p1 = p),
              by = "word1") %>%
    left_join(unigram_probs %>%
                  select(word2 = word, p2 = p),
              by = "word2") %>%
    mutate(p_together = p / p1 / p2)

What are the words most associated with Facebook on Hacker News?

normalized_prob %>% 
    filter(word1 == "facebook") %>%
    arrange(-p_together)
## # A tibble: 1,746 x 7
##       word1        word2     n            p           p1           p2 p_together
##       <chr>        <chr> <dbl>        <dbl>        <dbl>        <dbl>      <dbl>
##  1 facebook     facebook 52883 3.757919e-05 0.0003340144 3.340144e-04  336.83488
##  2 facebook    messenger   348 2.472923e-07 0.0003340144 1.122642e-05   65.94840
##  3 facebook   zuckerburg    23 1.634403e-08 0.0003340144 8.752799e-07   55.90454
##  4 facebook     statuses    40 2.842440e-08 0.0003340144 1.522226e-06   55.90454
##  5 facebook      myspace   330 2.345013e-07 0.0003340144 1.331948e-05   52.70999
##  6 facebook     newsfeed    32 2.273952e-08 0.0003340144 1.370003e-06   49.69292
##  7 facebook       hiphop    29 2.060769e-08 0.0003340144 1.484170e-06   41.57004
##  8 facebook mashable.com    24 1.705464e-08 0.0003340144 1.293892e-06   39.46203
##  9 facebook   friendster    30 2.131830e-08 0.0003340144 1.636393e-06   39.00317
## 10 facebook        gtalk    22 1.563342e-08 0.0003340144 1.217781e-06   38.43437
## # ... with 1,736 more rows

What about the programming language Scala?

normalized_prob %>% 
    filter(word1 == "scala") %>%
    arrange(-p_together)
## # A tibble: 445 x 7
##    word1        word2     n            p           p1           p2 p_together
##    <chr>        <chr> <dbl>        <dbl>        <dbl>        <dbl>      <dbl>
##  1 scala        scala  9159 6.508478e-06 5.449568e-05 5.449568e-05 2191.56919
##  2 scala      odersky    47 3.339867e-08 5.449568e-05 1.103614e-06  555.32856
##  3 scala         akka    74 5.258514e-08 5.449568e-05 3.006396e-06  320.96285
##  4 scala          sbt    31 2.202891e-08 5.449568e-05 1.446115e-06  279.52988
##  5 scala       groovy    88 6.253369e-08 5.449568e-05 5.594180e-06  205.12353
##  6 scala           mu    23 1.634403e-08 5.449568e-05 1.826671e-06  164.18624
##  7 scala       kotlin    92 6.537613e-08 5.449568e-05 7.382795e-06  162.49359
##  8 scala      clojure   472 3.354080e-07 5.449568e-05 5.719764e-05  107.60518
##  9 scala constructors    23 1.634403e-08 5.449568e-05 2.816118e-06  106.49918
## 10 scala    idiomatic    54 3.837294e-08 5.449568e-05 7.573073e-06   92.98028
## # ... with 435 more rows

Looks good!

Cast to a sparse matrix

We want to do matrix factorization, so we should probably make a matrix. We can use cast_sparse() from the tidytext package to transform our tidy data frame to a matrix.

pmi_matrix <- normalized_prob %>%
    mutate(pmi = log10(p_together)) %>%
    cast_sparse(word1, word2, pmi)

What is the type of this object?

class(pmi_matrix)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

The dgCMatrix class is a class of sparse numeric matrices in R. Text data like this represented in matrix form usually has lots and lots of zeroes, so we want to make use of sparse data structures to save us time and memory and all that.

Reduce the matrix dimensionality

We want to get information out of this giant matrix in a more useful form, so it’s time for singular value decomposition. Since we have a sparse matrix, we don’t want to use base R’s svd function, which casts the input to a plain old matrix (not sparse) first thing. Instead we will use the fast SVD algorithm for sparse matrices in the irlba package.

library(irlba)

pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e3)

The number 256 here means that we are finding 256-dimensional vectors for the words. This is another thing that I am not sure exactly what the best number is, but it will be easy to experiment with. Doing the matrix factorization is another part of this process that is a bit time intensive, but certainly not slow compared to training word2vec on a big corpus. In my experimenting here, it takes less time than counting up the skipgrams.

Once we have the singular value decomposition, we can get out the word vectors! Let’s set some row names, using our input, so we can find out what is what.

word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)

Now we can search our matrix of word vectors to find synonyms. I want to get back to a tidy data structure at this point, so I’ll write a new little function for tidying.

library(broom)

search_synonyms <- function(word_vectors, selected_vector) {
    
    similarities <- word_vectors %*% selected_vector %>%
        tidy() %>%
        as_tibble() %>%
        rename(token = .rownames,
               similarity = unrowname.x.)
    
    similarities %>%
        arrange(-similarity)    
}

facebook <- search_synonyms(word_vectors, word_vectors["facebook",])
facebook
## # A tibble: 69,411 x 2
##        token similarity
##        <chr>      <dbl>
##  1  facebook 0.07732632
##  2   twitter 0.05418014
##  3    google 0.05011162
##  4    social 0.04259758
##  5        fb 0.03549460
##  6   account 0.03347402
##  7 instagram 0.02990053
##  8    photos 0.02680315
##  9     users 0.02601248
## 10   friends 0.02547189
## # ... with 69,401 more rows
haskell <- search_synonyms(word_vectors, word_vectors["haskell",])
haskell
## # A tibble: 69,411 x 2
##          token similarity
##          <chr>      <dbl>
##  1     haskell 0.03949794
##  2        lisp 0.03634116
##  3   languages 0.03591368
##  4    language 0.03128933
##  5  functional 0.02969273
##  6 programming 0.02685119
##  7     clojure 0.02609575
##  8       scala 0.02604054
##  9      python 0.02541462
## 10      erlang 0.02382765
## # ... with 69,401 more rows

That’s… pretty darn amazing. Let’s visualize the most similar words vector to Facebook and Haskell from this dataset of Hacker News posts.

facebook %>%
    mutate(selected = "facebook") %>%
    bind_rows(haskell %>%
                  mutate(selected = "haskell")) %>%
    group_by(selected) %>%
    top_n(15, similarity) %>%
    ungroup %>%
    mutate(token = reorder(token, similarity)) %>%
    ggplot(aes(token, similarity, fill = selected)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~selected, scales = "free") +
    coord_flip() +
    theme(strip.text=element_text(hjust=0, family="Roboto-Bold", size=12)) +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = NULL, title = "What word vectors are most similar to Facebook or Haskell?",
         subtitle = "Based on the Hacker News corpus, calculated using counts and matrix factorization")

We can also do the familiar WORD MATH that is so fun with the output of word2vec; you have probably seen examples such as King - Man + Woman = Queen and such. We can just add and subtract our word vectors, and then search the matrix we built!

If the iPhone is an important product associated with Apple, as discussed on Hacker News, what is an important product associated with Microsoft?

mystery_product <- word_vectors["iphone",] - word_vectors["apple",] + word_vectors["microsoft",]
search_synonyms(word_vectors, mystery_product)
## # A tibble: 69,411 x 2
##        token similarity
##        <chr>      <dbl>
##  1   windows 0.03847298
##  2     phone 0.02785154
##  3    iphone 0.02642861
##  4    mobile 0.02564263
##  5   android 0.02475795
##  6       ios 0.02233122
##  7 microsoft 0.02095453
##  8         7 0.02093067
##  9         t 0.02085600
## 10         8 0.02061549
## # ... with 69,401 more rows

We even see some mobile phone and Android terms in this list, below Windows.

What about an important product associated with Google?

mystery_product <- word_vectors["iphone",] - word_vectors["apple",] + word_vectors["google",]
search_synonyms(word_vectors, mystery_product)
## # A tibble: 69,411 x 2
##       token similarity
##       <chr>      <dbl>
##  1   google 0.10097056
##  2   search 0.06054602
##  3      app 0.04070347
##  4    phone 0.03854913
##  5   engine 0.03560017
##  6 facebook 0.03447097
##  7  android 0.03273318
##  8   chrome 0.03020063
##  9   iphone 0.02991398
## 10    gmail 0.02895055
## # ... with 69,401 more rows

Google itself is at the top of the list, which is something that often happens to me when I try this word vector arithmetic no matter how I train them (usually one of the positive vectors in the “equation”). Does anyone know what that means? Anyway, “search”, “app”, “phone” and “engine” are next on the list.

mystery_product <- word_vectors["iphone",] - word_vectors["apple",] + word_vectors["amazon",]
search_synonyms(word_vectors, mystery_product)
## # A tibble: 69,411 x 2
##      token similarity
##      <chr>      <dbl>
##  1  amazon 0.04648808
##  2     aws 0.04216195
##  3      s3 0.03398973
##  4   cloud 0.03201470
##  5     ec2 0.03157737
##  6    book 0.02970601
##  7  iphone 0.02736457
##  8   books 0.02597030
##  9 storage 0.02410764
## 10     app 0.02395586
## # ... with 69,401 more rows

For Amazon, we get AWS, S3, cloud, and EC2. Nice!

The End

I am so excited about this approach! Like Chris said in his blog post, for all the applications in the kind of work I do (non-academic, industry NLP) these type of word vectors will work great. No need for neural networks! This approach is still not lightning fast (I have to sit and wait for parts of it to run) but I can easily implement it with the tools I am familiar with. I would imagine there are vast swaths of data science practitioners for whom this is also true. I am considering the idea of bundling some of these types of functions up into an R package, and Dave has just built a pairwise_pmi() function in the development version of widyr that simplifies this approach even more. Tidy word vectors, perhaps? Maybe I’ll also look into the higher rank extension of this technique to get at word and document vectors!

Let me know if you have feedback or questions.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Word Vectors with tidy data principles

Getting some data

Unigram probabilities

Skipgram probabilities

Normalized skipgram probability

Cast to a sparse matrix

Reduce the matrix dimensionality

The End

Related

Getting some data

Unigram probabilities

Skipgram probabilities

Normalized skipgram probability

Cast to a sparse matrix

Reduce the matrix dimensionality

The End

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)