Text Mining Analysis: some theory and practice in R

Pablo C.

6 years ago

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Big Data help us to analyze unstructred data (aka “text” ), with many techniques, in this post it is presented one: Cosine Similarity.

There are also other analysts work, who scraped data from twitter who spot some airplane complains from passangers.

Similarity between two documents

Cosine similarity is a technique to measure how similar are two documents, based on the words they have.

This link explains very well the concept, with an example which is replicated in R later in this post.

Quick summary: Imagine a document as a vector, you can build it just counting word appearances. If you have two vectors, they will have an angle.

If the documents have almost the same words, then the cosine of those vectors will be near to 1. Otherwise this score will be close to 0.

I replicated the example in R:

1) Julie loves me more than Linda loves me
2) Jane likes me more than Julie loves me

Word counting per sentece:

sentence_1=c(2, 1, 0, 2, 0, 1, 1, 1)

sentence_2=c(2, 1, 1, 1, 1, 0, 1, 1)

crossprod(sentence_1, sentence_2)/sqrt(crossprod(sentence_1) * crossprod(sentence_2))

And the result is… 0.8215838!

Now imagine we delete the word Julie from sentence 1. The new vector for sentence 1 is:
sentence_1=c(2, 0, 0, 2, 0, 1, 1, 1) (2nd element is now 0)

And the new result is…
0.7627701

Conclusion: Deleting the word Julie causes the sentences to be less similar.

This kind of techinques, allow us to order the data and take a decision quickly.

Mining Twitter

Airplane users used to have many complains about airlines, and they express their dissatisfaction through the popular Twitter.

In this real case Jeffrey Breen scrapes data from twitter, and then apply many text/sentimental mining techniques.

Here, the post.

Do you want to start your own project? Just follow this great tutorial made by Yanchan Zhao. I’m aware this is not new, but someone new to this topic may benefit from this.

One step ahead: Analyzing expressions

Last links showed how to analyze text considering one word at a time, but what about phrases?

For example, the sentence: “I don’t like to wait in the airport”.
It’s not the same to analyze the correlation between
the words:

“don’t”,
“like”,
“wait”

Than to analyze the correlation between:

“don’t like”
“wait”

In 1st case, the algorithm may show you a correlation between:

“don’t” and “like”
“don’t” and “wait”
“like” and “wait” -really? 😉

In 2nd case, the result may be something like:

“don’t like” and “wait”

Much more clear, isn’t it?

If you want to consider words as phrases –the 2nd case-, take a look at this answer from stackoverflow.com.

You can follow DSH in Twitter

Thanks for reading 🙂

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.