Site icon R-bloggers

Natural language processing tutorial

[This article was first published on Category: R | Vik's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This will serve as an introduction to natural language processing. I adapted it from slides for a recent talk at Boston Python.

We will go from tokenization to feature extraction to creating a model using a machine learning algorithm. The goal is to provide a reasonable baseline on top of which more complex natural language processing can be done, and provide a good introduction to the material.

The examples in this code are done in R, but are easily translatable to other languages. You can get the source of the post from github

< !--more-->

Training set example

Let’s say that I wanted to give a survey today and ask the following question:

Why do you want to learn about machine learning?

The responses might look like this:

“`

1 I like solving interesting problems.

2 What is machine learning?

3 I’m not sure.

4 Machien lerning predicts eveyrthing.

“`

Let’s say that the survey also asks people to rate their interest on a scale of 0 to 2.

We would now have text and associated scores:

First steps

What is the algorithm doing?

Tokenization

Let’s tokenize the first survey response:

“`

[1] “I” “like” “solving” “interesting” “problems”

“`

In this very simple case, we have just made each word a token (similar to string.split(‘ ’)).

Tokenization where n-grams are extracted is also useful. N-grams are sequences of words. So a 2-gram would be two words together. This allows the bag of words model to have some information about word ordering.

Bag of words model

Bag of words overview

Minimizing distances between vectors

Preserving information

Old features:

New features with lowercasing and spell correction:

Orthogonality

Cosine similarities:

“`

[1] 1.0000 0.6667 1.0000 0.2500

“`

Mean similarity:

“`

[1] 0.7292

“`

Meta-features

Relevance of Information

Which features are the right features?

Finally, some machine learning!

Linear regression

Coefficients:

“`

(Intercept) eveyrthing interesting learning

coefficients 1 -1 1 -1

“`

Words that are not shown do not have a coefficient (ie they did not have any useful information for scoring).

Predicting scores

Let’s use this as our “test” text that we will predict a score for:

“`

1 I want to learn to solve interesting problems.

“`

Doing the prediction

Evaluating model accuracy

Evaluating model accuracy

First fold:

Second fold:

Predictions:

Quantify error

More advanced features

More advanced algorithms

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Vik's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.