Text Prediction Shiny App pt 1

[This article was first published on R on The Data Sandbox, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This Shiny App was first written in May of 2021

Description

The goal of this project was to create an N-gram based model to predict the word to follow the user’s input. This project was to complete the Capstone project for the Johns Hopkins University Data science program on Coursera. The data for this project was provided by Swiftkey.

This project will be broken down to multiple parts as the entire project is quite large. The first part will deal with the creation of the corpus. This corpus will require additional filtering to remove words that are not English, contractions and words that are considered profanity.

Initialization

The initial step that loads the required libraries and downloads the data sets if not all read on file.

library(tidyverse)
library(tidytext)
library(pryr)
#downloads the corpus files, profanity filter and English dictionary
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
url2 <- "https://www.freewebheaders.com/download/files/facebook-bad-words-list_comma-separated-text-file_2021_01_18.zip"
url3 <- "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt"
url4 <- "https://raw.githubusercontent.com/mark-edney/Capestone/1c143b40dd71f0564c3248df2a8638d08af10440/data/contractions.txt"
# I have added this if statement for testing, if the files are found than they will not be downloaded again
if(dir.exists("~/R/Capestone/data/") == FALSE){
dir.create("~/R/Capestone/data/")}
if(file.exists("~/R/Capestone/data/data.zip") == FALSE|
file.exists("~/R/Capestone/data/prof.zip")==FALSE|
file.exists("~/R/Capestone/data/diction.txt")==FALSE|
file.exists("~/R/Capestone/data/contractions.txt")==FALSE){
download.file(url,destfile = "~/R/Capestone/data/data.zip")
download.file(url2,destfile = "~/R/Capestone/data/prof.zip")
download.file(url3,destfile = "~/R/Capestone/data/diction.txt")
download.file(url4,destfile = "~/R/Capestone/data/contractions.txt")
setwd("~/R/Capestone/data/")
unzip("~/R/Capestone/data/prof.zip")
unzip("~/R/Capestone/data/data.zip")
setwd("~/R/Capestone")
}

Creating a Corpus

The project requires a Corpus, or a large body of text, to create models. At this stage, the files are opened and joined. The Corpus is so large and requires so much ram that a sample of 10% is taken.

blog <- read_lines("~/R/Capestone/data/final/en_US/en_US.blogs.txt")
news <- read_lines("~/R/Capestone/data/final/en_US/en_US.news.txt")
twitter <- read_lines("~/R/Capestone/data/final/en_US/en_US.twitter.txt")
blog <- tibble(text = blog)
news <- tibble(text = news)
twitter <- tibble(text = twitter)
set.seed(90210)
corpus <- bind_rows(blog,twitter,news) %>%
slice_sample(prop = 0.10) %>%
mutate(line = row_number())

Corpus filtering

Here, the corpus filter is created to remove profanity and any word that is not in the English dictionary.

prof <- read_lines("~/R/Capestone/data/facebook-bad-words-list_comma-separated-text-file_2021_01_18.txt")[15]
prof <- prof %>%
str_split(", ") %>%
flatten() %>%
unlist()
prof <- tibble("word" = prof)
english <- read_lines("~/R/Capestone/data/diction.txt")
english <- tibble("word" = english[!english==""])
contract <- read_lines("~/R/Capestone/data/contractions.txt")
contract <- tibble("word" = contract) %>% unnest_tokens(word,word)

Vocabulary

A vocabulary of words is created from the unique words with the applied filters.

#clean up ram
rm(blog,news,twitter)
voc <- bind_rows(english, contract) %>% anti_join(prof)
unigram <- corpus %>% unnest_tokens(ngram, text, token = "ngrams", n = 1) %>%
semi_join(voc, by = c("ngram"="word"))
#decreases the voc size
voc <- tibble(word = unique(unigram$ngram))

Corpus Exploration

Now that the corpus is created, we can do some exploration into the text. There are some lines of text that have some odd behaviour, but on the whole it mostly makes sense.

corpus %>%
head()
## # A tibble: 6 x 2
## text line
## <chr> <int>
## 1 no we don't just kidding yes we do d 1
## 2 it sounds like a man walking on snow but it's my heartbeat cara on how ~ 2
## 3 they thought about it but you're too short 3
## 4 indeed dear to my heart want to go see 4
## 5 i know its a tough world out there the secret to succeeding isn't stepp~ 5
## 6 we're wishing for the cure too good luck 6

Vocabulary Exploration

By using the arrange function, we can sort the unigrams by their counts. This provides some insight on which words come up the most frequently. It is not surprisingly that the most common word is “the”. These frequencies will play an important role in the test prediction, so it is important to consider them. It is very common to filter out “Stop Words” as they likely add little value to predictions.

unigram %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## ngram n
## <chr> <int>
## 1 the 476750
## 2 to 277081
## 3 and 242032
## 4 a 238301
## 5 of 201539
## 6 in 165645

Photo by Sandy Millar on Unsplash

To leave a comment for the author, please follow the link and comment on their blog: R on The Data Sandbox.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)