High frequency words in TOEFL

Posted on December 27, 2013 by chenangen in R bloggers | 0 Comments

[This article was first published on Chen-ang Statistics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In general, TOEFL(Test of English as a Foreign Language) is not an easy test for Chinese students, including me. Relatively speaking, the reading section is little easier than the other sections (listening, speaking, writing). Interestingly, when I prepared my TOEFL test, I found that some important words appeared frequently in the mock examination. So I did a simple experiment this night just out of my curiosity. First I picked some relevant materials from Internet (Google covered). And then I did some basic transformations such as converting to plain text documents, eliminating extra whitespace, converting to lower case, remove stopwords and so on. Actually it can be completed easily in R, just based on package tm. Obviously tm is an excellent and significant package in text manipulation. After this step, package wordcloud enable us to plot a word cloud effortlessly. The result is as follows,

And the main codes are shown bellow,

library(tm);
library(wordcloud);
txt<-"E:\\TOEFL";
b<-Corpus(DirSource(txt),readerControl=list(language="eng"));
b<-tm_map(b,stripWhitespace);
b<-tm_map(b,removePunctuation);
b<-tm_map(b,tolower);
b<-tm_map(b,removeWords,c("and","the"));
b<-tm_map(b,removeWords,c("may","can"));
b<-tm_map(b,removeWords,c("also","often","one"));
b<-tm_map(b,removeWords,stopwords("english"));
tdm<-TermDocumentMatrix(b);
m1<-as.matrix(tdm);
v1<-sort(rowSums(m1),decreasing=TRUE);
d1<-data.frame(word =names(v1),freq=v1);
par(bg="lightyellow");
set.seed(10);
wordcloud(d1$word, d1$freq, scale=c(4,0.8),
min.freq=6,max.words=100,
col=rainbow(length(d1$freq)),font=2);

By the way, this article is just for fun. Please do not consult this when you prepare you test. Actually the result is also not satisfied, because I did not finish some advanced process, such as tense, singular&plural. Finally, hope all of the students who are dying to study abroad gets a satisfied score in TOEFL.

To leave a comment for the author, please follow the link and comment on their blog: Chen-ang Statistics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

High frequency words in TOEFL

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)