High frequency words in TOEFL

December 27, 2013
By

(This article was first published on Chen-ang Statistics » R, and kindly contributed to R-bloggers)

In general, TOEFL(Test of English as a Foreign Language) is not an easy test for Chinese students, including me.  Relatively speaking, the reading section is little easier than the other sections (listening, speaking, writing). Interestingly, when I prepared my TOEFL test, I found that some important words appeared frequently in the mock examination. So I did a simple experiment this night just out of my curiosity. First I picked some relevant materials from Internet (Google covered). And then I did some basic transformations such as converting to plain text documents, eliminating extra whitespace, converting to lower case, remove stopwords and so on. Actually it can be completed easily in R, just based on package tm. Obviously tm is an excellent and significant package in text manipulation. After this step, package wordcloud enable us to plot a word cloud effortlessly. The result is as follows,

toefl

And the main codes are shown bellow,

library(tm);
library(wordcloud);
txt<-"E:\\TOEFL";
b<-Corpus(DirSource(txt),readerControl=list(language="eng"));
b<-tm_map(b,stripWhitespace);
b<-tm_map(b,removePunctuation);
b<-tm_map(b,tolower);
b<-tm_map(b,removeWords,c("and","the"));
b<-tm_map(b,removeWords,c("may","can"));
b<-tm_map(b,removeWords,c("also","often","one"));
b<-tm_map(b,removeWords,stopwords("english"));
tdm<-TermDocumentMatrix(b);
m1<-as.matrix(tdm);
v1<-sort(rowSums(m1),decreasing=TRUE);
d1<-data.frame(word =names(v1),freq=v1);
par(bg="lightyellow");
set.seed(10);
wordcloud(d1$word, d1$freq, scale=c(4,0.8),
min.freq=6,max.words=100,
col=rainbow(length(d1$freq)),font=2);

By the way, this article is just for fun. Please  do not consult this when you prepare you test. Actually the result is also not satisfied, because I did not finish some advanced process, such as tense, singular&plural. Finally, hope all of the students who are dying to study abroad gets a satisfied score in TOEFL.

To leave a comment for the author, please follow the link and comment on their blog: Chen-ang Statistics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)