It is last part of RGolf before summer. As R excels in visualization capabilities today the task will be to generate a plot.

We will work with NGSL data – a list of 2801 important vocabulary words for students of English as a second language. I have prepared the list as a NGSL101.txt file for download.

(5) assume that you have NGSL101.txt file in your R working directory.

Warning! This time the task takes a bit more time to compute so it is worth to do the development and testing of the solution on the subset of NGSL word list.

d=scan(“NGSL101.txt”,””,skip=1)

a=s((s=sapply)(strsplit(d,””),sort),paste,collapse=”.*”)

y=log(s(a,function(z)sum(s(a,function(i)grepl(i,z)))))

plot(by(y,nchar(d),mean))

And the output is the following:

As we can see the number of subwords approximately on the average grows exponentially with the number of letters in a word.

And here is a verbose version of the solution with comments (warning again – it is slower than the solution given above):

d <- readLines("NGSL101.txt")

d <- d[-1] # remove first line from the dataset as it is a comment

is.subword <- function(test, ref) {

# we check if test is a subword of ref by applying regular

# expression on sorted letters contained in both words

test <- paste(sort(strsplit(test, "")[[1]]),collapse=".*")

ref <- paste(sort(strsplit(ref, "")[[1]]),collapse="")

# grepl returns true is match is found

grepl(test, ref)

}

# traverse all words in d and count number of matches

count.subwords <- function(ref) {

sum(sapply(d, is.subword, ref = ref))

}

x2 <- nchar(d)

y2 <- log(sapply(d, count.subwords))

y2.means <- tapply(y2, x2, mean)

plot(y2.means)

