Corpus Linguistics with R, Day 2

July 28, 2009
By

(This article was first published on Cornelius Puschmann's Blog » R, and kindly contributed to R-bloggers)

R Lesson 2


text<-c("This is a first example sentence.", "And this is a second example sentence.")

# gsub replaces stuff in strings

> gsub ("second", "third", text)
SEARCH-REPLACE-SUBJECT
[1] "This is a first example sentence."
[2] "And this is a third example sentence."
> gsub ("n", "X", text)
[1] "This is a first example seXteXce."
[2] "AXd this is a secoXd example seXteXce."
> gsub ("is", "was", text)
[1] "Thwas was a first example sentence."
[2] "And thwas was a second example sentence."

---

Perl-style regex

^ beginning of str, e.g. "^x", ***OR*** NOT inside of []
$ end of str, e.g. "x$"
. any other char
\ escape char - TWO ("\\") needed
[] character classes, e.g. [aeiou] vowels, [a-h] is same as [abcdefgh]
{MIN,MAX} number of immediately preceding unit (chacter)

examples
lo+l

> grep("analy[sz]e", c("analyze", "analyse", "moo"), perl=T, value=T)
[1] "analyze" "analyse"

> grep("(first|second)", text, perl=T, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
> grep("(first|lalala)", text, perl=T, value=T)
[1] "This is a first example sentence."
>

> grep("ab{2}", z, perl=T, value=T)
[1] "aabbccdd"
> grep("(ab){2}", z, perl=T, value=T)
[1] "ababcdcd"
>
>
> gsub("a (first|second)", "another", text, perl=T)
[1] "This is another example sentence."
[2] "And this is another example sentence."
>
>
>
>
> gsub("[abcdefgh]", "X", text, perl=T)
[1] "TXis is X Xirst XxXmplX sXntXnXX."
[2] "AnX tXis is X sXXonX XxXmplX sXntXnXX."

> grep("forg[eo]t(s|ting|ten)?_v", a.corpus.file, perl=T, value=T)
all forms of forget

*? lazy matching e.g.
gregexpr("s.*?s", text[1], perl=T)

> gregexpr("s.*?s", text[1], perl=T)
[[1]]
[1] 4 14
attr(,"match.length")
[1] 4 12

# note: things that are matched are consumed and can then not be found again in the same passtext

> gsub("(19|20)[0-9]{2}", "YEAR", text)
[1] "They killed 250 people in YEAR." "No, it was in YEAR."
> #replaces only 19xx and 20xx

---

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
>
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "\\W"))

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."

> text_split<-unlist(strsplit(textfile, "//W+"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> sort(table(text_split), decreasing=T)
text_split
to software the free and general
9 9 7 5 4 3 3
is it license public your by change
3 3 3 3 3 2 2
for foundation freedom gnu most other share
2 2 2 2 2 2 2
all any applies apply are authors away
1 1 1 1 1 1 1
can commit contrast covered designed guarantee instead
1 1 1 1 1 1 1
intended its library licenses make of program
1 1 1 1 1 1 1
programs s some sure take this too
1 1 1 1 1 1 1
users using whose you
1 1 1 1
>

> text_freqs
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share all
2 2 2 2 2 2 1
any applies apply are authors away can
1 1 1 1 1 1 1
commit contrast covered designed guarantee instead intended
1 1 1 1 1 1 1
its library licenses make of program programs
1 1 1 1 1 1 1
s some sure take this too users
1 1 1 1 1 1 1
using whose you
1 1 1
> text_freqs[text_freqs>1]
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share
2 2 2 2 2 2
>

> !(text_split %in% stop_list)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[37] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[61] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> text_stopremoved<-text_split[!(text_split %in% stop_list)]
> text_stopremoved
[1] "licenses" "for" "most" "software" "are"
[6] "designed" "to" "take" "away" "your"
[11] "freedom" "to" "share" "change" "it"
[16] "by" "contrast" "gnu" "general" "public"
[21] "license" "is" "intended" "to" "guarantee"
[26] "your" "freedom" "to" "share" "change"
[31] "free" "software" "to" "make" "sure"
[36] "software" "is" "free" "for" "all"
[41] "its" "users" "this" "general" "public"
[46] "license" "applies" "to" "most" "free"
[51] "software" "foundation" "s" "software" "to"
[56] "any" "other" "program" "whose" "authors"
[61] "commit" "to" "using" "it" "some"
[66] "other" "free" "software" "foundation" "software"
[71] "is" "covered" "by" "gnu" "library"
[76] "general" "public" "license" "instead" "you"
[81] "can" "apply" "it" "to" "your"
[86] "programs" "too"
>

# LOAD an R file
source("something.r")

To leave a comment for the author, please follow the link and comment on his blog: Cornelius Puschmann's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.