Preprocessing the Norwegian Web as Corpus (NoWaC) in R

[This article was first published on R on Pablo Bernabeu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The present script can be used to pre-process data from a frequency list of the Norwegian as Web Corpus (NoWaC).

Before using the script, the frequency list should be downloaded from The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and the direct URL is: The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.

Reference of the corpus

Guevara, E. R. (2010). NoWaC: A large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (pp. 1-7).


R Posts by Year

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)