Preprocessing the Norwegian Web as Corpus (NoWaC) in R

[This article was first published on R on Pablo Bernabeu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The present script can be used to pre-process data from a frequency list of the Norwegian as Web Corpus (NoWaC).

Before using the script, the frequency list should be downloaded from https://www.hf.uio.no/iln/english/about/organization/text-laboratory/services/nowac-frequency.html. The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and the direct URL is: https://www.tekstlab.uio.no/nowac/download/nowac-1.1.lemma.frek.sort_alf_frek.txt.gz. The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.

Reference of the corpus

Guevara, E. R. (2010). NoWaC: A large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (pp. 1-7). https://aclanthology.org/W10-1501


Sponsors

R Posts by Year

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)