Site icon R-bloggers

Natural Language Processing for non-English languages with udpipe

[This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic ‘tokenization’, ‘parts of speech tagging’, ‘lemmatization’, ‘morphological feature tagging’ and ‘dependency parsing’ of raw text. Next to text parsing, the package also allows you to train annotation models based on data of ‘treebanks’ in ‘CoNLL-U’ format as provided at http://universaldependencies.org/format.html.

Language models

The package provides direct access to language models trained on more than 50 languages. The following languages are directly available:

afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese

We hope that the package will allow other R users to build natural language applications on top of the resulting parts of speech tags, tokens, morphological features and dependency parsing output. And we hope in particular that applications will arise which are not limited to English only (like the textrank R package or the cleanNLP package to name a few)

Easy installation, great docs

Training on Text Mining with R

Are you interested in text mining. Feel free to register for the upcoming course on text mining

Example

Want to get started with it right away? Example annotating Polish text in UTF-8 encoding, but you can pick any language of choice listed above. Enjoy.

library(udpipe)
model <- udpipe_download_model(language = "polish")
model <- udpipe_load_model(file = model$file_model)
x <- udpipe_annotate(model, x = "Budynek otrzymany od parafii wymaga remontu, a placówka nie otrzymała jeszcze żadnej dotacji.")
x <- as.data.frame(x)
x

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.