- parallel NLP annotation across your CPU cores
- default models now use models trained on Universal Dependencies 2.4, allowing to do annotation in 64 languages, based on 94 treebanks from Universal Dependencies. We now have models built on afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb
- some fixes as indicated in the NEWS file
How does parallel NLP annotation looks like right now? Let’s do some annotation in French.
library(udpipe) data("brussels_reviews", package = "udpipe") x <- subset(brussels_reviews, language %in% "fr") x <- data.frame(doc_id = x$id, text = x$feedback, stringsAsFactors = FALSE) anno <- udpipe(x, "french-gsd", parallel.cores = 1, trace = 100) anno <- udpipe(x, "french-gsd", parallel.cores = 4) ## this will be 4 times as fast if you have 4 CPU cores View(anno)
Note that udpipe particularly works great in combination with the following R packages
- crfsuite for entity recognition (more docs here)
- textrank for text summarisation (more docs here)
- BTM for topic modelling on short texts (more docs here)
- ruimtehol for doing text classification, text recommendation and finding similaries between articles, sentences, words, bigrams, labels, tags, persons, websites, entities and entity relations (more docs here and here)
And nothing stops you from using R packages tm / tidytext / quanteda or text2vec alongside it!
Upcoming training schedule
If you want to know more, come attend the course on text mining with R or text mining with Python. Here is a list of scheduled upcoming public courses which BNOSAC is providing each year at the KULeuven in Belgium.
- 2019-10-17&18: Statistical Machine Learning with R: Subscribe here
- 2019-11-14&15: Text Mining with R: Subscribe here
- 2019-12-17&18: Applied Spatial Modelling with R: Subscribe here
- 2020-02-19&20: Advanced R programming: Subscribe here
- 2020-03-12&13: Computer Vision with R and Python: Subscribe here
- 2020-03-16&17: Deep Learning/Image recognition: Subscribe here
- 2020-04-22&23: Text Mining with R: Subscribe here
- 2020-05-05&06: Text Mining with Python: Subscribe here