Site icon R-bloggers

A comparison between spaCy and UDPipe for Natural Language Processing for R users

[This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last few years, Natural Language Processing (NLP) has become more and more an open multi-lingual task instead of being held back by language, country and legal boundaries. With the advent of commonly used open data regarding natural language processing tasks as available at http://universaldependencies.org one can now relatively easily compare different toolkits which perform natural language processing. In this post we compare the udpipe R package to the spacyr R package.

UDPipe – spaCy comparison

A traditional natural language processing flow consists of a number of building blocks which can be used to structure your Natural Language Application on top of it. Namely

1. tokenisation
2. parts of speech tagging
3. lemmatisation
4. morphological feature tagging
5. syntactic dependency parsing
6. entity recognition
7. extracting word & sentence meaning

Both of these R packages allow to do this where

Comparison

In the comparison, we will provide general feedback on the following elements

Annotation languages

Ease of use

Annotation accuracy of the models

As the spaCy and UDPipe models for Spanish, Portuguese, French, Italian and Dutch have been built on data from the same Universal Dependencies treebank (version 2.0) one can compare the accuracies of the different NLP processing steps (tokenisation, POS tagging, morphological feature tagging, lemmatisation, dependency parsing).
Evaluation is traditionally done by leaving out some sentences from the training part and seeing how good the model did on these hold-out sentences which were tagged by humans that’s why they are called ‘gold’.
Below you can find accuracy statistics for the different NLP tasks by using the conllu 2017 shared task evaluation script on the holdout test sets. These graphs basically show that

Exact reproducible details on the evaluation can be found at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide comments there.

Annotation possibilities

Annotation speed

library(udpipe)
library(spacyr)
library(microbenchmark)
data(brussels_reviews, package = "udpipe")
f_udpipe <- function(x, model){
  x_anno <- udpipe_annotate(model, x = x)
  x_anno <- as.data.frame(x_anno)
  invisible()
}
f_spacy <- function(x){
  x_anno <- spacy_parse(x, pos = TRUE, tag = TRUE, lemma = TRUE, entity = FALSE, dependency = TRUE)
  invisible()
}
## Dutch
x <- subset(brussels_reviews, language == "nl")
x <- x$feedback
ud_model <- udpipe_download_model(language = "dutch")
ud_model <- udpipe_load_model(ud_model$file)
spacy_initialize(model = "nl", python_executable = "C:/Users/Jan/Anaconda3/python.exe")
microbenchmark(
  f_udpipe(x, model = ud_model),
  f_spacy(x),
  times = 2)
spacy_finalize()

Enjoy

Hope this provides you some guidance when you are thinking about extending your nlp workflow with more deeper natural language processing than merely sentiment analysis.

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.