Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master.
To reproduce examples below, please install
[email protected] from github:
Also I’m waiting for feedback from text2vec users, please spend a few minutes:
- What APIs are not clear / not intuitive?
- What functionality is missing?
- Do you have any problems with speed / RAM usage?
In two words:
text2vec became faster and more user-friendly. During the work on this version I almost didn’t touch underlying core C++ code and focused on high-level features and usability. First I will briefly describe main improvements and then will provide full-featured example.
In this post i would like to highlight the following improvements:
- important bugfix
dtmkeeps document ids as rownames
- several API breaks – some functions removed, some renamed and some have another default arguments
- performance improvements – all core functions have parallel mode
Full list of the features/changes available at github and marked with 0.3 tag.
There was one significant bug: when last document has no terms (at least from vocabulary), i.e. last row of
dtm has all zeros,
get_dtm() function omitted this last row. So
dtm had less rows than number of documents in
corpus. Now fixed.
Preserving document ids in
I’m not only the developer of the
text2vec, but also probably the most active user. Since the first public release I felt that I needed to improve some rough edges. One of the most obviously missing things was lack of mechanism for keeping document
dtm) construction. Now it is straightforward – if input of the
itoken function has names, these names will be used as documents
New high-level API
corpus was the central object. We can think about it as a container with reference semantics, which allow us to perform vectorization and collection of terms coocurence statistics simulteniously. After the corpus is created, only the following two functions are useful in 99% of cases –
get_tcm. After that, users usually work with matrices. This means that
corpus actually is an intermediate object and mainly should be used internally. In real life users usually need Document-Term matrix (dtm) or Term-Cooccurence matrix (tcm) which simplifies the process of transition from raw text to a vector space.
In 0.3 I introduce new higher-level API for direct
tcm creation –
create_tcm() functions. Such simplification also allows me to implement efficient concurrent growing of
create_tcm() internally use
create_corpus(), but hide all gory details and care about parallel execution. Experienced users, who need simulteniously vectorize corpus and collect cooccurence statistics, can still use
create_corpus() and corresponding
Another refinement – is the introduction of
vectorizer is the function which performs mapping from raw text space to vector space. There are 2 kinds of vectorizers:
vocab_vectorizerwhich uses vocabulary to perfrom bag-of-ngrams vectorization;
hash_vectorizerwhich uses feature hashing (or hashing trick);
As it was pointed out here, in case of vocabulary vectorization, we perform 2 passes over input source. This means we read, preprocess and tokenize twice. While I/O usually is not an issue (if you use efficient reader like
data.table::fread or functions from
readr package), preprocessing can occupy a significant amount of time. For this reason I created
itoken S3 method which works with
character vectors – list of tokens. Now user can tokenize input and then reuse list of tokens in
tcm construction. See examples below.
There were several improvements to vocabulary construction:
- stopwords filtering during vocanulary construction (especially usefull for ngrams with
n > 1);
vocabularycan be built in parallel using all your CPU cores;
prune_vocabulary()became slightly more efficient – it performs less unnecessary computations;
All transformers renamed, now all starts with
transformer_* (this was done for more convenient work with autocompletion):
transformer_filter_commonsstill useful, even with some intersection with
The following example demonstrates new pipeline with many text2vec features: (note how flexible text2vec can be! thanks to functional style)
Loading required package: methods
List of 4 $ vocab :Classes 'data.table' and 'data.frame': 9595 obs. of 3 variables: ..$ terms : chr [1:9595] "fiorentino" "bfg" "tadashi" "kabei" ... ..$ terms_counts: int [1:9595] 5 8 5 5 11 5 6 10 6 8 ... ..$ doc_counts : int [1:9595] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, ".internal.selfref")=<externalptr> $ ngram : Named int [1:2] 1 1 ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max" $ document_count: int 5000 $ stopwords : chr [1:11] "i" "me" "my" "myself" ... - attr(*, "class")= chr "text2vec_vocabulary"
One important note. In current R realization, iterators are mutable. So at this point our iterator is empty:
tcm construction we need to reinitialise it. Here we create
 "5814_8" "2381_9" "7759_3" "3630_4" "9495_8" "8196_8"
Old-style simultenious vectorization and collection of cooccurence statistics:
Another option is to use
hash_vectorizer. Procedure is the same:
vocabulary take advantage of multicore machines and do it in transparent manner. In contrast to GloVe fitting which uses low-level thread parallelism via
RcppParallel, other functions use standart R high-level parallelism on top of
foreach package. They are flexible and can use diffrent parallel backends -
doRedis, etc. But user should remember that such high-level parallelism can involve significant overhead.
Only two things user should perform manually to take advantage of multicore machine:
- prepare splits of input data in a form of
- register parallel backend
Here is simple example with timings:
Loading required package: foreach Loading required package: iterators Loading required package: parallel
user system elapsed 0.363 0.000 0.364
user system elapsed 0.020 0.019 0.260
user system elapsed 0.435 0.043 0.957
user system elapsed 1.288 0.301 0.693
user system elapsed 0.488 0.157 0.930
user system elapsed 0.764 0.183 0.542
user system elapsed 0.787 0.285 3.053
user system elapsed 2.871 0.202 1.829
As you can see, speedup is not perfect. This happened because, R’s high-level parallelism has significant overhead on small tasks. On larger tasks you can expect almost linear speedup!
Bonus: how fast is fast?
On 16-core machine I was able to perform vectorization (unigrams) of english wikipedia (13 gb of text, 4M of documents) in 2.5 minutes using hash vectorizer and in 6 minutes using vocabulary vectorizer. Timings include time spent for reading from disk! Resulted
dtm was about 13gb and at peak R processes consumes about 30gb of RAM. (Try to do it with any other R package or python module).
Here is code: