Text Processing using the textTinyR package

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post is about my recently released package on CRAN, textTinyR. The following notes and examples are based mainly on the package Vignette.

The advantage of the textTinyR package lies in its ability to process big text data files in batches efficiently. For this purpose, it offers functions for splitting, parsing, tokenizing and creating a vocabulary. Moreover, it includes functions for building either a document-term matrix or a term-document matrix and extracting information from those (term-associations, most frequent terms). Lastly, it embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. The source code is based mainly on C++11 and exported in R through the Rcpp, RcppArmadillo and BH packages.

The following classes (based on the R6 package) and functions are part of the package:

classes

big_tokenize_transform sparse_term_matrix token_stats
big_text_splitter() Term_Matrix() path_2vector()
big_text_parser() Term_Matrix_Adjust() freq_distribution()
big_text_tokenizer() term_associations() print_frequency()
vocabulary_accumulator() most_frequent_terms() count_character()
    print_count_character()
    collocation_words()
    print_collocations()
    string_dissimilarity_matrix()
    look_up_table()
    print_words_lookup_tbl()

functions

sparse_matrices tokenization utilities
dense_2sparse() tokenize_transform_text() bytes_converter()
load_sparse_binary() tokenize_transform_vec_docs() cosine_distance()
matrix_sparsity()   dice_distance()
save_sparse_binary()   levenshtein_distance()
sparse_Means()   read_characters()
sparse_Sums()   read_rows()
    text_file_parser()
    utf_locale()
    vocabulary_parser()

big_tokenize_transform class

The big_tokenize_transform class can be utilized to process big data files and I’ll illustrate this using the english wikipedia pages and articles (to download the data use the following web-address : https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2). The size of the file (after downloading and extracting locally) is aproximalely 59.4 GB and it’s of type .xml (to reproduce the results one needs to have free hard drive space of approx. 200 GB).

Xml files have a tree structure and one should use queries to acquire specific information. First, I’ll observe the structure of the .xml file by using the utility function read_rows(). The read_rows() function takes a file as input and by specifying the rows argument it returns a subset of the file. It doesn’t load the entire file in memory, but it just opens the file and reads the specific number of rows,

<span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">textTinyR</span><span class="p">)</span><span class="w">


</span><span class="n">PATH</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'enwiki-latest-pages-articles.xml'</span><span class="w">


</span><span class="n">subset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read_rows</span><span class="p">(</span><span class="n">input_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PATH</span><span class="p">,</span><span class="w"> </span><span class="n">read_delimiter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">,</span><span class="w">
                   
                   </span><span class="n">rows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w">
                   
                   </span><span class="n">write_2file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/subs_output.txt"</span><span class="p">)</span><span class="w">

</span>

<span class="w">
</span><span class="c1"># data subset : subs_output.txt
</span><span class="w">

</span><span class="o"><</span><span class="n">mediawiki</span><span class="w"> </span><span class="n">xmlns</span><span class="o">=</span><span class="s2">"http://www.mediawiki.org/xml/export-0.10/"</span><span class="w"> </span><span class="n">xmlns</span><span class="o">:</span><span class="n">xsi</span><span class="o">=</span><span class="s2">"http://www.w3.org/2001/XMLSchema-instance"</span><span class="w"> </span><span class="n">xsi</span><span class="o">:</span><span class="n">schemaLocation</span><span class="o">=</span><span class="s2">"http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd"</span><span class="w"> </span><span class="n">version</span><span class="o">=</span><span class="s2">"0.10"</span><span class="w"> </span><span class="n">xml</span><span class="o">:</span><span class="n">lang</span><span class="o">=</span><span class="s2">"en"</span><span class="o">></span><span class="w">
  </span><span class="o"><</span><span class="n">siteinfo</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">sitename</span><span class="o">></span><span class="n">Wikipedia</span><span class="o"></</span><span class="n">sitename</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">dbname</span><span class="o">></span><span class="n">enwiki</span><span class="o"></</span><span class="n">dbname</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">base</span><span class="o">></span><span class="n">https</span><span class="o">://</span><span class="n">en.wikipedia.org</span><span class="o">/</span><span class="n">wiki</span><span class="o">/</span><span class="n">Main_Page</span><span class="o"></</span><span class="n">base</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">generator</span><span class="o">></span><span class="n">MediaWiki</span><span class="w"> </span><span class="m">1.28.0</span><span class="o">-</span><span class="n">wmf.23</span><span class="o"></</span><span class="n">generator</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">case</span><span class="o">></span><span class="n">first</span><span class="o">-</span><span class="n">letter</span><span class="o"></</span><span class="n">case</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">namespaces</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"-2"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">Media</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"-1"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">Special</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"0"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="w"> </span><span class="o">/></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"1"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">Talk</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"2"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">User</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"3"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">User</span><span class="w"> </span><span class="n">talk</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"4"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">Wikipedia</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"5"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">Wikipedia</span><span class="w"> </span><span class="n">talk</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"6"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">File</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"7"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">File</span><span class="w"> </span><span class="n">talk</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">namespace</span><span class="w"> </span><span class="n">key</span><span class="o">=</span><span class="s2">"8"</span><span class="w"> </span><span class="n">case</span><span class="o">=</span><span class="s2">"first-letter"</span><span class="o">></span><span class="n">MediaWiki</span><span class="o"></</span><span class="n">namespace</span><span class="o">></span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
    </span><span class="o"></</span><span class="n">namespaces</span><span class="o">></span><span class="w">
  </span><span class="o"></</span><span class="n">siteinfo</span><span class="o">></span><span class="w">
  </span><span class="o"><</span><span class="n">page</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">title</span><span class="o">></span><span class="n">AccessibleComputing</span><span class="o"></</span><span class="n">title</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">ns</span><span class="o">></span><span class="m">0</span><span class="o"></</span><span class="n">ns</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">10</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">redirect</span><span class="w"> </span><span class="n">title</span><span class="o">=</span><span class="s2">"Computer accessibility"</span><span class="w"> </span><span class="o">/></span><span class="w">
    </span><span class="o"><</span><span class="n">revision</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">631144794</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">parentid</span><span class="o">></span><span class="m">381202555</span><span class="o"></</span><span class="n">parentid</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">timestamp</span><span class="o">></span><span class="m">2014-10-26</span><span class="nb">T</span><span class="m">04</span><span class="o">:</span><span class="m">50</span><span class="o">:</span><span class="m">23</span><span class="n">Z</span><span class="o"></</span><span class="n">timestamp</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">contributor</span><span class="o">></span><span class="w">
        </span><span class="o"><</span><span class="n">username</span><span class="o">></span><span class="n">Paine</span><span class="w"> </span><span class="n">Ellsworth</span><span class="o"></</span><span class="n">username</span><span class="o">></span><span class="w">
        </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">9092818</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
      </span><span class="o"></</span><span class="n">contributor</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">comment</span><span class="o">></span><span class="n">add</span><span class="w"> </span><span class="p">[[</span><span class="n">WP</span><span class="o">:</span><span class="n">RCAT</span><span class="o">|</span><span class="n">rcat</span><span class="p">]]</span><span class="n">s</span><span class="o"></</span><span class="n">comment</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">model</span><span class="o">></span><span class="n">wikitext</span><span class="o"></</span><span class="n">model</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">format</span><span class="o">></span><span class="n">text</span><span class="o">/</span><span class="n">x</span><span class="o">-</span><span class="n">wiki</span><span class="o"></</span><span class="n">format</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">text</span><span class="w"> </span><span class="n">xml</span><span class="o">:</span><span class="n">space</span><span class="o">=</span><span class="s2">"preserve"</span><span class="o">></span><span class="c1">#REDIRECT [[Computer accessibility]]
</span><span class="w">
</span><span class="o"></</span><span class="n">text</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">sha1</span><span class="o">></span><span class="m">4</span><span class="n">ro7vvppa5kmm0o1egfjztzcwd0vabw</span><span class="o"></</span><span class="n">sha1</span><span class="o">></span><span class="w">
    </span><span class="o"></</span><span class="n">revision</span><span class="o">></span><span class="w">
  </span><span class="o"></</span><span class="n">page</span><span class="o">></span><span class="w">
  </span><span class="o"><</span><span class="n">page</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">title</span><span class="o">></span><span class="n">Anarchism</span><span class="o"></</span><span class="n">title</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">ns</span><span class="o">></span><span class="m">0</span><span class="o"></</span><span class="n">ns</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">12</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
    </span><span class="o"><</span><span class="n">revision</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">746687538</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">parentid</span><span class="o">></span><span class="m">744318951</span><span class="o"></</span><span class="n">parentid</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">timestamp</span><span class="o">></span><span class="m">2016-10-28</span><span class="nb">T</span><span class="m">22</span><span class="o">:</span><span class="m">43</span><span class="o">:</span><span class="m">19</span><span class="n">Z</span><span class="o"></</span><span class="n">timestamp</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">contributor</span><span class="o">></span><span class="w">
        </span><span class="o"><</span><span class="n">username</span><span class="o">></span><span class="n">Eduen</span><span class="o"></</span><span class="n">username</span><span class="o">></span><span class="w">
        </span><span class="o"><</span><span class="n">id</span><span class="o">></span><span class="m">7527773</span><span class="o"></</span><span class="n">id</span><span class="o">></span><span class="w">
      </span><span class="o"></</span><span class="n">contributor</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">minor</span><span class="w"> </span><span class="o">/></span><span class="w">
      </span><span class="o"><</span><span class="n">comment</span><span class="o">>/*</span><span class="w"> </span><span class="n">Free</span><span class="w"> </span><span class="n">love</span><span class="w"> </span><span class="o">*/</</span><span class="n">comment</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">model</span><span class="o">></span><span class="n">wikitext</span><span class="o"></</span><span class="n">model</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">format</span><span class="o">></span><span class="n">text</span><span class="o">/</span><span class="n">x</span><span class="o">-</span><span class="n">wiki</span><span class="o"></</span><span class="n">format</span><span class="o">></span><span class="w">
      </span><span class="o"><</span><span class="n">text</span><span class="w"> </span><span class="n">xml</span><span class="o">:</span><span class="n">space</span><span class="o">=</span><span class="s2">"preserve"</span><span class="o">></span><span class="w">


</span>

In that way one has a picture of the .xml tree structure and can continue by performing queries. The initial data file is too big to fit in the memory of a PC, thus it has to be split in smaller files, pre-processed and then returned as a single file. The main aim of the big_text_splitter() method is to split the data in smaller files of (approx.) equal size by either using the batches parameter or if the file has a structure by adding the end_query parameter too. Here I’ll take advantage of both the batches and the end_query parameters for this task, because I’ll use queries to extract the text tree-elements, so I don’t want that the file is split arbitrarily. Each sub-element in the file begins and ends with the same key-word, i.e. text,

<span class="w">

</span><span class="n">btt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">big_tokenize_transform</span><span class="o">$</span><span class="n">new</span><span class="p">(</span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">btt</span><span class="o">$</span><span class="n">big_text_splitter</span><span class="p">(</span><span class="n">input_path_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PATH</span><span class="p">,</span><span class="w">             </span><span class="c1"># path to the enwiki data file
</span><span class="w">                  
                  </span><span class="n">output_path_folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/enwiki_spl_data/"</span><span class="p">,</span><span class="w">  </span><span class="c1"># folder to save the files
</span><span class="w">                  
                  </span><span class="n">end_query</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'</text>'</span><span class="p">,</span><span class="w">    </span><span class="c1"># splits the file taking into account the key-word
</span><span class="w">                  
                  </span><span class="n">batches</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">40</span><span class="p">,</span><span class="w">                           </span><span class="c1"># split file in 40 batches (files)
</span><span class="w">                  
                  </span><span class="n">trimmed_line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">                   </span><span class="c1"># the lines will be trimmed
</span>
<span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 20 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 40 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">50</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 60 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">70</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 80 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">90</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 100 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">42.7098</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">complete</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">splitting</span><span class="w">

</span>

After the data is split and saved in the output_path_folder (“/ewiki_spl_data/”) the next step is to extract the text tree-elements from the batches by using the big_text_parser() method. The latter takes as arguments the previously created input_path_folder, an output_path_folder to save the resulted text files, a start_query, an end_query, the min_lines (only subsets of text with more than or equal to this minimum will be kept) and the trimmed_line ( specifying if each line is already trimmed both-sides ),

<span class="w">
</span><span class="n">btt</span><span class="o">$</span><span class="n">big_text_parser</span><span class="p">(</span><span class="n">input_path_folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/enwiki_spl_data/"</span><span class="p">,</span><span class="w"> </span><span class="c1"># the previously created folder
</span><span class="w">                    
                    </span><span class="n">output_path_folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/enwiki_parse/"</span><span class="p">,</span><span class="w">  </span><span class="c1"># folder to save the parsed files
</span><span class="w">                    
                    </span><span class="n">start_query</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"<text xml:space=\"preserve\">"</span><span class="p">,</span><span class="w">  </span><span class="c1"># starts to extract text
</span><span class="w">                    
                    </span><span class="n">end_query</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"</text>"</span><span class="p">,</span><span class="w">                        </span><span class="c1"># stop to extract once here
</span><span class="w">                    
                    </span><span class="n">min_lines</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> 
                    
                    </span><span class="n">trimmed_line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
<span class="w">
</span><span class="o">====================</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">begins</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">====================</span><span class="w">

</span><span class="n">approx.</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 20 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 40 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">50</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 60 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">70</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 80 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">90</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 100 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">0.296151</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">complete</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">preprocessing</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">0.0525948</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">save</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w"> </span><span class="n">data</span><span class="w">

</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">

</span><span class="o">====================</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="m">40</span><span class="w"> </span><span class="n">begins</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">====================</span><span class="w">

</span><span class="n">approx.</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 20 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 40 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">50</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 60 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">70</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 80 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">
</span><span class="n">approx.</span><span class="w"> </span><span class="m">90</span><span class="w"> </span><span class="o">% of data pre-processed
approx. 100 %</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">1.04127</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">complete</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">preprocessing</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">0.0448579</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">save</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w"> </span><span class="n">data</span><span class="w">

</span><span class="n">It</span><span class="w"> </span><span class="n">took</span><span class="w"> </span><span class="m">40.9034</span><span class="w"> </span><span class="n">minutes</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">complete</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">parsing</span><span class="w">

</span>

Here, it’s worth mentioning that the big_text_parser is more efficient if it extracts big chunks of text, rather than one-liners. In case of one-line text queries it has to check line by line the whole file, which is inefficient especially for files equal to the enwiki size.

By extracting the text chunks from the data the .xml file size reduces to (approx.) 48.9 GB. One can now continue utilizing the big_text_tokenizer() method in order to tokenize and transform the data. This method takes the following parameters:

batches (each file can be further split in batches during tokenization), to_lower (convert to lower case), to_upper (convert to upper case), utf_locale (change utf locale depending on the language), remove_char (remove specific characters from the text), remove_punctuation_string (remove punctuation before the data is split), remove_punctuation_vector (remove punctuation after the data is split), remove_numbers (remove numbers from the data), trim_token (trim the tokens both-sides), split_string (split the string), split_separator (token split seprator where multiple delimiters can be used), remove_stopwords (remove stopwords using one of the available languages or by providing a user defined vector of words), language (the language of use), min_num_char (the minimum number of characters to keep), max_num_char (the maximum number of characters to keep), stemmer (stemming of the words using either the porter_2steemer or n-gram stemming – those two methods will be explained in the tokenization function), min_n_gram (minimum n-grams), max_n_gram (maximum n-grams), skip_n_gram (skip n-gram), skip_distance (skip distance for n-grams), n_gram_delimiter (n-gram delimiter), concat_delimiter (concatenation of the data in case that one wants to save the file), path_2folder (specified folder to save the data), stemmer_ngram (in case of n-gram stemming the n-grams), stemmer_gamma (in case of n-gram stemming the gamma parameter), stemmer_truncate (in case of n-gram stemming the truncation parameter), stemmer_batches (in case of n-gram stemming the batches parameter ), threads (the number of cores to use in parallel ), save_2single_file (should the output data be saved in a single file), increment_batch_nr (the enumeration of the output files will start from this number), vocabulary_path_file (should a vocabulary be saved in a separate file).

More information about those parameters can be found in the package documentation.

In this vignette I’ll continue using the following transformations:

  • conversion to lowercase
  • trim each line
  • split each line using multiple delimiters
  • remove the punctuation ( once splitting is taken place )
  • remove the numbers from the tokens
  • limit the output words to a specific number of characters
  • remove the english stopwords
  • and save both the data (to a single file) and the vocabulary files (to a folder).

Each initial file will be split in additional batches to limit the memory usage during the tokenization and transformation phase,

<span class="w">
</span><span class="n">btt</span><span class="o">$</span><span class="n">big_text_tokenizer</span><span class="p">(</span><span class="n">input_path_folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/enwiki_parse/"</span><span class="p">,</span><span class="w">   </span><span class="c1"># the previously parsed data
</span><span class="w">                       
                       </span><span class="n">batches</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w">     </span><span class="c1"># each single file will be split further in 4 batches
</span><span class="w">                       
                       </span><span class="n">to_lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">trim_token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                       
                       </span><span class="n">split_string</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">max_num_char</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w">
                       
                       </span><span class="n">split_separator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" \r\n\t.,;:()?!//"</span><span class="p">,</span><span class="w">
                       
                       </span><span class="n">remove_punctuation_vector</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                       
                       </span><span class="n">remove_numbers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
                       
                       </span><span class="n">remove_stopwords</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">                
                       
                       </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> 
                       
                       </span><span class="n">save_2single_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">      </span><span class="c1"># save to a single file
</span><span class="w">                       
                       </span><span class="n">vocabulary_path_folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/enwiki_vocab/"</span><span class="p">,</span><span class="w">  </span><span class="c1"># path to vocabulary folder
</span><span class="w">                       
                       </span><span class="n">path_2folder</span><span class="o">=</span><span class="s2">"/enwiki_token/"</span><span class="p">)</span><span class="w">   </span><span class="c1"># folder to save the transformed data
</span><span class="w">
</span>
<span class="w">

</span><span class="o">====================================</span><span class="w">
</span><span class="n">transformation</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">====================================</span><span class="w">

</span><span class="o">-------------------</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">begins</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">-------------------</span><span class="w">

</span><span class="n">input</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">conversion</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">lower</span><span class="w"> </span><span class="n">case</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">numeric</span><span class="w"> </span><span class="n">values</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">string</span><span class="o">-</span><span class="n">trim</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">split</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">character</span><span class="w"> </span><span class="n">string</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">simultaneously</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">punctuation</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">vector</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">stop</span><span class="w"> </span><span class="n">words</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">english</span><span class="w"> </span><span class="n">language</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">used</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">stop</span><span class="o">-</span><span class="n">words</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">character</span><span class="w"> </span><span class="n">strings</span><span class="w"> </span><span class="n">with</span><span class="w"> </span><span class="n">more</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">equal</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">less</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="n">characters</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">kept</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">vocabulary</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">saved</span><span class="w"> </span><span class="k">in</span><span class="o">:</span><span class="w"> </span><span class="o">/</span><span class="n">enwiki_vocab</span><span class="o">/</span><span class="n">batch1.txt</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">saved</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">single</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="k">in</span><span class="o">:</span><span class="w"> </span><span class="o">/</span><span class="n">enwiki_token</span><span class="o">/</span><span class="w">

</span><span class="o">-------------------</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="n">begins</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">-------------------</span><span class="w">

</span><span class="n">input</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">conversion</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">lower</span><span class="w"> </span><span class="n">case</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">numeric</span><span class="w"> </span><span class="n">values</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">string</span><span class="o">-</span><span class="n">trim</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">split</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">character</span><span class="w"> </span><span class="n">string</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">simultaneously</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">punctuation</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">vector</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">stop</span><span class="w"> </span><span class="n">words</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">english</span><span class="w"> </span><span class="n">language</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">used</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">stop</span><span class="o">-</span><span class="n">words</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">character</span><span class="w"> </span><span class="n">strings</span><span class="w"> </span><span class="n">with</span><span class="w"> </span><span class="n">more</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">equal</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">less</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="n">characters</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">kept</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">vocabulary</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">saved</span><span class="w"> </span><span class="k">in</span><span class="o">:</span><span class="w"> </span><span class="o">/</span><span class="n">enwiki_vocab</span><span class="o">/</span><span class="n">batch1.txt</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">pre</span><span class="o">-</span><span class="n">processed</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">saved</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">single</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="k">in</span><span class="o">:</span><span class="w"> </span><span class="o">/</span><span class="n">enwiki_token</span><span class="o">/</span><span class="w">
  
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">
</span><span class="n">.</span><span class="w">

</span><span class="o">====================================</span><span class="w">
</span><span class="n">transformation</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="m">40</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">====================================</span><span class="w">

</span><span class="o">-------------------</span><span class="w">
</span><span class="n">batch</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">begins</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="o">-------------------</span><span class="w">

</span><span class="n">input</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">conversion</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">lower</span><span class="w"> </span><span class="n">case</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">numeric</span><span class="w"> </span><span class="n">values</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">string</span><span class="o">-</span><span class="n">trim</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">split</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">character</span><span class="w"> </span><span class="n">string</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">simultaneously</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">punctuation</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">vector</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">stop</span><span class="w"> </span><span class="n">words</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">english</span><span class="w"> </span><span class="n">language</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">used</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">removal</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">stop</span><span class="o">-</span><span class="n">words</span><span class="w"> </span><span class="n">starts</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">character</span><span class="w"> </span><span class="n">strings</span><span class="w"> </span><span class="n">with</span><span class="w"> </span><span class="n">more</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">equal</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">less</span><span class="w"> </span><span class="n">than</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="n">characters</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">kept</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="n">the</span><span class="w"> </span><span class="n">vocabulary</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="n">will</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">saved</span><span class="w"> </span><span class="k">in</span><span class="o">:</span><span class="w"> </span><span class="o">/</span><span class="n">enwiki_vocab</span><span class="o">/</span><span class="n">batch40.txt</span><span class="w">
</span><span class=&qu...

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)