Fuzzy string Matching using fuzzywuzzyR and the reticulate package in R

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently released an (other one) R package on CRAN – fuzzywuzzyR – which ports the fuzzywuzzy python library in R. “fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings).”

There is no big news here as in R already exist similar packages such as the stringdist package. Why then creating the package? Well, I intend to participate in a recently launched kaggle competition and one popular method to build features (predictors) is fuzzy string matching as explained in this blog post. My (second) aim was to use the (newly released from Rstudio) reticulate package, which “provides an R interface to Python modules, classes, and functions” and makes the process of porting python code in R not cumbersome.

First, I’ll explain the functionality of the fuzzywuzzyR package and then I’ll give some examples on how to take advantage of the reticulate package in R.

fuzzywuzzyR

The fuzzywuzzyR package includes R6-classes / functions for string matching,

classes

FuzzExtract FuzzMatcher FuzzUtils SequenceMatcher
Extract() Partial_token_set_ratio() Full_process() ratio()
ExtractBests() Partial_token_sort_ratio() Make_type_consistent() quick_ratio()
ExtractWithoutOrder() Ratio() Asciidammit() real_quick_ratio()
ExtractOne() QRATIO() Asciionly() get_matching_blocks()
  WRATIO() Validate_string() get_opcodes()
  UWRATIO()    
  UQRATIO()    
  Token_sort_ratio()    
  Partial_ratio()    
  Token_set_ratio()    

functions

GetCloseMatches()

The following code chunks / examples are part of the package documentation and give an idea on what can be done with the fuzzywuzzyR package,

FuzzExtract

Each one of the methods in the FuzzExtract class takes a character string and a character string sequence as input ( except for the Dedupe method which takes a string sequence only ) and given a processor and a scorer it returns one or more string match(es) and the corresponding score ( in the range 0 – 100 ). Information about the additional parameters (limit, score_cutoff and threshold) can be found in the package documentation,

<span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">fuzzywuzzyR</span><span class="p">)</span><span class="w">

</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"new york jets"</span><span class="w">

</span><span class="n">choices</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Atlanta Falcons"</span><span class="p">,</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="p">,</span><span class="w"> </span><span class="s2">"New York Giants"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dallas Cowboys"</span><span class="p">)</span><span class="w">


</span><span class="c1">#------------
# processor :
#------------
</span><span class="w">
</span><span class="n">init_proc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FuzzUtils</span><span class="o">$</span><span class="n">new</span><span class="p">()</span><span class="w">      </span><span class="c1"># initialization of FuzzUtils class to choose a processor
</span><span class="w">
</span><span class="n">PROC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">init_proc</span><span class="o">$</span><span class="n">Full_process</span><span class="w">    </span><span class="c1"># processor-method
</span><span class="w">
</span><span class="n">PROC1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tolower</span><span class="w">                  </span><span class="c1"># base R function ( as an example for a processor )
</span><span class="w">
</span><span class="c1">#---------
# scorer :
#---------
</span><span class="w">
</span><span class="n">init_scor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FuzzMatcher</span><span class="o">$</span><span class="n">new</span><span class="p">()</span><span class="w">    </span><span class="c1"># initialization of the scorer class
</span><span class="w">
</span><span class="n">SCOR</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">init_scor</span><span class="o">$</span><span class="n">WRATIO</span><span class="w">          </span><span class="c1"># choosen scorer function
</span><span class="w">

</span><span class="n">init</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">FuzzExtract</span><span class="o">$</span><span class="n">new</span><span class="p">()</span><span class="w">        </span><span class="c1"># Initialization of the FuzzExtract class
</span><span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">Extract</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sequence_strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">choices</span><span class="p">,</span><span class="w"> </span><span class="n">processor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PROC</span><span class="p">,</span><span class="w"> </span><span class="n">scorer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SCOR</span><span class="p">)</span><span class="w">
  
</span>
<span class="w">
</span><span class="c1"># example output
</span><span class="w">  
  </span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="w">

</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">100</span><span class="w">


</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Giants"</span><span class="w">

</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">79</span><span class="w">


</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">3</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Atlanta Falcons"</span><span class="w">

</span><span class="p">[[</span><span class="m">3</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">


</span><span class="p">[[</span><span class="m">4</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">4</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Dallas Cowboys"</span><span class="w">

</span><span class="p">[[</span><span class="m">4</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">22</span><span class="w">
  
</span>
<span class="w">
</span><span class="c1"># extracts best matches (limited to 2 matches)
</span><span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">ExtractBests</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sequence_strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">choices</span><span class="p">,</span><span class="w"> </span><span class="n">processor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PROC1</span><span class="p">,</span><span class="w">

                  </span><span class="n">scorer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SCOR</span><span class="p">,</span><span class="w"> </span><span class="n">score_cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0L</span><span class="p">,</span><span class="w"> </span><span class="n">limit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2L</span><span class="p">)</span><span class="w">
                  
</span>
<span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="w">

</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">100</span><span class="w">


</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Giants"</span><span class="w">

</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">79</span><span class="w">

</span>
<span class="w">
</span><span class="c1"># extracts matches without keeping the output order
</span><span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">ExtractWithoutOrder</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sequence_strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">choices</span><span class="p">,</span><span class="w"> </span><span class="n">processor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PROC</span><span class="p">,</span><span class="w">

                         </span><span class="n">scorer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SCOR</span><span class="p">,</span><span class="w"> </span><span class="n">score_cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0L</span><span class="p">)</span><span class="w">

</span>
<span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Atlanta Falcons"</span><span class="w">

</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">


</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="w">

</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">100</span><span class="w">


</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">3</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Giants"</span><span class="w">

</span><span class="p">[[</span><span class="m">3</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">79</span><span class="w">


</span><span class="p">[[</span><span class="m">4</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">4</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Dallas Cowboys"</span><span class="w">

</span><span class="p">[[</span><span class="m">4</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">22</span><span class="w">

</span>
<span class="w">
</span><span class="c1"># extracts first result 
</span><span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">ExtractOne</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sequence_strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">choices</span><span class="p">,</span><span class="w"> </span><span class="n">processor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PROC</span><span class="p">,</span><span class="w">

                </span><span class="n">scorer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SCOR</span><span class="p">,</span><span class="w"> </span><span class="n">score_cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0L</span><span class="p">)</span><span class="w">

</span>
<span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="w">

</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">100</span><span class="w">

</span>

The dedupe method removes duplicates from a sequence of character strings using fuzzy string matching,

<span class="w">
</span><span class="n">duplicat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Frodo Baggins'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Tom Sawyer'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Bilbo Baggin'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Samuel L. Jackson'</span><span class="p">,</span><span class="w">

             </span><span class="s1">'F. Baggins'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Frody Baggins'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Bilbo Baggins'</span><span class="p">)</span><span class="w">


</span><span class="n">init</span><span class="o">$</span><span class="n">Dedupe</span><span class="p">(</span><span class="n">contains_dupes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">duplicat</span><span class="p">,</span><span class="w"> </span><span class="n">threshold</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">70L</span><span class="p">,</span><span class="w"> </span><span class="n">scorer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SCOR</span><span class="p">)</span><span class="w">

</span>
<span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Frodo Baggins"</span><span class="w">     </span><span class="s2">"Samuel L. Jackson"</span><span class="w"> </span><span class="s2">"Bilbo Baggins"</span><span class="w">     </span><span class="s2">"Tom Sawyer"</span><span class="w">

</span>

FuzzMatcher

Each one of the methods in the FuzzMatcher class takes two character strings (string1, string2) as input and returns a score ( in range 0 to 100 ). Information about the additional parameters (force_ascii, full_process and threshold) can be found in the package documentation,

<span class="w">
</span><span class="n">s</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Atlanta Falcons"</span><span class="w">

</span><span class="n">s</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"New York Jets"</span><span class="w">

</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FuzzMatcher</span><span class="o">$</span><span class="n">new</span><span class="p">()</span><span class="w">          </span><span class="n">initialization</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">FuzzMatcher</span><span class="w"> </span><span class="n">class</span><span class="w">

</span><span class="n">init</span><span class="o">$</span><span class="n">Partial_token_set_ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">full_process</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># example output
</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">31</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">Partial_token_sort_ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">full_process</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">


</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">31</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">Ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">21</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">QRATIO</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">WRATIO</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">UWRATIO</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">UQRATIO</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">Token_sort_ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">full_process</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>
<span class="w">

</span><span class="n">init</span><span class="o">$</span><span class="n">Partial_ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">23</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">Token_set_ratio</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">full_process</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">29</span><span class="w">

</span>

FuzzUtils

The FuzzUtils class includes a number of utility methods, from which the Full_process method is from greater importance as besides its main functionality it can also be used as a secondary function in some of the other fuzzy matching classes,

<span class="w">
</span><span class="n">s</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Frodo Baggins'</span><span class="w">

</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">FuzzUtils</span><span class="o">$</span><span class="n">new</span><span class="p">()</span><span class="w">

</span><span class="n">init</span><span class="o">$</span><span class="n">Full_process</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">force_ascii</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span>
<span class="w">
</span><span class="c1"># example output
</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"frodo baggins"</span><span class="w">

</span>

GetCloseMatches

The GetCloseMatches method returns a list of the best “good enough” matches. The parameter string is a sequence for which close matches are desired (typically a character string), and sequence_strings is a list of sequences against which to match the parameter string (typically a list of strings).

<span class="w">
</span><span class="n">vec</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Frodo Baggins'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Tom Sawyer'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Bilbo Baggin'</span><span class="p">)</span><span class="w">

</span><span class="n">str1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Fra Bagg'</span><span class="w">

</span><span class="n">GetCloseMatches</span><span class="p">(</span><span class="n">string</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str1</span><span class="p">,</span><span class="w"> </span><span class="n">sequence_strings</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vec</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.6</span><span class="p">)</span><span class="w">


</span>
<span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Frodo Baggins"</span><span class="w">

</span>

SequenceMatcher

The SequenceMatcher class is based on difflib which comes by default installed with python and includes the following fuzzy string matching methods,

<span class="w">
</span><span class="n">s</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">' It was a dark and stormy night. I was all alone sitting on a red chair.'</span><span class="w">

</span><span class="n">s</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">' It was a murky and stormy night. I was all alone sitting on a crimson chair.'</span><span class="w">

</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SequenceMatcher</span><span class="o">$</span><span class="n">new</span><span class="p">(</span><span class="n">string1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">string2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="n">init</span><span class="o">$</span><span class="n">ratio</span><span class="p">()</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.9127517</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">quick_ratio</span><span class="p">()</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.9127517</span><span class="w">

</span>
<span class="w">
</span><span class="n">init</span><span class="o">$</span><span class="n">real_quick_ratio</span><span class="p">()</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.966443</span><span class="w"> 

</span>

The get_matching_blocks and get_opcodes return triples and 5-tuples describing matching subsequences. More information can be found in the Python’s difflib module and in the fuzzywuzzyR package documentation.

A last think to note here is that the mentioned fuzzy string matching classes can be parallelized using the base R parallel package. For instance, the following MCLAPPLY_RATIOS function can take two vectors of character strings (QUERY1, QUERY2) and return the scores for each method of the FuzzMatcher class,

<span class="w">
</span><span class="n">MCLAPPLY_RATIOS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">QUERY1</span><span class="p">,</span><span class="w"> </span><span class="n">QUERY2</span><span class="p">,</span><span class="w"> </span><span class="n">class_fuzz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'FuzzMatcher'</span><span class="p">,</span><span class="w"> </span><span class="n">method_fuzz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'QRATIO'</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">init</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eval</span><span class="p">(</span><span class="n">parse</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">class_fuzz</span><span class="p">,</span><span class="w"> </span><span class="s1">'$new()'</span><span class="p">)))</span><span class="w">

  </span><span class="n">METHOD</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s1">'init$'</span><span class="p">,</span><span class="w"> </span><span class="n">method_fuzz</span><span class="p">)</span><span class="w">

  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">threads</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">res_qrat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">QUERY1</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">eval</span><span class="p">(</span><span class="n">parse</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">METHOD</span><span class="p">)),</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">QUERY1</span><span class="p">[[</span><span class="n">x</span><span class="p">]],</span><span class="w"> </span><span class="n">QUERY2</span><span class="p">[[</span><span class="n">x</span><span class="p">]],</span><span class="w"> </span><span class="n">...</span><span class="p">)))}</span><span class="w">

  </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">res_qrat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parallel</span><span class="o">::</span><span class="n">mclapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">QUERY1</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">eval</span><span class="p">(</span><span class="n">parse</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">METHOD</span><span class="p">)),</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">QUERY1</span><span class="p">[[</span><span class="n">x</span><span class="p">]],</span><span class="w"> </span><span class="n">QUERY2</span><span class="p">[[</span><span class="n">x</span><span class="p">]],</span><span class="w"> </span><span class="n">...</span><span class="p">)),</span><span class="w"> 
    
                                  </span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">threads</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">

  </span><span class="nf">return</span><span class="p">(</span><span class="n">res_qrat</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span>

<span class="w">
</span><span class="n">query1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'word1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'word2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'word3'</span><span class="p">)</span><span class="w">

</span><span class="n">query2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'similarword1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'similar_word2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'similarwor'</span><span class="p">)</span><span class="w">

</span><span class="n">quer_res</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MCLAPPLY_RATIOS</span><span class="p">(</span><span class="n">query1</span><span class="p">,</span><span class="w"> </span><span class="n">query2</span><span class="p">,</span><span class="w"> </span><span class="n">class_fuzz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'FuzzMatcher'</span><span class="p">,</span><span class="w"> </span><span class="n">method_fuzz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'QRATIO'</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">unlist</span><span class="p">(</span><span class="n">quer_res</span><span class="p">)</span><span class="w">

</span>
<span class="w">
</span><span class="c1"># example output
</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">59</span><span class="w"> </span><span class="m">56</span><span class="w"> </span><span class="m">40</span><span class="w">

</span>

reticulate package

My personal opinion is that the newly released reticulate package is good news (for all R-users with minimal knowledge of python) and bad news (for package maintainers whose packages do not cover the full spectrum of a subject in comparison to an existing python library) at the same time. I’ll explain this in the following two examples.

As an R user I’d always like to have a truncated svd function similar to the one of the sklearn python library. So, now in R using the reticulate package and the mnist data set one can do,

<span class="w">
</span><span class="n">reticulate</span><span class="o">::</span><span class="n">py_module_available</span><span class="p">(</span><span class="s1">'sklearn'</span><span class="p">)</span><span class="w">       </span><span class="c1"># check that 'sklearn' is available in your OS
</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">

</span>
<span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">mnist</span><span class="p">)</span><span class="w">                </span><span class="c1"># after downloading and opening the data from the previous link
</span><span class="w">
</span><span class="m">70000</span><span class="w">   </span><span class="m">785</span><span class="w">

</span>
<span class="w">
</span><span class="n">mnist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">mnist</span><span class="p">)</span><span class="w">                                  </span><span class="c1"># convert to matrix
</span><span class="w">
</span><span class="n">trunc_SVD</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reticulate</span><span class="o">::</span><span class="n">import</span><span class="p">(</span><span class="s1">'sklearn.decomposition'</span><span class="p">)</span><span class="w">

</span><span class="n">res_svd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trunc_SVD</span><span class="o">$</span><span class="n">TruncatedSVD</span><span class="p">(</span><span class="n">n_components</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100L</span><span class="p">,</span><span class="w"> </span><span class="n">n_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5L</span><span class="p">,</span><span class="w"> </span><span class="n">random_state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1L</span><span class="p">)</span><span class="w">

</span><span class="n">res_svd</span><span class="o">$</span><span class="n">fit</span><span class="p">(</span><span class="n">mnist</span><span class="p">)</span><span class="w">

</span><span class="c1"># TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
#       random_state=1, tol=0.0)
</span><span class="w">       
</span>
<span class="w">
</span><span class="n">out_svd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">res_svd</span><span class="o">$</span><span class="n">transform</span><span class="p">(</span><span class="n">mnist</span><span class="p">)</span><span class="w">

</span><span class="n">str</span><span class="p">(</span><span class="n">out_svd</span><span class="p">)</span><span class="w">

</span><span class="c1"># num [1:70000, 1:100] 1752 1908 2289 2237 2236 ...
</span><span class="w">
</span>
<span class="w">
</span><span class="nf">class</span><span class="p">(</span><span class="n">out_svd</span><span class="p">)</span><span class="w">

</span><span class="c1"># [1] "matrix"
</span><span class="w">
</span>

to receive the desired output ( a matrix with 70000 rows and 100 columns (components) ).

As a package maintainer, I do receive from time to time e-mails from users of my packages. In one of them a user asked me if the hog function of the OpenImageR package is capable of plotting the hog features. Actually not, but now an R-user can, for instance, use the scikit-image python library to plot the hog-features using the following code chunk,

<span class="w">
</span><span class="n">reticulate</span><span class="o">::</span><span class="n">py_module_available</span><span class="p">(</span><span class="s2">"skimage"</span><span class="p">)</span><span class="w">             </span><span class="c1"># check that 'sklearn' is available in your OS
</span><span class="w">
</span><span class="c1"># [1] TRUE
</span><span class="w">
</span>
<span class="w">
</span><span class="n">feat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">reticulate</span><span class="o">::</span><span class="n">import</span><span class="p">(</span><span class="s2">"skimage.feature"</span><span class="p">)</span><span class="w">        </span><span class="c1"># import module
</span><span class="w">
</span><span class="n">data_sk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">reticulate</span><span class="o">::</span><span class="n">import</span><span class="p">(</span><span class="s2">"skimage.data"</span><span class="p">)</span><span class="w">        </span><span class="c1"># import data
</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">reticulate</span><span class="o">::</span><span class="n">import</span><span class="p">(</span><span class="s2">"skimage.color"</span><span class="p">)</span><span class="w">         </span><span class="c1"># import module to plot    
</span><span class="w">
</span><span class="n">tmp_im</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_sk</span><span class="o">$</span><span class="n">astronaut</span><span class="p">()</span><span class="w">                         </span><span class="c1"># import specific image data ('astronaut')
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">tmp_im</span><span class="p">)</span><span class="w">

</span><span class="c1"># [1] 512 512   3
</span><span class="w">
</span>
<span class="w">
</span><span class="n">image</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">color</span><span class="o">$</span><span class="n">rgb2gray</span><span class="p">(</span><span class="n">tmp_im</span><span class="p">)</span><span class="w">                       </span><span class="c1"># convert to gray
</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">image</span><span class="p">)</span><span class="w">

</span><span class="c1"># [1] 512 512
</span><span class="w">
</span>
<span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">feat</span><span class="o">$</span><span class="n">hog</span><span class="p">(</span><span class="n">image</span><span class="p">,</span><span class="w"> </span><span class="n">orientations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8L</span><span class="p">,</span><span class="w"> </span><span class="n">pixels_per_cell</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">16L</span><span class="p">,</span><span class="w"> </span><span class="m">16L</span><span class="p">),</span><span class="w"> </span><span class="n">cells_per_block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1L</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w"> </span><span class="n">visualise</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">

</span><span class="n">str</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">

</span><span class="c1"># List of 2
#  $ : num [1:8192(1d)] 1.34e-04 1.53e-04 6.68e-05 9.19e-05 7.93e-05 ...
#  $ : num [1:512, 1:512] 0 0 0 0 0 0 0 0 0 0 ...
</span><span class="w">
</span>
<span class="w">
</span><span class="n">OpenImageR</span><span class="o">::</span><span class="n">imageShow</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="m">2</span><span class="p">]])</span><span class="w">       </span><span class="c1"># using the OpenImageR to plot the data
</span><span class="w">

</span>

Alt text

As a final word, I think that the reticulate package, although not that popular yet, it will make a difference in the R-community.

The README.md file of the fuzzywuzzyR package includes the SystemRequirements and detailed installation instructions for each OS.

An updated version of the fuzzywuzzyR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/fuzzywuzzyR/issues.

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)