This is a quick example of how you might use Rcpp to send and receive R ‘strings’ to and from R. We’ll demonstrate this with a few operations.
Sort a String with R
Note that we can do this in R in a fairly fast way:
my_strings <- c("apples", "and", "cranberries")
R_str_sort <- function(strings) {
sapply( strings, USE.NAMES=FALSE, function(x) {
intToUtf8( sort( utf8ToInt( x ) ) )
})
}
R_str_sort( my_strings )
[1] "aelpps" "adn" "abceeinrrrs"
Sort a String with C++/Rcpp
Let’s see if we can re-create the output with Rcpp.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector< std::string > cpp_str_sort( std::vector< std::string > strings ) {
int len = strings.size();
for( int i=0; i < len; i++ ) {
std::sort( strings[i].begin(), strings[i].end() );
}
return strings;
}
Note the main things we do here:
- Rcpp’s attributes handle any
as-ing andwrap-ing of vectors; we even just specify our return type asstd::vector< std::string >. - We then call the
voidmethodstd::sort, which can sort a string in place, - … and we return that vector of strings.
Now, let’s test it, and let’s benchmark it as well.
cpp_str_sort( my_strings )
[1] "aelpps" "adn" "abceeinrrrs"
long_strings <- rep( paste( collapse="", sample( letters, 1E5, replace=TRUE ) ),
times=100 )
rbenchmark::benchmark( cpp_str_sort(long_strings),
R_str_sort(long_strings),
replications=3
)
test replications elapsed relative user.self
1 cpp_str_sort(long_strings) 3 0.898 1.000 0.883
2 R_str_sort(long_strings) 3 2.356 2.624 2.350
sys.self user.child sys.child
1 0.014 0 0
2 0.007 0 0
Note that the C++ implementation is quite a bit faster (on my machine). However, std::sort will not handle UTF-8 encoded vectors.
Now, let’s do something crazy – let’s see if we can use Rcpp to perform an operation that takes a vector of strings, and returns a list of vectors of strings. (Or, in R parlance, a list of vectors of type character).
We’ll do a simple ‘split’, such that each string is split every n indices.
Split a string at consecutive indices n
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List cpp_str_split( std::vector< std::string > strings, int n ) {
int num_strings = strings.size();
List out(num_strings);
for( int i=0; i < num_strings; i++ ) {
int num_substr = strings[i].length() / n;
std::vector< std::string > tmp;
for( int j=0; j < num_substr; j++ ) {
tmp.push_back( strings[i].substr( j*n, n ) );
}
out[i] = tmp;
}
return out;
}
Main things to notice:
- We declare the output to be a
List, - We form a
Listcontainer of sizenum_strings, - We construct the split strings one by one, then place them back into our output container (note how with
out[i] = tmp, we can assign our vector of strings directly as an element of the list), - We return the list we constructed.
cpp_str_split( c("abcd", "efgh", "ijkl"), 2 )
[[1]] [1] "ab" "cd" [[2]] [1] "ef" "gh" [[3]] [1] "ij" "kl"
cpp_str_split( c("abc", "de"), 2 )
[[1]] [1] "ab" [[2]] [1] "de"
My solution is perhaps a bit deficient (bug or feature?) in that it truncates any strings not long enough; ideally, we’d either improve the C++ code or form an appropriate wrapper to the function in R (and warn the user if truncation might occur).
Hopefully this gives you a better idea how you might use Rcpp to perform more extensive string manipulation with R character vectors.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).