Quick Hit: Above the Fold; Hard wrapping text at ‘n’ characters

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.

The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.

I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:

and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.

Originally I just made a header file with the same long lines:

but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.

The stringi::stri_wrap() function can easily take care of wrapping the time zone array elements for us:

but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:

library(Rcpp)
library(stringi)
library(tidyverse)
library(microbenchmark)

sourceCpp(code = "
#include <Rcpp.h>

// [[Rcpp::export]]
std::vector< std::string > fold_cpp(const std::string& input, int width) {

  int sz = input.length() / width;

  std::vector< std::string > out;
  out.reserve(sz); // shld make this more efficient

  for (unsigned long idx=0; idx<sz; idx++) {
    out.push_back(
      input.substr(idx*width, width)
    );
  }

  if (input.length() % width != 0) out.push_back(input.substr(width*sz));

  return(out);
}
") 

fold_base <- function(input, width) {

  vapply(
    seq(1, nchar(input), width), 
    function(idx) substr(input, idx, idx + width - 1), 
    FUN.VALUE = character(1)
  )

}

fold_tidy <- function(input, width) {

  map_chr(
    seq(1, nchar(input), width),
    ~stri_sub(input, .x, length = width)
  ) 

}

(If you know of a package that has this type of function def leave a note in the comments).

Each one does the same thing: move n sequences of width characters into a new slot in a character vector. Let’s see what they do with this toy long string example:

(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = ""))
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc"

for (n in c(1, 7, 30, 40)) {

  print(fold_base(src, n))
  print(fold_tidy(src, n))
  print(fold_cpp(src, n))
  cat("\n")

}
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## 
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"   

So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:

map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{
  microbenchmark(
    base = fold_base(src, .x),
    tidy = fold_tidy(src, .x),
    cpp = fold_cpp(src, .x)
  ) %>% 
    mutate(width = .x) %>% 
    as_tibble()
}) %>% 
  mutate(
    width = factor(width, 
                   levels = sort(unique(width)), 
                   ordered = TRUE)
  ) -> bench_df

ggplot(bench_df, aes(expr, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = width, fill = width),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = "Split width:", 
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

ggplot(bench_df, aes(width, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = expr, fill = expr),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_x_discrete(
    labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70")
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  scale_fill_ft() +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = NULL,
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:

FIN

If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)