R-caching (and scheduling)

[This article was first published on R – Artificial thoughts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is in preparation for running a custom Shiny server. We want to accelerate the server by using caching. In the this post we take a look at a candidate caching package.

In this post we’ll explore a the package DataCache. It is a very useful package, however, I found that for some reason the provided weather data example was not working.  So I wanted to simulate a  datafeed using a scheduler, preferably within R. There is a scheduler package for R tcltk2. It worked for me from the command line, however when running this in RStudio or Rscript there is a small complication, which we will cover further below.

The data function here  outputs the system time, using Sys.time(). When it is cached it uses a previous version of the cached time, therefore it is is smaller than the current Sys.time(). In general

Current time >=Cached time  .

Let’s look at the output first:

No cached data found. Loading intial data...
[1] "Current:2017-12-22 14:58:09.2|Cached:2017-12-22 14:58:09.2"
[1] "Current:2017-12-22 14:58:09.5|Cached:2017-12-22 14:58:09.2"
[1] "Current:2017-12-22 14:58:09.7|Cached:2017-12-22 14:58:09.2"
[1] "Current:2017-12-22 14:58:09.9|Cached:2017-12-22 14:58:09.2"
[1] "Current:2017-12-22 14:58:10.1|Cached:2017-12-22 14:58:09.2"
Loading more recent data, returning lastest available.
[1] "Current:2017-12-22 14:58:10.3|Cached:2017-12-22 14:58:09.2"
[1] "Current:2017-12-22 14:58:10.5|Cached:2017-12-22 14:58:10.3"
[1] "Current:2017-12-22 14:58:10.7|Cached:2017-12-22 14:58:10.3"
[1] "Current:2017-12-22 14:58:10.9|Cached:2017-12-22 14:58:10.3"
[1] "Current:2017-12-22 14:58:11.1|Cached:2017-12-22 14:58:10.3"
Loading more recent data, returning lastest available.
[1] "Current:2017-12-22 14:58:11.3|Cached:2017-12-22 14:58:10.3"
[1] "Current:2017-12-22 14:58:11.5|Cached:2017-12-22 14:58:11.3"
[1] "Current:2017-12-22 14:58:11.7|Cached:2017-12-22 14:58:11.3"
[1] "Current:2017-12-22 14:58:11.9|Cached:2017-12-22 14:58:11.3"
[1] "Current:2017-12-22 14:58:12.1|Cached:2017-12-22 14:58:11.3"
Loading more recent data, returning lastest available.
[1] "Current:2017-12-22 14:58:12.4|Cached:2017-12-22 14:58:11.3"
[1] "Current:2017-12-22 14:58:12.6|Cached:2017-12-22 14:58:12.4"
[1] "Current:2017-12-22 14:58:12.8|Cached:2017-12-22 14:58:12.4"
[1] "Current:2017-12-22 14:58:13.0|Cached:2017-12-22 14:58:12.4"
[1] "Current:2017-12-22 14:58:13.2|Cached:2017-12-22 14:58:12.4"

So we can see that it works. Basically, the scheduler does a cycle ~ every 200ms, whereas the Cached time is only updated every second, which implies that the update happens after 5 = 1000ms/200ms cycles.

Let’s discuss the code. We have three parts:

  1. Preparations: Loading packages, setting options:
    #!/usr/bin/env Rscript
    
    # load packages
    library(DataCache) # the the caching
    library(tcltk2) # for the scheduler
    
    # set the resolution to printed time values 
    #  so instead of 2017-12-22 14:58:12 we now have 2017-12-22 14:58:12.4
    op <- options(digits.secs = 1)
  2. Define the functions for caching:  the datafeed and custom frequency function :
    # define getTime function: 
    datafeed_getTime = function(varName) {
      
      timeValue = Sys.time()
      
      out = list(timeValue)
      names(out) = paste0('Mycached.' , varName)
      
      return (out)
    }
    
    # define custom frequency for cache updates
    # nMinutes already exists in the package DataCache, but we want faster updates for this test
    customFrequency_nSeconds <- function(seconds) {
      fun <- function(timestamp) {
        return(difftime(Sys.time(), timestamp, units='secs') > seconds)
      }
      return(fun)
    }
    
    
    varName1 = 'test1' # remark : the cached variable for this varName is Mycached.test1
  3. Define the scheduler:

tclTaskDelete(NULL) # delete all running tasks

tclTaskSchedule(200, {
  cache.timedata1 = data.cache(function() datafeed_getTime(varName1) , cache.name = varName1, frequency = customFrequency_nSeconds(1))
  
  print(paste0('Current:', Sys.time(), '|Cached:', Mycached.test1))

}

, id = "ticktock_test1", redo = 20)

The final part is only necessary when not running the code in the R command line i.e., when using it in Rstudio or Rscript. This is necessary for the scheduler to work.  There are other ways to define schedulers, which are more robust, but less readable than the tclTaskSchedule, therefore for simplicity’s sake I chose tclTaskSchedule for this post.

# Start : special
  # This part is only necessary for the scheduler to run with Rscript or RStudio. In R command line it is not necessary
  #  function for
  runFor = function(totalRunningTime)
  {
  
    startTime <- Sys.time()
    repeat{
      if (Sys.time() - startTime > totalRunningTime) {
        break
      }
    }
  }
  
  
  runFor(totalRunningTime = 7) # totalRunningTime is in seconds
# End : Special

options(op)

 

Final comment: The main hurdle to understanding the way DataCache works are these two points:

  • data.cache expects a function. If we want more than one cache we can can e.g. distinguish these by using a variable name varName1, and wrap the datafeed_getTime(varName1) call  in a anonymous function
    cache.timedata1 = data.cache(function() datafeed_getTime(varName1) , cache.name = varName1, frequency = customFrequency_nSeconds(1))

    That variable name is then used in datafeed_getTime to define under which name the value is saved, this is done here:

names(out) = paste0('Mycached.' , varName)

This means because we define varName1 = ‘test1’ that  the cached variable for this varName is Mycached.test1

So here is the entire code (for easy copy and pasting):

#!/usr/bin/env Rscript

library(DataCache) # the the caching
library(tcltk2) # for the scheduler


# set the resolution to printed time values 
#  so instead of 2017-12-22 14:58:12 we now have 2017-12-22 14:58:12.4
op <- options(digits.secs = 1)


# define getTime function: 
datafeed_getTime = function(varName) {
  
  timeValue = Sys.time()
  
  out = list(timeValue)
  names(out) = paste0('Mycached.' , varName)
  
  return (out)
}

# define custom frequency for cache updates
# nMinutes already exists in the package DataCache, but we want faster updates for this test
customFrequency_nSeconds <- function(seconds) {
  fun <- function(timestamp) {
    return(difftime(Sys.time(), timestamp, units='secs') > seconds)
  }
  return(fun)
}


varName1 = 'test1' # remark : the cached variable for this varName is Mycached.test1


tclTaskDelete(NULL) # delete all running tasks

tclTaskSchedule(200, {
  cache.timedata1 = data.cache(function() datafeed_getTime(varName1) , cache.name = varName1, frequency = customFrequency_nSeconds(1))
  
  print(paste0('Current:', Sys.time(), '|Cached:', Mycached.test1))

}

, id = "ticktock_test1", redo = 20)


# Start : special
  # This part is only necessary for the scheduler to run with Rscript or RStudio. In R command line it is not necessary
  #  function for
  runFor = function(totalRunningTime)
  {
  
    startTime <- Sys.time()
    repeat{
      if (Sys.time() - startTime > totalRunningTime) {
        break
      }
    }
  }
  
  
  runFor(totalRunningTime = 7) # totalRunningTime is in seconds
# End : Special

options(op)

 

To leave a comment for the author, please follow the link and comment on their blog: R – Artificial thoughts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)