Size of XDF files using RevoScaleR package

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It came to my attention that size of XDF (external data frame) file can change drastically based on the compute context and environment. When testing the output of a dataset I was working on in SQL Server Management Studio I was simultaneously testing R code in RTVS or  R Studio and I have noticed a file growth.

Following stored procedure will do a simple test:

CREATE PROCEDURE rxImport_Test ( @rowsPerRead INT )
AS
BEGIN
    DECLARE @RStat NVARCHAR(4000)
    SET @RStat = 'library(RevoScaleR)
                  #rxSetComputeContext("RxLocalSeq")
                  ptm <- proc.time()
                  inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")
                  filename <- "AirlineDemoSmall_'+CAST(@rowsPerRead AS VARCHAR(100))+'_TSQL_NC.xdf"
                  rxTextToXdf(inFile = inFile, outFile = filename,  stringsAsFactors = T, rowsPerRead = '+CAST(@rowsPerRead AS VARCHAR(100))+', overwrite=TRUE)
                  outFile <- file.path(rxGetOption("sampleDataDir"), filename)
                  rxImport(inData = inFile, outFile = outFile, overwrite=TRUE)
                  d <- proc.time() - ptm
                  filesize <- data.frame(file.size(filename))    
                  time     <- data.frame(d[3])
                  RowsPerRead <- data.frame('+CAST(@rowsPerRead AS VARCHAR(100))+')
                  filename_xdf <- data.frame(filename)
                  ran <- data.frame(Sys.time())
                  OutputDataSet <- cbind(as.character(filesize), time, RowsPerRead, filename_xdf, ran)';
    EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = @RStat
    WITH RESULT SETS (( 
                         Filesize NVARCHAR(100)
                        ,Time_df NVARCHAR(100)
                        ,RowsPerRead NVARCHAR(100)
                        ,filename_xdf NVARCHAR(100)
                        ,DateExecute NVARCHAR(100)
                        ))
END

But let’s first understand and test the Computation context and path to the data.

-- Make sure your path location is pointing to RevoscaleR library folder!
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = N'library(RevoScaleR) 
                OutputDataSet <- data.frame(rxGetOption("sampleDataDir"))'
WITH RESULT SETS (( 
                    path_folder NVARCHAR(1000)
                    ))

-- check for ComputeContext
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)
             cc <- rxGetOption("computeContext")
             OutputDataSet <- data.frame(cc@description)';
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS ((compute_context NVARCHAR(100)))

At my computer, this looks like this:

2016-09-22-14_33_40-revoscale_r_file_conversion-sql-sicn-kastrun-sqlr-spar_si01017988-59-mic

No we will run procedure

rxImport_Test

with different chunk sizes (this is what I will test) and observe execution times.

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 20;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 200;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2000;
GO

Running with different chunk size the procedure, it yields interesting results:

2016-09-22 15_22_37-Revoscale_R_file_conversion.sql - SICN-KASTRUN.SQLR (SPAR_si01017988 (60))_ - Mi.png

Now, let’s see the summary information on this file / dataset.

 

2016-09-22-15_35_58-book1-excel

Considering that original file holds 600.000 rows, number of blocks for each of the files is also changing and therefore the size of the files is growing.

Retrieving information on block size

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)    
              info <- rxGetInfoXdf(data="AirlineDemoSmall_20000000_TSQL_NC.xdf", getVarInfo = TRUE)    
              OutputDataSet <- data.frame(info$numBlocks)';

EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS (( 
                    nof_blocks NVARCHAR(100)))

one can see the change between the files and where is the optimal block size. In my test, number of blocks would be 3 to 30 max to receive maximum performance from creating XDF file. This means from 2000 up to 200.000 rows per block would yield best performance results. Otherwise I haven’t found the the golden rule of the block size, but take caution, especially when dealing with larger files.

I ran test couple of times in order to check the consistency of the results, and they hold water. As for the file size; this is the presentation of internal file, as of *.xdf file (as external structure) size should not differ as the block size changes, but perfomance does!

Code is available at Github.

Happy R-sqling!


To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)