Size of XDF files using RevoScaleR package

September 22, 2016

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

It came to my attention that size of XDF (external data frame) file can change drastically based on the compute context and environment. When testing the output of a dataset I was working on in SQL Server Management Studio I was simultaneously testing R code in RTVS or  R Studio and I have noticed a file growth.

Following stored procedure will do a simple test:

CREATE PROCEDURE rxImport_Test ( @rowsPerRead INT )
    SET @RStat = 'library(RevoScaleR)
                  ptm <- proc.time()
                  inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")
                  filename <- "AirlineDemoSmall_'+CAST(@rowsPerRead AS VARCHAR(100))+'_TSQL_NC.xdf"
                  rxTextToXdf(inFile = inFile, outFile = filename,  stringsAsFactors = T, rowsPerRead = '+CAST(@rowsPerRead AS VARCHAR(100))+', overwrite=TRUE)
                  outFile <- file.path(rxGetOption("sampleDataDir"), filename)
                  rxImport(inData = inFile, outFile = outFile, overwrite=TRUE)
                  d <- proc.time() - ptm
                  filesize <- data.frame(file.size(filename))    
                  time     <- data.frame(d[3])
                  RowsPerRead <- data.frame('+CAST(@rowsPerRead AS VARCHAR(100))+')
                  filename_xdf <- data.frame(filename)
                  ran <- data.frame(Sys.time())
                  OutputDataSet <- cbind(as.character(filesize), time, RowsPerRead, filename_xdf, ran)';
    EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = @RStat
                         Filesize NVARCHAR(100)
                        ,Time_df NVARCHAR(100)
                        ,RowsPerRead NVARCHAR(100)
                        ,filename_xdf NVARCHAR(100)
                        ,DateExecute NVARCHAR(100)

But let’s first understand and test the Computation context and path to the data.

-- Make sure your path location is pointing to RevoscaleR library folder!
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = N'library(RevoScaleR) 
                OutputDataSet <- data.frame(rxGetOption("sampleDataDir"))'
                    path_folder NVARCHAR(1000)

-- check for ComputeContext
SET @RStat = 'library(RevoScaleR)
             cc <- rxGetOption("computeContext")
             OutputDataSet <- data.frame([email protected])';
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS ((compute_context NVARCHAR(100)))

At my computer, this looks like this:


No we will run procedure


with different chunk sizes (this is what I will test) and observe execution times.

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2;

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 20;

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 200;

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2000;

Running with different chunk size the procedure, it yields interesting results:

2016-09-22 15_22_37-Revoscale_R_file_conversion.sql - SICN-KASTRUN.SQLR (SPAR_si01017988 (60))_ - Mi.png

Now, let’s see the summary information on this file / dataset.



Considering that original file holds 600.000 rows, number of blocks for each of the files is also changing and therefore the size of the files is growing.

Retrieving information on block size

SET @RStat = 'library(RevoScaleR)    
              info <- rxGetInfoXdf(data="AirlineDemoSmall_20000000_TSQL_NC.xdf", getVarInfo = TRUE)    
              OutputDataSet <- data.frame(info$numBlocks)';

EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
                    nof_blocks NVARCHAR(100)))

one can see the change between the files and where is the optimal block size. In my test, number of blocks would be 3 to 30 max to receive maximum performance from creating XDF file. This means from 2000 up to 200.000 rows per block would yield best performance results. Otherwise I haven’t found the the golden rule of the block size, but take caution, especially when dealing with larger files.

I ran test couple of times in order to check the consistency of the results, and they hold water. As for the file size; this is the presentation of internal file, as of *.xdf file (as external structure) size should not differ as the block size changes, but perfomance does!

Code is available at Github.

Happy R-sqling!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)