Is it possible to use RevoScaleR package in Power BI?

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

So let’s start by taking a large (500 MB) txt file and create XDF file:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt"
rxOptions(sampleDataDir = "C:\\Files")
sampleDataDir

File is available on-line at this address with the zip file.

Getting data with R script

Open Power BI and choose Get Data -> R Script -> and copy/Paste the following slightly changed code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

With copy pasting and clicking OK,

2017-03-20 18_56_17-Untitled - Power BI Desktop

You will have to wait for the data to be read into the memory, the data models to be created and after monitoring the memory consumption and patiently waiting, you will notice, that this particular dataset (500 MB or 160 MB XDF), that minimum 3 GB of RAM will be consumed and you will end up with preview:

4 - 2017-03-20 19_01_53-

By now, you will also notice that after saving this Power BI document, it will take somewhere up to 700 MB of your disk space and all the data visualization will consume additional RAM and time. After you will close the Power BI document, you will notice a lot of RAM being released.

Using R Script in the visuals

When you create a new Power BI document, I will create new dataset by Entering data. I will create three “dummy” variables.

7 - 2017-03-20 19_18_23-

With these three variables I will try to inject the data returned from XDF data format and have data represented in Power BI.

After selecting the new visual and choosing R visual, I inserted following code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

And this time, the result is fascinating. R is plotting histogram in a split of a second, simply meaning it takes advantage of XDF file and inject it to Power BI.

8 - 2017-03-20 19_26_47-Untitled - Power BI Desktop

This is still – an outer file or dataset -, that Power BI does not have a clue about. Meaning, no slicers are available for dynamic change of the user selection.

Let’s try to insert the data into those three dummy variables, where the third one will be a factor that I have to pre-prepare. Since in this case factor is Year, it is relatively easy to do:

library(RevoScaleR)
library(gridExtra)
library(dplyr)
Year % filter(year == c("2000","2001","2002")))
grid.table(df_f %>% filter(year == Year))

Once I have this inserted in new R visualize, I just need to add a dummy slicer.

9 - 2017-03-20 20_52_52-RevoScale_and_PowerBI - Power BI Desktop

Now, I can easily change the years for my cross-tabulation (using rxCrosstab function). Since calculation is comprehended in the back on the whole dataset and using dplyr package just to omit or filter the results, it is also possible to use rxDatastep:

rxDataStep(inData=outputFile, outFile="C:\\Files\\YearPredictMSD_Year.xdf", 
             overwrite=TRUE, transforms=list(LateYears = V1 > 1999))
rxCrossTabs(V2~F(LateYears), data = "C:\\Files\\YearPredictMSD_Year.xdf")

In this way, you will be creating new XDF file through PowerBI with the transformation. Bear in mind, that this step might take some extra seconds to create new variable or to make a subset, if you would need. Again, this is up to  you to decide, based on the file size.

Using SQL Server procedure with R Script

This approach is not that uncommon, because it has been proven that using Stored Procedures with T-SQL and R code is useful and powerful way to use SQL Server and R integration within SSRS.  Changing the computational context is sure another way to make a work around.

Creating Stored procedure:

CREATE PROCEDURE [dbo].[SP_YearMSD_CrossTab]
AS
BEGIN
    DECLARE @RScript nvarchar(max)
        SET @RScript = N'
                library(RevoScaleR)
                sampleDataDir

Or by copying the T-SQL Code into the SQL Server Data Source, the result is the same.

10 -- 2017-03-20 21_51_04-RevoScale_and_PowerBI - Power BI Desktop

In both cases, you should have a cross-tabulational  representation of XDF dataset within Power BI. And now you can really use all the advantages of Power BI visuals, Slicers and as well any additional R predictions.

12 --- 2017-03-20 21_54_42-RevoScale_and_PowerBI - Power BI Desktop

There is a slight minus to this (if not all) approaches like this. You need to have many stored procedures or queries having generated like this. Also rxCube will help you to some extent, but repetitive work will not be avoided.

Using HDInsight or Hadoop?

Using XDF data files stored in HD-Insight or in Hadoop would generaly mean using same dataset and step as for SQL Server procedure. Just that you would need to – prior to executing T-SQL script, also change comptutational context:

# HD Insight - Spark - Azure
HDInsight mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH) 
rxSetComputeContext("HDInsight")
## Hadoop
Hadoop mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH)
rxSetComputeContext("Hadoop")

Verdict

I have explored couple of ways how to use the Power BI visuals and environment with RevoScaleR XDF (eXternal Data Frame) datafiles. I have to admit, I was surprised that there will be a way to do it in a relatively easy way, but from data scientist perspective, it is still some additional load and work before you can start with actual data analysis. Last two approaches (R script in Visuals and SQL Server Procedures) are by far the fastest and also take the advantage of using parallel and distributed computations that RevoScaleR package brings.

I would very strongly advise Microsoft and Power BI development team to add XDF plug-in to Power BI. Plug-in would work with metadata presentation of the data each time the computations should be used, the metadata would push the code against R Server to have results returned. This would, for sure be a great way to bring Big Data concept to Power BI Desktop.

As always, code and samples are available at GitHub.

Happy coding!


To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)