Running multiple correlations with R and T-SQL

June 26, 2016
By

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Getting to know the data is always an interesting part of data science. With R integration into SQL Server, the exploration part is still part of the game.

Usual way to get some statistics out of the dataset is to run some frequencies, descriptive statistics and nevertheless correlations.

Running correlations against a set of variables in T-SQL might be a bit of a drag, hence using R code with SP_EXECUTE_EXTERNAL_SCRIPT is just as easy as following:

USE WideWorldImporters;
GO

 DECLARE @sql NVARCHAR(MAX)
 SET @sql = 'SELECT 
                      SupplierID
                    , UnitPackageID
                    , OuterPackageID
                    , LeadTimeDays
                    , QuantityPerOuter
                    , TaxRate
                    , UnitPrice
                    , RecommendedRetailPrice
                    , TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'df <- data.frame(cor(Stock, use="complete.obs", method="pearson"))
                OutputDataSet<-df'

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@[email protected]
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID NVARCHAR(100)
                    ,UnitPackageID NVARCHAR(100)
                    ,OuterPackageID NVARCHAR(100)
                    ,LeadTimeDays NVARCHAR(100)
                    ,QuantityPerOuter NVARCHAR(100)
                    ,TaxRate NVARCHAR(100)
                    ,UnitPrice NVARCHAR(100)
                    ,RecommendedRetailPrice NVARCHAR(100)
                    ,TypicalWeightPerUnit NVARCHAR(100)
                    ));

I am using WideWorldImporters; (GitHub or at Codeplex);  new Demo database from Microsoft that was released just this month, beginning of June 2016.

By running this query with correlations R returns a dataframe that T-SQL is able to interpret and output the results in SSMS in following format. Very cool.

2016-06-26 07_31_07-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

The outlook is very similar to one for example in SPSS:

2016-06-26 09_06_37-_Output1 [Document1] - IBM SPSS Statistics Viewer

Numbers are matching (!) and the outline is relatively the same; very clear and easily readable. One thing is missing – SPSS delivers statistical significance (p-value) whereas R only delivers value of Pearson correlation coefficient. For that matter we need to run additional T-SQL / R procedure in order to get p-values.

DECLARE @sql NVARCHAR(MAX)
SET @sql = 'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                    ,LeadTimeDays
                    ,QuantityPerOuter
                    ,TaxRate
                    ,UnitPrice
                    ,RecommendedRetailPrice
                    ,TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'
                library(Hmisc) 
                df <- data.frame(rcorr(as.matrix(Stock), type="pearson")$P)
                OutputDataSet<-df
                '

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@[email protected]
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID DECIMAL(10,5)
                    ,UnitPackageID DECIMAL(10,5)
                    ,OuterPackageID DECIMAL(10,5)
                    ,LeadTimeDays DECIMAL(10,5)
                    ,QuantityPerOuter DECIMAL(10,5)
                    ,TaxRate DECIMAL(10,5)
                    ,UnitPrice DECIMAL(10,5)
                    ,RecommendedRetailPrice DECIMAL(10,5)
                    ,TypicalWeightPerUnit DECIMAL(10,5)
                    ));

So we have now statistical significance of our correlation matrix. I used using library Hmisc and function rcorr.

2016-06-26 09_34_04-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

Rcorr function has very little options to be set. So results may vary when compared to other (by default) functions. You can also use cor.test function:

data.frame(p_value = cor.test(df$my_var1,df$my_var2,use="complete.obs", 
method="pearson")$p.value, var1= "my_var1", var2= "my_var2")

but since the function can not deal with matrix / dataframe, a loop function to go through every combination of variables and store the results with variable names into dataframe. The rcorr function will do the trick, for now.

The final step would be (hint) to combine both sp_execute_external_script into one stored procedure, store both results from R, combine the coefficients with significance level and export only one table with all the information needed. This is already prepared as part of my R scripts.

Happy R-SQLing!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)