Merging Data — SAS, R, and Python

June 24, 2013
By

(This article was first published on Adventures in Statistical Computing, and kindly contributed to R-bloggers)

On analyticbridge, the question was posed about moving an inner join from Excel (which was taking many minutes via VLOOKUP()) to some other package.  The question asked what types of performance can be expected in other systems.  Of the list given, I have varying degrees of experience in SAS, R, and Python.

In the question, the user has about 80,000 records that match between 2 tables and need to be merged by a character key.

To start, let's create 2 tables in SAS.  Each will have 150,000 records and there will be 80,000 overlapping records.  We will randomize the tables (obviously the merge is trivial when the tables are sorted with the matching records on top) to be more "real world."

dataA;
formataddress $32. a $8.;
a = "1";
doi=1 to 150000;
       rand = ranuni(123);
       ifi <=80000 thendo;
              address = put(i,z32.);
       end;
       elsedo;
              address = "AAA"|| put(i,z29.);
       end;
       output;
end;
run;

dataB;
formataddress $32. b $8.;
b = "1";
doi=1 to 150000;
       rand = ranuni(234);
       ifi <=80000 thendo;
              address = put(i,z32.);
       end;
       elsedo;
              address = "BBB"|| put(i,z29.);
       end;
       output;
end;
run;

procsort data=a;
byrand;
run;

procsort data=b;
byrand;
run;

Now SAS pages tables to the hard drive.  Good and bad.  Python and R will be starting with the tables in memory, so we use SASFILE to load the tables into the main memory.  Also note that SAS is writing the result table to the HD.  We will do the same in Python and R.

594  sasfile a load;
NOTE: The file WORK.A.DATA has been loaded into memory by the SASFILE statement.
595  sasfile b load;
NOTE: The file WORK.B.DATA has been loaded into memory by the SASFILE statement.
596
597  proc sql noprint;
598  create table temp.tableA_B as
599  select a.address,
600         a.a,
601         b.b
602      from a inner join
603           b
604        on a.address = b.address;
NOTE: Table TEMP.TABLEA_B created, with 80000 rows and 3 columns.

605  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.09 seconds
      cpu time            0.09 seconds

So SAS took 0.09 seconds.  Much faster than the many minutes in Excel.

In R this is trivial.  I wrote the tables to csv files (not shown).  So we will read them in, do the merge, and then save the result as an .rda file.

> system.time({
+   a = read.delim('c://temp//a.csv',header=T,sep=',')
+   b = read.delim('c://temp//b.csv',header=T,sep=',')
+ })
   user  system elapsed
   9.03    0.06    9.09

> system.time({
+   m = merge(a[c language="("address","a")"][/c],b[c language="("address","b")"][/c])
+ })
   user  system elapsed
   2.15    0.02    2.17

> system.time({
+   save(m, file="c://temp//tableA_B.rda")
+ })
   user  system elapsed
   0.21    0.00    0.20 

R took 2.17 seconds for the merge and 0.20 seconds to write.  A total of 2.37 seconds.

I am least familiar with the optimal way to do this in Python.  I have a question to my Python guru about the optimal way to do the merge.  For the time being, my attempt is here.  The basics is to read the files into a Dictionary with the address string as the key and a basic object as the value.  Then iterate over the keys in 1 table and see if they are in the second table.  If so, add the merged data to a new dictionary.

Outputting the merged data might be faster in a list instead of a dictionary.  The index hash is not being built in SAS or R.

C:\Temp>c:\Python27\python.exe merge.py
2.4200000763 Starting Merge
Took 0.2660000324 seconds to merge, 1.1210000515 total with pickle

Python took 0.27 seconds for the merge and a total of 1.12 seconds for the merge and write.

A final note, here is a screen grab of the resulting file sizes.  R wins hands down -- I assume there is some compression going on there.

This example is fairly trivial.  Hopefully, someone will find it useful while trying to learn one of these languages.

To leave a comment for the author, please follow the link and comment on his blog: Adventures in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.