Turning Data Into Awesome With sqldf and pandasql

April 29, 2015
By

(This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers)

Both R and Python possess libraries for using SQL statements to interact with data frames. While both languages have native facilities for manipulating data, the sqldf and pandasql provide a simple and elegant interface for conducting tasks using an intuitive framework that’s widely used by analysts.

Screenshot from 2015-04-29 11:23:02

Screenshot from 2015-04-29 11:23:44

 

 

 

 

R and sqldf

sqldf("SELECT COUNT(*) FROM df2 WHERE state = 'CA'")
 
  COUNT(*)
1        4
 
sqldf("SELECT 
            df2.firstname, 
            df2.lastname, 
            df1.var1, 
            df2.state 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id 
      WHERE df2.state = 'TX'")
 
  firstname lastname  var1 state
1     David    Spade -2.09    TX
2       Joe  Montana  1.16    TX
 
sqldf("SELECT 
            df2.state, 
            COUNT(df1.var1) 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state")
 
   state COUNT(df1.var1)
1     AZ               1
2     CA               1
3     GA               1
4     IL               1
5     NC               1
6     NY               1
7     OK               1
8     SC               1
9     TX               1
10    VT               1

Python and pandasql

import pandasql as ps
 
 
q1 = """SELECT COUNT(*) FROM df2 WHERE state = 'CA'"""
 
print ps.sqldf(q1, locals())
 
   COUNT(*)
0         4
 
q2 = """
    SELECT 
        df2.firstname, 
        df2.lastname, 
        df1.var1, 
        df2.state 
    FROM df1 INNER JOIN df2 ON df1.personid = df2.id 
    WHERE df2.state = "TX";
    """
 
print ps.sqldf(q2, locals())
 
  firstname lastname  var1 state
0     David    Spade -2.09    TX
1       Joe  Montana  1.16    TX
 
q3 = """SELECT 
            df2.state, 
            COUNT(df1.var1) 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state"""
 
print ps.sqldf(q3, locals())
 
  state  COUNT(df1.var1)
0    AZ                1
1    CA                1
2    GA                1
3    IL                1
4    NC                1
5    NY                1
6    OK                1
7    SC                1
8    TX                1
9    VT                1

 

To leave a comment for the author, please follow the link and comment on their blog: Mathew Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)