PythonMusings #6: dplyr in Python? First impressions of the siuba (小巴) module
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Whats great about Blogging and social media is that you get to learn so much if you use it right. After sharing my last blog on LinkedIn, Casper Crause told me about the siuba module created by Michael Chow from which allows Python users experience to using R’s powerful dplyr package (developed by Hadley Wickham) for data wrangling. What impressed me the most was seeing how similar the code was to dplyr syntax and functionality. There is even a pipe operator that you can use with >> used in lieu of the magrittr pipe %>%.
In this blog post, I am going to give the siuba module a spin by looking at the wine ratings dataset from the TidyTuesday project. While I hope to post an entire analysis in Python in the future, my focus is primarily on the features that the siuba has to offer. The questions, however trivial, are aimed at trying to highlight the features that are available.
The data
To stay in the spirit of doing things entirely in Python, I’m going to load the raw data from the github with pandas’ read.csv() function.
import pandas as pd
dt = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv")
Now that the data is loaded, lets begin with playing with the data! The following questions we’re going to ask are the following:
- What are the 10 most frequent countries in the data set?
- Who were the tasters who reviewed Cabernet Sauvignon wines?
- What was is the 5 number summary for cost of Merlot wines?
- What were the top wine varieties with the best scores?
While these questions are basic, it will give us the opportunity to show what the siuba module can do. I’m sure there are more sophisticated questions that we can ask, but this should be good for now to show off what can be done with siuba.
So without further ado, lets get into it!
1. The 10 most frequent countries in the data set.
This seems like something pretty standard that can be done in R with dplyr. The following is a comparison between dplyr in R and siuba in Python:
With dplyr (in R)
dt2 %>% select(country) %>% group_by(country) %>% summarize(n=n()) %>% arrange(desc(n)) %>% top_n(10) ## Error: `n()` must only be used inside dplyr verbs.
With siuba (in Python)
from siuba import *
from siuba.dply.vector import *
(
dt >>
select(_.country)>>
group_by(_.country)>>
summarize(n= n( _.country)) >>
# using desc
arrange(desc(_.n)) >>
top_n(10)
)
country n
40 US 54504
37 Spain 6645
31 Portugal 5691
22 Italy 19540
17 Germany 2165
15 France 22093
8 Chile 4472
3 Austria 3345
2 Australia 2329
0 Argentina 3800
There are a few differences that are immediately noticeable. With siuba we have to import all the the functions available and also import the siuba.dply.vector to get the relevant dplyr functions. Additionally, the desc(_.n) argument in the arrange() doesn’t give the top 10 most frequented countries in descending order.
After a bit of trial and error and some Googling, to remedy the siuba code, instead of writing arrange(desc(_.n)), we can write arrange(-_.n).
(
dt >>
select(_.country)>>
group_by(_.country)>>
summarize(n= n( _.country)) >>
# desc doesn't work
arrange(-_.n) >>
top_n(10)
)
country n
40 US 54504
15 France 22093
22 Italy 19540
37 Spain 6645
31 Portugal 5691
8 Chile 4472
0 Argentina 3800
3 Austria 3345
2 Australia 2329
17 Germany 2165
Much better!
2.The tasters who reviewed Cabernet Sauvignon wines
The difference between siuba and dplyr here seems to be more rooted in the difference between R and Python (but I could be wrong). R accounts for NA values while Python does not.
With dplyr (in R)
dt2 %>% filter(variety =="Cabernet Sauvignon") %>% select(taster_name) %>% group_by(taster_name) %>% summarize(n=n()) %>% arrange(desc(n)) ## Error: `n()` must only be used inside dplyr verbs.
With siuba (in Python)
(
dt >>
filter(_.variety=="Cabernet Sauvignon") >>
select(_.taster_name)>>
group_by(_.taster_name)>>
summarize(n= n( _.taster_name)) >>
# using desc
arrange(-_.n)
)
taster_name n
17 Virginie Boone 1829
11 Michael Schachner 1358
15 Sean P. Sullivan 785
13 Paul Gregutt 677
10 Matt Kettmann 454
6 Jim Gordon 399
7 Joe Czerwinski 263
9 Lauren Buzzeo 140
1 Anna Lee C. Iijima 94
16 Susan Kostrzewa 67
12 Mike DeSimone 66
14 Roger Voss 49
5 Jeff Jenssen 35
8 Kerin O'Keefe 32
0 Alexander Peartree 8
3 Carrie Dykes 7
4 Fiona Adams 6
2 Anne Krebiehl MW 2
What I like about R is that it accounts for missing data which I can see and account for, but with the way I wrote the Python code I am unable to see this.
3. 5 number summary for price of Merlot wines
With dplyr (in R)
dt2 %>% filter(variety =="Merlot") %>% select(price) %>% summary() ## price ## Min. : 4.00 ## 1st Qu.: 15.00 ## Median : 24.00 ## Mean : 29.54 ## 3rd Qu.: 35.00 ## Max. :625.00 ## NA's :40
With siuba (in Python)
(
dt >>
filter(_.variety =="Merlot")>>
select(_.price)
).describe()
price
count 3062.000000
mean 29.543436
std 33.340882
min 4.000000
25% 15.000000
50% 24.000000
75% 35.000000
max 625.000000
With dplyr I am able to pipe my filtered data directly into summary() to get a five number summary. With siuba, since the object output is still a pandas object (for good reason), I am only able to get the 5 number summary by using the .describe() method.
4. Top wine varieties with the best scores
With dplyr (in R)
dt2 %>% distinct(variety,points) %>% filter(points==max(points)) ## # A tibble: 13 x 2 ## points variety ## <dbl> <chr> ## 1 100 Muscat ## 2 100 Prugnolo Gentile ## 3 100 Champagne Blend ## 4 100 Merlot ## 5 100 Portuguese Red ## 6 100 Sangiovese ## 7 100 Cabernet Sauvignon ## 8 100 Bordeaux-style Red Blend ## 9 100 Chardonnay ## 10 100 Sangiovese Grosso ## 11 100 Syrah ## 12 100 Port ## 13 100 Bordeaux-style White Blend
With siuba (in Python)
import numpy as np
(
dt >>
distinct(_.variety,_.points) >>
filter(_.points==np.max(_.points))
)
variety points
186 Muscat 100
1188 Prugnolo Gentile 100
2477 Champagne Blend 100
2539 Merlot 100
2629 Portuguese Red 100
2744 Sangiovese 100
2747 Cabernet Sauvignon 100
3047 Bordeaux-style Red Blend 100
3567 Chardonnay 100
3840 Sangiovese Grosso 100
3875 Syrah 100
3887 Port 100
3977 Bordeaux-style White Blend 100
Aside for numpy being imported to get the max() function the code with siuba and dplyr is the same.
Conclusion
With everything I checked out. Aside for a hiccup with the desc() function, using the siuba module is great for making use of dplyr verbs with dataframes in Python. If properly maintained and stable, I can see tons of applications of siuba being used beyond jupyter notebooks in back end programs which deal with data.
This is only my small review, so I recommend everyone else give it a try themselves by installing it (pip install siuba) and seeing what you think! You won’t be disappointed!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.