Dpylthon…dplyr for Python!

September 5, 2018
By

(This article was first published on Open Source Automation, and kindly contributed to R-bloggers)

dplyr for python dplython

If you’re an avid R user, you probably use the famous dplyr package. Python has a package meant to be similar to dplyr, called dplython. This article will give an introduction for how to use dplython.

For the examples below, we’ll use a sample dataset that comes with R giving attributes about the US states, including population, area, and income levels. You can see the dataset by clicking here.

Initial setup

dplython can be installed using pip:.

pip install dplython

Once the package is installed, let’s load a few methods from it, and read in our dataset.

# load packages
from dplython import select, DplyFrame, X, arrange, count, sift, head, summarize, group_by, tail, mutate
import pandas as pd

# read in data
state_df = pd.read_csv("state_info.txt")

After we’ve read in our data, we need to convert it to an object called a DplyFrame, which we do using a method from dplython. This DplyFrame object will allow us to perform “dplyr-like” operations.

state_info = DplyFrame(state_df)

The Basics – How to select columns and get top rows

With dplython, selecting columns is done like below:

state_info >> select(X.State_Name, X.State_Region, X.Population)

Notice how we append X to each field name, which is a little different from the select command in dplyr. We also use >> instead of %>% as a pipe operator. If you’re not as familiar with dplyr from R, the pipe operator basically allows to you apply a function to the output of a function or object to the left the operator (the >> in dplython). In the case above, the select function is just being applied to state_info.

To select only the top 10 or bottom 10 rows, we can chain the head or tail methods, respectively, to our line of code above. Here the head and tail functions are being applied to the result to the left of each respective pipe operator.

# get top ten rows with selected columns
state_info >> select(X.State_Name, X.State_Region, X.Population) >> head(10)

# get bottom ten rows with selected columns
state_info >> select(X.State_Name, X.State_Region, X.Population) >> tail(10)


How to filter rows

Filtering can be done using the sift method. For example, if we want to get only the records where a state has an area less than 100,000 we would do this:

# filter initial data frame to states with areas less than 100,000 square miles
state_info >> sift(X.Area < 100000)

# filter rows and columns in one step
state_info >> sift(X.Area < 100000) >> 
              select(X.State_Name, X.State_Region, X.Population, X.Area)

Sorting datasets with dplython

To sort our DplyFrame by a column, we can use the arrange method, like in dplyr:

# sort DplyFrame in ascending order by Area
state_info >> arrange(X.Area)

# sort first by region, then by area
state_info >> arrange(X.State_Region, X.Area)

How to create new columns with dplython

Also similar to dplyr, we can use the mutate method to add fields to state_info:

with_extra_field = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region)

# create multiple fields in one step
more_fields = state_info >> mutate(Name_Region = X.State_Name + '-' + X.State_Region,
                                   Area_Per_Thousand_Sq_Mile = X.Area / 1000)

with_extra_field is now a DplyFrame containing the same fields as state_info, plus a field called “Name_Region”. more_fields contains two extra fields — one concatenating the state name and region, and one dividing the state area field by 1000.

Summarizing data

Aggregating data can be done using the summarize and group_by methods:

# get average area by region
state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean())

# or do multiple aggregations in one step
state_info >> group_by(X.State_Region) >> summarize(average_region_area = X.Area.mean(), \
                                                 >> max_region_area = X.Area.max), \
                                                 >> max_region_income = X.Income.max())

The first line of code above will show the average (mean) state area by region. The second line calculates multiple aggregations in one step, including mean and max state areas by region, as well as max state income by region.

Transposing a dataset

dplython can also transpose datasets using X.__T. (period followed by double underscore and second period).

# transpose state_info
state_info >> X._.T()

How to join tables together

dplython, like dplyr, has several “SQL-like” functions for joining datasets together:

from dplython import inner_join, left_join


right_table = more_fields >> select(X.State_Name, X.Name_Region)

# inner join
joined_result = inner_join(state_info, right_table, on = "State_Name")

# left_join
left_join_result = left_join(state_info, right_table, on = "State_Name")


In each type of join above, the first and second parameters are the left and right tables being joined. The “on” parameter specifies what column or list of columns should be used as the key to join the tables together.

Counting values in a column

To get the count of the values in a specific column, use the count function:

# get a count of each state region
state_info >> count(X.State_Region)

How to randomly sample rows with dplython

The sample_n function can be used to randomly sample rows from our dataset. You just need to pass whatever number of rows you want to randomly select (e.g. 10 in the below example).

from dplython import sample_n

# randomly select ten rows from state_info
state_info >> sample_n(10)

dplython can also sample a random proportion of a dataset. For example, if we want to randomly sample 70% of the rows, we can use the sample_frac function:

from dplython import sample_frac

# randomly sample 70% of the rows from state_info
state_info >> sample_frac(0.7)

The first line of code above will generate a DplyFrame with two columns: one showing the State Region, and the other showing the average area across each region. The second line contains this information, plus the max state area for each region, and the max state income for each region.

That’s it for this post! Please check out other Python posts of mine by clicking here.

The post Dpylthon…dplyr for Python! appeared first on Open Source Automation.

To leave a comment for the author, please follow the link and comment on their blog: Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)