How To: 20 Minute Guide to Get Started with PivotalR
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
According to fairly reliable job trends, R’s growth in the past few months is up significantly—over an order of magnitude compared to two close competitors that are in decline—SAS and SPSS.
Not only do internet giants like Google, Twitter, Facebook, Bing rely on R for daily analysis of big data, but data savvy companies of all sizes and focuses also turn to R for insight. Companies from the car sharing startup Uber, to retail giants such as Nordstrom and Loyd’s, to automotive stalwarts like Ford, to government agencies like the FDA, and banking visionaries like Bank of America all use R for statistical analysis.
Part of R’s power lies in that it has been developed by experts, is very comprehensive, and sports serious graphics capabilities. Due to its open source nature, a broad group of users have contributed over 4800 packages covering topics like econometrics, spatial analysis, bioinformatics, and more. Its adoption is also helped by the fact that it works across virtually every data store you might be interested in.
However, as discussed in a previous blog post, one of the shortcomings of R is that it is not natively equipped for big data. As part of the larger big data ecosystem, Pivotal has created its own package for R that eases its use with Pivotal products such as Greenplum Database, Pivotal HD and HAWQ.
Below, we will provide some background on PivotalR, help you install the system, load data, and explain important data manipulations.
Why PivotalR?
PivotalR helps users of R to specifically interact with the Pivotal Greenplum Database, Pivotal HD / HAWQ and the open-source database PostgreSQL. It does so by providing an interface to the operations on tables/views in the database. These operations are almost the same as those of data.frame, which is used to store data tables or collections of variables and similar to matrices or lists. It is important to note that a minimal amount of data is transferred between R and the database system, increasing efficiency. Also, users of R do not need to learn SQL when they operate on the objects in the database because PivotalR handles the translation.
PivotalR also allows users to run the functions of the open-source big-data machine learning package MADlib directly from R.
Getting Started
PivotalR is the interface to the database and MADlib. To use it, we need to install a database, MADlib package, and PivotalR.
1. Installing the database
Right now, PivotalR supports the connection to PostgreSQL, Greenplum Database (GPDB), and Pivotal HD’s HAWQ. In the supported platforms, both GPDB and HAWQ are massively parallel processing (MPP) database systems. Although we cannot run parallel computations on PostgreSQL, it is still very useful. Deploying MADlib and PivotalR on PostgreSQL gives the user to big data that cannot be fully loaded into the memory, although it will not get the same speedup as in a distributed environment like GPDB or HAWQ.
2. Installing MADlib
One can get the MADlib installation package and install it on the database.
3. Installing PivotalR
One can either install PivotalR from CRAN or GitHub. The current version on CRAN is 0.1.16.12.
install.packages("PivotalR")
or install it from GitHub
install.packages("devtools")
devtools::install_github("PivotalR", "gopivotal")
The latest PivotalR on GitHub contains an introductory paper, which has been accepted by R Journal and will appear in the next issue. To view this paper, run
vignette("pivotalr")
or view it on CRAN.
4. Start
Load the library and connect to a database.
library(PivotalR)
db.connect(port = 5333, dbname = "madlib")
## Created a connection to database with ID 2
## [1] 2
Loading Data for PivotalR
Before we can begin manipulating data, we need to load a data set. We will use the data set named “abalone” that is included in the package.
delete("abalone")
# dat0 dat ## Counting and caching the data table dimension ... 0.022 sec ... done.
## An R object pointing to "abalone" in connection 1 is created !
names(dat) # column names
## [1] "id" "sex" "length" "diameter" "height" "whole"
## [7] "shucked" "viscera" "shell" "rings"
dim(dat) # dimension of the table
## [1] 60 10
dat.mem
Here, the second argument of lk function specifies how many rows is loaded, and -1 means loading the whole result.
Data Manipulations in PivotalR
Understanding the Data Structure
PivotalR defines a db.data.frame
class. The objects of this class are the wrappers of the data table in the database. Almost all data.frame operations can be applied to this object, which looks and behaves like data.frame. This is why we call the class db.data.frame
. However, any operations on a db.data.frame
are not evaluated in the database immediately. Instead, a db.Rquery
object is generated, which can be used in the way of db.data.frame
objects. One can choose to evaluate a db.Rquery
object x
at any time by calling lk(x)
to look at the results of it. Both db.data.frame
and db.Rquery
objects can be directly used with MADlib machine learning function wrappers, making a very powerful combination.
Although a db.data.frame
object behaves like a data.frame object, there is a major difference. The data table in the database is essentially a set, which does not have an order on the rows. On the other hand, R’s own data.frame has an intrinsic order of the rows. For example, in the above code, there is no guarantee that you get the same order of rows if you run the command dat.mem <- lk(dat, -1)
multiple times. If you want to fix the order of rows loaded into R memory, you can use
dat.mem <- lk(sort(dat, FALSE, dat$id), -1)
where sort
function produces a db.Rquery
object which contains the SQL query for sorting the data, and lk
function evaluates it in the database.
1. Extraction & Replacement
PivotalR supports the extraction and replacement syntax of data.frame
. For example, add a new column named grp
, which is the remainder
dat$grp <- dat$id %% 3
At this point, dat
becomes a db.Rquery
object. The new column grp
is derived from the column id
. Its value is not evaluated yet.
2. Joins
The function for data.frame
that corresponds to SQL’s join
is merge
. PivotalR supports the same function on data table wrappers.
merge.self names(merge.self)
## [1] "id" "sex_x" "length_x" "diameter_x" "height_x"
## [6] "whole_x" "shucked_x" "viscera_x" "shell_x" "rings_x"
## [11] "grp_x" "sex_y" "length_y" "diameter_y" "height_y"
## [16] "whole_y" "shucked_y" "viscera_y" "shell_y" "rings_y"
## [21] "grp_y"
The merge
function joins two tables. If the two tables have columns with the same names, it is automatically taken care of.
3. Split-Apply-Combine
x lk(x)
## sex id_avg sex_avg length_avg diameter_avg height_avg whole_avg
## 1 F 1996 0.6164 0.4827 0.1675 1.2255
## 2 M 2079 0.5593 0.4370 0.1568 1.0058
## 3 I 2162 0.4378 0.3336 0.1086 0.4559
## shucked_avg viscera_avg shell_avg rings_avg grp_avg
## 1 0.5082 0.27609 0.3735 11.14 0.9545
## 2 0.4370 0.21925 0.2803 10.65 0.7500
## 3 0.2005 0.09906 0.1372 7.50 0.6667
The by
function groups the data using the values of dat$sex
and applies the function mean
to each group. Again, x
is not evaluated until we apply lk
function on it to look at the real values.
4. Categorical variables
PivotalR also supports basic operations of categorical variables. The factors can be used in MADlib function wrappers.
dat$grp |t|)
## (Intercept) 6.6312 2.3921 2.77 0.0077 **
## length -8.5634 15.5409 -0.55 0.5840
## diameter 13.3174 20.3185 0.66 0.5151
## shell 9.0381 4.6892 1.93 0.0594 .
## sex.I -1.0243 1.0017 -1.02 0.3112
## sex.M 0.5216 0.8055 0.65 0.5201
## grp.1 -0.1991 0.8155 -0.24 0.8081
## grp.2 0.0144 0.7912 0.02 0.9855
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R-squared: 0.4505
## Condition Number: 110.5
Here both sex
and grp
are categorical variables, and madlib.lm
is the wrapper function for MADlib’s linear regression. The function automatically takes care of the categorical variables.
5. Columns that have array values
## Create a column with array values
dat$arr lk(dat, 10)
## id sex length diameter height whole shucked viscera shell rings grp
## 1 3151 F 0.655 0.505 0.165 1.3670 0.5835 0.3515 0.3960 10 1
## 2 2026 F 0.550 0.470 0.150 0.9205 0.3810 0.2435 0.2675 10 1
## 3 3751 I 0.435 0.375 0.110 0.4155 0.1700 0.0760 0.1450 8 1
## 4 720 I 0.150 0.100 0.025 0.0150 0.0045 0.0040 0.0050 2 0
## 5 1635 F 0.575 0.470 0.155 1.1160 0.5090 0.2380 0.3400 10 0
## 6 2648 I 0.500 0.390 0.125 0.5830 0.2940 0.1320 0.1605 8 2
## 7 1796 F 0.580 0.430 0.170 1.4800 0.6535 0.3240 0.4155 10 2
## 8 209 F 0.525 0.415 0.170 0.8325 0.2755 0.1685 0.3100 13 2
## 9 1451 I 0.455 0.335 0.135 0.5010 0.2740 0.0995 0.1065 7 2
## 10 1108 I 0.510 0.380 0.115 0.5155 0.2150 0.1135 0.1660 8 1
## arr.1 arr.2 arr.3 arr.4 arr.5 arr.6 arr.7 arr.8
## 1 0.6550 0.5050 0.1650 1.3670 0.5835 0.3515 0.3960 10.0000
## 2 0.5500 0.4700 0.1500 0.9205 0.3810 0.2435 0.2675 10.0000
## 3 0.4350 0.3750 0.1100 0.4155 0.1700 0.0760 0.1450 8.0000
## 4 0.1500 0.1000 0.0250 0.0150 0.0045 0.0040 0.0050 2.0000
## 5 0.5750 0.4700 0.1550 1.1160 0.5090 0.2380 0.3400 10.0000
## 6 0.5000 0.3900 0.1250 0.5830 0.2940 0.1320 0.1605 8.0000
## 7 0.5800 0.4300 0.1700 1.4800 0.6535 0.3240 0.4155 10.0000
## 8 0.5250 0.4150 0.1700 0.8325 0.2755 0.1685 0.3100 13.0000
## 9 0.4550 0.3350 0.1350 0.5010 0.2740 0.0995 0.1065 7.0000
## 10 0.5100 0.3800 0.1150 0.5155 0.2150 0.1135 0.1660 8.0000
Sometimes, it is very useful to construct a column whose value is an array. Here, we create a new column arr
, whose values are arrays of numbers. And we load 10
rows of the data into memory as a data.frame. This data.frame has a column arr
, which is an array. And its elements are displayed as arr.1
, arr.2
, …, arr.8
. The equivalent SQL query for constructing arr
is SELECT ARRAY[length, diameter, height, whole, shucked, viscera, shell, rings] FROM abalone
. The syntax 3:10
makes the selection of multiple columns very easy. If you had 1000
columns and you want to construct an array using them, this R syntax would be very handy.
Computation that involves date, time and time intervals in the database are also supported in PivotalR. As well, regular expressions are supported in PivotalR.
Get Started with PivotalR
Now that you know how to install PivotalR with MADlib and perform some basic functions with machine learning or basic data table manipulations, it’s time to get started and learn more.
- Attend Author’s upcoming talk at the useR 2014 conference, on July 1st from 16:00 – 17:30 within the business track of Session 3.
- Install PostgreSQL, MADlib, and PivotalR
- Read other articles from this blog about PivotalR, MADlib, Pivotal Greenplum DB, Pivotal HD, or data science
- Check out more product information, documentation, or download Pivotal Greenplum DB and Pivotal HD/HAWQ
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.