Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When dealing with large datasets that potentially exceed the memory of your machine it is nice to have another possibility such as your own server with an SQL/PostgreSQL database on it, where you can query the data in smaller digestible chunks. For example, recently I was facing a financial dataset of 5 GB. Although 5 GB fit into my RAM the data uses a lot of resources. One solution is to use an SQL-based database, where I can query data in smaller chunks, leaving resources for the computation.

While MySQL is the more widely used, PostgreSQL has the advantage of being open source and free for all usages. However, we still need to get a server. One possible way to do it is to rent Amazon server, however, as I don’t have a budget for my projects and because I only need the data on my own machine I wanted to set up a server on my Windows 8.1 machine. This is how I did it.

# Installing software, Starting the Server and Setting up new Users

First, we need to install the necessary software. Besides R and RStudio we need to install PostgreSQL, that we find here. When installing we are asked to provide a password, just remember it as we need it later. Say for this example we set the password to: “DataScienceRocks”.

Now we can already access and use the database, for example we can start the interface (pgAdmin III) that was automatically installed with PostgreSQL. To connect to the database double click on the server in pgAdmin III and type in your password. The server seems to run after the installation as well. If this is not the case (i.e. you get the error “Server doesn’t listen” when trying to connect to the server with pgAdmin III), you can start the server with the following command in the command line:

 pg_ctl -D "C:Program FilesPostgreSQL9.4data" start

As we can see, we only have one user (“postgres“). It is good practice to use the database with another user that has no createrole (think of it as a non-admin user).

To set up a new user I follow this explanation. Start the command line (go to the start menu and type cmd”) and move to the folder where you installed PostgreSQL (more precisely, the bin-folder). In my case I navigated to the folder by typing:

 cd C:/Program Files/PostgreSQL/9.4/bin

Now we need to create a new user (openpg), which we can do by executing the following command:

 createuser.exe --createdb --username postgres --no-createrole --pwprompt openpg

We have to enter the password for the new user twice (note that there is no feedback from the commandline), for this example I set it to “new_user_password“, lastly we are asked to give our password for the main user (“postgres“) which is in this case “DataScienceRocks“, as specified during the installation.

We can check if we have two users by going into pgAdmin III, which should look like this:

## Creating a Table in pgAdmin III

An easy way to create a table (database) is to use pgAdmin III. Right click on the “Tables” and choose “New Table”.

For this example we create a table called cartable that we will later populate with the dataset of mtcars. For the dataset we need to specify the columes and their types as shown in the next picture.

Lastly, we need to specificy a primary key in constraints. In this case I use the carname column as key.

# Connection with R

Now it is time to connect to the database with R. This approach uses the RPostgreSQL package and this approach.

To connect, we need to enter the following commands in R:

# install.packages("RPostgreSQL")
require("RPostgreSQL")

# create a connection
# save the password that we can "hide" it as best as we can by collapsing it
pw <- {
}

drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
# note that "con" will be used later in each connection to the database
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = 5432,
user = "openpg", password = pw)

# check for the cartable
dbExistsTable(con, "cartable")
# TRUE


If we don’t get an error, that means we are connected to the database.

# Write and Load Data with RPostgreSQL

The following code show how we can write and read data to the database:

# creates df, a data.frame with the necessary columns
data(mtcars)
df <- data.frame(carname = rownames(mtcars),
mtcars,
row.names = NULL)
df$carname <- as.character(df$carname)
rm(mtcars)

# writes df to the PostgreSQL database "postgres", table "cartable"
dbWriteTable(con, "cartable",
value = df, append = TRUE, row.names = FALSE)

# query the data from postgreSQL
df_postgres <- dbGetQuery(con, "SELECT * from cartable")

# compares the two data.frames
identical(df, df_postgres)
# TRUE

# Basic Graph of the Data
require(ggplot2)
ggplot(df_postgres, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() + theme_bw()


Lastly, if we are finished, we have to disconnect from the server:

# close the connection
dbDisconnect(con)


# Outro

If you have any questions about the code, PostgreSQL or pgAdmin III or if you have remarks or have found a way to do it better/faster feel free to leave a comment or write me an email.

Get the PostgreSQL software here:

PostgreSQL commandline commands: http://www.postgresql.org/docs/9.4/static/app-pg-ctl.html

Create a new User: https://doc.odoo.com/install/windows/postgres/

For a short introduction to postgreSQL queries have a look at this: http://www.postgresql.org/docs/8.4/static/tutorial-select.html

# Appendix

If you want to create a table in R instead of pgAdmin III you can do that of course. The following creates the same table as the we did earlier in pgAdmin III:

# specifies the details of the table
sql_command <- "CREATE TABLE cartable
(
carname character varying NOT NULL,
mpg numeric(3,1),
cyl numeric(1,0),
disp numeric(4,1),
hp numeric(3,0),
drat numeric(3,2),
wt numeric(4,3),
qsec numeric(4,2),
vs numeric(1,0),
am numeric(1,0),
gear numeric(1,0),
carb numeric(1,0),
CONSTRAINT cartable_pkey PRIMARY KEY (carname)
)
WITH (
OIDS=FALSE
);
ALTER TABLE cartable
OWNER TO openpg;
COMMENT ON COLUMN cartable.disp IS '
';"
# sends the command and creates the table
dbGetQuery(con, sql_command)