Build Predictive Model on Big data: Using R and MySQL Part-3

September 21, 2014

(This article was first published on Pingax » R, and kindly contributed to R-bloggers)

Welcome to last part of the series post again! In previous part I discussed about the solutions to the questions mentioned in first part. In this part, we will implement whole scenario using R and MySQL together and see how we can process bigdata(computationally :) )

Let us recall those questions and summarize their answers to get started with implementation part.

Question-1 How to optimize our predictive analysis computation for big data when we are having limited computational resource?

Answer – Make changes into gradient descent algorithm and process data into chunks into convergence process.

Question-2 What could be done in order to process large data with limited memory?

Answer – Use MySQL to process data into chunks from database and provide it to optimization algorithm (gradient descent) to build predictive model on top of it.

For learning purpose, I am here just giving the idea of implementation. This implementation would differ in different scenarios. Assume that we have dataset stored in database and having following variables for building predictive model

y : Response variable

x1 : Predictor-1

x2 : Predictor-2

We want to fit linear regression model on this data. There are more than 10,000,000,000 rows in the database and we can process only 10,000,000 at a time with given memory.

Now let us see the R code for connecting the database in R

#Connecting Database
#Load Library
library(RJDBC) #install RJDBC package if not installed

#Set Driver 
#Make sure you have jar for mysaql and java connector and placed in working directory
drv <- JDBC("com.mysql.jdbc.Driver","mysql-connector-java-5.1.18.jar",identifier.quote="`")

#Establish connection(database name is test)
conn <- dbConnect(drv, "jdbc:mysql://localhost/test", "root", "root")

Obtain total number of rows from the table “data” where actually data is stored.

#Total rows
  totalRows <- dbGetQuery(conn, "select count(*) as total from data")

Set up chunk size and calculate total no of loops required to process whole data

#set chunk size
  chunkSize <- as.bigz(10000000)

  #Total number of loops
  totalLoops <- floor(totalRows$total/as.numeric(chunkSize))

Now set parameters for gradient descent algorithm

m <- totalRows$total
  vars <- dbGetQuery(conn, "SELECT COLUMN_NAME FROM   information_schema.columns WHERE  table_name = 'data' ORDER  BY ordinal_position ")
  theta <- numeric(nrow(vars))
  alpha <- 0.01

Now it’s time to implement loop structure defined in part two.

gradient chunkWe will require implementing two loops. Lets Implement loops for gradient descent. We consider in 50 iteration algorithm will reach near to convergence. We can set it up by defining convergence criteria or step size. But here we will iterate for 50 times. It is our outer loop. In inner loop will process the whole dataset into chunks. Whole implementation is shown below ,

for(j in 1:50)
   grad_n <- numeric(length(theta)) 
    for(i in 0:totalLoops)
      index <- i*chunkSize+1
      query <- paste("select * from data limit ",index,",",chunkSize,sep="")
      data_compute <- dbGetQuery(conn,query)
      x_mat <- cbind(rep(1,nrow(data_compute)),data_compute$x1,data_compute$x2)
      y_mat <- data_compute$y
      grad_n <- grad_n + t(x_mat) %*% ((x_mat%*%theta)-y_mat)
   theta <- theta - alpha*(1/m)*grad_n

Once this whole process will be finished, we will be having best fitting theta parameters and it is the outcome of our predictive model building using linear regression. This process will take some time but it will process whole dataset and perform required calculation.

So using this implementation structure we will be able to process large dataset and also fit the regression model on it. Here I want you to think upon how we can make this structure with parallel execution if we have multiple computational units. Share your thoughts on this in comment section

I hope you liked reading the post and understood the whole concept.

Finally thanks for reading the post. If you liked the post, don’t forget to share it (social sharing panel is shown left side :) ). I will keep posting new article on this and discuss further on new implementation structure.

Till that time stay tuned and enjoy reading the other posts on R!

Powered by Google+ Comments

The post Build Predictive Model on Big data: Using R and MySQL Part-3 appeared first on Pingax.

To leave a comment for the author, please follow the link and comment on their blog: Pingax » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)