Welcome to the second part of the series blog posts. In first part we tried to understand the challenges of fitting predictive model to the large dataset. In this post I will discuss about the solution approach to that challenges. Let’s start rolling.
As machine learning technique requires accessing whole dataset for fitting model on it, it will become computationally hard to do so. Unique solution in both scenarios (discussed in end of first part) is process data into chunks and fit model. This will requires to make some changes in algorithm, so when we are processing the data into chunks, we will obtain same result as fitting model on whole dataset.
As we assumed in first part that we are trying to fit linear regression model on data, let us discuss what changes we need to make in linear regression algorithm to fulfill our requirement of processing the data into chunks and fitting model on it. We are going little bit deeper towards the mathematical formulas (If you want to know more in detail about linear regression refer this post).
In the linear regression, modelling objective is to find best fitting parameters and these fitting parameters are obtained by minimizing the cost function using the gradient descent.
Cost function is defined as below
And Gradient descent is defined as below
Here thetas are our fitting parameters. Actually, gradient descent is the crux of linear regression. Let us have close look at the equation. It is iterative process. Here, we are trying to identify best values of thetas so the cost would be minimized. You can read more about gradient descent and its implementation here.
As you can see in the equation, thetas are updated simultaneously and gradient is calculated every time in each iteration. Hence, after some number of the iterations, algorithm will reach to convergence and we stop from that iteration. Now in each iteration, we will require to calculate gradient considering whole dataset and this will become again a problem for memory with big size of data. This will put us in situation to process data into chunks.
In each iteration, we need to process data into chunks and calculate gradient for thetas, so here we will require more processing time to complete gradient calculation for whole dataset. Now, the structure of the gradient descent will be changed as below.
Here one question might arise that how we can provide data into chunks to gradient descent without loading whole data into memory? MySQL would be the best option to do so. This is the answer of the second question discussed in first part
Question – 2 What could be done in order to process large data with limited memory?
Let us discuss in detail about how we can achieve processing of the data into chunks and feed to gradient descent algorithm. Consider, whole dataset is stored into the MySQL database, we can query limited data points from database and pass it to algorithm. We perform sequential extraction of the data from database until whole data is processed.
Then the next question might arise, how many times we have to query data base in order to process whole dataset? Answer to that question is “depending number of rows in the dataset and memory of the computing unit”
Let us consider one scenario here. We have 10,000,000,000 rows in the dataset and our computing unit is able to process 1,000,000 records based on available memory, then we need to query database 10,000 times to process whole dataset. This will become more time consuming process, but still we will be able to process computationally big data.
Since, R and MySQL can be used together; we can achieve this sequential extraction easily.
In the next part, we will see how we can put all these things together using R and MySQL.
Powered by Google+ Comments
The post Build Predictive Model on Big data: Using R and MySQL Part-2 appeared first on Pingax.