Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wellcome to the series blog posts. Since long time, I am writing post on Machine learning with R. Today I am gonna discuss on big data problem while fitting machine learning on it and its solution using MySQL and R.

Before we jump directly to solution, let us discuss about big data little bit. (You can skip this section if you want to start using R code directly, click here to jump to the last part)

Nowadays, huge amount of data is being generated from various sources and we often require analytics on top of it. As we all know predictive analytics is hottest topic in recent time, everyone want predictive analysis on big data and it certainly help into businesses and impact bottom line.

But, problem with predictive analysis is it requires too much mathematical computations on data and it is very much memory intensive process. So whenever we are dealing with big data (computationally), it becomes more difficult to perform mathematical calculation on it.

This will lead us to end up with two problems,

1. How to optimize our predictive analysis computation for big data when we are having limited computational resource?
2. What could be done in order to process large data with limited memory?

Now let us discuss about the solutions in order to tackle this challenges. One of the best solutions for this: Hadoop Ecosystem with parallel computation power. In recent time, Hadoop has proven to be the best open source solution for big data operation.

We know that Hadoop works on the concept of the parallel computation in cluster and Hadoop distributed file system. Running the ML algorithm over the Hadoop cluster requires the knowledge of the map-reduce programming and it makes learning curve little bit hard if you are not well with programming.

But, When we have limited limited computational resource (Only single PC ) , Hadoop won’t help us to perform computation on the larger dataset. This will put us in situation to find another solution. The alternate solution could be use of R and MySQL together.

Let us try to address question-1 How to optimize predictive analysis computation?

Here predictive analysis computation refers to building Machine learning model on dataset and Machine learning model consist of mathematical formulas. Let us dive deep into the machine learning predictive model & try to understand why it becomes hard (computationally) while working with the larger data.

Basic predictive model is built using linear and logistic regression techniques (you can refer our post to know more in detail about linear regression). Let say we are building linear regression model, now what happens when we are accessing the large data we will face two challenges,

1. Data is that much large, so we cannot load it in memory in order to use in R.
2. Data is loaded in memory, but with remaining memory mathematical computation can be not performed (requires more memory).

Both of the scenarios would require having some unique solution such that we can process large data in R and perform calculation as well. So let me end up here in this post and in next post we will look out solution to given challenges.