Blog Archives

Sold! How do home features add up to its price tag?

September 5, 2016
By
Sold! How do home features add up to its price tag?

I begin with a new project. It is from the Kaggle playground wherein the objective is to build a regression model (as the response variable or the outcome or dependent variable is continuous in nature) from a given set of predictors or independent variables. My motivation to work on this project are the following; Help me…

Read more »

Predict Blood Donation -warmup

August 28, 2016
By
Predict Blood Donation -warmup

Continuing from my previous post, in this post I will discuss on the inferential and predictive analysis. About the dataset and the problem to solve: a brief The dataset is derived from UCI Machine learning repository and the task is to predict if a donor has donated blood in March 2007 (1 stand for donating blood; 0…

Read more »

Learning from data science competitions- baby steps

August 23, 2016
By
Learning from data science competitions- baby steps

Off lately a considerable number of winner machine learning enthusiasts have used XGBoost as their predictive analytics solution. This algorithm has taken a preceedence over the traditional tree based algorithms like Random Forests and Neural Networks. The acronym Xgboost stands for eXtreme Gradient Boosting package. The creators of this algorithm presented its implementation by winning the Kaggle Otto…

Read more »

Basic assumptions to be taken care of when building a predictive model

August 9, 2016
By
Basic assumptions to be taken care of when building a predictive model

Before starting to build on a predictive model in R, the following assumptions should be taken care off; Assumption 1: The parameters of the linear regression model must be numeric and linear in nature.  If the parameters are non-numeric like categorical then use one-hot encoding (python) or dummy encoding (R) to convert them to numeric. Assumption…

Read more »

Data Transformations

August 8, 2016
By
Data Transformations

A number of reasons can be attributed to when a predictive model crumples such as: Inadequate data pre-processing Inadequate model validation Unjustified extrapolation Over-fitting (Kuhn, 2013) Before we dive into data preprocessing, let me quickly define a few terms that I will be commonly using. Predictor/Independent/Attributes/Descriptors – are the different terms that are used as…

Read more »

Data Splitting

August 7, 2016
By
Data Splitting

A few common steps in data model building are; Pre-processing the predictor data (predictor – independent variable’s) Estimating the model parameters Selecting the predictors for the model Evaluating the model performance Fine tuning the class prediction rules “One of the first decisions to make when modeling is to decide which samples will be used to…

Read more »

Batch Geo-coding in R

July 4, 2015
By
Batch Geo-coding in R

Batch Geo-coding in R

Read more »

To read multiple files from a directory and save to a data frame

June 23, 2015
By
To read multiple files from a directory and save to a data frame

There are various solution to this questions like these but I will attempt to answer the problems that I encountered with there working solution that either I found or created by my own. Question 1: My initial problem was how to read multiple .CSV files and store them into a single data frame. Solution: Use…

Read more »

Gini index to compute inequality or impurity in the data

May 18, 2015
By
Gini index to compute inequality or impurity in the data

"Gini index measures the extent to which the distribution of income or consumption expenditure among individuals or households within an economy deviates from a perfectly equal distribution. Thus a Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.

Read more »

Assessing Clustering Tendency in R

May 13, 2015
By
Assessing Clustering Tendency in R

In clustering one of major problem a researcher/analyst face are two question. First, does the given dataset has any clustering tendency?And second, how to determine an optimal number of clusters in a dataset validate the clustered results. In this post, I have attempted to answer this using R

Read more »

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)