Site icon R-bloggers

Is data mining more about fitting data well? – Exercise Results

[This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today, I am going to share results of an exercise that I carried out recently for a start-up. Intention of the study was to extract those major attributes that are generally driving less/in experienced (or) re-skilled data miners towards the given objective and to understand where they are failing back. Herein, twist is majority of them have given same conclusion or explanation for the given objective. Results highlight or comment on, those important aspects of the practice where most of them failed to cognize for the sake of quick answer/solution.< o:p>

Sample Observed:< o:p>
All members of the sample had experience both with R and data mining solutions; either through course projects (free/paid/part-of-curriculum) or through industry experience, however, industry experienced sample have been limited between minimum of 1 year to maximum of 3 years from whatever domain. Details of sample are as below:< o:p>
  1. 17  – Fresher’s from various engineering background (both Graduates and Post-Graduates)
  2. 12  – Fresher’s from various quantitative background (Maths, Stats, MBAs, Econometrics, etc.)
  3. 18  – Experienced from different industry background (data management related, programming, consulting, etc.)
  4. All members of the sample belong to two major cities of India.
< o:p>
< o:p>
< o:p>
< o:p>
About Test Data:< o:p>
Bank data of customers belonging to a particular city branch having around 17000 observations for a period of one month, which as information about customer’s age, few demographics, no of transactions they did in that month, whether they visited branch in that month, etc., total of 12 variables.< o:p>

Infrastructure Provided:< o:p>
Computing machine with a pre-installed latest R (3.1.1) & RStudio that has 8GB RAM and Intel Core i7 Processor. < o:p>

Objective:< o:p>
“Comment about the variables ‘visiting branch’ and ‘age’ relationship”. < o:p>

Time Limit:< o:p>
A time limit of 20 minutes was given, which was almost two and half times more than average time of experienced people, took to give their comments.< o:p>

Highlights from the Exercise:< o:p>
< o:p>
< o:p>
< o:p>
< o:p>
What was wrong in the data?
< o:p>
When this data was originally received, I observed that due to a machine/man-made mistake, column ‘age of the customer’ in the data was having representation of an additive nature, for instance, if customer has visited the branch twice in the month and his original age is 25, it appeared as 50. Hence, positive relationship as age increased, however, it was not the case after the noise removal.< o:p>

Summary: < o:p>
Data Mining is a process of many stages as depicted in CRISP-DM1and data understanding is key of them, I always suggest process your data incrementally, if you want efficient analytical solution, ignoring it, and employing which fits the data well practice, may not work in all situations.< o:p>
1 http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining< o:p>

Author thank management of start-up for allowing to publish exercise highlights. He undertook several programs towards analytical talent development, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail for more details.
To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.