hi everyone, please share this: if you are an experienced user of a publicly-available survey data set from any country or international organization, let's work together on some user-friendly code and a short blog post for http://asdfree.com.&nb...

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In Related posts:

by Joseph Rickert One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for...

What are joint models for longitudinal and survival data? In this post we will introduce in layman's terms the framework of joint models for longitudinal and time-to-event data. These models are applied in settings where the sample units are followed-up in time, for example, we may be interest in patients suffering...

Last week, a student asked me about multiple tests. More precisely, she ran an experience over – say – 20 weeks, with the same cohort of – say – 100 patients. An we observe some size=100 nb=20 set.seed(1) X=matrix(rnorm(size*nb),size,nb) (here, I just generate some fake data). I can visualize some trajectories, over the 20 weeks, library(RColorBrewer) cl1=brewer.pal(12,"Set3") cl2=brewer.pal(8,"Set2") cl=c(cl1,cl2)...

In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as "machine learning"), largely...

The Random Forest algorithm is a well-known tool for data analysis. It yields robust predictive models for vastly different data sets, serving as a sort of Swiss Army Knife in the data scientist's toolkit. Given the need to accommodate ever-larger data sets, scalable...

Welcome to last part of the series post again! In previous part I discussed about the solutions to the questions mentioned in first part. In this part, we will implement whole scenario using R and MySQL together and see how we can process bigdata(computationally ) Let us recall those questions and summarize their answers to The post Build...

Welcome to the second part of the series blog posts. In first part we tried to understand the challenges of fitting predictive model to the large dataset. In this post I will discuss about the solution approach to that challenges. Let’s start rolling. As machine learning technique requires accessing whole dataset for fitting model on The post Build...

