For many problems, classification and regression trees can be a simple and elegant solution, assuming you know their well-documented strengths and weaknesses. I first explored their use several years ago with JMP, which is easy to use. If y...

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible. (I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have...

My new R package, scholar, has just been posted on CRAN. The scholar package provides functions to extract citation data from Google Scholar. In addition to retrieving basic information about a single scholar, the package also allows you to compare multiple scholars and predict future h-index values. There’s a full guide on Github (along

Last week in non-life insurance course, we’ve seen the theory of the Generalized Linear Models, emphasizing the two important components the link function (which is actually the key component in predictive modeling) the distribution, or the variance function Just to illustrate, consider my favorite dataset lin.mod = lm(dist~speed,data=cars) A linear model means here where the residuals are assumed to be...

If your econometrics is a bit rusty and you're also looking to learn the R language, you can kill two birds with one stone with Introductory Econometrics using Quandl and R. The first three parts of this seven-part tutorial introduces the basics of regression analysis, while the remaining sections provide R code you can try yourself to reproduce econometric...

The chapter (Chap. 3) on Bayesian updating or learning (a most appropriate term) for discrete data is well-done in Machine Learning, a probabilistic perspective if a bit stretched (which is easy with 1000 pages left!). I like the remark (Section 3.5.3) about the log-sum-exp trick. While lengthy, the chapter (Chap. 4) on Gaussian models has

I’m often irritated by that when a statistical method is explained, such as linear regression, it is often characterized by how it can be calculated rather than by what model is assumed and fitted. A typical example of this is that linear regression is often described as a method that uses ordinary least squares to calculate the best...

Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units are described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying...

(This article was first published on David Chudzicki's Blog, and kindly contributed to R-bloggers) This post will describe a way I came up with of fitting a function that’s constrained to be increasing, using Stan. If you want practical help, standard statistical approaches, or expert research, this isn’t the place for you (look up “isotonic regression” or “Bayesian isotonic...