Poisson regression fitted by glm(), maximum likelihood, and MCMC

October 29, 2013
The goal of this post is to demonstrate how a simple statistical model (Poisson log-linear regression) can be fitted using three different approaches. I want to demonstrate that both frequentists and Bayesians use the same models, and that it is the fitting procedure and the inference that differs. This is … Continue reading →

Call them what you will

October 28, 2013
I’ve been playing around with the R package texreg for creating combined regression tables for multiple models. It’s not the only package to do that – see here for a review – but it’s often handy to be able to generate both ascii art, latex, and html versions of the same table using almost identical

The Basics of Encoding Categorical Data for Predictive Models

October 23, 2013
Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible. (I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have...

New R package: scholar

October 23, 2013
My new R package, scholar, has just been posted on CRAN. The scholar package provides functions to extract citation data from Google Scholar. In addition to retrieving basic information about a single scholar, the package also allows you to compare multiple scholars and predict future h-index values. There’s a full guide on Github (along

GLM, non-linearity and heteroscedasticity

October 22, 2013
$Y_i=\beta_0+\beta_1 X_i +\varepsilon_i$

Last week in non-life insurance course, we’ve seen the theory of the Generalized Linear Models, emphasizing the two important components the link function (which is actually the key component in predictive modeling) the distribution, or the variance function Just to illustrate, consider my favorite dataset ­lin.mod = lm(dist~speed,data=cars) A linear model means here where the residuals are assumed to be...

An introduction to Econometrics, using R

October 22, 2013
If your econometrics is a bit rusty and you're also looking to learn the R language, you can kill two birds with one stone with Introductory Econometrics using Quandl and R. The first three parts of this seven-part tutorial introduces the basics of regression analysis, while the remaining sections provide R code you can try yourself to reproduce econometric...

machine learning [book review, part 2]

October 21, 2013
The chapter (Chap. 3) on Bayesian updating or learning (a most appropriate term) for discrete data is well-done in Machine Learning, a probabilistic perspective if a bit stretched (which is easy with 1000 pages left!). I like the remark (Section 3.5.3) about the log-sum-exp trick. While lengthy, the chapter (Chap. 4) on Gaussian models has

How Do You Write Your Model Definitions?

October 20, 2013
I’m often irritated by that when a statistical method is explained, such as linear regression, it is often characterized by how it can be calculated rather than by what model is assumed and fitted. A typical example of this is that linear regression is often described as a method that uses ordinary least squares to calculate the best...

Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

October 17, 2013
Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units are described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying...