I had the opportunity today to check the performance of a calibration (moisture in intact sunflower seed in reflectance).This is always a exciting moment: Does the performance of the calibration for the new validation set is as expected duri...

Last time around I used R to plot the average runs per game for the American League, starting in 1901. Now I’ll do the same for the National League. I'll save a comparison of the two leagues for my next post.A fundamental principal of programming is that code can be repurposed for different sets of datas. So...

John Myles White, self-described "statistics hacker" and co-author of "Machine Learning for Hackers" was interviewed recently by The Setup. In the interview, he describes his some of his go-to R packages for data science: Most of my work involves programming, so programming languages and their libraries are the bulk of the software I use. I primarily program in R,...

Just stumbled on across a course on coursera titled “Computing for Data Analysis” taught by Roger D. Peng the Johns Hopkins Bloomberg School of Public Health. Here is the description of the course. In this course you will learn how to program in R and how to use R for effective data analysis. You will learn … Continue reading...

Introduction In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that

Today I want to show how to use Factor Attribution to boost performance of the 1-Month Reversal Strategy. The Short-Term Residual Reversal by D. Blitz, J. Huij, S. Lansdorp, M. Verbeek (2011) paper presents the idea and discusses the results as applied to US stock market since 1929. To improve 1-Month Reversal Strategy performance authors

In preparation for “Haxogreen” hackers summer camp which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection. Flame virus, Stuxnet, Duqu proved that static, signature based security systems are not able to detect very advanced, government sponsored

In April, Hans Rosling examined the influence of religion on fertility. I used R to replicate a graphic of his talk:> library(datamart) > gm <- gapminder() > #queries(gm) > # > # babies per woman > tmp <- query(gm, "TotalFertilityRate") > babies <- as.vector(tmp) > names(babies) <- names(tmp) > babies <- babies > countries <- names(babies) > # > # income per capita, PPP adjusted > tmp <- query(gm, "IncomePerCapita") >...

R has great support for Holt-Winter filtering and forecasting. I sometimes use this functionality, HoltWinter & predict.HoltWinter, to forecast demand figures based on historical data. Using the HoltWinter functions in R is pretty straightforward. Let's say our dataset looks as follows; demand <- ts(BJsales, start = c(2000, 1), frequency = Read more...

Linear Programming is a mathematical technique used to find the values of some variables (within the bounds of some defined constraints) to find the maximum value of a quantity. For example, consider this problem from the FishyOperations blog: A trading company is looking for a way to maximize profit per transportation of their goods. The company has a train...

Dynamite plots are a somewhat pejorative term for a graphical display where the height of a bar indicates the mean, and the vertical line on top of it represents the standard deviation (or standard error). These displays are commonly found in many scientific disciplines, as a way of communicating group differences in means. Many...

R has great support for Holt-Winter filtering and forecasting. I sometimes use this functionality, HoltWinter & predict.HoltWinter, to forecast demand figures based on historical data. Using the HoltWinter functions in R is pretty straightforward. Let's say our dataset looks as follows; demand <- ts(BJsales, start = c(2000, 1), frequency = 12) plot(demand) Now I pass the timeseries object to HoltWinter and...

I list and discuss the three books on Bayesian analysis that I recommend to social scientists.

Portfolio diversity is a balancing act. Previously The post “Portfolio diversity” talked about the role of the correlation between assets and the portfolio. The current post fills a hole in that post. The 2 dimensions asset-portfolio correlation Each asset in the universe has a correlation with the portfolio. If there are any assets that have … Continue reading...

In the earlier post we generated maps from GBIF biodiversity records using maps and ggplot2 packages. We used world map with country borders for that. Now we will generate maps with google maps as base layer using dismo package. Like earlier we download data for Danaus chrysippus from GBIF using occurrencelist function into a data

It’s all very well publishing a research paper that describes the method for, and results of, analysing a dataset in a particular way, or a news story that contains a visualisation of an open dataset, but how can you do so transparently and reproducibly? Wouldn’t it be handy if you could “View Source” on the

Approximate Bayesian Computing and similar techniques, which are based on calculating approximate likelihood values based on samples from a stochastic simulation model, have attracted a lot of attention in the last years, owing to their promise to provide a general statistical technique for stochastic processes of any complexity, without the limitations that apply to “traditional”…

Working on my R bootcamp materials and I thought it would be handy to get the bootcamp computers setup by sourcing an R script that will install all necessary non-core packages in it. The problem? How to deploy this script efficiently. A quick method w...

I have started to explore the functionality of R, the statistical and graphics programming language. And with what better data to play than that of Major League Baseball?There have already been some good examples of using R to analyze baseball data. The most comprehensive is the on-going series at The Prince of Slides (Brian Mills, aka...

This is a Twitter retweet network. When people tweet, they may get retweeted by other people, repeating the message for their followers to view. Each retweet is a one-way flow of information that links the first person to each person who retweeted them (forwarded the original tweet into their own network). So, in this visualization