Sampling and the Analysis of Big Data

April 8, 2012
By

After my last post, I came across a few articles supporting the opinion that if you have a good reason to take random samples from a “big” dataset, you’re not committing some kind of sin: Big Data Blasphemy: Why Sample? … Continue reading →

The lm() function with categorical predictors

April 8, 2012
By

What's with those estimates?By Ben OgorekIn R, categorical variables can be added to a regression using the lm() function without a hint of extra work. But have you ever look at the resulting estimates and wondered exactly what they were?First, let's define a data set.set.seed(12255)n = 30sigma = 2.0AOV.df <- data.frame(category = c(rep("category1", n)     ...

Using bigmemory for a distance matrix

April 7, 2012
By

The process of working on metadata and temperature series gives rise to several situations where I need to calculate the distance from every station to every other station. With a small number of stations this can be done easily on the fly with the result stored in a matrix. The matrix has rows and columns

What are the distributions on the positive k-dimensional quadrant with parametrizable covariance matrix? (solved)

April 7, 2012
By

Paulo (from the Instituto de Matemática e Estatística, Universidade de São Paulo, Brazil) has posted an answer to my earlier question both as a comment on the ‘Og and as a solution on StackOverflow (with a much more readable LaTeX output). His solution is based on the observation that the multidimensional log-normal distribution still allows

Temperature Change in Ireland

April 7, 2012
By

Has Ireland gotten any warmer? Ask any punter on the street and they will happily inform you of wild swings, trends and dips. “Back when I was a child”, “when I was younger”, or “years ago” are the usual refrains. What’s the evidence? To answer this, I will use the temperature data from my previous

Gaussian process regression with R

April 5, 2012
By

I’m currently working my way through Rasmussen and Williams’s book on Gaussian processes. It’s another one of those topics that seems to crop up a lot these days, particularly around control strategies for energy systems, and thought I should be able to at...

Obama administration unveiled a Big Data Research and Development Initiative with \$200 million

April 4, 2012
By

Yanchang Zhao, RDataMining.com Obama administration unveiled a Big Data Research and Development Initiative with \$200 million on March 29, 2012, to improve the ability to extract knowledge and insights from large and complex collections of digital data. Six Federal departments … Continue reading →

What are the distributions on the positive k-dimensional quadrant with parametrizable covariance matrix? (bis)

April 3, 2012
By
$What are the distributions on the positive k-dimensional quadrant with parametrizable covariance matrix? (bis)$

Wondering about the question I posted on Friday (on StackExchange, no satisfactory answer so far!), I looked further at the special case of the gamma distribution I suggested at the end. Starting from the moment conditions, and the solution is (hopefully) given by the system The resolution of this system obviously imposes conditions on those

Transaction Cost and Execution Price functionality in the Backtesting library in the Systematic Investor Toolbox

April 2, 2012
By

I want to introduce the Transaction Cost and Execution Price functionality in the Backtesting library in the Systematic Investor Toolbox. The Transaction Cost is implemented by a commission parameter in the bt.run() function. You may specify the commissions in \$ per share for “share” type backtest and as a percentage of total trade for “weight”

Web-Scraping in R

April 2, 2012
By

Web-scraping, or web-crawling, sounds like a seedy activity worthy of an Interpol investigative department. The reality, however, is far less nefarious. Web-scraping is any procedure by which someone extracts data from the internet. Given that it’s possible to get the internet on computers these days; web-scrapping opens an array of interesting possibilities to social-science researchers