Blog Archives

Reasonable Inheritance of Cluster Identities in Repetitive Clustering

August 15, 2014
By
Reasonable Inheritance of Cluster Identities in Repetitive Clustering

… or Inferring Identity from Observations Let’s assume the following application: A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant … Continue reading →

Read more »

Talking to Twitter’s REST API v1.1 with R

June 10, 2014
By
Talking to Twitter’s REST API v1.1 with R

In this text I am going to describe a very straightforward way of how to make use of Twitter’s REST API v1.1. I put some code together for that purpose, so that requesting data just needs the API URL, the API … Continue reading →

Read more »

FIR Filter Design and Digital Signal Processing in R

May 15, 2014
By
FIR Filter Design and Digital Signal Processing in R

This article serves the purpose of illustrating that signal processing with R is possible – thanks to the signal package – and to keep a reference of some of the stuff that I learned at my last edX course. Anyway, I … Continue reading →

Read more »

Relation of Word Order and Compression Ratio and Degree of Structure

May 7, 2014
By
Relation of Word Order and Compression Ratio and Degree of Structure

Having a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or … Continue reading →

Read more »

MapReduce with R on Hadoop and Amazon EMR

April 25, 2014
By
MapReduce with R on Hadoop and Amazon EMR

You all know why MapReduce is fancy – so let’s just jump right in. I like researching data and I like to see results fast – does that mean I enjoy the process of setting up a Hadoop cluster? No, … Continue reading →

Read more »

Testing for Linear Separability with Linear Programming in R

April 19, 2014
By
Testing for Linear Separability with Linear Programming in R

For the previous article I needed a quick way to figure out if two sets of points are linearly separable. But for crying out loud I could not find a simple and efficient implementation for this task. Except for the perceptron and … Continue reading → The post Testing for Linear Separability with Linear Programming in R appeared first...

Read more »

Impact of Dimensionality on Data in Pictures

April 16, 2014
By
Impact of Dimensionality on Data in Pictures

I am excited to announce that this is supposed to be my first article published also on r-bloggers.com :) The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the … Continue reading → The post Impact of Dimensionality on Data in Pictures appeared first on

Read more »

Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)

March 28, 2014
By
Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)

The Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using … Continue reading → The post Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)...

Read more »

The tf-idf-Statistic For Keyword Extraction

February 27, 2014
By
The tf-idf-Statistic For Keyword Extraction

The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word … Continue reading → The post The tf-idf-Statistic For Keyword Extraction appeared first on joy...

Read more »

“Digit Recognizer” Challenge on Kaggle using SVM Classification

February 14, 2014
By
“Digit Recognizer” Challenge on Kaggle using SVM Classification

This article is about the “Digit Recognizer” challenge on Kaggle. You are provided with two data sets. One for training: consisting of 42’000 labeled pixel vectors and one for the final benchmark: consisting of 28’000 vectors while labels are not … Continue reading → The post “Digit Recognizer” Challenge on Kaggle using SVM Classification appeared first on...

Read more »