Blog Archives

Ordinary Least Squares is dead to me

November 28, 2013
By

Most books that discuss regression modeling start out and often finish with Ordinary Least Squares (OLS) as the technique to use; Generalized Least Squares (GLS) sometimes get a mention near the back. This is all well and good if the readers’ data has the characteristics required for OLS to be an applicable technique. A lot

Read more »

R now has its own shelf in Dillons

November 24, 2013
By
R now has its own shelf in Dillons

I was in Dillons, the one opposite University College London, at the start of the week and what did I spy there? There is now a bookshelf devoted to R (right, second from top) in the programming languages section. The shelf would be a lot fuller if O’Reilly did not have a complete section devoted

Read more »

I made a mistake, please don’t shoot me

July 31, 2013
By

The major difference between commercial/academic written software is the handling of user mistakes, or to be more exact what is considered to be a user mistake. In the commercial world the emphasis is on keeping the customer happy, which translates into trying hard to gracefully handle any ‘mistake’ the user makes. Academic software is generally

Read more »

Amount of end-user usage of code in Firefox

July 25, 2013
By
Amount of end-user usage of code in Firefox

How much end-user usage does the code in Firefox receive over time? Short answer: The available data is very sparse and lots of hand waving is needed to concoct something. The longer answer is below as another draft section from my book Empirical software engineering with R. As always comments and pointers to more data

Read more »

Unique bytes in a sliding window as a file content signature

July 21, 2013
By
Unique bytes in a sliding window as a file content signature

I was at a workshop a few months ago where a speaker pointed out a useful technique for spotting whether a file contains compressed data, e.g., a virus hidden in a script by compressing it to look like a jumble of numbers. Compressed data contains a uniform distribution of byte values (after all, compression is

Read more »

Preferential attachment applied to frequency of accessing a variable

May 17, 2013
By
Preferential attachment applied to frequency of accessing a variable

If, when writing code for a function, up to the current point in the code distinct local variables have been accessed for reading times (), will the next read access be from a previously unread local variable and if not what is the likelihood of choosing each of the distinct variables (global variables are ignored

Read more »

Prioritizing project stakeholders using social network metrics

April 20, 2013
By
Prioritizing project stakeholders using social network metrics

Identifying project stakeholders and their requirements is a very important factor in the success of any project. Existing techniques tend to be very ad-hoc. In her PhD thesis Soo Ling Lim came up with a very interesting solution using social network analysis and what is more made her raw data available for download I have

Read more »

Never too experienced to make a basic mistake

April 15, 2013
By

I was one of the 170 or so people at the Data Science hackathon in London over the weekend. As always this was well run by Carlos and his team who kept us fed, watered and connected to the Internet. One of the three challenges involved a dataset containing pairs of Twitter users, A and

Read more »

Push hard on a problem here and it might just pop up over there

April 2, 2013
By

One thing I have noticed when reading other peoples’ R code is that their functions are often a lot longer than mine. Writing overly long functions is a common novice programmer mistake, but the code I am reading does not look like it is written by novices (based on the wide variety of base functions

Read more »

R needs some bureaucracy

March 12, 2013
By

Writing a program in R is almost bureaucracy free: variables don’t need to be declared, the language does a reasonable job of guessing the type a value might need to be automatically be converted to, there is no need to create a function having a special name that gets called at program startup, the commonly

Read more »