Score with scoring rules

July 21, 2009

(This article was first published on Decision Science News » R, and kindly contributed to R-bloggers)


We have all been there. You are running an experiment in which you would like participants to tell you what they believe. In particular, you’d like them to tell you what they believe to be the probability that an event will occur.

Normally, you would ask them. But come on, this is 2009. Are you going to leave yourself exposed to the slings and arrows of experimental economists? You need to give your participants an incentive to tell you what they really believe, right?

Enter the scoring rule. You pay off the subjects based on the accuracy of the probabilities they state. You do this by observing some outcome (let’s say “rain”) and you pay a lot of money to the people who assigned a high probability to it raining and you pay a little money (or even impose a fine upon) those who assigned a low probability to it raining. A so-called “proper” scoring rule is one in which people will do the best for themselves if they state what they truly believe to be the case.

Three popular proper scoring rules are the Spherical, Quadratic, and Logarithmic. Let’s see how they work.

Suppose in your experimental task you give people the title of a movie, and they have to guess what year the movie was released.  You tell them at the outset that the movie was released between 1980 and 1999: that’s 20 years. So you have these 20 categories (years) and you want people to assign a probability to each year. Afterwards, you will pay them out based on the actual year the movie was released and the probability they assigned to that year.

Let r be the vector of 20 probabilities, and r_1 could be the probability they assign to 1980 being the year of release, and r_2 the probability that it was 1981, so on through r_20 for 1999′s probability. Naturally, all the r’s add up to one, as probabilities like to do. Now, let r_i be the probability they assign to the year which turns out to be correct.

Under the Spherical scoring rule, their payout would be r_i / (r*r)^.5

Under the Quadratic scoring rule, the payout would be 2*r_i – r*r

Under the Logarithmic scoring rule, the payout would be ln(r_i)

In the movie above, the top row shows various sets of probabilities someone might assign to the 20 years. (Imagine the categories along the x-axis are the years 1980 to 1999).  Each bar in the graphs in the bottom three rows shows the person’s payout if that year turns out to be correct, based on the probabilities assigned to each year in the top row.

As you can see, when they assign a high probability to a category and it turns out to be correct, their payout is high. When they assign a low payout to a category and it turns out to be correct, their payout is low.

You’ll notice that the Logarithmic scoring rule goes right off the bottom of the page. This is because the log of small probabilities are negative numbers far beneath zero, and the log of 0 is negative infinity!

While I was at Stanford I heard that decision scientist extraordinaire Ron Howard (no relation) used to make students assign probabilities to the alternatives (A, B, C or D) on the multiple choice items on the final exam. The score for each question was the log of the probability they assigned to the correct answer. This means, of course, that if you assign a probability of 0 to alternative “B” and alternative “B” turns out to be correct, your score on that question is negative infinity. I always wondered if you got a negative infinity on one question if it meant you got negative infinity on the exam, or if there was some mercy clause.

But the main reason I am writing this post is because I wonder what experimental economists and psychologists are supposed to do when implementing log scoring rules in the lab. Naturally, you can endow the participant with cash at the beginning of the experiment and have them draw down with each question, but what do you do if they score a negative infinity? Take their life savings?

Winkler (1971) decided that he would treat probabilities less than .001 as .001 when it came time to imposing the penalty. Does anyone know of other methods?


Robert L. Winkler (1971)  Probabilistic Prediction: Some Experimental Results, Journal of the American Statistical Association, Vol. 66, No. 336.  pp. 675-685.


To make this simulation, I’ve drawn on the top row various beta distributions of differing modes between two fixed endpoints. This is akin to having a min and a max guess for the year of release, then entertaining various years between those two endpoints as most likely.

To leave a comment for the author, please follow the link and comment on his blog: Decision Science News » R. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.