A few weeks ago, we had our first meetup of the Basel Data Scientists (BDS) here in Basel, Switzerland. As it was our first meeting, and I wanted people to get to know each other and have some fun, I decided to have the members play a data science trivia game. I split the group into two teams of 5, and had each group answer 20 data science trivia questions I gathered from a mixture of classic statistical brain teasers, from both statistics and psychology, some statistical history (thank you Wikipedia!), and a few basic probability calculations. I had no idea if people would be into the game or not, but I was happy to see that after a few questions (and beers), people were engaged in some (at times heated!) debates over questions like the definition of a pvalue, and how to best protect an airplane from enemy fire.
As I thought other people might have fun with the game. I am posting them here for other people to enjoy. As you’ll see, the 20 questions are broken down into four categories “Fun”, “Statistics”, “History”, and “Terminology”. Once you’ve given the questions a shot, you can find (my) answers to the questions at http://ndphillips.github.io/DataScienceTrivia_Answers.html. If you find errors, or have suggestions for better questions, don’t hesitate to write me at [email protected]. Have fun!
Data Science Trivia
Fun

Abraham is tasked with reviewing damaged planes coming back from sorties over Germany in the Second World War. He has to review the damage of the planes to see which areas must be protected even more. Abraham finds that the fuel system of returned planes are much more likely to be damaged by bullets than the engines. Which part of the plan should he recommend to receive additional protection, the fuel systems or the engines?

Paul the __ was an animal that became famous in 2010 for accurately predicting the outcomes of the 2010 world cup. What species was Paul?

Amy and Bob have two children, one of whom is female. What is the probability that their other child is female?

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Assuming that you do not want a goat, should you stick with door No 1. or should you switch to door No 2.?

Imagine the following coin flipping game. Before the game starts, the pot starts at $2. I then continually flip a coin, and each time a Head appears, the pot doubles. The first time tails appears, the game ends and you win whatever is in the pot. Thus a Tails comes on the first flip, the game is over and you get 2$. If the first Tails comes on the second flip, you get $4. Formally, you win \(2^k\) dollars, where k is the number of flips. If you played this game infinitely times, how much money would you expect to earn on average? How much would you pay me for the opportunity to play this game?
Statistics

How many people do you need in a room for the probability to be greater than .50 that at least two people in the room have the same birthday?

If you flip a fair coin 4 times, what is the probability that it will have at least one head?
 Imagine you are a physician presented with the following problem. A 50year old woman Betty, with no symptoms, participants in routine mammogram screening. She tests positive and wants to know how likely it is that she actually has breast cancer given her positive test result. You know that about 1% of 50year old women have breast cancer. If a woman does have breast cancer, the probability that she tests positive is 90%. If she does not have breast cancer, the probability that she nevertheless tests positive is 9%. Based on this information, how likely is it that Betty actually has breast cancer given her positive test result?
 What is the definition of a pvalue?
 Imagine that I flipped a fair coin 5 times: which of the following two sequences is more likely to occur? A) “H, H, T, H, T”, B) “T, T, T, T, T”
History
 The ___ ___ theorem, one of the most famous in all of statistics, states that, given enough data, the probability distribution of the sample mean will always be Normal, regardless of the probability distribution of the raw data.

The mathematician ___ developed the method of least squares in 1809.

In 1907, Francis Galton submitted a paper to Nature where he found that when 787 people guessed the weight of an ox at a county fair, the median estimate of the group was only off by 10 pounds. This is one of the most famous examples of the ___ __ ___.

The .05 significance threshold was introduced by ___ in 1925.

Python is a programming language created by Guido van Rossum and was first released in 1991. Where did the name for Python come from?
Terminology

A machine learning model that is so complex that no one, even at times its programmers, don’t know exactly why it works the way it does, is called a ___ ___ model.

When an algorithm has very high accuracy in fitting a training dataset, but poor accuracy in predicting a new dataset, then the model has ___ the training data.

In order to computationally estimate probability distributions, especially in Bayesian statistics, MCMC methods are often used, which stand for ___ ___ ___ ___ methods.

What does SPSS stand for?

Regression, decision trees, and random forests are known as ___ learning algorithms, while algorithms such as nearest neighbor and principle component analysis are known as ___ learning algorithms
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...