Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we will solve a simple problem (called “FizzBuzz“) that is asked by some employers in data scientist job interviews. The question seeks to ascertain the applicant’s familiarity with basic programming concepts. We will see 2 different ways to solve the problem in 2 different statistical programming languages: R and Python.

The FizzBuzz Question

I came across the FizzBuzz question in this excellent blog post on conducting data scientist interviews and have seen it referenced elsewhere on the web. The intent of the question is to probe the job applicant’s knowledge of basic programming concepts.

The prompt of the question is as follows:

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The Solution in R

We will first solve the problem in R. We will make use of control structures: statements “which result in a choice being made as to which of two or more paths to follow.” This is necessary for us to A) iterate over a number sequence and B) print the correct output for each number in the sequence. We will solve this problem in two slightly different ways.

Solution 1: Using a “for loop”

The classical way of solving this problem is via a “for loop“: we use the loop to explicitly iterate through a sequence of numbers. For each number, we use if-else statements to evaluate which output to print to the console.

The code* looks like this:

for (i in 1:100){

if(i%%3 == 0 & i%%5 == 0) {
print('FizzBuzz')
}
else if(i%%3 == 0) {
print('Fizz')
}
else if (i%%5 == 0){
print('Buzz')
}
else {
print(i)
}

}


Our code first explicitly defines the number sequence from 1 to 100. For each number in this sequence, we evaluate a series of if-else statements, and based on the results of these evaluations, we print the appropriate output (as defined in the question prompt).

The double parentheses is a modulo operator, which allows us to determine the remainder following a division. If the remainder is zero (which we test via the double equals signs), the first number is divisible by the second number.

We start with the double evaluation – any other ordering of our if-else statements would cause the program to fail (because numbers which pass the double evaluation by definition also pass the single evaluations). If a given number in the sequence from 1 to 100 is divisible both by 3 and by 5, we print “FizzBuzz” to the console.

We then continue our testing, using “else if” statements. The program checks each one of these statements in turn. If the number fails the double evaluation, we test to see if it is divisible by 3. If this is the case, we print “Fizz” to the console. If the number fails this evaluation, we check to see if it is divisible by 5 (in which case we print “Buzz” to the console). If the number fails each of these evaluations (meaning it is not divisible by 3 or 5), we simply print the number to the console.

Solution 2: Using a function

In general, for loops are discouraged in R. In earlier days, they were much slower than functions, although this seems to be no longer true. The wonderful R programmer Hadley Wickham advocates for a functional approach based on the fact that functions are generally clearer about what they’re trying to accomplish. As he writes: “A for loop conveys that it’s iterating over something, but doesn’t clearly convey a high level goal… [Functions are] tailored for a specific task, so when you recognise the functional you know immediately why it’s being used.” (There are many interesting and strong opinions about using loops vs. functions in R; I won’t get into the details here, but it’s informative to see the various opinions out there.)

The FizzBuzz solution using a function looks like this:

# define the function
fizz_buzz <- function(number_sequence_f){
if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
print('FizzBuzz')
}
else if(number_sequence_f%%3 == 0) {
print('Fizz')
}
else if (number_sequence_f%%5 == 0){
print('Buzz')
}
else {
print(number_sequence_f)
}

}

# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)


The code above defines a function called "fizz_buzz". This function takes an input (which I have defined in the function as number_sequence_f), and then passes it through the logical sequence we defined using the "if" statements above.

We then use sapply to apply the FizzBuzz function to a sequence of numbers, which we define directly in the apply statement itself. (The seq command allows us to define a sequence of numbers, and we explicitly state that we want to start at 1 and end at 100, advancing in increments of 1).

This approach is arguably more elegant. We define a function, which is quite clear as to what its goal is. And then we use a single-line command to apply that function to the sequence of numbers we generate with the seq function.

However, note that both solutions produce the output that was required by the question prompt!

The Solution in Python

We will now see how to write these two types of programs in Python. The logic will be the same as that used above, so I will only comment on the slight differences between how the languages implement the central ideas.

Solution 1: Using a "for loop"

The code to solve the FizzBuzz problem looks like this:

for i in range(1,101):
if (i % 3 == 0) & (i % 5 == 0):
print('FizzBuzz')
elif (i % 3 == 0):
print('Fizz')
elif (i % 5 == 0):
print('Buzz')
else:
print(i)



Note that the structure of definition of the "for loop" is different. Whereas R requires {} symbols around all the code in the loop, and for the code inside of each if-else statement, Python uses the colon and indentations to indicate the different parts of the control structure.

I personally find the Python implementation cleaner and easier to read. The indentations are simpler and make the code less cluttered. I get the sense that Python is a true programming language, whereas R's origins in the statistical community have left it with comparatively more complicated ways to accomplish the same programming goal.

Note also that, if we want to stop at 100, we have to specify that the range in our for loop stops at 101. Python has a different way of defining ranges. The first element in the range indicates where to start, but the last number in the range should be 1 greater than the number you actually want to stop at.

Solution 2: Using a function

Note that we can also use the functional approach to solve this problem in Python. As we did above, we wrap our control flow in a function (called fizz_buzz), and apply the function to the sequence of numbers from 1 to 100.

There are a number of different ways to do this in Python. Below I use pandas, a tremendously useful Python library for data manipulation and munging, to apply our function to the numbers from 1 to 100. Because the pandas apply function cannot be used on arrays (the element returned by the numpy np.arange), I convert the array to a Series and call the apply function on that.

# import the pandas library
import pandas as pd

# define the function
def fizz_buzz(number_sequence_f):
if (number_sequence_f % 3 == 0) & (number_sequence_f % 5 == 0):
return('FizzBuzz')
elif (number_sequence_f % 3 == 0):
return('Fizz')
elif (number_sequence_f % 5 == 0):
return('Buzz')
else:
return(number_sequence_f)

# apply it to the number series
pd.Series(np.arange(1,101)).apply(fizz_buzz)



One final bit of programming details. The for loop defined above calls range, which is actually an iterator, not a series of numbers. Long story short, this is appropriate for the for loop, but as we need to apply our function to an actual series of numbers, we can't use the range command here. Instead, in order to apply our function to a series of numbers, we use the numpy arange command (which returns an array of numbers). As described above, we convert the array to a pandas Series, which is then passed to our function.

Does this make sense as an interview question?

As part of a larger interview, I suppose, this question makes sense. A large part of applied analytical work involves manipulating data programmatically, and so getting some sense of how the candidate can use control structures could be informative.

However, this doesn't really address the biggest challenges I face as a data scientist, which involve case definition and scoping: e.g. how can a data scientist partner with stakeholders to develop an analytical solution that answers a business problem while avoiding unnecessary complexity? The key issues involve A) analytical maturity to know what numerical solutions are likely to give a reasonable solution with the available data, B) business sense to understand the stakeholder problem and translate it into a form that allows a data science solution, and C) a combination of A and B (analytical wisdom and business sense) to identify the use cases for which analytics will have the largest business impact.**

So data science is a complicated affair, and I'm not sure the FizzBuzz question gives an interviewer much sense of the things I see as being important in a job candidate.*** For another interesting opinion on what's important in applied analytics, I can recommend this excellent blog post which details a number of other important but under-discussed aspects of the data science profession.

Conclusion

In this post, we saw how to solve a data science interview question called "FizzBuzz." The question requires the applicant to think through (and potentially code) a solution which involves cycling through a series of numbers, and based on certain conditions, printing a specific outcome.

We solved this problem in two different ways in both R and Python. The solutions allowed us to see different programming approaches (for loops and functions) for implementing control flow logic. The for loops are perhaps the most intuitive approach, but potentially more difficult for others to interpret. The functional approach benefits from clarity of purpose, but requires a slightly higher level of abstract thinking to code. The examples here serve as a good illustration of the differences between R and Python code. R has its origins in the statistical community, and its implementation of control structures seems less straightforward than the comparable implementations in Python.

Finally, we examined the value of the FizzBuzz interview question when assessing qualities in a prospective data science candidate. While the FizzBuzz question definitely provides some information, it misses a lot of what (in my view) comprises the critical skill-set that is needed to deliver data science projects in a business setting.

Coming Up Next

In the next post, we will see how to use R to download data from Fitbit devices, employing a combination of existing R packages and custom API calls.

Stay tuned!

---

* My go-to html code formatter for R has disappeared! If anyone knows of such a resource, it would be hugely helpful in making the R code on this blog more readable. Please let me know in the comments! (The Python code is converted with tohtml).

** And then, of course, once a proper solution has been developed, it has to be deployed. Can you partner with data engineers and/or other IT stakeholders to integrate your results into business processes? (Notice the importance of collaboration - a critical but seldom-discussed aspect of the data science enterprise).

*** In defense of the above-referenced blog post, the original author is quite nuanced and does not only rely on the FizzBuzz question when interviewing candidates!