How to Produce Fake Data Analysis in R: 3 Easy Steps

April 2, 2010
By

(This article was first published on Zero Intelligence Agents » R, and kindly contributed to R-bloggers)

Did you really think that a team of researchers spent their weekends counting the number of shirtless adolescent men and exposed penises they could find on charoulette.com? Perhaps you should not answer that, as it may be a better measure of your opinion of sociologist than gullibility. It is true, sociologist do say the darndest things, but c’mon, some of my best friends are sociologist!

The truth is, yesterday’s post was an April Fools joke, and one that I thought was fairly obvious (who’s that guy in the bottom panel of the chat roulette window?). However, given the level of traffic, comments, and chatter on Twitter (even by some prolific Tweeters), it seems that many people were seduced by the what seemed to be legitimate data analysis. In fact, the analysis was real—albeit rather light on detail—what was fake were the data. In a world where data manipulation in scientific endeavor can rise to the level of international scandal, and data analytics are more frequently being used as a means to promote various political agenda, it is important to understand just how easy the process of generating fake data is.

Below I describe this process in three easy steps, using the process of generating fake time-series data from chat roulette as an example.

First, a disclaimer: I do not endorse actually producing fake data analysis. This is to be used either for your own April Fools proclivities, or perhaps as a way to help you recognize real scientific shenanigans.

1. Pick an appropriate generator for your data

While it is a bit of an existential quandary, when producing fake data analysis you need to generate “good” random values for your data. Whatever phenomenon you are alleging to analyze, people will not be convinced if the values do not match their preconceived bias about that process. In the case of survival times to seeing various events on chat roulette, my assumption (after toying around on the service a bit) was that seeing lonely men and penises were highly probable; therefore, I needed to generate random time values with relatively low means.

Fortunately, R provides random number generators for nearly every distribution, thus making it trivial to generate data from any number of functional forms. For the purposes of creating random time values with low means I choose the chi-square distribution. Through rounding, the continuous random values of the chi-square can be converted into discrete times, and by adjusting the k parameter we can get mean values that seem to reasonably approximate my assumptions. To test, simply generate a large sample of random values and plot:

cs_test.png

For this analysis, I used k=1 for the time to seeing a lonely man, and k=2 for the time to seeing a penis. On the other hand, I assumed that the time to seeing drunk people and a woman would be uniformly distributed over different intervals. I believed it was reasonable to see a group of drunk people sometime before your first 20 minutes on chat roulette, while women were much rarer; you would be very lucky to see one even after your first hour.

2. Create the data

This is trivial. We have assumed functional forms, now all we have to do is turn that intoan R data frame. As I was faking a survival analysis, I had to create additional data specific to this type of analysis, which simply involved creating a bunch of 1′s to go with the times as observation indicators and identification values.

3. Create convincing visualization of your analysis and provide the data

Creating quality visualization is critical to real analysis, so it follows that it would be equally important in fake analysis as well. People read titles and axis labels, so be sure to make them very descriptive.

Finally, it is crucial that you also provide the data with the analysis. While most people will not actually bother to download the data, the fact that is available makes the whole thing seem more legitimate. Output a CSV file, upload it, and you are all set.

Congratulations, you have now produced fake data analysis in three easy steps. Now, do not every actually do this, but recognize how easy it is.

To leave a comment for the author, please follow the link and comment on his blog: Zero Intelligence Agents » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.