by James P. Peruvankal
Kaggle just announced a competition to predict which shoppers will become repeat buyers. To aid with algorithmic development, they have provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. Files containing the incentives offered to each shopper as well as their post-incentive behavior are also provided.
This challenge provides almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date. Once unzipped, data size will be 22GB, more than what can fit into the memory of usual laptops.
If you like this sort of thing, a first look at the data ought to captivate your interest. The following plots shows the number of repeated trips to the store plotted against the offer value in dollars on the x axis. The data are shaded by market, a geographical area.
To get your own first look at the data, and maybe try out a few of the fast Parallel External Memory Algorithms included in Revolution R Enterprise, you might find it helpful to take advantage of Revolution Analytics offer to try out Revolution R Enterprise in the AWS cloud. (If you spin up a Linux box in AWS, you can go up to 64GB RAM.)
This contest is representative of the challenge coping with the exponential growth in real-world data projects. I am sure, we will see more of these kind of problems.
In addition to trying Revolution R Enterprise in the cloud, active Kaggle competitors can download the full-featured Revolution R Enterprise software and use it for free to create their own submissions.
Some of us Revolutionaries are jumping into the fray. See you at the competition!