Big Data ETL and Big Data Analysis

[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was at Strata New York 2012 last month. Great conference! Thanks O’Reilly media for assembling the industry leaders and running it well.

I understand it was too crowded for some of my out-of-town friends. Stepping out to the streets of mid-town Manhattan for a breath of fresh air and calmness wasn’t an option either. Maybe O’Reilly can get a bigger space next year?

My primary interest in Big Data analysis was structured data analysis i.e. crunching, munging (ETL) and analysis of large dataset in columns and rows.

My team deals with 1-2 Terabytes (~ 1 billion rows) of structured data (e.g sales transaction data) regularly for marketing/retail/healthcare analytics. Like others, we’re spending a lot of time in Big Data ETL processes and less on Big Data Analysis. Someone at Strata New York captured this well,
80% of a Big Data development effort goes into data integration efforts and only 20% of our effort/time is spent on analysis, i.e. interesting things we want to work on
I want to flip this equation. I want to be able to spend 20% of our time/effort on Big Data ETL/integration efforts and 80% on Big Data analysis.

At Strata, I wanted to check if vendors and open source communities had simplified the Hadoop stack for my use. There have been many improvements in the past year and the products on my list is quite long. We’ve more players in Big Data space and the solution space is muddy. It is great to have more vendors, experts, communities focusing on Big Data but the product space is crowded, fragmented and CONFUSING (just listing all Apache products discussed at Strata needs 1 full page.)

I created a list of products to try out. I wish these products were easy to evaluate (legal paperwork, infrastructure footprint, and ease of setup and execute my use cases).

As far as ease of use and powerful data science workbench is concerned, I want to use something like R or even Excel for these steps (Big Data ETL and Big Data analysis), but they are both memory constrained. So, I need other options.

Did someone say, Hadoop? Yup, on it. I tried it a few years ago and we’re exploring it now. Hadoop/MapReduce is THE infrastructure to power Big Data ETL and Big Data analysis.

I also believe that MapReduce and Databases are complementary technologies and the experts agree! See MapReduce and Parallel DBMSs: friends or foes?

Here’s how I’ve framed my problem and thinking about the solution space.



Big Data ETL process

Take big structured dataset (multiple CSV files with total 100M-1B rows) and create DDL/clean/transform/split/sample/separate errors in minutes.

Solution options:
  1. Unix scripts (shell, awk, perl). Do all of the above in one pass (i.e. read each row only once) quickly. Start with Unix parallel processes, scale to multiple machines (mapreduce-style) only if needed
  2. Big Data ETL tools like Kafka?
  3. Open source ETL tools (e.g. Talend)
  4. Can commercial ETL tools do this in a few minutes/hours?
  5. Others?
Given any structured data from client (csv), our Big Data ETL workbench takes the data and processes it super fast (detect data types, clean, transform eg. change date formats to our internal standards, separate error rows, create sample, split into multiple clean files)

Raw data files have different schema that we auto-detect in processing (only string vs numeric types to begin with).

Big Data Analysis

Then we load this data for analysis:
  1. RDBMS for well-defined arithmetic/set-based analysis
  2. noSQL database (Lucene/Solr with Blacklight front-end for discovery). Blacklight project: Open source discovery app built on Lucene/Solr. Thinking of it as a discovery app for Big Data analysis. Facets on top of structured data. Slide-dice large structured dataset. We can add visualizations later (e.g. summaries etc.) Checkout this Stata session on Lucene-powered Big Data analysis which confirmed this design hypothesis
  3. The clean split files can be used in any stats tool as well for statistical analysis e.g. SAS for larger data sets, R for smaller ones (often the clean split files are small enough for R)
  4. New Big Data tools like Impala


We’re building proof-of-concepts for Big Data ETL (Unix scripts) and Blacklight discovery app on top of Lucene/Solr. I will share it when its ready. Stay tuned.

To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)