Five Hard-Won Lessons Using Hive

[This article was first published on randyzwitch.com » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been spending a ton of time lately on the data engineering side of ‘data science’, so I’ve been writing a lot of Hive queries. Hive is a great tool for querying large amounts of data, without having to know very much about the underpinnings of Hadoop. Unfortunately, there are a lot of things about Hive (version 0.12 and before) that aren’t quite the same as SQL and have caused me a bunch of frustration; here they are, in no particular order.

1. Set Hive Temp directory To Same As Final Output Directory

When doing a “Create Table As” (CTAS) statement in Hive, Hive allocates temp space for the Map and Reduce portions of the job. If you’re not lucky, the temp space for the job will be somewhere different than where your table actually ends up being saved, resulting in TWO I/O operations instead of just one. This can lead to a painful delay in when your Hive job says it is finished vs. when the table becomes available (one time, I saw a 30 hour delay writing 5TB of data).

If your Hive jobs seem to hang after the Job Tracker says they are complete, try this setting at the beginning of your session:

set hive.optimize.insert.dest.volume=true;

2. Column Aliasing In Group By/Order By

Not sure why this isn’t a default, but if you want to be able to reference your column names by position (i.e. group by 1,2) instead of by name (i.e. group by name, age), then run this at the beginning of your session:

set hive.groupby.orderby.position.alias=true;

3. Be Aware Of Predicate Push-Down Rules

In Hive, you can get great performance gains if you A) partition your table by commonly used columns/business concepts (i.e. Day, State, Market, etc.) and B) you use the partitions in a WHERE clause. These are known as partition-based queries. Otherwise, if you don’t use a partition in your WHERE clause, you will get a full table scan.

Unfortunately, when doing an OUTER JOIN, Hive will sometimes ignore the fact that your WHERE clause is on a partition and do a full table scan anyway. In order to get Hive to push your predicate down and avoid a full table scan, put your predicate on the JOIN instead of the WHERE clause:If you don’t want to think about the different rules, you can generally put your limiting clauses inside your JOIN clause instead of on your WHERE clause. It should just be a matter of preference (until your query performance indicates it isn’t!)

4. Calculate And Append Percentiles Using CROSS JOIN

Suppose you want to calculate the top 10% of your customers by sales. If you try to do the following, Hive will complain about needing a GROUP BY, because percentile_approx() is a summary function:

To get around the the need for a GROUP BY, we can use a CROSS JOIN. A CROSS JOIN is another name for a Cartesian Join, meaning all of the rows from the first table will be joined to ALL of the rows of the second table. Because the subquery only returns one row, the CROSS JOIN provides the desired affect of joining the percentile values back to the original table while keeping the same number of rows from the original table.Generally, you don’t want to do a CROSS JOIN (because relational data generally is joined on a key), but this is a good use case.

5.  Calculating a Histogram

Creating a histogram using Hive should be as simple as calling the histogram_numeric() function. However, the syntax and results of this function are just plain weird. To create a histogram, you can run the following:

The results of this query comes back as a list, which is very un-SQL like! To get the data as a table, we can use LATERAL VIEW and EXPLODE:

However, now that we have a table of data, it’s still not clear how to create a histogram, as the center of variable-width bins is what is returned by Hive. The Hive documentation for histogram_numeric() references Gnuplot, Excel, Mathematica and MATLAB, which I can only assume can deal with plotting the centers?  Eventually I’ll figure out how to deal with this using R or Python, but for now, I just use the table as a quick gauge of what the data looks like.

 

Five Hard-Won Lessons Using Hive is an article from randyzwitch.com, a blog dedicated to helping newcomers to Digital Analytics & Data Science

If you liked this post, please visit randyzwitch.com to read more. Or better yet, tell a friend…the best compliment is to share with others!

To leave a comment for the author, please follow the link and comment on their blog: randyzwitch.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)