Apache Hive 2.1 was released about a month ago and it’s a great opportunity to review how Hive 2 is drastically changing the landscape for SQL on Hadoop.
There is so much new in Hive it’s hard to pick highlights, but here are a few:
We’ll explore these other topics in later posts. For now we’ll focus on the most anticipated feature of Hive 2 and the massive performance gains it unlocks.
The biggest story in Hive 2 is also its most anticipated feature: LLAP which stands for ‘Live Long and Process’. To summarize it, LLAP combines persistent query servers and optimized in-memory caching that allows Hive to launch queries instantly and avoids unnecessary disk I/O. Put another way, LLAP is a second-generation big data system: LLAP brings compute to memory (rather than compute to disk), it caches memory intelligently and it shares this data among all clients, while retaining the ability to scale elastically within a cluster.
To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.
The full test environment is explored below but at a high level the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.
As you can see, LLAP delivers a dramatic performance gain. Minimum query runtime with Hive LLAP is a mere 1.3 seconds, compared to 9.58 seconds in Hive 1.
Let’s discuss some of the main reasons for these performance gains.
Reason 1: Smarter Map Joins
Hive on Tez is a shared-nothing architecture: each processing unit works independently with its own memory and disk resources. LLAP is a multi-threaded process that allows memory sharing between workers. A map-side join requires a hash table to be distributed 1:1 into each map task. If you have 24 containers on a node you need to make 24 copies of the hash table and distribute it out. With LLAP you build the hash table once per node and cache it in-memory for all workers. This is especially important for low-latency SQL.
A great example of this is Query 55. In TPC-DS, Query 55 touches the smallest amount of data among any query that queries a fact table, just 1 month. To run this query in Hive on Tez the small date_dim and item tables must first be distributed to all Tez tasks. With LLAP this happens once per node, a large part of the reason LLAP’s average execution time is 1.3s, compared to Hive on Tez’s 24.72s.
Reason 2: Better MapJoin vectorization for joins
Many MapJoin optimizations have made their way into Hive 2. As one example, joins against small dimension tables now run as fast as explicitly expanded lists.
A great example of where this helps is Query 43, which has a 37% selectivity in the store dimension join. Better MapJoin vectorization, which takes advantage of repeating sequences in the fact table, helps take Query 43 from 195.2s down to 4.2s.
Reason 3: A Fully Vectorized Pipeline
Hive 2 introduces Map Join vectorization in the reduce side with dynamically partitioned hash joins, essentially a reduce-side version of the MapJoin optimization. With this optimization, reducer inputs are unsorted and streamed through a hash table held on the reduce side. The optimization divides a large dimension table into many small disjoint dimension tables, allowing for the previous dimension table optimizations to scale upwards in size.
A great example here is Query 13 which touches a combination of several very large and several very small dimension tables, so it needs to run as a shuffle join for safety but gets high selectivity from the other dimension filters. This optimization helps take Query 13 from 90.2s down to 4.8s.
Reason 4: A Smarter CBO
The integration with Apache Calcite for sophisticated cost-based optimization continues to deepen and is reaping big rewards. For a few examples, Hive’s CBO can now factor join keys out of deeply nested predicates (avoiding cross joins), infer transitive predicates across joins, and apply basic transformations even when tables have no stats (a big win for ETL jobs).
|Hive 1||Hive 2|
All this software is deployed using Apache Ambari using HDP software that is currently in technical preview. In addition to the default settings from Ambari, some new optimizations are made for Hive 2. These optimizations will be set for default new installs at GA.
For the most part, OS defaults were used with 1 exception:
This screenshot gives a sense of how LLAP was configured to take advantage of the hardware within Ambari. Please note, LLAP configuration in Ambari is evolving at the time of this blog and current installation may look slightly different from this image.
As always, Apache Hive is 100% open source and can be used on the Hadoop distribution of your choice, and that goes for the performance improvements as well as all the other features discussed in the blog.
If you’re looking to try Hive 2.1 for yourself a few options include: