The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value.
TRY HIVE LLAP TODAY
2. If you’re looking for a quick test on a single node, the Hortonworks Sandbox 2.5. Download the Sandbox and this LLAP tutorial will have you up and running in minutes. Note: you’ll need a system with at least 16 GB of RAM for this approach.
Last week we discussed Apache Hive’s shift to a memory-centric architecture and showed how this new architecture delivers dramatic performance improvements, especially for interactive SQL workloads. Today we’ll compare these results with Apache Impala (Incubating), another SQL on Hadoop engine, using the same hardware and data scale.
Before we get to the numbers, an overview of the test environment, query set and data is in order. The Impala and Hive numbers were produced on the same 10 node d2.8xlarge EC2 VMs. To prepare the Impala environment the nodes were re-imaged and re-installed with Cloudera’s CDH version 5.8 using Cloudera Manager. The defaults from Cloudera Manager were used to setup / configure Impala 2.6.0. It is worth pointing out that Impala’s Runtime Filtering feature was enabled for all queries in this test.
Data: While Hive works best with ORCFile, Impala works best with Parquet, so Impala testing was done with all data in Parquet format, compressed with Snappy compression. Data was partitioned the same way for both systems, along the date_sk columns. This was done to benefit from Impala’s Runtime Filtering and from Hive’s Dynamic Partition Pruning.
Queries: After this setup and data load, we attempted to run the same set query set used in our previous blog (the full queries are linked in the Queries section below.) Impala was designed to be highly compatible with Hive, but since perfect SQL parity is never possible, 5 queries did not run in Impala due to syntax errors. For example, one query failed to compile due to missing rollup support within Impala. It may have been possible to find Impala-specific workarounds to these gaps, but no attempt was made to do so since these results could not be directly compared. Here we will only draw comparison between the queries that ran on both engines with identical syntax.
Timings: For both systems, all timings were measured from query submission to receipt of the last row on the client side.
This bar chart shows the runtime comparison between the two engines:
One thing that quickly stands out is that some Impala queries ran to timeout (30 minutes), including 4 queries that required less than 1 minute with Hive. This makes a direct comparison a bit challenging.
A more helpful way of comparing the engines is to examine how many of the queries complete within given time bands. The chart below shows the cumulative number of queries that complete within the given time. The x axis in this chart moves in discrete 30 second intervals.
The first thing we see is that Impala has an advantage on queries that run in less than 30 seconds. 22 queries completed in Impala within 30 seconds compared to 20 for Hive. The positions change as query times get a bit longer: By the time we reach one minute, Hive has completed 32 queries compared to Impala’s 26 and the relative position does not switch again. This shows that Impala performs well with less complex queries but struggles as query complexity increases. On the other hand Hive, with the introduction of LLAP, gets good performance at the low end while retaining Hive’s ability to perform well at mid to high query complexity.
Since some of the runtimes can be hard to see, a full table of runtimes is included toward the end.
As more Hadoop workloads move to interactive and user-facing, teams face the unpleasant prospect of using one SQL engine just for interactive while they use Hive for everything else. This introduces a lot of cost and complexity to Hadoop because it really means separate specialized teams to tune, troubleshoot and operate two very different SQL systems.
Hive LLAP fundamentally changes this landscape by bringing Hive’s interactive performance in line with SQL engines that are custom-built to only solve interactive SQL. With Hive LLAP you can solve SQL at Speed and at Scale from the same engine, greatly simplifying your Hadoop analytics architecture.
|Hive (HDP 2.5)||Impala (CDH 5.8)|
ORCFile format with zlib compression
All queries run through LLAP
Runtime Filtering Optimization Enabled
Parquet format with snappy compression
For the most part, OS defaults were used with 1 exception:
(HDP 2.5, ORCFile / Zlib, 10 TB)
Apache Impala (Incubating)
(CDH 5.8, Parquet/Snappy, 10 TB)
Trying Hive LLAP is simple in the cloud or on your laptop.
It’s easy to take a test drive, so we encourage you to start today and share your experiences with us on the Hortonworks Community Connection.