In April of this year, Hortonworks, along with the broad Hadoop community delivered the final phase of the Stinger Initiative on schedule, completing the work to bring interactive SQL query to Apache Hive. The original directive of Stinger was about advancing SQL capabilities at petabyte scale in pure open source. And over 13 months, 145 developers from 44 companies delivered exactly that, contributing over 390,000 lines of code to the Hive project alone.
While this community collaboration has had a tremendously positive impact for data workers, business analysts and the many data center tools around Hadoop that rely on Hive for SQL in Hadoop, it was just the beginning.
The Stinger Initiative enabled Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer.
Stinger.next is a continuation of this initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities. And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery schedule and developed completely in the open Apache Hive community.
Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios. Explicitly, some of the key deliverables that will enable these new business applications of Hive include:
Hive has been used as a write-once, read-often system, where users add partitions of data and query this data often. ACID is a major shift in the paradigm, adding SQL transactions that allow users to insert, update and delete the existing data. This allows a much wider set of use cases that require periodic modifications to the existing data. ACID will include BEGIN, COMMIT and ROLLBACK for multi-statement transactions in next releases.
Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the following:
YARN will provide workload management in LLAP by using delegation. Queries will bring information from YARN to LLAP about their authorized resource allocation. LLAP processes will then allocate additional resources to serve the query as instructed by YARN.
The hybrid engine approach provides fast response times by efficient in-memory data caching and low-latency processing, provided by node resident processes. However, by limiting LLAP use to the initial phases of query processing, Hive sidesteps limitations around coordination, workload management and failure isolation that are introduced by running entire query within this process as done by other databases.
SQL:2011 Analytics subset will be supported by Hive, with new features being added over multiple iterations, driven by customer demand. Hive is already much further along than other SQL options for Hadoop with strong SQL support including:
Stinger.next will extend this lead to cover most of the frequently used SQL constructs:
Hive-Spark Machine Learning Integration will also allow Hive users to run machine learning models via Hive. Users want to run predictive analytics and descriptive analytics in Hive, both on the same dataset.
Hive on Spark?
There is a lot of talk about Spark as a powerful engine running on YARN, and we at Hortonworks share that excitement and are working actively to make it enterprise ready for Spark users. In fact, in order to integrate with Spark, the broad Hive community is making use of several of the infrastructure components already added to Hive as part of the Tez integration which was delivered in Hive 0.13.
In addition to these primary use cases, some additional enhancements include:
Stinger.next will be delivered at a rapid pace over the next 18 months. Transactions will release in late 2014. Sub-second queries are coming in the first half of 2015, with a preview in the next few months. An initial outline of the delivery is below. We expect this work to be completed as the initial work was, in scope and on schedule.
It is not just Hortonworks that is enthusiastic about this next phase in the delivery of Enterprise SQL at Hadoop Scale. Some of our key partners have weighed in on their excitement as well. Watch this space over the next few days as Microsoft, Informatica, Microstrategy and Tableau all weigh in on this important initiative.
And as always, we are excited to continue our work within the Hive community to extend Hive, the leading SQL on Hadoop solution, further in terms of speed, scale, and SQL semantics.
Hive delivers a message of simplicity. It already provides a single tool for all SQL across, batch and interactive workload and with Stinger.next it is extended to near real-time. We’re enthusiastic about the upcoming Stinger.next journey as Hive adds exciting new features toward this goal. Watch this blog for future posts from Apache Hive committers and contributors from around the world, as they share enhancement ideas with the community.