Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

行動喚起

始める

クラウド

スタートのご用意はできましたか?

Sandbox をダウンロード

ご質問はありませんか?

クローズクローズボタン
Apache プロジェクト
Apache Hive

Apache Hive

メニュー

概要

The de facto standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.

With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.

Advantages of using Hive for enterprise SQL in Hadoop

Feature Description
Familiar
    Query data with a SQL-based language
Fast
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance
Compatible
    Works with traditional data integration and data analytics tools.

How Hive Works

Hive on LLAP (Live Long and Process) makes use of persistent query servers with intelligent in-memory caching to avoid Hadoop’s batch-oriented latency and provide as fast as sub-second query response times against smaller data volumes, while Hive on Tez continues to provide excellent batch query performance against petabyte-scale data sets.

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Innovation & Focus

The Stinger and Stinger.next initiatives successfully delivered a fundamentally new Apache Hive, which evolved Hive’s traditional architecture and made it much faster, with richer SQL standards compliant semantics and petabyte scalability. We continue to work within the Hive community to advance these four key facets of hive:

Speed
Deliver sub-second query response times
Scale
The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
SQL
Enable transactions and SQL:2011 Analytics for Hive
OLAP
Provide native interfaces for MDX compliant analytics tools including Excel

 

The focus for the next major Hive release will be:

Focus Planned Enhancements
MDX interface
    Provide native support for the MDX (multi-dimensional expressions) query language, the de-facto query language used by online analytical process (OLAP) tools, include Microsoft Excel.
Complete SQL:2011 Compliance
    Close the remaining gaps to achieve complete SQL:2011 standard SQL compliance, enabling maximum compatibility with SQL-based BI, visualization and machine learning tools.

Recent Hive Releases

HIVE_release_graphic

Apache Hive Version Prior Enhancements
2.1
  • Hive LLAP: Persistent query servers with intelligent in-memory caching.
  • ACID GA: Hardened and proven at scale.
  • Expanded SQL Compliance: More capable integration with BI tools.
  • Performance: Interactive query, 2x faster ETL.
  • Security: Row / Column security extending to views, Column level security for Spark.
  • Operations: LLAP integration in Ambari, new Grafana dashboards.
2.0
  • Speed: HBase to store Hive Metadata
  • Workflow: HPL/SQL – Implementing Procedural SQL in Hive
  • Scale: first version of LLAP, and Hive on Spark
  • SQL: Hive-on-Spark Self Union/Join
1.2
  • Speed: Vectorized Map Join brings up to 5x faster map joins
  • Scale: Hybrid-Hybrid Grace Hash Join allows analytical queries at large scale without complex tuning
  • Scale: Bloom Filter support added to ORCFile
  • SQL: Added support for UNION DISTINCT and Interval Types
0.14
  • Speed: Cost-based optimizer for star and bushy join queries
  • Scale: Temporary tables
  • Scale: Transactions with ACID semantics
0.13
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
0.12
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions

What’s new in HDP 2.6

Hive LLAP is now GA

Hive LLAP delivers MPP performance at Hadoop scale, provides ACID transactions and is 100% compatible with existing Hive applications. Apache Ambari delivers a one-click experience for enabling LLAP, making Hive LLAP the easiest SQL on Hadoop solution to get started with.

ACID MERGE

Hive now support ANSI SQL compliant MERGE to complement existing INSERT/UPDATE/DELETE. Managing data in Hadoop is now easier and faster than ever.

More Powerful SQL

Hive is now able to run all 99 TPC-DS queries out-of-the-box with only trivial rewrites required, thanks to the addition of new set operators and multi subquery support. Migrating workloads to Hive has never been easier.

New Hive View

Hive View 2.0 introduces a database manager to let you create, modify and manage Hive tables. In addition, a new visual explain makes it easy to pinpoint and optimize expensive query operations.

More Powerful Tez UI

The new Total Timeline View shows you exactly where a query spends its time, from planning to execution to cleanup, making it easy to pinpoint and fix query bottlenecks.

SQL and HiveQL

Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

Below is some basic information to get you started. Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Retrieving Some Values SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Retrieving With Multiple Criteria SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Retrieving Specific Columns SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Sorting SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Sorting Reverse SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting Rows SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Grouping With Counting SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum Value SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

Metadata

Function MySQL Hive
Selecting a database USE database; USE database;
Listing databases SHOW DATABASES; SHOW DATABASES;
Listing tables in a database SHOW TABLES; SHOW TABLES;
Describing the format of a table DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name; CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);

フォーラム

Hive Tutorials

Hive in the Press

Webinars & Presentations