Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

行動喚起

始める

クラウド

スタートのご用意はできましたか?

Sandbox をダウンロード

ご質問はありませんか?

クローズクローズボタン
Apache プロジェクト
Apache HAWQ

Apache HAWQ

メニュー

概要

Apache HAWQ (incubating) provides native SQL on Apache Hadoop based on an advanced MPP elastic query engine. HAWQ represents a new generation of high performance, advanced analytics that transforms Hadoop into an enterprise analytic database. Move and analyze entire workloads, while simplifying management and expanding the breadth of data access and analytics, all natively in Hadoop.

What HAWQ Does

HAWQ is an elastic SQL query engine that combines exceptional MPP-based analytics performance and robust ANSI SQL compliance – enabling you to run fast ad hoc queries. Hortonworks HDB powered by Apache HAWQ includes integrated Apache MADlib (incubating) machine learning – enabling SQL-based predictive analytics.

HAWQ and MADlib advantages include:

hawq-diagram-1

Evolved from over a decade’s worth of intellectual property from Pivotal Greenplum™ and open source PostgreSQL, HAWQ operates natively in Hadoop, which simplifies overall system management of cluster resources.

How HAWQ Works

Flow

The flow for setting up, loading, managing and using HAWQ and MADlib is listed below:

hawq-diagram-2

Technical Architecture

The high level architecture of Apache HAWQ is shown below. In a typical deployment, each slave node includes a physical HAWQ segment, an HDFS DataNode and a NodeManager. Masters for HAWQ, HDFS and YARN are on separate nodes.

HAWQ is tightly integrated with YARN for query resource management. HAWQ caches containers from YARN in a resource pool and then manages those resources locally leveraging its own finer-grained resource management for users and groups.

For a query to be executed, it allocates a set of virtual segments according to the cost of a query, resource queue definitions, data locality and the current resource usage in the system. Then the query is dispatched to corresponding physical hosts (can be a subset of nodes of the whole cluster). The HAWQ resource enforcer on each node monitors and controls the real time resources used by the query to avoid resource usage violations.

hawq-diagram-3

Nodes can be added dynamically without data redistribution. Expansion takes only seconds. When a new node is added, it automatically contacts the HAWQ master, which makes the resource available on the node to be used for future queries immediately.

Hortonworks HDB powered by Apache HAWQ

How HDB Complements Apache Hive

The Hortonworks HDB support subscription offering is a combination of Apache HAWQ and Apache MADlib, fully supported by Hortonworks running on the Hortonworks Data Platform (HDP). Apache Hive is the de facto standard for SQL queries over petabytes of data in Hadoop.

Hortonworks HDB complements Hive by adding the following capabilities:

Capability Details
Interactive query performance
  • Query performance in seconds
  • Compatible with any ANSI SQL compliant BI Tool
  • Larger number of concurrent users
MADlib big data Machine Learning in SQL
  • Classification e.g. predict loan default
  • Regression e.g. predict value of a sale
  • Clustering e.g. marketing campaign segmentation, and more.
Data federation using HAWQ Extension Framework
  • SQL queries against other data sources such as JSON files in HDFS

When to use HDB vs. Hive

Choose the right SQL engine based on your application’s needs:

Component Best Fit
Apache Hive
  • Multiple subject areas
  • Holds very detailed information
  • Scale – Multiple Petabytes
  • Integrates all data sources
  • ETL, Reporting & BI
  • Low-Mid Query Latency
Hortonworks HDB powered by Apache HAWQ
  • Single Subject Mart
  • Summarized information
  • Scale – 100s TB
  • Ad-hoc Analytics & Visualization
  • Machine Learning
  • Low Query Latency

フォーラム