ニュースレター

Hortonworks から最新情報をメールで受け取る

月に一度、ビッグデータに関する最新のインサイト、トレンド、分析情報、ナレッジをお届けします。

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

月に一度、ビッグデータに関する最新のインサイト、トレンド、分析情報、ナレッジをお届けします。

行動喚起

始める

クラウド

スタートのご用意はできましたか?

Sandbox をダウンロード

ご質問はありませんか?

*いつでも登録を解除できることを理解しています。Hortonworks プライバシーポリシーのその他の情報も確認しています。
クローズクローズボタン
HDP > Hadoop を使用した開発 > 入門編の基本

Interactive Query for Hadoop with Apache Hive on Apache Tez

クラウド スタートのご用意はできましたか?

SANDBOX をダウンロード

はじめに

In this tutorial, we’ll focus on taking advantage of the improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative, some of the features which helped make Hive be over one hundred times faster are:

  • Performance improvements of Hive on Tez
  • Performance improvements of Vectorized Query
  • Cost-based Optimization Plans
  • Multi-tenancy with HiveServer2

前提条件

概要

Download Data

Download the data from our repository:

wget https://github.com/hortonworks/data-tutorials/raw/master/tutorials/hdp/interactive-query-for-hadoop-with-apache-hive-on-apache-tez/assets/driver_data.zip
unzip driver_data.zip

Alternatively, click here to download.

We will be uploading two csv files – drivers.csv and timesheet.csv on to DAS to create tables from them.

Create Hive Tables from CSV files on DAS

DAS can be accessed by selecting the service from Sandbox Splash Page

das-ui-splash

DAS is also accessible by navigating to sandbox-hdp.hortonworks.com:30800

You will find the Data Analytics Studio UI:

welcome-to-das

Next, we will create tables based on the csv files we downloaded earlier.

1. Click on Database

2. Select the + button to add a new table

3. Click on UPLOAD TABLE

add-new-table

4. Select the Is first row header? checkbox

5. Select Upload from Local

6. Drag and drop drivers.csv and timesheet.csv onto the browser or select the files from your local directory

upload-files

7. Review the data and the DDL, once you are satisfied select create

review-data-ddl

Here is a table of the DDL for your reference:

Driver Table

Item Data Type
driverId int
name string
ssn big int
location string
certified string
wageplan string

Timesheet Table

driverId int
week int
hoursLogged int
miles_logged int

Speed Improvements

To experience the speed improvements of Hive on Tez, we will run some sample queries.

By default, the Hive view runs with Tez as it’s execution engine. That’s because Tez has great speed improvements over the original MapReduce execution engine. But by how much exactly are these improvements? Well let’s find out!

Configure MapReduce as Execution Engine in Hive view Settings Tab

Great, now that our tables are created and loaded with data we can begin experimenting with settings:

Navigate back to Ambari and sign in as Username/Password: raj_ops/raj_ops

Next select Hive and then CONFIGS

settings_page

Finally, use the filter to find and modify these specific configurations:

Configuration 新たな価値
hive.execution.engine mr
hive.auto.convert.join false (unselect box)

hive-exec-eng

uncheck-auto-convert

Save the changes and restart all services required.

Test Query on MapReduce Engine

We are now going to test a query using MapReduce as our execution engine. Head back to DAS then execute the following query and wait for the results.

select d.*, t.hoursLogged, t.milesLogged
from drivers d join timesheet t
on d.driverId = t.driverId;

first_join_mr

This query was run using the MapReduce framework.

Configure Tez as Execution Engine in Hive Settings Tab

Now we can enable Hive on Tez execution and take advantage of Directed Acyclic Graph (DAG) execution representing the query instead of multiple stages of MapReduce program which involved a lot of synchronization, barriers and IO overheads. This is improved in Tez, by writing intermediate data set into memory instead of hard disk.

Great, now that our tables are created and loaded with data we can begin experimenting with settings:

Navigate back to Ambari and sign in as Username/Password: raj_ops/raj_ops

Next select Hive and then CONFIGS

settings_page

Finally, use the filter to find and modify these specific configurations:

Configuration 新たな価値
hive.execution.engine tez
hive.auto.convert.join true (select box)

Save the changes and restart all services required.

Test Query on Tez Engine

Run the same query as we had run earlier to see the speed improvements with Tez.

select d.*, t.hours_logged, t.miles_logged
from drivers d join timesheet t
on d.driverId = t.driverId;

first_join_mr1

Take a look at the Visual Explain to visually see the execution plan.

first_join_tez_logs

Notice that the results will have appeared much quicker while having the execution engine set to Tez. This is currently the default for all Hive queries.

Congratulations! You have successfully run your Hive on Tez Job.

Execute Query as MapReduce Then Tez Engine

Now let’s try a new query to work with

SELECT d.driverId, d.name, t.total_hours, t.total_miles from drivers d
JOIN (SELECT driverId, sum(hoursLogged)total_hours, sum(milesLogged)total_miles FROM timesheet GROUP BY driverId ) t
ON (d.driverId = t.driverId);

Try executing the query first on MapReduce execution engine, then on Tez. You should notice a considerable gap in execution time.
Here is the result.

second_join_mr

To experience this further, you could use your own dataset, upload to your HDP Sandbox using steps above and execute with and without Tez to compare the difference.

Track Hive on Tez Jobs

You can track your Hive on Tez jobs in HDP Sandbox Web UI as well by navigating to http://sandbox-hdp.hortonworks.com:8088/ui2, here you can observe the running state of the queries we just experimented with, as well as the engine used in the background.

all_applications

You can click on your job and see further details.

Summary

You learned how to create Hive tables from .csv files using DAS. We also experimented with MapReduce and Tez to observe the speed improvements of Hive on Tez.

参考文献

ユーザーの評価

ユーザーの評価
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
チュートリアル名
Interactive Query for Hadoop with Apache Hive on Apache Tez

質問する回答を探す場合は、Hortonworks Community Connectionをご参照ください。

No Reviews
評価する

登録

登録して評価をご記入ください

ご自身の体験を共有してください

例: 最高のチュートリアル

この欄に最低50文字で記入してください。

成功

ご意見を共有していただきありがとうございます!