ニュースレター

Hortonworks から最新情報をメールで受け取る

月に一度、ビッグデータに関する最新のインサイト、トレンド、分析情報、ナレッジをお届けします。

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

月に一度、ビッグデータに関する最新のインサイト、トレンド、分析情報、ナレッジをお届けします。

行動喚起

始める

クラウド

スタートのご用意はできましたか?

Sandbox をダウンロード

ご質問はありませんか?

*いつでも登録を解除できることを理解しています。Hortonworks プライバシーポリシーのその他の情報も確認しています。
クローズクローズボタン
HDP > Hadoop を使用した開発 > 入門編の基本

Hadoop Tutorial – Getting Started with HDP

Pig - Risk Factor

クラウド スタートのご用意はできましたか?

SANDBOX をダウンロード

Pig – Risk Factor

はじめに

In this tutorial, you will be introduced to Apache Pig. In the earlier section of lab, you learned how to load data into HDFS and then manipulate it using Hive. We are using the Truck sensor data to better understand risk associated with every driver. This section will teach you to compute risk using Apache Pig.

前提条件

The tutorial is a part of series of hands on tutorial to get you started on HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial.

概要

Pig Basics

Pig is a high-level scripting language used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.

Create Table riskfactor from Existing trucks_mileage Data

Next, you will use Pig to compute the risk factor of each driver. Before we can run the Pig code, the table must already exist in Hive to satisfy one of the requirements for the HCatStorer() class. The Pig code expects the following structure for a table named riskfactor. Execute the following DDL in the Hive View 2.0 query editor:

CREATE TABLE riskfactor (driverid string, events bigint, totmiles bigint, riskfactor float)
STORED AS ORC;

riskfactor_lab3

Verify Table riskfactor was Created Successfully

Verify the riskfactor table was created successfully. It will be empty now, but you will populate it from a Pig script. You are now ready to compute the risk factor using Pig. Let’s take a look at Pig and how to execute Pig scripts from within Ambari.

Create Pig Script

In this phase of the tutorial, we create and run a Pig script. We will use the Ambari Pig View. Let’s get started…

Log in to Ambari Pig User Views

To get to the Ambari Pig View, click on the Ambari Views icon at top right and select Pig:

ambari_pig_view_concepts

This will bring up the Ambari Pig User View interface. Your Pig View does not have any scripts to display, so it will look like the following:

Lab3_4

On the left is a list of your scripts, and on the right is a composition box for writing scripts. A special interface feature is the Pig helper located below the name of your script file. The Pig helper provides us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions. At the very bottom are status areas that will show the results of our script and log files.

The following screenshot shows and describes the various components and features of the Pig View:

  1. Quick link to view existing scripts, UDFs, or History of prior runs
  2. View your current script or prior History
  3. Helper functions to help write your scripts
  4. Arguments needed for script execution
  5. Execute button to run your script

pig_user_view_components_hello_hdp

Create a New Script

Let’s enter a Pig script. Click the New Script button in the upper-right corner of the view:

new_script_hello_hdp_lab3

Name the script riskfactor.pig, then click the Create button:

Lab3_7

Load Data in Pig using Hcatalog

We will use HCatalog to load data into Pig. HCatalog allows us to share schema across tools and users within our Hadoop environment. It also allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository. Since it is in HCatalog we can use the HCatLoader() function. Pig allows us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table.

  • We can use the Pig helper located below the name of your script file to give us a template for the line. Click on the Pig helper -> HCatalog -> LOAD template
  • The entry %TABLE% is highlighted in red for us. Type the name of the table which is geolocation.
  • Remember to add the a = before the template. This saves the results into a. Note the ‘=’ has to have a space before and after it.
  • Our completed line of code will look like:
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();

The script above loads data, in our case, from a file named geolocation using the HCatLoader() function. Copy-and-paste the above Pig code into the riskfactor.pig window.

Note: Refer to Pig Latin Basics – load to learn more about the load operator.

Filter your data set

The next step is to select a subset of the records, so we have the records of drivers for which the event is not normal. To do this in Pig we use the Filter operator. We instruct Pig to Filter our table and keep all records where event !=“normal” and store this in b. With this one simple statement, Pig will look at each record in the table and filter out all the ones that do not meet our criteria.

  • We can use Pig Help again by clicking on the Pig helper-> Relational Operators -> FILTER template
  • We can replace %VAR% with “a” (hint: tab jumps you to the next field)
  • Our %COND% is “event !=’normal’; ” (note: single quotes are needed around normal and don’t forget the trailing semi-colon)
  • Complete line of code will look like:
b = filter a by event != 'normal';

Copy-and-paste the above Pig code into the riskfactor.pig window.

Note: Refer to Pig Latin Basics – filter to learn more about the filter operator.

Iterate your data set

Since we have the right set of records, let’s iterate through them. We use the “foreach” operator on the grouped data to iterate through all the records. We would also like to know the number of non normal events associated with a driver, so to achieve this we add ‘1’ to every row in the data set.

  • Use Pig Help again by clicking on the Pig helper -> Relational Operators -> FOREACH template
  • Our %DATA% is b and the second %NEW_DATA% is “driverid, event, (int) ‘1’ as occurance;
  • Complete line of code will look like:
c = foreach b generate driverid, event, (int) '1' as occurance;

Copy-and-paste the above Pig code into the riskfactor.pig window:

Note: Refer to Pig Latin Basics – foreach to learn more about the foreach operator.

Calculate the total non normal events for each driver

The group statement is important because it groups the records by one or more relations. In our case, we want to group by driver id and iterate over each row again to sum the non normal events.

  • Use the template Pig helper -> Relational Operators -> GROUP %VAR% BY %VAR%
  • First %VAR% takes “c” and second %VAR% takes “driverid;
  • Complete line of code will look like:
d = group c by driverid;

Copy-and-paste the above Pig code into the riskfactor.pig window.

  • Next use Foreach statement again to add the occurance.
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;

Note: Refer to Pig Latin Basics – group to learn more about the group operator.

Load drivermileage Table and Perform a Join Operation

In this section, we will load drivermileage table into Pig using Hcatlog and perform a join operation on driverid. The resulting data set will give us total miles and total non normal events for a particular driver.

  • Load drivermileage using HcatLoader()
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
  • Use the template Pig helper ->Relational Operators->JOIN %VAR% BY
  • Replace %VAR% by ‘e’ and after BY put ‘driverid, g by driverid;
  • Complete line of code will look like:
h = join e by driverid, g by driverid;

Copy-and-paste the above two Pig codes into the riskfactor.pig window.

Note: Refer to Pig Latin Basics – join to learn more about the join operator.

Compute Driver Risk factor

In this section, we will associate a driver risk factor with every driver. To calculate driver risk factor, divide total miles travelled by non normal event occurrences.

  • We will use Foreach statement again to compute driver risk factor for each driver.
  • Use the following code and paste it into your Pig script.
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
  • As a final step, store the data into a table using Hcatalog.
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();

Here is the final code and what it will look like once you paste it into the editor.

Note: Refer to Pig Latin Basics – store to learn more about the store operator.

Add Pig argument

Add Pig argument -useHCatalog (Case Sensitive).

pig_script_argument

Final Pig script should look like:

a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'normal';
c = foreach b generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
h = join e by driverid, g by driverid;
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();

riskfactor_computation_script_lab3

Save the file riskfactor.pig by clicking the Save button in the left-hand column.

Quick Recap

Before we execute the code, let’s review the code again:

  • The line a = loads the geolocation table from HCatalog.
  • The line b = filters out all the rows where the event is not ‘Normal’.
  • Then we add a column called occurrence and assign it a value of 1.
  • We then group the records by driverid and sum up the occurrences for each driver.
  • At this point we need the miles driven by each driver, so we load the table we created using Hive.
  • To get our final result, we join by the driverid the count of events in e with the mileage data in g.
  • Now it is real simple to calculate the risk factor by dividing the miles driven by the number of events

You need to configure the Pig Editor to use HCatalog so that the Pig script can load the proper libraries. In the Pig arguments text box, enter -useHCatalog and click the Add button:

Note this argument is case sensitive. It should be typed exactly -useHCatalog.

Lab3_9

The Arguments section of the Pig View should now look like the following:
Lab3_10

Execute Pig Script on Tez

Execute Pig Script

Click Execute on Tez checkbox and finally hit the blue Execute button to submit the job. Pig job will be submitted to the cluster. This will generate a new tab with a status of the running of the Pig job and at the top you will find a progress bar that shows the job status.

execute_pig_script_compute_riskfactor_hello_hdp_lab3

View Results Section

Wait for the job to complete. The output of the job is displayed in the Results section. Notice your script does not output any result – it stores the result into a Hive table – so your Results section will be empty.

running_script_riskfactor_hello_hdp_lab3

completed_riskfactor_script_hello_hdp_lab3

Click on the Logs dropdown menu to see what happened when your script ran. Errors will appear here.

View Logs section (Debugging Practice)

Why are Logs important?

The logs section is helpful when debugging code after expected output does not happen. For instance, say in the next section, we load the sample data from our riskfactor table and nothing appears. Logs will tell us why the job failed. A common issue that could happen is that pig does not successfully read data from the geolocation table or drivermileage table. Therefore, we can effectively address the issue.

Let’s verify pig read from these tables successfully and stored the data into our riskfactor table. You should receive similar output:

debug_through_logs_lab3

What results do our logs show us about our Pig Script?

  • Read 8000 records from our geolocation table
  • Read 100 records from our drivermileage table
  • Stored 99 records into our riskfactor table

Verify Pig Script Successfully Populated Hive Table

Go back to the Ambari Hive View 2.0 and browse the data in the riskfactor table to verify that your Pig job successfully populated this table. Here is what is should look like:

pig_populated_riskfactor_table_hello_hdp_lab3

At this point we now have our truck average miles per gallon table (avg_mileage) and our risk factor table (riskfactor).

Summary

Congratulations! Let’s summarize the Pig commands we learned in this tutorial to compute risk factor analysis on the geolocation and truck data. We learned to use Pig to access the data from Hive using the LOAD {hive_table} …HCatLoader() script. Therefore, we were able to perform the filter, foreach, group, join, and store {hive_table} …HCatStorer() scripts to manipulate, transform and process this data. To review these bold Pig Latin operators, view the Pig Latin Basics, which contains documentation on each operator.

参考文献

Strengthen your foundation of Pig Latin and reinforce why this scripting platform is beneficial for processing and analyzing massive data sets with these resources:

ユーザーの評価

ユーザーの評価
3 5 out of 5 stars
5 Star 100%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
チュートリアル名
Hadoop Tutorial – Getting Started with HDP

質問する回答を探す場合は、Hortonworks Community Connectionをご参照ください。

3 Reviews
評価する

登録

登録して評価をご記入ください

ご自身の体験を共有してください

例: 最高のチュートリアル

この欄に最低50文字で記入してください。

成功

ご意見を共有していただきありがとうございます!

Great Tutorial
by scott payne on July 24, 2018 at 8:55 pm

Tutorial was an excellent introduction to HDP data processing using a realistic data set. Each concept is presented succinctly with suggestions to explore the concept further. My only suggestion is that not enough emphasis is placed on how much faster it is to run your queries using a shell than it is to use the sandbox.

Tutorial was an excellent introduction to HDP data processing using a realistic data set. Each concept is presented succinctly with suggestions to explore the concept further.

My only suggestion is that not enough emphasis is placed on how much faster it is to run your queries using a shell than it is to use the sandbox.

表示件数を減らす
Cancel

Review updated successfully.

Outstanding
by Christian Lopez on May 8, 2018 at 8:29 pm

This review is written from the perspective of a new HDP user interested in understanding this environment and the tools included in the Sandbox. First you will be introduced to the technologies involved in the tutorial namely Hadoop, Ambari, Hive, Pig Latin, SPARK, HDFS, and most importantly HDP. Next, you will use IoT data to calculate the risk factor for truck drivers by using the truck's information and their geo-location, you will accomplish this goal by uploading the needed data to your VM and storing the data as Hive tables. Additionally, you will learn to use… Show More

This review is written from the perspective of a new HDP user interested in understanding this environment and the tools included in the Sandbox.

First you will be introduced to the technologies involved in the tutorial namely Hadoop, Ambari, Hive, Pig Latin, SPARK, HDFS, and most importantly HDP. Next, you will use IoT data to calculate the risk factor for truck drivers by using the truck’s information and their geo-location, you will accomplish this goal by uploading the needed data to your VM and storing the data as Hive tables. Additionally, you will learn to use PIG Latin and SPARK to extrapolate the data needed to find the risk factor for all drivers in the set and storing the information you found back into the database. Accomplishing the same task using two different tools (SPARK, and PIG) highlights the robustness and flexibility of HDP as all the operations happen flawlessly.

I highly recommend this tutorial as it is highly informative, shows a realistic use-case, and as a new user of HDP I learned about all the cool technologies enabled to work through the Hortonworks platform, most importantly I was left with a great sense of accomplishment and that’s reason alone to try the tutorial.

表示件数を減らす
Cancel

Review updated successfully.

Excellent Tutorial!
by Ana Castro on May 8, 2018 at 4:05 pm

The tutorial was very informative and had an excellent flow. It had just the right amount of detail per concept. Great introduction to Hadoop and other Apache projects.

The tutorial was very informative and had an excellent flow. It had just the right amount of detail per concept. Great introduction to Hadoop and other Apache projects.

表示件数を減らす
Cancel

Review updated successfully.