Apache Pig 简明教程

Apache Pig - Running Scripts

在本章中,我们将了解如何在批处理模式下运行 Apache Pig 脚本。

Here in this chapter, we will see how how to run Apache Pig scripts in batch mode.

Comments in Pig Script

在文件中编写脚本时,我们可以按照如下所示的方式在其中包含注释。

While writing a script in a file, we can include comments in it as shown below.

Multi-line comments

我们将使用 '/ ', end them with ' /' 来开始多行注释。

We will begin the multi-line comments with '/', end them with '/'.

/* These are the multi-line comments
  In the pig script */

Single –line comments

我们将使用 '--' 来开始单行注释。

We will begin the single-line comments with '--'.

--we can write single line comments like this.

Executing Pig Script in Batch mode

在批处理模式下执行 Apache Pig 语句时,请按照以下步骤操作。

While executing Apache Pig statements in batch mode, follow the steps given below.

Step 1

在一个文件中编写所有必需的 Pig Latin 语句。我们可以在一个文件中编写所有 Pig Latin 语句和命令并将其保存为 .pig 文件。

Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.

Step 2

执行 Apache Pig 脚本。你可以按照如下所示的方式从 shell(Linux)执行 Pig 脚本。

Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown below.

Local mode

MapReduce mode

$ pig -x local Sample_script.pig

$ pig -x mapreduce Sample_script.pig

你也可以按照如下所示的方式使用 exec 命令从 Grunt shell 执行它。

You can execute it from the Grunt shell as well using the exec command as shown below.

grunt> exec /sample_script.pig

Executing a Pig Script from HDFS

我们还可以执行驻留在 HDFS 中的 Pig 脚本。假设在名为 /pig_data/ 的 HDFS 目录中有一个名为 Sample_script.pig 的 Pig 脚本。我们可以按照如下所示的方式执行它。

We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as shown below.

$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

Example

假设我们在 HDFS 中有一个文件 student_details.txt ,内容如下。

Assume we have a file student_details.txt in HDFS with the following content.

student_details.txt

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

我们还有同一个 HDFS 目录中的一个样例脚本,名为 sample_script.pig 。该文件包含对 student 关系执行操作和转换的语句,如下所示。

We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.

student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

student_order = ORDER student BY age DESC;

student_limit = LIMIT student_order 4;

Dump student_limit;
  1. The first statement of the script will load the data in the file named student_details.txt as a relation named student.

  2. The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.

  3. The third statement of the script will store the first 4 tuples of student_order as student_limit.

  4. Finally the fourth statement will dump the content of the relation student_limit.

现在让我们按照如下所示执行 sample_script.pig

Let us now execute the sample_script.pig as shown below.

$./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig

Apache Pig 已经执行,并给出了内容如下。

Apache Pig gets executed and gives you the output with the following content.

(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO  org.apache.pig.Main - Pig script completed in 12
minutes, 32 seconds and 751 milliseconds (752751 ms)