Apache Pig 简明教程
Apache Pig - Installation
本章说明了如何下载、安装和在您的系统中设置 Apache Pig 。
Prerequisites
在使用 Apache Pig 之前,您必须在自己的系统上安装 Hadoop 和 Java 至关重要。因此,在安装 Apache Pig 之前,请按照以下链接中给出的步骤安装 Hadoop 和 Java −
Download Apache Pig
首先,从以下网站下载 Apache Pig 的最新版本 − https://pig.apache.org/
Install Apache Pig
下载 Apache Pig 软件后,请按照以下给定的步骤在 Linux 环境中安装它。
Step 1
在安装 Hadoop, Java, 和其他软件的安装目录所在的同一目录中创建一个名为 Pig 的目录。(在我们的教程中,我们在名为 Hadoop 的用户中创建了 Pig 目录)。
$ mkdir Pig
Configure Apache Pig
安装 Apache Pig 后,我们必须配置它。要配置,我们需要编辑两个文件 − bashrc and pig.properties 。
.bashrc file
在 .bashrc 文件中,设置以下变量 −
-
PIG_HOME 文件夹为 Apache Pig 的安装文件夹,
-
PATH 环境变量为 bin 文件夹,
-
PIG_CLASSPATH 环境变量为 Hadoop 安装的 etc(配置)文件夹(包含 core-site.xml、hdfs-site.xml 和 mapred-site.xml 文件的目录)。
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
pig.properties file
在 Pig 的 conf 文件夹中,有一个名为 pig.properties 的文件。在 pig.properties 文件中,您可以按照下面给出的方式,设置各种参数。
pig -h properties
支持以下属性 −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.
If true, prints count of warnings of each type rather than logging each warning.
Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery = true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression = true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination = true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg = true|false. Default is false.
Determines if partial aggregation is done within map phase, before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.