Hcatalog 简明教程
HCatalog - Loader & Storer
HCatLoader 和 HCatStorer API 与 Pig 脚本一起用于读写 HCatalog 管理的表中的数据。这些接口不需要 HCatalog 特定的设置。
The HCatLoader and HCatStorer APIs are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.
最好了解一些 Apache Pig 脚本的知识,才能更好地理解本章。更多参考信息,请参阅我们的 Apache Pig 教程。
It is better to have some knowledge on Apache Pig scripts to understand this chapter better. For further reference, please go through our Apache Pig tutorial.
HCatloader
HCatLoader 与 Pig 脚本一起用于从 HCatalog 管理的表中读取数据。使用以下语法以使用 HCatloader 将数据加载到 HDFS 中。
HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. Use the following syntax to load data into HDFS using HCatloader.
A = LOAD 'tablename' USING org.apache.HCatalog.pig.HCatLoader();
您必须用单引号指定表名: LOAD 'tablename' 。如果您正在使用非默认数据库,那么您必须将您的输入指定为 ' dbname.tablename' 。
You must specify the table name in single quotes: LOAD 'tablename'. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.
Hive 元存储允许您在不指定数据库的情况下创建表。如果您以这种方式创建表,那么数据库名称为 'default' ,并且在为 HCatLoader 指定表时不需要该名称。
The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and is not required when specifying the table for HCatLoader.
下表包含 HCatloader 类的重要方法及其描述。
The following table contains the important methods and description of the HCatloader class.
Sr.No. |
Method Name & Description |
1 |
public InputFormat<?,?> getInputFormat()throws IOException Read the input format of the loading data using the HCatloader class. |
2 |
public String relativeToAbsolutePath(String location, Path curDir) throws IOException It returns the String format of the Absolute path. |
3 |
public void setLocation(String location, Job job) throws IOException It sets the location where the job can be executed. |
4 |
public Tuple getNext() throws IOException Returns the current tuple (key and value) from the loop. |
HCatStorer
HCatStorer 与 Pig 脚本一起用于将数据写入 HCatalog 管理的表中。对于存储操作,使用以下语法。
HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. Use the following syntax for Storing operation.
A = LOAD ...
B = FOREACH A ...
...
...
my_processed_data = ...
STORE my_processed_data INTO 'tablename' USING org.apache.HCatalog.pig.HCatStorer();
您必须用单引号指定表名: LOAD 'tablename' 。在运行您的 Pig 脚本之前,必须创建数据库和表。如果您正在使用非默认数据库,那么您必须将您的输入指定为 'dbname.tablename' 。
You must specify the table name in single quotes: LOAD 'tablename'. Both the database and the table must be created prior to running your Pig script. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.
Hive 元存储允许您在不指定数据库的情况下创建表。如果您以这种方式创建表,那么数据库名称为 'default' ,您不需要在 store 语句中指定数据库名称。
The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and you do not need to specify the database name in the store statement.
对于 USING 语句,您可以有一个表示分区键值对的字符串参数。当您写入分区表而分区列不在输出列中时,这是一个强制参数。分区键的值不应加引号。
For the USING clause, you can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.
下表包含 HCatStorer 类的重要方法及其说明。
The following table contains the important methods and description of the HCatStorer class.
Sr.No. |
Method Name & Description |
1 |
public OutputFormat getOutputFormat() throws IOException Read the output format of the stored data using the HCatStorer class. |
2 |
public void setStoreLocation (String location, Job job) throws IOException Sets the location where to execute this store application. |
3 |
public void storeSchema (ResourceSchema schema, String arg1, Job job) throws IOException Store the schema. |
4 |
public void prepareToWrite (RecordWriter writer) throws IOException It helps to write data into a particular file using RecordWriter. |
5 |
public void putNext (Tuple tuple) throws IOException Writes the tuple data into the file. |
Running Pig with HCatalog
Pig 不会自动获取 HCatalog jar。若要引入必要的 jar,可使用 Pig 命令中的标志或设置 PIG_CLASSPATH 和 PIG_OPTS 环境变量,如下所示。
Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the Pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.
若要引入用于处理 HCatalog 的适当 jar,只需包含以下标志 −
To bring in the appropriate jars for working with HCatalog, simply include the following flag −
pig –useHCatalog <Sample pig scripts file>
Setting the CLASSPATH for Execution
使用以下 CLASSPATH 设置同步 HCatalog 与 Apache Pig。
Use the following CLASSPATH setting for synchronizing the HCatalog with Apache Pig.
export HADOOP_HOME = <path_to_hadoop_install>
export HIVE_HOME = <path_to_hive_install>
export HCAT_HOME = <path_to_hcat_install>
export PIG_CLASSPATH = $HCAT_HOME/share/HCatalog/HCatalog-core*.jar:\
$HCAT_HOME/share/HCatalog/HCatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\
$HIVE_HOME/lib/slf4j-api-*.jar
Example
假设我们在 HDFS 中有一个文件 student_details.txt ,内容如下。
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
student_details.txt
001, Rajiv, Reddy, 21, 9848022337, Hyderabad
002, siddarth, Battacharya, 22, 9848022338, Kolkata
003, Rajesh, Khanna, 22, 9848022339, Delhi
004, Preethi, Agarwal, 21, 9848022330, Pune
005, Trupthi, Mohanthy, 23, 9848022336, Bhuwaneshwar
006, Archana, Mishra, 23, 9848022335, Chennai
007, Komal, Nayak, 24, 9848022334, trivendram
008, Bharathi, Nambiayar, 24, 9848022333, Chennai
我们还有同一个 HDFS 目录中的一个样例脚本,名为 sample_script.pig 。该文件包含对 student 关系执行操作和转换的语句,如下所示。
We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.
student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);
student_order = ORDER student BY age DESC;
STORE student_order INTO 'student_order_table' USING org.apache.HCatalog.pig.HCatStorer();
student_limit = LIMIT student_order 4;
Dump student_limit;
-
The first statement of the script will load the data in the file named student_details.txt as a relation named student.
-
The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.
-
The third statement stores the processed data student_order results in a separate table named student_order_table.
-
The fourth statement of the script will store the first four tuples of student_order as student_limit.
-
Finally the fifth statement will dump the content of the relation student_limit.
现在让我们按照如下所示执行 sample_script.pig 。
Let us now execute the sample_script.pig as shown below.
$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample_script.pig
现在,检查输出目录 (hdfs: user/tmp/hive) 以查看输出 (part_0000, part_0001)。
Now, check your output directory (hdfs: user/tmp/hive) for the output (part_0000, part_0001).