Hcatalog 简明教程
HCatalog - Introduction
What is HCatalog?
HCatalog 是 Hadoop 的表存储管理工具。它向其他 Hadoop 应用程序公开 Hive 元存储的表格数据。它使用户能够使用不同的数据处理工具(Pig、MapReduce)轻松地将数据写到网格中。它确保用户不必担心数据存储在何处或以何种格式存储。
HCatalog is a table storage management tool for Hadoop. It exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don’t have to worry about where or in what format their data is stored.
HCatalog 作为一个 Hive 的关键组件,使用户可以存储任何格式和任何结构的数据。
HCatalog works like a key component of Hive and it enables the users to store their data in any format and any structure.
Why HCatalog?
Enabling right tool for right Job
Hadoop 生态系统包含用于数据处理的不同工具,例如 Hive、Pig 和 MapReduce。虽然这些工具不需要元数据,但是当元数据存在时,它们仍然可以从中受益。共享元数据存储还使跨工具的用户能够更轻松地共享数据。一种非常常见的工作流程是使用 MapReduce 或 Pig 加载和规范化数据,然后通过 Hive 进行分析。如果所有这些工具共享一个元存储,那么每个工具的用户都可以立即访问使用另一个工具创建的数据。无需加载或传输步骤。
Hadoop ecosystem contains different tools for data processing such as Hive, Pig, and MapReduce. Although these tools do not require metadata, they can still benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using MapReduce or Pig and then analyzed via Hive is very common. If all these tools share one metastore, then the users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
Capture processing states to enable sharing
HCatalog 可以发布您的分析结果。因此,其他程序员可以通过“REST”访问您的分析平台。您发布的模式对其他数据科学家也有用。其他数据科学家使用您的发现作为后续发现的输入。
HCatalog can publish your analytics results. So the other programmer can access your analytics platform via “REST”. The schemas which are published by you are also useful to other data scientists. The other data scientists use your discoveries as inputs into a subsequent discovery.
Integrate Hadoop with everything
Hadoop 作为处理和存储环境为企业提供了许多机会;但是,为了促进采用,它必须与现有工具一起工作并对其进行扩展。Hadoop 应作为您分析平台的输入或与您的运营数据存储和 Web 应用程序集成。企业应该享受 Hadoop 的价值,而不必学习全新的工具集。REST 服务通过熟悉的 API 和类似 SQL 的语言向企业开放平台。企业数据管理系统使用 HCatalog 更深入地与 Hadoop 平台集成。
Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption, it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform.
HCatalog Architecture
下图显示了 HCatalog 的整体架构。
The following illustration shows the overall architecture of HCatalog.

HCatalog 支持为可以使用 SerDe (序列化器-反序列化器)编写的任何格式读写文件。默认情况下,HCatalog 支持 RCFile、CSV、JSON、SequenceFile 和 ORC 文件格式。要使用自定义格式,您必须提供 InputFormat、OutputFormat 和 SerDe。
HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.
HCatalog 建立在 Hive 元存储之上,并结合了 Hive 的 DDL。HCatalog 为 Pig 和 MapReduce 提供了读写接口,并使用 Hive 的命令行界面来发布数据定义和元数据探索命令。
HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.
HCatalog - Installation
所有 Hadoop 子项目(如 Hive、Pig 和 HBase)都支持 Linux 操作系统。因此,您需要在系统上安装 Linux 版本。HCatalog 已在 2013 年 3 月 26 日与 Hive 安装合并。从 Hive-0.11.0 版本开始,HCatalog 随 Hive 安装提供。因此,请按照以下步骤安装 Hive,进而自动在系统上安装 HCatalog。
All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install a Linux flavor on your system. HCatalog is merged with Hive Installation on March 26, 2013. From the version Hive-0.11.0 onwards, HCatalog comes with Hive installation. Therefore, follow the steps given below to install Hive which in turn will automatically install HCatalog on your system.
Step 1: Verifying JAVA Installation
在安装 Hive 之前,必须在系统上安装 Java。您可以使用以下命令来检查系统上是否已安装 Java -
Java must be installed on your system before installing Hive. You can use the following command to check whether you have Java already installed on your system −
$ java –version
如果系统上已安装 Java,您将看到以下响应 -
If Java is already installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
如果系统上未安装 Java,则需要按照以下步骤操作。
If you don’t have Java installed on your system, then you need to follow the steps given below.
Step 2: Installing Java
访问以下链接 http://www.oracle.com/ 下载 Java(JDK <latest version> - X64.tar.gz)
Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link http://www.oracle.com/
然后 jdk-7u71-linux-x64.tar.gz 将下载到您的系统。
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.
通常,您会在 Downloads 文件夹中找到下载的 Java 文件。使用以下命令验证并解压 jdk-7u71-linux-x64.gz 文件。
Generally you will find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
为了使得所有用户都能使用 Java,您需要将 Java 移动到“/usr/local/”位置。打开 root,然后键入以下命令。
To make Java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
为设置 PATH 和 JAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件。
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin
现在,使用上述 from terminal java -version 命令验证安装。
Now verify the installation using the command java -version from the terminal as explained above.
Step 3: Verifying Hadoop Installation
在安装 Hive 之前,必须在您的系统上安装 Hadoop。让我们使用以下命令验证 Hadoop 安装:
Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command −
$ hadoop version
如果已在您的系统上安装 Hadoop,则会收到以下回复:
If Hadoop is already installed on your system, then you will get the following response −
Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
如果您的系统上未安装 Hadoop,请执行以下步骤:
If Hadoop is not installed on your system, then proceed with the following steps −
Step 4: Downloading Hadoop
使用以下命令从 Apache 软件基金会下载并解压缩 Hadoop 2.4.1。
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Step 5: Installing Hadoop in Pseudo Distributed Mode
以下步骤用于在伪分布式模式下安装 Hadoop 2.4.1 。
The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.
Setting up Hadoop
您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。
You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
现在将所有更改应用到当前正在运行的系统中。
Now apply all the changes into the current running system.
$ source ~/.bashrc
Hadoop Configuration
您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,您需要在这些配置文件中进行适当的更改。
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
为了使用 Java 开发 Hadoop 程序,您必须通过将 JAVA_HOME 值替换为系统中 Java 的位置,来重置 hadoop-env.sh 文件中的 Java 环境变量。
In order to develop Hadoop programs using Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
下面列出了您必须编辑以配置 Hadoop 的文件列表。
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
core-site.xml 文件包含信息,例如用于 Hadoop 实例的端口号、分配给文件系统内存、用于存储数据的内存限制以及读/写缓冲区大小。
The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
打开 core-site.xml,并在 <configuration> 和 </configuration> 标记之间添加以下属性。
Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
hdfs-site.xml 文件包含信息,例如复制数据的值、本地名节点的路径以及本地文件系统的数据节点路径。这意味着您要存储 Hadoop 基础架构的位置。
The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure.
让我们假设以下数据。
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
打开此文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
Note − 在上述文件中,所有属性值都是用户定义的,您可以根据 Hadoop 基础架构进行更改。
Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.
yarn-site.xml
此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
此文件用于指定我们使用哪种 MapReduce 框架。默认情况下,Hadoop 包含一个 yarn-site.xml 模板。首先,您需要使用以下命令将文件从 mapred-site,xml.template 复制到 mapred-site.xml 文件。
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
打开 mapred-site.xml 文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 6: Verifying Hadoop Installation
以下步骤用于验证 Hadoop 安装。
The following steps are used to verify the Hadoop installation.
Namenode Setup
使用 “hdfs namenode -format” 命令设置 name 节点,如下所示:
Set up the namenode using the command “hdfs namenode -format” as follows −
$ cd ~
$ hdfs namenode -format
预期结果如下:
The expected result is as follows −
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1
images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/
Verifying Hadoop DFS
以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。
The following command is used to start the DFS. Executing this command will start your Hadoop file system.
$ start-dfs.sh
预期输出如下所示 −
The expected output is as follows −
10/24/14 21:37:56 Starting namenodes on [localhost]
localhost: starting namenode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode-localhost.out localhost:
starting datanode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]
Verifying Yarn Script
以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。
The following command is used to start the Yarn script. Executing this command will start your Yarn daemons.
$ start-yarn.sh
预期输出如下所示 −
The expected output is as follows −
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/
yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to
/home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-localhost.out
Accessing Hadoop on Browser
访问 Hadoop 的默认端口号是 50070。使用以下 URL 在您的浏览器上获取 Hadoop 服务。
The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on your browser.
http://localhost:50070/

Verify all applications for cluster
访问集群所有应用程序的默认端口号为 8088。使用以下网址访问此服务。
The default port number to access all applications of cluster is 8088. Use the following url to visit this service.
http://localhost:8088/

Hadoop 安装完成后,执行下一步并在系统上安装 Hive。
Once you are done with the installation of Hadoop, proceed to the next step and install Hive on your system.
Step 7: Downloading Hive
在本教程中,我们使用 hive-0.14.0。你可以访问以下链接下载: http://apache.petsads.us/hive/hive-0.14.0/ 。我们假设教程下载到了 /Downloads 目录。在这里,我们为此教程下载了名为“ apache-hive-0.14.0-bin.tar.gz ”的 Hive 归档文件。使用以下命令验证下载情况−
We use hive-0.14.0 in this tutorial. You can download it by visiting the following link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the /Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this tutorial. The following command is used to verify the download −
$ cd Downloads
$ ls
下载成功后,你会看到以下响应−
On successful download, you get to see the following response −
apache-hive-0.14.0-bin.tar.gz
Step 8: Installing Hive
若要在系统上安装 Hive,需要执行以下步骤。我们假设 Hive 归档文件下载到了 /Downloads 目录。
The following steps are required for installing Hive on your system. Let us assume the Hive archive is downloaded onto the /Downloads directory.
Extracting and Verifying Hive Archive
使用以下命令验证下载情况并提取 Hive 归档文件−
The following command is used to verify the download and extract the Hive archive −
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
下载成功后,你会看到以下响应−
On successful download, you get to see the following response −
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
Copying files to /usr/local/hive directory
我们需要从超级用户 “su -” 复制文件。以下命令用于将文件从提取的目录复制到 /usr/local/hive ”目录。
We need to copy the files from the superuser “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
Setting up the environment for Hive
可以通过向 ~/.bashrc 文件追加以下行来设置 Hive 环境−
You can set up the Hive environment by appending the following lines to ~/.bashrc file −
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
使用以下命令执行 ~/.bashrc 文件。
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
Step 9: Configuring Hive
要将 Hive 与 Hadoop 配置在一起,需要编辑位于 $HIVE_HOME/conf 目录的 hive-env.sh 文件。以下命令重定向到 Hive config 文件夹并复制模板文件 −
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the template file −
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
通过追加以下行编辑 hive-env.sh 文件 −
Edit the hive-env.sh file by appending the following line −
export HADOOP_HOME=/usr/local/hadoop
这样,Hive 安装就完成了。现在你需要一个外部数据库服务器来配置元存储。我们使用 Apache Derby 数据库。
With this, the Hive installation is complete. Now you require an external database server to configure Metastore. We use Apache Derby database.
Step 10: Downloading and Installing Apache Derby
按照以下步骤下载并安装 Apache Derby −
Follow the steps given below to download and install Apache Derby −
Downloading Apache Derby
使用以下命令下载 Apache Derby。下载需要一些时间。
The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
使用以下命令验证下载情况−
The following command is used to verify the download −
$ ls
下载成功后,你会看到以下响应−
On successful download, you get to see the following response −
db-derby-10.4.2.0-bin.tar.gz
Extracting and Verifying Derby Archive
以下命令用于提取和验证 Derby 归档文件 −
The following commands are used for extracting and verifying the Derby archive −
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
下载成功后,你会看到以下响应−
On successful download, you get to see the following response −
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying Files to /usr/local/derby Directory
我们需要从超级用户 “su -” 复制文件。以下命令用于将文件从提取的目录复制到 /usr/local/derby 目录 −
We need to copy from the superuser “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/derby directory −
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
Setting up the Environment for Derby
可以通过向 ~/.bashrc 文件追加以下行来设置 Derby 环境 −
You can set up the Derby environment by appending the following lines to ~/.bashrc file −
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
使用以下命令执行 ~/.bashrc file −
The following command is used to execute ~/.bashrc file −
$ source ~/.bashrc
Step 11: Configuring the Hive Metastore
配置元数据存储库表示指定数据库存储在 Hive 中的位置。可以通过编辑 hive-site.xml 文件(位于 $HIVE_HOME/conf 目录中)来完成此操作。首先,使用以下命令复制模板文件 −
Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command −
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
编辑 hive-site.xml ,并将以下行追加到 <configuration> 和 </configuration> 标记之间 −
Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags −
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create = true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
创建名为 jpox.properties 的文件,并向其中添加以下行 −
Create a file named jpox.properties and add the following lines into it −
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Step 12: Verifying Hive Installation
在运行 Hive 之前,您需要在 HDFS 中创建 /tmp 文件夹和一个单独的 Hive 文件夹。在此处,我们使用 /user/hive/warehouse 文件夹。您需要为这些新创建的文件夹设置写入权限,如下所示 −
Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we use the /user/hive/warehouse folder. You need to set write permission for these newly created folders as shown below −
chmod g+w
现在,在验证 Hive 之前,请在 HDFS 中设置它们。使用以下命令 −
Now set them in HDFS before verifying Hive. Use the following commands −
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
以下命令用于验证 Hive 安装 −
The following commands are used to verify Hive installation −
$ cd $HIVE_HOME
$ bin/hive
在成功安装 Hive 后,您将看到以下响应 −
On successful installation of Hive, you get to see the following response −
Logging initialized using configuration in
jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/
hive-log4j.properties Hive history
=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>
您可以执行以下示例命令来显示所有表 −
You can execute the following sample command to display all the tables −
hive> show tables;
OK Time taken: 2.798 seconds
hive>
Step 13: Verify HCatalog Installation
使用以下命令为 HCatalog 主目录设置系统变量 HCAT_HOME 。
Use the following command to set a system variable HCAT_HOME for HCatalog Home.
export HCAT_HOME = $HiVE_HOME/HCatalog
使用以下命令验证 HCatalog 安装。
Use the following command to verify the HCatalog installation.
cd $HCAT_HOME/bin
./hcat
如果安装成功,您将看到以下输出 −
If the installation is successful, you will get to see the following output −
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
usage: hcat { -e "<query>" | -f "<filepath>" }
[ -g "<group>" ] [ -p "<perms>" ]
[ -D"<name> = <value>" ]
-D <property = value> use hadoop value for given property
-e <exec> hcat command given from command line
-f <file> hcat commands in file
-g <group> group for the db/table specified in CREATE statement
-h,--help Print help information
-p <perms> permissions for the db/table specified in CREATE statement
HCatalog - CLI
HCatalog 命令行界面 (CLI) 可以从命令 $HIVE_HOME/HCatalog/bin/hcat 中调用,其中 $HIVE_HOME 是 Hive 的主目录。 hcat 是用于初始化 HCatalog 服务器的命令。
HCatalog Command Line Interface (CLI) can be invoked from the command $HIVE_HOME/HCatalog/bin/hcat where $HIVE_HOME is the home directory of Hive. hcat is a command used to initialize the HCatalog server.
使用以下命令初始化 HCatalog 命令行。
Use the following command to initialize HCatalog command line.
cd $HCAT_HOME/bin
./hcat
如果安装已正确完成,则您将获得以下输出 −
If the installation has been done correctly, then you will get the following output −
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
usage: hcat { -e "<query>" | -f "<filepath>" }
[ -g "<group>" ] [ -p "<perms>" ]
[ -D"<name> = <value>" ]
-D <property = value> use hadoop value for given property
-e <exec> hcat command given from command line
-f <file> hcat commands in file
-g <group> group for the db/table specified in CREATE statement
-h,--help Print help information
-p <perms> permissions for the db/table specified in CREATE statement
HCatalog CLI 支持这些命令行选项 −
The HCatalog CLI supports these command line options −
Sr.No |
Option |
Example & Description |
1 |
-g |
hcat -g mygroup … The table to be created must have the group "mygroup". |
2 |
-p |
hcat -p rwxr-xr-x … The table to be created must have read, write, and execute permissions. |
3 |
-f |
hcat -f myscript.HCatalog … myscript.HCatalog is a script file containing DDL commands to execute. |
4 |
-e |
hcat -e 'create table mytable(a int);' … Treat the following string as a DDL command and execute it. |
5 |
-D |
hcat -Dkey = value … Passes the key-value pair to HCatalog as a Java system property. |
6 |
- |
hcat Prints a usage message. |
Note −
-
The -g and -p options are not mandatory.
-
At one time, either -e or -f option can be provided, not both.
-
The order of options is immaterial; you can specify the options in any order.
Sr.No |
DDL Command & Description |
1 |
CREATE TABLE Create a table using HCatalog. If you create a table with a CLUSTERED BY clause, you will not be able to write to it with Pig or MapReduce. |
2 |
ALTER TABLE Supported except for the REBUILD and CONCATENATE options. Its behavior remains same as in Hive. |
3 |
DROP TABLE Supported. Behavior the same as Hive (Drop the complete table and structure). |
4 |
CREATE/ALTER/DROP VIEW Supported. Behavior same as Hive. Note − Pig and MapReduce cannot read from or write to views. |
5 |
SHOW TABLES Display a list of tables. |
6 |
SHOW PARTITIONS Display a list of partitions. |
7 |
Create/Drop Index CREATE and DROP FUNCTION operations are supported, but the created functions must still be registered in Pig and placed in CLASSPATH for MapReduce. |
8 |
DESCRIBE Supported. Behavior same as Hive. Describe the structure. |
上表中的一些命令在后续章节中进行了说明。
Some of the commands from the above table are explained in subsequent chapters.
HCatalog - Create Table
本章介绍了如何创建表以及如何向其中插入数据。在 HCatalog 中创建表的约定与使用 Hive 创建表非常相似。
This chapter explains how to create a table and how to insert data into it. The conventions of creating a table in HCatalog is quite similar to creating a table using Hive.
Create Table Statement
Create Table 是一个用于使用 HCatalog 在 Hive Metastore 中创建表的语句。它的语法和示例如下 −
Create Table is a statement used to create a table in Hive metastore using HCatalog. Its syntax and example are as follows −
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
让我们假设您需要使用 CREATE TABLE 语句创建名为 employee 的表。下表列出了 employee 表中的字段及其数据类型 −
Let us assume you need to create a table named employee using CREATE TABLE statement. The following table lists the fields and their data types in the employee table −
Sr.No |
Field Name |
Data Type |
1 |
Eid |
int |
2 |
Name |
String |
3 |
Salary |
Float |
4 |
Designation |
string |
以下数据定义了 Comment 等受支持字段、行格式字段例如 Field terminator 、 Lines terminator 和 Stored File type 。
The following data defines the supported fields such as Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
以下查询使用上述数据创建名为 employee 的表。
The following query creates a table named employee using the above data.
./hcat –e "CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String) \
COMMENT 'Employee details' \
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ‘\t’ \
LINES TERMINATED BY ‘\n’ \
STORED AS TEXTFILE;"
如果添加选项 IF NOT EXISTS ,则在表已存在的情况下,HCatalog 忽略该声明。
If you add the option IF NOT EXISTS, HCatalog ignores the statement in case the table already exists.
当表创建成功时,您可以看到以下响应:
On successful creation of table, you get to see the following response −
OK
Time taken: 5.905 seconds
Load Data Statement
总体上,在 SQL 中创建一个表之后,我们可以使用 Insert 声明插入数据。但在 HCatalog 中,我们使用 LOAD DATA 声明插入数据。
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in HCatalog, we insert data using the LOAD DATA statement.
向 HCatalog 插入数据时,最好使用 LOAD DATA 来存储批量记录。有两种方法可用于加载数据:一种是从 local file system ,另一种是从 Hadoop file system 。
While inserting data into HCatalog, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is from local file system and second is from Hadoop file system.
Syntax
LOAD DATA 的语法如下:
The syntax for LOAD DATA is as follows −
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
-
LOCAL is the identifier to specify the local path. It is optional.
-
OVERWRITE is optional to overwrite the data in the table.
-
PARTITION is optional.
Example
我们将向表中插入以下数据。它是一个文本文件,名为 sample.txt ,位于 /home/user 目录中。
We will insert the following data into the table. It is a text file named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
以下查询将给定的文本加载到表中。
The following query loads the given text into the table.
./hcat –e "LOAD DATA LOCAL INPATH '/home/user/sample.txt'
OVERWRITE INTO TABLE employee;"
下载成功后,你会看到以下响应−
On successful download, you get to see the following response −
OK
Time taken: 15.905 seconds
HCatalog - Alter Table
本章介绍如何修改表的属性,例如更改表名、更改列名、添加列以及删除或替换列。
This chapter explains how to alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns.
Alter Table Statement
您可以使用 ALTER TABLE 语句来修改 Hive 中的表。
You can use the ALTER TABLE statement to alter a table in Hive.
Syntax
该语句根据我们希望在表中修改哪些属性采用以下任一语法。
The statement takes any of the following syntaxes based on what attributes we wish to modify in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
下面介绍一些场景。
Some of the scenarios are explained below.
Rename To… Statement
以下查询将表 employee 重命名为 emp 。
The following query renames a table from employee to emp.
./hcat –e "ALTER TABLE employee RENAME TO emp;"
Change Statement
下表包含 employee 表的字段,并显示要更改的字段(以粗体显示)。
The following table contains the fields of employee table and it shows the fields to be changed (in bold).
Field Name |
Convert from Data Type |
Change Field Name |
Convert to Data Type |
eid |
int |
eid |
int |
name |
String |
ename |
String |
salary |
Float |
salary |
Double |
designation |
String |
designation |
String |
以下查询使用上述数据重命名列名和列数据类型 −
The following queries rename the column name and column data type using the above data −
./hcat –e "ALTER TABLE employee CHANGE name ename String;"
./hcat –e "ALTER TABLE employee CHANGE salary salary Double;"
Add Columns Statement
以下查询向 employee 表中添加一个名为 dept 的列。
The following query adds a column named dept to the employee table.
./hcat –e "ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name');"
Replace Statement
以下查询从 employee 表中删除所有列,并用 emp 和 name 列替换它们 −
The following query deletes all the columns from the employee table and replaces it with emp and name columns −
./hcat – e "ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING name String);"
Drop Table Statement
本章介绍如何在 HCatalog 中删除表。当您从元存储中删除表时,它会删除表/列数据及其元数据。它可以是普通表(存储在元存储中)或外部表(存储在本地文件系统中);无论类型如何,HCatalog 都以相同的方式处理它们。
This chapter describes how to drop a table in HCatalog. When you drop a table from the metastore, it removes the table/column data and their metadata. It can be a normal table (stored in metastore) or an external table (stored in local file system); HCatalog treats both in the same manner, irrespective of their types.
其语法如下:
The syntax is as follows −
DROP TABLE [IF EXISTS] table_name;
以下查询删除名为 employee 的表格−
The following query drops a table named employee −
./hcat –e "DROP TABLE IF EXISTS employee;"
成功执行查询后,你会看到以下响应−
On successful execution of the query, you get to see the following response −
OK
Time taken: 5.3 seconds
HCatalog - View
本章介绍如何在HCatalog中创建并管理 view 。数据库视图使用 CREATE VIEW 语句创建。视图可以从单个表格、多个表格或另一个视图创建。
This chapter describes how to create and manage a view in HCatalog. Database views are created using the CREATE VIEW statement. Views can be created from a single table, multiple tables, or another view.
要创建视图,用户必须根据具体实现具有适当的系统权限。
To create a view, a user must have appropriate system privileges according to the specific implementation.
Create View Statement
CREATE VIEW 创建一个具有给定名称的视图。如果已经存在具有相同名称的表或视图,则会引发错误。可以使用 IF NOT EXISTS 来跳过该错误。
CREATE VIEW creates a view with the given name. An error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error.
如果没有提供列名,则视图的列名将自动从 defining SELECT expression 派生。
If no column names are supplied, the names of the view’s columns will be derived automatically from the defining SELECT expression.
Note − 如果 SELECT 包含未别名的标量表达式,例如 x+y,则生成的视图列名将采用 _C0、_C1 等形式。
Note − If the SELECT contains un-aliased scalar expressions such as x+y, the resulting view column names will be generated in the form _C0, _C1, etc.
重命名列时,还可以提供列注释。注释不会自动从基础列继承。
When renaming columns, column comments can also be supplied. Comments are not automatically inherited from the underlying columns.
如果视图的 defining SELECT expression 无效,则 CREATE VIEW 语句将失败。
A CREATE VIEW statement will fail if the view’s defining SELECT expression is invalid.
Syntax
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
Example
以下是员工表数据。现在让我们看看如何创建一个名为 Emp_Deg_View 的视图,其中包含薪水高于 35000 的员工的 ID、姓名、职位和薪水字段。
The following is the employee table data. Now let us see how to create a view named Emp_Deg_View containing the fields id, name, Designation, and salary of an employee having a salary greater than 35,000.
+------+-------------+--------+-------------------+-------+
| ID | Name | Salary | Designation | Dept |
+------+-------------+--------+-------------------+-------+
| 1201 | Gopal | 45000 | Technical manager | TP |
| 1202 | Manisha | 45000 | Proofreader | PR |
| 1203 | Masthanvali | 30000 | Technical writer | TP |
| 1204 | Kiran | 40000 | Hr Admin | HR |
| 1205 | Kranthi | 30000 | Op Admin | Admin |
+------+-------------+--------+-------------------+-------+
以下是基于以上给定数据创建视图的命令。
The following is the command to create a view based on the above given data.
./hcat –e "CREATE VIEW Emp_Deg_View (salary COMMENT ' salary more than 35,000')
AS SELECT id, name, salary, designation FROM employee WHERE salary ≥ 35000;"
Drop View Statement
DROP VIEW 删除指定视图的元数据。删除其他视图引用的视图时,不会给出警告(依赖视图会一直处于无效状态,并且必须由用户删除或重新创建)。
DROP VIEW removes metadata for the specified view. When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as invalid and must be dropped or recreated by the user).
HCatalog - Show Tables
您通常希望列出数据库中的所有表或列出表中的所有列。显然,每个数据库都有自己列出表和列的语法。
You often want to list all the tables in a database or list all the columns in a table. Obviously, every database has its own syntax to list the tables and columns.
Show Tables 语句显示所有表的名称。默认情况下,它会列出当前数据库中的表,或者在 IN 子句中,在指定数据库中列出表。
Show Tables statement displays the names of all tables. By default, it lists tables from the current database, or with the IN clause, in a specified database.
本节描述如何列出 HCatalog 中当前数据库中的所有表。
This chapter describes how to list out all tables from the current database in HCatalog.
Show Tables Statement
SHOW TABLES 的语法如下 −
The syntax of SHOW TABLES is as follows −
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
以下查询显示表列表 −
The following query displays a list of tables −
./hcat –e "Show tables;"
成功执行查询后,你会看到以下响应−
On successful execution of the query, you get to see the following response −
OK
emp
employee
Time taken: 5.3 seconds
HCatalog - Show Partitions
分区是用于创建单独表格或视图的表格数据条件。SHOW PARTITIONS列出给定基础表的所有现有分区。分区按字母顺序列出。在Hive 0.6之后,还可以指定分区规范的部分以筛选结果列表。
A partition is a condition for tabular data which is used for creating a separate table or view. SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order. After Hive 0.6, it is also possible to specify parts of a partition specification to filter the resulting list.
可以使用SHOW PARTITIONS命令查看特定表格中存在的分区。本章介绍如何列出HCatalog中特定表格的分区。
You can use the SHOW PARTITIONS command to see the partitions that exist in a particular table. This chapter describes how to list out the partitions of a particular table in HCatalog.
Show Partitions Statement
其语法如下:
The syntax is as follows −
SHOW PARTITIONS table_name;
以下查询删除名为 employee 的表格−
The following query drops a table named employee −
./hcat –e "Show partitions employee;"
成功执行查询后,你会看到以下响应−
On successful execution of the query, you get to see the following response −
OK
Designation = IT
Time taken: 5.3 seconds
Dynamic Partition
HCatalog将表格组织成分区。这是一种基于分区列,如日期、城市和部门的值,将表格划分为相关部分的方式。使用分区,很容易查询部分数据。
HCatalog organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partitions, it is easy to query a portion of the data.
例如,名为 Tab1 的表格包含员工数据,如id、姓名、部门和yoj(即参加年份)。假设你需要检索2012年加入的所有员工的详细信息。查询搜索整个表格以获取所需信息。然而,如果您使用年份对员工数据进行分区并将其存储在单独的文件中,则会减少查询处理时间。以下示例显示如何对文件及其数据进行分区 −
For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data −
以下文件包含 employeedata 表格。
The following file contains employeedata table.
Adding a Partition
我们可以通过更改表格来向表格添加分区。让我们假设我们有一个名为 employee 的表格,其中包含诸如Id、姓名、工资、职务、部门和yoj等字段。
We can add partitions to a table by altering the table. Let us assume we have a table called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
以下查询用于向 employee 表格添加分区。
The following query is used to add a partition to the employee table.
./hcat –e "ALTER TABLE employee ADD PARTITION (year = '2013') location '/2012/part2012';"
Renaming a Partition
可以使用RENAME-TO命令重命名分区。它的语法如下−
You can use the RENAME-TO command to rename a partition. Its syntax is as follows −
./hact –e "ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;"
以下查询用于重命名分区 −
The following query is used to rename a partition −
./hcat –e "ALTER TABLE employee PARTITION (year=’1203’) RENAME TO PARTITION (Yoj='1203');"
Dropping a Partition
用于删除分区的命令的语法如下−
The syntax of the command that is used to drop a partition is as follows −
./hcat –e "ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,.
PARTITION partition_spec,...;"
以下查询用于删除分区−
The following query is used to drop a partition −
./hcat –e "ALTER TABLE employee DROP [IF EXISTS] PARTITION (year=’1203’);"
HCatalog - Indexes
Creating an Index
索引实际上只是表某个特定列上的指针。创建索引是指在表的特定列上创建一个指针。其语法如下:
An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows −
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name = property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)][
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example
让我们举一个例子来理解索引的概念。使用我们先前使用过的、包含字段 Id、Name、Salary、Designation 和 Dept 的相同 employee 表。在 employee 表的 salary 列上创建一个名为 index_salary 的索引。
Let us take an example to understand the concept of index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.
以下查询创建了一个索引:
The following query creates an index −
./hcat –e "CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';"
它是 salary 列的指针。如果该列被修改,则更改将使用索引值进行存储。
It is a pointer to the salary column. If the column is modified, the changes are stored using an index value.
HCatalog - Reader Writer
HCatalog 包含一个数据传输 API,用于在不使用 MapReduce 的情况下实现并行输入和输出。此 API 使用表的存储和行的基本抽象从 Hadoop 集群读取数据,并向其中写入数据。
HCatalog contains a data transfer API for parallel input and output without using MapReduce. This API uses a basic storage abstraction of tables and rows to read data from Hadoop cluster and write data into it.
数据传输 API 主要包含三个类:
The Data Transfer API contains mainly three classes; those are −
-
HCatReader − Reads data from a Hadoop cluster.
-
HCatWriter − Writes data into a Hadoop cluster.
-
DataTransferFactory − Generates reader and writer instances.
此 API 适合主从节点设置。让我们进一步讨论 HCatReader 和 HCatWriter 。
This API is suitable for master-slave node setup. Let us discuss more on HCatReader and HCatWriter.
HCatReader
HCatReader 是 HCatalog 的一个内部抽象类,它抽象了从其检索记录的底层系统中的复杂性。
HCatReader is an abstract class internal to HCatalog and abstracts away the complexities of the underlying system from where the records are to be retrieved.
S. No. |
Method Name & Description |
1 |
Public abstract ReaderContext prepareRead() throws HCatException This should be called at master node to obtain ReaderContext which then should be serialized and sent slave nodes. |
2 |
Public abstract Iterator <HCatRecorder> read() throws HCaException This should be called at slaves nodes to read HCatRecords. |
3 |
Public Configuration getConf() It will return the configuration class object. |
HCatReader 类用于读取 HDFS 中的数据。阅读是一个两步过程,其中第一步发生在外部系统的 master 节点上。第二步在多个 slave 节点上并行执行。
The HCatReader class is used to read the data from HDFS. Reading is a two-step process in which the first step occurs on the master node of an external system. The second step is carried out in parallel on multiple slave nodes.
读取在 ReadEntity 上完成。在开始读取之前,你需要定义一个 ReadEntity 用于读取。可以通过 ReadEntity.Builder 完成。你可以指定一个数据库名称、表名称、分区和过滤字符串。例如 −
Reads are done on a ReadEntity. Before you start to read, you need to define a ReadEntity from which to read. This can be done through ReadEntity.Builder. You can specify a database name, table name, partition, and filter string. For example −
ReadEntity.Builder builder = new ReadEntity.Builder();
ReadEntity entity = builder.withDatabase("mydb").withTable("mytbl").build(); 10.
上面的代码片段定义了一个 ReadEntity 对象(“entity”),它包含一个名为 mytbl 的表和一个名为 mydb 的数据库,可用于读取该表的所有行。请注意,此表必须在该操作开始之前存在于 HCatalog 中。
The above code snippet defines a ReadEntity object (“entity”), comprising a table named mytbl in a database named mydb, which can be used to read all the rows of this table. Note that this table must exist in HCatalog prior to the start of this operation.
在定义 ReadEntity 之后,你可以使用 ReadEntity 和集群配置获取 HCatReader 的实例 −
After defining a ReadEntity, you obtain an instance of HCatReader using the ReadEntity and cluster configuration −
HCatReader reader = DataTransferFactory.getHCatReader(entity, config);
下一步是从 reader 获取一个 ReaderContext,如下所示:
The next step is to obtain a ReaderContext from reader as follows −
ReaderContext cntxt = reader.prepareRead();
HCatWriter
该抽象是 HCatalog 内部实现。这便于从外部系统写入 HCatalog。请勿尝试直接实例化它。相反,请使用 DataTransferFactory。
This abstraction is internal to HCatalog. This is to facilitate writing to HCatalog from external systems. Don’t try to instantiate this directly. Instead, use DataTransferFactory.
Sr.No. |
Method Name & Description |
1 |
Public abstract WriterContext prepareRead() throws HCatException External system should invoke this method exactly once from a master node. It returns a WriterContext. This should be serialized and sent to slave nodes to construct HCatWriter there. |
2 |
Public abstract void write(Iterator<HCatRecord> recordItr) throws HCaException This method should be used at slave nodes to perform writes. The recordItr is an iterator object that contains the collection of records to be written into HCatalog. |
3 |
Public abstract void abort(WriterContext cntxt) throws HCatException This method should be called at the master node. The primary purpose of this method is to do cleanups in case of failures. |
4 |
public abstract void commit(WriterContext cntxt) throws HCatException This method should be called at the master node. The purpose of this method is to do metadata commit. |
与阅读类似,写入也是一个两步过程,其中第一步发生在 master 节点上。随后,第二步在 slave 节点上并行执行。
Similar to reading, writing is also a two-step process in which the first step occurs on the master node. Subsequently, the second step occurs in parallel on slave nodes.
在 WriteEntity 上执行写操作,可以按照与读取类似的方式构建 −
Writes are done on a WriteEntity which can be constructed in a fashion similar to reads −
WriteEntity.Builder builder = new WriteEntity.Builder();
WriteEntity entity = builder.withDatabase("mydb").withTable("mytbl").build();
上面的代码创建了一个 WriteEntity 对象 entity,可以用来写到数据库 mydb 中名为 mytbl 的表。
The above code creates a WriteEntity object entity which can be used to write into a table named mytbl in the database mydb.
在创建 WriteEntity 之后,下一步是获取一个 WriterContext −
After creating a WriteEntity, the next step is to obtain a WriterContext −
HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config);
WriterContext info = writer.prepareWrite();
上述所有步骤都在 master 节点上发生。然后,master 节点将 WriterContext 对象序列化,并使所有 slave 可以使用它。
All of the above steps occur on the master node. The master node then serializes the WriterContext object and makes it available to all the slaves.
在 slave 节点上,你需要使用 WriterContext 获取一个 HCatWriter,如下所示:
On slave nodes, you need to obtain an HCatWriter using WriterContext as follows −
HCatWriter writer = DataTransferFactory.getHCatWriter(context);
然后, writer 将迭代器作为写方法的参数 −
Then, the writer takes an iterator as the argument for the write method −
writer.write(hCatRecordItr);
然后, writer 在循环中对该迭代器调用 getNext() ,并写出附加到迭代器上的所有记录。
The writer then calls getNext() on this iterator in a loop and writes out all the records attached to the iterator.
TestReaderWriter.java 文件用于测试 HCatreader 和 HCatWriter 类。以下程序演示如何使用 HCatReader 和 HCatWriter API 从源文件读取数据,并随后将其写入目标文件。
The TestReaderWriter.java file is used to test the HCatreader and HCatWriter classes. The following program demonstrates how to use HCatReader and HCatWriter API to read data from a source file and subsequently write it onto a destination file.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.metastore.api.MetaException;
import org.apache.hadoop.hive.ql.CommandNeedRetryException;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hive.HCatalog.common.HCatException;
import org.apache.hive.HCatalog.data.transfer.DataTransferFactory;
import org.apache.hive.HCatalog.data.transfer.HCatReader;
import org.apache.hive.HCatalog.data.transfer.HCatWriter;
import org.apache.hive.HCatalog.data.transfer.ReadEntity;
import org.apache.hive.HCatalog.data.transfer.ReaderContext;
import org.apache.hive.HCatalog.data.transfer.WriteEntity;
import org.apache.hive.HCatalog.data.transfer.WriterContext;
import org.apache.hive.HCatalog.mapreduce.HCatBaseTest;
import org.junit.Assert;
import org.junit.Test;
public class TestReaderWriter extends HCatBaseTest {
@Test
public void test() throws MetaException, CommandNeedRetryException,
IOException, ClassNotFoundException {
driver.run("drop table mytbl");
driver.run("create table mytbl (a string, b int)");
Iterator<Entry<String, String>> itr = hiveConf.iterator();
Map<String, String> map = new HashMap<String, String>();
while (itr.hasNext()) {
Entry<String, String> kv = itr.next();
map.put(kv.getKey(), kv.getValue());
}
WriterContext cntxt = runsInMaster(map);
File writeCntxtFile = File.createTempFile("hcat-write", "temp");
writeCntxtFile.deleteOnExit();
// Serialize context.
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(writeCntxtFile));
oos.writeObject(cntxt);
oos.flush();
oos.close();
// Now, deserialize it.
ObjectInputStream ois = new ObjectInputStream(new FileInputStream(writeCntxtFile));
cntxt = (WriterContext) ois.readObject();
ois.close();
runsInSlave(cntxt);
commit(map, true, cntxt);
ReaderContext readCntxt = runsInMaster(map, false);
File readCntxtFile = File.createTempFile("hcat-read", "temp");
readCntxtFile.deleteOnExit();
oos = new ObjectOutputStream(new FileOutputStream(readCntxtFile));
oos.writeObject(readCntxt);
oos.flush();
oos.close();
ois = new ObjectInputStream(new FileInputStream(readCntxtFile));
readCntxt = (ReaderContext) ois.readObject();
ois.close();
for (int i = 0; i < readCntxt.numSplits(); i++) {
runsInSlave(readCntxt, i);
}
}
private WriterContext runsInMaster(Map<String, String> config) throws HCatException {
WriteEntity.Builder builder = new WriteEntity.Builder();
WriteEntity entity = builder.withTable("mytbl").build();
HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config);
WriterContext info = writer.prepareWrite();
return info;
}
private ReaderContext runsInMaster(Map<String, String> config,
boolean bogus) throws HCatException {
ReadEntity entity = new ReadEntity.Builder().withTable("mytbl").build();
HCatReader reader = DataTransferFactory.getHCatReader(entity, config);
ReaderContext cntxt = reader.prepareRead();
return cntxt;
}
private void runsInSlave(ReaderContext cntxt, int slaveNum) throws HCatException {
HCatReader reader = DataTransferFactory.getHCatReader(cntxt, slaveNum);
Iterator<HCatRecord> itr = reader.read();
int i = 1;
while (itr.hasNext()) {
HCatRecord read = itr.next();
HCatRecord written = getRecord(i++);
// Argh, HCatRecord doesnt implement equals()
Assert.assertTrue("Read: " + read.get(0) + "Written: " + written.get(0),
written.get(0).equals(read.get(0)));
Assert.assertTrue("Read: " + read.get(1) + "Written: " + written.get(1),
written.get(1).equals(read.get(1)));
Assert.assertEquals(2, read.size());
}
//Assert.assertFalse(itr.hasNext());
}
private void runsInSlave(WriterContext context) throws HCatException {
HCatWriter writer = DataTransferFactory.getHCatWriter(context);
writer.write(new HCatRecordItr());
}
private void commit(Map<String, String> config, boolean status,
WriterContext context) throws IOException {
WriteEntity.Builder builder = new WriteEntity.Builder();
WriteEntity entity = builder.withTable("mytbl").build();
HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config);
if (status) {
writer.commit(context);
} else {
writer.abort(context);
}
}
private static HCatRecord getRecord(int i) {
List<Object> list = new ArrayList<Object>(2);
list.add("Row #: " + i);
list.add(i);
return new DefaultHCatRecord(list);
}
private static class HCatRecordItr implements Iterator<HCatRecord> {
int i = 0;
@Override
public boolean hasNext() {
return i++ < 100 ? true : false;
}
@Override
public HCatRecord next() {
return getRecord(i);
}
@Override
public void remove() {
throw new RuntimeException();
}
}
}
以上程序读取 HDFS 中的数据记录,并将记录数据写入 mytable
The above program reads the data from the HDFS in the form of records and writes the record data into mytable
HCatalog - Input Output Format
HCatInputFormat 和 HCatOutputFormat 用于读取 HDFS 的数据,并在处理后使用 MapReduce 作业将结果数据写入 HDFS。我们来详细介绍输入和输出格式接口。
The HCatInputFormat and HCatOutputFormat interfaces are used to read data from HDFS and after processing, write the resultant data into HDFS using MapReduce job. Let us elaborate the Input and Output format interfaces.
HCatInputFormat
HCatInputFormat 用于与 MapReduce 作业一起,从 HCatalog 管理的表读取数据。HCatInputFormat 暴露出 Hadoop 0.20 MapReduce API,用于读取数据,就像已发表到表中一样。
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.
Sr.No. |
Method Name & Description |
1 |
public static HCatInputFormat setInput(Job job, String dbName, String tableName)throws IOException Set inputs to use for the job. It queries the metastore with the given input specification and serializes matching partitions into the job configuration for MapReduce tasks. |
2 |
public static HCatInputFormat setInput(Configuration conf, String dbName, String tableName) throws IOException Set inputs to use for the job. It queries the metastore with the given input specification and serializes matching partitions into the job configuration for MapReduce tasks. |
3 |
public HCatInputFormat setFilter(String filter)throws IOException Set a filter on the input table. |
4 |
public HCatInputFormat setProperties(Properties properties) throws IOException Set properties for the input format. |
HCatInputFormat API 包括以下方法 -
The HCatInputFormat API includes the following methods −
-
setInput
-
setOutputSchema
-
getTableSchema
要使用 HCatInputFormat 读取数据,请首先使用正在读取的表的必要信息实例化一个 InputJobInfo ,然后使用 InputJobInfo 调用 setInput 。
To use HCatInputFormat to read data, first instantiate an InputJobInfo with the necessary information from the table being read and then call setInput with the InputJobInfo.
可以使用 setOutputSchema 方法包含 projection schema ,以指定输出字段。如果没有指定架构,则会返回表中的所有列。可以使用 getTableSchema 方法来确定指定输入表表的架构。
You can use the setOutputSchema method to include a projection schema, to specify the output fields. If a schema is not specified, all the columns in the table will be returned. You can use the getTableSchema method to determine the table schema for a specified input table.
HCatOutputFormat
HCatOutputFormat 用于向 HCatalog 管理的表写入数据的 MapReduce 作业。HCatOutputFormat 暴露出 Hadoop 0.20 MapReduce API,用于将数据写入表。当 MapReduce 作业使用 HCatOutputFormat 写入输出时,将使用为表配置的默认 OutputFormat,并在作业完成后将新分区发布到表中。
HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. HCatOutputFormat exposes a Hadoop 0.20 MapReduce API for writing data to a table. When a MapReduce job uses HCatOutputFormat to write output, the default OutputFormat configured for the table is used and the new partition is published to the table after the job completes.
Sr.No. |
Method Name & Description |
1 |
public static void setOutput (Configuration conf, Credentials credentials, OutputJobInfo outputJobInfo) throws IOException Set the information about the output to write for the job. It queries the metadata server to find the StorageHandler to use for the table. It throws an error if the partition is already published. |
2 |
public static void setSchema (Configuration conf, HCatSchema schema) throws IOException Set the schema for the data being written out to the partition. The table schema is used by default for the partition if this is not called. |
3 |
public RecordWriter <WritableComparable<?>, HCatRecord > getRecordWriter (TaskAttemptContext context)throws IOException, InterruptedException Get the record writer for the job. It uses the StorageHandler’s default OutputFormat to get the record writer. |
4 |
public OutputCommitter getOutputCommitter (TaskAttemptContext context) throws IOException, InterruptedException Get the output committer for this output format. It ensures that the output is committed correctly. |
HCatOutputFormat API 包括以下方法 -
The HCatOutputFormat API includes the following methods −
-
setOutput
-
setSchema
-
getTableSchema
HCatOutputFormat 中的第一个调用必须是 setOutput ;其他任何调用都会引发一个异常,表明输出格式未初始化。
The first call on the HCatOutputFormat must be setOutput; any other call will throw an exception saying the output format is not initialized.
通过 setSchema 方法指定待写入数据的架构。必须调用此方法,提供正在写入数据的架构。如果您的数据与表架构具有相同的架构,则可以使用 HCatOutputFormat.getTableSchema() 获取表架构,然后将其传递给 setSchema() 。
The schema for the data being written out is specified by the setSchema method. You must call this method, providing the schema of data you are writing. If your data has the same schema as the table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema().
Example
以下 MapReduce 程序从一个表中读取数据,它假定第二列(“列 1”)中有整数,并统计找到的每个不同值的实例数。也就是说,它执行了“ select col1, count( from $table group by col1;*”的等价操作。
The following MapReduce program reads data from one table which it assumes to have an integer in the second column ("column 1"), and counts how many instances of each distinct value it finds. That is, it does the equivalent of "select col1, count() from $table group by col1;*".
例如,如果第二列中的值为 {1, 1, 1, 3, 3, 5},那么程序将生成以下值和次数输出 −
For example, if the values in the second column are {1, 1, 1, 3, 3, 5}, then the program will produce the following output of values and counts −
1, 3
3, 2
5, 1
我们现在来看一下程序代码 −
Let us now take a look at the program code −
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.HCatalog.common.HCatConstants;
import org.apache.HCatalog.data.DefaultHCatRecord;
import org.apache.HCatalog.data.HCatRecord;
import org.apache.HCatalog.data.schema.HCatSchema;
import org.apache.HCatalog.mapreduce.HCatInputFormat;
import org.apache.HCatalog.mapreduce.HCatOutputFormat;
import org.apache.HCatalog.mapreduce.InputJobInfo;
import org.apache.HCatalog.mapreduce.OutputJobInfo;
public class GroupByAge extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable,
HCatRecord, IntWritable, IntWritable> {
int age;
@Override
protected void map(
WritableComparable key, HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable,
HCatRecord, IntWritable, IntWritable>.Context context
)throws IOException, InterruptedException {
age = (Integer) value.get(1);
context.write(new IntWritable(age), new IntWritable(1));
}
}
public static class Reduce extends Reducer<IntWritable, IntWritable,
WritableComparable, HCatRecord> {
@Override
protected void reduce(
IntWritable key, java.lang.Iterable<IntWritable> values,
org.apache.hadoop.mapreduce.Reducer<IntWritable, IntWritable,
WritableComparable, HCatRecord>.Context context
)throws IOException ,InterruptedException {
int sum = 0;
Iterator<IntWritable> iter = values.iterator();
while (iter.hasNext()) {
sum++;
iter.next();
}
HCatRecord record = new DefaultHCatRecord(2);
record.set(0, key.get());
record.set(1, sum);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
String serverUri = args[0];
String inputTableName = args[1];
String outputTableName = args[2];
String dbName = null;
String principalID = System
.getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);
if (principalID != null)
conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);
Job job = new Job(conf, "GroupByAge");
HCatInputFormat.setInput(job, InputJobInfo.create(dbName, inputTableName, null));
// initialize HCatOutputFormat
job.setInputFormatClass(HCatInputFormat.class);
job.setJarByClass(GroupByAge.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:" + s);
HCatOutputFormat.setSchema(job, s);
job.setOutputFormatClass(HCatOutputFormat.class);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new GroupByAge(), args);
System.exit(exitCode);
}
}
在编译上述程序之前,您必须先下载一些 jars 并将它们添加到此应用程序的 classpath 。您需要下载所有 Hive jar 和 HCatalog jar(HCatalog-core-0.5.0.jar、hive-metastore-0.10.0.jar、libthrift-0.7.0.jar、hive-exec-0.10.0.jar、libfb303-0.7.0.jar、jdo2-api-2.3-ec.jar、slf4j-api-1.6.1.jar)。
Before compiling the above program, you have to download some jars and add those to the classpath for this application. You need to download all the Hive jars and HCatalog jars (HCatalog-core-0.5.0.jar, hive-metastore-0.10.0.jar, libthrift-0.7.0.jar, hive-exec-0.10.0.jar, libfb303-0.7.0.jar, jdo2-api-2.3-ec.jar, slf4j-api-1.6.1.jar).
使用以下命令将 jar 文件从 local 复制到 HDFS 并将它们添加到 classpath 。
Use the following commands to copy those jar files from local to HDFS and add those to the classpath.
bin/hadoop fs -copyFromLocal $HCAT_HOME/share/HCatalog/HCatalog-core-0.5.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/hive-metastore-0.10.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/libthrift-0.7.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/hive-exec-0.10.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/libfb303-0.7.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/jdo2-api-2.3-ec.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/slf4j-api-1.6.1.jar /tmp
export LIB_JARS=hdfs:///tmp/HCatalog-core-0.5.0.jar,
hdfs:///tmp/hive-metastore-0.10.0.jar,
hdfs:///tmp/libthrift-0.7.0.jar,
hdfs:///tmp/hive-exec-0.10.0.jar,
hdfs:///tmp/libfb303-0.7.0.jar,
hdfs:///tmp/jdo2-api-2.3-ec.jar,
hdfs:///tmp/slf4j-api-1.6.1.jar
使用以下命令编译并执行给定的程序。
Use the following command to compile and execute the given program.
$HADOOP_HOME/bin/hadoop jar GroupByAge tmp/hive
现在,检查输出目录 (hdfs: user/tmp/hive) 以查看输出 (part_0000, part_0001)。
Now, check your output directory (hdfs: user/tmp/hive) for the output (part_0000, part_0001).
HCatalog - Loader & Storer
HCatLoader 和 HCatStorer API 与 Pig 脚本一起用于读写 HCatalog 管理的表中的数据。这些接口不需要 HCatalog 特定的设置。
The HCatLoader and HCatStorer APIs are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.
最好了解一些 Apache Pig 脚本的知识,才能更好地理解本章。更多参考信息,请参阅我们的 Apache Pig 教程。
It is better to have some knowledge on Apache Pig scripts to understand this chapter better. For further reference, please go through our Apache Pig tutorial.
HCatloader
HCatLoader 与 Pig 脚本一起用于从 HCatalog 管理的表中读取数据。使用以下语法以使用 HCatloader 将数据加载到 HDFS 中。
HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. Use the following syntax to load data into HDFS using HCatloader.
A = LOAD 'tablename' USING org.apache.HCatalog.pig.HCatLoader();
您必须用单引号指定表名: LOAD 'tablename' 。如果您正在使用非默认数据库,那么您必须将您的输入指定为 ' dbname.tablename' 。
You must specify the table name in single quotes: LOAD 'tablename'. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.
Hive 元存储允许您在不指定数据库的情况下创建表。如果您以这种方式创建表,那么数据库名称为 'default' ,并且在为 HCatLoader 指定表时不需要该名称。
The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and is not required when specifying the table for HCatLoader.
下表包含 HCatloader 类的重要方法及其描述。
The following table contains the important methods and description of the HCatloader class.
Sr.No. |
Method Name & Description |
1 |
public InputFormat<?,?> getInputFormat()throws IOException Read the input format of the loading data using the HCatloader class. |
2 |
public String relativeToAbsolutePath(String location, Path curDir) throws IOException It returns the String format of the Absolute path. |
3 |
public void setLocation(String location, Job job) throws IOException It sets the location where the job can be executed. |
4 |
public Tuple getNext() throws IOException Returns the current tuple (key and value) from the loop. |
HCatStorer
HCatStorer 与 Pig 脚本一起用于将数据写入 HCatalog 管理的表中。对于存储操作,使用以下语法。
HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. Use the following syntax for Storing operation.
A = LOAD ...
B = FOREACH A ...
...
...
my_processed_data = ...
STORE my_processed_data INTO 'tablename' USING org.apache.HCatalog.pig.HCatStorer();
您必须用单引号指定表名: LOAD 'tablename' 。在运行您的 Pig 脚本之前,必须创建数据库和表。如果您正在使用非默认数据库,那么您必须将您的输入指定为 'dbname.tablename' 。
You must specify the table name in single quotes: LOAD 'tablename'. Both the database and the table must be created prior to running your Pig script. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.
Hive 元存储允许您在不指定数据库的情况下创建表。如果您以这种方式创建表,那么数据库名称为 'default' ,您不需要在 store 语句中指定数据库名称。
The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and you do not need to specify the database name in the store statement.
对于 USING 语句,您可以有一个表示分区键值对的字符串参数。当您写入分区表而分区列不在输出列中时,这是一个强制参数。分区键的值不应加引号。
For the USING clause, you can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.
下表包含 HCatStorer 类的重要方法及其说明。
The following table contains the important methods and description of the HCatStorer class.
Sr.No. |
Method Name & Description |
1 |
public OutputFormat getOutputFormat() throws IOException Read the output format of the stored data using the HCatStorer class. |
2 |
public void setStoreLocation (String location, Job job) throws IOException Sets the location where to execute this store application. |
3 |
public void storeSchema (ResourceSchema schema, String arg1, Job job) throws IOException Store the schema. |
4 |
public void prepareToWrite (RecordWriter writer) throws IOException It helps to write data into a particular file using RecordWriter. |
5 |
public void putNext (Tuple tuple) throws IOException Writes the tuple data into the file. |
Running Pig with HCatalog
Pig 不会自动获取 HCatalog jar。若要引入必要的 jar,可使用 Pig 命令中的标志或设置 PIG_CLASSPATH 和 PIG_OPTS 环境变量,如下所示。
Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the Pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.
若要引入用于处理 HCatalog 的适当 jar,只需包含以下标志 −
To bring in the appropriate jars for working with HCatalog, simply include the following flag −
pig –useHCatalog <Sample pig scripts file>
Setting the CLASSPATH for Execution
使用以下 CLASSPATH 设置同步 HCatalog 与 Apache Pig。
Use the following CLASSPATH setting for synchronizing the HCatalog with Apache Pig.
export HADOOP_HOME = <path_to_hadoop_install>
export HIVE_HOME = <path_to_hive_install>
export HCAT_HOME = <path_to_hcat_install>
export PIG_CLASSPATH = $HCAT_HOME/share/HCatalog/HCatalog-core*.jar:\
$HCAT_HOME/share/HCatalog/HCatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\
$HIVE_HOME/lib/slf4j-api-*.jar
Example
假设我们在 HDFS 中有一个文件 student_details.txt ,内容如下。
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
student_details.txt
001, Rajiv, Reddy, 21, 9848022337, Hyderabad
002, siddarth, Battacharya, 22, 9848022338, Kolkata
003, Rajesh, Khanna, 22, 9848022339, Delhi
004, Preethi, Agarwal, 21, 9848022330, Pune
005, Trupthi, Mohanthy, 23, 9848022336, Bhuwaneshwar
006, Archana, Mishra, 23, 9848022335, Chennai
007, Komal, Nayak, 24, 9848022334, trivendram
008, Bharathi, Nambiayar, 24, 9848022333, Chennai
我们还有同一个 HDFS 目录中的一个样例脚本,名为 sample_script.pig 。该文件包含对 student 关系执行操作和转换的语句,如下所示。
We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.
student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);
student_order = ORDER student BY age DESC;
STORE student_order INTO 'student_order_table' USING org.apache.HCatalog.pig.HCatStorer();
student_limit = LIMIT student_order 4;
Dump student_limit;
-
The first statement of the script will load the data in the file named student_details.txt as a relation named student.
-
The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.
-
The third statement stores the processed data student_order results in a separate table named student_order_table.
-
The fourth statement of the script will store the first four tuples of student_order as student_limit.
-
Finally the fifth statement will dump the content of the relation student_limit.
现在让我们按照如下所示执行 sample_script.pig 。
Let us now execute the sample_script.pig as shown below.
$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample_script.pig
现在,检查输出目录 (hdfs: user/tmp/hive) 以查看输出 (part_0000, part_0001)。
Now, check your output directory (hdfs: user/tmp/hive) for the output (part_0000, part_0001).