Hcatalog 简明教程

HCatalog - Installation

所有 Hadoop 子项目(如 Hive、Pig 和 HBase)都支持 Linux 操作系统。因此,您需要在系统上安装 Linux 版本。HCatalog 已在 2013 年 3 月 26 日与 Hive 安装合并。从 Hive-0.11.0 版本开始,HCatalog 随 Hive 安装提供。因此,请按照以下步骤安装 Hive,进而自动在系统上安装 HCatalog。

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install a Linux flavor on your system. HCatalog is merged with Hive Installation on March 26, 2013. From the version Hive-0.11.0 onwards, HCatalog comes with Hive installation. Therefore, follow the steps given below to install Hive which in turn will automatically install HCatalog on your system.

Step 1: Verifying JAVA Installation

在安装 Hive 之前,必须在系统上安装 Java。您可以使用以下命令来检查系统上是否已安装 Java -

Java must be installed on your system before installing Hive. You can use the following command to check whether you have Java already installed on your system −

$ java –version

如果系统上已安装 Java,您将看到以下响应 -

If Java is already installed on your system, you get to see the following response −

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果系统上未安装 Java,则需要按照以下步骤操作。

If you don’t have Java installed on your system, then you need to follow the steps given below.

Step 2: Installing Java

访问以下链接 http://www.oracle.com/ 下载 Java(JDK <latest version> - X64.tar.gz)

Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link http://www.oracle.com/

然后 jdk-7u71-linux-x64.tar.gz 将下载到您的系统。

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

通常,您会在 Downloads 文件夹中找到下载的 Java 文件。使用以下命令验证并解压 jdk-7u71-linux-x64.gz 文件。

Generally you will find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

为了使得所有用户都能使用 Java,您需要将 Java 移动到“/usr/local/”位置。打开 root,然后键入以下命令。

To make Java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

为设置 PATHJAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin

现在,使用上述 from terminal java -version 命令验证安装。

Now verify the installation using the command java -version from the terminal as explained above.

Step 3: Verifying Hadoop Installation

在安装 Hive 之前,必须在您的系统上安装 Hadoop。让我们使用以下命令验证 Hadoop 安装:

Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command −

$ hadoop version

如果已在您的系统上安装 Hadoop,则会收到以下回复:

If Hadoop is already installed on your system, then you will get the following response −

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

如果您的系统上未安装 Hadoop,请执行以下步骤:

If Hadoop is not installed on your system, then proceed with the following steps −

Step 4: Downloading Hadoop

使用以下命令从 Apache 软件基金会下载并解压缩 Hadoop 2.4.1。

Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Step 5: Installing Hadoop in Pseudo Distributed Mode

以下步骤用于在伪分布式模式下安装 Hadoop 2.4.1

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Setting up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Hadoop Configuration

您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,您需要在这些配置文件中进行适当的更改。

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为了使用 Java 开发 Hadoop 程序,您必须通过将 JAVA_HOME 值替换为系统中 Java 的位置,来重置 hadoop-env.sh 文件中的 Java 环境变量。

In order to develop Hadoop programs using Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

下面列出了您必须编辑以配置 Hadoop 的文件列表。

Given below are the list of files that you have to edit to configure Hadoop.

core-site.xml

core-site.xml 文件包含信息,例如用于 Hadoop 实例的端口号、分配给文件系统内存、用于存储数据的内存限制以及读/写缓冲区大小。

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.

打开 core-site.xml,并在 <configuration> 和 </configuration> 标记之间添加以下属性。

Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

hdfs-site.xml

hdfs-site.xml 文件包含信息,例如复制数据的值、本地名节点的路径以及本地文件系统的数据节点路径。这意味着您要存储 Hadoop 基础架构的位置。

The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure.

让我们假设以下数据。

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开此文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
   </property>
</configuration>

Note − 在上述文件中,所有属性值都是用户定义的,您可以根据 Hadoop 基础架构进行更改。

Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

此文件用于指定我们使用哪种 MapReduce 框架。默认情况下,Hadoop 包含一个 yarn-site.xml 模板。首先,您需要使用以下命令将文件从 mapred-site,xml.template 复制到 mapred-site.xml 文件。

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Step 6: Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Namenode Setup

使用 “hdfs namenode -format” 命令设置 name 节点,如下所示:

Set up the namenode using the command “hdfs namenode -format” as follows −

$ cd ~
$ hdfs namenode -format

预期结果如下:

The expected result is as follows −

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1
images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Verifying Hadoop DFS

以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。

The following command is used to start the DFS. Executing this command will start your Hadoop file system.

$ start-dfs.sh

预期输出如下所示 −

The expected output is as follows −

10/24/14 21:37:56 Starting namenodes on [localhost]
localhost: starting namenode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode-localhost.out localhost:
starting datanode, logging to
   /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Verifying Yarn Script

以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。

The following command is used to start the Yarn script. Executing this command will start your Yarn daemons.

$ start-yarn.sh

预期输出如下所示 −

The expected output is as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/
yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to
   /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Accessing Hadoop on Browser

访问 Hadoop 的默认端口号是 50070。使用以下 URL 在您的浏览器上获取 Hadoop 服务。

The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on your browser.

http://localhost:50070/
accessing hadoop

Verify all applications for cluster

访问集群所有应用程序的默认端口号为 8088。使用以下网址访问此服务。

The default port number to access all applications of cluster is 8088. Use the following url to visit this service.

http://localhost:8088/
cluster

Hadoop 安装完成后,执行下一步并在系统上安装 Hive。

Once you are done with the installation of Hadoop, proceed to the next step and install Hive on your system.

Step 7: Downloading Hive

在本教程中,我们使用 hive-0.14.0。你可以访问以下链接下载: http://apache.petsads.us/hive/hive-0.14.0/ 。我们假设教程下载到了 /Downloads 目录。在这里,我们为此教程下载了名为“ apache-hive-0.14.0-bin.tar.gz ”的 Hive 归档文件。使用以下命令验证下载情况−

We use hive-0.14.0 in this tutorial. You can download it by visiting the following link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the /Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this tutorial. The following command is used to verify the download −

$ cd Downloads
$ ls

下载成功后,你会看到以下响应−

On successful download, you get to see the following response −

apache-hive-0.14.0-bin.tar.gz

Step 8: Installing Hive

若要在系统上安装 Hive,需要执行以下步骤。我们假设 Hive 归档文件下载到了 /Downloads 目录。

The following steps are required for installing Hive on your system. Let us assume the Hive archive is downloaded onto the /Downloads directory.

Extracting and Verifying Hive Archive

使用以下命令验证下载情况并提取 Hive 归档文件−

The following command is used to verify the download and extract the Hive archive −

$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls

下载成功后,你会看到以下响应−

On successful download, you get to see the following response −

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

我们需要从超级用户 “su -” 复制文件。以下命令用于将文件从提取的目录复制到 /usr/local/hive ”目录。

We need to copy the files from the superuser “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/hive” directory.

$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up the environment for Hive

可以通过向 ~/.bashrc 文件追加以下行来设置 Hive 环境−

You can set up the Hive environment by appending the following lines to ~/.bashrc file −

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

使用以下命令执行 ~/.bashrc 文件。

The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 9: Configuring Hive

要将 Hive 与 Hadoop 配置在一起,需要编辑位于 $HIVE_HOME/conf 目录的 hive-env.sh 文件。以下命令重定向到 Hive config 文件夹并复制模板文件 −

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the template file −

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

通过追加以下行编辑 hive-env.sh 文件 −

Edit the hive-env.sh file by appending the following line −

export HADOOP_HOME=/usr/local/hadoop

这样,Hive 安装就完成了。现在你需要一个外部数据库服务器来配置元存储。我们使用 Apache Derby 数据库。

With this, the Hive installation is complete. Now you require an external database server to configure Metastore. We use Apache Derby database.

Step 10: Downloading and Installing Apache Derby

按照以下步骤下载并安装 Apache Derby −

Follow the steps given below to download and install Apache Derby −

Downloading Apache Derby

使用以下命令下载 Apache Derby。下载需要一些时间。

The following command is used to download Apache Derby. It takes some time to download.

$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

使用以下命令验证下载情况−

The following command is used to verify the download −

$ ls

下载成功后,你会看到以下响应−

On successful download, you get to see the following response −

db-derby-10.4.2.0-bin.tar.gz

Extracting and Verifying Derby Archive

以下命令用于提取和验证 Derby 归档文件 −

The following commands are used for extracting and verifying the Derby archive −

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls

下载成功后,你会看到以下响应−

On successful download, you get to see the following response −

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying Files to /usr/local/derby Directory

我们需要从超级用户 “su -” 复制文件。以下命令用于将文件从提取的目录复制到 /usr/local/derby 目录 −

We need to copy from the superuser “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/derby directory −

$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up the Environment for Derby

可以通过向 ~/.bashrc 文件追加以下行来设置 Derby 环境 −

You can set up the Derby environment by appending the following lines to ~/.bashrc file −

export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar

使用以下命令执行 ~/.bashrc file

The following command is used to execute ~/.bashrc file

$ source ~/.bashrc

Create a Directory for Metastore

在 $DERBY_HOME 目录中创建一个名为 data 的目录来存储元存储数据。

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data

Derby 安装和环境设置已完成。

Derby installation and environmental setup is now complete.

Step 11: Configuring the Hive Metastore

配置元数据存储库表示指定数据库存储在 Hive 中的位置。可以通过编辑 hive-site.xml 文件(位于 $HIVE_HOME/conf 目录中)来完成此操作。首先,使用以下命令复制模板文件 −

Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command −

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

编辑 hive-site.xml ,并将以下行追加到 <configuration> 和 </configuration> 标记之间 −

Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags −

<property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:derby://localhost:1527/metastore_db;create = true</value>
   <description>JDBC connect string for a JDBC metastore</description>
</property>

创建名为 jpox.properties 的文件,并向其中添加以下行 −

Create a file named jpox.properties and add the following lines into it −

javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl

org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false

org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed

javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 12: Verifying Hive Installation

在运行 Hive 之前,您需要在 HDFS 中创建 /tmp 文件夹和一个单独的 Hive 文件夹。在此处,我们使用 /user/hive/warehouse 文件夹。您需要为这些新创建的文件夹设置写入权限,如下所示 −

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we use the /user/hive/warehouse folder. You need to set write permission for these newly created folders as shown below −

chmod g+w

现在,在验证 Hive 之前,请在 HDFS 中设置它们。使用以下命令 −

Now set them in HDFS before verifying Hive. Use the following commands −

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

以下命令用于验证 Hive 安装 −

The following commands are used to verify Hive installation −

$ cd $HIVE_HOME
$ bin/hive

在成功安装 Hive 后,您将看到以下响应 −

On successful installation of Hive, you get to see the following response −

Logging initialized using configuration in
   jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/
hive-log4j.properties Hive history
   =/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>

您可以执行以下示例命令来显示所有表 −

You can execute the following sample command to display all the tables −

hive> show tables;
OK Time taken: 2.798 seconds
hive>

Step 13: Verify HCatalog Installation

使用以下命令为 HCatalog 主目录设置系统变量 HCAT_HOME

Use the following command to set a system variable HCAT_HOME for HCatalog Home.

export HCAT_HOME = $HiVE_HOME/HCatalog

使用以下命令验证 HCatalog 安装。

Use the following command to verify the HCatalog installation.

cd $HCAT_HOME/bin
./hcat

如果安装成功,您将看到以下输出 −

If the installation is successful, you will get to see the following output −

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
usage: hcat { -e "<query>" | -f "<filepath>" }
   [ -g "<group>" ] [ -p "<perms>" ]
   [ -D"<name> = <value>" ]

-D <property = value>    use hadoop value for given property
-e <exec>                hcat command given from command line
-f <file>                hcat commands in file
-g <group>               group for the db/table specified in CREATE statement
-h,--help                Print help information
-p <perms>               permissions for the db/table specified in CREATE statement