Hive 简明教程

Hive - Quick Guide

Hive - Introduction

“大数据”一词用于海量数据集的集合,包含海量数据、高速和随日益增加的各种数据。使用传统的数据管理系统很难处理大数据。因此,Apache 软件基金会引入了一个称为 Hadoop 的框架来解决大数据管理和处理难题。

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges.

Hadoop

Hadoop 是一个开源框架,用于在分布式环境中存储和处理大数据。它包含两个模块,一个是 MapReduce,另一个是 Hadoop 分布式文件系统 (HDFS)。

Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).

  1. MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware.

  2. *HDFS:*Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware.

Hadoop 生态系统包含不同的子项目(工具),例如 Sqoop、Pig 和 Hive,可用于对 Hadoop 模块提供帮助。

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules.

  1. Sqoop: It is used to import and export data to and fro between HDFS and RDBMS.

  2. Pig: It is a procedural language platform used to develop a script for MapReduce operations.

  3. Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: 有多种执行 MapReduce 操作的方式:

Note: There are various ways to execute MapReduce operations:

  1. The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.

  2. The scripting approach for MapReduce to process structured and semi structured data using Pig.

  3. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive.

What is Hive

Hive 是用于处理 Hadoop 中结构化数据的数据库基础架构工具。它立足于 Hadoop 之上,用于对大数据进行汇总,并使查询和分析变得容易。

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Hive 最初由 Facebook 开发,后由 Apache 软件基金会接手并以开源形式进一步开发,名为 Apache Hive。它被不同公司所使用。例如,Amazon 在 Amazon Elastic MapReduce 中使用它。

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

  1. A relational database

  2. A design for OnLine Transaction Processing (OLTP)

  3. A language for real-time queries and row-level updates

Features of Hive

  1. It stores schema in a database and processed data into HDFS.

  2. It is designed for OLAP.

  3. It provides SQL type language for querying called HiveQL or HQL.

  4. It is familiar, fast, scalable, and extensible.

Architecture of Hive

以下组件图解显示了 Hive 的架构:

The following component diagram depicts the architecture of Hive:

hive architecture

该组件图解包含不同的单元。下表描述了每个单元:

This component diagram contains different units. The following table describes each unit:

Unit Name

Operation

User Interface

Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store

Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL Process Engine

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.

Execution Engine

The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE

Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Working of Hive

下图描述了 Hive 和 Hadoop 之间的工作流程。

The following diagram depicts the workflow between Hive and Hadoop.

how hive works

下表定义了 Hive 如何与 Hadoop 框架交互:

The following table defines how Hive interacts with Hadoop framework:

Step No.

Operation

1

*Execute Query*The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2

*Get Plan*The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.

3

*Get Metadata*The compiler sends metadata request to Metastore (any database).

4

*Send Metadata*Metastore sends metadata as a response to the compiler.

5

*Send Plan*The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

6

*Execute Plan*The driver sends the execute plan to the execution engine.

7

*Execute Job*Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1

*Metadata Ops*Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

8

*Fetch Result*The execution engine receives the results from Data nodes.

9

*Send Results*The execution engine sends those resultant values to the driver.

10

*Send Results*The driver sends the results to Hive Interfaces.

Hive - Installation

Hive、Pig 和 HBase 等所有 Hadoop 子项目都支持 Linux 操作系统。因此,你需要安装任何 Linux 操作系统。对于 Hive 安装,执行以下简单步骤:

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux flavored OS. The following simple steps are executed for Hive installation:

Step 1: Verifying JAVA Installation

在安装 Hive 之前,必须在你的系统上安装 Java。让我们使用以下命令验证 Java 安装:

Java must be installed on your system before installing Hive. Let us verify java installation using the following command:

$ java –version

如果系统中已经安装了 Java,您将看到以下响应:

If Java is already installed on your system, you get to see the following response:

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果您的系统中未安装 java,请按照以下步骤安装 java。

If java is not installed in your system, then follow the steps given below for installing java.

Installing Java

Step I:

然后系统将会下载 jdk-7u71-linux-x64.tar.gz。

Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:

通常,您将在下载文件夹中找到下载的 java 文件。使用以下命令对文件进行验证并提取 jdk-7u71-linux-x64.gz 文件。

Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:

为了让所有用户都可以使用 java,您必须将其移动到 “/usr/local/” 位置。打开 root,并键入以下命令。

To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step IV:

为设置 PATH 和 JAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件中。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin

现在,使用上述说明,通过终端运行 java -version 命令来验证安装。

Now verify the installation using the command java -version from the terminal as explained above.

Step 2: Verifying Hadoop Installation

在安装 Hive 之前,必须在系统上安装 Hadoop。让我们使用以下命令验证 Hadoop 安装:

Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command:

$ hadoop version

如果系统中已经安装了 Hadoop,那么您将得到以下响应:

If Hadoop is already installed on your system, then you will get the following response:

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

如果系统中没有安装 Hadoop,那么请执行以下步骤:

If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop

使用以下命令从 Apache 软件基金会下载并解压缩 Hadoop 2.4.1。

Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed Mode

以下步骤用于在伪分布模式下安装 Hadoop 2.4.1。

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step I: Setting up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step II: Hadoop Configuration

您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,您需要在这些配置文件中进行适当的更改。

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为了使用 Java 开发 Hadoop 程序,您必须通过将 JAVA_HOME 值替换为您系统中 Java 的位置来在 hadoop-env.sh 文件中重置 Java 环境变量。

In order to develop Hadoop programs using java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

下面列出了您必须编辑以配置 Hadoop 的文件列表。

Given below are the list of files that you have to edit to configure Hadoop.

core-site.xml

core-site.xml

core-site.xml 文件包含信息,例如用于 Hadoop 实例的端口号、分配给文件系统内存、用于存储数据的内存限制以及读/写缓冲区大小。

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.

打开 core-site.xml,并在 <configuration> 和 </configuration> 标记之间添加以下属性。

Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.

<configuration>

   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>

</configuration>

hdfs-site.xml

hdfs-site.xml

hdfs-site.xml 文件包含信息,例如,本地文件系统的复制数据的值、名称节点路径和数据节点路径。这意味着您要存储 Hadoop 基础设施的位置。

The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infra.

让我们假设以下数据。

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开此文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
   </property>
   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
   </property>

</configuration>

Note: 在上述文件中,所有属性值都是用户定义的,可以根据 Hadoop 基础设施进行修改。

Note: In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

yarn-site.xml

此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>

   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>

</configuration>

mapred-site.xml

mapred-site.xml

该文件用于指定我们使用的 MapReduce Framework。默认情况下,Hadoop 包含 yarn-site.xml 模板。首先,需要使用以下命令将文件从 mapred-site.xml.template 复制到 mapred-site.xml 文件。

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件,并在此文件中的 <configuration>、</configuration> 标记之间添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>

   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>

</configuration>

Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Step I: Name Node Setup

使用命令 “hdfs namenode -format” 设置名称节点,如下所示。

Set up the namenode using the command “hdfs namenode -format” as follows.

$ cd ~
$ hdfs namenode -format

预期结果如下所示。

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
 ************************************************************/

Step II: Verifying Hadoop dfs

以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。

The following command is used to start dfs. Executing this command will start your Hadoop file system.

$ start-dfs.sh

预期的输出如下:

The expected output is as follows:

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step III: Verifying Yarn Script

以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。

The following command is used to start the yarn script. Executing this command will start your yarn daemons.

$ start-yarn.sh

预期的输出如下:

The expected output is as follows:

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step IV: Accessing Hadoop on Browser

访问 Hadoop 的默认端口号为 50070。使用以下 URL 在浏览器上获取 Hadoop 服务。

The default port number to access Hadoop is 50070. Use the following url to get Hadoop services on your browser.

http://localhost:50070/
hadoop browser

Step V: Verify all applications for cluster

访问集群所有应用程序的默认端口号为 8088。使用以下网址访问此服务。

The default port number to access all applications of cluster is 8088. Use the following url to visit this service.

http://localhost:8088/
all applications

Step 3: Downloading Hive

在本教程中,我们使用 hive-0.14.0。您可以通过访问以下链接下载它 http://apache.petsads.us/hive/hive-0.14.0/. 假设它下载到了 /Downloads 目录。在本教程中,我们下载了名为“apache-hive-0.14.0-bin.tar.gz”的 Hive 归档。使用以下命令来验证下载:

We use hive-0.14.0 in this tutorial. You can download it by visiting the following link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the /Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this tutorial. The following command is used to verify the download:

$ cd Downloads
$ ls

在成功下载后,你可以看到以下响应:

On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive

需要执行以下步骤来在系统上安装 Hive。假设 Hive 归档下载到了 /Downloads 目录。

The following steps are required for installing Hive on your system. Let us assume the Hive archive is downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

以下命令用于验证下载并提取 hive 归档:

The following command is used to verify the download and extract the hive archive:

$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls

在成功下载后,你可以看到以下响应:

On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

我们需要从超级用户“su -”复制文件。以下命令用于将文件从提取目录复制到 /usr/local/hive”目录。

We need to copy the files from the super user “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/hive” directory.

$ su -
passwd:

# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive

您可以添加以下行到 ~/.bashrc 文件以设置 Hive 环境:

You can set up the Hive environment by appending the following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

使用以下命令执行 ~/.bashrc 文件。

The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 5: Configuring Hive

要使用 Hadoop 配置 Hive,您需要编辑 hive-env.sh 文件,它保存在 $HIVE_HOME/conf 目录中。以下命令重定向到 Hive config 文件夹并复制模板文件:

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the template file:

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

通过添加以下行编辑 hive-env.sh 文件:

Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop

完成 Hive 安装。现在,您需要一个外部数据库服务器来配置 Metastore。我们使用 Apache Derby 数据库。

Hive installation is completed successfully. Now you require an external database server to configure Metastore. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby

执行以下步骤下载并安装 Apache Derby:

Follow the steps given below to download and install Apache Derby:

Downloading Apache Derby

使用以下命令下载 Apache Derby。下载需要一些时间。

The following command is used to download Apache Derby. It takes some time to download.

$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

使用以下命令验证下载:

The following command is used to verify the download:

$ ls

在成功下载后,你可以看到以下响应:

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

使用以下命令解压并验证 Derby 存档:

The following commands are used for extracting and verifying the Derby archive:

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls

在成功下载后,你可以看到以下响应:

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory

我们需要从超级用户“su -”进行复制。使用以下命令从解压后的目录复制文件到 /usr/local/derby 目录:

We need to copy from the super user “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/derby directory:

$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby

您可以添加以下行到 ~/.bashrc 文件以设置 Derby 环境:

You can set up the Derby environment by appending the following lines to ~/.bashrc file:

export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar

使用以下命令执行 ~/.bashrc 文件:

The following command is used to execute ~/.bashrc file:

$ source ~/.bashrc

Create a directory to store Metastore

在 $DERBY_HOME 目录中创建一个名为 data 的目录,用于存储 Metastore 数据。

Create a directory named data in $DERBY_HOME directory to store Metastore data.

$ mkdir $DERBY_HOME/data

Derby 安装和环境设置已完成。

Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive

配置 Metastore 意味着向 Hive 指定数据库的存储位置。通过编辑 hive-site.xml 文件进行操作,该文件位于 $HIVE_HOME/conf 目录中。首先,使用以下命令复制模板文件:

Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command:

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

编辑 hive-site.xml ,并在 <configuration> 和 </configuration> 标记之间添加以下行:

Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags:

<property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
   <description>JDBC connect string for a JDBC metastore </description>
</property>

创建一个名为 jpox.properties 的文件,并添加以下行:

Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation

在运行 Hive 之前,您需要在 HDFS 中创建 /tmp 文件夹和一个单独的 Hive 文件夹。此处,我们使用 /user/hive/warehouse 文件夹。需要设置以下新建文件夹的写入权限,如下所示:

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we use the /user/hive/warehouse folder. You need to set write permission for these newly created folders as shown below:

chmod g+w

现在,在验证 Hive 之前在 HDFS 中设置它们。使用以下命令:

Now set them in HDFS before verifying Hive. Use the following commands:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

使用以下命令验证 Hive 安装:

The following commands are used to verify Hive installation:

$ cd $HIVE_HOME
$ bin/hive

在成功安装 Hive 后,您将看到以下响应:

On successful installation of Hive, you get to see the following response:

Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>

执行以下示例命令以显示所有表格:

The following sample command is executed to display all the tables:

hive> show tables;
OK
Time taken: 2.798 seconds
hive>

Hive - Data Types

本章将介绍 Hive 中的各种数据类型,这些数据类型都参与表创建。Hive 中的所有数据类型都分为以下四类:

This chapter takes you through the different data types in Hive, which are involved in the table creation. All the data types in Hive are classified into four types, given as follows:

  1. Column Types

  2. Literals

  3. Null Values

  4. Complex Types

Column Types

列类型用作 Hive 的列数据类型。如下所示:

Column type are used as column data types of Hive. They are as follows:

Integral Types

整型数据类型可以使用整数数据类型 INT 指定。当数据范围超过INT的范围时,您需要使用BIGINT,如果数据范围小于INT,则可以使用SMALLINT。TINYINT小于SMALLINT。

Integer type data can be specified using integral data types, INT. When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

下表描绘了各种 INT 数据类型:

The following table depicts various INT data types:

Type

Postfix

Example

TINYINT

Y

10Y

SMALLINT

S

10S

INT

-

10

BIGINT

L

10L

String Types

字符串类型数据类型可以使用单引号(')或双引号(")指定。它包含两种数据类型:VARCHAR 和 CHAR。Hive 遵循 C 类型转义字符。

String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

下表列出了各种 CHAR 数据类型:

The following table depicts various CHAR data types:

Data Type

Length

VARCHAR

1 to 65355

CHAR

255

Timestamp

它支持传统的 UNIX 时间戳,并具有可选的纳秒精度。它支持 java.sql.Timestamp 格式“YYYY-MM-DD HH:MM:SS.fffffffff”和格式“yyyy-mm-dd hh:mm:ss.ffffffffff”。

It supports traditional UNIX timestamp with optional nanosecond precision. It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

Dates

DATE 值以年/月/日格式描述,形式为 {{YYYY-MM-DD}}。

DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

Decimals

Hive 中的 DECIMAL 类型与 Java 的 Big Decimal 格式相同。它用于表示不可变的任意精度。语法和示例如下:

The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types

Union 是异构数据类型的集合。您可以使用 create union 创建一个实例。语法和示例如下:

Union is a collection of heterogeneous data types. You can create an instance using create union. The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

Hive 中使用了以下文字:

The following literals are used in Hive:

Floating Point Types

浮点数类型只不过是小数点附近的数字。通常,此类数据由 DOUBLE 数据类型组成。

Floating point types are nothing but numbers with decimal points. Generally, this type of data is composed of DOUBLE data type.

Decimal Type

Null Value

缺失值由特殊值 NULL 表示。

Missing values are represented by the special value NULL.

Complex Types

Hive 复杂数据类型如下:

The Hive complex data types are as follows:

Arrays

Hive 中的数组与它们在 Java 中的使用方式相同。

Arrays in Hive are used the same way they are used in Java.

语法:ARRAY<data_type>

Syntax: ARRAY<data_type>

Maps

Hive 中的映射与 Java 映射类似。

Maps in Hive are similar to Java Maps.

语法:MAP<primitive_type, data_type>

Syntax: MAP<primitive_type, data_type>

Structs

Hive 中的结构类似于使用带有注释的复杂数据。

Structs in Hive is similar to using complex data with comment.

语法:STRUCT<col_name : data_type [COMMENT col_comment], …​>

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], …​>

Hive - Create Database

Hive 是一种数据库技术,用于定义可分析结构化数据的数据库和数据表。结构化数据分析的主题是将数据以表格形式进行存储,并传递查询以进行分析。本章解释了如何创建 Hive 数据库。Hive 包含一个名为 default 的默认数据库。

Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. This chapter explains how to create Hive database. Hive contains a default database named default.

Create Database Statement

Create Database 是一款用于在 Hive 中创建数据库的语句。Hive 中的数据库是 namespace 或一系列数据表。此语句的 syntax 如下:

Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

此处,IF NOT EXISTS 是一个可选项,用于通知用户是否已存在同名数据库。我们可以在此命令中使用 SCHEMA 代替 DATABASE。执行以下查询可创建名为 userdb 的数据库:

Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb:

hive> CREATE DATABASE [IF NOT EXISTS] userdb;

or

or

hive> CREATE SCHEMA userdb;

使用以下查询可验证数据库列表:

The following query is used to verify a databases list:

hive> SHOW DATABASES;
default
userdb

JDBC Program

以下为创建数据库的 JDBC 程序。

The JDBC program to create a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet; 4. CREATE DATABASE
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDb {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/default", "", "");
      Statement stmt = con.createStatement();
      stmt.executeQuery("CREATE DATABASE userdb");
      System.out.println(“Database userdb created successfully.”);
      con.close();
   }
}

将程序保存到名为 HiveCreateDb.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output:

Database userdb created successfully.

Hive - Drop Database

本章介绍如何在 Hive 中删除数据库。SCHEMA 和 DATABASE 的用法相同。

This chapter describes how to drop a database in Hive. The usage of SCHEMA and DATABASE are same.

Drop Database Statement

Drop Database 是一款可删除所有数据表并删除数据库的语句。其语法如下:

Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];

使用以下查询可删除数据库。我们假设数据库名称为 userdb

The following queries are used to drop a database. Let us assume that the database name is userdb.

hive> DROP DATABASE IF EXISTS userdb;

以下查询使用 CASCADE 删除数据库。这意味着在删除数据库之前先删除各自的数据表。

The following query drops the database using CASCADE. It means dropping respective tables before dropping the database.

hive> DROP DATABASE IF EXISTS userdb CASCADE;

以下查询使用 SCHEMA 删除数据库。

The following query drops the database using SCHEMA.

hive> DROP SCHEMA userdb;

此条款已添加到 Hive 0.6 中。

This clause was added in Hive 0.6.

JDBC Program

以下为删除数据库的 JDBC 程序。

The JDBC program to drop a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager; 5. DROP DATABASE

public class HiveDropDb {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/default", "", "");
      Statement stmt = con.createStatement();
      stmt.executeQuery("DROP DATABASE userdb");
      System.out.println(“Drop userdb database successful.”);
      con.close();
   }
}

在名为 HiveDropDb.java 的文件中保存程序。以下为编译和执行此程序的命令。

Save the program in a file named HiveDropDb.java. Given below are the commands to compile and execute this program.

$ javac HiveDropDb.java
$ java HiveDropDb

Output:

Drop userdb database successful.

Hive - Create Table

本章说明了如何创建表以及如何向其中插入数据。在 HIVE 中创建表的约定与使用 SQL 创建表非常相似。

This chapter explains how to create a table and how to insert data into it. The conventions of creating a table in HIVE is quite similar to creating a table using SQL.

Create Table Statement

Create Table 是一款用于在 Hive 中创建表的语句。语法和示例如下:

Create Table is a statement used to create a table in Hive. The syntax and example are as follows:

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example

让我们假设您需要使用 CREATE TABLE 语句创建一个名为 employee 的表。下表列出了员工表中的字段及其数据类型:

Let us assume you need to create a table named employee using CREATE TABLE statement. The following table lists the fields and their data types in employee table:

Sr.No

Field Name

Data Type

1

Eid

int

2

Name

String

3

Salary

Float

4

Designation

string

以下数据是一条注释,行格式字段(例如字段终止符、行终止符和存储文件类型)。

The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type.

COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE

以下查询使用上述数据创建名为 employee 的表。

The following query creates a table named employee using the above data.

hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> LINES TERMINATED BY ‘\n’
> STORED AS TEXTFILE;

如果你添加了 IF NOT EXISTS 选项,如果表已经存在,Hive 会忽略该声明。

If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.

在成功创建表后,你可以看到以下响应:

On successful creation of table, you get to see the following response:

OK
Time taken: 5.905 seconds
hive>

JDBC Program

下面给出了创建表的 JDBC 程序示例。

The JDBC program to create a table is given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateTable {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
      +" employee ( eid int, name String, "
      +" salary String, destignation String)"
      +" COMMENT ‘Employee details’"
      +" ROW FORMAT DELIMITED"
      +" FIELDS TERMINATED BY ‘\t’"
      +" LINES TERMINATED BY ‘\n’"
      +" STORED AS TEXTFILE;");
      System.out.println(“ Table employee created.”);
      con.close();
   }
}

将程序保存到名为 HiveCreateDb.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output

Table employee created.

Load Data Statement

通常,在 SQL 中创建表之后,我们可以使用 Insert 语句插入数据。但在 Hive 中,我们可以使用 LOAD DATA 语句插入数据。

Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement.

在将数据插入 Hive 时,最好使用 LOAD DATA 存储大量记录。有两种方法可以加载数据:一种是来自本地文件系统,另一种是来自 Hadoop 文件系统。

While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is from local file system and second is from Hadoop file system.

Syntex

加载数据的语法如下:

The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
  1. LOCAL is identifier to specify the local path. It is optional.

  2. OVERWRITE is optional to overwrite the data in the table.

  3. PARTITION is optional.

Example

我们将向表中插入以下数据。它是一个文本文件,名为 sample.txt ,位于 /home/user 目录中。

We will insert the following data into the table. It is a text file named sample.txt in /home/user directory.

1201  Gopal       45000    Technical manager
1202  Manisha     45000    Proof reader
1203  Masthanvali 40000    Technical writer
1204  Krian       40000    Hr Admin
1205  Kranthi     30000    Op Admin

以下查询将给定的文本加载到表中。

The following query loads the given text into the table.

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'
> OVERWRITE INTO TABLE employee;

在成功下载后,你可以看到以下响应:

On successful download, you get to see the following response:

OK
Time taken: 15.905 seconds
hive>

JDBC Program

下面给出了将给定数据加载到表中的 JDBC 程序。

Given below is the JDBC program to load given data into the table.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveLoadData {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("LOAD DATA LOCAL INPATH '/home/user/sample.txt'"
      +"OVERWRITE INTO TABLE employee;");
      System.out.println("Load Data into employee successful");
      con.close();
   }
}

将程序保存到名为 HiveLoadData.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveLoadData.java. Use the following commands to compile and execute this program.

$ javac HiveLoadData.java
$ java HiveLoadData

Output:

Load Data into employee successful

Hive - Alter Table

本章介绍如何修改表的属性,例如更改表名、更改列名、添加列以及删除或替换列。

This chapter explains how to alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns.

Alter Table Statement

用于在 Hive 中更改表格。

It is used to alter a table in Hive.

Syntax

该语句根据我们希望在表中修改哪些属性采用以下任一语法。

The statement takes any of the following syntaxes based on what attributes we wish to modify in a table.

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Rename To… Statement

以下查询将表从 employee 重命名为 emp

The following query renames the table from employee to emp.

hive> ALTER TABLE employee RENAME TO emp;

JDBC Program

用于重命名表的 JDBC 程序如下。

The JDBC program to rename a table is as follows.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveAlterRenameTo {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("ALTER TABLE employee RENAME TO emp;");
      System.out.println("Table Renamed Successfully");
      con.close();
   }
}

将程序保存在名为 HiveAlterRenameTo.java 的文件中。使用以下命令来编译和执行此程序。

Save the program in a file named HiveAlterRenameTo.java. Use the following commands to compile and execute this program.

$ javac HiveAlterRenameTo.java
$ java HiveAlterRenameTo

Output:

Table renamed successfully.

Change Statement

下表包含 employee 表的字段,并显示要更改的字段(以粗体显示)。

The following table contains the fields of employee table and it shows the fields to be changed (in bold).

Field Name

Convert from Data Type

Change Field Name

Convert to Data Type

eid

int

eid

int

name

String

ename

String

salary

Float

salary

Double

designation

String

designation

String

以下查询使用上述数据重命名列名称和列数据类型:

The following queries rename the column name and column data type using the above data:

hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;

JDBC Program

以下是用于更改列的 JDBC 程序。

Given below is the JDBC program to change a column.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveAlterChangeColumn {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("ALTER TABLE employee CHANGE name ename String;");
      stmt.executeQuery("ALTER TABLE employee CHANGE salary salary Double;");
      System.out.println("Change column successful.");
      con.close();
   }
}

将程序保存在名为 HiveAlterChangeColumn.java 的文件中。使用以下命令来编译和执行此程序。

Save the program in a file named HiveAlterChangeColumn.java. Use the following commands to compile and execute this program.

$ javac HiveAlterChangeColumn.java
$ java HiveAlterChangeColumn

Output:

Change column successful.

Add Columns Statement

以下查询向 employee 表中添加一个名为 dept 的列。

The following query adds a column named dept to the employee table.

hive> ALTER TABLE employee ADD COLUMNS (
   > dept STRING COMMENT 'Department name');

JDBC Program

用于向表中添加列的 JDBC 程序如下。

The JDBC program to add a column to a table is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveAlterAddColumn {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
     // create statement
     Statement stmt = con.createStatement();
     // execute statement
     stmt.executeQuery("ALTER TABLE employee ADD COLUMNS "
     +" (dept STRING COMMENT 'Department name');");
     System.out.prinln("Add column successful.");
     con.close();
   }
}

将程序保存在名为 HiveAlterAddColumn.java 的文件中。使用以下命令来编译和执行此程序。

Save the program in a file named HiveAlterAddColumn.java. Use the following commands to compile and execute this program.

$ javac HiveAlterAddColumn.java
$ java HiveAlterAddColumn

Output:

Add column successful.

Replace Statement

以下查询从 employee 表中删除所有列,并用 empname 列替换。

The following query deletes all the columns from the employee table and replaces it with emp and name columns:

hive> ALTER TABLE employee REPLACE COLUMNS (
   > eid INT empid Int,
   > ename STRING name String);

JDBC Program

以下是用于用 empidename *column with *name 替换 eid 列的 JDBC 程序。

Given below is the JDBC program to replace eid column with empid and ename *column with *name.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveAlterReplaceColumn {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("ALTER TABLE employee REPLACE COLUMNS "
      +" (eid INT empid Int,"
      +" ename STRING name String);");
      System.out.println(" Replace column successful");
      con.close();
   }
}

将程序保存在名为 HiveAlterReplaceColumn.java 的文件中。使用以下命令来编译和执行此程序。

Save the program in a file named HiveAlterReplaceColumn.java. Use the following commands to compile and execute this program.

$ javac HiveAlterReplaceColumn.java
$ java HiveAlterReplaceColumn

Output:

Replace column successful.

Hive - Drop Table

本章介绍如何在 Hive 中删除数据表。在从 Hive Metastore 中删除数据表时,它将删除该数据表/列数据及其元数据。它可以是常规数据表(存储在 Metastore 中)或外部数据表(存储在本地文件系统中);与它们的类型无关,Hive 会将两者以相同的方式进行处理。

This chapter describes how to drop a table in Hive. When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats both in the same manner, irrespective of their types.

Drop Table Statement

语法如下:

The syntax is as follows:

DROP TABLE [IF EXISTS] table_name;

以下查询删除名为 employee 的数据表:

The following query drops a table named employee:

hive> DROP TABLE IF EXISTS employee;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

OK
Time taken: 5.3 seconds
hive>

JDBC Program

以下 JDBC 程序删除 employee 数据表。

The following JDBC program drops the employee table.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveDropTable {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      stmt.executeQuery("DROP TABLE IF EXISTS employee;");
     System.out.println("Drop table successful.");
     con.close();
   }
}

在名为 HiveDropTable.java 的文件中保存程序。使用以下命令编译和执行此程序。

Save the program in a file named HiveDropTable.java. Use the following commands to compile and execute this program.

$ javac HiveDropTable.java
$ java HiveDropTable

Output:

Drop table successful

以下查询用于验证表列表:

The following query is used to verify the list of tables:

hive> SHOW TABLES;
emp
ok
Time taken: 2.1 seconds
hive>

Hive - Partitioning

Hive 会将表组织成分区。这是一种基于分区列(如日期、城市和部门)的值将表分成相关部分的方法。使用分区可以轻松地查询部分数据。

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

表或分区被细分 buckets, 以便向数据提供额外结构,这些结构可用于更有效的查询。分桶基于表某些列的哈希函数值进行工作。

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.

例如,名为 Tab1 的表包含诸如 id、名称、部门和 yoj(即入职年)之类的员工数据。假设您需要检索 2012 年加入的所有员工的详细信息。查询将搜索整个表以获取所需信息。然而,如果您按年份对员工数据进行分区并将其存储在单独的文件中,这将减少查询处理时间。以下示例展示如何对文件及其数据进行分区:

For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data:

以下文件包含 employeedata 表。

The following file contains employeedata table.

/tab1/employeedata/file1

id、name、dept、yoj

id, name, dept, yoj

1、gopal、TP、2012

1, gopal, TP, 2012

2、kiran、HR、2012

2, kiran, HR, 2012

3、kaleel、SC、2013

3, kaleel,SC, 2013

4、Prasanth、SC、2013

4, Prasanth, SC, 2013

我无法使用 Gemini 翻译任何内容。

上面数据使用年份分成两个文件。

The above data is partitioned into two files using year.

/tab1/employeedata/2012/file2

1、gopal、TP、2012

1, gopal, TP, 2012

2、kiran、HR、2012

2, kiran, HR, 2012

我无法使用 Gemini 翻译任何内容。

/tab1/employeedata/2013/file3

3、kaleel、SC、2013

3, kaleel,SC, 2013

4、Prasanth、SC、2013

4, Prasanth, SC, 2013

Adding a Partition

我们可以通过更改表格来向表格添加分区。让我们假设我们有一个名为 employee 的表格,其中包含诸如Id、姓名、工资、职务、部门和yoj等字段。

We can add partitions to a table by altering the table. Let us assume we have a table called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.

Syntax:

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;

partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)

以下查询用于向 employee 表中添加分区。

The following query is used to add a partition to the employee table.

hive> ALTER TABLE employee
> ADD PARTITION (year=’2012’)
> location '/2012/part2012';

Renaming a Partition

此命令的语法如下。

The syntax of this command is as follows.

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

以下查询用于重命名分区:

The following query is used to rename a partition:

hive> ALTER TABLE employee PARTITION (year=’1203’)
   > RENAME TO PARTITION (Yoj=’1203’);

Dropping a Partition

以下语法用于删除分区:

The following syntax is used to drop a partition:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...;

以下查询用于删除分区:

The following query is used to drop a partition:

hive> ALTER TABLE employee DROP [IF EXISTS]
   > PARTITION (year=’1203’);

Hive - Built-in Operators

本章解释了 Hive 的内置运算符。Hive 中有四种类型的运算符:

This chapter explains the built-in operators of Hive. There are four types of operators in Hive:

  1. Relational Operators

  2. Arithmetic Operators

  3. Logical Operators

  4. Complex Operators

Relational Operators

这些运算符用于比较两个操作数。下表描述了 Hive 中可用的关系运算符:

These operators are used to compare two operands. The following table describes the relational operators available in Hive:

Operator

Operand

Description

A = B

all primitive types

TRUE if expression A is equivalent to expression B otherwise FALSE.

A != B

all primitive types

TRUE if expression A is not equivalent to expression B otherwise FALSE.

A < B

all primitive types

TRUE if expression A is less than expression B otherwise FALSE.

A ⇐ B

all primitive types

TRUE if expression A is less than or equal to expression B otherwise FALSE.

A > B

all primitive types

TRUE if expression A is greater than expression B otherwise FALSE.

A >= B

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE.

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE.

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE.

A LIKE B

Strings

TRUE if string pattern A matches to B otherwise FALSE.

A RLIKE B

Strings

NULL if A or B is NULL, TRUE if any substring of A matches the Java regular expression B , otherwise FALSE.

A REGEXP B

Strings

Same as RLIKE.

Example

让我们假设 employee 表由名为 Id、Name、Salary、Designation 和 Dept 的字段组成,如下所示。生成一个查询来检索 Id 为 1205 的员工详细信息。

Let us assume the employee table is composed of fields named Id, Name, Salary, Designation, and Dept as shown below. Generate a query to retrieve the employee details whose Id is 1205.

+-----+--------------+--------+---------------------------+------+
| Id  | Name         | Salary | Designation               | Dept |
+-----+--------------+------------------------------------+------+
|1201 | Gopal        | 45000  | Technical manager         | TP   |
|1202 | Manisha      | 45000  | Proofreader               | PR   |
|1203 | Masthanvali  | 40000  | Technical writer          | TP   |
|1204 | Krian        | 40000  | Hr Admin                  | HR   |
|1205 | Kranthi      | 30000  | Op Admin                  | Admin|
+-----+--------------+--------+---------------------------+------+

执行以下查询以使用上述表格检索员工详细信息:

The following query is executed to retrieve the employee details using the above table:

hive> SELECT * FROM employee WHERE Id=1205;

在成功执行查询后,您可以看到以下响应:

On successful execution of query, you get to see the following response:

+-----+-----------+-----------+----------------------------------+
| ID  | Name      | Salary    | Designation              | Dept  |
+-----+---------------+-------+----------------------------------+
|1205 | Kranthi   | 30000     | Op Admin                 | Admin |
+-----+-----------+-----------+----------------------------------+

执行以下查询以检索工资大于或等于 40000 卢比的员工详细信息。

The following query is executed to retrieve the employee details whose salary is more than or equal to Rs 40000.

hive> SELECT * FROM employee WHERE Salary>=40000;

在成功执行查询后,您可以看到以下响应:

On successful execution of query, you get to see the following response:

+-----+------------+--------+----------------------------+------+
| ID  | Name       | Salary | Designation                | Dept |
+-----+------------+--------+----------------------------+------+
|1201 | Gopal      | 45000  | Technical manager          | TP   |
|1202 | Manisha    | 45000  | Proofreader                | PR   |
|1203 | Masthanvali| 40000  | Technical writer           | TP   |
|1204 | Krian      | 40000  | Hr Admin                   | HR   |
+-----+------------+--------+----------------------------+------+

Arithmetic Operators

这些运算符支持操作数上的各种常见算术运算。它们都返回数字类型。下表描述了 Hive 中可用的算术运算符:

These operators support various common arithmetic operations on the operands. All of them return number types. The following table describes the arithmetic operators available in Hive:

Operators

Operand

Description

A + B

all number types

Gives the result of adding A and B.

A - B

all number types

Gives the result of subtracting B from A.

A * B

all number types

Gives the result of multiplying A and B.

A / B

all number types

Gives the result of dividing B from A.

A % B

all number types

Gives the reminder resulting from dividing A by B.

A & B

all number types

Gives the result of bitwise AND of A and B.

A

B

all number types

Gives the result of bitwise OR of A and B.

A ^ B

all number types

Gives the result of bitwise XOR of A and B.

~A

all number types

Example

以下查询对两个数字相加,20 和 30。

The following query adds two numbers, 20 and 30.

hive> SELECT 20+30 ADD FROM temp;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+--------+
|   ADD  |
+--------+
|   50   |
+--------+

Logical Operators

运算符是逻辑表达式。它们都返回 TRUE 或 FALSE。

The operators are logical expressions. All of them return either TRUE or FALSE.

Operators

Operands

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE.

A && B

boolean

Same as A AND B.

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE.

A

B

boolean

Same as A OR B.

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE.

!A

Example

以下查询用于检索部门为 TP 且薪水大于 40000 卢比的员工详细信息。

The following query is used to retrieve employee details whose Department is TP and Salary is more than Rs 40000.

hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
+------+--------------+-------------+-------------------+--------+

Complex Operators

这些运算符提供一个表达式来访问复杂类型元素。

These operators provide an expression to access the elements of Complex Types.

Operator

Operand

Description

A[n]

A is an Array and n is an int

It returns the nth element in the array A. The first element has index 0.

M[key]

M is a Map<K, V> and key has type K

It returns the value corresponding to the key in the map.

S.x

S is a struct

It returns the x field of S.

Hiveql Select…​Where

Hive 查询语言 (HiveQL) 是 Hive 处理和分析 Metastore 中结构化数据的查询语言。本章解释了如何将 SELECT 语句与 WHERE 子句结合使用。

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.

SELECT 语句用于从表中检索数据。WHERE 子句类似于一个条件。它使用条件过滤数据,并为你提供有限的结果。内置运算符和函数生成一个表达式,该表达式满足条件。

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.

Syntax

下面给出了 SELECT 查询的语法:

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Example

我们以一个 SELECT…​WHERE 子句的示例为例。假设我们有如下的雇员表,其中字段名为 Id、Name、Salary、Designation 和 Dept。生成一个查询来检索工资超过 30000 卢比的员工详细信息。

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

以下查询使用上述场景检索员工详细信息:

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
+------+--------------+-------------+-------------------+--------+

JDBC Program

以下 JDBC 程序将 where 子句应用于给定示例。

The JDBC program to apply where clause for the given example is as follows.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLWhere {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE
      salary>30000;");
      System.out.println("Result:");
      System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
      while (res.next()) {
         System.out.println(res.getInt(1)+" "+ res.getString(2)+" "+
         res.getDouble(3)+" "+ res.getString(4)+" "+ res.getString(5));
      }
      con.close();
   }
}

将程序保存到名为 HiveQLWhere.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveQLWhere.java. Use the following commands to compile and execute this program.

$ javac HiveQLWhere.java
$ java HiveQLWhere

Output:

ID       Name           Salary      Designation          Dept
1201     Gopal          45000       Technical manager    TP
1202     Manisha        45000       Proofreader          PR
1203     Masthanvali    40000       Technical writer     TP
1204     Krian          40000       Hr Admin             HR

Hiveql Select…​Order By

本章解释了如何在 SELECT 语句中使用 ORDER BY 从句。ORDER BY 从句用于基于一列检索详细信息,并按升序或降序对结果集进行排序。

This chapter explains how to use the ORDER BY clause in a SELECT statement. The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.

Syntax

下面是 ORDER BY 从句的语法:

Given below is the syntax of the ORDER BY clause:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

Example

我们来看一个 SELECT…​ORDER BY 从句的示例。假设员工表如下所示,其中有 Id、Name、Salary、Designation 和 Dept 字段。生成一个查询,以根据部门名称按顺序检索员工详细信息。

Let us take an example for SELECT…​ORDER BY clause. Assume employee table as given below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details in order by using Department name.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

以下查询使用上述场景检索员工详细信息:

The following query retrieves the employee details using the above scenario:

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
+------+--------------+-------------+-------------------+--------+

JDBC Program

以下是用于为给定示例应用 Order By 子句的 JDBC 程序。

Here is the JDBC program to apply Order By clause for the given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLOrderBy {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY
      DEPT;");
      System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
      while (res.next()) {
         System.out.println(res.getInt(1)+" "+ res.getString(2)+" "+
         res.getDouble(3)+" "+ res.getString(4)+" "+ res.getString(5));
      }
      con.close();
   }
}

将程序保存在名为 HiveQLOrderBy.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveQLOrderBy.java. Use the following commands to compile and execute this program.

$ javac HiveQLOrderBy.java
$ java HiveQLOrderBy

Output:

ID       Name           Salary      Designation          Dept
1205     Kranthi        30000       Op Admin             Admin
1204     Krian          40000       Hr Admin             HR
1202     Manisha        45000       Proofreader          PR
1201     Gopal          45000       Technical manager    TP
1203     Masthanvali    40000       Technical writer     TP
1204     Krian          40000       Hr Admin             HR

Hiveql Group By

本章解释了 SELECT 语句中 GROUP BY 从句的详细信息。GROUP BY 从句用于使用特定收集列对结果集汇总所有记录。它用于查询一组记录。

This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

Syntax

GROUP BY 从句的语法如下:

The syntax of GROUP BY clause is as follows:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

Example

我们来看一个 SELECT…GROUP BY 从句的示例。假设员工表如下所示,其中有 Id、Name、Salary、Designation 和 Dept 字段。生成一个查询来获取每个部门的员工数量。

Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 45000       | Proofreader       | PR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

以下查询使用以上方案来检索员工详细信息。

The following query retrieves the employee details using the above scenario.

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+--------------+
| Dept | Count(*)     |
+------+--------------+
|Admin |    1         |
|PR    |    2         |
|TP    |    3         |
+------+--------------+

JDBC Program

下面是将 Group By 从句应用于给定示例的 JDBC 程序。

Given below is the JDBC program to apply the Group By clause for the given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLGroupBy {
   private static String driverName =
   "org.apache.hadoop.hive.jdbc.HiveDriver";
   public static void main(String[] args) throws SQLException {
      // Register driver and create driver instance
      Class.forName(driverName);
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      // create statement
      Statement stmt = con.createStatement();
      // execute statement
      Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ”
      +“FROM employee GROUP BY DEPT; ”);
      System.out.println(" Dept \t count(*)");
      while (res.next()) {
         System.out.println(res.getString(1)+" "+ res.getInt(2));
      }
      con.close();
   }
}

将程序保存在名为 HiveQLGroupBy.java 的文件中。使用以下命令编译并执行此程序。

Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program.

$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy

Output:

 Dept     Count(*)
 Admin       1
 PR          2
 TP          3

Hiveql Joins

JOINS 是一个子句,用于通过使用每张表中重复的值组合来自两张表的特定字段。它用于组合来自数据库中两张或更多表的记录。

JOINS is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.

Syntax

join_table:

   table_reference JOIN table_factor [join_condition]
   | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
   join_condition
   | table_reference LEFT SEMI JOIN table_reference join_condition
   | table_reference CROSS JOIN table_reference [join_condition]

Example

我们在本章中将使用以下两张表。考虑名为 CUSTOMERS 的表。

We will use the following two tables in this chapter. Consider the following table named CUSTOMERS..

+----+----------+-----+-----------+----------+
| ID | NAME     | AGE | ADDRESS   | SALARY   |
+----+----------+-----+-----------+----------+
| 1  | Ramesh   | 32  | Ahmedabad | 2000.00  |
| 2  | Khilan   | 25  | Delhi     | 1500.00  |
| 3  | kaushik  | 23  | Kota      | 2000.00  |
| 4  | Chaitali | 25  | Mumbai    | 6500.00  |
| 5  | Hardik   | 27  | Bhopal    | 8500.00  |
| 6  | Komal    | 22  | MP        | 4500.00  |
| 7  | Muffy    | 24  | Indore    | 10000.00 |
+----+----------+-----+-----------+----------+

考虑另一个表 ORDERS 如下:

Consider another table ORDERS as follows:

+-----+---------------------+-------------+--------+
|OID  | DATE                | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 |           3 | 3000   |
| 100 | 2009-10-08 00:00:00 |           3 | 1500   |
| 101 | 2009-11-20 00:00:00 |           2 | 1560   |
| 103 | 2008-05-20 00:00:00 |           4 | 2060   |
+-----+---------------------+-------------+--------+

给出如下几种类型的联接:

There are different types of joins given as follows:

  1. JOIN

  2. LEFT OUTER JOIN

  3. RIGHT OUTER JOIN

  4. FULL OUTER JOIN

JOIN

JOIN 子句用于组合和检索来自多张表的记录。JOIN 与 SQL 中的 OUTER JOIN 相同。JOIN 条件应使用表的键和外键创建。

JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.

以下查询对 CUSTOMER 和 ORDER 表执行 JOIN 并检索记录:

The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
   > FROM CUSTOMERS c JOIN ORDERS o
   > ON (c.ID = o.CUSTOMER_ID);

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+----+----------+-----+--------+
| ID | NAME     | AGE | AMOUNT |
+----+----------+-----+--------+
| 3  | kaushik  | 23  | 3000   |
| 3  | kaushik  | 23  | 1500   |
| 2  | Khilan   | 25  | 1560   |
| 4  | Chaitali | 25  | 2060   |
+----+----------+-----+--------+

LEFT OUTER JOIN

HiveQL LEFT OUTER JOIN 返回左表的全部行,即使在右表中没有匹配行也是如此。这意味着,如果 ON 子句在右表中匹配 0(零)个记录,则 JOIN 仍然会返回结果中的行,但在右表的每一列中都会返回 NULL。

The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a row in the result, but with NULL in each column from the right table.

LEFT JOIN 返回左表中的全部值以及右表中的匹配值,如果匹配 JOIN 谓词不存在,则返回 NULL。

A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or NULL in case of no matching JOIN predicate.

以下查询演示了 CUSTOMER 和 ORDER 表之间的 LEFT OUTER JOIN:

The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
   > FROM CUSTOMERS c
   > LEFT OUTER JOIN ORDERS o
   > ON (c.ID = o.CUSTOMER_ID);

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+----+----------+--------+---------------------+
| ID | NAME     | AMOUNT | DATE                |
+----+----------+--------+---------------------+
| 1  | Ramesh   | NULL   | NULL                |
| 2  | Khilan   | 1560   | 2009-11-20 00:00:00 |
| 3  | kaushik  | 3000   | 2009-10-08 00:00:00 |
| 3  | kaushik  | 1500   | 2009-10-08 00:00:00 |
| 4  | Chaitali | 2060   | 2008-05-20 00:00:00 |
| 5  | Hardik   | NULL   | NULL                |
| 6  | Komal    | NULL   | NULL                |
| 7  | Muffy    | NULL   | NULL                |
+----+----------+--------+---------------------+

RIGHT OUTER JOIN

HiveQL RIGHT OUTER JOIN 返回右表的全部行,即使在左表中没有匹配行也是如此。如果 ON 子句在左表中匹配 0(零)个记录,则 JOIN 仍然会返回结果中的行,但在左表的每一列中都会返回 NULL。

The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with NULL in each column from the left table.

RIGHT JOIN 返回右表中的全部值以及左表中的匹配值,如果匹配 JOIN 谓词不存在,则返回 NULL。

A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate.

以下查询演示了 CUSTOMER 和 ORDER 表之间的 RIGHT OUTER JOIN。

The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
   > FROM CUSTOMERS c
   > RIGHT OUTER JOIN ORDERS o
   > ON (c.ID = o.CUSTOMER_ID);

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+----------+--------+---------------------+
| ID   | NAME     | AMOUNT | DATE                |
+------+----------+--------+---------------------+
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 |
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 |
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 |
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

FULL OUTER JOIN

HiveQL FULL OUTER JOIN 组合满足 JOIN 条件的左外表和右外表的记录。连接表包含来自两张表的所有记录,或为两边缺失的匹配填入 NULL 值。

The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in NULL values for missing matches on either side.

以下查询演示了 CUSTOMER 表与 ORDER 表之间的 FULL OUTER JOIN:

The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
   > FROM CUSTOMERS c
   > FULL OUTER JOIN ORDERS o
   > ON (c.ID = o.CUSTOMER_ID);

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

+------+----------+--------+---------------------+
| ID   | NAME     | AMOUNT | DATE                |
+------+----------+--------+---------------------+
| 1    | Ramesh   | NULL   | NULL                |
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 |
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 |
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 |
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 |
| 5    | Hardik   | NULL   | NULL                |
| 6    | Komal    | NULL   | NULL                |
| 7    | Muffy    | NULL   | NULL                |
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 |
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 |
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 |
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

Hive - Built-in Functions

本章解释了 Hive 中可用的内建函数。这些函数除了用法外,都与 SQL 函数十分相似。

This chapter explains the built-in functions available in Hive. The functions look quite similar to SQL functions, except for their usage.

Built-In Functions

Hive 支持以下内建函数:

Hive supports the following built-in functions:

Return Type

Signature

Description

BIGINT

round(double a)

It returns the rounded BIGINT value of the double.

BIGINT

floor(double a)

It returns the maximum BIGINT value that is equal or less than the double.

BIGINT

ceil(double a)

It returns the minimum BIGINT value that is equal or greater than the double.

double

rand(), rand(int seed)

It returns a random number that changes from row to row.

string

concat(string A, string B,…​)

It returns the string resulting from concatenating B after A.

string

substr(string A, int start)

It returns the substring of A starting from start position till the end of string A.

string

substr(string A, int start, int length)

It returns the substring of A starting from start position with the given length.

string

upper(string A)

It returns the string resulting from converting all characters of A to upper case.

string

ucase(string A)

Same as above.

string

lower(string A)

It returns the string resulting from converting all characters of B to lower case.

string

lcase(string A)

Same as above.

string

trim(string A)

It returns the string resulting from trimming spaces from both ends of A.

string

ltrim(string A)

It returns the string resulting from trimming spaces from the beginning (left hand side) of A.

string

rtrim(string A)

rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A.

string

regexp_replace(string A, string B, string C)

It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C.

int

size(Map<K.V>)

It returns the number of elements in the map type.

int

size(Array<T>)

It returns the number of elements in the array type.

value of <type>

cast(<expr> as <type>)

It converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"

string

to_date(string timestamp)

It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

It returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

int

month(string date)

It returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int

day(string date)

It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid.

Example

以下查询演示了一些内置函数:

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;

在成功执行查询后,您可以看到以下响应:

On successful execution of query, you get to see the following response:

3.0

floor() function

hive> SELECT floor(2.6) from temp;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

2.0

ceil() function

hive> SELECT ceil(2.6) from temp;

在成功执行查询后,您可以看到以下响应:

On successful execution of the query, you get to see the following response:

3.0

Aggregate Functions

Hive 支持以下内置 aggregate functions 。这些函数的使用与 SQL 聚合函数相同。

Hive supports the following built-in aggregate functions. The usage of these functions is as same as the SQL aggregate functions.

Return Type

Signature

Description

BIGINT

count(*), count(expr),

count(*) - Returns the total number of retrieved rows.

DOUBLE

sum(col), sum(DISTINCT col)

It returns the sum of the elements in the group or the sum of the distinct values of the column in the group.

DOUBLE

avg(col), avg(DISTINCT col)

It returns the average of the elements in the group or the average of the distinct values of the column in the group.

DOUBLE

min(col)

It returns the minimum value of the column in the group.

DOUBLE

max(col)

It returns the maximum value of the column in the group.

Hive - View and Indexes

本章介绍如何创建和管理视图。视图是根据用户需求生成的。你可以将任何结果集数据另存为视图。视图在 Hive 中的使用与视图在 SQL 中的使用相同。它是一种标准的 RDBMS 概念。我们可以在视图上执行所有 DML 操作。

This chapter describes how to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.

Creating a View

执行 SELECT 语句时,可以创建一个视图。语法如下:

You can create a view at the time of executing a SELECT statement. The syntax is as follows:

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Example

我们以视图为例。假设员工表如下所示,其中有 Id、Name、Salary、Designation 和 Dept 字段。生成一个查询,以检索获得超过 INR 30000 薪水的所有员工的详细信息。我们将结果存储在名为 emp_30000. 的视图中

Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

以下查询使用上述场景检索员工详细信息:

The following query retrieves the employee details using the above scenario:

hive> CREATE VIEW emp_30000 AS
   > SELECT * FROM employee
   > WHERE salary>30000;

Dropping a View

使用以下语法删除视图:

Use the following syntax to drop a view:

DROP VIEW view_name

以下查询删除名为 emp_30000 的视图:

The following query drops a view named as emp_30000:

hive> DROP VIEW emp_30000;

Creating an Index

索引只是指向表中的特定列的一个指针。创建索引意味着在表中的特定列上创建指针。其语法如下:

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows:

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Example

我们以索引为例。使用我们之前用于 Id、Name、Salary、Designation 和 Dept 字段的同一员工表。在员工表的薪水列上创建一个名为 index_salary 的索引。

Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.

以下查询创建索引:

The following query creates an index:

hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
   > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

它是指向薪水列的指针。如果修改列,则使用索引值存储更改。

It is a pointer to the salary column. If the column is modified, the changes are stored using an index value.

Dropping an Index

以下语法用于删除索引:

The following syntax is used to drop an index:

DROP INDEX <index_name> ON <table_name>

以下查询删除名为 index_salary 的索引:

The following query drops an index named index_salary:

hive> DROP INDEX index_salary ON employee;