Sqoop 简明教程
Sqoop - Introduction
传统应用程序管理系统,也就是应用程序使用 RDBMS 与关系数据库交互,是大数据生成的源头之一。此类由 RDBMS 生成的海量数据被储存在关系 Database Servers 中,即关系数据库结构中。
The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure.
当 Hadoop 生态系统的大数据存储和分析器(例如 MapReduce、Hive、HBase、Cassandra、Pig 等)进入人们的视野后,它们需要一个工具与关系数据库服务器交互,以导入和导出其中驻留的大数据。在此,Sqoop 占据了 Hadoop 生态系统中的一个位置,以在关系数据库服务器和 Hadoop 的 HDFS 之间提供可行的交互。
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational database server and Hadoop’s HDFS.
Sqoop −“SQL 到 Hadoop,Hadoop 到 SQL”
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop 是一款设计用于在 Hadoop 与关系数据库服务器之间传输数据的工具。它用于将数据从诸如 MySQL 和 Oracle 等关系数据库导入 Hadoop HDFS,以及将数据从 Hadoop 文件系统导出到关系数据库。它由 Apache 软件基金会提供。
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided by the Apache Software Foundation.
Sqoop Import
导入工具将 RDBMS 中的各个表格导入 HDFS。表格中的每一行在 HDFS 中被视为一条记录。所有记录都以文本数据形式存储在文本文件中,或者以二进制数据形式存储在 Avro 和序列文件中。
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
导出工具将一组文件从 HDFS 导回到 RDBMS。作为 Sqoop 输入的文件包含记录,这些记录在表格中被称为行。对这些文件进行读取并解析为一组记录,然后使用用户指定的定界符进行分隔。
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.
Sqoop - Installation
由于 Sqoop 是 Hadoop 的子项目,因此它只能在 Linux 操作系统上运行。请按照以下步骤在您的系统上安装 Sqoop。
As Sqoop is a sub-project of Hadoop, it can only work on Linux operating system. Follow the steps given below to install Sqoop on your system.
Step 1: Verifying JAVA Installation
在安装 Sqoop 之前,您的系统上需要安装 Java。让我们使用以下命令验证 Java 安装 −
You need to have Java installed on your system before installing Sqoop. Let us verify Java installation using the following command −
$ java –version
如果系统上已安装 Java,您将看到以下响应 -
If Java is already installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
如果您的系统上尚未安装 Java,请按照以下步骤执行操作。
If Java is not installed on your system, then follow the steps given below.
Installing Java
按照以下简单步骤在您的系统上安装 Java。
Follow the simple steps given below to install Java on your system.
Step 1
访问以下 link 下载 Java(JDK <最新版本> - X64.tar.gz)。
Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link.
然后系统将会下载 jdk-7u71-linux-x64.tar.gz。
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.
Step 2
通常,您可以在“下载”文件夹中找到下载的 Java 文件。使用以下命令验证它并解压 jdk-7u71-linux-x64.gz 文件。
Generally, you can find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step 3
为了使得所有用户都能使用 Java,您需要将 Java 移动到“/usr/local/”位置。打开 root,然后键入以下命令。
To make Java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/java
# exitStep IV:
Step 4
为设置 PATH 和 JAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件中。
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
现在将所有更改应用到当前正在运行的系统中。
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 5
使用以下命令配置Java备用:
Use the following commands to configure Java alternatives −
# alternatives --install /usr/bin/java java usr/local/java/bin/java 2
# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2
# alternatives --set java usr/local/java/bin/java
# alternatives --set javac usr/local/java/bin/javac
# alternatives --set jar usr/local/java/bin/jar
现在,使用上述 from terminal java -version 命令验证安装。
Now verify the installation using the command java -version from the terminal as explained above.
Step 2: Verifying Hadoop Installation
在安装 Sqoop 之前,您的系统上必须安装 Hadoop。让我们使用以下命令验证 Hadoop 安装:
Hadoop must be installed on your system before installing Sqoop. Let us verify the Hadoop installation using the following command −
$ hadoop version
如果已在您的系统上安装 Hadoop,则会收到以下回复:
If Hadoop is already installed on your system, then you will get the following response −
Hadoop 2.4.1
--
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
如果您的系统上未安装 Hadoop,请执行以下步骤:
If Hadoop is not installed on your system, then proceed with the following steps −
Downloading Hadoop
使用以下命令从 Apache 软件基金会下载并解压缩 Hadoop 2.4.1。
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode
按照以下步骤在伪分布式模式下安装 Hadoop 2.4.1。
Follow the steps given below to install Hadoop 2.4.1 in pseudo-distributed mode.
Step 1: Setting up Hadoop
您可以通过将以下命令附加到~/.bashrc文件来设置Hadoop环境变量。
You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
现在,将所有更改应用到当前正在运行的系统中。
Now, apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration
您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,您需要在这些配置文件中进行适当的更改。
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
为了使用 Java 开发 Hadoop 程序,您必须在 hadoop-env.sh 文件中重置 Java 环境变量,方法是用您系统中的 Java 位置替换 JAVA_HOME 值。
In order to develop Hadoop programs using java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.
export JAVA_HOME=/usr/local/java
以下是您需要编辑以配置 Hadoop 的文件列表。
Given below is the list of files that you need to edit to configure Hadoop.
core-site.xml
core-site.xml
core-site.xml 文件包含诸如 Hadoop 实例使用的端口号、用于文件系统分配的内存、用于存储数据的内存限制,以及读/写缓冲区的大小等信息。
The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
打开 core-site.xml,并在 <configuration> 和 </configuration> 标记之间添加以下属性。
Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000 </value>
</property>
</configuration>
hdfs-site.xml
hdfs-site.xml
hdfs-site.xml 文件包含诸如复制数据的值、namenode 路径和您本地文件系统的 datanode 路径等信息。这意味着您要存储 Hadoop 基础架构的位置。
The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure.
让我们假设以下数据。
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
打开此文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
Note − 在上述文件中,所有属性值都是用户定义的,您可以根据 Hadoop 基础架构进行更改。
Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.
yarn-site.xml
yarn-site.xml
此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
mapred-site.xml
此文件用于指定我们正在使用的 MapReduce 框架。默认情况下,Hadoop 包含 yarn-site.xml 的一个模板。首先,您需要使用以下命令将文件从 mapred-site.xml.template 复制到 mapred-site.xml 文件。
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site.xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
打开 mapred-site.xml 文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation
以下步骤用于验证 Hadoop 安装。
The following steps are used to verify the Hadoop installation.
Step 1: Name Node Setup
使用命令 “hdfs namenode -format” 设置名称节点,如下所示。
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
预期结果如下所示。
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/
Step 2: Verifying Hadoop dfs
以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。
The following command is used to start dfs. Executing this command will start your Hadoop file system.
$ start-dfs.sh
预期输出如下所示 −
The expected output is as follows −
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]
Step 3: Verifying Yarn Script
以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。
The following command is used to start the yarn script. Executing this command will start your yarn daemons.
$ start-yarn.sh
预期输出如下所示 −
The expected output is as follows −
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out
Step 3: Downloading Sqoop
我们可以从以下 link 下载最新版本的 Sqoop。对于本教程,我们使用 1.4.5 版本,即 sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz 。
We can download the latest version of Sqoop from the following link For this tutorial, we are using version 1.4.5, that is, sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz.
Step 4: Installing Sqoop
以下命令用于解压 Sqoop tar 包并将其移动到“/usr/lib/sqoop”目录。
The following commands are used to extract the Sqoop tar ball and move it to “/usr/lib/sqoop” directory.
$tar -xvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
$ su
password:
# mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha /usr/lib/sqoop
#exit
Step 5: Configuring bashrc
您必须通过将以下行追加到 ~/ .bashrc 文件来设置 Sqoop 环境:
You have to set up the Sqoop environment by appending the following lines to ~/.bashrc file −
#Sqoop
export SQOOP_HOME=/usr/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin
以下命令用于执行 ~/ .bashrc 文件。
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
Step 6: Configuring Sqoop
要使用 Hadoop 配置 Sqoop,您需要编辑 sqoop-env.sh 文件,该文件位于 $SQOOP_HOME/conf 目录中。首先,重定向到 Sqoop 配置目录并使用以下命令复制模板文件:
To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop config directory and copy the template file using the following command −
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
打开 sqoop-env.sh 并编辑以下行:
Open sqoop-env.sh and edit the following lines −
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Step 7: Download and Configure mysql-connector-java
我们可以从以下 link 下载 mysql-connector-java-5.1.30.tar.gz 文件。
We can download mysql-connector-java-5.1.30.tar.gz file from the following link.
以下命令用于解压 mysql-connector-java tar 包并将 mysql-connector-java-5.1.30-bin.jar 移动到 /usr/lib/sqoop/lib 目录。
The following commands are used to extract mysql-connector-java tarball and move mysql-connector-java-5.1.30-bin.jar to /usr/lib/sqoop/lib directory.
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
password:
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Step 8: Verifying Sqoop
以下命令用于验证 Sqoop 版本:
The following command is used to verify the Sqoop version.
$ cd $SQOOP_HOME/bin
$ sqoop-version
预期输出 -
Expected output −
14/12/17 14:52:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Sqoop 1.4.5 git commit id 5b34accaca7de251fc91161733f906af2eddbe83
Compiled by abe on Fri Aug 1 11:19:26 PDT 2014
Sqoop 安装已完成。
Sqoop installation is complete.
Sqoop - Import
本章介绍如何从 MySQL 数据库导入数据到 Hadoop HDFS。“导入工具”将 RDBMS 中的各个表导入到 HDFS。表中的每一行都被视为 HDFS 中的一条记录。所有记录都作为文本数据存储在文本文件中的,或者作为二进制数据存储在 Avro 和 Sequence 文件中。
This chapter describes how to import data from MySQL database to Hadoop HDFS. The ‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files.
Syntax
以下语法用于将数据导入至 HDFS。
The following syntax is used to import data into HDFS.
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
Example
让我们以三个表为例,分别命名为 emp 、 emp_add 和 emp_contact ,这些表位于 MySQL 数据库服务器的 userdb 数据库中。
Let us take an example of three tables named as emp, emp_add, and emp_contact, which are in a database called userdb in a MySQL database server.
这三个表及其数据如下。
The three tables and their data are as follows.
emp:
id |
name |
deg |
salary |
dept |
1201 |
gopal |
manager |
50,000 |
TP |
1202 |
manisha |
Proof reader |
50,000 |
TP |
1203 |
khalil |
php dev |
30,000 |
AC |
1204 |
prasanth |
php dev |
30,000 |
AC |
1204 |
kranthi |
admin |
20,000 |
TP |
emp_add:
id |
hno |
street |
city |
1201 |
288A |
vgiri |
jublee |
1202 |
108I |
aoc |
sec-bad |
1203 |
144Z |
pgutta |
hyd |
1204 |
78B |
old city |
sec-bad |
1205 |
720X |
hitec |
sec-bad |
Importing a Table
Sqoop 工具“import”用于将表数据从表作为文本文件或二进制文件导入 Hadoop 文件系统。
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as a text file or a binary file.
以下命令用于将 emp 表从 MySQL 数据库服务器导入至 HDFS。
The following command is used to import the emp table from MySQL database server to HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
如果执行成功,则会获得以下输出。
If it is executed successfully, then you get the following output.
14/12/22 15:24:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/22 15:24:56 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
14/12/22 15:24:56 INFO tool.CodeGenTool: Beginning code generation
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM `emp` AS t LIMIT 1
14/12/22 15:24:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop
14/12/22 15:25:11 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/cebe706d23ebb1fd99c1f063ad51ebd7/emp.jar
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:25:40 INFO mapreduce.Job: The url to track the job:
http://localhost:8088/proxy/application_1419242001831_0001/
14/12/22 15:26:45 INFO mapreduce.Job: Job job_1419242001831_0001 running in uber mode :
false
14/12/22 15:26:45 INFO mapreduce.Job: map 0% reduce 0%
14/12/22 15:28:08 INFO mapreduce.Job: map 100% reduce 0%
14/12/22 15:28:16 INFO mapreduce.Job: Job job_1419242001831_0001 completed successfully
-----------------------------------------------------
-----------------------------------------------------
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Transferred 145 bytes in 177.5849 seconds
(0.8165 bytes/sec)
14/12/22 15:28:17 INFO mapreduce.ImportJobBase: Retrieved 5 records.
要验证 HDFS 中导入的数据,请使用以下命令。
To verify the imported data in HDFS, use the following command.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
它将显示 emp 表数据,其中字段以逗号 (,) 分隔。
It shows you the emp table data and fields are separated with comma (,).
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
Importing into Target Directory
使用 Sqoop 导入工具将表数据导入 HDFS 时,我们可以指定目标目录。
We can specify the target directory while importing table data into HDFS using the Sqoop import tool.
以下是将目标目录指定为 Sqoop 导入命令选项的语法。
Following is the syntax to specify the target directory as option to the Sqoop import command.
--target-dir <new or exist directory in HDFS>
以下命令用于将 emp_add 表数据导入“/queryresult”目录。
The following command is used to import emp_add table data into ‘/queryresult’ directory.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
以下命令用于验证 emp_add 表中 /queryresult 目录表单中导入的数据。
The following command is used to verify the imported data in /queryresult directory form emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
会显示 emp_add 表数据,其中字段以逗号 (,) 分隔。
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Import Subset of Table Data
我们可以使用 Sqoop 导入工具中的“where”子句导入表的子集。它会在各个数据库服务器中执行对应的 SQL 查询,并将结果存储在 HDFS 的目标目录中。
We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the corresponding SQL query in the respective database server and stores the result in a target directory in HDFS.
where 子句的语法如下。
The syntax for where clause is as follows.
--where <condition>
以下命令用于导入 emp_add 表数据的一个子集。子集查询用于检索居住在 Secunderabad 市的员工 ID 和地址。
The following command is used to import a subset of emp_add table data. The subset query is to retrieve the employee id and address, who lives in Secunderabad city.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
以下命令用于验证 emp_add 表中 /wherequery 目录中导入的数据。
The following command is used to verify the imported data in /wherequery directory from the emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*
会显示 emp_add 表数据,其中字段以逗号 (,) 分隔。
It will show you the emp_add table data with comma (,) separated fields.
1202, 108I, aoc, sec-bad
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Incremental Import
增量导入是一种仅导入表中新添加行的技术。需要添加“incremental”、“check-column”和“last-value”选项以执行增量导入。
Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
Sqoop 导入命令中的增量选件使用以下语法。
The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
让我们假设新添加的数据为 emp 表,如下所示 −
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR
使用以下命令来执行 emp 表中的增量导入。
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
使用以下命令来验证 emp 表与 HDFS emp/ 目录之间导入的数据。
The following command is used to verify the imported data from emp table to HDFS emp/ directory.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
它以逗号 (,) 分隔字段显示 emp 表数据。
It shows you the emp table data with comma (,) separated fields.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR
使用以下命令来从 emp 表查看已修改或新添加的行。
The following command is used to see the modified or newly added rows from the emp table.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1
它以逗号 (,) 分隔字段显示新添加的 emp 表行。
It shows you the newly added rows to the emp table with comma (,) separated fields.
1206, satish p, grp des, 20000, GR
Sqoop - Import All Tables
本章介绍如何将所有表从 RDBMS 数据库服务器导入到 HDFS。每个表数据存储在单独的目录中,目录名称与表名称相同。
This chapter describes how to import all the tables from the RDBMS database server to the HDFS. Each table data is stored in a separate directory and the directory name is same as the table name.
Syntax
以下语法用于导入所有表。
The following syntax is used to import all tables.
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Example
我们以从 userdb 数据库导入所有表为例。数据库 userdb 包含的表列表如下。
Let us take an example of importing all tables from the userdb database. The list of tables that the database userdb contains is as follows.
+--------------------+
| Tables |
+--------------------+
| emp |
| emp_add |
| emp_contact |
+--------------------+
以下命令用于从 userdb 数据库导入所有表。
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
Note - 如果你在使用 import-all-tables,那么数据库中的每个表都必须有一个主键字段。
Note − If you are using the import-all-tables, it is mandatory that every table in that database must have a primary key field.
以下命令用于将所有表数据验证到 HDFS 中的 userdb 数据库。
The following command is used to verify all the table data to the userdb database in HDFS.
$ $HADOOP_HOME/bin/hadoop fs -ls
它将会向你展示 userdb 数据库中的表名称列表作为目录。
It will show you the list of table names in userdb database as directories.
Sqoop - Export
此章节描述了如何将数据从 HDFS 导回到 RDBMS 数据库。目标表必须存在于目标数据库中。提供给 Sqoop 的文件包含记录,这些记录称为表中的行。这些记录被读入并解析成为一组记录,并使用用户指定的定界符分隔。
This chapter describes how to export data back from the HDFS to the RDBMS database. The target table must exist in the target database. The files which are given as input to the Sqoop contain records, which are called rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.
默认操作是使用 INSERT 语句将来自输入文件中的所有记录插入到数据库表中。在更新模式下,Sqoop 生成 UPDATE 语句,该语句替换数据库中的现有记录。
The default operation is to insert all the record from the input files to the database table using the INSERT statement. In update mode, Sqoop generates the UPDATE statement that replaces the existing record into the database.
Syntax
以下是导出命令的语法。
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Example
让我们举一个 HDFS 文件中员工数据的示例。员工数据位于 HDFS 中“emp/”目录中的 emp_data 文件中。 emp_data 如下。
Let us take an example of the employee data in file, in HDFS. The employee data is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR
必须手动创建待导出的表,并且此表应存在于必须从中导出该表的数据库中。
It is mandatory that the table to be exported is created manually and is present in the database from where it has to be exported.
使用以下查询在 mysql 命令行中创建表“employee”。
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
使用以下命令将表数据(它位于 HDFS 上的 emp_data 文件中)导出到 Mysql 数据库服务器的 db 数据库中的 employee 表。
The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
使用以下命令验证 mysql 命令行中的表。
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
如果已成功存储给定数据,则可以找到给定员工数据的以下表。
If the given data is stored successfully, then you can find the following table of given employee data.
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |
+------+--------------+-------------+-------------------+--------+
Sqoop - Job
本章介绍如何创建和维护 Sqoop 作业。Sqoop 作业创建并保存导入和导出命令。它指定参数以识别和调用已保存的作业。这种重新调用或重新执行用于增量导入,它可以将更新的行从 RDBMS 表导入到 HDFS。
This chapter describes how to create and maintain the Sqoop jobs. Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-executing is used in the incremental import, which can import the updated rows from RDBMS table to HDFS.
Syntax
以下是在创建 Sqoop 作业时的语法。
The following is the syntax for creating a Sqoop job.
$ sqoop job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
Create Job (--create)
在其中,我们创建了一个名为 myjob 的作业,它可以将表数据从 RDBMS 表导入到 HDFS。以下命令用于创建一个作业,即将数据从 employee 数据库中的 db 表导入到 HDFS 文件中。
Here we are creating a job with the name myjob, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.
$ sqoop job --create myjob \
-- import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ 参数用于验证已保存的作业。以下命令用于验证已保存的 Sqoop 作业的列表。
‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ sqoop job --list
它展示已保存作业的列表。
It shows the list of saved jobs.
Available jobs:
myjob
Inspect Job (--show)
‘--show’ 参数用于检查或验证特定作业及其详细信息。以下命令和示例输出用于验证一个被称为 myjob 的作业。
‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.
$ sqoop job --show myjob
它展示用在 myjob 中的工具及其选项。
It shows the tools and their options, which are used in myjob.
Job: myjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...
Execute Job (--exec)
‘--exec’ 选项用于执行一个已保存的作业。以下命令用于执行一个被称为 myjob 的已保存作业。
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ sqoop job --exec myjob
它展示给你以下输出。
It shows you the following output.
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
Sqoop - Codegen
此章节介绍了“codegen”工具的重要性。从面向对象的应用程序的角度来看,每个数据库表都包含一个 DAO 类,该类包含“getter”和“setter”方法,以初始化对象。此工具 (-codegen) 自动生成 DAO 类。
This chapter describes the importance of ‘codegen’ tool. From the viewpoint of object-oriented application, every database table has one DAO class that contains ‘getter’ and ‘setter’ methods to initialize objects. This tool (-codegen) generates the DAO class automatically.
它基于表架构结构以 Java 形式生成 DAO 类。在导入流程中会对 Java 定义进行实例化。此工具的主要用途是检查 Java 是否丢失了 Java 代码。如果是,它将使用字段之间的默认分隔符创建新版本的 Java。
It generates DAO class in Java, based on the Table Schema structure. The Java definition is instantiated as a part of the import process. The main usage of this tool is to check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields.
Syntax
以下是 Sqoop codegen 命令的语法。
The following is the syntax for Sqoop codegen command.
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
Example
我们来看一个在 userdb 数据库中为 emp 表生成 Java 代码的示例。
Let us take an example that generates Java code for the emp table in the userdb database.
以下命令用于执行给定示例。
The following command is used to execute the given example.
$ sqoop codegen \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp
如果命令执行成功,它将在终端上产生以下输出。
If the command executes successfully, then it will produce the following output on the terminal.
14/12/23 02:34:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/12/23 02:34:41 INFO tool.CodeGenTool: Beginning code generation
……………….
14/12/23 02:34:42 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop
Note: /tmp/sqoop-hadoop/compile/9a300a1f94899df4a9b10f9935ed9f91/emp.java uses or
overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/12/23 02:34:47 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/9a300a1f94899df4a9b10f9935ed9f91/emp.jar
Verification
我们来看一下输出。以粗体显示的路径是 emp 表的 Java 代码生成和存储的位置。让我们使用以下命令验证该位置的文件。
Let us take a look at the output. The path, which is in bold, is the location that the Java code of the emp table generates and stores. Let us verify the files in that location using the following commands.
$ cd /tmp/sqoop-hadoop/compile/9a300a1f94899df4a9b10f9935ed9f91/
$ ls
emp.class
emp.jar
emp.java
如果你希望进行深入的验证,请将 userdb 数据库中的 emp 表与以下目录中的 emp.java 进行比较
If you want to verify in depth, compare the emp table in the userdb database and emp.java in the following directory
/tmp/sqoop-hadoop/compile/9a300a1f94899df4a9b10f9935ed9f91/.
Sqoop - Eval
本章描述了如何使用 Sqoop 的“eval”工具。它允许用户对各自的数据库服务器执行用户定义的查询,并在控制台上预览结果。因此,用户可以预计导入结果表数据。使用 eval,我们可以评估任何类型的 SQL 查询,无论是 DDL 还是 DML 语句。
This chapter describes how to use the Sqoop ‘eval’ tool. It allows users to execute user-defined queries against respective database servers and preview the result in the console. So, the user can expect the resultant table data to import. Using eval, we can evaluate any type of SQL query that can be either DDL or DML statement.
Syntax
Sqoop eval 命令使用了以下语法。
The following syntax is used for Sqoop eval command.
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
Select Query Evaluation
使用 eval 工具,我们可以评估任何类型的 SQL 查询。让我们举一个在 employee 数据库的 db 表中选择有限行的一个例子。以下命令用于使用 SQL 查询来评估给定的示例。
Using eval tool, we can evaluate any type of SQL query. Let us take an example of selecting limited rows in the employee table of db database. The following command is used to evaluate the given example using SQL query.
$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
如果命令执行成功,它将在终端上产生以下输出。
If the command executes successfully, then it will produce the following output on the terminal.
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
+------+--------------+-------------+-------------------+--------+
Insert Query Evaluation
Sqoop eval 工具可以适用于建模和定义 SQL 语句。这意味着,我们也可以将 eval 用于插入语句。以下命令用于在 employee 数据库的 db 表中插入新行。
Sqoop eval tool can be applicable for both modeling and defining the SQL statements. That means, we can use eval for insert statements too. The following command is used to insert a new row in the employee table of db database.
$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
-e “INSERT INTO employee VALUES(1207,‘Raju’,‘UI dev’,15000,‘TP’)”
如果命令执行成功,它将在控制台上显示更新行状态。
If the command executes successfully, then it will display the status of the updated rows on the console.
否则,您可以在 MySQL 控制台上验证员工表。以下命令用于使用 select’ 查询验证 employee 数据库 db 表的行。
Or else, you can verify the employee table on MySQL console. The following command is used to verify the rows of employee table of db database using select’ query.
mysql>
mysql> use db;
mysql> SELECT * FROM employee;
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |
| 1207 | Raju | UI dev | 15000 | TP |
+------+--------------+-------------+-------------------+--------+
Sqoop - List Databases
本章描述了如何使用 Sqoop 列出数据库。Sqoop list-databases 工具解析和执行针对数据库服务器的“SHOW DATABASES”查询。之后,它会列出服务器上的当前数据库。
This chapter describes how to list out the databases using Sqoop. Sqoop list-databases tool parses and executes the ‘SHOW DATABASES’ query against the database server. Thereafter, it lists out the present databases on the server.
Syntax
Sqoop list-databases 命令使用了以下语法。
The following syntax is used for Sqoop list-databases command.
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
Sample Query
以下命令用于列出 MySQL 数据库服务器中的所有数据库。
The following command is used to list all the databases in the MySQL database server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
如果命令执行成功,它将像下面一样在您的 MySQL 数据库服务器中显示数据库列表。
If the command executes successfully, then it will display the list of databases in your MySQL database server as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
mysql
test
userdb
db
本章介绍如何使用 Sqoop 在 MySQL 数据库服务器中列出特定数据库中的表。Sqoop list-tables 工具解析并针对特定数据库执行“SHOW TABLES”查询。然后它列出数据库中存在的表。
This chapter describes how to list out the tables of a particular database in MySQL database server using Sqoop. Sqoop list-tables tool parses and executes the ‘SHOW TABLES’ query against a particular database. Thereafter, it lists out the present tables in a database.
Syntax
以下语法用于 Sqoop list-tables 命令。
The following syntax is used for Sqoop list-tables command.
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
Sample Query
以下命令用于列出 MySQL 数据库服务器的 userdb 数据库中的所有表。
The following command is used to list all the tables in the userdb database of MySQL database server.
$ sqoop list-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
如果命令执行成功,它将如下显示 userdb 数据库中的表列表。
If the command is executes successfully, then it will display the list of tables in the userdb database as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
emp
emp_add
emp_contact
Sqoop - List Tables
本章介绍如何使用 Sqoop 在 MySQL 数据库服务器中列出特定数据库中的表。Sqoop list-tables 工具解析并针对特定数据库执行“SHOW TABLES”查询。然后它列出数据库中存在的表。
This chapter describes how to list out the tables of a particular database in MySQL database server using Sqoop. Sqoop list-tables tool parses and executes the ‘SHOW TABLES’ query against a particular database. Thereafter, it lists out the present tables in a database.
Syntax
以下语法用于 Sqoop list-tables 命令。
The following syntax is used for Sqoop list-tables command.
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
Sample Query
以下命令用于列出 MySQL 数据库服务器的 userdb 数据库中的所有表。
The following command is used to list all the tables in the userdb database of MySQL database server.
$ sqoop list-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
如果命令执行成功,它将如下显示 userdb 数据库中的表列表。
If the command is executes successfully, then it will display the list of tables in the userdb database as follows.
...
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
emp
emp_add
emp_contact