Sqoop 简明教程
Sqoop - Installation
由于 Sqoop 是 Hadoop 的子项目,因此它只能在 Linux 操作系统上运行。请按照以下步骤在您的系统上安装 Sqoop。
As Sqoop is a sub-project of Hadoop, it can only work on Linux operating system. Follow the steps given below to install Sqoop on your system.
Step 1: Verifying JAVA Installation
在安装 Sqoop 之前,您的系统上需要安装 Java。让我们使用以下命令验证 Java 安装 −
You need to have Java installed on your system before installing Sqoop. Let us verify Java installation using the following command −
$ java –version
如果系统上已安装 Java,您将看到以下响应 -
If Java is already installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
如果您的系统上尚未安装 Java,请按照以下步骤执行操作。
If Java is not installed on your system, then follow the steps given below.
Installing Java
按照以下简单步骤在您的系统上安装 Java。
Follow the simple steps given below to install Java on your system.
Step 1
访问以下 link 下载 Java(JDK <最新版本> - X64.tar.gz)。
Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link.
然后系统将会下载 jdk-7u71-linux-x64.tar.gz。
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.
Step 2
通常,您可以在“下载”文件夹中找到下载的 Java 文件。使用以下命令验证它并解压 jdk-7u71-linux-x64.gz 文件。
Generally, you can find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step 3
为了使得所有用户都能使用 Java,您需要将 Java 移动到“/usr/local/”位置。打开 root,然后键入以下命令。
To make Java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/java
# exitStep IV:
Step 4
为设置 PATH 和 JAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件中。
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
现在将所有更改应用到当前正在运行的系统中。
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 5
使用以下命令配置Java备用:
Use the following commands to configure Java alternatives −
# alternatives --install /usr/bin/java java usr/local/java/bin/java 2
# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2
# alternatives --set java usr/local/java/bin/java
# alternatives --set javac usr/local/java/bin/javac
# alternatives --set jar usr/local/java/bin/jar
现在,使用上述 from terminal java -version 命令验证安装。
Now verify the installation using the command java -version from the terminal as explained above.
Step 2: Verifying Hadoop Installation
在安装 Sqoop 之前,您的系统上必须安装 Hadoop。让我们使用以下命令验证 Hadoop 安装:
Hadoop must be installed on your system before installing Sqoop. Let us verify the Hadoop installation using the following command −
$ hadoop version
如果已在您的系统上安装 Hadoop,则会收到以下回复:
If Hadoop is already installed on your system, then you will get the following response −
Hadoop 2.4.1
--
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
如果您的系统上未安装 Hadoop,请执行以下步骤:
If Hadoop is not installed on your system, then proceed with the following steps −
Downloading Hadoop
使用以下命令从 Apache 软件基金会下载并解压缩 Hadoop 2.4.1。
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode
按照以下步骤在伪分布式模式下安装 Hadoop 2.4.1。
Follow the steps given below to install Hadoop 2.4.1 in pseudo-distributed mode.
Step 1: Setting up Hadoop
您可以通过将以下命令附加到~/.bashrc文件来设置Hadoop环境变量。
You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
现在,将所有更改应用到当前正在运行的系统中。
Now, apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration
您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,您需要在这些配置文件中进行适当的更改。
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
为了使用 Java 开发 Hadoop 程序,您必须在 hadoop-env.sh 文件中重置 Java 环境变量,方法是用您系统中的 Java 位置替换 JAVA_HOME 值。
In order to develop Hadoop programs using java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.
export JAVA_HOME=/usr/local/java
以下是您需要编辑以配置 Hadoop 的文件列表。
Given below is the list of files that you need to edit to configure Hadoop.
core-site.xml
core-site.xml
core-site.xml 文件包含诸如 Hadoop 实例使用的端口号、用于文件系统分配的内存、用于存储数据的内存限制,以及读/写缓冲区的大小等信息。
The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
打开 core-site.xml,并在 <configuration> 和 </configuration> 标记之间添加以下属性。
Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000 </value>
</property>
</configuration>
hdfs-site.xml
hdfs-site.xml
hdfs-site.xml 文件包含诸如复制数据的值、namenode 路径和您本地文件系统的 datanode 路径等信息。这意味着您要存储 Hadoop 基础架构的位置。
The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure.
让我们假设以下数据。
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
打开此文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
Note − 在上述文件中,所有属性值都是用户定义的,您可以根据 Hadoop 基础架构进行更改。
Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.
yarn-site.xml
yarn-site.xml
此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
mapred-site.xml
此文件用于指定我们正在使用的 MapReduce 框架。默认情况下,Hadoop 包含 yarn-site.xml 的一个模板。首先,您需要使用以下命令将文件从 mapred-site.xml.template 复制到 mapred-site.xml 文件。
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site.xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
打开 mapred-site.xml 文件,并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。
Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation
以下步骤用于验证 Hadoop 安装。
The following steps are used to verify the Hadoop installation.
Step 1: Name Node Setup
使用命令 “hdfs namenode -format” 设置名称节点,如下所示。
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
预期结果如下所示。
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/
Step 2: Verifying Hadoop dfs
以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。
The following command is used to start dfs. Executing this command will start your Hadoop file system.
$ start-dfs.sh
预期输出如下所示 −
The expected output is as follows −
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]
Step 3: Verifying Yarn Script
以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。
The following command is used to start the yarn script. Executing this command will start your yarn daemons.
$ start-yarn.sh
预期输出如下所示 −
The expected output is as follows −
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out
Step 3: Downloading Sqoop
我们可以从以下 link 下载最新版本的 Sqoop。对于本教程,我们使用 1.4.5 版本,即 sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz 。
We can download the latest version of Sqoop from the following link For this tutorial, we are using version 1.4.5, that is, sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz.
Step 4: Installing Sqoop
以下命令用于解压 Sqoop tar 包并将其移动到“/usr/lib/sqoop”目录。
The following commands are used to extract the Sqoop tar ball and move it to “/usr/lib/sqoop” directory.
$tar -xvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
$ su
password:
# mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha /usr/lib/sqoop
#exit
Step 5: Configuring bashrc
您必须通过将以下行追加到 ~/ .bashrc 文件来设置 Sqoop 环境:
You have to set up the Sqoop environment by appending the following lines to ~/.bashrc file −
#Sqoop
export SQOOP_HOME=/usr/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin
以下命令用于执行 ~/ .bashrc 文件。
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
Step 6: Configuring Sqoop
要使用 Hadoop 配置 Sqoop,您需要编辑 sqoop-env.sh 文件,该文件位于 $SQOOP_HOME/conf 目录中。首先,重定向到 Sqoop 配置目录并使用以下命令复制模板文件:
To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop config directory and copy the template file using the following command −
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
打开 sqoop-env.sh 并编辑以下行:
Open sqoop-env.sh and edit the following lines −
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Step 7: Download and Configure mysql-connector-java
我们可以从以下 link 下载 mysql-connector-java-5.1.30.tar.gz 文件。
We can download mysql-connector-java-5.1.30.tar.gz file from the following link.
以下命令用于解压 mysql-connector-java tar 包并将 mysql-connector-java-5.1.30-bin.jar 移动到 /usr/lib/sqoop/lib 目录。
The following commands are used to extract mysql-connector-java tarball and move mysql-connector-java-5.1.30-bin.jar to /usr/lib/sqoop/lib directory.
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
password:
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Step 8: Verifying Sqoop
以下命令用于验证 Sqoop 版本:
The following command is used to verify the Sqoop version.
$ cd $SQOOP_HOME/bin
$ sqoop-version
预期输出 -
Expected output −
14/12/17 14:52:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Sqoop 1.4.5 git commit id 5b34accaca7de251fc91161733f906af2eddbe83
Compiled by abe on Fri Aug 1 11:19:26 PDT 2014
Sqoop 安装已完成。
Sqoop installation is complete.