Hadoop 简明教程

Hadoop - Enviornment Setup

Hadoop 受 GNU/Linux 平台及其变体支持。因此，我们必须安装 Linux 操作系统以设置 Hadoop 环境。如果你有除 Linux 以外的其他操作系统，则可以在其中安装 Virtualbox 软件，并在 Virtualbox 内使用 Linux。

Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a Linux operating system for setting up Hadoop environment. In case you have an OS other than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox.

Pre-installation Setup

在将 Hadoop 安装到 Linux 环境之前，我们需要使用 ` ssh `（安全 Shell）设置 Linux。按照以下步骤设置 Linux 环境。

Before installing Hadoop into the Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps given below for setting up the Linux environment.

Creating a User

开始时，建议为 Hadoop 创建一个单独的用户，以将 Hadoop 文件系统与 Unix 文件系统隔离。按照以下步骤创建用户：

At the beginning, it is recommended to create a separate user for Hadoop to isolate Hadoop file system from Unix file system. Follow the steps given below to create a user −

Open the root using the command “su”.
Create a user from the root account using the command “useradd username”.
Now you can open an existing user account using the command “su username”.

打开 Linux 终端并输入以下命令以创建用户。

Open the Linux terminal and type the following commands to create a user.

$ su
   password:
# useradd hadoop
# passwd hadoop
   New passwd:
   Retype new passwd

SSH Setup and Key Generation

SSH 设置需要对集群执行不同的操作，例如启动、停止、分布式守护程序 shell 操作。要对不同 Hadoop 用户进行身份验证，需要为 Hadoop 用户提供公钥/私钥对，并与不同用户共享。

SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for a Hadoop user and share it with different users.

以下命令用于使用 SSH 生成密钥值对。将公钥窗体 id_rsa.pub 复制到 authorized_keys，并分别向所有者授予 authorized_keys 文件的读写权限。

The following commands are used for generating a key value pair using SSH. Copy the public keys form id_rsa.pub to authorized_keys, and provide the owner with read and write permissions to authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Installing Java

Java 是 Hadoop 的主要先决条件。首先，您应该使用 “java -version” 命令验证系统中 java 的存在。java 版本命令的语法如下。

Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using the command “java -version”. The syntax of java version command is given below.

$ java -version

如果一切正常，它将为您提供以下输出。

If everything is in order, it will give you the following output.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果您的系统中未安装 java，请按照以下步骤安装 java。

If java is not installed in your system, then follow the steps given below for installing java.

Step 1

访问以下链接下载 java (JDK <最新版本> - X64.tar.gz)：{s0}

Download java (JDK <latest version> - X64.tar.gz) by visiting the following link www.oracle.com

然后 jdk-7u71-linux-x64.tar.gz 将下载到您的系统中。

Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.

Step 2

通常，您将在“下载”文件夹中找到下载的 java 文件。使用以下命令对其进行验证并解压缩 jdk-7u71-linux-x64.gz 文件。

Generally you will find the downloaded java file in Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71   jdk-7u71-linux-x64.gz

Step 3

为了让所有用户都可以使用 java，您必须将其移动到 “/usr/local/” 位置。打开 root，并键入以下命令。

To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

为设置 PATH 和 JAVA_HOME 变量，将以下命令添加到 ~/.bashrc 文件。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step 5

使用以下命令配置 java 替代项 −

Use the following commands to configure java alternatives −

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2
# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java
# alternatives --set javac usr/local/java/bin/javac
# alternatives --set jar usr/local/java/bin/jar

现在，如上所述，从终端验证 java -version 命令。

Now verify the java -version command from the terminal as explained above.

Downloading Hadoop

使用以下命令从 Apache 软件基金会下载并解压 Hadoop 2.4.1。

Download and extract Hadoop 2.4.1 from Apache software foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Hadoop Operation Modes

下载 Hadoop 后，您可以在三种受支持模式中操作您的 Hadoop 集群 −

Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported modes −

Local/Standalone Mode − After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.
Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development.
Fully Distributed Mode − This mode is fully distributed with minimum two or more machines as a cluster. We will come across this mode in detail in the coming chapters.

Installing Hadoop in Standalone Mode

这里我们将讨论在独立模式下安装 Hadoop 2.4.1 。

Here we will discuss the installation of Hadoop 2.4.1 in standalone mode.

没有正在运行的守护进程，所有内容都在单个 JVM 中运行。独立模式适用于在开发过程中运行 MapReduce 程序，因为它易于测试和调试。

There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

Setting Up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop

在继续之前，您需要确保 Hadoop 正常工作。只需发出以下命令 −

Before proceeding further, you need to make sure that Hadoop is working fine. Just issue the following command −

$ hadoop version

如果您的设置一切正常，那么您应该看到以下结果 −

If everything is fine with your setup, then you should see the following result −

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

这意味着您的 Hadoop 独立模式设置正常工作。默认情况下，Hadoop 配置为在单台机器上以非分布式方式运行。

It means your Hadoop’s standalone mode setup is working fine. By default, Hadoop is configured to run in a non-distributed mode on a single machine.

Example

让我们查看一个 Hadoop 的简单示例。Hadoop 安装提供了以下示例 MapReduce jar 文件，它提供了 MapReduce 的基本功能，且可用于计算，如 Pi 值、给定文件列表中的单词计数等。

Let’s check a simple example of Hadoop. Hadoop installation delivers the following example MapReduce jar file, which provides basic functionality of MapReduce and can be used for calculating, like Pi value, word counts in a given list of files, etc.

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

让我们有一个输入目录，我们将向其中推送几个文件，我们的要求是对这些文件中的单词总数进行计数。要计算单词总数，我们无需编写自己的 MapReduce，前提是 .jar 文件包含单词计数的实现。您可以使用同一个 .jar 文件尝试其他示例；仅需发出以下命令，即可检查 hadoop-mapreduce-examples-2.2.0.jar 文件支持的 MapReduce 函数程序。

Let’s have an input directory where we will push a few files and our requirement is to count the total number of words in those files. To calculate the total number of words, we do not need to write our MapReduce, provided the .jar file contains the implementation for word count. You can try other examples using the same .jar file; just issue the following commands to check supported MapReduce functional programs by hadoop-mapreduce-examples-2.2.0.jar file.

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar

Step 1

在输入目录中创建临时内容文件。您可以在希望工作的任何位置创建这个输入目录。

Create temporary content files in the input directory. You can create this input directory anywhere you would like to work.

$ mkdir input
$ cp $HADOOP_HOME/*.txt input
$ ls -l input

它将在您的输入目录中给出以下文件 −

It will give the following files in your input directory −

total 24
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt
-rw-r--r-- 1 root root   101 Feb 21 10:14 NOTICE.txt
-rw-r--r-- 1 root root  1366 Feb 21 10:14 README.txt

这些文件已从 Hadoop 安装主目录中复制。在您的实验中，您可以拥有不同且较大型的文件集。

These files have been copied from the Hadoop installation home directory. For your experiment, you can have different and large sets of files.

Step 2

让我们开始 Hadoop 进程，以计算输入目录中所有可用文件中的单词总数，如下所示 −

Let’s start the Hadoop process to count the total number of words in all the files available in the input directory, as follows −

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar  wordcount input output

Step 3

步骤 2 将执行所需的处理，并将输出保存在 output/part-r00000 文件中，您可使用 − 进行检查

Step-2 will do the required processing and save the output in output/part-r00000 file, which you can check by using −

$cat output/*

它将列出输入目录中所有可用文件中的所有单词及其总计数。

It will list down all the words along with their total counts available in all the files available in the input directory.

"AS      4
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License"      1
"License");     1
"Licensor"      1
"NOTICE”        1
"Not      1
"Object"        1
"Source”        1
"Work”    1
"You"     1
"Your")   1
"[]"      1
"control"       1
"printed        1
"submitted"     1
(50%)     1
(BIS),    1
(C)       1
(Don't)   1
(ECCN)    1
(INCLUDING      2
(INCLUDING,     2
.............

Installing Hadoop in Pseudo Distributed Mode

按照以下步骤在伪分布模式下安装 Hadoop 2.4.1。

Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 − Setting Up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step 2 − Hadoop Configuration

您可以在 “$HADOOP_HOME/etc/hadoop” 位置找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构，需要更改这些配置文件中的内容。

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为使用 java 开发 Hadoop 程序，您必须在 hadoop-env.sh 文件中通过将 JAVA_HOME 值替换为系统中 java 的位置来重置 java 环境变量。

In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

以下是要编辑以配置 Hadoop 的文件列表。

The following are the list of files that you have to edit to configure Hadoop.

core-site.xml

core-site.xml 文件包含信息，例如用于 Hadoop 实例的端口号、分配给文件系统内存、用于存储数据的内存限制和读/写缓冲区的大小。

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.

打开 core-site.xml，并在 <configuration>、</configuration> 标记之间添加以下属性。

Open the core-site.xml and add the following properties in between <configuration>, </configuration> tags.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

hdfs-site.xml

hdfs-site.xml 文件包含有关复制数据的值、名称节点路径和本地文件系统的 DataNode 路径的信息。也就是说，您想将 Hadoop 基础架构存储在什么位置。

The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.

让我们假设以下数据。

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开此文件，并在该文件中的 <configuration> </configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration> </configuration> tags in this file.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
   </property>
</configuration>

Note − 在上述文件中，所有属性值都是用户定义的，您可以根据 Hadoop 基础架构进行更改。

Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

此文件用于指定我们使用的 MapReduce 框架。默认情况下，Hadoop 包含 yarn-site.xml 的模板。首先，需要使用以下命令将文件从 mapred-site.xml.template 复制到 * mapred-site.xml* 文件。

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to * mapred-site.xml* file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件，并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration>tags in this file.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Step 1 − Name Node Setup

使用命令 “hdfs namenode -format” 设置名称节点，如下所示。

Set up the namenode using the command “hdfs namenode -format” as follows.

$ cd ~
$ hdfs namenode -format

预期结果如下所示。

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost/192.168.1.11
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2 − Verifying Hadoop dfs

以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。

The following command is used to start dfs. Executing this command will start your Hadoop file system.

$ start-dfs.sh

预期输出如下所示 −

The expected output is as follows −

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3 − Verifying Yarn Script

以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。

The following command is used to start the yarn script. Executing this command will start your yarn daemons.

$ start-yarn.sh

预期输出如下所示 −

The expected output as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4 − Accessing Hadoop on Browser

访问 Hadoop 的默认端口号为 50070。使用以下网址在浏览器上获取 Hadoop 服务。

The default port number to access Hadoop is 50070. Use the following url to get Hadoop services on browser.

http://localhost:50070/

Step 5 − Verify All Applications for Cluster

访问集群所有应用程序的默认端口号为 8088。使用以下网址访问此服务。

The default port number to access all applications of cluster is 8088. Use the following url to visit this service.

http://localhost:8088/