Hadoop 简明教程

Hadoop - Quick Guide

Hadoop - Big Data Overview

由于新技术、设备以及像社交网站那样的通信手段的出现,人类制作的数据量每年都在快速增长。我们从时间开始到 2003 年产生数据的数量为 50 亿千兆字节。如果将数据以磁盘的形式堆叠起来,它可以填满整个足球场。同等数量的数据在 2011 中每两天产生一次,在 2013 中每十分钟产生一次。这个比率仍在剧烈增长。尽管产生的所有这些信息都是有意义的,并且在处理时可能是有用的,但它却被忽视了。

Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data in the form of disks it may fill an entire football field. The same amount was created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though all this information produced is meaningful and can be useful when processed, it is being neglected.

What is Big Data?

Big data 是大量数据集的集合,无法使用传统计算技术进行处理。它不是单一技术或工具,而是一个完整的学科,其包括各种工具、技术和框架。

Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?

大数据涉及由不同设备和应用程序生成的数据。以下是属于大数据保护范围的一些领域。

Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.

  1. Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.

  2. Social Media Data − Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.

  3. Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.

  4. Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station.

  5. Transport Data − Transport data includes model, capacity, distance and availability of a vehicle.

  6. Search Engine Data − Search engines retrieve lots of data from different databases.

big data

因此,大数据包含海量、高速和可扩展性多种数据。其中数据有三种类型。

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types.

  1. Structured data − Relational data.

  2. Semi Structured data − XML data.

  3. Unstructured data − Word, PDF, Text, Media Logs.

Benefits of Big Data

  1. Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.

  2. Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.

  3. Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.

Big Data Technologies

大数据技术对于提供更准确的分析非常重要,这可能导致更具体决策,从而提高运营效率、降低成本,并降低业务风险。

Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.

为了利用大数据的强大优势,你需要一个基础结构,该基础结构可以实时管理和处理海量的结构化和非结构化数据,并且可以保护数据隐私和安全性。

To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in realtime and can protect data privacy and security.

市场上有许多来自不同供应商(包括 Amazon、IBM、Microsoft 等)的各种技术来处理大数据。在研究用于处理大数据的技术时,我们会研究以下两类技术 −

There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology −

Operational Big Data

这包括诸如 MongoDB 之类的系统,这些系统提供了可针对实际、交互式工作负载(其中数据会首先捕获和存储)的操作功能。

This include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.

NoSQL 大数据系统被设计用于利用在过去十年间出现的新的云计算架构,以实现经济高效地运行大规模计算。这让在操作大数据工作负载时更容易管理、成本更低、实施起来更快。

NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement.

一些 NoSQL 系统能够基于实时数据以最少编码并无需数据科学家和额外基础结构来洞察模式和趋势。

Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

这些包括诸如海量并行处理 (MPP) 数据库系统和 MapReduce 之类的系统,这些系统为可能涉及大多数或所有数据的回顾和复杂分析提供了分析功能。

These includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data.

MapReduce 提供了一种新型数据分析方法,它补充了 SQL 所提供的功能,并提供了基于 MapReduce 的系统,该系统可以从单台服务器扩展到数千台高端和低端机器。

MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.

这两类技术是互补的,经常一起部署。

These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems

Operational

Analytical

Latency

1 ms - 100 ms

1 min - 100 min

Concurrency

1000 - 100,000

1 - 10

Access Pattern

Writes and Reads

Reads

Queries

Selective

Unselective

Data Scope

Operational

Retrospective

End User

Customer

Data Scientist

Technology

NoSQL

MapReduce, MPP Database

Big Data Challenges

大数据相关的重大挑战如下 −

The major challenges associated with big data are as follows −

  1. Capturing data

  2. Curation

  3. Storage

  4. Searching

  5. Sharing

  6. Transfer

  7. Analysis

  8. Presentation

为了应对上述挑战,组织通常会借助企业服务器。

To fulfill the above challenges, organizations normally take the help of enterprise servers.

Hadoop - Big Data Solutions

Traditional Approach

与此方法类似,一家企业将备有一台计算机来存储和处理海量数据。出于存储的目的,程序员将借助他们的数据库供应商(比如甲骨文和 IBM 等)的选择。在此方法中,用户与应用程序交互,应用程序再处理数据存储和分析部分。

In this approach, an enterprise will have a computer to store and process big data. For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts with the application, which in turn handles the part of data storage and analysis.

traditional approach

Limitation

此方法对于那些处理少量的标准数据库服务器即可容纳或达到正在处理数据的处理器限制的数据的应用程序来说效果很好。但当处理大量的可扩展数据时,通过单一数据库瓶颈处理此类数据是一件繁重的工作。

This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.

Google’s Solution

Google 使用一种称为 MapReduce 的算法解决了此问题。该算法将任务分成小块,并将它们分配给多台计算机,并从它们那里收集结果,这些结果在集成后形成结果数据集。

Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.

mapreduce

Hadoop

使用Google提供的解决方案, Doug Cutting 和他的团队开发了一个名为 HADOOP 的开源项目。

Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.

Hadoop 使用MapReduce算法运行应用程序,其中数据与其他数据并行处理。简而言之,Hadoop 用于开发可以在海量数据上执行完整统计分析的应用程序。

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.

hadoop framework

Hadoop - Introduction

Hadoop 是一个 Apache 开源框架,使用 Java 编写,它允许跨计算机集群使用简单的编程模型对大型数据集进行分布式处理。Hadoop 框架应用程序在提供跨计算机集群的分布式存储和计算的环境中运行。Hadoop 被设计为从单个服务器扩展到数千台机器,每台机器提供本地计算和存储。

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

Hadoop 的核心有两个主要层,即:

At its core, Hadoop has two major layers namely −

  1. Processing/Computation layer (MapReduce), and

  2. Storage layer (Hadoop Distributed File System).

hadoop architecture

MapReduce

MapReduce 是一种并行编程模型,用于编写分布式应用程序,由 Google 设计,用于在大量商品硬件集群(数千个节点)上可靠且容错地高效处理大量数据(多太字节数据集)。MapReduce 程序在 Hadoop 上运行,Hadoop 是一个 Apache 开源框架。

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

Hadoop 分布式文件系统(HDFS)基于 Google 文件系统(GFS),并提供了一个分布式文件系统,该文件系统设计为在商品硬件上运行。它与现有的分布式文件系统有许多相似之处。但是,与其他分布式文件系统的不同之处是很明显的。它具有极高的容错性,并且设计为部署在低成本硬件上。它提供对应用程序数据的高速访问,并且适用于具有大型数据集的应用程序。

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.

除了上述两个核心组件之外,Hadoop 框架还包括以下两个模块:

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −

  1. Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.

  2. Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

构建具有繁重配置来处理大规模处理的大型服务器非常昂贵,但是作为一种替代方案,你可以将许多具有单个 CPU 的商品计算机连接在一起,作为一个单一的功能分布式系统,实际上,集群机器可以并行读取数据集并提供更高的吞吐量。而且,它比一台高端服务器便宜。因此这是使用 Hadoop 背后的第一个激励因素,因为它可以在集群且低成本的机器上运行。

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop 在一组计算机之间运行代码。这个过程包括 Hadoop 执行的以下几个核心任务 -

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs −

  1. Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).

  2. These files are then distributed across various cluster nodes for further processing.

  3. HDFS, being on top of the local file system, supervises the processing.

  4. Blocks are replicated for handling hardware failure.

  5. Checking that the code was executed successfully.

  6. Performing the sort that takes place between the map and reduce stages.

  7. Sending the sorted data to a certain computer.

  8. Writing the debugging logs for each job.

Advantages of Hadoop

  1. Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.

  2. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.

  3. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.

  4. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

Hadoop - Enviornment Setup

Hadoop 受 GNU/Linux 平台及其变体支持。因此,我们必须安装 Linux 操作系统以设置 Hadoop 环境。如果你有除 Linux 以外的其他操作系统,则可以在其中安装 Virtualbox 软件,并在 Virtualbox 内使用 Linux。

Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a Linux operating system for setting up Hadoop environment. In case you have an OS other than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox.

Pre-installation Setup

在将 Hadoop 安装到 Linux 环境之前,我们需要使用 ` ssh `(安全 Shell)设置 Linux。按照以下步骤设置 Linux 环境。

Before installing Hadoop into the Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps given below for setting up the Linux environment.

Creating a User

开始时,建议为 Hadoop 创建一个单独的用户,以将 Hadoop 文件系统与 Unix 文件系统隔离。按照以下步骤创建用户:

At the beginning, it is recommended to create a separate user for Hadoop to isolate Hadoop file system from Unix file system. Follow the steps given below to create a user −

  1. Open the root using the command “su”.

  2. Create a user from the root account using the command “useradd username”.

  3. Now you can open an existing user account using the command “su username”.

打开 Linux 终端并输入以下命令以创建用户。

Open the Linux terminal and type the following commands to create a user.

$ su
   password:
# useradd hadoop
# passwd hadoop
   New passwd:
   Retype new passwd

SSH Setup and Key Generation

SSH 设置需要对集群执行不同的操作,例如启动、停止、分布式守护程序 shell 操作。要对不同 Hadoop 用户进行身份验证,需要为 Hadoop 用户提供公钥/私钥对,并与不同用户共享。

SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for a Hadoop user and share it with different users.

以下命令用于使用 SSH 生成密钥值对。将公钥窗体 id_rsa.pub 复制到 authorized_keys,并分别向所有者授予 authorized_keys 文件的读写权限。

The following commands are used for generating a key value pair using SSH. Copy the public keys form id_rsa.pub to authorized_keys, and provide the owner with read and write permissions to authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Installing Java

Java 是 Hadoop 的主要先决条件。首先,您应该使用 “java -version” 命令验证系统中 java 的存在。java 版本命令的语法如下。

Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using the command “java -version”. The syntax of java version command is given below.

$ java -version

如果一切正常,它将为您提供以下输出。

If everything is in order, it will give you the following output.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果您的系统中未安装 java,请按照以下步骤安装 java。

If java is not installed in your system, then follow the steps given below for installing java.

Step 1

访问以下链接下载 java (JDK <最新版本> - X64.tar.gz):{s0}

Download java (JDK <latest version> - X64.tar.gz) by visiting the following link www.oracle.com

然后 jdk-7u71-linux-x64.tar.gz 将下载到您的系统中。

Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.

Step 2

通常,您将在“下载”文件夹中找到下载的 java 文件。使用以下命令对其进行验证并解压缩 jdk-7u71-linux-x64.gz 文件。

Generally you will find the downloaded java file in Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71   jdk-7u71-linux-x64.gz

Step 3

为了让所有用户都可以使用 java,您必须将其移动到 “/usr/local/” 位置。打开 root,并键入以下命令。

To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

为设置 PATHJAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step 5

使用以下命令配置 java 替代项 −

Use the following commands to configure java alternatives −

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2
# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java
# alternatives --set javac usr/local/java/bin/javac
# alternatives --set jar usr/local/java/bin/jar

现在,如上所述,从终端验证 java -version 命令。

Now verify the java -version command from the terminal as explained above.

Downloading Hadoop

使用以下命令从 Apache 软件基金会下载并解压 Hadoop 2.4.1。

Download and extract Hadoop 2.4.1 from Apache software foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Hadoop Operation Modes

下载 Hadoop 后,您可以在三种受支持模式中操作您的 Hadoop 集群 −

Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported modes −

  1. Local/Standalone Mode − After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.

  2. Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development.

  3. Fully Distributed Mode − This mode is fully distributed with minimum two or more machines as a cluster. We will come across this mode in detail in the coming chapters.

Installing Hadoop in Standalone Mode

这里我们将讨论在独立模式下安装 Hadoop 2.4.1

Here we will discuss the installation of Hadoop 2.4.1 in standalone mode.

没有正在运行的守护进程,所有内容都在单个 JVM 中运行。独立模式适用于在开发过程中运行 MapReduce 程序,因为它易于测试和调试。

There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

Setting Up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop

在继续之前,您需要确保 Hadoop 正常工作。只需发出以下命令 −

Before proceeding further, you need to make sure that Hadoop is working fine. Just issue the following command −

$ hadoop version

如果您的设置一切正常,那么您应该看到以下结果 −

If everything is fine with your setup, then you should see the following result −

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

这意味着您的 Hadoop 独立模式设置正常工作。默认情况下,Hadoop 配置为在单台机器上以非分布式方式运行。

It means your Hadoop’s standalone mode setup is working fine. By default, Hadoop is configured to run in a non-distributed mode on a single machine.

Example

让我们查看一个 Hadoop 的简单示例。Hadoop 安装提供了以下示例 MapReduce jar 文件,它提供了 MapReduce 的基本功能,且可用于计算,如 Pi 值、给定文件列表中的单词计数等。

Let’s check a simple example of Hadoop. Hadoop installation delivers the following example MapReduce jar file, which provides basic functionality of MapReduce and can be used for calculating, like Pi value, word counts in a given list of files, etc.

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

让我们有一个输入目录,我们将向其中推送几个文件,我们的要求是对这些文件中的单词总数进行计数。要计算单词总数,我们无需编写自己的 MapReduce,前提是 .jar 文件包含单词计数的实现。您可以使用同一个 .jar 文件尝试其他示例;仅需发出以下命令,即可检查 hadoop-mapreduce-examples-2.2.0.jar 文件支持的 MapReduce 函数程序。

Let’s have an input directory where we will push a few files and our requirement is to count the total number of words in those files. To calculate the total number of words, we do not need to write our MapReduce, provided the .jar file contains the implementation for word count. You can try other examples using the same .jar file; just issue the following commands to check supported MapReduce functional programs by hadoop-mapreduce-examples-2.2.0.jar file.

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar

Step 1

在输入目录中创建临时内容文件。您可以在希望工作的任何位置创建这个输入目录。

Create temporary content files in the input directory. You can create this input directory anywhere you would like to work.

$ mkdir input
$ cp $HADOOP_HOME/*.txt input
$ ls -l input

它将在您的输入目录中给出以下文件 −

It will give the following files in your input directory −

total 24
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt
-rw-r--r-- 1 root root   101 Feb 21 10:14 NOTICE.txt
-rw-r--r-- 1 root root  1366 Feb 21 10:14 README.txt

这些文件已从 Hadoop 安装主目录中复制。在您的实验中,您可以拥有不同且较大型的文件集。

These files have been copied from the Hadoop installation home directory. For your experiment, you can have different and large sets of files.

Step 2

让我们开始 Hadoop 进程,以计算输入目录中所有可用文件中的单词总数,如下所示 −

Let’s start the Hadoop process to count the total number of words in all the files available in the input directory, as follows −

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar  wordcount input output

Step 3

步骤 2 将执行所需的处理,并将输出保存在 output/part-r00000 文件中,您可使用 − 进行检查

Step-2 will do the required processing and save the output in output/part-r00000 file, which you can check by using −

$cat output/*

它将列出输入目录中所有可用文件中的所有单词及其总计数。

It will list down all the words along with their total counts available in all the files available in the input directory.

"AS      4
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License"      1
"License");     1
"Licensor"      1
"NOTICE”        1
"Not      1
"Object"        1
"Source”        1
"Work”    1
"You"     1
"Your")   1
"[]"      1
"control"       1
"printed        1
"submitted"     1
(50%)     1
(BIS),    1
(C)       1
(Don't)   1
(ECCN)    1
(INCLUDING      2
(INCLUDING,     2
.............

Installing Hadoop in Pseudo Distributed Mode

按照以下步骤在伪分布模式下安装 Hadoop 2.4.1。

Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 − Setting Up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

现在将所有更改应用到当前正在运行的系统中。

Now apply all the changes into the current running system.

$ source ~/.bashrc

Step 2 − Hadoop Configuration

您可以在 “$HADOOP_HOME/etc/hadoop” 位置找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构,需要更改这些配置文件中的内容。

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为使用 java 开发 Hadoop 程序,您必须在 hadoop-env.sh 文件中通过将 JAVA_HOME 值替换为系统中 java 的位置来重置 java 环境变量。

In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

以下是要编辑以配置 Hadoop 的文件列表。

The following are the list of files that you have to edit to configure Hadoop.

core-site.xml

core-site.xml

core-site.xml 文件包含信息,例如用于 Hadoop 实例的端口号、分配给文件系统内存、用于存储数据的内存限制和读/写缓冲区的大小。

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.

打开 core-site.xml,并在 <configuration>、</configuration> 标记之间添加以下属性。

Open the core-site.xml and add the following properties in between <configuration>, </configuration> tags.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

hdfs-site.xml

hdfs-site.xml

hdfs-site.xml 文件包含有关复制数据的值、名称节点路径和本地文件系统的 DataNode 路径的信息。也就是说,您想将 Hadoop 基础架构存储在什么位置。

The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.

让我们假设以下数据。

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开此文件,并在该文件中的 <configuration> </configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration> </configuration> tags in this file.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
   </property>
</configuration>

Note − 在上述文件中,所有属性值都是用户定义的,您可以根据 Hadoop 基础架构进行更改。

Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

yarn-site.xml

此文件用于将 Yarn 配置到 Hadoop 中。打开 yarn-site.xml 文件并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

mapred-site.xml

此文件用于指定我们使用的 MapReduce 框架。默认情况下,Hadoop 包含 yarn-site.xml 的模板。首先,需要使用以下命令将文件从 mapred-site.xml.template 复制到 * mapred-site.xml* 文件。

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to * mapred-site.xml* file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件,并在该文件中的 <configuration>、</configuration> 标记之间添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration>tags in this file.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Step 1 − Name Node Setup

使用命令 “hdfs namenode -format” 设置名称节点,如下所示。

Set up the namenode using the command “hdfs namenode -format” as follows.

$ cd ~
$ hdfs namenode -format

预期结果如下所示。

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost/192.168.1.11
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2 − Verifying Hadoop dfs

以下命令用于启动 DFS。执行此命令将启动您的 Hadoop 文件系统。

The following command is used to start dfs. Executing this command will start your Hadoop file system.

$ start-dfs.sh

预期输出如下所示 −

The expected output is as follows −

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3 − Verifying Yarn Script

以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。

The following command is used to start the yarn script. Executing this command will start your yarn daemons.

$ start-yarn.sh

预期输出如下所示 −

The expected output as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4 − Accessing Hadoop on Browser

访问 Hadoop 的默认端口号为 50070。使用以下网址在浏览器上获取 Hadoop 服务。

The default port number to access Hadoop is 50070. Use the following url to get Hadoop services on browser.

http://localhost:50070/
hadoop on browser

Step 5 − Verify All Applications for Cluster

访问集群所有应用程序的默认端口号为 8088。使用以下网址访问此服务。

The default port number to access all applications of cluster is 8088. Use the following url to visit this service.

http://localhost:8088/
hadoop application cluster

Hadoop - HDFS Overview

Hadoop 文件系统是使用分布式文件系统设计开发的。它在商品硬件上运行。与其他分布式系统不同,HDFS 的容错性很高,并且设计为使用低成本硬件。

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS 拥有非常大量的数据,并且提供了更轻松的访问方式。为了存储如此庞大的数据,这些文件被存储在多台机器上。这些文件以冗余的方式存储,以在发生故障时防止系统出现可能的数据丢失。HDFS 还会使应用程序可用于并行处理。

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

  1. It is suitable for the distributed storage and processing.

  2. Hadoop provides a command interface to interact with HDFS.

  3. The built-in servers of namenode and datanode help users to easily check the status of cluster.

  4. Streaming access to file system data.

  5. HDFS provides file permissions and authentication.

HDFS Architecture

下面给出了 Hadoop 文件系统的体系结构。

Given below is the architecture of a Hadoop File System.

hdfs architecture

HDFS 遵循主从架构,它具有以下元素。

HDFS follows the master-slave architecture and it has the following elements.

Namenode

名称节点是包含 GNU/Linux 操作系统和名称节点软件的商品硬件。它是一种可以在商品硬件上运行的软件。拥有名称节点的系统充当主服务器,并执行以下任务:

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks −

  1. Manages the file system namespace.

  2. Regulates client’s access to files.

  3. It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

数据节点是使用 GNU/Linux 操作系统和数据节点软件的商品硬件。在集群中的每个节点(商品硬件/系统)中,将有一个数据节点。这些节点管理其系统的存储数据。

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

  1. Datanodes perform read-write operations on the file systems, as per client request.

  2. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

通常,用户数据存储在 HDFS 的文件中。文件系统中的文件将被划分为一个或多个段或者存储在各个数据节点中。这些文件段被称为块。换句话说,HDFS 可以读取或写入的最小数据量称为一个块。默认块大小是 64MB,但是可以根据需要增加 HDFS 配置中的块大小。

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

Fault detection and recovery − 由于 HDFS 包含大量商品硬件,所以组件故障经常出现。因此,HDFS 应具备快速的自动故障检测和恢复机制。

Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets − HDFS 每集群应包含数百个节点来管理具有庞大数据集的应用程序。

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data − 当计算在数据附近发生时,可以高效地完成请求的任务。尤其是在涉及巨大数据集时,它可以减少网络通信并增加吞吐量。

Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

Hadoop - HDFS Operations

Starting HDFS

一开始,您必须格式化配置好的 HDFS 文件系统,打开名称节点(HDFS 服务器),然后执行以下命令。

Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command.

$ hadoop namenode -format

格式化 HDFS 后,启动分布式文件系统。以下命令将启动名称节点以及作为集群的数据节点。

After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.

$ start-dfs.sh

Listing Files in HDFS

将信息加载到服务器后,我们可以使用 ‘ls’ 查找目录中的文件列表、文件状态。以下是 ls 的语法,您可以将其作为参数传递给目录或文件名。

After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

假设我们在本地系统中名为 file.txt 的文件中具有数据,该数据应该保存在 hdfs 文件系统中。按照以下步骤将所需文件插入到 Hadoop 文件系统中。

Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.

Step 1

您必须创建一个输入目录。

You have to create an input directory.

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step 2

使用 put 命令将数据文件从本地系统传输并存储到 Hadoop 文件系统。

Transfer and store a data file from local systems to the Hadoop file system using the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3

您可以使用 ls 命令验证文件。

You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS

假设我们在 HDFS 中有一个名为 outfile 的文件。以下是从 Hadoop 文件系统中检索所需文件的简单演示。

Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the Hadoop file system.

Step 1

最初,使用 cat 命令从 HDFS 查看数据。

Initially, view the data from HDFS using cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

Step 2

使用 get 命令将文件从 HDFS 获取到本地文件系统。

Get the file from HDFS to the local file system using get command.

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS

您可以使用以下命令关闭 HDFS。

You can shut down the HDFS by using the following command.

$ stop-dfs.sh

Hadoop - Command Reference

"$HADOOP_HOME/bin/hadoop fs" 中还有更多命令未在此处演示,尽管这些基本操作将帮助您入门。不带任何附加参数运行 ./bin/hadoop dfs 将列出 FsShell 系统可以运行的所有命令。此外,如果您遇到困难,*$HADOOP_HOME/bin/hadoop fs -help * commandName 命令将显示有关所讨论操作的简短使用方法摘要。

There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are demonstrated here, although these basic operations will get you started. Running ./bin/hadoop dfs with no additional arguments will list all the commands that can be run with the FsShell system. Furthermore, *$HADOOP_HOME/bin/hadoop fs -help * commandName will display a short usage summary for the operation in question, if you are stuck.

所有操作的表格如下所示。针对参数使用了以下约定 -

A table of all the operations is shown below. The following conventions are used for parameters −

"<path>" means any file or directory name.
"<path>..." means one or more file or directory names.
"<file>" means any filename.
"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file system.

所有其他文件和路径名称指的是 HDFS 内的对象。

All other files and path names refer to the objects inside HDFS.

Sr.No

Command & Description

1

-ls <path> Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.

2

-lsr <path> Behaves like -ls, but recursively displays entries in all subdirectories of path.

3

-du <path> Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix.

4

-dus <path> Like -du, but prints a summary of disk usage of all files/directories in the path.

5

-mv <src><dest> Moves the file or directory indicated by src to dest, within HDFS.

6

-cp <src> <dest> Copies the file or directory identified by src to dest, within HDFS.

7

-rm <path> Removes the file or empty directory identified by path.

8

-rmr <path> Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).

9

-put <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

10

-copyFromLocal <localSrc> <dest> Identical to -put

11

-moveFromLocal <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success.

12

-get [-crc] <src> <localDest> Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.

13

-getmerge <src> <localDest> Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.

14

-cat <filen-ame> Displays the contents of filename on stdout.

15

-copyToLocal <src> <localDest> Identical to -get

16

-moveToLocal <src> <localDest> Works like -get, but deletes the HDFS copy on success.

17

-mkdir <path> Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

18

-setrep [-R] [-w] rep <path> Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)

19

-touchz <path> Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.

20

-test -[ezd] <path> Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

21

-stat [format] <path> Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

22

-tail [-f] <file2name> Shows the last 1KB of file on stdout.

23

-chmod [-R] mode,mode,…​ <path>…​ Changes the file permissions associated with one or more objects identified by path…​. Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask.

24

-chown [-R] [owner][:[group]] <path>…​ Sets the owning user and/or group for files or directories identified by path…​. Sets owner recursively if -R is specified.

25

-chgrp [-R] group <path>…​ Sets the owning group for files or directories identified by path…​. Sets group recursively if -R is specified.

26

-help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.

Hadoop - MapReduce

MapReduce是一个框架,我们可以使用该框架编写应用程序来并行处理海量数据,在大规模商品硬件集群上,以可靠的方式处理数据。

MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?

MapReduce是一种基于java的分布式计算的处理技术和程序模型。MapReduce算法包含两个重要任务,即Map和Reduce。Map获取一组数据并将它转换为另一组数据,其中各个元素被分解成元组(键/值对)。其次是reduce任务,它将map的输出作为输入,并将这些数据元组组合成更小的一组元组。正如MapReduce名称的顺序所暗示的那样,reduce任务总是在map作业之后执行。

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

MapReduce的主要优势在于,可以在多个计算节点上轻松扩展数据处理。在MapReduce模型中,数据处理原语称为映射器和还原器。将数据处理应用程序分解成映射器和还原器有时并非易事。但是,一旦我们在MapReduce形式中编写了一个应用程序,将应用程序扩展到在一个集群中成百上千甚至数万台机器上运行仅仅是一次配置更改。这种简单的可扩展性吸引了许多程序员使用MapReduce模型。

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

  1. Generally MapReduce paradigm is based on sending the computer to where the data resides!

  2. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage − This stage is the combination of the * Shuffle * stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

  3. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

  4. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.

  5. Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

  6. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

mapreduce algorithm

Inputs and Outputs (Java Perspective)

MapReduce 框架使用 <key, value> 对工作,即,该框架将作业的输入视为一组 <key, value> 对,并且将一组 <key, value> 对作为作业的输出,可能是不同类型的。

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

密钥和值类应该是通过该框架串行化的,因此,需要实现 Writable 接口。此外,关键类必须实现 Writable-Comparable 接口以方便框架进行排序。 MapReduce job 的输入和输出类型——(输入) <k1, v1> → 映射 → <k2, v2> → 缩减 → <k3, v3>(输出)。

The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input

Output

Map

<k1, v1>

list (<k2, v2>)

Reduce

<k2, list(v2)>

list (<k3, v3>)

Terminology

  1. PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.

  2. Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.

  3. NamedNode − Node that manages the Hadoop Distributed File System (HDFS).

  4. DataNode − Node where data is presented in advance before any processing takes place.

  5. MasterNode − Node where JobTracker runs and which accepts job requests from clients.

  6. SlaveNode − Node where Map and Reduce program runs.

  7. JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.

  8. Task Tracker − Tracks the task and reports status to JobTracker.

  9. Job − A program is an execution of a Mapper and Reducer across a dataset.

  10. Task − An execution of a Mapper or a Reducer on a slice of data.

  11. Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Example Scenario

下面提供了一个组织的用电量数据。它包含月用电量和各个年份的年平均用电量。

Given below is the data regarding the electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years.

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Avg

1979

23

23

2

43

24

25

26

26

26

26

25

26

25

1980

26

27

28

28

28

30

31

31

31

30

30

30

29

1981

31

32

32

32

33

34

35

36

36

34

34

34

34

1984

39

38

39

39

39

41

42

43

40

39

38

38

40

1985

38

39

39

39

39

41

41

41

00

40

39

39

45

如果输入了上述数据,我们必须编写应用程序来处理它并产生结果,例如找到使用量最大的年份、使用量最小的年份,等等。这对具有有限数量记录的程序员来说是小菜一碟。他们只需编写逻辑以生成所需输出,并将数据传递给编写的应用程序。

If the above data is given as input, we have to write applications to process it and produce results such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the application written.

但请想到自成立以来某一特定州的所有大规模行业的用电量所代表的数据。

But, think of the data representing the electrical consumption of all the largescale industries of a particular state, since its formation.

当我们编写应用程序来处理此类数据时,

When we write applications to process such bulk data,

  1. They will take a lot of time to execute.

  2. There will be a heavy network traffic when we move data from source to network server and so on.

为了解决这些问题,我们有MapReduce框架。

To solve these problems, we have the MapReduce framework.

Input Data

上述数据保存为*sample.txt*并作为输入提供。输入文件如下图所示。

The above data is saved as *sample.txt*and given as input. The input file looks as shown below.

1979   23   23   2   43   24   25   26   26   26   26   25   26  25
1980   26   27   28  28   28   30   31   31   31   30   30   30  29
1981   31   32   32  32   33   34   35   36   36   34   34   34  34
1984   39   38   39  39   39   41   42   43   40   39   38   38  40
1985   38   39   39  39   39   41   41   41   00   40   39   39  45

Example Program

下面提供了使用MapReduce框架处理示例数据的程序。

Given below is the program to the sample data using MapReduce framework.

package hadoop;

import java.util.*;

import java.io.IOException;
import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class ProcessUnits {
   //Mapper class
   public static class E_EMapper extends MapReduceBase implements
   Mapper<LongWritable ,/*Input key Type */
   Text,                /*Input value Type*/
   Text,                /*Output key Type*/
   IntWritable>        /*Output value Type*/
   {
      //Map function
      public void map(LongWritable key, Text value,
      OutputCollector<Text, IntWritable> output,

      Reporter reporter) throws IOException {
         String line = value.toString();
         String lasttoken = null;
         StringTokenizer s = new StringTokenizer(line,"\t");
         String year = s.nextToken();

         while(s.hasMoreTokens()) {
            lasttoken = s.nextToken();
         }
         int avgprice = Integer.parseInt(lasttoken);
         output.collect(new Text(year), new IntWritable(avgprice));
      }
   }

   //Reducer class
   public static class E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable > {

      //Reduce function
      public void reduce( Text key, Iterator <IntWritable> values,
      OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
         int maxavg = 30;
         int val = Integer.MIN_VALUE;

         while (values.hasNext()) {
            if((val = values.next().get())>maxavg) {
               output.collect(key, new IntWritable(val));
            }
         }
      }
   }

   //Main function
   public static void main(String args[])throws Exception {
      JobConf conf = new JobConf(ProcessUnits.class);

      conf.setJobName("max_eletricityunits");
      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);
      conf.setMapperClass(E_EMapper.class);
      conf.setCombinerClass(E_EReduce.class);
      conf.setReducerClass(E_EReduce.class);
      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));

      JobClient.runJob(conf);
   }
}

将上述程序保存为 ProcessUnits.java. 程序的编译和执行在下面进行了说明。

Save the above program as ProcessUnits.java. The compilation and execution of the program is explained below.

Compilation and Execution of Process Units Program

让我们假设我们处于Hadoop用户的home目录下(如/home/hadoop)。

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

按照以下步骤编译和执行上述程序。

Follow the steps given below to compile and execute the above program.

Step 1

以下命令用于创建存储已编译Java类的目录。

The following command is to create a directory to store the compiled java classes.

$ mkdir units

Step 2

下载用于编译和执行MapReduce程序的 Hadoop-core-1.2.1.jar, 。访问以下链接 mvnrepository.com 下载jar包。我们假设下载的文件夹是 /home/hadoop/.

Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit the following link mvnrepository.com to download the jar. Let us assume the downloaded folder is /home/hadoop/.

Step 3

以下命令用于编译 ProcessUnits.java 程序并创建该程序的jar包。

The following commands are used for compiling the ProcessUnits.java program and creating a jar for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .

Step 4

以下命令用于在HDFS中创建输入目录。

The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5

以下命令用于将名为*sample.txt*的输入文件复制到HDFS的输入目录中。

The following command is used to copy the input file named *sample.txt*in the input directory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir

Step 6

以下命令用于验证输入目录中的文件。

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7

以下命令用于通过从输入目录中获取输入文件来运行Eleunit_max应用程序。

The following command is used to run the Eleunit_max application by taking the input files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

等待一段时间,直到文件执行完成。在执行之后,结果将包含输入拆分的数量、Map任务的数量、reducer任务的数量等,如下所示。

Wait for a while until the file is executed. After execution, as shown below, the output will contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
   File System Counters

FILE: Number of bytes read = 61
FILE: Number of bytes written = 279400
FILE: Number of read operations = 0
FILE: Number of large read operations = 0
FILE: Number of write operations = 0
HDFS: Number of bytes read = 546
HDFS: Number of bytes written = 40
HDFS: Number of read operations = 9
HDFS: Number of large read operations = 0
HDFS: Number of write operations = 2 Job Counters


   Launched map tasks = 2
   Launched reduce tasks = 1
   Data-local map tasks = 2
   Total time spent by all maps in occupied slots (ms) = 146137
   Total time spent by all reduces in occupied slots (ms) = 441
   Total time spent by all map tasks (ms) = 14613
   Total time spent by all reduce tasks (ms) = 44120
   Total vcore-seconds taken by all map tasks = 146137
   Total vcore-seconds taken by all reduce tasks = 44120
   Total megabyte-seconds taken by all map tasks = 149644288
   Total megabyte-seconds taken by all reduce tasks = 45178880

Map-Reduce Framework

   Map input records = 5
   Map output records = 5
   Map output bytes = 45
   Map output materialized bytes = 67
   Input split bytes = 208
   Combine input records = 5
   Combine output records = 5
   Reduce input groups = 5
   Reduce shuffle bytes = 6
   Reduce input records = 5
   Reduce output records = 5
   Spilled Records = 10
   Shuffled Maps  = 2
   Failed Shuffles = 0
   Merged Map outputs = 2
   GC time elapsed (ms) = 948
   CPU time spent (ms) = 5160
   Physical memory (bytes) snapshot = 47749120
   Virtual memory (bytes) snapshot = 2899349504
   Total committed heap usage (bytes) = 277684224

File Output Format Counters

   Bytes Written = 40

Step 8

以下命令用于验证输出文件夹中的结果文件。

The following command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9

以下命令可用于查看 * Part-00000 * 文件中的输出。该文件由 HDFS 生成。

The following command is used to see the output in * Part-00000 * file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000

以下是 MapReduce 程序生成的输出。

Below is the output generated by the MapReduce program.

1981    34
1984    40
1985    45

Step 10

以下命令可用于将输出文件夹从 HDFS 复制到本地文件系统进行分析。

The following command is used to copy the output folder from HDFS to the local file system for analyzing.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir /home/hadoop

Important Commands

所有 Hadoop 命令都可使用 $HADOOP_HOME/bin/hadoop 命令调用。在不带任何参数的情况下运行 Hadoop 脚本将打印所有命令的描述。

All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.

Usage − hadoop [--config confdir] COMMAND

Usage − hadoop [--config confdir] COMMAND

下表列出了可用选项及其说明。

The following table lists the options available and their description.

Sr.No.

Option & Description

1

namenode -format Formats the DFS filesystem.

2

secondarynamenode Runs the DFS secondary namenode.

3

namenode Runs the DFS namenode.

4

datanode Runs a DFS datanode.

5

dfsadmin Runs a DFS admin client.

6

mradmin Runs a Map-Reduce admin client.

7

fsck Runs a DFS filesystem checking utility.

8

fs Runs a generic filesystem user client.

9

balancer Runs a cluster balancing utility.

10

oiv Applies the offline fsimage viewer to an fsimage.

11

fetchdt Fetches a delegation token from the NameNode.

12

jobtracker Runs the MapReduce job Tracker node.

13

pipes Runs a Pipes job.

14

tasktracker Runs a MapReduce task Tracker node.

15

historyserver Runs job history servers as a standalone daemon.

16

job Manipulates the MapReduce jobs.

17

queue Gets information regarding JobQueues.

18

version Prints the version.

19

jar <jar> Runs a jar file.

20

distcp <srcurl> <desturl> Copies file or directories recursively.

21

distcp2 <srcurl> <desturl> DistCp version 2.

22

archive -archiveName NAME -p <parent path> <src> <dest>* Creates a hadoop archive.

23

classpath Prints the class path needed to get the Hadoop jar and the required libraries.

24

daemonlog Get/Set the log level for each daemon

How to Interact with MapReduce Jobs

用法 - hadoop job [GENERIC_OPTIONS]

Usage − hadoop job [GENERIC_OPTIONS]

以下是 Hadoop 作业中可用的通用选项。

The following are the Generic Options available in a Hadoop job.

Sr.No.

GENERIC_OPTION & Description

1

-submit <job-file> Submits the job.

2

-status <job-id> Prints the map and reduce completion percentage and all job counters.

3

-counter <job-id> <group-name> <countername> Prints the counter value.

4

-kill <job-id> Kills the job.

5

-events <job-id> <fromevent-> <-of-events> Prints the events' details received by jobtracker for the given range.

6

-history [all] <jobOutputDir> - history < jobOutputDir> Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option.

7

-list[all] Displays all jobs. -list displays only jobs which are yet to complete.

8

-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed attempts.

9

-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.

10

-set-priority <job-id> <priority> Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004

To see the history of job output-dir

$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output

To kill the job

$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

Hadoop - Streaming

Hadoop 流式处理是 Hadoop 发行版中附带的一个实用工具。此实用工具允许你使用任何可执行文件或脚本作为映射器和/或规约器来创建和运行 Map/Reduce 作业。

Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Example Using Python

对于 Hadoop 流式处理,我们考虑单词计数问题。Hadoop 中的任何作业都必须具有两个阶段:映射器和规约器。我们已用 Python 脚本为映射器和规约器编写了代码,以便在 Hadoop 下运行它。你也可以用 Perl 和 Ruby 编写相同的代码。

For Hadoop streaming, we are considering the word-count problem. Any job in Hadoop must have two phases: mapper and reducer. We have written codes for the mapper and the reducer in python script to run it under Hadoop. One can also write the same in Perl and Ruby.

Mapper Phase Code

!/usr/bin/python

import sys

# Input takes from standard input for myline in sys.stdin:
   # Remove whitespace either side
   myline = myline.strip()

   # Break the line into words
   words = myline.split()

   # Iterate the words list
   for myword in words:
      # Write the results to standard output
      print '%s\t%s' % (myword, 1)

确保此文件具有执行权限 (chmod +x /home/ expert/hadoop-1.2.1/mapper.py)。

Make sure this file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).

Reducer Phase Code

#!/usr/bin/python

from operator import itemgetter
import sys

current_word = ""
current_count = 0
word = ""

# Input takes from standard input for myline in sys.stdin:
   # Remove whitespace either side
   myline = myline.strip()

   # Split the input we got from mapper.py word,
   count = myline.split('\t', 1)

   # Convert count variable to integer
   try:
      count = int(count)

   except ValueError:
      # Count was not a number, so silently ignore this line continue

   if current_word == word:
   current_count += count
   else:
      if current_word:
         # Write result to standard output print '%s\t%s' % (current_word, current_count)

      current_count = count
      current_word = word

# Do not forget to output the last word if needed!
if current_word == word:
   print '%s\t%s' % (current_word, current_count)

将映射器和规约器代码保存在 Hadoop 主目录中的 mapper.py 和 reducer.py 中。确保这些文件具有执行权限 (chmod +x mapper.py 和 chmod +x reducer.py)。由于 Python 对缩进很敏感,因此可以通过下面的链接下载相同的代码。

Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop home directory. Make sure these files have execution permission (chmod +x mapper.py and chmod +x reducer.py). As python is indentation sensitive so the same code can be download from the below link.

Execution of WordCount Program

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar \
   -input input_dirs \
   -output output_dir \
   -mapper <path/mapper.py \
   -reducer <path/reducer.py

使用“\”换行,以便于清晰阅读。

Where "\" is used for line continuation for clear readability.

For Example,

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

How Streaming Works

在上述示例中,映射器和规约器都是从标准输入读取输入并向标准输出发送输出的 Python 脚本。此实用工具将创建一个 Map/Reduce 作业,将作业提交到适当的集群,并在作业完成之前监控作业进度。

In the above example, both the mapper and the reducer are python scripts that read the input from standard input and emit the output to standard output. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

当为映射器指定脚本时,每个映射器任务将在映射器初始化时作为单独的进程启动该脚本。随着映射器任务的运行,它会将其输入转换为行,并将这些行馈送给进程的标准输入 (STDIN)。同时,映射器从进程的标准输出 (STDOUT) 收集面向行的输出,并将每行转换成一个键值对,该键值对将作为映射器的输出被收集。默认情况下,一行开头到第一个制表符字符的前缀是键,该行的其余部分(不包括制表符字符)将是值。如果该行中没有制表符字符,则整行被视为键,而值为空。但是,可以根据需要进行自定义。

When a script is specified for mappers, each mapper task will launch the script as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper collects the line-oriented outputs from the standard output (STDOUT) of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then the entire line is considered as the key and the value is null. However, this can be customized, as per one need.

当为规约器指定脚本时,每个规约器任务都将作为单独的进程启动该脚本,然后初始化规约器。随着规约器任务的运行,它会将其输入键值对转换为行,并将这些行馈送给进程的标准输入 (STDIN)。同时,规约器从进程的标准输出 (STDOUT) 收集面向行的输出,将每行转换成一个键值对,该键值对将作为规约器的输出被收集。默认情况下,一行开头到第一个制表符字符的前缀是键,该行的其余部分(不包括制表符字符)是值。但是,可以根据具体要求进行自定义。

When a script is specified for reducers, each reducer task will launch the script as a separate process, then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In the meantime, the reducer collects the line-oriented outputs from the standard output (STDOUT) of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized as per specific requirements.

Important Commands

Parameters

Options

Description

-input directory/file-name

Required

Input location for mapper.

-output directory-name

Required

Output location for reducer.

-mapper executable or script or JavaClassName

Required

Mapper executable.

-reducer executable or script or JavaClassName

Required

Reducer executable.

-file file-name

Optional

Makes the mapper, reducer, or combiner executable available locally on the compute nodes.

-inputformat JavaClassName

Optional

Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default.

-outputformat JavaClassName

Optional

Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default.

-partitioner JavaClassName

Optional

Class that determines which reduce a key is sent to.

-combiner streamingCommand or JavaClassName

Optional

Combiner executable for map output.

-cmdenv name=value

Optional

Passes the environment variable to streaming commands.

-inputreader

Optional

For backwards-compatibility: specifies a record reader class (instead of an input format class).

-verbose

Optional

Verbose output.

-lazyOutput

Optional

Creates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write).

-numReduceTasks

Optional

Specifies the number of reducers.

-mapdebug

Optional

Script to call when map task fails.

-reducedebug

Optional

Script to call when reduce task fails.

Hadoop - Multi-Node Cluster

本章介绍了在分布式环境中设置 Hadoop 多节点群集。

This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed environment.

由于无法演示整个集群,我们使用三个系统(一个主设备和两个从设备)解释 Hadoop 集群环境;以下是它们的 IP 地址。

As the whole cluster cannot be demonstrated, we are explaining the Hadoop cluster environment using three systems (one master and two slaves); given below are their IP addresses.

  1. Hadoop Master: 192.168.1.15 (hadoop-master)

  2. Hadoop Slave: 192.168.1.16 (hadoop-slave-1)

  3. Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

按照以下步骤设置 Hadoop 多节点集群。

Follow the steps given below to have Hadoop Multi-Node cluster setup.

Installing Java

Java 是 Hadoop 的主要先决条件。首先,您应该使用“java -version”验证您的系统中是否存在 java。java 版本命令的语法如下。

Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using “java -version”. The syntax of java version command is given below.

$ java -version

如果一切正常,它将为您提供以下输出。

If everything works fine it will give you the following output.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果您的系统中未安装 java,请按照以下步骤安装 java。

If java is not installed in your system, then follow the given steps for installing java.

Step 1

访问以下链接下载 java (JDK <最新版本> - X64.tar.gz): www.oracle.com

Download java (JDK <latest version> - X64.tar.gz) by visiting the following link link: www.oracle.com

然后 jdk-7u71-linux-x64.tar.gz 将下载到您的系统中。

Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.

Step 2

通常,您将在“下载”文件夹中找到下载的 java 文件。使用以下命令对其进行验证并解压缩 jdk-7u71-linux-x64.gz 文件。

Generally you will find the downloaded java file in Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-Linux-x64.gz

$ tar zxf jdk-7u71-Linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-Linux-x64.gz

Step 3

要使所有用户都能使用 java,您必须将其移至 “/usr/local/”的位置。打开 root 并键入以下命令。

To make java available to all the users, you have to move it to the location “/usr/local/”. Open the root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

为设置 PATHJAVA_HOME 变量,将以下命令添加到 ~/.bashrc 文件。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin

现在,如上所述,从终端验证 java -version 命令。按照上述过程在所有集群节点中安装 java。

Now verify the java -version command from the terminal as explained above. Follow the above process and install java in all your cluster nodes.

Creating User Account

在主设备和从设备上创建一个系统用户帐户以使用 Hadoop 安装。

Create a system user account on both master and slave systems to use the Hadoop installation.

# useradd hadoop
# passwd hadoop

Mapping the nodes

您必须在所有节点上的 /etc/ 文件夹中编辑 hosts 文件,指定每个系统的 IP 地址以及它们的主机名。

You have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.

# vi /etc/hosts
enter the following lines in the /etc/hosts file.

192.168.1.109 hadoop-master
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2

Configuring Key Based Login

在每个节点中设置 ssh,以便它们无需任何密码提示即可相互通信。

Setup ssh in every node such that they can communicate with one another without any prompt for password.

# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit

Installing Hadoop

在主设备服务器中,使用以下命令下载并安装 Hadoop。

In the Master server, download and install Hadoop using the following commands.

# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/

Configuring Hadoop

您必须通过进行如下所示的更改将 Hadoop 服务器进行配置。

You have to configure Hadoop server by making the following changes as given below.

core-site.xml

打开 core-site.xml 文件,并按如下所示进行编辑。

Open the core-site.xml file and edit it as shown below.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://hadoop-master:9000/</value>
   </property>
   <property>
      <name>dfs.permissions</name>
      <value>false</value>
   </property>
</configuration>

hdfs-site.xml

打开 hdfs-site.xml 文件,并按如下所示进行编辑。

Open the hdfs-site.xml file and edit it as shown below.

<configuration>
   <property>
      <name>dfs.data.dir</name>
      <value>/opt/hadoop/hadoop/dfs/name/data</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>/opt/hadoop/hadoop/dfs/name</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
</configuration>

mapred-site.xml

打开 mapred-site.xml 文件,并按如下所示进行编辑。

Open the mapred-site.xml file and edit it as shown below.

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>hadoop-master:9001</value>
   </property>
</configuration>

hadoop-env.sh

打开 hadoop-env.sh 文件,并按如下所示编辑 JAVA_HOME、HADOOP_CONF_DIR 和 HADOOP_OPTS。

Open the hadoop-env.sh file and edit JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS as shown below.

Note − 根据您的系统配置设置 JAVA_HOME。

Note − Set the JAVA_HOME as per your system configuration.

export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers

按照提供的命令,在所有从属服务器上安装Hadoop。

Install Hadoop on all the slave servers by following the given commands.

# su hadoop
$ cd /opt/hadoop
$ scp -r hadoop hadoop-slave-1:/opt/hadoop
$ scp -r hadoop hadoop-slave-2:/opt/hadoop

Configuring Hadoop on Master Server

打开主服务器并按照给定的命令配置它。

Open the master server and configure it by following the given commands.

# su hadoop
$ cd /opt/hadoop/hadoop

Configuring Master Node

$ vi etc/hadoop/masters

hadoop-master

Configuring Slave Node

$ vi etc/hadoop/slaves

hadoop-slave-1
hadoop-slave-2

Format Name Node on Hadoop Master

# su hadoop
$ cd /opt/hadoop/hadoop
$ bin/hadoop namenode –format
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-master/192.168.1.109
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.0
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473;
compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013
STARTUP_MSG: java = 1.7.0_71

************************************************************/
11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap
editlog=/opt/hadoop/hadoop/dfs/name/current/edits
………………………………………………….
………………………………………………….
………………………………………………….
11/10/14 10:58:08 INFO common.Storage: Storage directory
/opt/hadoop/hadoop/dfs/name has been successfully formatted.
11/10/14 10:58:08 INFO namenode.NameNode:
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15
************************************************************/

Starting Hadoop Services

以下命令用于在Hadoop主服务器上启动所有Hadoop服务。

The following command is to start all the Hadoop services on the Hadoop-Master.

$ cd $HADOOP_HOME/sbin
$ start-all.sh

Adding a New DataNode in the Hadoop Cluster

以下是为Hadoop集群添加新节点的步骤。

Given below are the steps to be followed for adding new nodes to a Hadoop cluster.

Networking

为现有Hadoop集群添加新节点,并使用一些适当的网络配置。假设以下网络配置。

Add new nodes to an existing Hadoop cluster with some appropriate network configuration. Assume the following network configuration.

对于新节点配置−

For New node Configuration −

IP address : 192.168.1.103
netmask : 255.255.255.0
hostname : slave3.in

Adding User and SSH Access

Add a User

在新节点上,添加“hadoop”用户并使用以下命令将Hadoop用户的密码设置为“hadoop123”或您想要的任何密码。

On a new node, add "hadoop" user and set password of Hadoop user to "hadoop123" or anything you want by using the following commands.

useradd hadoop
passwd hadoop

建立从主节点到新从属节点的非密码连接。

Setup Password less connectivity from master to new slave.

Execute the following on the master

mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys
Copy the public key to new slave node in hadoop user $HOME directory
scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/

Execute the following on the slaves

登录到Hadoop。如果没有,请登录到Hadoop用户。

Login to hadoop. If not, login to hadoop user.

su hadoop ssh -X hadoop@192.168.1.103

将公钥的内容复制到文件 "$HOME/.ssh/authorized_keys" 中,然后通过执行以下命令更改相同文件的权限。

Copy the content of public key into file "$HOME/.ssh/authorized_keys" and then change the permission for the same by executing the following commands.

cd $HOME
mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys

从主计算机检查ssh登录。现在,检查是否可以从主节点通过ssh访问新节点,而无需密码。

Check ssh login from the master machine. Now check if you can ssh to the new node without a password from the master.

ssh hadoop@192.168.1.103 or hadoop@slave3

Set Hostname of New Node

您可以在文件 /etc/sysconfig/network 中设置主机名。

You can set hostname in file /etc/sysconfig/network

On new slave3 machine

NETWORKING = yes
HOSTNAME = slave3.in

为了使更改生效,请重新启动计算机或使用各自的主机名对新计算机运行主机名命令(重新启动是一个好选择)。

To make the changes effective, either restart the machine or run hostname command to a new machine with the respective hostname (restart is a good option).

在slave3节点计算机上−

On slave3 node machine −

主机名slave3.in

hostname slave3.in

使用以下行更新群集所有计算机上的 /etc/hosts

Update /etc/hosts on all machines of the cluster with the following lines −

192.168.1.102 slave3.in slave3

现在尝试使用主机名ping计算机,以检查是否解析为IP。

Now try to ping the machine with hostnames to check whether it is resolving to IP or not.

在新节点计算机上−

On new node machine −

ping master.in

Start the DataNode on New Node

使用 $HADOOP_HOME/bin/hadoop-daemon.sh script 手动启动dataNode守护程序。它将自动联系主(NameNode)并加入群集。我们还应该将新节点添加到主服务器中的conf/slaves文件中。基于脚本的命令将识别新的节点。

Start the datanode daemon manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. It will automatically contact the master (NameNode) and join the cluster. We should also add the new node to the conf/slaves file in the master server. The script-based commands will recognize the new node.

Login to new node

su hadoop or ssh -X hadoop@192.168.1.103

Start HDFS on a newly added slave node by using the following command

./bin/hadoop-daemon.sh start datanode

Check the output of jps command on a new node. It looks as follows.

$ jps
7141 DataNode
10312 Jps

Removing a DataNode from the Hadoop Cluster

在运行时,我们可以在不丢失任何数据的情况下从集群中删除一个节点。HDFS 提供了一个解除配置的功能,它确保以安全的方式删除节点。要使用该功能,请按照以下步骤操作:

We can remove a node from a cluster on the fly, while it is running, without any data loss. HDFS provides a decommissioning feature, which ensures that removing a node is performed safely. To use it, follow the steps as given below −

Step 1 − Login to master

登录到已安装 Hadoop 的主计算机用户。

Login to master machine user where Hadoop is installed.

$ su hadoop

Step 2 − Change cluster configuration

必须在启动集群之前配置一个排除文件。向我们的 ` $HADOOP_HOME/etc/hadoop/hdfs-site.xml ` 文件中添加一个名为 dfs.hosts.exclude 的键。与此键关联的值提供了一个文件在 NameNode 本地文件系统上的完整路径,此文件包含不被允许连接到 HDFS 的计算机列表。

An exclude file must be configured before starting the cluster. Add a key named dfs.hosts.exclude to our $HADOOP_HOME/etc/hadoop/hdfs-site.xml file. The value associated with this key provides the full path to a file on the NameNode’s local file system which contains a list of machines which are not permitted to connect to HDFS.

例如,将这些行添加到 ` etc/hadoop/hdfs-site.xml ` 文件。

For example, add these lines to etc/hadoop/hdfs-site.xml file.

<property>
   <name>dfs.hosts.exclude</name>
   <value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value>
   <description>DFS exclude</description>
</property>

Step 3 − Determine hosts to decommission

每个需要解除配置的计算机都应添加到 hdfs_exclude.txt 标识的文件中,每行一个域名。这会阻止它们连接到 NameNode。如果你想删除 DataNode2,` "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" ` 文件的内容如下所示:

Each machine to be decommissioned should be added to the file identified by the hdfs_exclude.txt, one domain name per line. This will prevent them from connecting to the NameNode. Content of the "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" file is shown below, if you want to remove DataNode2.

slave2.in

Step 4 − Force configuration reload

在不加引号的情况下运行命令 ` "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" `。

Run the command "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" without the quotes.

$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes

这将强制 NameNode 重新读取其配置,包括新更新的“排除”文件。它会在一段时间内解除配置节点,从而为每个节点的块复制到计划保持活动状态的计算机上提供时间。

This will force the NameNode to re-read its configuration, including the newly updated ‘excludes’ file. It will decommission the nodes over a period of time, allowing time for each node’s blocks to be replicated onto machines which are scheduled to remain active.

在 ` slave2.in ` 中,检查 jps 命令输出。一段时间后,你将看到 DataNode 进程自动关闭。

On slave2.in, check the jps command output. After some time, you will see the DataNode process is shutdown automatically.

Step 5 − Shutdown nodes

解除配置过程完成后,可以安全地关闭已解除配置的硬件以进行维护。运行 dfsadmin 报告命令以检查解除配置的状态。以下命令将描述解除配置节点的状态和连接到该集群的节点。

After the decommission process has been completed, the decommissioned hardware can be safely shut down for maintenance. Run the report command to dfsadmin to check the status of decommission. The following command will describe the status of the decommission node and the connected nodes to the cluster.

$ $HADOOP_HOME/bin/hadoop dfsadmin -report

Step 6 − Edit excludes file again

机器解除配置后,可以从“排除”文件中删除它们。再次运行 ` "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" ` 将把排除文件读回到 NameNode 中;允许 DataNode 在维护完成后重新加入该集群,或者在该集群再次需要额外容量时重新加入该集群,等等。

Once the machines have been decommissioned, they can be removed from the ‘excludes’ file. Running "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" again will read the excludes file back into the NameNode; allowing the DataNodes to rejoin the cluster after the maintenance has been completed, or additional capacity is needed in the cluster again, etc.

` Special Note ` - 如果遵循以上过程,并且 tasktracker 进程仍在该节点上运行,则需要将其关闭。一种方法是像在上述步骤中所做的那样断开计算机连接。主计算机将自动识别该进程并宣布该进程已死亡。无需遵循相同过程来删除 tasktracker,因为它不如 DataNode 那样重要。DataNode 包含你要在不丢失任何数据的情况下安全删除的数据。

Special Note − If the above process is followed and the tasktracker process is still running on the node, it needs to be shut down. One way is to disconnect the machine as we did in the above steps. The Master will recognize the process automatically and will declare as dead. There is no need to follow the same process for removing the tasktracker because it is NOT much crucial as compared to the DataNode. DataNode contains the data that you want to remove safely without any loss of data.

可以通过在任何时间点使用以下命令动态运行/关闭 tasktracker。

The tasktracker can be run/shutdown on the fly by the following command at any point of time.

$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker
$HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker