Map Reduce 简明教程

MapReduce - Quick Guide

MapReduce - Introduction

MapReduce 是一种编程模型，用于编写可以在多个节点上并行处理大数据的应用程序。MapReduce 为分析大量复杂数据提供了分析功能。

MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of complex data.

What is Big Data?

大数据是大量数据集的集合，无法使用传统计算技术进行处理。例如，Facebook 或 YouTube 每天需要收集和管理的数据量，属于大数据的范畴。然而，大数据不仅仅是规模和容量，还涉及以下一个或多个方面：速度、多样性、容量和复杂性。

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and Complexity.

Why MapReduce?

传统企业系统通常有一个集中式服务器来存储和处理数据。以下插图描绘了传统企业系统的示意图。传统模型显然不适合处理大量的可扩展数据，也无法容纳标准的数据库服务器。此外，集中式系统在同时处理多个文件时会造成太多瓶颈。

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.

谷歌使用一种名为 MapReduce 的算法解决了这个瓶颈问题。MapReduce 将一个任务分成小部分并将其分配给多台计算机。稍后，结果收集到一处并集成形成结果数据集。

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.

How MapReduce Works?

MapReduce 算法包含两项重要任务，即 Map 和 Reduce。

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.

reduce 任务始终在 map 作业之后执行。

The reduce task is always performed after the map job.

现在让我们仔细研究每个阶段并尝试理解它们的意义。

Let us now take a close look at each of the phases and try to understand their significance.

Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

让我们尝试借助一小部分示意图理解 Map 和 Reduce 这两个任务 −

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce-Example

让我们举一个实际示例来说明 MapReduce 的功能。推特每天收到大约 5 亿条推文，每秒接近 3000 条推文。以下插图展示了推特如何借助 MapReduce 管理其推文。

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.

如图所示，MapReduce算法执行以下操作：

As shown in the illustration, the MapReduce algorithm performs the following actions −

Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

MapReduce - Algorithm

MapReduce 算法包含两项重要任务，即 Map 和 Reduce。

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.

映射程序类接受输入，对输入进行标记化、映射和排序。映射程序类的输出作为还原程序类的输入，还原程序类反过来搜索匹配对并减少它们。

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce 实施各种数学算法，将任务划分为较小的部分并将其分配给多个系统。从技术角度而言，MapReduce 算法有助于将 Map 和 Reduce 任务发送到集群中的适当服务器。

MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

这些数学算法可能包括以下内容：

These mathematical algorithms may include the following −

Sorting
Searching
Indexing
TF-IDF

Sorting

排序是处理和分析数据的基本 MapReduce 算法之一。MapReduce 实施排序算法，通过键自动对映射程序的输出键值对进行排序。

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys.

Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context class (user-defined class) collects the matching valued keys as a collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the help of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

搜索在 MapReduce 算法中扮演着重要的角色。它有助于合并器阶段（可选）和 Reducer 阶段。让我们尝试借助一个示例了解搜索的运作方式。

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase (optional) and in the Reducer phase. Let us try to understand how Searching works with the help of an example.

Example

以下示例展示了 MapReduce 如何采用搜索算法找出给定员工数据集中薪水最高员工的详细信息。

The following example shows how MapReduce employs Searching algorithm to find out the details of the employee who draws the highest salary in a given employee dataset.

Let us assume we have employee data in four different files − A, B, C, and D. Let us also assume there are duplicate employee records in all four files because of importing the employee data from all database tables repeatedly. See the following illustration.

The Map phase processes each input file and provides the employee data in key-value pairs (<k, v> : <emp name, salary>). See the following illustration.

The combiner phase (searching technique) will accept the input from the Map phase as a key-value pair with employee name and salary. Using searching technique, the combiner will check all the employee salary to find the highest salaried employee in each file. See the following snippet.

<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){
   Max = v(salary);
}

else{
   Continue checking;
}

预期结果如下：

The expected result is as follows −

<satish, 26000> <gopal, 50000> <kiran, 45000> <manisha, 45000>

<satish, 26000>

<gopal, 50000>

<kiran, 45000>

<manisha, 45000>

Reducer phase − Form each file, you will find the highest salaried employee. To avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm is used in between the four <k, v> pairs, which are coming from four input files. The final output should be as follows −

<gopal, 50000>

Indexing

通常，索引用于指向特定数据及其地址。它对特定 Mapper 的输入文件执行批处理索引。

Normally indexing is used to point to a particular data and its address. It performs batch indexing on the input files for a particular Mapper.

MapReduce 中通常使用的索引技术称为 inverted index. Google 和 Bing 等搜索引擎使用反向索引技术。让我们尝试借助一个简单示例了解索引的运作方式。

The indexing technique that is normally used in MapReduce is known as inverted index. Search engines like Google and Bing use inverted indexing technique. Let us try to understand how Indexing works with the help of a simple example.

Example

以下文本是反向索引的输入。其中 T[0]、T[1] 和 t[2] 是文件名，其内容置于双引号中。

The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and their content are in double quotes.

T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

应用索引算法后，我们得到以下输出 −

After applying the Indexing algorithm, we get the following output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

此处 "a": {2} 表示术语 "a" 出现在 T[2] 文件中。类似地，"is": {0, 1, 2} 表示术语 "is" 出现在文件 T[0]、T[1] 和 T[2] 中。

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the term "is" appears in the files T[0], T[1], and T[2].

TF-IDF

TF-IDF 是一种文本处理算法，是术语频率 - 逆向文件频率的缩写。它是常见的 Web 分析算法之一。在此，术语“频率”是指术语在文档中出现的次数。

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to the number of times a term appears in a document.

Term Frequency (TF)

它测量特定术语在文档中出现的频率。它通过将单词在文档中出现的次数除以该文档中单词总数来计算。

It measures how frequently a particular term occurs in a document. It is calculated by the number of times a word appears in a document divided by the total number of words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the document)

Inverse Document Frequency (IDF)

它测量术语的重要性。它通过文本数据库中文件的数目除以特定术语出现的文件的数目来计算。

It measures the importance of a term. It is calculated by the number of documents in the text database divided by the number of documents where a specific term appears.

在计算 TF 时，所有术语都被认为同样重要。也就是说，TF 计算“is”、“a”、“what”等普通单词的术语频率。因此，我们需要在扩大罕见术语的比例时了解常见术语，通过计算以下内容：

While computing TF, all the terms are considered equally important. That means, TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms while scaling up the rare ones, by computing the following −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

算法在以下一个小示例中得到解释。

The algorithm is explained below with the help of a small example.

Example

考虑一个包含 1000 个单词的文档，其中单词 hive 出现 50 次。则 hive 的 TF 为 (50/1000) = 0.05。

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for hive is then (50 / 1000) = 0.05.

现在，假设我们有 1000 万份文档，单词 hive 出现在其中的 1000 份文档中。然后，IDF 计算如下：log(10,000,000 / 1,000) = 4。

Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.

TF-IDF 权重是这些数量的乘积 − 0.05 × 4 = 0.20。

The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

MapReduce - Installation

MapReduce仅在类Linux的操作系统上运行，并内置于Hadoop框架中。我们需要执行以下步骤来安装Hadoop框架。

MapReduce works only on Linux flavored operating systems and it comes inbuilt with a Hadoop Framework. We need to perform the following steps in order to install Hadoop framework.

Verifying JAVA Installation

在安装Hadoop之前，必须在系统中安装Java。使用以下命令检查您的系统是否已安装Java。

Java must be installed on your system before installing Hadoop. Use the following command to check whether you have Java installed on your system.

$ java –version

如果系统上已安装 Java，您将看到以下响应 -

If Java is already installed on your system, you get to see the following response −

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果您尚未在系统中安装Java，请按照以下步骤操作。

In case you don’t have Java installed on your system, then follow the steps given below.

Installing Java

Step 1

从以下链接下载Java最新版本： this link 。

Download the latest version of Java from the following link − this link.

下载后，您可以在“下载”文件夹中找到文件 jdk-7u71-linux-x64.tar.gz 。

After downloading, you can locate the file jdk-7u71-linux-x64.tar.gz in your Downloads folder.

Step 2

使用以下命令来解压jdk-7u71-linux-x64.gz的内容。

Use the following commands to extract the contents of jdk-7u71-linux-x64.gz.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

要向所有用户提供Java，您必须将其移动到“/usr/local/”位置。转到root并键入以下命令：

To make Java available to all the users, you have to move it to the location “/usr/local/”. Go to root and type the following commands −

$ su
password:
# mv jdk1.7.0_71 /usr/local/java
# exit

Step 4

为设置 PATH 和 JAVA_HOME 变量，将以下命令添加到 ~/.bashrc 文件中。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin

将所有更改应用到当前运行的系统。

Apply all the changes to the current running system.

$ source ~/.bashrc

Step 5

使用以下命令配置Java备用：

Use the following commands to configure Java alternatives −

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java

# alternatives --set javac usr/local/java/bin/javac

# alternatives --set jar usr/local/java/bin/jar

现在使用命令 java -version 从终端验证安装。

Now verify the installation using the command java -version from the terminal.

Verifying Hadoop Installation

在安装MapReduce之前，必须在您的系统上安装Hadoop。让我们使用以下命令验证Hadoop安装：

Hadoop must be installed on your system before installing MapReduce. Let us verify the Hadoop installation using the following command −

$ hadoop version

如果已在您的系统上安装 Hadoop，则会收到以下回复：

If Hadoop is already installed on your system, then you will get the following response −

Hadoop 2.4.1
--
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

如果您的系统上尚未安装Hadoop，请继续执行以下步骤。

If Hadoop is not installed on your system, then proceed with the following steps.

Downloading Hadoop

从Apache软件基金会下载Hadoop 2.4.1，并使用以下命令解压其内容。

Download Hadoop 2.4.1 from Apache Software Foundation and extract its contents using the following commands.

$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed mode

以下步骤用于在伪分布模式下安装 Hadoop 2.4.1。

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 − Setting up Hadoop

您可以通过将以下命令附加到~/.bashrc文件来设置Hadoop环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

将所有更改应用到当前运行的系统。

Apply all the changes to the current running system.

$ source ~/.bashrc

Step 2 − Hadoop Configuration

您可以在位置 “$HADOOP_HOME/etc/hadoop” 中找到所有 Hadoop 配置文件。根据您的 Hadoop 基础架构，您需要在这些配置文件中进行适当的更改。

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为了使用 Java 开发 Hadoop 程序，您必须替换 hadoop-env.sh 文件中的 JAVA_HOME 值为系统中 Java 的位置，以重置 Java 环境变量。

In order to develop Hadoop programs using Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME=/usr/local/java

您必须编辑以下文件以配置 Hadoop -

You have to edit the following files to configure Hadoop −

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

core-site.xml

core-site.xml 包含以下信息 -

core-site.xml contains the following information−

Port number used for Hadoop instance
Memory allocated for the file system
Memory limit for storing the data
Size of Read/Write buffers

打开 core-site.xml，并在 <configuration> 和 </configuration> 标记之间添加以下属性。

Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000 </value>
   </property>
</configuration>

hdfs-site.xml

hdfs-site.xml 包含以下信息 -

hdfs-site.xml contains the following information −

Value of replication data
The namenode path
The datanode path of your local file systems (the place where you want to store the Hadoop infra)

让我们假设以下数据。

Let us assume the following data.

dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开该文件，并在 <configuration>、</configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration>, </configuration> tags.

<configuration>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
   </property>

</configuration>

Note − 在上述文件中，所有属性值都是用户定义的，您可以根据 Hadoop 基础架构进行更改。

Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

此文件用于将纱线配置到 Hadoop 中。打开 yarn-site.xml 文件并在 <configuration>，</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

此文件用于指定我们正在使用的 MapReduce 框架。默认情况下，Hadoop 包含 yarn-site.xml 的模板。首先，您需要使用以下命令将文件从 mapred-site.xml.template 复制到 mapred-site.xml 文件。

This file is used to specify the MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site.xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件并在 <configuration>，</configuration> 标记中添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Step 1 − Name Node Setup

使用 “hdfs namenode -format” 命令设置 name 节点，如下所示：

Set up the namenode using the command “hdfs namenode -format” as follows −

$ cd ~
$ hdfs namenode -format

预期结果如下：

The expected result is as follows −

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2 − Verifying Hadoop dfs

执行以下命令以启动您的 Hadoop 文件系统。

Execute the following command to start your Hadoop file system.

$ start-dfs.sh

预期输出如下所示 −

The expected output is as follows −

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3 − Verifying Yarn Script

以下命令用于启动 Yarn 脚本。执行此命令将启动您的 Yarn 守护程序。

The following command is used to start the yarn script. Executing this command will start your yarn daemons.

$ start-yarn.sh

预期输出如下所示 −

The expected output is as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4 − Accessing Hadoop on Browser

访问 Hadoop 的默认端口号是 50070。使用以下 URL 在您的浏览器上获取 Hadoop 服务。

The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on your browser.

http://localhost:50070/

以下屏幕截图显示了 Hadoop 浏览器。

The following screenshot shows the Hadoop browser.

Step 5 − Verify all Applications of a Cluster

访问群集中所有应用程序的默认端口号为 8088。使用以下 URL 使用此服务。

The default port number to access all the applications of a cluster is 8088. Use the following URL to use this service.

http://localhost:8088/

以下屏幕截图显示了 Hadoop 群集浏览器。

The following screenshot shows a Hadoop cluster browser.

MapReduce - API

在本课程中，我们将仔细研究涉及 MapReduce 编程操作的类及其方法。我们将主要关注以下内容：

In this chapter, we will take a close look at the classes and their methods that are involved in the operations of MapReduce programming. We will primarily keep our focus on the following −

JobContext Interface
Job Class
Mapper Class
Reducer Class

JobContext Interface

JobContext 接口是所有类的超级接口，定义 MapReduce 中的不同作业。它为您提供了在任务运行时提供给任务的作业的可读视图。

The JobContext interface is the super interface for all the classes, which defines different jobs in MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are running.

以下是 JobContext 接口的子接口。

The following are the sub-interfaces of JobContext interface.

S.No.

Subinterface Description

*MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>*Defines the context that is given to the Mapper.

*ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>*Defines the context that is passed to the Reducer.

Job 类是实现 JobContext 接口的主类。

Job class is the main class that implements the JobContext interface.

Job Class

Job 类是 MapReduce API 中最重要类。它允许用户配置、提交和控制作业执行，以及查询状态。set 方法只在作业提交前有效，随后它们会抛出 IllegalStateException。

The Job class is the most important class in the MapReduce API. It allows the user to configure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.

通常，用户会创建应用程序，描述作业的各个方面，然后提交作业并监控其进度。

Normally, the user creates the application, describes the various facets of the job, and then submits the job and monitors its progress.

以下是提交作业的示例：

Here is an example of how to submit a job −

// Create a new Job
Job job = new Job(new Configuration());
job.setJarByClass(MyJob.class);

// Specify various job-specific parameters
job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));

job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);

// Submit the job, then poll for progress until the job is complete
job.waitForCompletion(true);

Constructors

以下是 Job 类的构造函数摘要。

Following are the constructor summary of Job class.

S.No

Constructor Summary

Job()

Job(Configuration conf)

Job(Configuration conf, String jobName)

Methods

以下是 Job 类中一些重要的函数：

Some of the important methods of Job class are as follows −

S.No	Method Description
1	getJobName()User-specified job name.
2	getJobState()Returns the current state of the Job.
3	isComplete()Checks if the job is finished or not.
4	setInputFormatClass()Sets the InputFormat for the job.
5	setJobName(String name)Sets the user-specified job name.
6	setOutputFormatClass()Sets the Output Format for the job.
7	setMapperClass(Class)Sets the Mapper for the job.
8	setReducerClass(Class)Sets the Reducer for the job.
9	setPartitionerClass(Class)Sets the Partitioner for the job.
10	setCombinerClass(Class)Sets the Combiner for the job.

Mapper Class

Mapper 类定义映射作业。将输入键值对映射到一组中间键值对。映射是将输入记录转换为中间记录的单独任务。转换后的中间记录不必与输入记录的类型相同。给定的输入对可以映射到 0 个或任意多个输出对。

The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks that transform the input records into intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

Method

map 是 Mapper 类中最突出的方法。语法定义如下：

map is the most prominent method of the Mapper class. The syntax is defined below −

map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

对于输入分片中的每个键值对，此方法会被调用一次。

This method is called once for each key-value pair in the input split.

Reducer Class

Reducer 类在 MapReduce 中定义了 Reduce 作业。它将一组共享键的中间值减少到更小的一组值。Reducer 实现可以通过 JobContext.getConfiguration() 方法访问作业的配置。Reducer 有三个主要阶段−Shuffle、Sort 和 Reduce。

The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.

Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

Method

reduce 是 Reducer 类的最主要方法。其语法如下所示−

reduce is the most prominent method of the Reducer class. The syntax is defined below −

reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)

此方法在键值对集合上的每个键上调用一次。

This method is called once for each key on the collection of key-value pairs.

MapReduce - Hadoop Implementation

MapReduce 是一个框架，用于编写应用程序，以便在大量商用硬件集群上可靠地处理海量数据。本章将带您了解在 Hadoop 框架中使用 Java 进行 MapReduce 操作。

MapReduce is a framework that is used for writing applications to process huge volumes of data on large clusters of commodity hardware in a reliable manner. This chapter takes you through the operation of MapReduce in Hadoop framework using Java.

MapReduce Algorithm

MapReduce 范式通常基于将 map-reduce 程序发送到实际数据所在的计算机。

Generally MapReduce paradigm is based on sending map-reduce programs to computers where the actual data resides.

During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate servers in the cluster.
The framework manages all the details of data-passing like issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on the nodes with data on local disks that reduces the network traffic.
After completing a given task, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

MapReduce 框架对键值对进行操作，即该框架将作业的输入视为一组键值对，并将一组键值对作为作业的输出生成，想象中不同类型。

The MapReduce framework operates on key-value pairs, that is, the framework views the input to the job as a set of key-value pairs and produces a set of key-value pair as the output of the job, conceivably of different types.

键和值类必须是可序列化框架，因此需要实现 Writable 接口。此外，键类必须实现 WritableComparable 接口，以利于框架进行排序。

The key and value classes have to be serializable by the framework and hence, it is required to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

MapReduce 作业的输入和输出格式都是键值对形式−

Both the input and output format of a MapReduce job are in the form of key-value pairs −

(Input) <k1, v1> → map → <k2, v2>→ reduce → <k3, v3> (Output)。

(Input) <k1, v1> → map → <k2, v2>→ reduce → <k3, v3> (Output).

Input

Output

Map

<k1, v1>

list (<k2, v2>)

Reduce

<k2, list(v2)>

list (<k3, v3>)

MapReduce Implementation

下表显示了有关组织用电量的资料。该表包括连续五年的每月用电量和年平均值。

The following table shows the data regarding the electrical consumption of an organization. The table includes the monthly electrical consumption and the annual average for five consecutive years.

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Avg
1979	23	23	2	43	24	25	26	26	26	26	25	26	25
1980	26	27	28	28	28	30	31	31	31	30	30	30	29
1981	31	32	32	32	33	34	35	36	36	34	34	34	34
1984	39	38	39	39	39	41	42	43	40	39	38	38	40
1985	38	39	39	39	39	41	41	41	00	40	39	39	45

我们需要编写应用程序，处理给定表中的输入数据，找到用电量最大的年份、用电量最小的年份，等等。对于记录数量有限的程序员来说，这项任务很简单，因为他们只需编写逻辑来生成所需的输出，并将数据传递到编写的应用程序。

We need to write applications to process the input data in the given table to find the year of maximum usage, the year of minimum usage, and so on. This task is easy for programmers with finite amount of records, as they will simply write the logic to produce the required output, and pass the data to the written application.

现在我们来增加输入数据的规模。假设我们必须分析一个特定州的所有大型产业的用电量。当我们编写应用程序来处理这种大量数据时，

Let us now raise the scale of the input data. Assume we have to analyze the electrical consumption of all the large-scale industries of a particular state. When we write applications to process such bulk data,

They will take a lot of time to execute.
There will be heavy network traffic when we move data from the source to the network server.

为了解决这些问题，我们有MapReduce框架。

To solve these problems, we have the MapReduce framework.

Input Data

以上数据存储在 sample.txt 中，并作为输入提供。输入文件如下所示。

The above data is saved as sample.txt and given as input. The input file looks as shown below.

S.No	Method Description
1	getJobName()User-specified job name.
2	getJobState()Returns the current state of the Job.
3	isComplete()Checks if the job is finished or not.
4	setInputFormatClass()Sets the InputFormat for the job.
5	setJobName(String name)Sets the user-specified job name.
6	setOutputFormatClass()Sets the Output Format for the job.
7	setMapperClass(Class)Sets the Mapper for the job.
8	setReducerClass(Class)Sets the Reducer for the job.
9	setPartitionerClass(Class)Sets the Partitioner for the job.
10	setCombinerClass(Class)Sets the Combiner for the job.