Map Reduce 简明教程

MapReduce - Hadoop Implementation

MapReduce 是一个框架，用于编写应用程序，以便在大量商用硬件集群上可靠地处理海量数据。本章将带您了解在 Hadoop 框架中使用 Java 进行 MapReduce 操作。

MapReduce is a framework that is used for writing applications to process huge volumes of data on large clusters of commodity hardware in a reliable manner. This chapter takes you through the operation of MapReduce in Hadoop framework using Java.

MapReduce Algorithm

MapReduce 范式通常基于将 map-reduce 程序发送到实际数据所在的计算机。

Generally MapReduce paradigm is based on sending map-reduce programs to computers where the actual data resides.

During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate servers in the cluster.
The framework manages all the details of data-passing like issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on the nodes with data on local disks that reduces the network traffic.
After completing a given task, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

MapReduce 框架对键值对进行操作，即该框架将作业的输入视为一组键值对，并将一组键值对作为作业的输出生成，想象中不同类型。

The MapReduce framework operates on key-value pairs, that is, the framework views the input to the job as a set of key-value pairs and produces a set of key-value pair as the output of the job, conceivably of different types.

键和值类必须是可序列化框架，因此需要实现 Writable 接口。此外，键类必须实现 WritableComparable 接口，以利于框架进行排序。

The key and value classes have to be serializable by the framework and hence, it is required to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

MapReduce 作业的输入和输出格式都是键值对形式−

Both the input and output format of a MapReduce job are in the form of key-value pairs −

(Input) <k1, v1> → map → <k2, v2>→ reduce → <k3, v3> (Output)。

(Input) <k1, v1> → map → <k2, v2>→ reduce → <k3, v3> (Output).

Input

Output

Map

<k1, v1>

list (<k2, v2>)

Reduce

<k2, list(v2)>

list (<k3, v3>)

MapReduce Implementation

下表显示了有关组织用电量的资料。该表包括连续五年的每月用电量和年平均值。

The following table shows the data regarding the electrical consumption of an organization. The table includes the monthly electrical consumption and the annual average for five consecutive years.

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Avg
1979	23	23	2	43	24	25	26	26	26	26	25	26	25
1980	26	27	28	28	28	30	31	31	31	30	30	30	29
1981	31	32	32	32	33	34	35	36	36	34	34	34	34
1984	39	38	39	39	39	41	42	43	40	39	38	38	40
1985	38	39	39	39	39	41	41	41	00	40	39	39	45

我们需要编写应用程序，处理给定表中的输入数据，找到用电量最大的年份、用电量最小的年份，等等。对于记录数量有限的程序员来说，这项任务很简单，因为他们只需编写逻辑来生成所需的输出，并将数据传递到编写的应用程序。

We need to write applications to process the input data in the given table to find the year of maximum usage, the year of minimum usage, and so on. This task is easy for programmers with finite amount of records, as they will simply write the logic to produce the required output, and pass the data to the written application.

现在我们来增加输入数据的规模。假设我们必须分析一个特定州的所有大型产业的用电量。当我们编写应用程序来处理这种大量数据时，

Let us now raise the scale of the input data. Assume we have to analyze the electrical consumption of all the large-scale industries of a particular state. When we write applications to process such bulk data,

They will take a lot of time to execute.
There will be heavy network traffic when we move data from the source to the network server.

为了解决这些问题，我们有MapReduce框架。

To solve these problems, we have the MapReduce framework.

Input Data

以上数据存储在 sample.txt 中，并作为输入提供。输入文件如下所示。

The above data is saved as sample.txt and given as input. The input file looks as shown below.