Map Reduce 简明教程

MapReduce - Partitioner

分区程序在处理输入数据集时像条件一样工作。分区阶段发生在 Map 阶段之后，Reduce 阶段之前。

A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase.

分区程序的数量等于还原程序的数量。这意味着分区程序会根据还原程序的数量划分数据。因此，由单个分区程序传递的数据由单个还原程序处理。

The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer.

Partitioner

分区程序对 Map 中间输出的键值对分区。它使用类似哈希函数的用户定义条件对数据进行分区。分区总数与作业的还原程序任务数量相同。我们举个例子来了解分区程序的工作原理。

A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Let us take an example to understand how the partitioner works.

MapReduce Partitioner Implementation

为了方便，我们假设我们有一个名为“Employee”的小型表格，其中包含以下数据。我们将使用此示例数据作为输入数据集来演示分区程序的工作原理。

For the sake of convenience, let us assume we have a small table called Employee with the following data. We will use this sample data as our input dataset to demonstrate how the partitioner works.

Id	Name	Age	Gender	Salary
1201	gopal	45	Male	50,000
1202	manisha	40	Female	50,000
1203	khalil	34	Male	30,000
1204	prasanth	30	Male	30,000
1205	kiran	20	Male	40,000
1206	laxmi	25	Female	35,000
1207	bhavya	20	Female	15,000
1208	reshma	19	Female	15,000
1209	kranthi	22	Male	22,000
1210	Satish	24	Male	25,000
1211	Krishna	25	Male	25,000
1212	Arshad	28	Male	20,000
1213	lavanya	18	Female	8,000

我们必须编写一个应用程序来处理输入数据集，以便按性别在不同年龄组（例如，20 岁以下、21 岁至 30 岁、30 岁以上）中查找薪水最高的员工。

We have to write an application to process the input dataset to find the highest salaried employee by gender in different age groups (for example, below 20, between 21 to 30, above 30).

Input Data

以上数据存储在“/home/hadoop/hadoopPartitioner”目录中的 input.txt 中，并作为输入提供。

The above data is saved as input.txt in the “/home/hadoop/hadoopPartitioner” directory and given as input.