Map Reduce 简明教程

MapReduce - Partitioner

分区程序在处理输入数据集时像条件一样工作。分区阶段发生在 Map 阶段之后,Reduce 阶段之前。

A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase.

分区程序的数量等于还原程序的数量。这意味着分区程序会根据还原程序的数量划分数据。因此,由单个分区程序传递的数据由单个还原程序处理。

The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer.

Partitioner

分区程序对 Map 中间输出的键值对分区。它使用类似哈希函数的用户定义条件对数据进行分区。分区总数与作业的还原程序任务数量相同。我们举个例子来了解分区程序的工作原理。

A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Let us take an example to understand how the partitioner works.

MapReduce Partitioner Implementation

为了方便,我们假设我们有一个名为“Employee”的小型表格,其中包含以下数据。我们将使用此示例数据作为输入数据集来演示分区程序的工作原理。

For the sake of convenience, let us assume we have a small table called Employee with the following data. We will use this sample data as our input dataset to demonstrate how the partitioner works.

Id

Name

Age

Gender

Salary

1201

gopal

45

Male

50,000

1202

manisha

40

Female

50,000

1203

khalil

34

Male

30,000

1204

prasanth

30

Male

30,000

1205

kiran

20

Male

40,000

1206

laxmi

25

Female

35,000

1207

bhavya

20

Female

15,000

1208

reshma

19

Female

15,000

1209

kranthi

22

Male

22,000

1210

Satish

24

Male

25,000

1211

Krishna

25

Male

25,000

1212

Arshad

28

Male

20,000

1213

lavanya

18

Female

8,000

我们必须编写一个应用程序来处理输入数据集,以便按性别在不同年龄组(例如,20 岁以下、21 岁至 30 岁、30 岁以上)中查找薪水最高的员工。

We have to write an application to process the input dataset to find the highest salaried employee by gender in different age groups (for example, below 20, between 21 to 30, above 30).

Input Data

以上数据存储在“/home/hadoop/hadoopPartitioner”目录中的 input.txt 中,并作为输入提供。

The above data is saved as input.txt in the “/home/hadoop/hadoopPartitioner” directory and given as input.