Map Reduce 简明教程

MapReduce - API

在本课程中,我们将仔细研究涉及 MapReduce 编程操作的类及其方法。我们将主要关注以下内容:

In this chapter, we will take a close look at the classes and their methods that are involved in the operations of MapReduce programming. We will primarily keep our focus on the following −

  1. JobContext Interface

  2. Job Class

  3. Mapper Class

  4. Reducer Class

JobContext Interface

JobContext 接口是所有类的超级接口,定义 MapReduce 中的不同作业。它为您提供了在任务运行时提供给任务的作业的可读视图。

The JobContext interface is the super interface for all the classes, which defines different jobs in MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are running.

以下是 JobContext 接口的子接口。

The following are the sub-interfaces of JobContext interface.

S.No.

Subinterface Description

1.

*MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>*Defines the context that is given to the Mapper.

2.

*ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>*Defines the context that is passed to the Reducer.

Job 类是实现 JobContext 接口的主类。

Job class is the main class that implements the JobContext interface.

Job Class

Job 类是 MapReduce API 中最重要类。它允许用户配置、提交和控制作业执行,以及查询状态。set 方法只在作业提交前有效,随后它们会抛出 IllegalStateException。

The Job class is the most important class in the MapReduce API. It allows the user to configure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.

通常,用户会创建应用程序,描述作业的各个方面,然后提交作业并监控其进度。

Normally, the user creates the application, describes the various facets of the job, and then submits the job and monitors its progress.

以下是提交作业的示例:

Here is an example of how to submit a job −

// Create a new Job
Job job = new Job(new Configuration());
job.setJarByClass(MyJob.class);

// Specify various job-specific parameters
job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));

job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);

// Submit the job, then poll for progress until the job is complete
job.waitForCompletion(true);

Constructors

以下是 Job 类的构造函数摘要。

Following are the constructor summary of Job class.

S.No

Constructor Summary

1

Job()

2

Job(Configuration conf)

3

Job(Configuration conf, String jobName)

Methods

以下是 Job 类中一些重要的函数:

Some of the important methods of Job class are as follows −

S.No

Method Description

1

*getJobName()*User-specified job name.

2

*getJobState()*Returns the current state of the Job.

3

*isComplete()*Checks if the job is finished or not.

4

*setInputFormatClass()*Sets the InputFormat for the job.

5

*setJobName(String name)*Sets the user-specified job name.

6

*setOutputFormatClass()*Sets the Output Format for the job.

7

*setMapperClass(Class)*Sets the Mapper for the job.

8

*setReducerClass(Class)*Sets the Reducer for the job.

9

*setPartitionerClass(Class)*Sets the Partitioner for the job.

10

*setCombinerClass(Class)*Sets the Combiner for the job.

Mapper Class

Mapper 类定义映射作业。将输入键值对映射到一组中间键值对。映射是将输入记录转换为中间记录的单独任务。转换后的中间记录不必与输入记录的类型相同。给定的输入对可以映射到 0 个或任意多个输出对。

The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks that transform the input records into intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

Method

map 是 Mapper 类中最突出的方法。语法定义如下:

map is the most prominent method of the Mapper class. The syntax is defined below −

map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

对于输入分片中的每个键值对,此方法会被调用一次。

This method is called once for each key-value pair in the input split.

Reducer Class

Reducer 类在 MapReduce 中定义了 Reduce 作业。它将一组共享键的中间值减少到更小的一组值。Reducer 实现可以通过 JobContext.getConfiguration() 方法访问作业的配置。Reducer 有三个主要阶段−Shuffle、Sort 和 Reduce。

The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.

  1. Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the network.

  2. Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they are merged.

  3. Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

Method

reduce 是 Reducer 类的最主要方法。其语法如下所示−

reduce is the most prominent method of the Reducer class. The syntax is defined below −

reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)

此方法在键值对集合上的每个键上调用一次。

This method is called once for each key on the collection of key-value pairs.