Map Reduce 简明教程

MapReduce - Algorithm

MapReduce 算法包含两项重要任务,即 Map 和 Reduce。

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

  1. The map task is done by means of Mapper Class

  2. The reduce task is done by means of Reducer Class.

映射程序类接受输入,对输入进行标记化、映射和排序。映射程序类的输出作为还原程序类的输入,还原程序类反过来搜索匹配对并减少它们。

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them.

mapper reducer class

MapReduce 实施各种数学算法,将任务划分为较小的部分并将其分配给多个系统。从技术角度而言,MapReduce 算法有助于将 Map 和 Reduce 任务发送到集群中的适当服务器。

MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

这些数学算法可能包括以下内容:

These mathematical algorithms may include the following −

  1. Sorting

  2. Searching

  3. Indexing

  4. TF-IDF

Sorting

排序是处理和分析数据的基本 MapReduce 算法之一。MapReduce 实施排序算法,通过键自动对映射程序的输出键值对进行排序。

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys.

  1. Sorting methods are implemented in the mapper class itself.

  2. In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context class (user-defined class) collects the matching valued keys as a collection.

  3. To collect similar key-value pairs (intermediate keys), the Mapper class takes the help of RawComparator class to sort the key-value pairs.

  4. The set of intermediate key-value pairs for a given Reducer is automatically sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

搜索在 MapReduce 算法中扮演着重要的角色。它有助于合并器阶段(可选)和 Reducer 阶段。让我们尝试借助一个示例了解搜索的运作方式。

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase (optional) and in the Reducer phase. Let us try to understand how Searching works with the help of an example.

Example

以下示例展示了 MapReduce 如何采用搜索算法找出给定员工数据集中薪水最高员工的详细信息。

The following example shows how MapReduce employs Searching algorithm to find out the details of the employee who draws the highest salary in a given employee dataset.

  1. Let us assume we have employee data in four different files − A, B, C, and D. Let us also assume there are duplicate employee records in all four files because of importing the employee data from all database tables repeatedly. See the following illustration.

map reduce illustration
  1. The Map phase processes each input file and provides the employee data in key-value pairs (<k, v> : <emp name, salary>). See the following illustration.

map reduce illustration 2
  1. The combiner phase (searching technique) will accept the input from the Map phase as a key-value pair with employee name and salary. Using searching technique, the combiner will check all the employee salary to find the highest salaried employee in each file. See the following snippet.

<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){
   Max = v(salary);
}

else{
   Continue checking;
}

预期结果如下:

The expected result is as follows −

<satish, 26000> <gopal, 50000> <kiran, 45000> <manisha, 45000>

<satish, 26000>

<satish, 26000>

<gopal, 50000>

<gopal, 50000>

<kiran, 45000>

<kiran, 45000>

<manisha, 45000>

<manisha, 45000>

  1. Reducer phase − Form each file, you will find the highest salaried employee. To avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm is used in between the four <k, v> pairs, which are coming from four input files. The final output should be as follows −

<gopal, 50000>

Indexing

通常,索引用于指向特定数据及其地址。它对特定 Mapper 的输入文件执行批处理索引。

Normally indexing is used to point to a particular data and its address. It performs batch indexing on the input files for a particular Mapper.

MapReduce 中通常使用的索引技术称为 inverted index. Google 和 Bing 等搜索引擎使用反向索引技术。让我们尝试借助一个简单示例了解索引的运作方式。

The indexing technique that is normally used in MapReduce is known as inverted index. Search engines like Google and Bing use inverted indexing technique. Let us try to understand how Indexing works with the help of a simple example.

Example

以下文本是反向索引的输入。其中 T[0]、T[1] 和 t[2] 是文件名,其内容置于双引号中。

The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and their content are in double quotes.

T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

应用索引算法后,我们得到以下输出 −

After applying the Indexing algorithm, we get the following output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

此处 "a": {2} 表示术语 "a" 出现在 T[2] 文件中。类似地,"is": {0, 1, 2} 表示术语 "is" 出现在文件 T[0]、T[1] 和 T[2] 中。

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the term "is" appears in the files T[0], T[1], and T[2].

TF-IDF

TF-IDF 是一种文本处理算法,是术语频率 - 逆向文件频率的缩写。它是常见的 Web 分析算法之一。在此,术语“频率”是指术语在文档中出现的次数。

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to the number of times a term appears in a document.

Term Frequency (TF)

它测量特定术语在文档中出现的频率。它通过将单词在文档中出现的次数除以该文档中单词总数来计算。

It measures how frequently a particular term occurs in a document. It is calculated by the number of times a word appears in a document divided by the total number of words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the document)

Inverse Document Frequency (IDF)

它测量术语的重要性。它通过文本数据库中文件的数目除以特定术语出现的文件的数目来计算。

It measures the importance of a term. It is calculated by the number of documents in the text database divided by the number of documents where a specific term appears.

在计算 TF 时,所有术语都被认为同样重要。也就是说,TF 计算“is”、“a”、“what”等普通单词的术语频率。因此,我们需要在扩大罕见术语的比例时了解常见术语,通过计算以下内容:

While computing TF, all the terms are considered equally important. That means, TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms while scaling up the rare ones, by computing the following −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

算法在以下一个小示例中得到解释。

The algorithm is explained below with the help of a small example.

Example

考虑一个包含 1000 个单词的文档,其中单词 hive 出现 50 次。则 hive 的 TF 为 (50/1000) = 0.05。

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for hive is then (50 / 1000) = 0.05.

现在,假设我们有 1000 万份文档,单词 hive 出现在其中的 1000 份文档中。然后,IDF 计算如下:log(10,000,000 / 1,000) = 4。

Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.

TF-IDF 权重是这些数量的乘积 − 0.05 × 4 = 0.20。

The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.