Hadoop 简明教程

Hadoop - Introduction

Hadoop 是一个 Apache 开源框架，使用 Java 编写，它允许跨计算机集群使用简单的编程模型对大型数据集进行分布式处理。Hadoop 框架应用程序在提供跨计算机集群的分布式存储和计算的环境中运行。Hadoop 被设计为从单个服务器扩展到数千台机器，每台机器提供本地计算和存储。

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

Hadoop 的核心有两个主要层，即：

At its core, Hadoop has two major layers namely −

Processing/Computation layer (MapReduce), and
Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce 是一种并行编程模型，用于编写分布式应用程序，由 Google 设计，用于在大量商品硬件集群（数千个节点）上可靠且容错地高效处理大量数据（多太字节数据集）。MapReduce 程序在 Hadoop 上运行，Hadoop 是一个 Apache 开源框架。

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

Hadoop 分布式文件系统（HDFS）基于 Google 文件系统（GFS），并提供了一个分布式文件系统，该文件系统设计为在商品硬件上运行。它与现有的分布式文件系统有许多相似之处。但是，与其他分布式文件系统的不同之处是很明显的。它具有极高的容错性，并且设计为部署在低成本硬件上。它提供对应用程序数据的高速访问，并且适用于具有大型数据集的应用程序。

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.

除了上述两个核心组件之外，Hadoop 框架还包括以下两个模块：

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

构建具有繁重配置来处理大规模处理的大型服务器非常昂贵，但是作为一种替代方案，你可以将许多具有单个 CPU 的商品计算机连接在一起，作为一个单一的功能分布式系统，实际上，集群机器可以并行读取数据集并提供更高的吞吐量。而且，它比一台高端服务器便宜。因此这是使用 Hadoop 背后的第一个激励因素，因为它可以在集群且低成本的机器上运行。

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop 在一组计算机之间运行代码。这个过程包括 Hadoop 执行的以下几个核心任务 -

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs −

Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.

Advantages of Hadoop

Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.