Apache Flume 简明教程

Apache Flume - Data Transfer In Hadoop

如我们所知， Big Data, 是一组无法使用传统计算技术处理的大型数据集。大数据经过分析后可得出有价值的结果。 Hadoop 是一个开源框架，允许使用简单的编程模型通过计算机集群在分布式环境中存储和处理大数据。

Big Data, as we know, is a collection of large datasets that cannot be processed using traditional computing techniques. Big Data, when analyzed, gives valuable results. Hadoop is an open-source framework that allows to store and process Big Data in a distributed environment across clusters of computers using simple programming models.

Streaming / Log Data

一般而言，要分析的大部分数据都是由各种数据源（如应用程序服务器、社交网站、云服务器和企业服务器）产生的。这些数据将采用 log files 和 events 的形式。

Generally, most of the data that is to be analyzed will be produced by various data sources like applications servers, social networking sites, cloud servers, and enterprise servers. This data will be in the form of log files and events.

Log file − 通常，日志文件是一种列出操作系统中发生的事件/操作的 file 。例如，Web 服务器会在日志文件中列出向服务器发出的每个请求。

Log file − In general, a log file is a file that lists events/actions that occur in an operating system. For example, web servers list every request made to the server in the log files.

收集此类日志数据后，我们可以获取有关以下方面的信息：

On harvesting such log data, we can get information about −

the application performance and locate various software and hardware failures.
the user behavior and derive better business insights.

将数据导入 HDFS 系统的传统方法是使用 put 命令。我们来看看如何使用 put 命令。

The traditional method of transferring data into the HDFS system is to use the put command. Let us see how to use the put command.

HDFS put Command

处理日志数据的主要挑战在于将多台服务器产生的这些日志移到 Hadoop 环境中。

The main challenge in handling the log data is in moving these logs produced by multiple servers to the Hadoop environment.

Hadoop File System Shell 提供了将数据插入到 Hadoop 并从中读取数据的命令。您可以使用 put 命令将数据插入到 Hadoop 中，如下所示。

Hadoop File System Shell provides commands to insert data into Hadoop and read from it. You can insert data into Hadoop using the put command as shown below.

$ Hadoop fs –put /path of the required file  /path in HDFS where to save the file

Problem with put Command

我们可以使用 Hadoop 的 put 命令将数据从这些源传输到 HDFS。但它存在以下缺点：

We can use the put command of Hadoop to transfer data from these sources to HDFS. But, it suffers from the following drawbacks −

Using put command, we can transfer only one file at a time while the data generators generate data at a much higher rate. Since the analysis made on older data is less accurate, we need to have a solution to transfer data in real time.
If we use put command, the data is needed to be packaged and should be ready for the upload. Since the webservers generate data continuously, it is a very difficult task.

我们在这里需要一个解决方案，该解决方案可以克服 put 命令的缺点，并且可以将数据生成器中的“流式数据”传输到集中存储（尤其是 HDFS）中，且延迟较低。

What we need here is a solutions that can overcome the drawbacks of put command and transfer the "streaming data" from data generators to centralized stores (especially HDFS) with less delay.

Problem with HDFS

在 HDFS 中，文件以目录项的形式存在，并且在关闭文件之前，文件的长度将被视为零。例如，如果某个源正在向 HDFS 中写入数据，并且网络在操作期间中断（而没有关闭文件），那么写入该文件的数据将丢失。

In HDFS, the file exists as a directory entry and the length of the file will be considered as zero till it is closed. For example, if a source is writing data into HDFS and the network was interrupted in the middle of the operation (without closing the file), then the data written in the file will be lost.

因此，我们需要一个可靠、可配置且可维护的系统才能将日志数据传输到 HDFS 中。

Therefore we need a reliable, configurable, and maintainable system to transfer the log data into HDFS.

Note − 在 POSIX 文件系统中，每当我们访问某个文件（比如执行写操作）时，其他程序仍然可以读取此文件（至少是已保存的部分）。这是因为文件在关闭之前已存在于磁盘上。

Note − In POSIX file system, whenever we are accessing a file (say performing write operation), other programs can still read this file (at least the saved portion of the file). This is because the file exists on the disc before it is closed.

Available Solutions

要将来自各种来源的流式数据（日志文件、事件等）发送到 HDFS，我们可以使用以下工具：

To send streaming data (log files, events etc..,) from various sources to HDFS, we have the following tools available at our disposal −

Facebook’s Scribe

Scribe 是一种非常受欢迎的工具，用于聚合和流式传输日志数据。它的设计可以扩展到极大量的节点，并且对于网络和节点故障具有鲁棒性。

Scribe is an immensely popular tool that is used to aggregate and stream log data. It is designed to scale to a very large number of nodes and be robust to network and node failures.

Apache Kafka

Kafka 已由 Apache 软件基金会开发。它是一个开源消息代理。使用 Kafka，我们可以处理高吞吐量、低延迟的信息源。

Kafka has been developed by Apache Software Foundation. It is an open-source message broker. Using Kafka, we can handle feeds with high-throughput and low-latency.

Apache Flume

Apache Flume 是一个用于收集、聚合和传输大量流式数据（如日志数据、事件等）的工具/服务/数据提取机制，从各种 Web 服务到集中化数据存储。

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc…) from various webserves to a centralized data store.

它是一个高度可靠、可分布和可配置的工具，主要旨在传输来自各种来源的流式数据到 HDFS。

It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.

在本教程中，我们将详细讨论如何使用 Flume，并使用一些示例。

In this tutorial, we will discuss in detail how to use Flume with some examples.