Apache Flume 简明教程

Apache Flume - Architecture

下图描述了 Flume 的基本架构。如示意图所示, data generators (如 Facebook、Twitter)会生成数据,这些数据由在这些数据上运行的各个 Flume agents 收集。此后, data collector (也是一个代理)会从代理中收集数据,这些数据会聚合并推送到 HDFS 或 HBase 等集中存储中。

The following illustration depicts the basic architecture of Flume. As shown in the illustration, data generators (such as Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.

flume architecture

Flume Event

eventFlume 中传输的数据的基本单位。它包含一个字节数组有效负载,该有效负载将随可选标题一起从源传输到目标。典型的 Flume 事件将具有以下结构 −

An event is the basic unit of the data transported inside Flume. It contains a payload of byte array that is to be transported from the source to the destination accompanied by optional headers. A typical Flume event would have the following structure −

flume event

Flume Agent

在 Flume 中, agent 是一个独立的守护进程 (JVM)。它从客户端或其他代理接收数据(事件)并将其转发到其下一个目的地(汇或代理)。Flume 可以有多个代理。下图表示 Flume Agent

An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. Following diagram represents a Flume Agent

flume agent1

如该图所示,Flume 代理包含三个主要组件,即 sourcechannelsink

As shown in the diagram a Flume Agent contains three main components namely, source, channel, and sink.

Source

source 是代理的一个组件,它从数据生成器处接收数据并将其作为 Flume 事件的形式传输到一个或多个信道。

A source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events.

Apache Flume 支持多种类型的源,每个源都会从指定的数据生成器接收事件。

Apache Flume supports several types of sources and each source receives events from a specified data generator.

Example − Avro 源、Thrift 源、twitter 1% 源等。

Example − Avro source, Thrift source, twitter 1% source etc.

Channel

channel 是一个瞬态存储,它从源接收事件并在汇使用它们之前对其进行缓冲。它充当源和汇之间的桥梁。

A channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks.

这些信道是完全事务性的,并且它们可以与任何数量的源和汇协同工作。

These channels are fully transactional and they can work with any number of sources and sinks.

Example − JDBC 信道、文件系统信道、内存信道等。

Example − JDBC channel, File system channel, Memory channel, etc.

Sink

sink 将数据存储到 HBase 和 HDFS 等集中存储中。它使用来自信道的数据(事件)并将其传递到目的地。汇的目的地可能是另一个代理或集中存储。

A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores.

Example − HDFS 汇

Example − HDFS sink

Note − 一个 Flume 代理可以有多源、多汇、多信道。我们在本教程的 Flume 配置章节中列出了所有受支持的源、汇和信道。

Note − A flume agent can have multiple sources, sinks and channels. We have listed all the supported sources, sinks, channels in the Flume configuration chapter of this tutorial.

Additional Components of Flume Agent

我们在上面讨论的是代理的基本组件。除此之外,我们还有更多组件在将事件从数据生成器传输到集中存储中起着至关重要的作用。

What we have discussed above are the primitive components of the agent. In addition to this, we have a few more components that play a vital role in transferring the events from the data generator to the centralized stores.

Interceptors

拦截器用于更改/检查在源和信道之间传输的 Flume 事件。

Interceptors are used to alter/inspect flume events which are transferred between source and channel.

Channel Selectors

在多个信道的情况下,这些用于确定要选择哪个信道来传输数据。信道选择器有两种类型 −

These are used to determine which channel is to be opted to transfer the data in case of multiple channels. There are two types of channel selectors −

  1. Default channel selectors − These are also known as replicating channel selectors they replicates all the events in each channel.

  2. Multiplexing channel selectors − These decides the channel to send an event based on the address in the header of that event.

Sink Processors

这些用于从所选汇组中调用特定的汇。它们用于为您的汇创建故障转移路径,或者在信道中跨多个汇负载平衡事件。

These are used to invoke a particular sink from the selected group of sinks. These are used to create failover paths for your sinks or load balance events across multiple sinks from a channel.