Apache Flume 简明教程

Apache Flume - Sequence Generator Source

在上一章,我们已经了解到如何从 twitter 来源获取数据到 HDFS。本章将会讲解如何从 Sequence generator 获取数据。

In the previous chapter, we have seen how to fetch data from twitter source to HDFS. This chapter explains how to fetch data from Sequence generator.

Prerequisites

要运行本章提供的示例,你需要安装 HDFSFlume 。因此,在继续阅读后续内容之前,请确认 Hadoop 安装并启动 HDFS。(参考上一章来学习如何启动 HDFS)。

To run the example provided in this chapter, you need to install HDFS along with Flume. Therefore, verify Hadoop installation and start the HDFS before proceeding further. (Refer the previous chapter to learn how to start the HDFS).

Configuring Flume

我们必须使用 conf 文件夹中的配置文件来配置来源、通道和接收器。本章中的示例使用一个 sequence generator source 、一个 memory channel 和一个 HDFS sink

We have to configure the source, the channel, and the sink using the configuration file in the conf folder. The example given in this chapter uses a sequence generator source, a memory channel, and an HDFS sink.

Sequence Generator Source

它是不断产生事件的来源。它维护了一个从 0 开始并按 1 递增的计数器。它用于测试目的。在配置此来源时,你必须为以下属性提供值 −

It is the source that generates the events continuously. It maintains a counter that starts from 0 and increments by 1. It is used for testing purpose. While configuring this source, you must provide values to the following properties −

  1. Channels

  2. Source type − seq

Channel

我们正在使用 memory 通道。若要配置内存通道,您必须为通道类型提供值。以下是配置内存通道时您需要提供的属性列表 −

We are using the memory channel. To configure the memory channel, you must provide a value to the type of the channel. Given below are the list of properties that you need to supply while configuring the memory channel −

  1. type − It holds the type of the channel. In our example the type is MemChannel.

  2. Capacity − It is the maximum number of events stored in the channel. Its default value is 100. (optional)

  3. TransactionCapacity − It is the maximum number of events the channel accepts or sends. Its default is 100. (optional).

HDFS Sink

该接收器将数据写入 HDFS。要配置该接收器,您必须提供以下详细信息。

This sink writes data into the HDFS. To configure this sink, you must provide the following details.

  1. Channel

  2. type − hdfs

  3. hdfs.path − the path of the directory in HDFS where data is to be stored.

我们可以根据场景提供一些可选值。以下是我们在应用程序中配置的 HDFS 接收器的可选属性。

And we can provide some optional values based on the scenario. Given below are the optional properties of the HDFS sink that we are configuring in our application.

  1. fileType − This is the required file format of our HDFS file. SequenceFile, DataStream and CompressedStream are the three types available with this stream. In our example, we are using the DataStream.

  2. writeFormat − Could be either text or writable.

  3. batchSize − It is the number of events written to a file before it is flushed into the HDFS. Its default value is 100.

  4. rollsize − It is the file size to trigger a roll. It default value is 100.

  5. rollCount − It is the number of events written into the file before it is rolled. Its default value is 10.

Example – Configuration File

以下是配置文件的示例。复制此内容并将其保存为 Flume 的 conf 文件夹中的 seq_gen .conf

Given below is an example of the configuration file. Copy this content and save as seq_gen .conf in the conf folder of Flume.

# Naming the components on the current agent

SeqGenAgent.sources = SeqSource
SeqGenAgent.channels = MemChannel
SeqGenAgent.sinks = HDFS

# Describing/Configuring the source
SeqGenAgent.sources.SeqSource.type = seq

# Describing/Configuring the sink
SeqGenAgent.sinks.HDFS.type = hdfs
SeqGenAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/seqgen_data/
SeqGenAgent.sinks.HDFS.hdfs.filePrefix = log
SeqGenAgent.sinks.HDFS.hdfs.rollInterval = 0
SeqGenAgent.sinks.HDFS.hdfs.rollCount = 10000
SeqGenAgent.sinks.HDFS.hdfs.fileType = DataStream

# Describing/Configuring the channel
SeqGenAgent.channels.MemChannel.type = memory
SeqGenAgent.channels.MemChannel.capacity = 1000
SeqGenAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel
SeqGenAgent.sources.SeqSource.channels = MemChannel
SeqGenAgent.sinks.HDFS.channel = MemChannel

Execution

浏览 Flume 主目录,并如下所示执行应用程序。

Browse through the Flume home directory and execute the application as shown below.

$ cd $FLUME_HOME
$./bin/flume-ng agent --conf $FLUME_CONF --conf-file $FLUME_CONF/seq_gen.conf
   --name SeqGenAgent

如果一切顺利,来源开始生成序列号,这些序列号将以日志文件形式推送到 HDFS。

If everything goes fine, the source starts generating sequence numbers which will be pushed into the HDFS in the form of log files.

以下是命令提示符窗口的截图,该窗口将序列号生成器生成的数据获取到 HDFS。

Given below is a snapshot of the command prompt window fetching the data generated by the sequence generator into the HDFS.

data generated

Verifying the HDFS

你可以使用以下 URL 访问 Hadoop 管理 Web UI: −

You can access the Hadoop Administration Web UI using the following URL −

http://localhost:50070/

在页面的右侧点击名为 Utilities 的下拉栏。如以下图表所示,你可以看到两个选项。

Click on the dropdown named Utilities on the right-hand side of the page. You can see two options as shown in the diagram given below.

verifying the hdfs

点击 Browse the file system 并输入已将序列号生成器生成的数据存储至 HDFS 目录的路径。

Click on Browse the file system and enter the path of the HDFS directory where you have stored the data generated by the sequence generator.

在我们的示例中,路径将为 /user/Hadoop/ seqgen_data / 。然后,您可以看到由序列发生器生成的日志文件列表,如下所示,该列表存储在 HDFS 中。

In our example, the path will be /user/Hadoop/ seqgen_data /. Then, you can see the list of log files generated by the sequence generator, stored in the HDFS as given below.

browse file system

Verifying the Contents of the File

所有这些日志文件都按顺序格式包含数字。您可以使用 cat 命令验证文件系统中这些文件的内容,如下所示。

All these log files contain numbers in sequential format. You can verify the contents of these file in the file system using the cat command as shown below.

verifying the contents of file