Apache Storm 简明教程
Apache Storm - Core Concepts
Apache Storm 从一端读取实时的原始数据流,并通过一系列的小处理单元对其进行处理,然后在另一端输出已处理/有用的信息。
Apache Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units and output the processed / useful information at the other end.
以下图表描述了 Apache Storm 的核心概念。
The following diagram depicts the core concept of Apache Storm.

现在,让我们仔细了解 Apache Storm 的组件 −
Let us now have a closer look at the components of Apache Storm −
Components |
Description |
Tuple |
Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster. |
Stream |
Stream is an unordered sequence of tuples. |
Spouts |
Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout" is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc. |
Bolts |
Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc. |
让我们举一个“Twitter 分析”的实时示例,并了解如何在 Apache Storm 中对其进行建模。以下图表描绘了结构。
Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache Storm. The following diagram depicts the structure.

“Twitter 分析”的输入来自 Twitter Streaming API。喷口将使用 Twitter Streaming API 读取用户发出的微博,并作为元组流进行输出。喷口的一个元组将具有一个 Twitter 用户名和一个作为逗号分隔值的单条微博。然后,这个元组流将被转发给螺栓,而螺栓将把微博拆分为各个单词,计算单词数量,并将信息持久保存到已配置的数据源。现在,我们可以通过查询数据源轻松地得到结果。
The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the tweets of the users using Twitter Streaming API and output as a stream of tuples. A single tuple from the spout will have a twitter username and a single tweet as comma separated values. Then, this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet into individual word, calculate the word count, and persist the information to a configured datasource. Now, we can easily get the result by querying the datasource.
Topology
喷口和螺栓相互连接,它们形成了一个拓扑。实时应用程序逻辑在 Storm 拓扑内指定。简单来说,拓扑是有向图,其中顶点是计算,边是数据流。
Spouts and bolts are connected together and they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology is a directed graph where vertices are computation and edges are stream of data.
一个简单的拓扑结构从流经器开始。流经器向一个或多个螺栓发送数据。螺栓表示拓扑结构中节点,该节点拥有最小的处理逻辑,而且螺栓的输出可以被输入至另一个螺栓。
A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt represents a node in the topology having the smallest processing logic and the output of a bolt can be emitted into another bolt as input.
Storm 会一直保持拓扑一直运行,直到你终止拓扑。Apache Storm 的主要工作是运行拓扑,并且可在特定时间运行任何数量的拓扑。
Storm keeps the topology always running, until you kill the topology. Apache Storm’s main job is to run the topology and will run any number of topology at a given time.
Tasks
现在,你已经对流经器和螺栓有了基本概念。它们是拓扑结构的最小逻辑单元,而且使用单个流经器和一系列螺栓构建拓扑结构。为了让拓扑成功运行,应对它们按特定顺序恰当执行。Storm 对各个流经器和螺栓执行的操作称为“任务”。简而言之,任务即是流经器或螺栓的执行。在特定时间,每个流经器和螺栓可以具有多个实例,并在多个独立线程中运行。
Now you have a basic idea on spouts and bolts. They are the smallest logical unit of the topology and a topology is built using a single spout and an array of bolts. They should be executed properly in a particular order for the topology to run successfully. The execution of each and every spout and bolt by Storm is called as “Tasks”. In simple words, a task is either the execution of a spout or a bolt. At a given time, each spout and bolt can have multiple instances running in multiple separate threads.
Workers
拓扑结构在分布式的方式运行,针对多个工作节点运行。Storm 将任务均匀地分散在各个工作节点上。工作节点的作用是对任务进行监听,以及在有新任务到来时启动或停止处理。
A topology runs in a distributed manner, on multiple worker nodes. Storm spreads the tasks evenly on all the worker nodes. The worker node’s role is to listen for jobs and start or stop the processes whenever a new job arrives.
Stream Grouping
数据流会从流经器流向螺栓,或从一个螺栓流向另一个螺栓。流分组控制如何向拓扑结构中分流元组,并帮助我们理解元组在拓扑结构中的流动方式。共有四种内置的分组,如下所示。
Stream of data flows from spouts to bolts or from one bolt to another bolt. Stream grouping controls how the tuples are routed in the topology and helps us to understand the tuples flow in the topology. There are four in-built groupings as explained below.
Shuffle Grouping
在随机分组中,相同数量的元组随机分配到执行螺栓的所有工作节点。下图生动地描述了这个结构。
In shuffle grouping, an equal number of tuples is distributed randomly across all of the workers executing the bolts. The following diagram depicts the structure.

Field Grouping
元组中具有相同值字段的元组被分组在一起,而剩余的元组被保留在外。然后,具有相同字段值的元组会被发送到执行螺栓的相同工作节点。例如,如果通过“单词”字段对流进行分组,那么具有相同字符串“你好”的元组将传入同一工作节点。下图说明了字段分组的工作原理。
The fields with same values in tuples are grouped together and the remaining tuples kept outside. Then, the tuples with the same field values are sent forward to the same worker executing the bolts. For example, if the stream is grouped by the field “word”, then the tuples with the same string, “Hello” will move to the same worker. The following diagram shows how Field Grouping works.
