Apache Storm 简明教程

Apache Storm - Introduction

What is Apache Storm?

Apache Storm 是一个分布式实时大数据处理系统。Storm 被设计为以容错和水平可扩展的方法处理大量数据。它是一个流数据框架,具有最高的摄取率。虽然 Storm 是无状态的,但它通过 Apache ZooKeeper 管理分布式环境和集群状态。它很简单,并且您可以并行对实时数据执行各种操作。

Apache Storm is a distributed real-time big data-processing system. Storm is designed to process vast amount of data in a fault-tolerant and horizontal scalable method. It is a streaming data framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all kinds of manipulations on real-time data in parallel.

Apache Storm 继续成为实时数据分析的领导者。Storm 易于设置和操作,并且保证每条消息都将至少处理一次。

Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate and it guarantees that every message will be processed through the topology at least once.

Apache Storm vs Hadoop

基本上,Hadoop 和 Storm 框架用于分析大数据。它们在某些方面互补,但在某些方面也有所不同。Apache Storm 除了持久性之外,执行所有操作,而 Hadoop 擅长所有内容,但实时计算方面滞后。下表比较了 Storm 和 Hadoop 的属性。

Basically Hadoop and Storm frameworks are used for analyzing big data. Both of them complement each other and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop is good at everything but lags in real-time computation. The following table compares the attributes of Storm and Hadoop.

Storm

Hadoop

Real-time stream processing

Batch processing

Stateless

Stateful

Master/Slave architecture with ZooKeeper based coordination. The master node is called as nimbus and slaves are supervisors.

Master-slave architecture with/without ZooKeeper based coordination. Master node is job tracker and slave node is task tracker.

A Storm streaming process can access tens of thousands messages per second on cluster.

Hadoop Distributed File System (HDFS) uses MapReduce framework to process vast amount of data that takes minutes or hours.

Storm topology runs until shutdown by the user or an unexpected unrecoverable failure.

MapReduce jobs are executed in a sequential order and completed eventually.

Both are distributed and fault-tolerant

If nimbus / supervisor dies, restarting makes it continue from where it stopped, hence nothing gets affected.

Use-Cases of Apache Storm

Apache Storm 以实时大数据流处理而闻名。出于这个原因,大多数公司将 Storm 用作其系统的组成部分。一些值得注意的示例如下 −

Apache Storm is very famous for real-time big data stream processing. For this reason, most of the companies are using Storm as an integral part of their system. Some notable examples are as follows −

Twitter − Twitter 使用 Apache Storm 来处理其 “Publisher Analytics 产品” 的范围。Publisher Analytics 产品在 Twitter 平台中处理每条推文和点击。Apache Storm 与 Twitter 基础设施深度集成。

Twitter − Twitter is using Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process each and every tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter infrastructure.

NaviSite − NaviSite 使用 Storm 用于事件日志监控/审计系统。在系统中生成的每条日志都将通过 Storm。Storm 将根据配置的正则表达式集检查消息,如果有匹配项,则会将该特定消息保存到数据库中。

NaviSite − NaviSite is using Storm for Event log monitoring/auditing system. Every logs generated in the system will go through the Storm. Storm will check the message against the configured set of regular expression and if there is a match, then that particular message will be saved to the database.

Wego − Wego 是位于新加坡的旅游元搜索引擎。与旅游相关的的数据来自世界各地,时间不同。Storm 帮助 Wego 搜索实时数据、解决并发问题并找到最终用户的最佳匹配。

Wego − Wego is a travel metasearch engine located in Singapore. Travel related data comes from many sources all over the world with different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user.

Apache Storm Benefits

以下是 Apache Storm 提供的好处列表 −

Here is a list of the benefits that Apache Storm offers −

  1. Storm is open source, robust, and user friendly. It could be utilized in small companies as well as large corporations.

  2. Storm is fault tolerant, flexible, reliable, and supports any programming language.

  3. Allows real-time stream processing.

  4. Storm is unbelievably fast because it has enormous power of processing the data.

  5. Storm can keep up the performance even under increasing load by adding resources linearly. It is highly scalable.

  6. Storm performs data refresh and end-to-end delivery response in seconds or minutes depends upon the problem. It has very low latency.

  7. Storm has operational intelligence.

  8. Storm provides guaranteed data processing even if any of the connected nodes in the cluster die or messages are lost.