Apache Kafka 简明教程
Apache Kafka - Introduction
大数据中使用大量数据。关于数据,我们有两个主要挑战。第一个挑战是如何收集大量数据,第二个挑战是如何分析收集到的数据。为了应对这些挑战,你肯定需要一个消息传递系统。
In Big Data, an enormous volume of data is used. Regarding data, we have two main challenges.The first challenge is how to collect large volume of data and the second challenge is to analyze the collected data. To overcome those challenges, you must need a messaging system.
Kafka 是专为分布式高吞吐量系统而设计的。Kafka 作为更传统的消息代理的替代品往往能很好地工作。与其他消息传递系统相比,Kafka 具有更好的吞吐量、内置分区、复制和固有的容错性,这使其非常适合大规模消息处理应用程序。
Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.
What is a Messaging System?
消息传递系统负责在应用程序之间传输数据,因此应用程序可以专注于数据,而不用担心如何共享数据。分布式消息传递基于可靠消息排队的概念。消息在客户端应用程序和消息传递系统之间异步排队。有两种消息传递模式可用 - 一种是点对点,另一种是发布订阅 (pub-sub) 消息传递系统。大多数消息传递模式遵循 pub-sub 。
A Messaging System is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging system. Two types of messaging patterns are available − one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub.
Point to Point Messaging System
在点对点系统中,消息保存在队列中。一个或多个消费者可以消费队列中的消息,但特定消息只能被最多一个消费者消费。一旦消费者读取了队列中的消息,它就会从该队列中消失。此系统的典型示例是订单处理系统,其中每个订单将由一个订单处理程序处理,但多个订单处理程序也可以同时工作。下图描绘了结构。
In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue. The typical example of this system is an Order Processing System, where each order will be processed by one Order Processor, but Multiple Order Processors can work as well at the same time. The following diagram depicts the structure.
Publish-Subscribe Messaging System
在发布订阅系统中,消息保存在主题中。与点对点系统不同,消费者可以订阅一个或多个主题并消费该主题中的所有消息。在发布订阅系统中,消息生产者称为发布者,消息消费者称为订阅者。一个现实生活中的例子是 Dish TV,它发布不同的频道,如体育、电影、音乐等,任何人都可以订阅他们自己的一组频道,并在他们订阅的频道可用时获取它们。
In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers. A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can subscribe to their own set of channels and get them whenever their subscribed channels are available.
What is Kafka?
Apache Kafka 是一个分布式发布订阅消息传递系统,也是一个可以处理大量数据的健壮队列,它使你可以将消息从一个端点传递到另一个端点。Kafka 适用于离线和在线消息消费。Kafka 消息保存在磁盘上并在集群内复制以防止数据丢失。Kafka 构建在 ZooKeeper 同步服务的基础之上。它与 Apache Storm 和 Spark 集成得很好,可用于实时流数据分析。
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.
Benefits
以下是 Kafka 的一些优点 −
Following are a few benefits of Kafka −
-
Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
-
Scalability − Kafka messaging system scales easily without down time..
-
Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable..
-
Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.
Kafka 非常快,并且保证零停机时间和零数据丢失。
Kafka is very fast and guarantees zero downtime and zero data loss.
Use Cases
Kafka 可以用于许多用例。其中一些如下所示 −
Kafka can be used in many Use Cases. Some of them are listed below −
-
Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
-
Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers.
-
Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
Need for Kafka
Apache Kafka 是一个用于处理所有实时数据馈送的统一平台。Kafka 支持低延迟的消息传递,并且在发生机器故障时可保证故障容错。它能够处理大量不同的消费者。Kafka 非常快,每秒可执行 200 万次写入。Kafka 将所有数据保存到磁盘,这意味着实际上所有写入都进入操作系统的页面缓存(RAM)。这使得将数据从页面缓存传输到网络套接字非常高效。
Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket.