Zookeeper 简明教程

Zookeeper - Applications

Zookeeper 为分布式环境提供了灵活的协调基础架构。ZooKeeper 框架支持当今许多最佳工业应用。本章我们将讨论 ZooKeeper 一些最著名的应用。

Zookeeper provides a flexible coordination infrastructure for distributed environment. ZooKeeper framework supports many of the today’s best industrial applications. We will discuss some of the most notable applications of ZooKeeper in this chapter.

Yahoo!

ZooKeeper 框架最初是在“Yahoo!”中构建的。设计良好的分布式应用需要满足数据透明性、更好的性能、健壮性、集中式配置和协调等要求。因此,他们设计了 ZooKeeper 框架来满足这些要求。

The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to meet requirements such as data transparency, better performance, robustness, centralized configuration, and coordination. So, they designed the ZooKeeper framework to meet these requirements.

Apache Hadoop

Apache Hadoop 是大数据行业增长的推动力。Hadoop 依赖 ZooKeeper 来进行配置管理和协调。让我们采用一个场景来理解 ZooKeeper 在 Hadoop 中扮演的角色。

Apache Hadoop is the driving force behind the growth of Big Data industry. Hadoop relies on ZooKeeper for configuration management and coordination. Let us take a scenario to understand the role of ZooKeeper in Hadoop.

假设 Hadoop cluster 桥接 100 or more commodity servers 。因此,需要协调和命名服务。因为计算涉及大量节点,所以每个节点需要相互同步,知道在哪里访问服务,并知道它们应如何配置。在此时刻,Hadoop 集群需要跨节点服务。ZooKeeper 为 cross-node synchronization 提供了便利,并确保 Hadoop 项目中的任务被序列化和同步。

Assume that a Hadoop cluster bridges 100 or more commodity servers. Therefore, there’s a need for coordination and naming services. As computation of large number of nodes are involved, each node needs to synchronize with each other, know where to access services, and know how they should be configured. At this point of time, Hadoop clusters require cross-node services. ZooKeeper provides the facilities for cross-node synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.

多个 ZooKeeper 服务器支持大型 Hadoop 集群。每台客户端机器与一台 ZooKeeper 服务器通信以检索和更新其同步信息。一些实时示例如下 −

Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information. Some of the real-time examples are −

  1. Human Genome Project − The Human Genome Project contains terabytes of data. Hadoop MapReduce framework can be used to analyze the dataset and find interesting facts for human development.

  2. Healthcare − Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are normally in terabytes.

Apache HBase

Apache HBase 是一个开源、分布式 NoSQL 数据库,用于大数据集的实时读/写访问,并在 HDFS 之上运行。HBase 遵循 master-slave architecture ,其中 HBase Master 管理所有从属。从属被称为 Region servers

Apache HBase is an open source, distributed, NoSQL database used for real-time read/write access of large datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master governs all the slaves. Slaves are referred as Region servers.

HBase 分布式应用程序的安装依赖于一个正在运行的 ZooKeeper 集群。Apache HBase 使用 ZooKeeper 来跟踪主服务器和地区服务器中分布式数据的状态,这依靠 centralized configuration managementdistributed mutex 机制。以下是一些 HBase 的用例 −

HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses ZooKeeper to track the status of distributed data throughout the master and region servers with the help of centralized configuration management and distributed mutex mechanisms. Here are some of the use-cases of HBase −

  1. Telecom − Telecom industry stores billions of mobile call records (around 30TB / month) and accessing these call records in real time become a huge task. HBase can be used to process all the records in real time, easily and efficiently.

  2. Social network − Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge volumes of data through the posts created by users. HBase can be used to find recent trends and other interesting facts.

Apache Solr

Apache Solr 是一个用 Java 编写的快速、开源的搜索平台。它是一个速度惊人、容错的分布式搜索引擎。它建立在 Lucene 之上,是一个高性能、功能齐全的文本搜索引擎。

Apache Solr is a fast, open source search platform written in Java. It is a blazing fast, faulttolerant distributed search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.

Solr 大量使用了 ZooKeeper 的每一个功能,例如配置管理、领导者选举、节点管理、数据的锁定和同步。

Solr extensively uses every feature of ZooKeeper such as Configuration management, Leader election, node management, Locking and syncronization of data.

Solr 有两个不同的部分: indexingsearching 。索引是一个以适当格式存储数据的过程,以便以后可以对其进行搜索。Solr 使用 ZooKeeper 在多个节点上对数据建立索引并从多个节点进行搜索。ZooKeeper 贡献了以下功能 −

Solr has two distinct parts, indexing and searching. Indexing is a process of storing the data in a proper format so that it can be searched later. Solr uses ZooKeeper for both indexing the data in multiple nodes and searching from multiple nodes. ZooKeeper contributes the following features −

  1. Add / remove nodes as and when needed

  2. Replication of data between nodes and subsequently minimizing data loss

  3. Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search results

Apache Solr 的一些用例包括电子商务、职位搜索等。

Some of the use-cases of Apache Solr include e-commerce, job search, etc.