Avro 简明教程

AVRO - Overview

为通过网络传输数据或进行持久化存储，您需要序列化数据。在 Java 和 Hadoop 提供的 serialization APIs 之前，我们有一个特殊的实用程序，称为 Avro ，一种基于模式的序列化技术。

To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique.

本教程教您如何使用 Avro 序列化和反序列化数据。Avro 为各种编程语言提供库。在本教程中，我们使用 Java 库演示了示例。

This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library.

What is Avro?

Apache Avro 是一个与语言无关的数据序列化系统。它由 Hadoop 之父 Doug Cutting 开发。由于 Hadoop 可写类缺乏语言可移植性，因此 Avro 非常有用，因为它处理的数据格式可以由多重语言处理。Avro 是一个序列化 Hadoop 数据的首选工具。

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop.

Avro 有一个基于模式的系统。与语言无关的模式与其读写操作相关联。Avro 序列化具有内置模式的数据。Avro 将数据序列化成紧凑的二进制格式，它可以被任何应用程序反序列化。

Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.

Avro 使用 JSON 格式来声明数据结构。目前，它支持 Java、C、C++、C#、Python 和 Ruby 等语言。

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

Avro Schemas

Avro 高度依赖其 schema 。它允许每个数据在完全不知道模式的情况下编写。它快速序列化，得到的序列化数据更小。模式连同 Avro 数据一起存储在一个文件中以进行任何进一步处理。

Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing.

在 RPC 中，客户端和服务器在连接期间交换模式。此交换有助于具有相同名称的字段、丢失的字段、额外的字段等之间的通信。

In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc.

Avro 模式用 JSON 定义，这简化了它在具有 JSON 库的语言中的实现。

Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries.

与 Avro 一样，Hadoop 中还有其他序列化机制，例如 Sequence Files, Protocol Buffers, 和 Thrift 。

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Comparison with Thrift and Protocol Buffers

Thrift 和 Protocol Buffers 是 Avro 最好的库。Avro 和这些框架在以下方面有所不同 −

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways −

Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.
Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem.

与 Thrift 和 Protocol Buffer 不同，Avro 的模式定义使用 JSON，而不在专有 IDL 中。

Unlike Thrift and Protocol Buffer, Avro’s schema definition is in JSON and not in any proprietary IDL.

Property

Avro

Thrift & Protocol Buffer

Dynamic schema

Yes

Built into Hadoop

Yes

Schema in JSON

Yes

No need to compile

Yes

No need to declare IDs

Yes

Bleeding edge

Yes

Features of Avro

下面列出 Avro 的一些突出特性 −

Listed below are some of the prominent features of Avro −

Avro is a language-neutral data serialization system.
It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.
Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.
Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.
Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.
Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

General Working of Avro

要使用 Avro，你需要执行以下工作流 −

To use Avro, you need to follow the given workflow −

Step 1 − Create schemas. Here you need to design Avro schema according to your data.
Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library.
Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific.
Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific.