Avro 简明教程

AVRO - Quick Guide

AVRO - Overview

为通过网络传输数据或进行持久化存储,您需要序列化数据。在 Java 和 Hadoop 提供的 serialization APIs 之前,我们有一个特殊的实用程序,称为 Avro ,一种基于模式的序列化技术。

To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique.

本教程教您如何使用 Avro 序列化和反序列化数据。Avro 为各种编程语言提供库。在本教程中,我们使用 Java 库演示了示例。

This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library.

What is Avro?

Apache Avro 是一个与语言无关的数据序列化系统。它由 Hadoop 之父 Doug Cutting 开发。由于 Hadoop 可写类缺乏语言可移植性,因此 Avro 非常有用,因为它处理的数据格式可以由多重语言处理。Avro 是一个序列化 Hadoop 数据的首选工具。

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop.

Avro 有一个基于模式的系统。与语言无关的模式与其读写操作相关联。Avro 序列化具有内置模式的数据。Avro 将数据序列化成紧凑的二进制格式,它可以被任何应用程序反序列化。

Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.

Avro 使用 JSON 格式来声明数据结构。目前,它支持 Java、C、C++、C#、Python 和 Ruby 等语言。

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

Avro Schemas

Avro 高度依赖其 schema 。它允许每个数据在完全不知道模式的情况下编写。它快速序列化,得到的序列化数据更小。模式连同 Avro 数据一起存储在一个文件中以进行任何进一步处理。

Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing.

在 RPC 中,客户端和服务器在连接期间交换模式。此交换有助于具有相同名称的字段、丢失的字段、额外的字段等之间的通信。

In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc.

Avro 模式用 JSON 定义,这简化了它在具有 JSON 库的语言中的实现。

Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries.

与 Avro 一样,Hadoop 中还有其他序列化机制,例如 Sequence Files, Protocol Buffers,Thrift

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Comparison with Thrift and Protocol Buffers

ThriftProtocol Buffers 是 Avro 最好的库。Avro 和这些框架在以下方面有所不同 −

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways −

  1. Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.

  2. Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem.

与 Thrift 和 Protocol Buffer 不同,Avro 的模式定义使用 JSON,而不在专有 IDL 中。

Unlike Thrift and Protocol Buffer, Avro’s schema definition is in JSON and not in any proprietary IDL.

Property

Avro

Thrift & Protocol Buffer

Dynamic schema

Yes

No

Built into Hadoop

Yes

No

Schema in JSON

Yes

No

No need to compile

Yes

No

No need to declare IDs

Yes

No

Bleeding edge

Yes

No

Features of Avro

下面列出 Avro 的一些突出特性 −

Listed below are some of the prominent features of Avro −

  1. Avro is a language-neutral data serialization system.

  2. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).

  3. Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.

  4. Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.

  5. Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.

  6. Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.

  7. Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

General Working of Avro

要使用 Avro,你需要执行以下工作流 −

To use Avro, you need to follow the given workflow −

  1. Step 1 − Create schemas. Here you need to design Avro schema according to your data.

  2. Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library.

  3. Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific.

  4. Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific.

AVRO - Serialization

数据序列化有以下两个目的 −

Data is serialized for two objectives −

  1. For persistent storage

  2. To transport the data over network

What is Serialization?

序列化是指将数据结构或对象状态转换为二进制或文本形式,以便通过网络传输数据或存储在某些持久存储中。一旦通过网络传输数据或从持久存储中检索数据,就需要再次反序列化数据。序列化称为 marshalling ,而反序列化称为 unmarshalling

Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Serialization in Java

Java 提供了一种称为 object serialization 的机制,对象可以用作一个字节序列,其中包括对象的数据以及对象类型和存储在对象中的数据类型的信息。

Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object’s data as well as information about the object’s type and the types of data stored in the object.

将序列化的对象写入文件后,可以从该文件中读取并反序列化它。也就是说,表示对象及其数据的类型信息和字节可用于在内存中重新创建对象。

After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.

ObjectInputStreamObjectOutputStream 类分别用于在 Java 中序列化和反序列化对象。

ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java.

Serialization in Hadoop

通常,在 Hadoop 等分布式系统中,序列化概念用于 Interprocess CommunicationPersistent Storage

Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage.

Interprocess Communication

  1. To establish the interprocess communication between the nodes connected in a network, RPC technique was used.

  2. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message.

  3. The RPC serialization format is required to be as follows − Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center. Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and deserialization process should be quick, producing less overhead. Extensible − Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and servers. Interoperable − The message format should support the nodes that are written in different languages.

Persistent Storage

持久性存储是一种数字存储设施,当电源供应中断时,它不会丢失其数据。文件、文件夹和数据库是持久性存储的示例。

Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. Files, folders, databases are the examples of persistent storage.

Writable Interface

这是 Hadoop 中的接口,它提供用于序列化和反序列化的多种方法。下表对这些方法进行了描述 −

This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes the methods −

S.No.

Methods and Description

1

void readFields(DataInput in) This method is used to deserialize the fields of the given object.

2

void write(DataOutput out) This method is used to serialize the fields of the given object.

Writable Comparable Interface

它结合了 WritableComparable 接口。该接口继承了 Hadoop 的 Writable 接口和 Java 的 Comparable 接口。所以它提供了用于数据序列化、反序列化和比较的方法。

It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison.

S.No.

Methods and Description

1

int compareTo(class obj) This method compares current object with the given object obj.

除了这些类以外,Hadoop 还支持许多实现 WritableComparable 接口的包装类。每个类都包装一个 Java 原始类型。Hadoop 序列化的类层次结构如下 −

In addition to these classes, Hadoop supports a number of wrapper classes that implement WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy of Hadoop serialization is given below −

hadoop serialization hierarchy

这些类有助于序列化 Hadoop 中各种类型的数据。例如,我们考虑 IntWritable 类。我们看看是如何使用此类来序列化和反序列化 Hadoop 中的数据的。

These classes are useful to serialize various types of data in Hadoop. For instance, let us consider the IntWritable class. Let us see how this class is used to serialize and deserialize the data in Hadoop.

IntWritable Class

此类实现了 Writable, Comparable,WritableComparable 接口。它包装了一个整数数据类型。此类提供了用于序列化和反序列化整数类型数据的的方法。

This class implements Writable, Comparable, and WritableComparable interfaces. It wraps an integer data type in it. This class provides methods used to serialize and deserialize integer type of data.

Constructors

S.No.

Summary

1

IntWritable()

2

IntWritable( int value)

Methods

S.No.

Summary

1

int get() Using this method you can get the integer value present in the current object.

2

void readFields(DataInput in) This method is used to deserialize the data in the given DataInput object.

3

void set(int value) This method is used to set the value of the current IntWritable object.

4

void write(DataOutput out) This method is used to serialize the data in the current object to the given DataOutput object.

Serializing the Data in Hadoop

下面讨论序列化整数类型数据的过程。

The procedure to serialize the integer type of data is discussed below.

  1. Instantiate IntWritable class by wrapping an integer value in it.

  2. Instantiate ByteArrayOutputStream class.

  3. Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.

  4. Serialize the integer value in IntWritable object using write() method. This method needs an object of DataOutputStream class.

  5. The serialized data will be stored in the byte array object which is passed as parameter to the DataOutputStream class at the time of instantiation. Convert the data in the object to byte array.

Example

以下示例显示了如何在 Hadoop 中序列化整数类型的数据:

The following example shows how to serialize data of integer type in Hadoop −

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

public class Serialization {
   public byte[] serialize() throws IOException{

      //Instantiating the IntWritable object
      IntWritable intwritable = new IntWritable(12);

      //Instantiating ByteArrayOutputStream object
      ByteArrayOutputStream byteoutputStream = new ByteArrayOutputStream();

      //Instantiating DataOutputStream object
      DataOutputStream dataOutputStream = new
      DataOutputStream(byteoutputStream);

      //Serializing the data
      intwritable.write(dataOutputStream);

      //storing the serialized object in bytearray
      byte[] byteArray = byteoutputStream.toByteArray();

      //Closing the OutputStream
      dataOutputStream.close();
      return(byteArray);
   }

   public static void main(String args[]) throws IOException{
      Serialization serialization= new Serialization();
      serialization.serialize();
      System.out.println();
   }
}

Deserializing the Data in Hadoop

下面讨论反序列化整数类型数据的过程:

The procedure to deserialize the integer type of data is discussed below −

  1. Instantiate IntWritable class by wrapping an integer value in it.

  2. Instantiate ByteArrayOutputStream class.

  3. Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.

  4. Deserialize the data in the object of DataInputStream using readFields() method of IntWritable class.

  5. The deserialized data will be stored in the object of IntWritable class. You can retrieve this data using get() method of this class.

Example

以下示例显示了如何在 Hadoop 中反序列化整数类型的数据:

The following example shows how to deserialize the data of integer type in Hadoop −

import java.io.ByteArrayInputStream;
import java.io.DataInputStream;

import org.apache.hadoop.io.IntWritable;

public class Deserialization {

   public void deserialize(byte[]byteArray) throws Exception{

      //Instantiating the IntWritable class
      IntWritable intwritable =new IntWritable();

      //Instantiating ByteArrayInputStream object
      ByteArrayInputStream InputStream = new ByteArrayInputStream(byteArray);

      //Instantiating DataInputStream object
      DataInputStream datainputstream=new DataInputStream(InputStream);

      //deserializing the data in DataInputStream
      intwritable.readFields(datainputstream);

      //printing the serialized data
      System.out.println((intwritable).get());
   }

   public static void main(String args[]) throws Exception {
      Deserialization dese = new Deserialization();
      dese.deserialize(new Serialization().serialize());
   }
}

Advantage of Hadoop over Java Serialization

Hadoop 的基于可写对象的序列化能够通过重新使用可写对象来减少对象创建开销,而 Java 的原生序列化框架不能这么做。

Hadoop’s Writable-based serialization is capable of reducing the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

Disadvantages of Hadoop Serialization

要序列化 Hadoop 数据,有两种办法:

To serialize Hadoop data, there are two ways −

  1. You can use the Writable classes, provided by Hadoop’s native library.

  2. You can also use Sequence Files which store the data in binary format.

这两种机制的主要缺点是 WritablesSequenceFiles 仅有一个 Java API,不能用其他任何语言编写或读入。

The main drawback of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.

因此,任何使用上述两种机制在 Hadoop 中创建的文件都无法被任何其他第三方语言读取,这使得 Hadoop 成为一个限制盒子。为了解决这一缺点,Doug Cutting 创建了 Avro, ,它是一个 language independent data structure

Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a limited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure.

AVRO - Environment Setup

Apache 软件基金会提供带有各种版本的 Avro。你可以从 Apache 镜像中下载必需版本。让我们了解如何设置环境来使用 Avro:

Apache software foundation provides Avro with various releases. You can download the required release from Apache mirrors. Let us see, how to set up the environment to work with Avro −

Downloading Avro

要下载 Apache Avro,执行以下操作:

To download Apache Avro, proceed with the following −

  1. Open the web page Apache.org. You will see the homepage of Apache Avro as shown below −

avro homepage
  1. Click on project → releases. You will get a list of releases.

  2. Select the latest release which leads you to a download link.

  3. mirror.nexcess is one of the links where you can find the list of all libraries of different languages that Avro supports as shown below −

avro languages supports

你可以选择和下载任何提供的语言的库。在本教程中,我们使用 Java。因此下载 jar 文件 avro-1.7.7.jaravro-tools-1.7.7.jar

You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar.

Avro with Eclipse

要在 Eclipse 环境中使用 Avro,你需要按照下面给出的步骤操作 −

To use Avro in Eclipse environment, you need to follow the steps given below −

  1. Step 1. Open eclipse.

  2. Step 2. Create a project.

  3. Step 3. Right-click on the project name. You will get a shortcut menu.

  4. Step 4. Click on Build Path. It leads you to another shortcut menu.

  5. Step 5. Click on Configure Build Path…​ You can see Properties window of your project as shown below −

properties of avro
  1. Step 6. Under libraries tab, click on ADD EXternal JARs…​ button.

  2. Step 7. Select the jar file avro-1.77.jar you have downloaded.

  3. Step 8. Click on OK.

Avro with Maven

你也可以使用 Maven 在你的项目中获取 Avro 库。下面给出了 Avro 的 pom.xml 文件。

You can also get the Avro library into your project using Maven. Given below is the pom.xml file for Avro.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="   http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

   <modelVersion>4.0.0</modelVersion>
   <groupId>Test</groupId>
   <artifactId>Test</artifactId>
   <version>0.0.1-SNAPSHOT</version>

   <build>
      <sourceDirectory>src</sourceDirectory>
      <plugins>
         <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.1</version>

            <configuration>
               <source>1.7</source>
               <target>1.7</target>
            </configuration>

         </plugin>
      </plugins>
   </build>

   <dependencies>
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro</artifactId>
         <version>1.7.7</version>
      </dependency>

      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro-tools</artifactId>
         <version>1.7.7</version>
      </dependency>

      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-api</artifactId>
         <version>2.0-beta9</version>
      </dependency>

      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-core</artifactId>
         <version>2.0-beta9</version>
      </dependency>

   </dependencies>

</project>

Setting Classpath

要在 Linux 环境中使用 Avro,下载以下 jar 文件 −

To work with Avro in Linux environment, download the following jar files −

  1. avro-1.77.jar

  2. avro-tools-1.77.jar

  3. log4j-api-2.0-beta9.jar

  4. og4j-core-2.0.beta9.jar.

将这些文件复制到一个文件夹,并将类路径设置为 . /bashrc 文件中的文件夹,如下所示。

Copy these files into a folder and set the classpath to the folder, in the ./bashrc file as shown below.

#class path for Avro
export CLASSPATH=$CLASSPATH://home/Hadoop/Avro_Work/jars/*
setting classpath

AVRO - Schemas

Avro 是基于模式的序列化工具,接收模式作为输入。尽管有各种模式可用,但 Avro 遵循其定义模式的自己的标准。这些模式描述以下详细信息 −

Avro, being a schema-based serialization utility, accepts schemas as input. In spite of various schemas being available, Avro follows its own standards of defining schemas. These schemas describe the following details −

  1. type of file (record by default)

  2. location of record

  3. name of the record

  4. fields in the record with their corresponding data types

使用这些模式,你可以使用更少的空间将序列化值存储在二进制格式中。这些值存储没有任何元数据。

Using these schemas, you can store serialized values in binary format using less space. These values are stored without any metadata.

Creating Avro Schemas

Avro 模式是在 JavaScript 对象表示法 (JSON) 文档格式中创建的,这是一种轻量级的基于文本的数据交换格式。它可以用以下方式创建 −

The Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format. It is created in one of the following ways −

  1. A JSON string

  2. A JSON object

  3. A JSON array

Example - 下面的示例显示了一个模式,它在名称空间 Tutorialspoint 下定义了一个名为 Employee 的文档,该文档具有字段名称和年龄。

Example − The following example shows a schema, which defines a document, under the name space Tutorialspoint, with name Employee, having fields name and age.

{
   "type" : "record",
   "namespace" : "Tutorialspoint",
   "name" : "Employee",
   "fields" : [
      { "name" : "Name" , "type" : "string" },
      { "name" : "Age" , "type" : "int" }
   ]
}

在这个示例中,你可以观察到每条记录有四个字段 -

In this example, you can observe that there are four fields for each record −

  1. type − This field comes under the document as well as the under the field named fields. In case of document, it shows the type of the document, generally a record because there are multiple fields. When it is field, the type describes data type.

  2. namespace − This field describes the name of the namespace in which the object resides.

  3. name − This field comes under the document as well as the under the field named fields. In case of document, it describes the schema name. This schema name together with the namespace, uniquely identifies the schema within the store (Namespace.schema name). In the above example, the full name of the schema will be Tutorialspoint.Employee. In case of fields, it describes name of the field.

Primitive Data Types of Avro

Avro 模式既有原始数据类型,也有复杂数据类型。下表描述了 Avro 的 primitive data types -

Avro schema is having primitive data types as well as complex data types. The following table describes the primitive data types of Avro −

Data type

Description

null

Null is a type having no value.

int

32-bit signed integer.

long

64-bit signed integer.

float

single precision (32-bit) IEEE 754 floating-point number.

double

double precision (64-bit) IEEE 754 floating-point number.

bytes

sequence of 8-bit unsigned bytes.

string

Unicode character sequence.

Complex Data Types of Avro

除了原始数据类型之外,Avro 还提供六种复杂数据类型,即记录、枚举、数组、映射、联合和固定的。

Along with primitive data types, Avro provides six complex data types namely Records, Enums, Arrays, Maps, Unions, and Fixed.

Record

Avro 中的记录数据类型是多个属性的集合。它支持以下属性 -

A record data type in Avro is a collection of multiple attributes. It supports the following attributes −

  1. name − The value of this field holds the name of the record.

  2. namespace − The value of this field holds the name of the namespace where the object is stored.

  3. type − The value of this attribute holds either the type of the document (record) or the datatype of the field in the schema.

  4. fields − This field holds a JSON array, which have the list of all of the fields in the schema, each having name and the type attributes.

Example

下面给出一个记录的示例。

Given below is the example of a record.

{
" type " : "record",
" namespace " : "Tutorialspoint",
" name " : "Employee",
" fields " : [
 { "name" : " Name" , "type" : "string" },
 { "name" : "age" , "type" : "int" }
 ]
}

Enum

枚举是集合中项的列表,Avro 枚举支持以下属性 -

An enumeration is a list of items in a collection, Avro enumeration supports the following attributes −

  1. name − The value of this field holds the name of the enumeration.

  2. namespace − The value of this field contains the string that qualifies the name of the Enumeration.

  3. symbols − The value of this field holds the enum’s symbols as an array of names.

Example

以下是枚举的示例。

Given below is the example of an enumeration.

{
   "type" : "enum",
   "name" : "Numbers",
   "namespace": "data",
   "symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]
}

Arrays

此数据类型定义具有单个属性项的数组字段。此项属性指定数组中项的类型。

This data type defines an array field having a single attribute items. This items attribute specifies the type of items in the array.

Example

{ " type " : " array ", " items " : " int " }

Maps

地图数据类型是键值对的数组,它以键值对组织数据。Avro 地图的键必须是字符串。地图的值保存地图内容的数据类型。

The map data type is an array of key-value pairs, it organizes data as key-value pairs. The key for an Avro map must be a string. The values of a map hold the data type of the content of map.

Example

{"type" : "map", "values" : "int"}

Unions

每当字段有一个或多个数据类型时,都会使用联合数据类型。它们表示为 JSON 数组。例如,如果一个字段可以是整数或 null,则联合表示为 ["int", "null"]。

A union datatype is used whenever the field has one or more datatypes. They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as ["int", "null"].

Example

以下是使用联合的示例文档 −

Given below is an example document using unions −

{
   "type" : "record",
   "namespace" : "tutorialspoint",
   "name" : "empdetails ",
   "fields" :
   [
      { "name" : "experience", "type": ["int", "null"] }, { "name" : "age", "type": "int" }
   ]
}

Fixed

此数据类型用于声明定长字段,该字段可用于存储二进制数据。它具有字段名称和数据作为属性。名称包含字段的名称,大小包含字段的大小。

This data type is used to declare a fixed-sized field that can be used for storing binary data. It has field name and data as attributes. Name holds the name of the field, and size holds the size of the field.

Example

{ "type" : "fixed" , "name" : "bdata", "size" : 1048576}

AVRO - Reference API

在上一章,我们介绍了 Avro 的输入类型,即 Avro 模式。在本章,我们将介绍在 Avro 模式的序列化和反序列化中使用的类和方法。

In the previous chapter, we described the input type of Avro, i.e., Avro schemas. In this chapter, we will explain the classes and methods used in the serialization and deserialization of Avro schemas.

SpecificDatumWriter Class

此类属于 org.apache.avro.specific 程序包。它实现了接口 DatumWriter ,该接口将 Java 对象转换为内存中的序列化格式。

This class belongs to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format.

Constructor

S.No.

Description

1

SpecificDatumWriter(Schema schema)

Method

S.No.

Description

1

SpecificData getSpecificData() Returns the SpecificData implementation used by this writer.

SpecificDatumReader Class

此类属于程序包 org.apache.avro.specific 。它实现了接口 DatumReader ,该接口读取模式的数据并确定内存中数据表示。 SpecificDatumReader 是支持生成 Java 类别。

This class belongs to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the class which supports generated java classes.

Constructor

S.No.

Description

1

SpecificDatumReader(Schema schema) Construct where the writer’s and reader’s schemas are the same.

Methods

S.No.

Description

1

SpecificData getSpecificData() Returns the contained SpecificData.

2

void setSchema(Schema actual) This method is used to set the writer’s schema.

DataFileWriter

实例化 DataFileWrite emp 类。此类编写一系列符合模式的序列化数据记录,以及文件中模式。

Instantiates DataFileWrite for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema in a file.

Constructor

S.No.

Description

1

DataFileWriter(DatumWriter<D> dout)

Methods

S.No

Description

1

void append(D datum) Appends a datum to a file.

2

DataFileWriter<D> appendTo(File file) This method is used to open a writer appending to an existing file.

Data FileReader

此类提供使用 DataFileWriter 编写的文件的随机访问。它继承类 DataFileStream

This class provides random access to files written with DataFileWriter. It inherits the class DataFileStream.

Constructor

S.No.

Description

1

DataFileReader(File file, DatumReader<D> reader))

Methods

S.No.

Description

1

next() Reads the next datum in the file.

2

Boolean hasNext() Returns true if more entries remain in this file.

Class Schema.parser

此类是 JSON 格式模式的解析器。它包含解析模式的方法。它属于 org.apache.avro 程序包。

This class is a parser for JSON-format schemas. It contains methods to parse the schema. It belongs to org.apache.avro package.

Constructor

S.No.

Description

1

Schema.Parser()

Methods

S.No.

Description

1

parse (File file) Parses the schema provided in the given file.

2

parse (InputStream in) Parses the schema provided in the given InputStream.

3

parse (String s) Parses the schema provided in the given String.

Interface GenricRecord

此界面提供了按名称和索引来访问字段的方法。

This interface provides methods to access the fields by name as well as index.

Methods

S.No.

Description

1

Object get(String key) Returns the value of a field given.

2

void put(String key, Object v) Sets the value of a field given its name.

Class GenericData.Record

Constructor

S.No.

Description

1

GenericData.Record(Schema schema)

Methods

S.No.

Description

1

Object get(String key) Returns the value of a field of the given name.

2

Schema getSchema() Returns the schema of this instance.

3

void put(int i, Object v) Sets the value of a field given its position in the schema.

4

void put(String key, Object value) Sets the value of a field given its name.

AVRO - Serialization By Generating Class

可以通过生成与架构图相对应的类或者使用解析器库将 Avro 架构图读入程序。本章介绍了如何使用 Avro 读取架构图 by generating a classSerializing 数据。

One can read an Avro schema into the program either by generating a class corresponding to a schema or by using the parsers library. This chapter describes how to read the schema by generating a class and Serializing the data using Avr.

avro withcode serialize

Serialization by Generating a Class

若要使用 Avro 序列化数据,请按照以下步骤操作:

To serialize the data using Avro, follow the steps as given below −

  1. Write an Avro schema.

  2. Compile the schema using Avro utility. You get the Java code corresponding to that schema.

  3. Populate the schema with the data.

  4. Serialize it using Avro library.

Defining a Schema

假设您想要一个具有以下详细信息的架构图:

Suppose you want a schema with the following details −

Field

Name

id

age

salary

address

type

String

int

int

int

string

创建如下所示的 Avro 架构图。

Create an Avro schema as shown below.

emp.avsc 的形式保存它。

Save it as emp.avsc.

{
   "namespace": "tutorialspoint.com",
   "type": "record",
   "name": "emp",
   "fields": [
      {"name": "name", "type": "string"},
      {"name": "id", "type": "int"},
      {"name": "salary", "type": "int"},
      {"name": "age", "type": "int"},
      {"name": "address", "type": "string"}
   ]
}

Compiling the Schema

在创建 Avro 架构图后,您需要使用 Avro 工具来编译所创建的架构图。 avro-tools-1.7.7.jar 是包含这些工具的 jar。

After creating an Avro schema, you need to compile the created schema using Avro tools. avro-tools-1.7.7.jar is the jar containing the tools.

Syntax to Compile an Avro Schema

java -jar <path/to/avro-tools-1.7.7.jar> compile schema <path/to/schema-file> <destination-folder>

在 home 文件夹中打开终端。

Open the terminal in the home folder.

创建一个新目录以使用 Avro,如下所示:

Create a new directory to work with Avro as shown below −

$ mkdir Avro_Work

在新建目录中,创建三个子目录−

In the newly created directory, create three sub-directories −

  1. First named schema, to place the schema.

  2. Second named with_code_gen, to place the generated code.

  3. Third named jars, to place the jar files.

$ mkdir schema
$ mkdir with_code_gen
$ mkdir jars

以下屏幕截图显示了在创建所有目录后您的 Avro_work 文件夹会是什么样子。

The following screenshot shows how your Avro_work folder should look like after creating all the directories.

avro work
  1. Now /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar is the path for the directory where you have downloaded avro-tools-1.7.7.jar file.

  2. /home/Hadoop/Avro_work/schema/ is the path for the directory where your schema file emp.avsc is stored.

  3. /home/Hadoop/Avro_work/with_code_gen is the directory where you want the generated class files to be stored.

现在按如下所示编译模式 −

Now compile the schema as shown below −

$ java -jar /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar compile schema /home/Hadoop/Avro_work/schema/emp.avsc /home/Hadoop/Avro/with_code_gen

编译后,根据模式的名称空间在目标目录中创建一个包。在此包中,将创建具有模式名称的 Java 源代码。此生成的源代码是给定模式的 Java 代码,可直接在应用程序中使用。

After compiling, a package according to the name space of the schema is created in the destination directory. Within this package, the Java source code with schema name is created. This generated source code is the Java code of the given schema which can be used in the applications directly.

例如,在此实例中创建了一个名为 tutorialspoint 的包/文件夹,其中包含另一个名为 com 的文件夹(因为名称空间是 tutorialspoint.com),在其中,您可以看到生成的文件 emp.java 。以下快照显示 emp.java

For example, in this instance a package/folder, named tutorialspoint is created which contains another folder named com (since the name space is tutorialspoint.com) and within it, you can observe the generated file emp.java. The following snapshot shows emp.java

snapshot sample program

此类对于根据模式创建数据十分有用。

This class is useful to create data according to schema.

生成的类包含−

The generated class contains −

  1. Default constructor, and parameterized constructor which accept all the variables of the schema.

  2. The setter and getter methods for all variables in the schema.

  3. Get() method which returns the schema.

  4. Builder methods.

Creating and Serializing the Data

首先,将此项目中使用的已生成 java 文件复制到当前目录中或从其所在位置导入该文件。

First of all, copy the generated java file used in this project into the current directory or import it from where it is located.

现在我们可以编写一个新的 Java 文件并实例化生成文件中 ( emp ) 的类以向模式中添加员工数据。

Now we can write a new Java file and instantiate the class in the generated file (emp) to add employee data to the schema.

让我们了解使用 Apache Avro 根据模式创建数据的过程。

Let us see the procedure to create data according to the schema using apache Avro.

Step 1

实例化生成的 emp 类。

Instantiate the generated emp class.

emp e1=new emp( );

Step 2

使用 setter 方法,插入第一个员工的数据。例如,我们已经创建的名为 Omar 的员工的详细信息。

Using setter methods, insert the data of first employee. For example, we have created the details of the employee named Omar.

e1.setName("omar");
e1.setAge(21);
e1.setSalary(30000);
e1.setAddress("Hyderabad");
e1.setId(001);

同样,使用 setter 方法填写所有员工详细信息。

Similarly, fill in all employee details using setter methods.

Step 3

使用 SpecificDatumWriter 类创建 DatumWriter 接口的对象。这会将 Java 对象转换为内存中的序列化格式。以下示例会为 emp 类实例化 SpecificDatumWriter 类对象。

Create an object of DatumWriter interface using the SpecificDatumWriter class. This converts Java objects into in-memory serialized format. The following example instantiates SpecificDatumWriter class object for emp class.

DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);

Step 4

emp 类实例化 DataFileWriter 。该类会连同模式本身将符合模式的数据序列序列化记录写入文件。该类需要 DatumWriter 对象作为构造函数的参数。

Instantiate DataFileWriter for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema itself, in a file. This class requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> empFileWriter = new DataFileWriter<emp>(empDatumWriter);

Step 5

使用 create() 方法打开一个新文件,以存储与给定模式匹配的数据。此方法需要模式和应存储数据的文件的路径作为参数。

Open a new file to store the data matching to the given schema using create() method. This method requires the schema, and the path of the file where the data is to be stored, as parameters.

在以下示例中,使用 getSchema() 方法传递模式,而数据文件存储在路径 /home/Hadoop/Avro/serialized_file/emp.avro. 中。

In the following example, schema is passed using getSchema() method, and the data file is stored in the path − /home/Hadoop/Avro/serialized_file/emp.avro.

empFileWriter.create(e1.getSchema(),new File("/home/Hadoop/Avro/serialized_file/emp.avro"));

Step 6

使用 append() 方法将所有创建的记录添加到文件中,如下所示:

Add all the created records to the file using append() method as shown below −

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization by Generating a Class

以下完整程序演示如何使用 Apache Avro 将数据序列化到文件中:

The following complete program shows how to serialize data into a file using Apache Avro −

import java.io.File;
import java.io.IOException;

import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumWriter;

public class Serialize {
   public static void main(String args[]) throws IOException{

      //Instantiating generated emp class
      emp e1=new emp();

      //Creating values according the schema
      e1.setName("omar");
      e1.setAge(21);
      e1.setSalary(30000);
      e1.setAddress("Hyderabad");
      e1.setId(001);

      emp e2=new emp();

      e2.setName("ram");
      e2.setAge(30);
      e2.setSalary(40000);
      e2.setAddress("Hyderabad");
      e2.setId(002);

      emp e3=new emp();

      e3.setName("robbin");
      e3.setAge(25);
      e3.setSalary(35000);
      e3.setAddress("Hyderabad");
      e3.setId(003);

      //Instantiate DatumWriter class
      DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);
      DataFileWriter<emp> empFileWriter = new DataFileWriter<emp>(empDatumWriter);

      empFileWriter.create(e1.getSchema(), new File("/home/Hadoop/Avro_Work/with_code_gen/emp.avro"));

      empFileWriter.append(e1);
      empFileWriter.append(e2);
      empFileWriter.append(e3);

      empFileWriter.close();

      System.out.println("data successfully serialized");
   }
}

浏览放置生成代码的目录。在此情况下,在 home/Hadoop/Avro_work/with_code_gen 中。

Browse through the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen.

In Terminal −

In Terminal −

$ cd home/Hadoop/Avro_work/with_code_gen/

In GUI −

In GUI −

generated code

现在将上述程序复制并保存到名为 Serialize.java 的文件中。

Now copy and save the above program in the file named Serialize.java

按如下所示进行编译并执行:

Compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Output

data successfully serialized

如果你验证程序中给定的路径,你可以找到如下所示的生成序列化文件。

If you verify the path given in the program, you can find the generated serialized file as shown below.

generated serialized file

AVRO - Deserialization By Generating Class

如前所述,可以通过生成与模式相对应的类或通过使用解析器库来读取 Avro 模式到一个程序中。本章介绍了如何使用 Avro 读取模式 by generating a classDeserialize 数据。

As described earlier, one can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library. This chapter describes how to read the schema by generating a class and Deserialize the data using Avro.

Deserialization by Generating a Class

序列化数据存储在文件 emp.avro 中。你可以使用 Avro 序列化并读取它。

The serialized data is stored in the file emp.avro. You can deserialize and read it using Avro.

serialized data is stored

按照以下步骤从文件中序列化序列化数据。

Follow the procedure given below to deserialize the serialized data from a file.

Step 1

使用 SpecificDatumReader 类创建一个 DatumReader 接口对象。

Create an object of DatumReader interface using SpecificDatumReader class.

DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class);

Step 2

emp 类实例化 DataFileReader 。此类从文件中读取序列化数据。它将 Dataumeader 对象和序列化数据所在的文件夹路径作为构造函数的参数。

Instantiate DataFileReader for emp class. This class reads serialized data from a file. It requires the Dataumeader object, and path of the file where the serialized data is existing, as a parameters to the constructor.

DataFileReader<emp> dataFileReader = new DataFileReader(new File("/path/to/emp.avro"), empDatumReader);

Step 3

使用 DataFileReader 的方法打印序列化数据。

Print the deserialized data, using the methods of DataFileReader.

  1. The hasNext() method will return a boolean if there are any elements in the Reader.

  2. The next() method of DataFileReader returns the data in the Reader.

while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out.println(em);
}

Example – Deserialization by Generating a Class

以下完整程序展示了如何使用 Avro 序列化文件中的数据。

The following complete program shows how to deserialize the data in a file using Avro.

import java.io.File;
import java.io.IOException;

import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;

public class Deserialize {
   public static void main(String args[]) throws IOException{

      //DeSerializing the objects
      DatumReader<emp> empDatumReader = new SpecificDatumReader<emp>(emp.class);

      //Instantiating DataFileReader
      DataFileReader<emp> dataFileReader = new DataFileReader<emp>(new
         File("/home/Hadoop/Avro_Work/with_code_genfile/emp.avro"), empDatumReader);
      emp em=null;

      while(dataFileReader.hasNext()){

         em=dataFileReader.next(em);
         System.out.println(em);
      }
   }
}

浏览到生成代码所在目录。在本例中,在 home/Hadoop/Avro_work/with_code_gen.

Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen.

$ cd home/Hadoop/Avro_work/with_code_gen/

现在,复制并保存以上程序到名为 DeSerialize.java 的文件中。如下所示,对其进行编译和执行:

Now, copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Output

{"name": "omar", "id": 1, "salary": 30000, "age": 21, "address": "Hyderabad"}
{"name": "ram", "id": 2, "salary": 40000, "age": 30, "address": "Hyderabad"}
{"name": "robbin", "id": 3, "salary": 35000, "age": 25, "address": "Hyderabad"}

AVRO - Serialization Using Parsers

可以通过生成对应于模式的类或使用解析器库将 Avro 模式读入程序中。在 Avro 中,数据始终与相应模式一起存储。因此,我们可以始终在不生成代码的情况下读取模式。

One can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library. In Avro, data is always stored with its corresponding schema. Therefore, we can always read a schema without code generation.

本章介绍如何读取模式 by using parsers library 以及使用 Avro serialize 数据。

This chapter describes how to read the schema by using parsers library and to serialize the data using Avro.

avro withoutcode serialize

Serialization Using Parsers Library

要序列化数据,我们需要读取模式,根据模式创建数据,并使用 Avro API 序列化模式。以下过程在不生成任何代码的情况下对数据进行序列化 −

To serialize the data, we need to read the schema, create data according to the schema, and serialize the schema using the Avro API. The following procedure serializes the data without generating any code −

Step 1

首先,从文件中读取模式。为此,请使用 Schema.Parser 类。此类提供以不同格式解析模式的方法。

First of all, read the schema from the file. To do so, use Schema.Parser class. This class provides methods to parse the schema in different formats.

通过传递存储模式的文件路径实例化 Schema.Parser 类。

Instantiate the Schema.Parser class by passing the file path where the schema is stored.

Schema schema = new Schema.Parser().parse(new File("/path/to/emp.avsc"));

Step 2

创建 GenericRecord 接口的对象,通过实例化 GenericData.Record 类,如下所示。将上述创建的模式对象传递给构造函数。

Create the object of GenericRecord interface, by instantiating GenericData.Record class as shown below. Pass the above created schema object to its constructor.

GenericRecord e1 = new GenericData.Record(schema);

Step 3

使用 GenericData 类的 put() 方法在模式中插入值。

Insert the values in the schema using the put() method of the GenericData class.

e1.put("name", "ramu");
e1.put("id", 001);
e1.put("salary",30000);
e1.put("age", 25);
e1.put("address", "chennai");

Step 4

使用 SpecificDatumWriter 类创建 DatumWriter 接口的对象。它将 Java 对象转换为内存中序列化的格式。以下示例为 emp 类实例化 SpecificDatumWriter 类对象 −

Create an object of DatumWriter interface using the SpecificDatumWriter class. It converts Java objects into in-memory serialized format. The following example instantiates SpecificDatumWriter class object for emp class −

DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);

Step 5

emp 类实例化 DataFileWriter 。此类将符合模式的数据的序列化记录与模式本身一起写入文件。此类需要 DatumWriter 对象作为构造函数的参数。

Instantiate DataFileWriter for emp class. This class writes serialized records of data conforming to a schema, along with the schema itself, in a file. This class requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> dataFileWriter = new DataFileWriter<emp>(empDatumWriter);

Step 6

使用 create() 方法打开一个新文件,以存储与给定模式匹配的数据。此方法需要模式和应存储数据的文件的路径作为参数。

Open a new file to store the data matching to the given schema using create() method. This method requires the schema, and the path of the file where the data is to be stored, as parameters.

在下面给出的示例中,模式使用 getSchema() 方法传递,数据文件存储在路径中

In the example given below, schema is passed using getSchema() method and the data file is stored in the path

/home/Hadoop/Avro/serialized_file/emp.avro.

/home/Hadoop/Avro/serialized_file/emp.avro.

empFileWriter.create(e1.getSchema(), new
File("/home/Hadoop/Avro/serialized_file/emp.avro"));

Step 7

使用 append( ) 方法将所有创建的记录添加到文件中,如下所示。

Add all the created records to the file using append( ) method as shown below.

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization Using Parsers

以下完整程序显示了如何使用解析器对数据进行序列化 −

The following complete program shows how to serialize the data using parsers −

import java.io.File;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;

import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;

import org.apache.avro.io.DatumWriter;

public class Seriali {
   public static void main(String args[]) throws IOException{

      //Instantiating the Schema.Parser class.
      Schema schema = new Schema.Parser().parse(new File("/home/Hadoop/Avro/schema/emp.avsc"));

      //Instantiating the GenericRecord class.
      GenericRecord e1 = new GenericData.Record(schema);

      //Insert data according to schema
      e1.put("name", "ramu");
      e1.put("id", 001);
      e1.put("salary",30000);
      e1.put("age", 25);
      e1.put("address", "chenni");

      GenericRecord e2 = new GenericData.Record(schema);

      e2.put("name", "rahman");
      e2.put("id", 002);
      e2.put("salary", 35000);
      e2.put("age", 30);
      e2.put("address", "Delhi");

      DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);

      DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
      dataFileWriter.create(schema, new File("/home/Hadoop/Avro_work/without_code_gen/mydata.txt"));

      dataFileWriter.append(e1);
      dataFileWriter.append(e2);
      dataFileWriter.close();

      System.out.println(“data successfully serialized”);
   }
}

浏览已放置生成代码的目录。在本例中,在 home/Hadoop/Avro_work/without_code_gen

Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/without_code_gen.

$ cd home/Hadoop/Avro_work/without_code_gen/
without code gen

现在,将上述程序复制并保存在名为 Serialize.java 的文件。按照如下所示编译并执行它-

Now copy and save the above program in the file named Serialize.java. Compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Output

data successfully serialized

如果你验证程序中给定的路径,你可以找到如下所示的生成序列化文件。

If you verify the path given in the program, you can find the generated serialized file as shown below.

without code gen1

AVRO - Deserialization Using Parsers

如前所述,可以通过生成与模式对应的类或使用解析器库将 Avro 模式读入程序。在 Avro 中,数据总是与它对应的模式存储在一起。因此,我们始终可以在不生成代码的情况下读取一个序列化项。

As mentioned earlier, one can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library. In Avro, data is always stored with its corresponding schema. Therefore, we can always read a serialized item without code generation.

本章描述了如何使用 Avro 读取模式 using parsers libraryDeserializing 数据。

This chapter describes how to read the schema using parsers library and Deserializing the data using Avro.

Deserialization Using Parsers Library

序列化数据存储在文件 mydata.txt 中。你可以使用 Avro 反序列化和读取它。

The serialized data is stored in the file mydata.txt. You can deserialize and read it using Avro.

avro utility

按照以下步骤从文件中序列化序列化数据。

Follow the procedure given below to deserialize the serialized data from a file.

Step 1

首先,从文件中读取模式。为此,请使用 Schema.Parser 类。此类提供以不同格式解析模式的方法。

First of all, read the schema from the file. To do so, use Schema.Parser class. This class provides methods to parse the schema in different formats.

通过传递存储模式的文件路径实例化 Schema.Parser 类。

Instantiate the Schema.Parser class by passing the file path where the schema is stored.

Schema schema = new Schema.Parser().parse(new File("/path/to/emp.avsc"));

Step 2

使用 SpecificDatumReader 类创建一个 DatumReader 接口对象。

Create an object of DatumReader interface using SpecificDatumReader class.

DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class);

Step 3

实例化 DataFileReader 类。此类从文件中读取序列化数据。它需要 DatumReader 对象和序列化数据存在的文件夹路径作为构造函数的参数。

Instantiate DataFileReader class. This class reads serialized data from a file. It requires the DatumReader object, and path of the file where the serialized data exists, as a parameters to the constructor.

DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File("/path/to/mydata.txt"), datumReader);

Step 4

使用 DataFileReader 的方法打印序列化数据。

Print the deserialized data, using the methods of DataFileReader.

  1. The hasNext() method returns a boolean if there are any elements in the Reader .

  2. The next() method of DataFileReader returns the data in the Reader.

while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out.println(em);
}

Example – Deserialization Using Parsers Library

以下完整程序展示了如何使用解析器库序列化序列化数据:

The following complete program shows how to deserialize the serialized data using Parsers library −

public class Deserialize {
   public static void main(String args[]) throws Exception{

      //Instantiating the Schema.Parser class.
      Schema schema = new Schema.Parser().parse(new File("/home/Hadoop/Avro/schema/emp.avsc"));
      DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
      DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File("/home/Hadoop/Avro_Work/without_code_gen/mydata.txt"), datumReader);
      GenericRecord emp = null;

      while (dataFileReader.hasNext()) {
         emp = dataFileReader.next(emp);
         System.out.println(emp);
      }
      System.out.println("hello");
   }
}

浏览到生成代码所在目录。在本例中,它在 home/Hadoop/Avro_work/without_code_gen

Browse into the directory where the generated code is placed. In this case, it is at home/Hadoop/Avro_work/without_code_gen.

$ cd home/Hadoop/Avro_work/without_code_gen/

现在将上述程序复制并保存到名为 DeSerialize.java 的文件中。按照以下步骤进行编译和执行 -

Now copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Output

{"name": "ramu", "id": 1, "salary": 30000, "age": 25, "address": "chennai"}
{"name": "rahman", "id": 2, "salary": 35000, "age": 30, "address": "Delhi"}