Avro 简明教程

AVRO - Serialization

Data is serialized for two objectives −

  1. For persistent storage

  2. To transport the data over network

What is Serialization?

Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Serialization in Java

Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object’s data as well as information about the object’s type and the types of data stored in the object.

After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.

ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java.

Serialization in Hadoop

Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage.

Interprocess Communication

  1. To establish the interprocess communication between the nodes connected in a network, RPC technique was used.

  2. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message.

  3. The RPC serialization format is required to be as follows − Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center. Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and deserialization process should be quick, producing less overhead. Extensible − Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and servers. Interoperable − The message format should support the nodes that are written in different languages.

Persistent Storage

Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. Files, folders, databases are the examples of persistent storage.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes the methods −

S.No.

Methods and Description

1

void readFields(DataInput in) This method is used to deserialize the fields of the given object.

2

void write(DataOutput out) This method is used to serialize the fields of the given object.

Writable Comparable Interface

It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison.

S.No.

Methods and Description

1

int compareTo(class obj) This method compares current object with the given object obj.

In addition to these classes, Hadoop supports a number of wrapper classes that implement WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy of Hadoop serialization is given below −

hadoop serialization hierarchy

这些类有助于序列化 Hadoop 中各种类型的数据。例如,我们考虑 IntWritable 类。我们看看是如何使用此类来序列化和反序列化 Hadoop 中的数据的。

IntWritable Class

此类实现了 Writable, Comparable,WritableComparable 接口。它包装了一个整数数据类型。此类提供了用于序列化和反序列化整数类型数据的的方法。

Constructors

S.No.

Summary

1

IntWritable()

2

IntWritable( int value)

Methods

S.No.

Summary

1

int get() 使用此方法可以获得当前对象中存在的整数值。

2

void readFields(DataInput in) 此方法用于反序列化给定 DataInput 对象中的数据。

3

void set(int value) 此方法用于设置当前 IntWritable 对象的值。

4

void write(DataOutput out) 此方法用于序列化当前对象中的数据到给定的 DataOutput 对象。

Serializing the Data in Hadoop

下面讨论序列化整数类型数据的过程。

  1. 通过包装一个整数值来实例化 IntWritable 类。

  2. Instantiate ByteArrayOutputStream class.

  3. 实例化 DataOutputStream 类,并将 ByteArrayOutputStream 类的对象传递给它。

  4. 使用 write() 方法在 IntWritable 对象中序列化整数值。此方法需要 DataOutputStream 类的对象。

  5. 已序列化的数据将存储在字节数组对象中,该对象在实例化时作为参数传递给 DataOutputStream 类。将对象中的数据转换为字节数组。

Example

以下示例显示了如何在 Hadoop 中序列化整数类型的数据:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

public class Serialization {
   public byte[] serialize() throws IOException{

      //Instantiating the IntWritable object
      IntWritable intwritable = new IntWritable(12);

      //Instantiating ByteArrayOutputStream object
      ByteArrayOutputStream byteoutputStream = new ByteArrayOutputStream();

      //Instantiating DataOutputStream object
      DataOutputStream dataOutputStream = new
      DataOutputStream(byteoutputStream);

      //Serializing the data
      intwritable.write(dataOutputStream);

      //storing the serialized object in bytearray
      byte[] byteArray = byteoutputStream.toByteArray();

      //Closing the OutputStream
      dataOutputStream.close();
      return(byteArray);
   }

   public static void main(String args[]) throws IOException{
      Serialization serialization= new Serialization();
      serialization.serialize();
      System.out.println();
   }
}

Deserializing the Data in Hadoop

下面讨论反序列化整数类型数据的过程:

  1. 通过包装一个整数值来实例化 IntWritable 类。

  2. Instantiate ByteArrayOutputStream class.

  3. 实例化 DataOutputStream 类,并将 ByteArrayOutputStream 类的对象传递给它。

  4. 使用 IntWritable 类的 readFields() 方法反序列化 DataInputStream 对象中的数据。

  5. 反序列化的数据将存储在 IntWritable 类的对象中。可以使用此类的 get() 方法检索此数据。

Example

以下示例显示了如何在 Hadoop 中反序列化整数类型的数据:

import java.io.ByteArrayInputStream;
import java.io.DataInputStream;

import org.apache.hadoop.io.IntWritable;

public class Deserialization {

   public void deserialize(byte[]byteArray) throws Exception{

      //Instantiating the IntWritable class
      IntWritable intwritable =new IntWritable();

      //Instantiating ByteArrayInputStream object
      ByteArrayInputStream InputStream = new ByteArrayInputStream(byteArray);

      //Instantiating DataInputStream object
      DataInputStream datainputstream=new DataInputStream(InputStream);

      //deserializing the data in DataInputStream
      intwritable.readFields(datainputstream);

      //printing the serialized data
      System.out.println((intwritable).get());
   }

   public static void main(String args[]) throws Exception {
      Deserialization dese = new Deserialization();
      dese.deserialize(new Serialization().serialize());
   }
}

Advantage of Hadoop over Java Serialization

Hadoop 的基于可写对象的序列化能够通过重新使用可写对象来减少对象创建开销,而 Java 的原生序列化框架不能这么做。

Disadvantages of Hadoop Serialization

要序列化 Hadoop 数据,有两种办法:

  1. 可以使用 Hadoop 的原生库提供的 Writable 类。

  2. 您还可以使用存储二进制格式数据的 Sequence Files

这两种机制的主要缺点是 WritablesSequenceFiles 仅有一个 Java API,不能用其他任何语言编写或读入。

因此,任何使用上述两种机制在 Hadoop 中创建的文件都无法被任何其他第三方语言读取,这使得 Hadoop 成为一个限制盒子。为了解决这一缺点,Doug Cutting 创建了 Avro, ,它是一个 language independent data structure