Pyspark 简明教程
PySpark - Serializers
序列化用于在 Apache Spark 上进行性能优化。所有通过网络发送、写入磁盘或持久存储在内存中的数据应进行序列化。序列化在代价高昂的操作中起着重要作用。
Serialization is used for performance tuning on Apache Spark. All data that is sent over the network or written to the disk or persisted in the memory should be serialized. Serialization plays an important role in costly operations.
PySpark 支持在性能优化中使用自定义序列化程序。PySpark 支持以下两种序列化程序:
PySpark supports custom serializers for performance tuning. The following two serializers are supported by PySpark −
MarshalSerializer
使用 Python 的 Marshal 序列化程序序列化对象。该序列化程序比 PickleSerializer 速度快,但支持的数据类型较少。
Serializes objects using Python’s Marshal Serializer. This serializer is faster than PickleSerializer, but supports fewer datatypes.
class pyspark.MarshalSerializer
PickleSerializer
使用 Python 的 Pickle 序列化程序序列化对象。该序列化程序支持几乎任何 Python 对象,但可能没有其他专门序列化程序那么快。
Serializes objects using Python’s Pickle Serializer. This serializer supports nearly any Python object, but may not be as fast as more specialized serializers.
class pyspark.PickleSerializer
让我们看看 PySpark 序列化的示例。在此,我们使用 MarshalSerializer 序列化数据。
Let us see an example on PySpark serialization. Here, we serialize the data using MarshalSerializer.
--------------------------------------serializing.py-------------------------------------
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "serialization app", serializer = MarshalSerializer())
print(sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10))
sc.stop()
--------------------------------------serializing.py-------------------------------------
Command − 命令如下所示−
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit serializing.py
Output − 上述命令的输出为 −
Output − The output of the above command is −
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]