Pyspark 简明教程

PySpark - StorageLevel

StorageLevel 决定如何存储 RDD。在 Apache Spark 中,StorageLevel 决定 RDD 是应该存储在内存中,还是应该存储在磁盘上,或者同时存储在内存和磁盘上。它还决定是否对 RDD 进行序列化以及是否对 RDD 分区进行复制。

StorageLevel decides how RDD should be stored. In Apache Spark, StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both. It also decides whether to serialize RDD and whether to replicate RDD partitions.

以下代码块包含 StorageLevel 的类定义 −

The following code block has the class definition of a StorageLevel −

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)

现在,要决定 RDD 的存储位置,有不同的存储级别,如下所示 −

Now, to decide the storage of RDD, there are different storage levels, which are given below −

  1. DISK_ONLY = StorageLevel(True, False, False, False, 1)

  2. DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)

  3. MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)

  4. MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)

  5. MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)

  6. MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)

  7. MEMORY_ONLY = StorageLevel(False, True, False, False, 1)

  8. MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)

  9. MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)

  10. MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)

  11. OFF_HEAP = StorageLevel(True, True, True, False, 1)

我们考虑以下 StorageLevel 示例,其中我们使用存储级别 MEMORY_AND_DISK_2, ,这意味着 RDD 分区将具有 2 的副本。

Let us consider the following example of StorageLevel, where we use the storage level MEMORY_AND_DISK_2, which means RDD partitions will have replication of 2.

------------------------------------storagelevel.py-------------------------------------
from pyspark import SparkContext
import pyspark
sc = SparkContext (
   "local",
   "storagelevel app"
)
rdd1 = sc.parallelize([1,2])
rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 )
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
------------------------------------storagelevel.py-------------------------------------

Command − 命令如下所示−

Command − The command is as follows −

$SPARK_HOME/bin/spark-submit storagelevel.py

Output − 以上命令的输出如下 −

Output − The output for the above command is given below −

Disk Memory Serialized 2x Replicated