Apache Spark 简明教程

Apache Spark - RDD

Resilient Distributed Datasets

弹性分布式数据集 (RDD) 是 Spark 的一个基本数据结构。它是对象的不可变分布式集合。RDD 中的每个数据集都被划分为逻辑分区，这些分区可以在群集的不同节点上计算。RDD 可以包含任何类型的 Python、Java 或 Scala 对象，包括用户定义的类。

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

形式上，RDD 是只读的分区记录集合。RDD 可以通过对稳定存储器中的数据或其他 RDD 上的确定性操作来创建。RDD 是一个容错元素集合，可以并行对其进行操作。

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

有两种方法可以创建 RDD − parallelizing 驱动程序程序中的现有集合或 referencing a dataset 位于外部存储系统中，例如共享文件系统、HDFS、HBase 或提供 Hadoop 输入格式的任何数据源。

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark 利用 RDD 的概念实现了更快速、高效的 MapReduce 操作。让我们首先讨论 MapReduce 操作如何进行以及为什么它们效率不高。

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce 被广泛采用，用于在群集上通过并行分布式算法处理和生成大型数据集。它允许用户编写并行计算，使用一组高级运算符，而无需担心工作分布和容错能力。

MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.

不幸的是，在大多数当前框架中，在计算之间（例如两个 MapReduce 作业之间）重用数据的唯一方法是将其写入外部稳定存储系统（例如 HDFS）。尽管此框架提供了许多用于访问群集计算资源的抽象，但用户仍然希望获得更多。

Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more.

Iterative 和 Interactive 应用程序都需要并行作业之间更快的 data 共享。由于 replication, serialization 和 disk IO ，MapReduce 中 data 共享速度很慢。关于存储系统，大多数 Hadoop 应用程序花费超过 90% 的时间执行 HDFS 读写操作。

Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

在多阶段应用程序中跨多个计算重用中间结果。下图解释了在 MapReduce 上执行迭代操作时当前框架如何工作。由于复制数据、磁盘 I/O 和序列化，这会产生大量的开销，从而导致系统变慢。

Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.

Interactive Operations on MapReduce

用户针对相同数据子集运行即席查询。每个查询将对持久性存储执行磁盘 I/O，这会占据应用执行时间的大部分。

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which can dominate application execution time.

下图说明了当前框架在 MapReduce 上执行交互式查询时如何工作。

The following illustration explains how the current framework works while doing the interactive queries on MapReduce.

Data Sharing using Spark RDD

由于 replication, serialization 和 disk IO ，MapReduce 中的数据共享较慢。大多数 Hadoop 应用在执行 HDFS 读写操作时会花费 90% 以上的时间。

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

认识到这个问题后，研究人员开发了一个名为 Apache Spark 的专门框架。Spark 的关键思想是 *R*esilient *D*istributed *D*atasets (RDD)；它支持内存内处理计算。这意味着它将内存的状态存储为作业间的对象，且此对象可在这些作业中共享。内存内的数据共享比网络和磁盘快 10 到 100 倍。

Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is *R*esilient *D*istributed *D*atasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

现在，我们尝试找出 Spark RDD 中迭代和交互式操作如何执行。

Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

下图显示了 Spark RDD 上的迭代操作。它会将中间结果存储在分布式内存中，而不是持久性存储（磁盘）中，并使系统更快。

The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster.

Note − 如果分布式内存（RAM）不足以存储中间结果（作业状态），则会将这些结果存储在磁盘上。

Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.

Interactive Operations on Spark RDD

此图显示了 Spark RDD 上的交互式操作。如果针对相同数据集反复运行不同的查询，则可以将这个特定数据保留在内存中，以获得更好的执行时间。

This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.

默认情况下，每次针对转换后的 RDD 执行操作时，都会重新计算该 RDD。但是，您也可以选择将 RDD persist 在内存中，在这种情况下，Spark 将把元素保留在集群中以便下次查询时更快地访问。此外，还支持将 RDD 持久保留在磁盘上或跨多个节点复制。

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.