Pyspark 简明教程

PySpark - Introduction

在本章中，我们将了解什么是 Apache Spark 以及 PySpark 是如何开发的。

In this chapter, we will get ourselves acquainted with what Apache Spark is and how was PySpark developed.

Spark – Overview

Apache Spark 是一个闪电般快速的实时处理框架。它执行内存计算以实时分析数据。它作为一个角色出现，因为 Apache Hadoop MapReduce 仅执行批处理并且缺乏实时处理功能。因此，引入了 Apache Spark，因为它可以实时执行流处理，并且还可以处理批处理。

Apache Spark is a lightning fast real-time processing framework. It does in-memory computations to analyze data in real-time. It came into picture as Apache Hadoop MapReduce was performing batch processing only and lacked a real-time processing feature. Hence, Apache Spark was introduced as it can perform stream processing in real-time and can also take care of batch processing.

除了实时和批处理之外，Apache Spark 还支持交互式查询和迭代算法。Apache Spark 有自己的集群管理器，它可以在其中托管其应用程序。它利用 Apache Hadoop 进行存储和处理。它使用 HDFS （Hadoop 分布式文件系统）进行存储，并且它也可以在 YARN 上运行 Spark 应用程序。

Apart from real-time and batch processing, Apache Spark supports interactive queries and iterative algorithms also. Apache Spark has its own cluster manager, where it can host its application. It leverages Apache Hadoop for both storage and processing. It uses HDFS (Hadoop Distributed File system) for storage and it can run Spark applications on YARN as well.

PySpark – Overview

Apache Spark 是用 Scala programming language 编写的。为了在 Spark 中支持 Python，Apache Spark 社区发布了一个工具 PySpark。使用 PySpark，您还可以使用 Python 编程语言来处理 RDDs 。这是因为一个名为 Py4j 的库，它可以实现此功能。

Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.

PySpark 提供 PySpark Shell ，它将 Python API 链接到 Spark 核心并初始化 Spark 上下文。当今大多数数据科学家和分析专家由于其丰富的库集而使用 Python。将 Python 与 Spark 集成对他们来说是一个福音。

PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them.