Pyspark 简明教程

PySpark - Environment Setup

在本章中,我们将了解 PySpark 的环境设置。

In this chapter, we will understand the environment setup of PySpark.

Note − 这里考虑的是在计算机上安装 Java 和 Scala。

Note − This is considering that you have Java and Scala installed on your computer.

现在,让我们按照以下步骤下载并设置 PySpark。

Let us now download and set up PySpark with the following steps.

Step 1 − 转到官方 Apache Spark download 页面并下载那里提供的最新版本的 Apache Spark。在本教程中,我们使用 spark-2.1.0-bin-hadoop2.7

Step 1 − Go to the official Apache Spark download page and download the latest version of Apache Spark available there. In this tutorial, we are using spark-2.1.0-bin-hadoop2.7.

Step 2 − 现在,解压下载的 Spark tar 文件。默认情况下,它将被下载到下载目录。

Step 2 − Now, extract the downloaded Spark tar file. By default, it will get downloaded in Downloads directory.

# tar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz

它将创建一个目录 spark-2.1.0-bin-hadoop2.7 。在启动 PySpark 之前,需要设置以下环境以设置 Spark 路径和 Py4j path

It will create a directory spark-2.1.0-bin-hadoop2.7. Before starting PySpark, you need to set the following environments to set the Spark path and the Py4j path.

export SPARK_HOME = /home/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:/home/hadoop/spark-2.1.0-bin-hadoop2.7/bin
export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH = $SPARK_HOME/python:$PATH

或者,要在全局范围内设置以上环境,请将它们放在 .bashrc file 中。然后运行以下命令以使环境生效。

Or, to set the above environments globally, put them in the .bashrc file. Then run the following command for the environments to work.

# source .bashrc

现在,我们已设置所有环境,让我们转到 Spark 目录并通过运行以下命令调用 PySpark shell −

Now that we have all the environments set, let us go to Spark directory and invoke PySpark shell by running the following command −

# ./bin/pyspark

这将启动 PySpark shell。

This will start your PySpark shell.

Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
SparkSession available as 'spark'.
<<<