Pyspark 简明教程

PySpark - SparkFiles

在 Apache Spark 中,可以使用 sc.addFile 上传文件(sc 是您的默认 SparkContext)并使用 SparkFiles.get 获取工作程序上的路径。因此,SparkFiles 会解析通过 SparkContext.addFile() 添加的文件的路径。

In Apache Spark, you can upload your files using sc.addFile (sc is your default SparkContext) and get the path on a worker using SparkFiles.get. Thus, SparkFiles resolve the paths to files added through SparkContext.addFile().

SparkFiles 包含以下类方法−

SparkFiles contain the following classmethods −

  1. get(filename)

  2. getrootdirectory()

我们详细了解它们。

Let us understand them in detail.

get(filename)

它指定通过 SparkContext.addFile() 添加的文件的路径。

It specifies the path of the file that is added through SparkContext.addFile().

getrootdirectory()

它指定包含通过 SparkContext.addFile() 添加的文件的根目录的路径。

It specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile().

----------------------------------------sparkfile.py------------------------------------
from pyspark import SparkContext
from pyspark import SparkFiles
finddistance = "/home/hadoop/examples_pyspark/finddistance.R"
finddistancename = "finddistance.R"
sc = SparkContext("local", "SparkFile App")
sc.addFile(finddistance)
print "Absolute Path -> %s" % SparkFiles.get(finddistancename)
----------------------------------------sparkfile.py------------------------------------

Command − 命令如下所示−

Command − The command is as follows −

$SPARK_HOME/bin/spark-submit sparkfiles.py

Output − 以上命令的输出为−

Output − The output for the above command is −

Absolute Path ->
   /tmp/spark-f1170149-af01-4620-9805-f61c85fecee4/userFiles-641dfd0f-240b-4264-a650-4e06e7a57839/finddistance.R