Spark Sql 简明教程

Spark SQL - Data Sources

一个数据帧接口允许不同的数据源在 Spark SQL 上运行。这是一个临时表，可以作为普通 RDD 进行操作。将数据帧注册为表格可让你对其数据运行 SQL 查询。

A DataFrame interface allows different DataSources to work on Spark SQL. It is a temporary table and can be operated as a normal RDD. Registering a DataFrame as a table allows you to run SQL queries over its data.

在本章中，我们将介绍使用不同 Spark 数据源加载和保存数据的一般方法。然后，我们将详细讨论内置数据源的可用特定选项。

In this chapter, we will describe the general methods for loading and saving data using different Spark DataSources. Thereafter, we will discuss in detail the specific options that are available for the built-in data sources.

SparkSQL 中有不同的数据源类型，其中一些列在下面：

There are different types of data sources available in SparkSQL, some of which are listed below −

Sr. No

Data Sources

JSON DatasetsSpark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.

Hive TablesHive comes bundled with the Spark library as HiveContext, which inherits from SQLContext.

Parquet FilesParquet is a columnar format, supported by many data processing systems.