Pandas 中文参考指南

PyArrow Functionality

pandas 可以利用 PyArrow 扩展功能并提高各种 API 的性能。其中包括:

pandas can utilize PyArrow to extend functionality and improve the performance of various APIs. This includes:

  1. More extensive data types compared to NumPy

  2. Missing data support (NA) for all data types

  3. Performant IO reader integration

  4. Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

如需使用此功能,请确保您已获得 installed the minimum supported PyArrow version.

To use this functionality, please ensure you have installed the minimum supported PyArrow version.

Data Structure Integration

SeriesIndexDataFrame 的列可以直接以 pyarrow.ChunkedArray 为后盾,类似于 NumPy 数组。若需用 Pandas 主数据结构构造这些结构,您可以将类型字符串传入到 [pyarrow] 之后,例如将 "int64[pyarrow]"" 传入 dtype 参数中

A Series, Index, or the columns of a DataFrame can be directly backed by a pyarrow.ChunkedArray which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by [pyarrow], e.g. "int64[pyarrow]"" into the dtype parameter

In [1]: ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")

In [2]: ser
Out[2]:
0    -1.5
1     0.2
2    <NA>
dtype: float[pyarrow]

In [3]: idx = pd.Index([True, None], dtype="bool[pyarrow]")

In [4]: idx
Out[4]: Index([True, <NA>], dtype='bool[pyarrow]')

In [5]: df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")

In [6]: df
Out[6]:
   0  1
0  1  2
1  3  4

字符串别名 "string[pyarrow]" 映射至 pd.StringDtype("pyarrow"),它并不等于指定 dtype=pd.ArrowDtype(pa.string())。通常情况下,但除外,数据上的操作会表现得类似 pd.StringDtype("pyarrow") 可以返回 NumPy 支持的可空类型,而 pd.ArrowDtype(pa.string()) 将返回 ArrowDtype

The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to specifying dtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarly except pd.StringDtype("pyarrow") can return NumPy-backed nullable types while pd.ArrowDtype(pa.string()) will return ArrowDtype.

In [7]: import pyarrow as pa

In [8]: data = list("abc")

In [9]: ser_sd = pd.Series(data, dtype="string[pyarrow]")

In [10]: ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))

In [11]: ser_ad.dtype == ser_sd.dtype
Out[11]: False

In [12]: ser_sd.str.contains("a")
Out[12]:
0     True
1    False
2    False
dtype: boolean

In [13]: ser_ad.str.contains("a")
Out[13]:
0     True
1    False
2    False
dtype: bool[pyarrow]

如果 PyArrow 类型接受参数,您可以使用这些参数将 PyArrow 类型传入 ArrowDtype,用于 dtype 参数中。

For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters into ArrowDtype to use in the dtype parameter.

In [14]: import pyarrow as pa

In [15]: list_str_type = pa.list_(pa.string())

In [16]: ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))

In [17]: ser
Out[17]:
0    ['hello']
1    ['there']
dtype: list<item: string>[pyarrow]
In [18]: from datetime import time

In [19]: idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))

In [20]: idx
Out[20]: Index([12:30:00, <NA>], dtype='time64[us][pyarrow]')
In [21]: from decimal import Decimal

In [22]: decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))

In [23]: data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]

In [24]: df = pd.DataFrame(data, dtype=decimal_type)

In [25]: df
Out[25]:
      0      1
0  3.19   <NA>
1  <NA>  -1.23

如果您已经有 pyarrow.Arraypyarrow.ChunkedArray,您可以将它传入 arrays.ArrowExtensionArray 以构造相关的 SeriesIndexDataFrame 对象。

If you already have an pyarrow.Array or pyarrow.ChunkedArray, you can pass it into arrays.ArrowExtensionArray to construct the associated Series, Index or DataFrame object.

In [26]: pa_array = pa.array(
   ....:     [{"1": "2"}, {"10": "20"}, None],
   ....:     type=pa.map_(pa.string(), pa.string()),
   ....: )
   ....:

In [27]: ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))

In [28]: ser
Out[28]:
0      [('1', '2')]
1    [('10', '20')]
2              <NA>
dtype: map<string, string>[pyarrow]

为了从 SeriesIndex 中获取 pyarrow pyarrow.ChunkedArray,您可以在 SeriesIndex 上调用 pyarrow 数组构造函数。

To retrieve a pyarrow pyarrow.ChunkedArray from a Series or Index, you can call the pyarrow array constructor on the Series or Index.

In [29]: ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")

In [30]: pa.array(ser)
Out[30]:
<pyarrow.lib.UInt8Array object at 0x7ff2a2968400>
[
  1,
  2,
  null
]

In [31]: idx = pd.Index(ser)

In [32]: pa.array(idx)
Out[32]:
<pyarrow.lib.UInt8Array object at 0x7ff2a2968460>
[
  1,
  2,
  null
]

为了将 pyarrow.Table 转换为 DataFrame,您可以使用 types_mapper=pd.ArrowDtype 调用 pyarrow.Table.to_pandas() 方法。

To convert a pyarrow.Table to a DataFrame, you can call the pyarrow.Table.to_pandas() method with types_mapper=pd.ArrowDtype.

In [33]: table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])

In [34]: df = table.to_pandas(types_mapper=pd.ArrowDtype)

In [35]: df
Out[35]:
   a
0  1
1  2
2  3

In [36]: df.dtypes
Out[36]:
a    int64[pyarrow]
dtype: object

Operations

PyArrow 数据结构集成通过 Pandas 的 ExtensionArray interface 实现;因此,此接口在 Pandas API 内集成的位置存在支持的功能。另外,此功能已由 PyArrow compute functions(如果适用)加速。其中包括:

PyArrow data structure integration is implemented through pandas’ ExtensionArray interface; therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality is accelerated with PyArrow compute functions where available. This includes:

  1. Numeric aggregations

  2. Numeric arithmetic

  3. Numeric rounding

  4. Logical and comparison functions

  5. String functionality

  6. Datetime functionality

以下只是本机 PyArrow 计算函数加速的部分操作示例。

The following are just some examples of operations that are accelerated by native PyArrow compute functions.

In [37]: import pyarrow as pa

In [38]: ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")

In [39]: ser.mean()
Out[39]: -0.6669999808073044

In [40]: ser + ser
Out[40]:
0    -3.09
1    0.422
2     <NA>
dtype: float[pyarrow]

In [41]: ser > (ser + 1)
Out[41]:
0    False
1    False
2     <NA>
dtype: bool[pyarrow]

In [42]: ser.dropna()
Out[42]:
0   -1.545
1    0.211
dtype: float[pyarrow]

In [43]: ser.isna()
Out[43]:
0    False
1    False
2     True
dtype: bool

In [44]: ser.fillna(0)
Out[44]:
0   -1.545
1    0.211
2      0.0
dtype: float[pyarrow]
In [45]: ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))

In [46]: ser_str.str.startswith("a")
Out[46]:
0     True
1    False
2     <NA>
dtype: bool[pyarrow]
In [47]: from datetime import datetime

In [48]: pa_type = pd.ArrowDtype(pa.timestamp("ns"))

In [49]: ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)

In [50]: ser_dt.dt.strftime("%Y-%m")
Out[50]:
0    2022-01
1       <NA>
dtype: string[pyarrow]

I/O Reading

PyArrow 还提供已集成到多个 Pandas IO 读卡器中的 IO 读取功能。以下函数提供 engine 关键字,可分派到 PyArrow 以加速从 IO 源中读取信息。

PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following functions provide an engine keyword that can dispatch to PyArrow to accelerate reading from an IO source.

In [51]: import io

In [52]: data = io.StringIO("""a,b,c
   ....:    1,2.5,True
   ....:    3,4.5,False
   ....: """)
   ....:

In [53]: df = pd.read_csv(data, engine="pyarrow")

In [54]: df
Out[54]:
   a    b      c
0  1  2.5   True
1  3  4.5  False

默认情况下,这些函数和其他所有 IO 读卡器功能返回 NumPy 支持的数据。这些读卡器可以通过指定参数 dtype_backend="pyarrow" 来返回 PyArrow 支持的数据。读卡器不必设置 engine="pyarrow" 来返回 PyArrow 支持的数据。

By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return PyArrow-backed data by specifying the parameter dtype_backend="pyarrow". A reader does not need to set engine="pyarrow" to necessarily return PyArrow-backed data.

In [55]: import io

In [56]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
   ....:     1,2.5,True,a,,,,,
   ....:     3,4.5,False,b,6,7.5,True,a,
   ....: """)
   ....:

In [57]: df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")

In [58]: df_pyarrow.dtypes
Out[58]:
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object

一些非 IO 读卡器函数还可以使用 dtype_backend 参数来返回 PyArrow 支持的数据,包括:

Several non-IO reader functions can also use the dtype_backend argument to return PyArrow-backed data including: