Pandas 中文参考指南
Sparse data structures
Pandas 提供了用于高效存储稀疏数据的结构。这些结构不一定在典型情况下是稀疏的,即“主要为 0”。相反,您可以将这些对象视为“压缩”的,其中与特定值(NaN / 缺失值,但可以选择任意值,包括 0)相匹配的任何数据均被忽略。被压缩的值实际上不会存储在数组中。
pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
注意数据类型 Sparse[float64, nan]。nan 表示数组中等于 nan 的元素并未实际存储,只是非 nan 元素已存储。那些非 nan 元素有一个 float64 数据类型。
Notice the dtype, Sparse[float64, nan]. The nan means that elements in the array that are nan aren’t actually stored, only the non-nan elements are. Those non-nan elements have a float64 dtype.
稀疏对象的存在是为了提高内存效率。假设您拥有一个主要为 NA 的大型数组 DataFrame:
The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame:
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
In [6]: df.iloc[:9998] = np.nan
In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes
Out[9]:
0 Sparse[float64, nan]
1 Sparse[float64, nan]
2 Sparse[float64, nan]
3 Sparse[float64, nan]
dtype: object
In [10]: sdf.sparse.density
Out[10]: 0.0002
正如您所见,密度(尚未被“压缩”的值的百分比)极低。此稀疏对象在磁盘(腌制)和 Python 解释器中占用的内存要小得多。
As you can see, the density (% of values that have not been “compressed”) is extremely low. This sparse object takes up much less memory on disk (pickled) and in the Python interpreter.
In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
Out[11]: 'dense : 320.13 bytes'
In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
Out[12]: 'sparse: 0.22 bytes'
在功能上,它们的行为应与密集对应结构几乎相同。
Functionally, their behavior should be nearly identical to their dense counterparts.
SparseArray
arrays.SparseArray 是用于存储稀疏值数组的 ExtensionArray(有关扩展数组的更多信息,请参见 dtypes)。它是一个类似于一维 ndarray 的对象,仅存储与 fill_value 不同的值:
arrays.SparseArray is a ExtensionArray for storing an array of sparse values (see dtypes for more on extension arrays). It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:
In [13]: arr = np.random.randn(10)
In [14]: arr[2:5] = np.nan
In [15]: arr[7:8] = np.nan
In [16]: sparr = pd.arrays.SparseArray(arr)
In [17]: sparr
Out[17]:
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
稀疏数组可通过 numpy.asarray() 转换为常规(密集)ndarray
A sparse array can be converted to a regular (dense) ndarray with numpy.asarray()
In [18]: np.asarray(sparr)
Out[18]:
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
nan, 0.606 , 1.3342])
SparseDtype
SparseArray.dtype 属性存储了两条信息
The SparseArray.dtype property stores two pieces of information
-
The dtype of the non-sparse values
-
The scalar fill value
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
可以仅通过传递一个 dtype 来构建 SparseDtype
A SparseDtype may be constructed by passing only a dtype
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], numpy.datetime64('NaT')]
在这种情况下,将会使用一个 fill 的默认值(对于 NumPy dtypes,这通常是该 dtype 的“缺失”值)。要覆盖此默认值,您可以改为传递一个明确的 fill 值
in which case a default fill value will be used (for NumPy dtypes this is often the “missing” value for that dtype). To override this default an explicit fill value may be passed instead
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
....: fill_value=pd.Timestamp('2017-01-01'))
....:
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]
最后,字符串别名 'Sparse[dtype]' 可用于在许多地方指定稀疏 dtype
Finally, the string alias 'Sparse[dtype]' may be used to specify a sparse dtype in many places
In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]:
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)
Sparse accessor
pandas 提供了一个 .sparse 访问器,类似于 .str(对于字符串数据)、.cat(对于分类数据)和 .dt(对于类似日期时间的数据)。此命名空间提供了特定于稀疏数据的属性和方法。
pandas provides a .sparse accessor, similar to .str for string data, .cat for categorical data, and .dt for datetime-like data. This namespace provides attributes and methods that are specific to sparse data.
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
In [24]: s.sparse.density
Out[24]: 0.5
In [25]: s.sparse.fill_value
Out[25]: 0
此访问器仅适用于具有 SparseDtype 的数据和 Series 类,以便使用 scipy COO 矩阵创建具有稀疏数据的 Series。
This accessor is available only on data with SparseDtype, and on the Series class itself for creating a Series with sparse data from a scipy COO matrix with.
也为 DataFrame 添加了 .sparse 访问器。有关更多信息,请参阅 Sparse accessor。
A .sparse accessor has been added for DataFrame as well. See Sparse accessor for more.
Sparse calculation
您可以将 NumPy ufuncs 应用于 arrays.SparseArray 并获得 arrays.SparseArray。
You can apply NumPy ufuncs to arrays.SparseArray and get a arrays.SparseArray as a result.
In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
ufunc 也应用于 fill_value。这是获得正确的密集结果所需要的。
The ufunc is also applied to fill_value. This is needed to get the correct dense result.
In [28]: arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
In [29]: np.abs(arr)
Out[29]:
[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)
In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])
转换
Conversion
要将数据从稀疏转换为密集,请使用 .sparse 访问器
To convert data from sparse to dense, use the .sparse accessors
In [31]: sdf.sparse.to_dense()
Out[31]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
... ... ... ... ...
9995 NaN NaN NaN NaN
9996 NaN NaN NaN NaN
9997 NaN NaN NaN NaN
9998 0.509184 -0.774928 -1.369894 -0.382141
9999 0.280249 -1.648493 1.490865 -0.890819
[10000 rows x 4 columns]
从密集到稀疏,使用 DataFrame.astype() 与 SparseDtype。
From dense to sparse, use DataFrame.astype() with a SparseDtype.
In [32]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})
In [33]: dtype = pd.SparseDtype(int, fill_value=0)
In [34]: dense.astype(dtype)
Out[34]:
A
0 1
1 0
2 0
3 1
Interaction with scipy.sparse
使用 DataFrame.sparse.from_spmatrix() 从稀疏矩阵创建具有稀疏值的 DataFrame。
Use DataFrame.sparse.from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.
In [35]: from scipy.sparse import csr_matrix
In [36]: arr = np.random.random(size=(1000, 5))
In [37]: arr[arr < .9] = 0
In [38]: sp_arr = csr_matrix(arr)
In [39]: sp_arr
Out[39]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in Compressed Sparse Row format>
In [40]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
In [41]: sdf.head()
Out[41]:
0 1 2 3 4
0 0.95638 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0.999552 0 0 0.956153 0
In [42]: sdf.dtypes
Out[42]:
0 Sparse[float64, 0]
1 Sparse[float64, 0]
2 Sparse[float64, 0]
3 Sparse[float64, 0]
4 Sparse[float64, 0]
dtype: object
支持所有稀疏格式,但并非 COOrdinate 格式的矩阵将被转换,根据需要复制数据。要转换回 COO 格式的稀疏 SciPy 矩阵,可以使用 DataFrame.sparse.to_coo() 方法:
All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying data as needed. To convert back to sparse SciPy matrix in COO format, you can use the DataFrame.sparse.to_coo() method:
In [43]: sdf.sparse.to_coo()
Out[43]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in COOrdinate format>
Series.sparse.to_coo() 用于转换 Series,该 Series 具有由 MultiIndex 索引的稀疏值,并转换为 scipy.sparse.coo_matrix。
Series.sparse.to_coo() is implemented for transforming a Series with sparse values indexed by a MultiIndex to a scipy.sparse.coo_matrix.
该方法需要一个具有两个或更多级别的 MultiIndex。
The method requires a MultiIndex with two or more levels.
In [44]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
In [45]: s.index = pd.MultiIndex.from_tuples(
....: [
....: (1, 2, "a", 0),
....: (1, 2, "a", 1),
....: (1, 1, "b", 0),
....: (1, 1, "b", 1),
....: (2, 1, "b", 0),
....: (2, 1, "b", 1),
....: ],
....: names=["A", "B", "C", "D"],
....: )
....:
In [46]: ss = s.astype('Sparse')
In [47]: ss
Out[47]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
在下面的示例中,我们通过指定第一和第二 MultiIndex 级别定义行标签,而第三和第四级别定义列标签,将 Series 转换为 2-d 数组的稀疏表示形式。我们还指定列和行标签应按最终的稀疏表示形式进行排序。
In the example below, we transform the Series to a sparse representation of a 2-d array by specifying that the first and second MultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
In [48]: A, rows, columns = ss.sparse.to_coo(
....: row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
....: )
....:
In [49]: A
Out[49]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [50]: A.todense()
Out[50]:
matrix([[0., 0., 1., 3.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
In [51]: rows
Out[51]: [(1, 1), (1, 2), (2, 1)]
In [52]: columns
Out[52]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
指定不同的行和列标签(并且不对它们进行排序)会生成不同的稀疏矩阵:
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
In [53]: A, rows, columns = ss.sparse.to_coo(
....: row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
....: )
....:
In [54]: A
Out[54]:
<3x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [55]: A.todense()
Out[55]:
matrix([[3., 0.],
[1., 3.],
[0., 0.]])
In [56]: rows
Out[56]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
In [57]: columns
Out[57]: [(0,), (1,)]
针对从 scipy.sparse.coo_matrix 中创建包含稀疏值的 Series 的情况,实施了 Series.sparse.from_coo() 便捷方法。
A convenience method Series.sparse.from_coo() is implemented for creating a Series with sparse values from a scipy.sparse.coo_matrix.
In [58]: from scipy import sparse
In [59]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))
In [60]: A
Out[60]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [61]: A.todense()
Out[61]:
matrix([[0., 0., 1., 2.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
默认行为(包含 dense_index=False)仅返回包含非空项目的 Series。
The default behaviour (with dense_index=False) simply returns a Series containing only the non-null entries.
In [62]: ss = pd.Series.sparse.from_coo(A)
In [63]: ss
Out[63]:
0 2 1.0
3 2.0
1 0 3.0
dtype: Sparse[float64, nan]
指定 dense_index=True 将生成一个索引,该索引是矩阵的行和列坐标的笛卡尔积。请注意,如果稀疏矩阵足够大(且稀疏),这将消耗大量内存(相对于 dense_index=False)。
Specifying dense_index=True will result in an index that is the Cartesian product of the row and columns coordinates of the matrix. Note that this will consume a significant amount of memory (relative to dense_index=False) if the sparse matrix is large (and sparse) enough.
In [64]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)
In [65]: ss_dense
Out[65]:
1 0 3.0
2 NaN
3 NaN
0 0 NaN
2 1.0
3 2.0
0 NaN
2 1.0
3 2.0
dtype: Sparse[float64, nan]