Pandas 中文参考指南
Nullable integer data type
IntegerArray 目前处于实验阶段。它的 API 或实现可能会在不发出警告的情况下发生更改。把 pandas.NA 作为缺失值。 |
IntegerArray is currently experimental. Its API or implementation may change without warning. Uses pandas.NA as the missing value. |
在 Working with missing data 中,我们看到 Pandas 主要使用 NaN 表示缺失数据。因为 NaN 是浮点数,所以这会强制包含任何缺失值的整数数组变成浮点。在某些情况下,这可能并不重要。但是如果你的整数列,比如说,是一个标识符,强制转换为浮点数会带来问题。某些整数甚至无法表示为浮点数。
In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.
Construction
Pandas 可以使用 arrays.IntegerArray 表示可能包含缺失值的整数数据。这是在 Pandas 中实现的 extension type。
pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension type implemented within pandas.
In [1]: arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
In [2]: arr
Out[2]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
或者字符串别名 "Int64"(注意大写的 "I"),与 NumPy 的 'int64' 数据类型进行区分:
Or the string alias "Int64" (note the capital "I") to differentiate from NumPy’s 'int64' dtype:
In [3]: pd.array([1, 2, np.nan], dtype="Int64")
Out[3]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
所有类 NA 值都替换为 pandas.NA。
All NA-like values are replaced with pandas.NA.
In [4]: pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")
Out[4]:
<IntegerArray>
[1, 2, <NA>, <NA>, <NA>]
Length: 5, dtype: Int64
In [5]: pd.Series(arr)
Out[5]:
0 1
1 2
2 <NA>
dtype: Int64
你还可以使用数据类型将类列表对象传递给 Series 构造函数。
You can also pass the list-like object to the Series constructor with the dtype.
警告
Warning
当前, pandas.array() 和 pandas.Series() 使用不同的数据类型推断规则。 pandas.array() 将推断出可为空的整数数据类型
Currently pandas.array() and pandas.Series() use different rules for dtype inference. pandas.array() will infer a nullable-integer dtype
In [6]: pd.array([1, None])
Out[6]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
In [7]: pd.array([1, 2])
Out[7]:
<IntegerArray>
[1, 2]
Length: 2, dtype: Int64
为了向后兼容, Series 将它们推断为整数或浮点数据类型。
For backwards-compatibility, Series infers these as either integer or float dtype.
In [8]: pd.Series([1, None])
Out[8]:
0 1.0
1 NaN
dtype: float64
In [9]: pd.Series([1, 2])
Out[9]:
0 1
1 2
dtype: int64
我们建议显式提供数据类型以避免混淆。
We recommend explicitly providing the dtype to avoid confusion.
In [10]: pd.array([1, None], dtype="Int64")
Out[10]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
In [11]: pd.Series([1, None], dtype="Int64")
Out[11]:
0 1
1 <NA>
dtype: Int64
将来,我们可能会提供选项,让 Series 推断出可为空的整数数据类型。
In the future, we may provide an option for Series to infer a nullable-integer dtype.
Operations
包含整数数组的操作将表现得与 NumPy 数组类似。缺失值将被传播,并且如果需要,数据将被强制转换为其他数据类型。
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and the data will be coerced to another dtype if needed.
In [12]: s = pd.Series([1, 2, None], dtype="Int64")
# arithmetic
In [13]: s + 1
Out[13]:
0 2
1 3
2 <NA>
dtype: Int64
# comparison
In [14]: s == 1
Out[14]:
0 True
1 False
2 <NA>
dtype: boolean
# slicing operation
In [15]: s.iloc[1:3]
Out[15]:
1 2
2 <NA>
dtype: Int64
# operate with other dtypes
In [16]: s + s.iloc[1:3].astype("Int8")
Out[16]:
0 <NA>
1 4
2 <NA>
dtype: Int64
# coerce when needed
In [17]: s + 0.01
Out[17]:
0 1.01
1 2.01
2 <NA>
dtype: Float64
这些数据类型可以作为 DataFrame 的一部分运行。
These dtypes can operate as part of a DataFrame.
In [18]: df = pd.DataFrame({"A": s, "B": [1, 1, 3], "C": list("aab")})
In [19]: df
Out[19]:
A B C
0 1 1 a
1 2 1 a
2 <NA> 3 b
In [20]: df.dtypes
Out[20]:
A Int64
B int64
C object
dtype: object
这些数据类型可以合并、重塑和强制转换。
These dtypes can be merged, reshaped & casted.
In [21]: pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes
Out[21]:
A Int64
B int64
C object
dtype: object
In [22]: df["A"].astype(float)
Out[22]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
sum() 等规约和分组操作也能正常工作。
Reduction and groupby operations such as sum() work as well.
In [23]: df.sum(numeric_only=True)
Out[23]:
A 3
B 5
dtype: Int64
In [24]: df.sum()
Out[24]:
A 3
B 5
C aab
dtype: object
In [25]: df.groupby("B").A.sum()
Out[25]:
B
1 3
3 0
Name: A, dtype: Int64
Scalar NA Value
arrays.IntegerArray 使用 pandas.NA 作为标量缺失值。切片单个缺失元素将返回 pandas.NA
arrays.IntegerArray uses pandas.NA as its scalar missing value. Slicing a single element that’s missing will return pandas.NA
In [26]: a = pd.array([1, None], dtype="Int64")
In [27]: a[1]
Out[27]: <NA>