Pandas 中文参考指南

Working with missing data

Values considered “missing”

pandas 使用不同的哨兵值来表示缺失值(也称为 NA),具体取决于数据类型。

pandas uses different sentinel values to represent a missing (also referred to as NA) depending on the data type.

NumPy 数据类型的 numpy.nan。使用 NumPy 数据类型有一个缺点,即原始数据类型将被强制转换为 np.float64object

numpy.nan for NumPy data types. The disadvantage of using NumPy data types is that the original data type will be coerced to np.float64 or object.

In [1]: pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
Out[1]:
0    1.0
1    2.0
2    NaN
dtype: float64

In [2]: pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2])
Out[2]:
0     True
1    False
2      NaN
dtype: object

NaT 用作 NumPy np.datetime64np.timedelta64PeriodDtype 的类型注释。对于类型应用,请使用 api.types.NaTType

NaT for NumPy np.datetime64, np.timedelta64, and PeriodDtype. For typing applications, use api.types.NaTType.

In [3]: pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2])
Out[3]:
0   0 days 00:00:00.000000001
1   0 days 00:00:00.000000002
2                         NaT
dtype: timedelta64[ns]

In [4]: pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
Out[4]:
0   1970-01-01 00:00:00.000000001
1   1970-01-01 00:00:00.000000002
2                             NaT
dtype: datetime64[ns]

In [5]: pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2])
Out[5]:
0    2020-01-01
1    2020-01-01
2           NaT
dtype: period[D]

NA 用作 StringDtypeInt64Dtype(和其他位宽)Float64Dtype`(and other bit widths), :class:`BooleanDtypeArrowDtype 的类型注释。这些类型将保留数据的原始数据类型。对于类型应用,请使用 api.types.NAType

NA for StringDtype, Int64Dtype (and other bit widths), Float64Dtype`(and other bit widths), :class:`BooleanDtype and ArrowDtype. These types will maintain the original data type of the data. For typing applications, use api.types.NAType.

In [6]: pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])
Out[6]:
0       1
1       2
2    <NA>
dtype: Int64

In [7]: pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2])
Out[7]:
0     True
1    False
2     <NA>
dtype: bool[pyarrow]

要检测这些缺失值,请使用 isna()notna() 方法。

To detect these missing value, use the isna() or notna() methods.

In [8]: ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])

In [9]: ser
Out[9]:
0   2020-01-01
1          NaT
dtype: datetime64[ns]

In [10]: pd.isna(ser)
Out[10]:
0    False
1     True
dtype: bool

isna()notna() 也将把 None 视为缺失值。

isna() or notna() will also consider None a missing value.

In [11]: ser = pd.Series([1, None], dtype=object)

In [12]: ser
Out[12]:
0       1
1    None
dtype: object

In [13]: pd.isna(ser)
Out[13]:
0    False
1     True
dtype: bool

警告

Warning

np.nanNaTNA 之间的相等比较的行为不像是 None

Equality compaisons between np.nan, NaT, and NA do not act like None

In [14]: None == None  # noqa: E711
Out[14]: True

In [15]: np.nan == np.nan
Out[15]: False

In [16]: pd.NaT == pd.NaT
Out[16]: False

In [17]: pd.NA == pd.NA
Out[17]: <NA>

因此, DataFrameSeries 与其中一个缺失值之间的相等比较不会提供与 isna()notna() 相同的信息。

Therefore, an equality comparison between a DataFrame or Series with one of these missing values does not provide the same information as isna() or notna().

In [18]: ser = pd.Series([True, None], dtype="boolean[pyarrow]")

In [19]: ser == pd.NA
Out[19]:
0    <NA>
1    <NA>
dtype: bool[pyarrow]

In [20]: pd.isna(ser)
Out[20]:
0    False
1     True
dtype: bool

NA semantics

警告

Warning

实验中:NA` 的行为仍然可以不经警告就更改。

Experimental: the behaviour of NA` can still change without warning.

从 pandas 1.0 开始,提供了一个实验性的 NA 值(单例)来表示标量缺失值。 NA 的目标是提供一个“缺失”指示符,该指示符可以在跨数据类型一致使用(而不是取决于数据类型的 np.nanNonepd.NaT)。

Starting from pandas 1.0, an experimental NA value (singleton) is available to represent scalar missing values. The goal of NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

例如,当一个 Series 中的缺失值带有可为 null 的整数数据类型时,它将使用 NA

For example, when having missing values in a Series with the nullable integer dtype, it will use NA:

In [21]: s = pd.Series([1, 2, None], dtype="Int64")

In [22]: s
Out[22]:
0       1
1       2
2    <NA>
dtype: Int64

In [23]: s[2]
Out[23]: <NA>

In [24]: s[2] is pd.NA
Out[24]: True

目前,pandas 尚未默认使用 NA 的那些数据类型,因此你需要明确指定数据类型。解释将转换为那些数据类型的一种简单方法。

Currently, pandas does not yet use those data types using NA by default a DataFrame or Series, so you need to specify the dtype explicitly. An easy way to convert to those dtypes is explained in the conversion section.

Propagation in arithmetic and comparison operations

一般情况下,缺失值会在涉及 NA 的运算中传播。当其中一个操作数未知时,运算的结果也是未知的。

In general, missing values propagate in operations involving NA. When one of the operands is unknown, the outcome of the operation is also unknown.

例如, NA 在算术运算中传播,类似于 np.nan

For example, NA propagates in arithmetic operations, similarly to np.nan:

In [25]: pd.NA + 1
Out[25]: <NA>

In [26]: "a" * pd.NA
Out[26]: <NA>

即使其中一个操作数是 NA,也有一些特殊的情况,在这种情况下,结果是已知的。

There are a few special cases when the result is known, even when one of the operands is NA.

In [27]: pd.NA ** 0
Out[27]: 1

In [28]: 1 ** pd.NA
Out[28]: 1

在相等性和比较运算中, NA 也会传播。这偏离了 np.nan 的行为,其中与 np.nan 比较总是返回 False

In equality and comparison operations, NA also propagates. This deviates from the behaviour of np.nan, where comparisons with np.nan always return False.

In [29]: pd.NA == 1
Out[29]: <NA>

In [30]: pd.NA == pd.NA
Out[30]: <NA>

In [31]: pd.NA < 2.5
Out[31]: <NA>

要检查值是否等于 NA,请使用 isna()

To check if a value is equal to NA, use isna()

In [32]: pd.isna(pd.NA)
Out[32]: True

此基本传播规则的例外是降维(例如均值或最小值),其中 pandas 默认跳过缺失值。有关更多信息,请参阅 calculation section

An exception on this basic propagation rule are reductions (such as the mean or the minimum), where pandas defaults to skipping missing values. See the calculation section for more.

Logical operations

对于逻辑运算, NA 遵循 three-valued logic(或克莱尼逻辑,类似于 R、SQL 和 Julia)的规则。此逻辑意味着仅在逻辑上必需时才传播缺失值。

For logical operations, NA follows the rules of the three-valued logic (or Kleene logic, similarly to R, SQL and Julia). This logic means to only propagate missing values when it is logically required.

例如,对于逻辑“或”运算(|),如果其中一个操作数是 True,无论另一个值是什么(因此无论缺失值是 True 还是 False),我们已经知道结果将是 True。在这种情况下, NA 不传播:

For example, for the logical “or” operation (|), if one of the operands is True, we already know the result will be True, regardless of the other value (so regardless the missing value would be True or False). In this case, NA does not propagate:

In [33]: True | False
Out[33]: True

In [34]: True | pd.NA
Out[34]: True

In [35]: pd.NA | True
Out[35]: True

另一方面,如果其中一个操作数是 False,则结果取决于另一个操作数的值。因此,在这种情况下, NA 会传播:

On the other hand, if one of the operands is False, the result depends on the value of the other operand. Therefore, in this case NA propagates:

In [36]: False | True
Out[36]: True

In [37]: False | False
Out[37]: False

In [38]: False | pd.NA
Out[38]: <NA>

逻辑“与”运算 (&) 的行为可以用类似的逻辑来推导(其中当其中一个操作数已经是 FalseNA 现在不会传播):

The behaviour of the logical “and” operation (&) can be derived using similar logic (where now NA will not propagate if one of the operands is already False):

In [39]: False & True
Out[39]: False

In [40]: False & False
Out[40]: False

In [41]: False & pd.NA
Out[41]: False
In [42]: True & True
Out[42]: True

In [43]: True & False
Out[43]: False

In [44]: True & pd.NA
Out[44]: <NA>

NA in a boolean context

由于 NA 的实际值是未知的,因此将 NA 转换为布尔值是不明确的。

Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value.

In [45]: bool(pd.NA)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[45], line 1
----> 1 bool(pd.NA)

File missing.pyx:392, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

这也意味着 NA 不能用在求值为布尔的上下文中,例如 if condition: …​,其中 condition 可能会是 NA。在这些情况下,既可以用 isna() 检查 NAcondition 是否是 NA,也可以避免,例如事先填充缺失值。

This also means that NA cannot be used in a context where it is evaluated to a boolean, such as if condition: …​ where condition can potentially be NA. In such cases, isna() can be used to check for NA or condition being NA can be avoided, for example by filling missing values beforehand.

if 语句中使用 SeriesDataFrame 对象时,也会出现类似的情况,请参阅 Using if/truth statements with pandas

A similar situation occurs when using Series or DataFrame objects in if statements, see Using if/truth statements with pandas.

NumPy ufuncs

pandas.NA 实现 NumPy 的 array_ufunc 协议。大多数 ufunc 都适用于 NA,并且通常返回 NA

pandas.NA implements NumPy’s array_ufunc protocol. Most ufuncs work with NA, and generally return NA:

In [46]: np.log(pd.NA)
Out[46]: <NA>

In [47]: np.add(pd.NA, 1)
Out[47]: <NA>

警告

Warning

当前,涉及 ndarray 和 NA 的 ufunc 将会返回一个填充了 NA 值的对象数据类型。

Currently, ufuncs involving an ndarray and NA will return an object-dtype filled with NA values.

In [48]: a = np.array([1, 2, 3])

In [49]: np.greater(a, pd.NA)
Out[49]: array([<NA>, <NA>, <NA>], dtype=object)

此处返回的类型将来可能会更改为返回其他数组类型。

The return type here may change to return a different array type in the future.

有关 ufunc 的详细信息,请参阅 DataFrame interoperability with NumPy functions

如果你有一个 DataFrameSeries 使用的是 np.nanSeries.convert_dtypes()DataFrame.convert_dtypes()DataFrame 中,它可以将数据转换为使用数据类型的数据,例如 Int64DtypeArrowDtype。这在从推断数据类型的 IO 方法中读入数据集之后尤其有用。

If you have a DataFrame or Series using np.nan, Series.convert_dtypes() and DataFrame.convert_dtypes() in DataFrame that can convert data to use the data types that use NA such as Int64Dtype or ArrowDtype. This is especially helpful after reading in data sets from IO methods where data types were inferred.

在此示例中,尽管会更改所有列的数据类型,但我们会显示前 10 列的结果。

In this example, while the dtypes of all columns are changed, we show the results for the first 10 columns.

In [50]: import io

In [51]: data = io.StringIO("a,b\n,True\n2,")

In [52]: df = pd.read_csv(data)

In [53]: df.dtypes
Out[53]:
a    float64
b     object
dtype: object

In [54]: df_conv = df.convert_dtypes()

In [55]: df_conv
Out[55]:
      a     b
0  <NA>  True
1     2  <NA>

In [56]: df_conv.dtypes
Out[56]:
a      Int64
b    boolean
dtype: object

Inserting missing data

通过简单地赋值给 SeriesDataFrame,你可以插入缺失值。将根据数据类型来选择使用的缺失值哨兵。

You can insert missing values by simply assigning to a Series or DataFrame. The missing value sentinel used will be chosen based on the dtype.

In [57]: ser = pd.Series([1., 2., 3.])

In [58]: ser.loc[0] = None

In [59]: ser
Out[59]:
0    NaN
1    2.0
2    3.0
dtype: float64

In [60]: ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")])

In [61]: ser.iloc[0] = np.nan

In [62]: ser
Out[62]:
0          NaT
1   2021-01-01
dtype: datetime64[ns]

In [63]: ser = pd.Series([True, False], dtype="boolean[pyarrow]")

In [64]: ser.iloc[0] = None

In [65]: ser
Out[65]:
0     <NA>
1    False
dtype: bool[pyarrow]

对于 object 类型,pandas 会使用给定的值:

For object types, pandas will use the value given:

In [66]: s = pd.Series(["a", "b", "c"], dtype=object)

In [67]: s.loc[0] = None

In [68]: s.loc[1] = np.nan

In [69]: s
Out[69]:
0    None
1     NaN
2       c
dtype: object

Calculations with missing data

缺失值会通过 pandas 对象之间的算术运算进行传播。

Missing values propagate through arithmetic operations between pandas objects.

In [70]: ser1 = pd.Series([np.nan, np.nan, 2, 3])

In [71]: ser2 = pd.Series([np.nan, 1, np.nan, 4])

In [72]: ser1
Out[72]:
0    NaN
1    NaN
2    2.0
3    3.0
dtype: float64

In [73]: ser2
Out[73]:
0    NaN
1    1.0
2    NaN
3    4.0
dtype: float64

In [74]: ser1 + ser2
Out[74]:
0    NaN
1    NaN
2    NaN
3    7.0
dtype: float64

data structure overview 中讨论的描述性统计信息和计算方法(并列在 herehere 中)都考虑了缺失数据。

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all account for missing data.

在对数据求和时,NA 值或空数据将视为零。

When summing data, NA values or empty data will be treated as zero.

In [75]: pd.Series([np.nan]).sum()
Out[75]: 0.0

In [76]: pd.Series([], dtype="float64").sum()
Out[76]: 0.0

在计算乘积时,NA 值或空数据将视为 1。

When taking the product, NA values or empty data will be treated as 1.

In [77]: pd.Series([np.nan]).prod()
Out[77]: 1.0

In [78]: pd.Series([], dtype="float64").prod()
Out[78]: 1.0

默认情况下, cumsum()cumprod() 等累积方法会忽略 NA 值,但在结果中保留它们。此行为可以通过 skipna 更改。

Cumulative methods like cumsum() and cumprod() ignore NA values by default preserve them in the result. This behavior can be changed with skipna

  1. Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.

In [79]: ser = pd.Series([1, np.nan, 3, np.nan])

In [80]: ser
Out[80]:
0    1.0
1    NaN
2    3.0
3    NaN
dtype: float64

In [81]: ser.cumsum()
Out[81]:
0    1.0
1    NaN
2    4.0
3    NaN
dtype: float64

In [82]: ser.cumsum(skipna=False)
Out[82]:
0    1.0
1    NaN
2    NaN
3    NaN
dtype: float64

Dropping missing data

dropna() 删除具有缺失数据的行或列。

dropna() dropa rows or columns with missing data.

In [83]: df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]])

In [84]: df
Out[84]:
     0  1    2
0  NaN  1  2.0
1  1.0  2  NaN
2  1.0  2  3.0

In [85]: df.dropna()
Out[85]:
     0  1    2
2  1.0  2  3.0

In [86]: df.dropna(axis=1)
Out[86]:
   1
0  1
1  2
2  2

In [87]: ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]")

In [88]: ser.dropna()
Out[88]:
0    1
dtype: int64[pyarrow]

Filling missing data

Filling by value

fillna() 用非 NA 数据替换 NA 值。

fillna() replaces NA values with non-NA data.

使用标量值替换 NA

Replace NA with a scalar value

In [89]: data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}

In [90]: df = pd.DataFrame(data)

In [91]: df
Out[91]:
    np  arrow
0  1.0    1.0
1  NaN   <NA>
2  NaN   <NA>
3  2.0    2.0

In [92]: df.fillna(0)
Out[92]:
    np  arrow
0  1.0    1.0
1  0.0    0.0
2  0.0    0.0
3  2.0    2.0

向前或向后填充空白

Fill gaps forward or backward

In [93]: df.ffill()
Out[93]:
    np  arrow
0  1.0    1.0
1  1.0    1.0
2  1.0    1.0
3  2.0    2.0

In [94]: df.bfill()
Out[94]:
    np  arrow
0  1.0    1.0
1  2.0    2.0
2  2.0    2.0
3  2.0    2.0

限制填充的 NA 值的数目

Limit the number of NA values filled

In [95]: df.ffill(limit=1)
Out[95]:
    np  arrow
0  1.0    1.0
1  1.0    1.0
2  NaN   <NA>
3  2.0    2.0

NA 值可以用 SeriesDataFrame 中相应的值替换,其中索引和列在原始对象和填充对象之间对齐。

NA values can be replaced with corresponding value from a Series or DataFrame where the index and column aligns between the original object and the filled object.

In [96]: dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))

In [97]: dff.iloc[3:5, 0] = np.nan

In [98]: dff.iloc[4:6, 1] = np.nan

In [99]: dff.iloc[5:8, 2] = np.nan

In [100]: dff
Out[100]:
      A     B     C
0   0.0   1.0   2.0
1   3.0   4.0   5.0
2   6.0   7.0   8.0
3   NaN  10.0  11.0
4   NaN   NaN  14.0
5  15.0   NaN   NaN
6  18.0  19.0   NaN
7  21.0  22.0   NaN
8  24.0  25.0  26.0
9  27.0  28.0  29.0

In [101]: dff.fillna(dff.mean())
Out[101]:
       A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000

DataFrame.where() 也可用于填充 NA 值。结果与上述相同。

DataFrame.where() can also be used to fill NA values.Same result as above.

In [102]: dff.where(pd.notna(dff), dff.mean(), axis="columns")
Out[102]:
       A     B          C
0   0.00   1.0   2.000000
1   3.00   4.0   5.000000
2   6.00   7.0   8.000000
3  14.25  10.0  11.000000
4  14.25  14.5  14.000000
5  15.00  14.5  13.571429
6  18.00  19.0  13.571429
7  21.00  22.0  13.571429
8  24.00  25.0  26.000000
9  27.00  28.0  29.000000

Interpolation

DataFrame.interpolate()Series.interpolate() 使用各种插值方法填充 NA 值。

DataFrame.interpolate() and Series.interpolate() fills NA values using various interpolation methods.

In [103]: df = pd.DataFrame(
   .....:     {
   .....:         "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
   .....:         "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
   .....:     }
   .....: )
   .....:

In [104]: df
Out[104]:
     A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

In [105]: df.interpolate()
Out[105]:
     A      B
0  1.0   0.25
1  2.1   1.50
2  3.4   2.75
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

In [106]: idx = pd.date_range("2020-01-01", periods=10, freq="D")

In [107]: data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)

In [108]: ts = pd.Series(data, index=idx)

In [109]: ts.iloc[[1, 2, 5, 6, 9]] = np.nan

In [110]: ts
Out[110]:
2020-01-01    8.0
2020-01-02    NaN
2020-01-03    NaN
2020-01-04    2.0
2020-01-05    4.0
2020-01-06    NaN
2020-01-07    NaN
2020-01-08    0.0
2020-01-09    3.0
2020-01-10    NaN
Freq: D, dtype: float64

In [111]: ts.plot()
Out[111]: <Axes: >
In [112]: ts.interpolate()
Out[112]:
2020-01-01    8.000000
2020-01-02    6.000000
2020-01-03    4.000000
2020-01-04    2.000000
2020-01-05    4.000000
2020-01-06    2.666667
2020-01-07    1.333333
2020-01-08    0.000000
2020-01-09    3.000000
2020-01-10    3.000000
Freq: D, dtype: float64

In [113]: ts.interpolate().plot()
Out[113]: <Axes: >

设置 method="time",可以通过在 DatetimeIndex 中的 Timestamp 相对于插值来获取

Interpolation relative to a Timestamp in the DatetimeIndex is available by setting method="time"

In [114]: ts2 = ts.iloc[[0, 1, 3, 7, 9]]

In [115]: ts2
Out[115]:
2020-01-01    8.0
2020-01-02    NaN
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    NaN
dtype: float64

In [116]: ts2.interpolate()
Out[116]:
2020-01-01    8.0
2020-01-02    5.0
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    0.0
dtype: float64

In [117]: ts2.interpolate(method="time")
Out[117]:
2020-01-01    8.0
2020-01-02    6.0
2020-01-04    2.0
2020-01-08    0.0
2020-01-10    0.0
dtype: float64

对于浮点数索引,使用 method='values'

For a floating-point index, use method='values':

In [118]: idx = [0.0, 1.0, 10.0]

In [119]: ser = pd.Series([0.0, np.nan, 10.0], idx)

In [120]: ser
Out[120]:
0.0      0.0
1.0      NaN
10.0    10.0
dtype: float64

In [121]: ser.interpolate()
Out[121]:
0.0      0.0
1.0      5.0
10.0    10.0
dtype: float64

In [122]: ser.interpolate(method="values")
Out[122]:
0.0      0.0
1.0      1.0
10.0    10.0
dtype: float64

如果您已安装 scipy,您可以将 1-d 插值例程的名称传递给 method。正如在 scipy 插值 documentation 和参考 guide 中指定的。合适的插值方法将取决于数据类型。

If you have scipy installed, you can pass the name of a 1-d interpolation routine to method. as specified in the scipy interpolation documentation and reference guide. The appropriate interpolation method will depend on the data type.

提示

Tip

如果您处理的是以越来越快的速率增长的时序,请使用 method='barycentric'

If you are dealing with a time series that is growing at an increasing rate, use method='barycentric'.

如果您有近似于累积分布函数的值,请使用 method='pchip'

If you have values approximating a cumulative distribution function, use method='pchip'.

要填充缺失值以获得平滑绘图,请使用 method='akima'

To fill missing values with goal of smooth plotting use method='akima'.

In [123]: df = pd.DataFrame(
   .....:    {
   .....:       "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
   .....:       "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
   .....:    }
   .....: )
   .....:

In [124]: df
Out[124]:
     A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

In [125]: df.interpolate(method="barycentric")
Out[125]:
      A       B
0  1.00   0.250
1  2.10  -7.660
2  3.53  -4.515
3  4.70   4.000
4  5.60  12.200
5  6.80  14.400

In [126]: df.interpolate(method="pchip")
Out[126]:
         A          B
0  1.00000   0.250000
1  2.10000   0.672808
2  3.43454   1.928950
3  4.70000   4.000000
4  5.60000  12.200000
5  6.80000  14.400000

In [127]: df.interpolate(method="akima")
Out[127]:
          A          B
0  1.000000   0.250000
1  2.100000  -0.873316
2  3.406667   0.320034
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

通过多项式或样条近似进行插值时,您还必须指定近似的程度或阶数:

When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approximation:

In [128]: df.interpolate(method="spline", order=2)
Out[128]:
          A          B
0  1.000000   0.250000
1  2.100000  -0.428598
2  3.404545   1.206900
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

In [129]: df.interpolate(method="polynomial", order=2)
Out[129]:
          A          B
0  1.000000   0.250000
1  2.100000  -2.703846
2  3.451351  -1.453846
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

比较几种方法。

Comparing several methods.

In [130]: np.random.seed(2)

In [131]: ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))

In [132]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])

In [133]: ser.iloc[missing] = np.nan

In [134]: methods = ["linear", "quadratic", "cubic"]

In [135]: df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})

In [136]: df.plot()
Out[136]: <Axes: >

从扩展数据中插值新观测值,使用 Series.reindex()

Interpolating new observations from expanding data with Series.reindex().

In [137]: ser = pd.Series(np.sort(np.random.uniform(size=100)))

# interpolate at new_index
In [138]: new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75]))

In [139]: interp_s = ser.reindex(new_index).interpolate(method="pchip")

In [140]: interp_s.loc[49:51]
Out[140]:
49.00    0.471410
49.25    0.476841
49.50    0.481780
49.75    0.485998
50.00    0.489266
50.25    0.491814
50.50    0.493995
50.75    0.495763
51.00    0.497074
dtype: float64

interpolate() 接受 limit 关键字参数来限制自上次有效观测以来填充的连续 NaN 值的数目

interpolate() accepts a limit keyword argument to limit the number of consecutive NaN values filled since the last valid observation

In [141]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])

In [142]: ser
Out[142]:
0     NaN
1     NaN
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7     NaN
8     NaN
dtype: float64

In [143]: ser.interpolate()
Out[143]:
0     NaN
1     NaN
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64

In [144]: ser.interpolate(limit=1)
Out[144]:
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5     NaN
6    13.0
7    13.0
8     NaN
dtype: float64

默认情况下,按 forward 方向填充 NaN 值。使用 limit_direction 参数从 backwardboth 方向填充值。

By default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or from both directions.

In [145]: ser.interpolate(limit=1, limit_direction="backward")
Out[145]:
0     NaN
1     5.0
2     5.0
3     NaN
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64

In [146]: ser.interpolate(limit=1, limit_direction="both")
Out[146]:
0     NaN
1     5.0
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7    13.0
8     NaN
dtype: float64

In [147]: ser.interpolate(limit_direction="both")
Out[147]:
0     5.0
1     5.0
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64

默认情况下,NaN 值会得到填充,无论它们位于已有有效值周围还是在已有有效值外部。limit_area 参数会将填充限制为位于值内部或外部。

By default, NaN values are filled whether they are surrounded by existing valid values or outside existing valid values. The limit_area parameter restricts filling to either inside or outside values.

# fill one consecutive inside value in both directions
In [148]: ser.interpolate(limit_direction="both", limit_area="inside", limit=1)
Out[148]:
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64

# fill all consecutive outside values backward
In [149]: ser.interpolate(limit_direction="backward", limit_area="outside")
Out[149]:
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7     NaN
8     NaN
dtype: float64

# fill all consecutive outside values in both directions
In [150]: ser.interpolate(limit_direction="both", limit_area="outside")
Out[150]:
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7    13.0
8    13.0
dtype: float64

Replacing values

Series.replace()DataFrame.replace() 可类似于 Series.fillna()DataFrame.fillna() 来替换或插入缺失值。

Series.replace() and DataFrame.replace() can be used similar to Series.fillna() and DataFrame.fillna() to replace or insert missing values.

In [151]: df = pd.DataFrame(np.eye(3))

In [152]: df
Out[152]:
     0    1    2
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0

In [153]: df_missing = df.replace(0, np.nan)

In [154]: df_missing
Out[154]:
     0    1    2
0  1.0  NaN  NaN
1  NaN  1.0  NaN
2  NaN  NaN  1.0

In [155]: df_filled = df_missing.replace(np.nan, 2)

In [156]: df_filled
Out[156]:
     0    1    2
0  1.0  2.0  2.0
1  2.0  1.0  2.0
2  2.0  2.0  1.0

可以通过传递列表来替换多个值。

Replacing more than one value is possible by passing a list.

In [157]: df_filled.replace([1, 44], [2, 28])
Out[157]:
     0    1    2
0  2.0  2.0  2.0
1  2.0  2.0  2.0
2  2.0  2.0  2.0

使用映射字典进行替换。

Replacing using a mapping dict.

In [158]: df_filled.replace({1: 44, 2: 28})
Out[158]:
      0     1     2
0  44.0  28.0  28.0
1  28.0  44.0  28.0
2  28.0  28.0  44.0

字符 r 开头的 Python 字符串,如 r’hello world'“raw” strings。它们对反斜杠的语义不同于没有这种前缀的字符串。原始字符串中的反斜杠将解释为转义的反斜杠,例如 r'\' == '\\'

Python strings prefixed with the r character such as r’hello world' are “raw” strings. They have different semantics regarding backslashes than strings without this prefix. Backslashes in raw strings will be interpreted as an escaped backslash, e.g., r'\' == '\\'.

NaN 替换“.”

Replace the ‘.’ with NaN

In [159]: d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}

In [160]: df = pd.DataFrame(d)

In [161]: df.replace(".", np.nan)
Out[161]:
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

NaN 替换“.”,并去除周围空白的正则表达式

Replace the ‘.’ with NaN with regular expression that removes surrounding whitespace

In [162]: df.replace(r"\s*\.\s*", np.nan, regex=True)
Out[162]:
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

用正则表达式列表来替换。

Replace with a list of regexes.

In [163]: df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)
Out[163]:
   a       b       c
0  0  astuff  astuff
1  1       b       b
2  2     dot     NaN
3  3     dot       d

在映射字典中替换正则表达式。

Replace with a regex in a mapping dict.

In [164]: df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)
Out[164]:
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

传递使用 regex 关键字的嵌套字典的正则表达式。

Pass nested dictionaries of regular expressions that use the regex keyword.

In [165]: df.replace({"b": {"b": r""}}, regex=True)
Out[165]:
   a  b    c
0  0  a    a
1  1       b
2  2  .  NaN
3  3  .    d

In [166]: df.replace(regex={"b": {r"\s*\.\s*": np.nan}})
Out[166]:
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

In [167]: df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)
Out[167]:
   a    b    c
0  0    a    a
1  1    b    b
2  2  .ty  NaN
3  3  .ty    d

传递正则表达式列表,用标量替换匹配项。

Pass a list of regular expressions that will replace matches with a scalar.

In [168]: df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True)
Out[168]:
   a            b            c
0  0  placeholder  placeholder
1  1  placeholder  placeholder
2  2  placeholder          NaN
3  3  placeholder            d

所有这些正则表达式示例还可以通过 to_replace 参数传递为 regex 参数。在这种情况下,value 参数必须通过名称显式传递,或 regex 必须是嵌套字典。

All of the regular expression examples can also be passed with the to_replace argument as the regex argument. In this case the value argument must be passed explicitly by name or regex must be a nested dictionary.

In [169]: df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder")
Out[169]:
   a            b            c
0  0  placeholder  placeholder
1  1  placeholder  placeholder
2  2  placeholder          NaN
3  3  placeholder            d

re.compile 的正则表达式对象也是一个有效的输入。

A regular expression object from re.compile is a valid input as well.