Pandas 中文参考指南

Windowing operations

pandas 包含一些紧凑型 API,用于执行窗口化操作,这项操作对值的滑动分区执行聚合。API 的功能类似于 groupby API,因为 SeriesDataFrame 通过必要的参数调用窗口化方法,然后依次调用聚合函数。

pandas contains a compact set of APIs for performing windowing operations - an operation that performs an aggregation over a sliding partition of values. The API functions similarly to the groupby API in that Series and DataFrame call the windowing method with necessary parameters and then subsequently call the aggregation function.

In [1]: s = pd.Series(range(5))

In [2]: s.rolling(window=2).sum()
Out[2]:
0    NaN
1    1.0
2    3.0
3    5.0
4    7.0
dtype: float64

通过从当前观察向后看窗口的长度来组成窗口。可以获取以下窗口化数据分区之和,来获取上述结果:

The windows are comprised by looking back the length of the window from the current observation. The result above can be derived by taking the sum of the following windowed partitions of data:

In [3]: for window in s.rolling(window=2):
   ...:     print(window)
   ...:
0    0
dtype: int64
0    0
1    1
dtype: int64
1    1
2    2
dtype: int64
2    2
3    3
dtype: int64
3    3
4    4
dtype: int64

Overview

pandas 支持 4 种窗口化操作类型:

pandas supports 4 types of windowing operations:

  1. Rolling window: Generic fixed or variable sliding window over the values.

  2. Weighted window: Weighted, non-rectangular window supplied by the scipy.signal library.

  3. Expanding window: Accumulating window over the values.

  4. Exponentially Weighted window: Accumulating and exponentially weighted window over the values.

概念

Concept

方法

Method

返回的对象

Returned Object

支持基于时间的窗口

Supports time-based windows

支持链式 groupby

Supports chained groupby

支持表方法

Supports table method

支持在线操作

Supports online operations

滚动窗口

Rolling window

rolling

pandas.typing.api.Rolling

Yes

Yes

是(从版本 1.3 开始)

Yes (as of version 1.3)

No

加权窗口

Weighted window

rolling

pandas.typing.api.Window

No

No

No

No

扩展窗口

Expanding window

expanding

pandas.typing.api.Expanding

No

Yes

是(从版本 1.3 开始)

Yes (as of version 1.3)

No

指数加权窗口

Exponentially Weighted window

ewm

pandas.typing.api.ExponentialMovingWindow

No

是(版本 1.2 以上版本)

Yes (as of version 1.2)

No

是(从版本 1.3 开始)

Yes (as of version 1.3)

如上所述,某些操作支持根据时间偏移量指定窗口:

As noted above, some operations support specifying a window based on a time offset:

In [4]: s = pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))

In [5]: s.rolling(window='2D').sum()
Out[5]:
2020-01-01    0.0
2020-01-02    1.0
2020-01-03    3.0
2020-01-04    5.0
2020-01-05    7.0
Freq: D, dtype: float64

此外,某些方法支持将 groupby 操作与窗口化操作链接起来,该操作将首先按指定键对数据进行分组,然后针对每个组执行窗口化操作。

Additionally, some methods support chaining a groupby operation with a windowing operation which will first group the data by the specified keys and then perform a windowing operation per group.

In [6]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})

In [7]: df.groupby('A').expanding().sum()
Out[7]:
       B
A
a 0  0.0
  2  2.0
  4  6.0
b 1  1.0
  3  4.0

窗口化操作目前仅支持数字数据(整数和浮点数),并将始终返回 float64 值。

Windowing operations currently only support numeric data (integer and float) and will always return float64 values.

警告

Warning

某些窗口化聚合、meansumvarstd 方法可能由于底层窗口化算法累积和而出现数值不精确的情况。当值的不同量级为 \(1/np.finfo(np.double).eps\) 时,这将导致截断。必须注意,大值可能会对不包含这些值的窗口产生影响。 Kahan summation 用于计算滚动和,以尽可能保持准确性。

Some windowing aggregation, mean, sum, var and std methods may suffer from numerical imprecision due to the underlying windowing algorithms accumulating sums. When values differ with magnitude \(1/np.finfo(np.double).eps\) this results in truncation. It must be noted, that large values may have an impact on windows, which do not include these values. Kahan summation is used to compute the rolling sums to preserve accuracy as much as possible.

1.3.0 版中的新增功能。

New in version 1.3.0.

某些窗口化操作还支持构造函数中的 method='table' 选项,该选项对整个 DataFrame 执行窗口化操作,而不是一次针对单列或单行执行。对于具有许多列或行的 DataFrame (具有相应的 axis 参数),或者在窗口化操作过程中利用其他列的能力,这可以提供有用的性能优势。只有在相应的函数调用中指定了 engine='numba' 时才能使用 method='table' 选项。

Some windowing operations also support the method='table' option in the constructor which performs the windowing operation over an entire DataFrame instead of a single column or row at a time. This can provide a useful performance benefit for a DataFrame with many columns or rows (with the corresponding axis argument) or the ability to utilize other columns during the windowing operation. The method='table' option can only be used if engine='numba' is specified in the corresponding method call.

例如, weighted mean 计算可以通过 apply() 使用单独的权重列来计算。

For example, a weighted mean calculation can be calculated with apply() by specifying a separate column of weights.

In [8]: def weighted_mean(x):
   ...:     arr = np.ones((1, x.shape[1]))
   ...:     arr[:, :2] = (x[:, :2] * x[:, 2]).sum(axis=0) / x[:, 2].sum()
   ...:     return arr
   ...:

In [9]: df = pd.DataFrame([[1, 2, 0.6], [2, 3, 0.4], [3, 4, 0.2], [4, 5, 0.7]])

In [10]: df.rolling(2, method="table", min_periods=0).apply(weighted_mean, raw=True, engine="numba")  # noqa: E501
Out[10]:
          0         1    2
0  1.000000  2.000000  1.0
1  1.800000  2.000000  1.0
2  3.333333  2.333333  1.0
3  1.555556  7.000000  1.0

1.3 版新增功能。

New in version 1.3.

某些窗口化操作还支持在构造窗口化对象后使用 online 方法,该方法返回一个新对象,该对象支持传递新 DataFrameSeries 对象,以使用新值继续窗口化计算(即在线计算)。

Some windowing operations also support an online method after constructing a windowing object which returns a new object that supports passing in new DataFrame or Series objects to continue the windowing calculation with the new values (i.e. online calculations).

此新窗口化对象上的方法必须首先调用聚合方法来“启动”在线计算的初始状态。然后,可以在 update 参数中传递新的 DataFrameSeries 对象,以继续窗口化计算。

The methods on this new windowing objects must call the aggregation method first to “prime” the initial state of the online calculation. Then, new DataFrame or Series objects can be passed in the update argument to continue the windowing calculation.

In [11]: df = pd.DataFrame([[1, 2, 0.6], [2, 3, 0.4], [3, 4, 0.2], [4, 5, 0.7]])

In [12]: df.ewm(0.5).mean()
Out[12]:
          0         1         2
0  1.000000  2.000000  0.600000
1  1.750000  2.750000  0.450000
2  2.615385  3.615385  0.276923
3  3.550000  4.550000  0.562500
In [13]: online_ewm = df.head(2).ewm(0.5).online()

In [14]: online_ewm.mean()
Out[14]:
      0     1     2
0  1.00  2.00  0.60
1  1.75  2.75  0.45

In [15]: online_ewm.mean(update=df.tail(1))
Out[15]:
          0         1         2
3  3.307692  4.307692  0.623077

所有窗口化运算都支持一个_min_periods_参数,该参数决定了一个窗口必须具有的最小非_np.nan_值;否则,产生的值是_np.nan_。对于基于时间的窗口,_min_periods_默认为 1,对于固定窗口,_window_默认为 1。

All windowing operations support a min_periods argument that dictates the minimum amount of non-np.nan values a window must have; otherwise, the resulting value is np.nan. min_periods defaults to 1 for time-based windows and window for fixed windows

In [16]: s = pd.Series([np.nan, 1, 2, np.nan, np.nan, 3])

In [17]: s.rolling(window=3, min_periods=1).sum()
Out[17]:
0    NaN
1    1.0
2    3.0
3    3.0
4    2.0
5    3.0
dtype: float64

In [18]: s.rolling(window=3, min_periods=2).sum()
Out[18]:
0    NaN
1    NaN
2    3.0
3    3.0
4    NaN
5    NaN
dtype: float64

# Equivalent to min_periods=3
In [19]: s.rolling(window=3, min_periods=None).sum()
Out[19]:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
dtype: float64

此外,所有窗口化运算都支持_aggregate_方法,用于返回应用于窗口的多个聚合结果。

Additionally, all windowing operations supports the aggregate method for returning a result of multiple aggregations applied to a window.

In [20]: df = pd.DataFrame({"A": range(5), "B": range(10, 15)})

In [21]: df.expanding().agg(["sum", "mean", "std"])
Out[21]:
      A                    B
    sum mean       std   sum  mean       std
0   0.0  0.0       NaN  10.0  10.0       NaN
1   1.0  0.5  0.707107  21.0  10.5  0.707107
2   3.0  1.0  1.000000  33.0  11.0  1.000000
3   6.0  1.5  1.290994  46.0  11.5  1.290994
4  10.0  2.0  1.581139  60.0  12.0  1.581139

Rolling window

通用滚动窗口支持指定窗口为固定数量的观测值或基于偏移量的可变数量的观测值。如果提供了基于时间的偏移量,则基于时间的相应索引必须是单调的。

Generic rolling windows support specifying windows as a fixed number of observations or variable number of observations based on an offset. If a time based offset is provided, the corresponding time based index must be monotonic.

In [22]: times = ['2020-01-01', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-29']

In [23]: s = pd.Series(range(5), index=pd.DatetimeIndex(times))

In [24]: s
Out[24]:
2020-01-01    0
2020-01-03    1
2020-01-04    2
2020-01-05    3
2020-01-29    4
dtype: int64

# Window with 2 observations
In [25]: s.rolling(window=2).sum()
Out[25]:
2020-01-01    NaN
2020-01-03    1.0
2020-01-04    3.0
2020-01-05    5.0
2020-01-29    7.0
dtype: float64

# Window with 2 days worth of observations
In [26]: s.rolling(window='2D').sum()
Out[26]:
2020-01-01    0.0
2020-01-03    1.0
2020-01-04    3.0
2020-01-05    5.0
2020-01-29    4.0
dtype: float64

有关所有受支持的聚合函数,请参见 Rolling window functions

For all supported aggregation functions, see Rolling window functions.

Centering windows

默认情况下,标签被设置为窗口的右边缘,但_center_关键字可用,因此可以将标签设置为中心。

By default the labels are set to the right edge of the window, but a center keyword is available so the labels can be set at the center.

In [27]: s = pd.Series(range(10))

In [28]: s.rolling(window=5).mean()
Out[28]:
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

In [29]: s.rolling(window=5, center=True).mean()
Out[29]:
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    NaN
9    NaN
dtype: float64

这也可应用于类日期的索引。

This can also be applied to datetime-like indices.

1.3.0 版中的新增功能。

New in version 1.3.0.

In [30]: df = pd.DataFrame(
   ....:     {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D")
   ....: )
   ....:

In [31]: df
Out[31]:
            A
2020-01-01  0
2020-01-02  1
2020-01-03  2
2020-01-04  3
2020-01-05  4

In [32]: df.rolling("2D", center=False).mean()
Out[32]:
              A
2020-01-01  0.0
2020-01-02  0.5
2020-01-03  1.5
2020-01-04  2.5
2020-01-05  3.5

In [33]: df.rolling("2D", center=True).mean()
Out[33]:
              A
2020-01-01  0.5
2020-01-02  1.5
2020-01-03  2.5
2020-01-04  3.5
2020-01-05  4.0

Rolling window endpoints

可以使用_closed_参数指定滚动窗口计算中包含区间端点:

The inclusion of the interval endpoints in rolling window calculations can be specified with the closed parameter:

Value

行为

Behavior

'right'

关闭右端点

close right endpoint

'left'

关闭左端点

close left endpoint

'both'

关闭两个端点

close both endpoints

'neither'

打开端点

open endpoints

例如,在很多问题中,如果来自现在的信息不会影响过去的信息,则打开右端点非常有用。这允许滚动窗口计算“直到时间点”的统计信息,但不包括该时间点。

For example, having the right endpoint open is useful in many problems that require that there is no contamination from present information back to past information. This allows the rolling window to compute statistics “up to that point in time”, but not including that point in time.

In [34]: df = pd.DataFrame(
   ....:     {"x": 1},
   ....:     index=[
   ....:         pd.Timestamp("20130101 09:00:01"),
   ....:         pd.Timestamp("20130101 09:00:02"),
   ....:         pd.Timestamp("20130101 09:00:03"),
   ....:         pd.Timestamp("20130101 09:00:04"),
   ....:         pd.Timestamp("20130101 09:00:06"),
   ....:     ],
   ....: )
   ....:

In [35]: df["right"] = df.rolling("2s", closed="right").x.sum()  # default

In [36]: df["both"] = df.rolling("2s", closed="both").x.sum()

In [37]: df["left"] = df.rolling("2s", closed="left").x.sum()

In [38]: df["neither"] = df.rolling("2s", closed="neither").x.sum()

In [39]: df
Out[39]:
                     x  right  both  left  neither
2013-01-01 09:00:01  1    1.0   1.0   NaN      NaN
2013-01-01 09:00:02  1    2.0   2.0   1.0      1.0
2013-01-01 09:00:03  1    2.0   3.0   2.0      1.0
2013-01-01 09:00:04  1    2.0   3.0   2.0      1.0
2013-01-01 09:00:06  1    1.0   2.0   1.0      NaN

Custom window rolling

除了接受整数或偏移量作为_window_参数,rolling_还接受允许用户定义计算窗口边界的自定义方法的_BaseIndexer_子类。_BaseIndexer_子类需要定义_get_window_bounds_方法,该方法返回二元数组的元组,第一个数组是窗口的起始索引,第二个数组是窗口的结束索引。此外,_num_valuesmin_periodscenterclosed_和_step_将自动传递给_get_window_bounds,并且定义的方法必须始终接受这些参数。

In addition to accepting an integer or offset as a window argument, rolling also accepts a BaseIndexer subclass that allows a user to define a custom method for calculating window bounds. The BaseIndexer subclass will need to define a get_window_bounds method that returns a tuple of two arrays, the first being the starting indices of the windows and second being the ending indices of the windows. Additionally, num_values, min_periods, center, closed and step will automatically be passed to get_window_bounds and the defined method must always accept these arguments.

例如,如果我们有以下 DataFrame

For example, if we have the following DataFrame

In [40]: use_expanding = [True, False, True, False, True]

In [41]: use_expanding
Out[41]: [True, False, True, False, True]

In [42]: df = pd.DataFrame({"values": range(5)})

In [43]: df
Out[43]:
   values
0       0
1       1
2       2
3       3
4       4

想使用扩展窗口,其中 use_expandingTrue,否则窗口大小是 1,我们可以创建以下 BaseIndexer 子类:

and we want to use an expanding window where use_expanding is True otherwise a window of size 1, we can create the following BaseIndexer subclass:

In [44]: from pandas.api.indexers import BaseIndexer

In [45]: class CustomIndexer(BaseIndexer):
   ....:      def get_window_bounds(self, num_values, min_periods, center, closed, step):
   ....:          start = np.empty(num_values, dtype=np.int64)
   ....:          end = np.empty(num_values, dtype=np.int64)
   ....:          for i in range(num_values):
   ....:              if self.use_expanding[i]:
   ....:                  start[i] = 0
   ....:                  end[i] = i + 1
   ....:              else:
   ....:                  start[i] = i
   ....:                  end[i] = i + self.window_size
   ....:          return start, end
   ....:

In [46]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)

In [47]: df.rolling(indexer).sum()
Out[47]:
   values
0     0.0
1     1.0
2     3.0
3     3.0
4    10.0

你可以在 here 中查看 BaseIndexer 子类的其他示例

You can view other examples of BaseIndexer subclasses here

在这些示例中需要注意的一个子类是 VariableOffsetWindowIndexer,它允许在非固定偏移量上(例如 BusinessDay)进行滚动操作。

One subclass of note within those examples is the VariableOffsetWindowIndexer that allows rolling operations over a non-fixed offset like a BusinessDay.

In [48]: from pandas.api.indexers import VariableOffsetWindowIndexer

In [49]: df = pd.DataFrame(range(10), index=pd.date_range("2020", periods=10))

In [50]: offset = pd.offsets.BDay(1)

In [51]: indexer = VariableOffsetWindowIndexer(index=df.index, offset=offset)

In [52]: df
Out[52]:
            0
2020-01-01  0
2020-01-02  1
2020-01-03  2
2020-01-04  3
2020-01-05  4
2020-01-06  5
2020-01-07  6
2020-01-08  7
2020-01-09  8
2020-01-10  9

In [53]: df.rolling(indexer).sum()
Out[53]:
               0
2020-01-01   0.0
2020-01-02   1.0
2020-01-03   2.0
2020-01-04   3.0
2020-01-05   7.0
2020-01-06  12.0
2020-01-07   6.0
2020-01-08   7.0
2020-01-09   8.0
2020-01-10   9.0

对于一些问题,未来知识可供分析。例如,当每个数据点是从实验中读取的完整时间序列时就会发生这种情况,而任务是提取基础条件。在这些情况下,执行面向未来的滚动窗口计算可能很有用。 FixedForwardWindowIndexer 类可用于此目的。此 BaseIndexer 子类实现了封闭固定宽度的面向未来的滚动窗口,我们可以如下使用它:

For some problems knowledge of the future is available for analysis. For example, this occurs when each data point is a full time series read from an experiment, and the task is to extract underlying conditions. In these cases it can be useful to perform forward-looking rolling window computations. FixedForwardWindowIndexer class is available for this purpose. This BaseIndexer subclass implements a closed fixed-width forward-looking rolling window, and we can use it as follows:

In [54]: from pandas.api.indexers import FixedForwardWindowIndexer

In [55]: indexer = FixedForwardWindowIndexer(window_size=2)

In [56]: df.rolling(indexer, min_periods=1).sum()
Out[56]:
               0
2020-01-01   1.0
2020-01-02   3.0
2020-01-03   5.0
2020-01-04   7.0
2020-01-05   9.0
2020-01-06  11.0
2020-01-07  13.0
2020-01-08  15.0
2020-01-09  17.0
2020-01-10   9.0

我们还可以通过使用切片、应用滚动聚合,然后翻转结果来实现这一点,如下面的示例所示:

We can also achieve this by using slicing, applying rolling aggregation, and then flipping the result as shown in example below:

In [57]: df = pd.DataFrame(
   ....:     data=[
   ....:         [pd.Timestamp("2018-01-01 00:00:00"), 100],
   ....:         [pd.Timestamp("2018-01-01 00:00:01"), 101],
   ....:         [pd.Timestamp("2018-01-01 00:00:03"), 103],
   ....:         [pd.Timestamp("2018-01-01 00:00:04"), 111],
   ....:     ],
   ....:     columns=["time", "value"],
   ....: ).set_index("time")
   ....:

In [58]: df
Out[58]:
                     value
time
2018-01-01 00:00:00    100
2018-01-01 00:00:01    101
2018-01-01 00:00:03    103
2018-01-01 00:00:04    111

In [59]: reversed_df = df[::-1].rolling("2s").sum()[::-1]

In [60]: reversed_df
Out[60]:
                     value
time
2018-01-01 00:00:00  201.0
2018-01-01 00:00:01  101.0
2018-01-01 00:00:03  214.0
2018-01-01 00:00:04  111.0

Rolling apply

apply() 函数采用一个额外的 func 参数并执行通用的滚动计算。func 参数应是从 ndarray 输入中生成单个值的一个函数。raw 指定窗口是作为 Series 对象 (raw=False) 还是 ndarray 对象 (raw=True) 转换。

The apply() function takes an extra func argument and performs generic rolling computations. The func argument should be a single function that produces a single value from an ndarray input. raw specifies whether the windows are cast as Series objects (raw=False) or ndarray objects (raw=True).

In [61]: def mad(x):
   ....:     return np.fabs(x - x.mean()).mean()
   ....:

In [62]: s = pd.Series(range(10))

In [63]: s.rolling(window=4).apply(mad, raw=True)
Out[63]:
0    NaN
1    NaN
2    NaN
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
dtype: float64

Numba engine

此外,如果已安装 apply() 作为可选依赖项,它可以使用 Numba。可以通过指定 engine='numba'engine_kwargs 参数使用 Numba 执行应用聚合(raw 也必须设置为 True)。有关参数和性能注意事项的一般用法,请参阅 enhancing performance with Numba

Additionally, apply() can leverage Numba if installed as an optional dependency. The apply aggregation can be executed using Numba by specifying engine='numba' and engine_kwargs arguments (raw must also be set to True). See enhancing performance with Numba for general usage of the arguments and performance considerations.

Numba 将在两个例程中应用:

Numba will be applied in potentially two routines:

  1. If func is a standard Python function, the engine will JIT the passed function. func can also be a JITed function in which case the engine will not JIT the function again.

  2. The engine will JIT the for loop where the apply function is applied to each window.

engine_kwargs 参数是传入 numba.jit decorator 的关键字参数的字典。这些关键字参数将应用到传递的函数(如果是一个标准 Python 函数)和循环中每个窗口的 apply。

The engine_kwargs argument is a dictionary of keyword arguments that will be passed into the numba.jit decorator. These keyword arguments will be applied to both the passed function (if a standard Python function) and the apply for loop over each window.

1.3.0 版中的新增功能。

New in version 1.3.0.

meanmedianmaxminsum 也支持 engineengine_kwargs 参数。

mean, median, max, min, and sum also support the engine and engine_kwargs arguments.

Binary window functions

cov()corr() 可以计算有关两个 SeriesDataFrame/ SeriesDataFrame/ DataFrame 的任何组合的移动窗口统计信息。以下是每种情况中的行为:

cov() and corr() can compute moving window statistics about two Series or any combination of DataFrame/Series or DataFrame/DataFrame. Here is the behavior in each case:

  1. two Series: compute the statistic for the pairing.

  2. DataFrame/Series: compute the statistics for each column of the DataFrame with the passed Series, thus returning a DataFrame.

  3. DataFrame/DataFrame: by default compute the statistic for matching column names, returning a DataFrame. If the keyword argument pairwise=True is passed then computes the statistic for each pair of columns, returning a DataFrame with a MultiIndex whose values are the dates in question (see the next section).

例如:

For example:

In [64]: df = pd.DataFrame(
   ....:     np.random.randn(10, 4),
   ....:     index=pd.date_range("2020-01-01", periods=10),
   ....:     columns=["A", "B", "C", "D"],
   ....: )
   ....:

In [65]: df = df.cumsum()

In [66]: df2 = df[:4]

In [67]: df2.rolling(window=2).corr(df2["B"])
Out[67]:
              A    B    C    D
2020-01-01  NaN  NaN  NaN  NaN
2020-01-02 -1.0  1.0 -1.0  1.0
2020-01-03  1.0  1.0  1.0 -1.0
2020-01-04 -1.0  1.0  1.0 -1.0

Computing rolling pairwise covariances and correlations

在金融数据分析和其他领域中,通常为一系列时间序列计算协方差和相关矩阵。人们通常还对移动窗口协方差和相关矩阵感兴趣。这可以通过传递 pairwise 关键字参数来完成,在 DataFrame 输入的情况下,它将产生一个 MultiIndexed DataFrame,其 index 是相关日期。在单个 DataFrame 参数的情况下,甚至可以省略 pairwise 参数:

In financial data analysis and other fields it’s common to compute covariance and correlation matrices for a collection of time series. Often one is also interested in moving-window covariance and correlation matrices. This can be done by passing the pairwise keyword argument, which in the case of DataFrame inputs will yield a MultiIndexed DataFrame whose index are the dates in question. In the case of a single DataFrame argument the pairwise argument can even be omitted:

将忽略缺失值,并且每个条目都是使用成对的完整观测来计算的。

Missing values are ignored and each entry is computed using the pairwise complete observations.

假设缺失数据是随机缺失的,这样就会产生一个协方差矩阵的估计值,该估计值是无偏的。但是,对于许多应用程序,此估计值可能不可接受,因为估计的协方差矩阵不能保证为正半定的。这可能导致估计的相关值有大于 1 的绝对值和/或不可逆的协方差矩阵。有关更多详细信息,请参阅 Estimation of covariance matrices

Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

In [68]: covs = (
   ....:     df[["B", "C", "D"]]
   ....:     .rolling(window=4)
   ....:     .cov(df[["A", "B", "C"]], pairwise=True)
   ....: )
   ....:

In [69]: covs
Out[69]:
                     B         C         D
2020-01-01 A       NaN       NaN       NaN
           B       NaN       NaN       NaN
           C       NaN       NaN       NaN
2020-01-02 A       NaN       NaN       NaN
           B       NaN       NaN       NaN
...                ...       ...       ...
2020-01-09 B  0.342006  0.230190  0.052849
           C  0.230190  1.575251  0.082901
2020-01-10 A -0.333945  0.006871 -0.655514
           B  0.649711  0.430860  0.469271
           C  0.430860  0.829721  0.055300

[30 rows x 3 columns]

Weighted window

win_type 参数在 .rolling 中生成加权窗口,通常用于滤波和频谱估计。win_type 必须是与 scipy.signal window function 对应的字符串。为了使用这些窗口,必须安装 Scipy,并在聚合函数中指定 Scipy 窗口方法所需要的补充参数。

The win_type argument in .rolling generates a weighted windows that are commonly used in filtering and spectral estimation. win_type must be string that corresponds to a scipy.signal window function. Scipy must be installed in order to use these windows, and supplementary arguments that the Scipy window methods take must be specified in the aggregation function.

In [70]: s = pd.Series(range(10))

In [71]: s.rolling(window=5).mean()
Out[71]:
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

In [72]: s.rolling(window=5, win_type="triang").mean()
Out[72]:
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

# Supplementary Scipy arguments passed in the aggregation function
In [73]: s.rolling(window=5, win_type="gaussian").mean(std=0.1)
Out[73]:
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

有关所有受支持的聚合函数,请参阅 Weighted window functions

For all supported aggregation functions, see Weighted window functions.

Expanding window

扩展窗口通过到目前为止所有可用数据计算聚合统计值。由于这些计算是滚动统计的特殊情况,因此它们实现在 pandas 中,使以下两个调用等效:

An expanding window yields the value of an aggregation statistic with all the data available up to that point in time. Since these calculations are a special case of rolling statistics, they are implemented in pandas such that the following two calls are equivalent:

In [74]: df = pd.DataFrame(range(5))

In [75]: df.rolling(window=len(df), min_periods=1).mean()
Out[75]:
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

In [76]: df.expanding(min_periods=1).mean()
Out[76]:
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

有关所有受支持的聚合函数,请参阅 Expanding window functions

For all supported aggregation functions, see Expanding window functions.

Exponentially weighted window

指数加权窗口类似于扩展窗口,但是相对于当前点,每个先验点都以指数加权向下。

An exponentially weighted window is similar to an expanding window but with each prior point being exponentially weighted down relative to the current point.

通常,加权移动平均数的计算方式如下:

In general, a weighted moving average is calculated as

其中 \(x_t\) 是输入,\(y_t\) 是结果和 \(w_i\) 是权重。

where \(x_t\) is the input, \(y_t\) is the result and the \(w_i\) are the weights.

有关所有受支持的聚合函数,请参阅 Exponentially-weighted window functions

For all supported aggregation functions, see Exponentially-weighted window functions.

EW 函数支持两种指数权重的变体。默认值 adjust=True 使用权重 \(w_i = (1 - \alpha)^i\),它提供

The EW functions support two variants of exponential weights. The default, adjust=True, uses the weights \(w_i = (1 - \alpha)^i\) which gives

如果指定 adjust=False,则移动平均数的计算如下:

When adjust=False is specified, moving averages are calculated as

这相当于使用权重

which is equivalent to using weights

这些方程有时使用 \(\alpha' = 1 - \alpha\) 来编写(例如)。

These equations are sometimes written in terms of \(\alpha' = 1 - \alpha\), e.g.

上述两个变体之间的差异是因为我们处理具有有限历史记录的序列。考虑一个具有 adjust=True 的无限历史记录的序列:

The difference between the above two variants arises because we are dealing with series which have finite history. Consider a series of infinite history, with adjust=True:

注意到分母是一个初始项等于 1 且比率为 \(1 - \alpha\) 的几何级数,则我们有

Noting that the denominator is a geometric series with initial term equal to 1 and a ratio of \(1 - \alpha\) we have

这与上面的 adjust=False 表达式相同,因此显示了两个变体对于无限级数的等效性。当 adjust=False 时,我们有 \(y_0 = x_0\) 和 \(y_t = \alpha x_t + (1 - \alpha) y_{t-1}\)。因此,有一个假设,即 \(x_0\) 不是一个普通值,而是一个到该点的无限级数的指数加权矩。

which is the same expression as adjust=False above and therefore shows the equivalence of the two variants for infinite series. When adjust=False, we have \(y_0 = x_0\) and \(y_t = \alpha x_t + (1 - \alpha) y_{t-1}\). Therefore, there is an assumption that \(x_0\) is not an ordinary value but rather an exponentially weighted moment of the infinite series up to that point.

必须有 \(0 < \alpha \leq 1\),虽然直接传递 \(\alpha\) 是可能的,但通常更容易考虑 EW 矩的跨度、质心 (com) 或半衰期:

One must have \(0 < \alpha \leq 1\), and while it is possible to pass \(\alpha\) directly, it’s often easier to think about either the span, center of mass (com) or half-life of an EW moment:

必须明确地将跨度、质心、半衰期和 alpha 中的一个指定给 EW 函数:

One must specify precisely one of span, center of mass, half-life and alpha to the EW functions:

  1. Span corresponds to what is commonly called an “N-day EW moving average”.

  2. Center of mass has a more physical interpretation and can be thought of in terms of span: \(c = (s - 1) / 2\).

  3. Half-life is the period of time for the exponential weight to reduce to one half.

  4. Alpha specifies the smoothing factor directly.

你还可以使用 halflife 来指定时间增量可转换单元,用以指定在也指定了一个 times 序列时观察值衰减为其值一半所需的时间量。

You can also specify halflife in terms of a timedelta convertible unit to specify the amount of time it takes for an observation to decay to half its value when also specifying a sequence of times.

In [77]: df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})

In [78]: df
Out[78]:
     B
0  0.0
1  1.0
2  2.0
3  NaN
4  4.0

In [79]: times = ["2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17"]

In [80]: df.ewm(halflife="4 days", times=pd.DatetimeIndex(times)).mean()
Out[80]:
          B
0  0.000000
1  0.585786
2  1.523889
3  1.523889
4  3.233686

以下公式用于计算时间输入向量的指数加权均值:

The following formula is used to compute exponentially weighted mean with an input vector of times:

ExponentialMovingWindow 还有一个 ignore_na 参数,它确定中间空值将如何影响权重的计算。当 ignore_na=False(默认值)时,基于绝对位置计算权重,以便中间空值将影响结果。当 ignore_na=True 时,忽略中间空值来计算权重。例如,假设 adjust=True,如果 ignore_na=False,则 3, NaN, 5 的加权平均值将被计算为

ExponentialMovingWindow also has an ignore_na argument, which determines how intermediate null values affect the calculation of the weights. When ignore_na=False (the default), weights are calculated based on absolute positions, so that intermediate null values affect the result. When ignore_na=True, weights are calculated by ignoring intermediate null values. For example, assuming adjust=True, if ignore_na=False, the weighted average of 3, NaN, 5 would be calculated as

而如果 ignore_na=True,则加权平均值将被计算为

Whereas if ignore_na=True, the weighted average would be calculated as

var()std()cov() 函数具有一个 bias 参数,指定结果应该包含有偏统计信息还是无偏统计信息。例如,如果 bias=True,则 ewmvar(x) 被计算为 ewmvar(x) = ewma(x*2) - ewma(x)*2;而如果 bias=False(默认值),则有偏方差统计信息将按去偏因子进行缩放

The var(), std(), and cov() functions have a bias argument, specifying whether the result should contain biased or unbiased statistics. For example, if bias=True, ewmvar(x) is calculated as ewmvar(x) = ewma(x*2) - ewma(x)*2; whereas if bias=False (the default), the biased variance statistics are scaled by debiasing factors

(对于 \(w_i = 1\),它被简化为通常的 \(N / (N - 1)\) 因子,其中 \(N = t + 1\)。)有关详细信息,请参见维基百科上的 Weighted Sample Variance

(For \(w_i = 1\), this reduces to the usual \(N / (N - 1)\) factor, with \(N = t + 1\).) See Weighted Sample Variance on Wikipedia for further details.