Pandas 中文参考指南
Copy-on-Write (CoW)
写时复制将在 Pandas 3.0 中成为默认设置。我们建议 turning it on now 来受益于所有改进。 |
Copy-on-Write will become the default in pandas 3.0. We recommend turning it on now to benefit from all improvements. |
写时复制最初是在 1.5.0 版本中引入的。从 2.0 版本开始,通过 CoW 可能的大多数优化已实现并得到支持。从 Pandas 2.1 开始支持所有可能的优化。
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1.
CoW 将在 3.0 版本中默认启用
CoW will be enabled by default in version 3.0.
CoW 将带来更可预测的行为,因为用一个语句更新多个对象是不可能的,例如,索引操作或方法不会有副作用。此外,通过将副本的生成尽可能地推迟,平均性能和内存使用率将得到提升。
CoW will lead to more predictable behavior since it is not possible to update more than one object with one statement, e.g. indexing operations or methods won’t have side-effects. Additionally, through delaying copies as long as possible, the average performance and memory usage will improve.
Previous behavior
理解 pandas 的索引行为很费解。有些操作返回视图,而另一些操返回副本。根据操作结果,变异一个对象可能会意外变异另一个对象:
pandas indexing behavior is tricky to understand. Some operations return views while other return copies. Depending on the result of the operation, mutating one object might accidentally mutate another:
In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [2]: subset = df["foo"]
In [3]: subset.iloc[0] = 100
In [4]: df
Out[4]:
foo bar
0 100 4
1 2 5
2 3 6
变异 subset,例如更新其值,也会更新 df。确切的行为很难预测。写时复制法解决了意外地修改多个对象的问题,明确禁止这样做。启用 CoW 时,df 保持不变:
Mutating subset, e.g. updating its values, also updates df. The exact behavior is hard to predict. Copy-on-Write solves accidentally modifying more than one object, it explicitly disallows this. With CoW enabled, df is unchanged:
In [5]: pd.options.mode.copy_on_write = True
In [6]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [7]: subset = df["foo"]
In [8]: subset.iloc[0] = 100
In [9]: df
Out[9]:
foo bar
0 1 4
1 2 5
2 3 6
以下部分将解释这意味着什么以及它如何影响现有应用程序。
The following sections will explain what this means and how it impacts existing applications.
Migrating to Copy-on-Write
写时复制法将成为 pandas 3.0 中的默认且唯一的模式。这意味着用户需要迁移自己的代码以使其符合 CoW 规则。
Copy-on-Write will be the default and only mode in pandas 3.0. This means that users need to migrate their code to be compliant with CoW rules.
在某些情况下,pandas 中的默认模式将发出警告,这些情况将主动更改行为,从而更改用户预期行为。
The default mode in pandas will raise warnings for certain cases that will actively change behavior and thus change user intended behavior.
我们添加了另一个模式,例如:
We added another mode, e.g.
pd.options.mode.copy_on_write = "warn"
对于将随 CoW 更改行为的每项操作发出警告。我们预计此模式将非常嘈杂,因为我们无法预估会影响用户的许多情况也会发出警告。我们建议检查此模式并分析警告,但不必解决所有这些警告。以下列表的前两项是使现有代码使用 CoW 运行所需要解决的唯一情况。
that will warn for every operation that will change behavior with CoW. We expect this mode to be very noisy, since many cases that we don’t expect that they will influence users will also emit a warning. We recommend checking this mode and analyzing the warnings, but it is not necessary to address all of these warning. The first two items of the following lists are the only cases that need to be addressed to make existing code work with CoW.
以下几项描述了用户可见的更改:
The following few items describe the user visible changes:
链式赋值永远不起作用
Chained assignment will never work
loc 应作为备选方案使用。查看 chained assignment section 以获取更多详细信息。
loc should be used as an alternative. Check the chained assignment section for more details.
访问 pandas 对象的基础数组将返回只读视图
Accessing the underlying array of a pandas object will return a read-only view
In [10]: ser = pd.Series([1, 2, 3])
In [11]: ser.to_numpy()
Out[11]: array([1, 2, 3])
此示例返回一个 NumPy 数组,该数组是 Series 对象的视图。可以修改此视图,从而也可以修改 pandas 对象。这与 CoW 规则不符。为了防止出现此问题,将返回的数组设置为不可写。创建此数组的副本时允许修改。如果你不再关心 pandas 对象,也可再次使该数组可写。
This example returns a NumPy array that is a view of the Series object. This view can be modified and thus also modify the pandas object. This is not compliant with CoW rules. The returned array is set to non-writeable to protect against this behavior. Creating a copy of this array allows modification. You can also make the array writeable again if you don’t care about the pandas object anymore.
有关更多详细信息,请参阅有关 read-only NumPy arrays 的部分。
See the section about read-only NumPy arrays for more details.
一次只更新一个 pandas 对象
Only one pandas object is updated at once
以下代码片段在未启用 CoW 的情况下更新 df 和 subset:
The following code snippet updates both df and subset without CoW:
In [12]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [13]: subset = df["foo"]
In [14]: subset.iloc[0] = 100
In [15]: df
Out[15]:
foo bar
0 1 4
1 2 5
2 3 6
启用 CoW 后将不再可能这样做,因为 CoW 规则明确禁止这样做。这包括将单个列更新为 Series,并依赖于更改传播回父 DataFrame。如果需要这种行为,可以用带有 loc 或 iloc 的单个语句对该语句进行重写。 DataFrame.where() 是此情况下的另一个合适的备选方案。
This won’t be possible anymore with CoW, since the CoW rules explicitly forbid this. This includes updating a single column as a Series and relying on the change propagating back to the parent DataFrame. This statement can be rewritten into a single statement with loc or iloc if this behavior is necessary. DataFrame.where() is another suitable alternative for this case.
使用就地方法从 DataFrame 中选择的列进行更新也不起作用了。
Updating a column selected from a DataFrame with an inplace method will also not work anymore.
In [16]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [17]: df["foo"].replace(1, 5, inplace=True)
In [18]: df
Out[18]:
foo bar
0 1 4
1 2 5
2 3 6
这是链式赋值的另一种形式。这通常可以重写成 2 种不同的形式:
This is another form of chained assignment. This can generally be rewritten in 2 different forms:
In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [20]: df.replace({"foo": {1: 5}}, inplace=True)
In [21]: df
Out[21]:
foo bar
0 5 4
1 2 5
2 3 6
另一个替代方案是不使用 inplace:
A different alternative would be to not use inplace:
In [22]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [23]: df["foo"] = df["foo"].replace(1, 5)
In [24]: df
Out[24]:
foo bar
0 5 4
1 2 5
2 3 6
构造函数现在默认复制 NumPy 数组
Constructors now copy NumPy arrays by default
当没有另行指定时,Series 和 DataFrame 构造函数现在将默认复制 NumPy 数组。这一改动是为了避免当 NumPy 数组在 pandas 外部被就地更改时修改 pandas 对象。您可以设置 copy=False 来避免此复制。
The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.
Description
CoW 意味着从另一个以任何方式派生的任何 DataFrame 或 Series 始终表现为一个副本。因此,我们只能通过修改对象本身来更改对象的值。CoW 不允许更新一个与另一个 DataFrame 或 Series 对象共享数据的 DataFrame 或 Series。
CoW means that any DataFrame or Series derived from another in any way always behaves as a copy. As a consequence, we can only change the values of an object through modifying the object itself. CoW disallows updating a DataFrame or a Series that shares data with another DataFrame or Series object inplace.
这样可以在修改值时避免副作用,因此,大多数方法都可以避免实际复制数据,并且仅在必要时触发复制。
This avoids side-effects when modifying values and hence, most methods can avoid actually copying the data and only trigger a copy when necessary.
以下示例将在 CoW 中就地操作:
The following example will operate inplace with CoW:
In [25]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [26]: df.iloc[0, 0] = 100
In [27]: df
Out[27]:
foo bar
0 100 4
1 2 5
2 3 6
对象 df 与任何其他对象不共享任何数据,因此在更新值时不会触发复制操作。相比之下,以下操作会触发在 CoW 下复制数据:
The object df does not share any data with any other object and hence no copy is triggered when updating the values. In contrast, the following operation triggers a copy of the data under CoW:
In [28]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [29]: df2 = df.reset_index(drop=True)
In [30]: df2.iloc[0, 0] = 100
In [31]: df
Out[31]:
foo bar
0 1 4
1 2 5
2 3 6
In [32]: df2
Out[32]:
foo bar
0 100 4
1 2 5
2 3 6
reset_index 返回一个带有 CoW 的惰性副本,而它复制了没有 CoW 的数据。由于两个对象 df 和 df2 共享相同的数据,因此在修改 df2 时会触发复制。对象 df 仍然具有与最初相同的值,而 df2 被修改。
reset_index returns a lazy copy with CoW while it copies the data without CoW. Since both objects, df and df2 share the same data, a copy is triggered when modifying df2. The object df still has the same values as initially while df2 was modified.
如果在执行 reset_index 操作后不再需要对象 df,您可以通过将 reset_index 的输出分配给同一个变量来模拟就地操作:
If the object df isn’t needed anymore after performing the reset_index operation, you can emulate an inplace-like operation through assigning the output of reset_index to the same variable:
In [33]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [34]: df = df.reset_index(drop=True)
In [35]: df.iloc[0, 0] = 100
In [36]: df
Out[36]:
foo bar
0 100 4
1 2 5
2 3 6
一旦 reset_index 的结果被重新分配,初始对象就会超出范围,因此 df 不会与任何其他对象共享数据。修改对象时无需进行复制。这通常适用于 Copy-on-Write optimizations 中列出的所有方法。
The initial object gets out of scope as soon as the result of reset_index is reassigned and hence df does not share data with any other object. No copy is necessary when modifying the object. This is generally true for all methods listed in Copy-on-Write optimizations.
以前,在对视图进行操作时,视图和父对象都会被修改:
Previously, when operating on views, the view and the parent object was modified:
In [37]: with pd.option_context("mode.copy_on_write", False):
....: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
....: view = df[:]
....: df.iloc[0, 0] = 100
....:
In [38]: df
Out[38]:
foo bar
0 100 4
1 2 5
2 3 6
In [39]: view
Out[39]:
foo bar
0 100 4
1 2 5
2 3 6
CoW 在 df 被更改时触发复制,以避免同时修改 view:
CoW triggers a copy when df is changed to avoid mutating view as well:
In [40]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [41]: view = df[:]
In [42]: df.iloc[0, 0] = 100
In [43]: df
Out[43]:
foo bar
0 100 4
1 2 5
2 3 6
In [44]: view
Out[44]:
foo bar
0 1 4
1 2 5
2 3 6
Chained Assignment
链式赋值引用一种技术,其中通过两次后续索引操作更新一个对象,例如:
Chained assignment references a technique where an object is updated through two subsequent indexing operations, e.g.
In [45]: with pd.option_context("mode.copy_on_write", False):
....: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
....: df["foo"][df["bar"] > 5] = 100
....: df
....:
更新列 foo,其中列 bar 大于 5。但这违反了 CoW 原则,因为它必须一步修改视图 df["foo"] 和 df。因此,当启用 CoW 时,链式赋值将始终不起作用并会引发 ChainedAssignmentError 警告:
The column foo is updated where the column bar is greater than 5. This violates the CoW principles though, because it would have to modify the view df["foo"] and df in one step. Hence, chained assignment will consistently never work and raise a ChainedAssignmentError warning with CoW enabled:
In [46]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
In [47]: df["foo"][df["bar"] > 5] = 100
在写时复制时,这可以通过使用 loc 来完成。
With copy on write this can be done by using loc.
In [48]: df.loc[df["bar"] > 5, "foo"] = 100
Read-only NumPy arrays
如果数组与初始 DataFrame 共享数据,则访问 DataFrame 的底层 NumPy 数组将返回一个只读数组:
Accessing the underlying NumPy array of a DataFrame will return a read-only array if the array shares data with the initial DataFrame:
如果初始 DataFrame 由多个数组组成,则数组是副本:
The array is a copy if the initial DataFrame consists of more than one array:
In [49]: df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
In [50]: df.to_numpy()
Out[50]:
array([[1. , 1.5],
[2. , 2.5]])
如果 DataFrame 仅由一个 NumPy 数组组成,则该数组与 DataFrame 共享数据:
The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:
In [51]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
In [52]: df.to_numpy()
Out[52]:
array([[1, 3],
[2, 4]])
此数组是只读的,这意味着无法就地修改它:
This array is read-only, which means that it can’t be modified inplace:
In [53]: arr = df.to_numpy()
In [54]: arr[0, 0] = 100
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[54], line 1
----> 1 arr[0, 0] = 100
ValueError: assignment destination is read-only
对于 Series 也是如此,因为 Series 总是由一个数组组成。
The same holds true for a Series, since a Series always consists of a single array.
有两种可能的解决办法:
There are two potential solution to this:
-
Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
-
Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so it should be used with caution.
In [55]: arr = df.to_numpy()
In [56]: arr.flags.writeable = True
In [57]: arr[0, 0] = 100
In [58]: arr
Out[58]:
array([[100, 3],
[ 2, 4]])
Patterns to avoid
如果你正在修改一个对象时,两个对象共享相同数据,则不会执行防御性复制。
No defensive copy will be performed if two objects share the same data while you are modifying one object inplace.
In [59]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [60]: df2 = df.reset_index(drop=True)
In [61]: df2.iloc[0, 0] = 100
这会创建两个共享数据的对象,因此 setitem 操作将触发一个副本。如果最初的对象 df 不再需要,则这是不必要的。只需重新赋值给相同的变量,将使该对象持有的引用失效。
This creates two objects that share data and thus the setitem operation will trigger a copy. This is not necessary if the initial object df isn’t needed anymore. Simply reassigning to the same variable will invalidate the reference that is held by the object.
In [62]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [63]: df = df.reset_index(drop=True)
In [64]: df.iloc[0, 0] = 100
在这个示例中,不需要副本。创建多个引用将保持不必要的引用,因此会损害“写时复制”的性能。
No copy is necessary in this example. Creating multiple references keeps unnecessary references alive and thus will hurt performance with Copy-on-Write.
Copy-on-Write optimizations
一种新的延迟复制机制,它推迟复制,直到所讨论的对象被修改并且仅在该对象与另一个对象共享数据时才复制。已向不需要基础数据副本的方法中添加了这一机制。流行的示例是 axis=1 和 DataFrame.rename() 的 DataFrame.drop()。
A new lazy copy mechanism that defers the copy until the object in question is modified and only if this object shares data with another object. This mechanism was added to methods that don’t require a copy of the underlying data. Popular examples are DataFrame.drop() for axis=1 and DataFrame.rename().
当启用“写时复制”时,这些方法返回视图,与常规执行相比,这提供了显着的性能改进。
These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution.
How to enable CoW
可以通过配置选项 copy_on_write 启用 “写时复制”。该选项可以通过以下方式之一启用 globally:
Copy-on-Write can be enabled through the configuration option copy_on_write. The option can be turned on globally through either of the following:
In [65]: pd.set_option("mode.copy_on_write", True)
In [66]: pd.options.mode.copy_on_write = True