Pandas 中文参考指南
Indexing and selecting data
pandas 对象中的轴标记信息有许多用途:
The axis labeling information in pandas objects serves many purposes:
-
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
-
Enables automatic and explicit data alignment.
-
Allows intuitive getting and setting of subsets of the data set.
在本节中,我们将重点关注最后一点:如何切分、切割以及一般地获取和设置 pandas 对象的子集。主要重点将放在 Series 和 DataFrame 上,因为它们在这个领域受到了更多开发关注。
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
Python 和 NumPy 索引运算符 [] 和属性运算符 . 提供了在广泛的用例中快速便捷地访问 pandas 数据结构。这使得交互式工作变得直观,因为如果你已经知道如何处理 Python 字典和 NumPy 数组,则几乎无需学习新的知识。但是,由于要访问的数据的类型事先未知,所以直接使用标准运算符有一些优化限制。对于生产代码,我们建议你利用本章中介绍的经过优化的 pandas 数据访问方法。 |
The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter. |
警告
Warning
对于设置操作,是否返回副本或引用,可能取决于上下文。这有时称为 chained assignment,应该避免这样做。请参阅 Returning a View versus Copy。
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
参阅 MultiIndex / Advanced Indexing 了解 MultiIndex 和更高级的索引文档。
See the MultiIndex / Advanced Indexing for MultiIndex and more advanced indexing documentation.
参阅 cookbook 了解一些高级策略。
See the cookbook for some advanced strategies.
Different choices for indexing
对象选择已有一系列用户请求的附加功能来支持更明确的基于位置的索引。Pandas 现在支持三种多轴索引。
Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.
-
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:
-
A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
-
A list or array of labels ['a', 'b', 'c'].
-
A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
-
A boolean array (any NA values will be treated as False).
-
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
-
A tuple of row (and column) indices whose elements are one of the above inputs.
-
See more at Selection by Label.
-
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:
-
An integer e.g. 5.
-
A list or array of integers [4, 3, 0].
-
A slice object with ints 1:7.
-
A boolean array (any NA values will be treated as False).
-
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
-
A tuple of row (and column) indices whose elements are one of the above inputs.
-
See more at Selection by Position, Advanced Indexing and Advanced Hierarchical.
-
.loc, .iloc, and also [] indexing can accept a callable as indexer. See more at Selection By Callable.
在应用可调用对象之前,会将解构的元组键拆分为行(和列)索引,所以不能从可调用对象返回元组来对行和列进行索引。 |
Destructuring tuple keys into row (and column) indexes occurs before callables are applied, so you cannot return a tuple from a callable to index both rows and columns. |
使用多轴选择从对象获取值时,将采用以下符号(使用 .loc 作为示例,但以下内容同样适用于 .iloc)。任何轴访问器都可以是空切片 :。假设未在规范中列出的轴为 :,例如 p.loc['a'] 等同于 p.loc['a', :]。
Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well). Any of the axes accessors may be the null slice :. Axes left out of the specification are assumed to be :, e.g. p.loc['a'] is equivalent to p.loc['a', :].
对象类型
Object Type
索引器
Indexers
序列
Series
s.loc[indexer]
DataFrame
df.loc[row_indexer,column_indexer]
Basics
如在介绍 last section中的数据结构时提到的,通过_[](对于熟悉在 Python 中实现类行为的人也称为 _getitem)进行索引的主要功能是选择较低维度的切片。下表显示使用 [] 对 pandas 对象进行索引时的返回类型值:
As mentioned when introducing the data structures in the last section, the primary function of indexing with [] (a.k.a. getitem for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. The following table shows return type values when indexing pandas objects with []:
对象类型
Object Type
选择
Selection
返回类型值
Return Value Type
序列
Series
series[label]
标量值
scalar value
DataFrame
frame[colname]
与 colname 对应的 Series
Series corresponding to colname
下面我们构建一个简单的时序数据集用于说明索引功能:
Here we construct a simple time series data set to use for illustrating the indexing functionality:
In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [2]: df = pd.DataFrame(np.random.randn(8, 4),
...: index=dates, columns=['A', 'B', 'C', 'D'])
...:
In [3]: df
Out[3]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
除非明确声明,否则没有哪种索引功能是特定于时序的。 |
None of the indexing functionality is time series specific unless specifically stated. |
因此,根据以上内容,我们使用 [] 有最基本的索引:
Thus, as per above, we have the most basic indexing using []:
In [4]: s = df['A']
In [5]: s[dates[5]]
Out[5]: -0.6736897080883706
您可以将列列表传递给 [] 以按该顺序选择列。如果 DataFrame 中没有该列,它会引发异常。还可以用这种方式设置多个列:
You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner:
In [6]: df
Out[6]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
In [7]: df[['B', 'A']] = df[['A', 'B']]
In [8]: df
Out[8]:
A B C D
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
2000-01-02 -0.173215 1.212112 0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
2000-01-04 -0.706771 0.721555 -1.039575 0.271860
2000-01-05 0.567020 -0.424972 0.276232 -1.087401
2000-01-06 0.113648 -0.673690 -1.478427 0.524988
2000-01-07 0.577046 0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312 0.844885
在列子集上应用转换(不改变原数据)时,您可能会觉得这种方式有用。
You may find this useful for applying a transform (in-place) to a subset of the columns.
警告
Warning
在通过 .loc 设置 Series 和 DataFrame 时,pandas 会对齐所有 AXES。
pandas aligns all AXES when setting Series and DataFrame from .loc.
这不会修改 df,因为列对齐在赋值之前。
This will not modify df because the column alignment is before value assignment.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
交换列值的正确方法是使用原始值:
The correct way to swap column values is by using raw values:
In [12]: df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
In [13]: df[['A', 'B']]
Out[13]:
A B
2000-01-01 0.469112 -0.282863
2000-01-02 1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04 0.721555 -0.706771
2000-01-05 -0.424972 0.567020
2000-01-06 -0.673690 0.113648
2000-01-07 0.404705 0.577046
2000-01-08 -0.370647 -1.157892
但是,在通过 .iloc 设置 Series 和 DataFrame 时,pandas 不会对齐 AXES,因为 .iloc 根据位置工作。
However, pandas does not align AXES when setting Series and DataFrame from .iloc because .iloc operates by position.
这会修改 df,因为未在赋值之前对齐列。
This will modify df because the column alignment is not done before value assignment.
In [14]: df[['A', 'B']]
Out[14]:
A B
2000-01-01 0.469112 -0.282863
2000-01-02 1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04 0.721555 -0.706771
2000-01-05 -0.424972 0.567020
2000-01-06 -0.673690 0.113648
2000-01-07 0.404705 0.577046
2000-01-08 -0.370647 -1.157892
In [15]: df.iloc[:, [1, 0]] = df[['A', 'B']]
In [16]: df[['A','B']]
Out[16]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
Attribute access
您可以直接访问 Series 上的索引或 DataFrame 上的列作为属性:
You may access an index on a Series or column on a DataFrame directly as an attribute:
In [17]: sa = pd.Series([1, 2, 3], index=list('abc'))
In [18]: dfa = df.copy()
In [19]: sa.b
Out[19]: 2
In [20]: dfa.A
Out[20]:
2000-01-01 -0.282863
2000-01-02 -0.173215
2000-01-03 -2.104569
2000-01-04 -0.706771
2000-01-05 0.567020
2000-01-06 0.113648
2000-01-07 0.577046
2000-01-08 -1.157892
Freq: D, Name: A, dtype: float64
In [21]: sa.a = 5
In [22]: sa
Out[22]:
a 5
b 2
c 3
dtype: int64
In [23]: dfa.A = list(range(len(dfa.index))) # ok if A already exists
In [24]: dfa
Out[24]:
A B C D
2000-01-01 0 0.469112 -1.509059 -1.135632
2000-01-02 1 1.212112 0.119209 -1.044236
2000-01-03 2 -0.861849 -0.494929 1.071804
2000-01-04 3 0.721555 -1.039575 0.271860
2000-01-05 4 -0.424972 0.276232 -1.087401
2000-01-06 5 -0.673690 -1.478427 0.524988
2000-01-07 6 0.404705 -1.715002 -1.039268
2000-01-08 7 -0.370647 -1.344312 0.844885
In [25]: dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column
In [26]: dfa
Out[26]:
A B C D
2000-01-01 0 0.469112 -1.509059 -1.135632
2000-01-02 1 1.212112 0.119209 -1.044236
2000-01-03 2 -0.861849 -0.494929 1.071804
2000-01-04 3 0.721555 -1.039575 0.271860
2000-01-05 4 -0.424972 0.276232 -1.087401
2000-01-06 5 -0.673690 -1.478427 0.524988
2000-01-07 6 0.404705 -1.715002 -1.039268
2000-01-08 7 -0.370647 -1.344312 0.844885
警告
Warning
-
You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
-
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.
-
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.
-
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
如果您使用 IPython 环境,还可以使用 Tab 补全来查看这些可访问的属性。
If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.
您还可以为 DataFrame 的一行分配一个 dict:
You can also assign a dict to a row of a DataFrame:
In [27]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
In [28]: x.iloc[1] = {'x': 9, 'y': 99}
In [29]: x
Out[29]:
x y
0 1 3
1 9 99
2 3 5
您可以使用属性访问来修改 Series 的现有元素或 DataFrame 的列,但请小心;如果您尝试使用属性访问来创建新列,它将创建一个新属性而不是新列,并会引发 UserWarning:
You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column and will this raise a UserWarning:
In [30]: df_new = pd.DataFrame({'one': [1., 2., 3.]})
In [31]: df_new.two = [4, 5, 6]
In [32]: df_new
Out[32]:
one
0 1.0
1 2.0
2 3.0
Slicing ranges
在 Selection by Position 部分中详细说明了 .iloc 方法,这是沿任意轴对范围进行切片的稳健且一致的方式。目前,我们使用 [] 运算符解释切片的语义。
The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position section detailing the .iloc method. For now, we explain the semantics of slicing using the [] operator.
对于 Series,该语法的工作方式与 ndarray 完全相同,返回值的切片和相应的标签:
With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:
In [33]: s[:5]
Out[33]:
2000-01-01 0.469112
2000-01-02 1.212112
2000-01-03 -0.861849
2000-01-04 0.721555
2000-01-05 -0.424972
Freq: D, Name: A, dtype: float64
In [34]: s[::2]
Out[34]:
2000-01-01 0.469112
2000-01-03 -0.861849
2000-01-05 -0.424972
2000-01-07 0.404705
Freq: 2D, Name: A, dtype: float64
In [35]: s[::-1]
Out[35]:
2000-01-08 -0.370647
2000-01-07 0.404705
2000-01-06 -0.673690
2000-01-05 -0.424972
2000-01-04 0.721555
2000-01-03 -0.861849
2000-01-02 1.212112
2000-01-01 0.469112
Freq: -1D, Name: A, dtype: float64
请注意,设置也适用:
Note that setting works as well:
In [36]: s2 = s.copy()
In [37]: s2[:5] = 0
In [38]: s2
Out[38]:
2000-01-01 0.000000
2000-01-02 0.000000
2000-01-03 0.000000
2000-01-04 0.000000
2000-01-05 0.000000
2000-01-06 -0.673690
2000-01-07 0.404705
2000-01-08 -0.370647
Freq: D, Name: A, dtype: float64
对于 DataFrame,在 [] 内进行切片时会对行进行切片。这在很大程度上是为了方便,因为它是一种非常常见的操作。
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
In [39]: df[:3]
Out[39]:
A B C D
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
2000-01-02 -0.173215 1.212112 0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
In [40]: df[::-1]
Out[40]:
A B C D
2000-01-08 -1.157892 -0.370647 -1.344312 0.844885
2000-01-07 0.577046 0.404705 -1.715002 -1.039268
2000-01-06 0.113648 -0.673690 -1.478427 0.524988
2000-01-05 0.567020 -0.424972 0.276232 -1.087401
2000-01-04 -0.706771 0.721555 -1.039575 0.271860
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
2000-01-02 -0.173215 1.212112 0.119209 -1.044236
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
Selection by label
警告
Warning
对于设置操作,是否返回副本或引用,可能取决于上下文。这有时称为 chained assignment,应该避免这样做。请参阅 Returning a View versus Copy。
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
警告
Warning
当您显示与索引类型不兼容(或不可转换)的切片器时,.loc 是严格的。例如,在 DatetimeIndex 中使用整数。这些将引发 TypeError。
.loc is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a DatetimeIndex. These will raise a TypeError.
In [41]: dfl = pd.DataFrame(np.random.randn(5, 4),
....: columns=list('ABCD'),
....: index=pd.date_range('20130101', periods=5))
....:
In [42]: dfl
Out[42]:
A B C D
2013-01-01 1.075770 -0.109050 1.643563 -1.469388
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05 0.895717 0.805244 -1.206412 2.565646
In [43]: dfl.loc[2:3]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[43], line 1
----> 1 dfl.loc[2:3]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:682, in DatetimeIndex.slice_indexer(self, start, end, step)
674 # GH#33146 if start and end are combinations of str and None and Index is not
675 # monotonic, we can not use Index.slice_indexer because it does not honor the
676 # actual elements, is only searching for start and end
677 if (
678 check_str_or_none(start)
679 or check_str_or_none(end)
680 or self.is_monotonic_increasing
681 ):
--> 682 return Index.slice_indexer(self, start, end, step)
684 mask = np.array(True)
685 in_index = True
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
6790 original_label = label
6792 # For datetime indices label may be a string that has to be converted
6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
6796 # we need to look up the label
6797 try:
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:642, in DatetimeIndex._maybe_cast_slice_bound(self, label, side)
637 if isinstance(label, dt.date) and not isinstance(label, dt.datetime):
638 # Pandas supports slicing with dates, treated as datetimes at midnight.
639 # https://github.com/pandas-dev/pandas/issues/31501
640 label = Timestamp(label).to_pydatetime()
--> 642 label = super()._maybe_cast_slice_bound(label, side)
643 self._data._assert_tzawareness_compat(label)
644 return Timestamp(label)
File ~/work/pandas/pandas/pandas/core/indexes/datetimelike.py:378, in DatetimeIndexOpsMixin._maybe_cast_slice_bound(self, label, side)
376 return lower if side == "left" else upper
377 elif not isinstance(label, self._data._recognized_scalars):
--> 378 self._raise_invalid_indexer("slice", label)
380 return label
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4301, in Index._raise_invalid_indexer(self, form, key, reraise)
4299 if reraise is not lib.no_default:
4300 raise TypeError(msg) from reraise
-> 4301 raise TypeError(msg)
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [2] of type int
切片中的字符串类似于可转换为索引的类型,并且导致自然切片。
String likes in slicing can be convertible to the type of the index and lead to natural slicing.
In [44]: dfl.loc['20130102':'20130104']
Out[44]:
A B C D
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
pandas 提供了一套方法,以便进行基于纯标签的索引。这是一个基于严格包含的协议。所要求的每个标签必须在索引中,否则会引发 KeyError。在切片时,如果开始范围和结束范围出现在索引中,则都会包含在内。整数是有效的标签,但它们指的是标签而不是位置。
pandas provides a suite of methods in order to have purely label based indexing. This is a strict inclusion based protocol. Every label asked for must be in the index, or a KeyError will be raised. When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.
.loc 属性是主要访问方法。以下是有效的输入:
The .loc attribute is the primary access method. The following are valid inputs:
-
A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
-
A list or array of labels ['a', 'b', 'c'].
-
A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.
-
A boolean array.
-
A callable, see Selection By Callable.
In [45]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
In [46]: s1
Out[46]:
a 1.431256
b 1.340309
c -1.170299
d -0.226169
e 0.410835
f 0.813850
dtype: float64
In [47]: s1.loc['c':]
Out[47]:
c -1.170299
d -0.226169
e 0.410835
f 0.813850
dtype: float64
In [48]: s1.loc['b']
Out[48]: 1.3403088497993827
请注意,设置也适用:
Note that setting works as well:
In [49]: s1.loc['c':] = 0
In [50]: s1
Out[50]:
a 1.431256
b 1.340309
c 0.000000
d 0.000000
e 0.000000
f 0.000000
dtype: float64
对于 DataFrame:
With a DataFrame:
In [51]: df1 = pd.DataFrame(np.random.randn(6, 4),
....: index=list('abcdef'),
....: columns=list('ABCD'))
....:
In [52]: df1
Out[52]:
A B C D
a 0.132003 -0.827317 -0.076467 -1.187678
b 1.130127 -1.436737 -1.413681 1.607920
c 1.024180 0.569605 0.875906 -2.211372
d 0.974466 -2.006747 -0.410001 -0.078638
e 0.545952 -1.219217 -1.226825 0.769804
f -1.281247 -0.727707 -0.121306 -0.097883
In [53]: df1.loc[['a', 'b', 'd'], :]
Out[53]:
A B C D
a 0.132003 -0.827317 -0.076467 -1.187678
b 1.130127 -1.436737 -1.413681 1.607920
d 0.974466 -2.006747 -0.410001 -0.078638
通过标签切片访问:
Accessing via label slices:
In [54]: df1.loc['d':, 'A':'C']
Out[54]:
A B C
d 0.974466 -2.006747 -0.410001
e 0.545952 -1.219217 -1.226825
f -1.281247 -0.727707 -0.121306
若要使用标签获取一个横截面(相当于_df.xs('a')_):
For getting a cross section using a label (equivalent to df.xs('a')):
In [55]: df1.loc['a']
Out[55]:
A 0.132003
B -0.827317
C -0.076467
D -1.187678
Name: a, dtype: float64
若要使用布尔数组获取值:
For getting values with a boolean array:
In [56]: df1.loc['a'] > 0
Out[56]:
A True
B False
C False
D False
Name: a, dtype: bool
In [57]: df1.loc[:, df1.loc['a'] > 0]
Out[57]:
A
a 0.132003
b 1.130127
c 1.024180
d 0.974466
e 0.545952
f -1.281247
布尔数组中的 NA 值向外传播,表示为 False:
NA values in a boolean array propagate as False:
In [58]: mask = pd.array([True, False, True, False, pd.NA, False], dtype="boolean")
In [59]: mask
Out[59]:
<BooleanArray>
[True, False, True, False, <NA>, False]
Length: 6, dtype: boolean
In [60]: df1[mask]
Out[60]:
A B C D
a 0.132003 -0.827317 -0.076467 -1.187678
c 1.024180 0.569605 0.875906 -2.211372
若要明确获取值:
For getting a value explicitly:
# this is also equivalent to ``df1.at['a','A']``
In [61]: df1.loc['a', 'A']
Out[61]: 0.13200317033032932
Slicing with labels
使用 .loc 时,若开始和停止标签均出现在索引中,则在两者间的元素(包括两者)将被返回:
When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:
In [62]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
In [63]: s.loc[3:5]
Out[63]:
3 b
2 c
5 d
dtype: object
如果两个标签至少有一个缺失,但索引已被排序,并可以针对开始和停止标签进行比较,则切片仍会按预期工作,即通过选择在两者之间的标签:
If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:
In [64]: s.sort_index()
Out[64]:
0 a
2 c
3 b
4 e
5 d
dtype: object
In [65]: s.sort_index().loc[1:6]
Out[65]:
2 c
3 b
4 e
5 d
dtype: object
然而,如果两个标签至少有一个缺失,且索引未被排序,则会引发一个错误(因为这样做在计算上会很昂贵,且对于混合类型索引来说也有可能模棱两可)。例如,在上述示例中,s.loc[1:6] 会引发 KeyError。
However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes). For instance, in the above example, s.loc[1:6] would raise KeyError.
有关此行为背后的基本原理,请参阅 Endpoints are inclusive。
For the rationale behind this behavior, see Endpoints are inclusive.
In [66]: s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])
In [67]: s.loc[3:5]
Out[67]:
3 b
2 c
5 d
dtype: object
同样,如果索引具有重复标签,且开始标签或停止标签之一被重复,则会引发错误。例如,在上述示例中,s.loc[2:5] 会引发 KeyError。
Also, if the index has duplicate labels and either the start or the stop label is duplicated, an error will be raised. For instance, in the above example, s.loc[2:5] would raise a KeyError.
有关重复标签的详情,请参阅 Duplicate Labels。
For more information about duplicate labels, see Duplicate Labels.
Selection by position
警告
Warning
对于设置操作,是否返回副本或引用,可能取决于上下文。这有时称为 chained assignment,应该避免这样做。请参阅 Returning a View versus Copy。
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
pandas 提供一套方法,以便获取纯基于整数的索引。语义严格遵循 Python 和 NumPy 切片。这些是 0-based 索引。在切片时,包括开始绑定,而排除上界。尝试使用一个非整数(即使是一个有效标签),也会引发 IndexError。
pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. Trying to use a non-integer, even a valid label will raise an IndexError.
.iloc 属性是主要访问方法。以下输入是有效的:
The .iloc attribute is the primary access method. The following are valid inputs:
-
An integer e.g. 5.
-
A list or array of integers [4, 3, 0].
-
A slice object with ints 1:7.
-
A boolean array.
-
A callable, see Selection By Callable.
-
A tuple of row (and column) indexes, whose elements are one of the above types.
In [68]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
In [69]: s1
Out[69]:
0 0.695775
2 0.341734
4 0.959726
6 -1.110336
8 -0.619976
dtype: float64
In [70]: s1.iloc[:3]
Out[70]:
0 0.695775
2 0.341734
4 0.959726
dtype: float64
In [71]: s1.iloc[3]
Out[71]: -1.110336102891167
请注意,设置也适用:
Note that setting works as well:
In [72]: s1.iloc[:3] = 0
In [73]: s1
Out[73]:
0 0.000000
2 0.000000
4 0.000000
6 -1.110336
8 -0.619976
dtype: float64
对于 DataFrame:
With a DataFrame:
In [74]: df1 = pd.DataFrame(np.random.randn(6, 4),
....: index=list(range(0, 12, 2)),
....: columns=list(range(0, 8, 2)))
....:
In [75]: df1
Out[75]:
0 2 4 6
0 0.149748 -0.732339 0.687738 0.176444
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
6 -0.826591 -0.345352 1.314232 0.690579
8 0.995761 2.396780 0.014871 3.357427
10 -0.317441 -1.236269 0.896171 -0.487602
通过整数切片选择:
Select via integer slicing:
In [76]: df1.iloc[:3]
Out[76]:
0 2 4 6
0 0.149748 -0.732339 0.687738 0.176444
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
In [77]: df1.iloc[1:5, 2:4]
Out[77]:
4 6
2 0.301624 -2.179861
4 1.462696 -1.743161
6 1.314232 0.690579
8 0.014871 3.357427
通过整数列表选择:
Select via integer list:
In [78]: df1.iloc[[1, 3, 5], [1, 3]]
Out[78]:
2 6
2 -0.154951 -2.179861
6 -0.345352 0.690579
10 -1.236269 -0.487602
In [79]: df1.iloc[1:3, :]
Out[79]:
0 2 4 6
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
In [80]: df1.iloc[:, 1:3]
Out[80]:
2 4
0 -0.732339 0.687738
2 -0.154951 0.301624
4 -0.954208 1.462696
6 -0.345352 1.314232
8 2.396780 0.014871
10 -1.236269 0.896171
# this is also equivalent to ``df1.iat[1,1]``
In [81]: df1.iloc[1, 1]
Out[81]: -0.1549507744249032
若要使用整数位置获取一个横截面(等同于 df.xs(1)):
For getting a cross section using an integer position (equiv to df.xs(1)):
In [82]: df1.iloc[1]
Out[82]:
0 0.403310
2 -0.154951
4 0.301624
6 -2.179861
Name: 2, dtype: float64
超出范围的切片索引的使用与在 Python/NumPy 中一样得到妥善处理。
Out of range slice indexes are handled gracefully just as in Python/NumPy.
# these are allowed in Python/NumPy.
In [83]: x = list('abcdef')
In [84]: x
Out[84]: ['a', 'b', 'c', 'd', 'e', 'f']
In [85]: x[4:10]
Out[85]: ['e', 'f']
In [86]: x[8:10]
Out[86]: []
In [87]: s = pd.Series(x)
In [88]: s
Out[88]:
0 a
1 b
2 c
3 d
4 e
5 f
dtype: object
In [89]: s.iloc[4:10]
Out[89]:
4 e
5 f
dtype: object
In [90]: s.iloc[8:10]
Out[90]: Series([], dtype: object)
请注意,使用超出边界的切片可能会导致一个空轴(例如返回一个空的 DataFrame
)。
Note that using slices that go out of bounds can result in an empty axis (e.g. an empty DataFrame being returned).
In [91]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [92]: dfl
Out[92]:
A B
0 -0.082240 -2.182937
1 0.380396 0.084844
2 0.432390 1.519970
3 -0.493662 0.600178
4 0.274230 0.132885
In [93]: dfl.iloc[:, 2:3]
Out[93]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
In [94]: dfl.iloc[:, 1:3]
Out[94]:
B
0 -2.182937
1 0.084844
2 1.519970
3 0.600178
4 0.132885
In [95]: dfl.iloc[4:6]
Out[95]:
A B
4 0.27423 0.132885
超出边界的单个索引器会引发 IndexError。任何元素都超出边界的索引器列表会引发 IndexError。
A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out of bounds will raise an IndexError.
In [96]: dfl.iloc[[4, 5, 6]]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexing.py:1714, in _iLocIndexer._get_list_axis(self, key, axis)
1713 try:
-> 1714 return self.obj._take_with_is_copy(key, axis=axis)
1715 except IndexError as err:
1716 # re-raise with different error message, e.g. test_getitem_ndarray_3d
File ~/work/pandas/pandas/pandas/core/generic.py:4153, in NDFrame._take_with_is_copy(self, indices, axis)
4144 """
4145 Internal version of the `take` method that sets the `_is_copy`
4146 attribute to keep track of the parent dataframe (using in indexing
(...)
4151 See the docstring of `take` for full explanation of the parameters.
4152 """
-> 4153 result = self.take(indices=indices, axis=axis)
4154 # Maybe set copy if we didn't actually change the index.
File ~/work/pandas/pandas/pandas/core/generic.py:4133, in NDFrame.take(self, indices, axis, **kwargs)
4129 indices = np.arange(
4130 indices.start, indices.stop, indices.step, dtype=np.intp
4131 )
-> 4133 new_data = self._mgr.take(
4134 indices,
4135 axis=self._get_block_manager_axis(axis),
4136 verify=True,
4137 )
4138 return self._constructor_from_mgr(new_data, axes=new_data.axes).__finalize__(
4139 self, method="take"
4140 )
File ~/work/pandas/pandas/pandas/core/internals/managers.py:891, in BaseBlockManager.take(self, indexer, axis, verify)
890 n = self.shape[axis]
--> 891 indexer = maybe_convert_indices(indexer, n, verify=verify)
893 new_labels = self.axes[axis].take(indexer)
File ~/work/pandas/pandas/pandas/core/indexers/utils.py:282, in maybe_convert_indices(indices, n, verify)
281 if mask.any():
--> 282 raise IndexError("indices are out-of-bounds")
283 return indices
IndexError: indices are out-of-bounds
The above exception was the direct cause of the following exception:
IndexError Traceback (most recent call last)
Cell In[96], line 1
----> 1 dfl.iloc[[4, 5, 6]]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1743, in _iLocIndexer._getitem_axis(self, key, axis)
1741 # a list of integers
1742 elif is_list_like_indexer(key):
-> 1743 return self._get_list_axis(key, axis=axis)
1745 # a single integer
1746 else:
1747 key = item_from_zerodim(key)
File ~/work/pandas/pandas/pandas/core/indexing.py:1717, in _iLocIndexer._get_list_axis(self, key, axis)
1714 return self.obj._take_with_is_copy(key, axis=axis)
1715 except IndexError as err:
1716 # re-raise with different error message, e.g. test_getitem_ndarray_3d
-> 1717 raise IndexError("positional indexers are out-of-bounds") from err
IndexError: positional indexers are out-of-bounds
In [97]: dfl.iloc[:, 4]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[97], line 1
----> 1 dfl.iloc[:, 4]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
1689 def _getitem_tuple(self, tup: tuple):
-> 1690 tup = self._validate_tuple_indexer(tup)
1691 with suppress(IndexingError):
1692 return self._getitem_lowerdim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
964 for i, k in enumerate(key):
965 try:
--> 966 self._validate_key(k, i)
967 except ValueError as err:
968 raise ValueError(
969 "Location based indexing can only have "
970 f"[{self._valid_types}] types"
971 ) from err
File ~/work/pandas/pandas/pandas/core/indexing.py:1592, in _iLocIndexer._validate_key(self, key, axis)
1590 return
1591 elif is_integer(key):
-> 1592 self._validate_integer(key, axis)
1593 elif isinstance(key, tuple):
1594 # a tuple should already have been caught by this point
1595 # so don't treat a tuple as a valid indexer
1596 raise IndexingError("Too many indexers")
File ~/work/pandas/pandas/pandas/core/indexing.py:1685, in _iLocIndexer._validate_integer(self, key, axis)
1683 len_axis = len(self.obj._get_axis(axis))
1684 if key >= len_axis or key < -len_axis:
-> 1685 raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
Selection by callable
.loc、.iloc 和 [] 索引也可以接受 callable 作为索引器。callable 必须是一个函数,它带一个参数(正在调用的系列或数据框),并返回用于索引的有效输出。
.loc, .iloc, and also [] indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.
对于 .iloc 索引,不支持从可调用对象返回元组,因为行和列索引的元组解构发生在应用可调用对象之前。 |
For .iloc indexing, returning a tuple from the callable is not supported, since tuple destructuring for row and column indexes occurs before applying callables. |
In [98]: df1 = pd.DataFrame(np.random.randn(6, 4),
....: index=list('abcdef'),
....: columns=list('ABCD'))
....:
In [99]: df1
Out[99]:
A B C D
a -0.023688 2.410179 1.450520 0.206053
b -0.251905 -2.213588 1.063327 1.266143
c 0.299368 -0.863838 0.408204 -1.048089
d -0.025747 -0.988387 0.094055 1.262731
e 1.289997 0.082423 -0.055758 0.536580
f -0.489682 0.369374 -0.034571 -2.484478
In [100]: df1.loc[lambda df: df['A'] > 0, :]
Out[100]:
A B C D
c 0.299368 -0.863838 0.408204 -1.048089
e 1.289997 0.082423 -0.055758 0.536580
In [101]: df1.loc[:, lambda df: ['A', 'B']]
Out[101]:
A B
a -0.023688 2.410179
b -0.251905 -2.213588
c 0.299368 -0.863838
d -0.025747 -0.988387
e 1.289997 0.082423
f -0.489682 0.369374
In [102]: df1.iloc[:, lambda df: [0, 1]]
Out[102]:
A B
a -0.023688 2.410179
b -0.251905 -2.213588
c 0.299368 -0.863838
d -0.025747 -0.988387
e 1.289997 0.082423
f -0.489682 0.369374
In [103]: df1[lambda df: df.columns[0]]
Out[103]:
a -0.023688
b -0.251905
c 0.299368
d -0.025747
e 1.289997
f -0.489682
Name: A, dtype: float64
可以在 Series 中使用可调用索引。
You can use callable indexing in Series.
In [104]: df1['A'].loc[lambda s: s > 0]
Out[104]:
c 0.299368
e 1.289997
Name: A, dtype: float64
使用这些方法/索引器,可以链接数据选择操作,而无需使用临时变量。
Using these methods / indexers, you can chain data selection operations without using a temporary variable.
In [105]: bb = pd.read_csv('data/baseball.csv', index_col='id')
In [106]: (bb.groupby(['year', 'team']).sum(numeric_only=True)
.....: .loc[lambda df: df['r'] > 100])
.....:
Out[106]:
stint g ab r h X2b ... so ibb hbp sh sf gidp
year team ...
2007 CIN 6 379 745 101 203 35 ... 127.0 14.0 1.0 1.0 15.0 18.0
DET 5 301 1062 162 283 54 ... 176.0 3.0 10.0 4.0 8.0 28.0
HOU 4 311 926 109 218 47 ... 212.0 3.0 9.0 16.0 6.0 17.0
LAN 11 413 1021 153 293 61 ... 141.0 8.0 9.0 3.0 8.0 29.0
NYN 13 622 1854 240 509 101 ... 310.0 24.0 23.0 18.0 15.0 48.0
SFN 5 482 1305 198 337 67 ... 188.0 51.0 8.0 16.0 6.0 41.0
TEX 2 198 729 115 200 40 ... 140.0 4.0 5.0 2.0 8.0 16.0
TOR 4 459 1408 187 378 96 ... 265.0 16.0 12.0 4.0 16.0 38.0
[8 rows x 18 columns]
Combining positional and label-based indexing
如果你希望从“A”列的索引中获取第 0 个和第 2 个元素,则可以执行以下操作:
If you wish to get the 0th and the 2nd elements from the index in the ‘A’ column, you can do:
In [107]: dfd = pd.DataFrame({'A': [1, 2, 3],
.....: 'B': [4, 5, 6]},
.....: index=list('abc'))
.....:
In [108]: dfd
Out[108]:
A B
a 1 4
b 2 5
c 3 6
In [109]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[109]:
a 1
c 3
Name: A, dtype: int64
也可以使用 .iloc 表示,通过显式获取索引器上的位置,并使用位置索引选择内容。
This can also be expressed using .iloc, by explicitly getting locations on the indexers, and using positional indexing to select things.
In [110]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[110]:
a 1
c 3
Name: A, dtype: int64
要获取多个索引器,请使用 .get_indexer:
For getting multiple indexers, using .get_indexer:
In [111]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
Out[111]:
A B
a 1 4
c 3 6
Reindexing
实现选择存在可能找不到的元素的惯用方式是通过 .reindex()。另请参阅 reindexing 部分。
The idiomatic way to achieve selecting potentially not-found elements is via .reindex(). See also the section on reindexing.
In [112]: s = pd.Series([1, 2, 3])
In [113]: s.reindex([1, 2, 3])
Out[113]:
1 2.0
2 3.0
3 NaN
dtype: float64
或者,如果你只想选择有效密钥,以下方法既惯用又高效;它可以保证保留所选内容的 dtype。
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.
In [114]: labels = [1, 2, 3]
In [115]: s.loc[s.index.intersection(labels)]
Out[115]:
1 2
2 3
dtype: int64
重复的索引将引发 .reindex():
Having a duplicated index will raise for a .reindex():
In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
In [117]: labels = ['c', 'd']
In [118]: s.reindex(labels)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[118], line 1
----> 1 s.reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
5136 @doc(
5137 NDFrame.reindex, # type: ignore[has-type]
5138 klass=_shared_doc_kwargs["klass"],
(...)
5151 tolerance=None,
5152 ) -> Series:
-> 5153 return super().reindex(
5154 index=index,
5155 method=method,
5156 copy=copy,
5157 level=level,
5158 fill_value=fill_value,
5159 limit=limit,
5160 tolerance=tolerance,
5161 )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
5607 return self._reindex_multi(axes, copy, fill_value)
5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
5611 axes, level, limit, tolerance, method, fill_value, copy
5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
5630 continue
5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
5634 labels, level=level, limit=limit, tolerance=tolerance, method=method
5635 )
5637 axis = self._get_axis_number(a)
5638 obj = obj._reindex_with_indexers(
5639 {axis: [new_index, indexer]},
5640 fill_value=fill_value,
5641 copy=copy,
5642 allow_dups=False,
5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
4426 raise ValueError("cannot handle a non-unique multi-index!")
4427 elif not self.is_unique:
4428 # GH#42568
-> 4429 raise ValueError("cannot reindex on an axis with duplicate labels")
4430 else:
4431 indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels
通常,你可以将所需标签与当前轴相交,然后再重新索引。
Generally, you can intersect the desired labels with the current axis, and then reindex.
In [119]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[119]:
c 3.0
d NaN
dtype: float64
但是,如果结果索引重复,这仍然会引发错误。
However, this would still raise if your resulting index is duplicated.
In [120]: labels = ['a', 'd']
In [121]: s.loc[s.index.intersection(labels)].reindex(labels)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[121], line 1
----> 1 s.loc[s.index.intersection(labels)].reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
5136 @doc(
5137 NDFrame.reindex, # type: ignore[has-type]
5138 klass=_shared_doc_kwargs["klass"],
(...)
5151 tolerance=None,
5152 ) -> Series:
-> 5153 return super().reindex(
5154 index=index,
5155 method=method,
5156 copy=copy,
5157 level=level,
5158 fill_value=fill_value,
5159 limit=limit,
5160 tolerance=tolerance,
5161 )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
5607 return self._reindex_multi(axes, copy, fill_value)
5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
5611 axes, level, limit, tolerance, method, fill_value, copy
5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
5630 continue
5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
5634 labels, level=level, limit=limit, tolerance=tolerance, method=method
5635 )
5637 axis = self._get_axis_number(a)
5638 obj = obj._reindex_with_indexers(
5639 {axis: [new_index, indexer]},
5640 fill_value=fill_value,
5641 copy=copy,
5642 allow_dups=False,
5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
4426 raise ValueError("cannot handle a non-unique multi-index!")
4427 elif not self.is_unique:
4428 # GH#42568
-> 4429 raise ValueError("cannot reindex on an axis with duplicate labels")
4430 else:
4431 indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels
Selecting random samples
使用 sample() 方法从序列或数据框中随机选择行或列。该方法默认情况下会对行取样,并接受要返回的行/列的具体数量或行的分数。
A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
In [122]: s = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1 row.
In [123]: s.sample()
Out[123]:
4 4
dtype: int64
# One may specify either a number of rows:
In [124]: s.sample(n=3)
Out[124]:
0 0
4 4
1 1
dtype: int64
# Or a fraction of the rows:
In [125]: s.sample(frac=0.5)
Out[125]:
5 5
3 3
1 1
dtype: int64
默认情况下,sample 最多将对每一行返回一次,但也可以使用 replace 选项进行替换取样:
By default, sample will return each row at most once, but one can also sample with replacement using the replace option:
In [126]: s = pd.Series([0, 1, 2, 3, 4, 5])
# Without replacement (default):
In [127]: s.sample(n=6, replace=False)
Out[127]:
0 0
1 1
5 5
3 3
2 2
4 4
dtype: int64
# With replacement:
In [128]: s.sample(n=6, replace=True)
Out[128]:
0 0
4 4
3 3
2 2
4 4
4 4
dtype: int64
默认情况下,每一行被选中的概率相同,但如果你希望不同行的概率不同,你可以将 sample 函数传递给采样权重 weights。这些权重可以是列表、NumPy 数组或序列,但它们必须与要采样的对象长度相同。缺失值将视为零权重,且不允许无穷大值。如果权重之和不为 1,将通过将所有权重除以权重之和重新归一化权重。例如:
By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:
In [129]: s = pd.Series([0, 1, 2, 3, 4, 5])
In [130]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [131]: s.sample(n=3, weights=example_weights)
Out[131]:
5 5
4 4
3 3
dtype: int64
# Weights will be re-normalized automatically
In [132]: example_weights2 = [0.5, 0, 0, 0, 0, 0]
In [133]: s.sample(n=1, weights=example_weights2)
Out[133]:
0 0
dtype: int64
当应用于数据框时,你可以通过简单地将列名作为字符串传递,使用数据框的一列作为采样权重(前提是你采样的是行而不是列)。
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.
In [134]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
.....: 'weight_column': [0.5, 0.4, 0.1, 0]})
.....:
In [135]: df2.sample(n=3, weights='weight_column')
Out[135]:
col1 weight_column
1 8 0.4
0 9 0.5
2 7 0.1
sample 还允许用户使用 axis 参数对列而不是行进行采样。
sample also allows users to sample columns instead of rows using the axis argument.
In [136]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
In [137]: df3.sample(n=1, axis=1)
Out[137]:
col1
0 1
1 2
2 3
最后,还可以使用 random_state 参数为 sample 的随机数生成器设置种子,它将接受一个整数(作为种子)或一个 NumPy RandomState 对象。
Finally, one can also set a seed for sample’s random number generator using the random_state argument, which will accept either an integer (as a seed) or a NumPy RandomState object.
In [138]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
# With a given seed, the sample will always draw the same rows.
In [139]: df4.sample(n=2, random_state=2)
Out[139]:
col1 col2
2 3 4
1 2 3
In [140]: df4.sample(n=2, random_state=2)
Out[140]:
col1 col2
2 3 4
1 2 3
Setting with enlargement
当为该轴设置不存在的关键字时,.loc/[] 操作可以执行扩展。
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
在 Series 情况下,这实际上是一个附加操作。
In the Series case this is effectively an appending operation.
In [141]: se = pd.Series([1, 2, 3])
In [142]: se
Out[142]:
0 1
1 2
2 3
dtype: int64
In [143]: se[5] = 5.
In [144]: se
Out[144]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
可以通过 .loc 在两个轴上扩大 DataFrame。
A DataFrame can be enlarged on either axis via .loc.
In [145]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
.....: columns=['A', 'B'])
.....:
In [146]: dfi
Out[146]:
A B
0 0 1
1 2 3
2 4 5
In [147]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']
In [148]: dfi
Out[148]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
这类似于 DataFrame 上的 append 操作。
This is like an append operation on the DataFrame.
In [149]: dfi.loc[3] = 5
In [150]: dfi
Out[150]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
Fast scalar value getting and setting
由于使用 [] 进行索引必须处理多种情况(单标签访问、分片、布尔索引等),因此为了弄清楚您要求的内容,它有一些开销。如果您只想访问标量值,最快的办法是使用 at 和 iat 方法,它们在所有数据结构上实现。
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures.
类似于 loc,at 提供基于标签的标量查找,而 iat 提供基于整数的查找,类似于 iloc
Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc
In [151]: s.iat[5]
Out[151]: 5
In [152]: df.at[dates[5], 'A']
Out[152]: 0.1136484096888855
In [153]: df.iat[3, 0]
Out[153]: -0.7067711336300845
您还可以使用这些相同的索引器进行设置。
You can also set using these same indexers.
In [154]: df.at[dates[5], 'E'] = 7
In [155]: df.iat[3, 0] = 7
如果缺少索引器,at 可能会就地扩大对象。
at may enlarge the object in-place as above if the indexer is missing.
In [156]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
In [157]: df
Out[157]:
A B C D E 0
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632 NaN NaN
2000-01-02 -0.173215 1.212112 0.119209 -1.044236 NaN NaN
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804 NaN NaN
2000-01-04 7.000000 0.721555 -1.039575 0.271860 NaN NaN
2000-01-05 0.567020 -0.424972 0.276232 -1.087401 NaN NaN
2000-01-06 0.113648 -0.673690 -1.478427 0.524988 7.0 NaN
2000-01-07 0.577046 0.404705 -1.715002 -1.039268 NaN NaN
2000-01-08 -1.157892 -0.370647 -1.344312 0.844885 NaN NaN
2000-01-09 NaN NaN NaN NaN NaN 7.0
Boolean indexing
另一个常见操作是使用布尔向量过滤数据。运营商包括:or 的 |,and 的 &,以及 not 的 ~。这些必须使用括号分组,因为默认情况下 Python 会计算 df['A'] > 2 & df['B'] < 3 等表达式为 df['A'] > (2 & df['B']) < 3,而所需的计算顺序是 (df['A'] > 2) & (df['B'] < 3)。
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).
使用布尔向量索引 Series 的方式与 NumPy ndarray 中完全相同:
Using a boolean vector to index a Series works exactly as in a NumPy ndarray:
In [158]: s = pd.Series(range(-3, 4))
In [159]: s
Out[159]:
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
In [160]: s[s > 0]
Out[160]:
4 1
5 2
6 3
dtype: int64
In [161]: s[(s < -1) | (s > 0.5)]
Out[161]:
0 -3
1 -2
4 1
5 2
6 3
dtype: int64
In [162]: s[~(s < 0)]
Out[162]:
3 0
4 1
5 2
6 3
dtype: int64
您可以使用与 DataFrame 索引等长的布尔向量从 DataFrame 中选择行(例如,从 DataFrame 的某个列派生出来的东西):
You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):
In [163]: df[df['A'] > 0]
Out[163]:
A B C D E 0
2000-01-04 7.000000 0.721555 -1.039575 0.271860 NaN NaN
2000-01-05 0.567020 -0.424972 0.276232 -1.087401 NaN NaN
2000-01-06 0.113648 -0.673690 -1.478427 0.524988 7.0 NaN
2000-01-07 0.577046 0.404705 -1.715002 -1.039268 NaN NaN
列表识别和 Series 的 map 方法也可以用来产生更加复杂的标准:
List comprehensions and the map method of Series can also be used to produce more complex criteria:
In [164]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
.....: 'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
.....: 'c': np.random.randn(7)})
.....:
# only want 'two' or 'three'
In [165]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [166]: df2[criterion]
Out[166]:
a b c
2 two y 0.041290
3 three x 0.361719
4 two y -0.238075
# equivalent but slower
In [167]: df2[[x.startswith('t') for x in df2['a']]]
Out[167]:
a b c
2 two y 0.041290
3 three x 0.361719
4 two y -0.238075
# Multiple criteria
In [168]: df2[criterion & (df2['b'] == 'x')]
Out[168]:
a b c
3 three x 0.361719
使用 Selection by Label、 Selection by Position 和 Advanced Indexing 等选项方法,您可以使用布尔向量与其他索引表达式相结合,沿着多个轴进行选择。
With the choice methods Selection by Label, Selection by Position, and Advanced Indexing you may select along more than one axis using boolean vectors combined with other indexing expressions.
In [169]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[169]:
b c
3 x 0.361719
警告
Warning
iloc 支持两种布尔索引。如果索引器是布尔 Series,将会引发错误。例如,在以下示例中,df.iloc[s.values, 1] 是好的。布尔索引器是一个数组。但是,df.iloc[s, 1] 将引发 ValueError。
iloc supports two kinds of boolean indexing. If the indexer is a boolean Series, an error will be raised. For instance, in the following example, df.iloc[s.values, 1] is ok. The boolean indexer is an array. But df.iloc[s, 1] would raise ValueError.
In [170]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
.....: index=list('abc'),
.....: columns=['A', 'B'])
.....:
In [171]: s = (df['A'] > 2)
In [172]: s
Out[172]:
a False
b True
c True
Name: A, dtype: bool
In [173]: df.loc[s, 'B']
Out[173]:
b 4
c 6
Name: B, dtype: int64
In [174]: df.iloc[s.values, 1]
Out[174]:
b 4
c 6
Name: B, dtype: int64
Indexing with isin
考虑一下 Series 的 isin() 方法,它返回一个布尔向量,在 Series 元素存在于传递列表中的任何位置时,该向量都为 true。这允许您选择具有您所需值的某些列的行:
Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:
In [175]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [176]: s
Out[176]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [177]: s.isin([2, 4, 6])
Out[177]:
4 False
3 False
2 True
1 False
0 True
dtype: bool
In [178]: s[s.isin([2, 4, 6])]
Out[178]:
2 2
0 4
dtype: int64
对于 Index 对象,可以使用相同的方法,在您不知道实际上存在哪些所查找标签的情况下,这种方法非常有用:
The same method is available for Index objects and is useful for the cases when you don’t know which of the sought labels are in fact present:
In [179]: s[s.index.isin([2, 4, 6])]
Out[179]:
4 0
2 2
dtype: int64
# compare it to the following
In [180]: s.reindex([2, 4, 6])
Out[180]:
2 2.0
4 0.0
6 NaN
dtype: float64
此外,MultiIndex 允许选择用于成员关系检查的单独级别:
In addition to that, MultiIndex allows selecting a separate level to use in the membership check:
In [181]: s_mi = pd.Series(np.arange(6),
.....: index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
.....:
In [182]: s_mi
Out[182]:
0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int64
In [183]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[183]:
0 c 2
1 a 3
dtype: int64
In [184]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[184]:
0 a 0
c 2
1 a 3
c 5
dtype: int64
DataFrame 还具有 isin() 方法。当调用 isin 时,传递一组值作为数组或字典。如果 values 是一个数组,isin 将返回一个与原始 DataFrame 形状相同的布尔 DataFrame,在元素位于值序列中的任何位置,都为 True。
DataFrame also has an isin() method. When calling isin, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.
In [185]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
.....: 'ids2': ['a', 'n', 'c', 'n']})
.....:
In [186]: values = ['a', 'b', 1, 3]
In [187]: df.isin(values)
Out[187]:
vals ids ids2
0 True True True
1 False True False
2 True False False
3 False False False
通常,您希望将某些值与某些列进行匹配。只让 values 成为 dict,其中键是列,而值是要检查的项目列表。
Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the key is the column, and the value is a list of items you want to check for.
In [188]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}
In [189]: df.isin(values)
Out[189]:
vals ids ids2
0 True True False
1 False True False
2 True False False
3 False False False
要返回值不在原始 DataFrame 中的布尔 DataFrame,请使用 ~ 运算符:
To return the DataFrame of booleans where the values are not in the original DataFrame, use the ~ operator:
In [190]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}
In [191]: ~df.isin(values)
Out[191]:
vals ids ids2
0 False False True
1 True False True
2 False True True
3 True True True
使用 DataFrame 的_isin_方法、_any()_方法和_all()_方法,可以快速选择符合给定条件的子集。若要选择满足自身条件的每一列的行:
Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion:
In [192]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [193]: row_mask = df.isin(values).all(1)
In [194]: df[row_mask]
Out[194]:
vals ids ids2
0 1 a a
The where() Method and Masking
使用布尔向量从序列中选择值通常会返回数据的子集。若要确保选择输出与原始数据形状相同,可以在_Series_和_DataFrame_中使用_where_方法。
Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.
若要仅返回选定的行:
To return only the selected rows:
In [195]: s[s > 0]
Out[195]:
3 1
2 2
1 3
0 4
dtype: int64
若要返回与原始数据形状相同的序列:
To return a Series of the same shape as the original:
In [196]: s.where(s > 0)
Out[196]:
4 NaN
3 1.0
2 2.0
1 3.0
0 4.0
dtype: float64
使用布尔条件从 DataFrame 中选择值现在还保留了输入数据形状。where_在内部用作实现。以下代码等价于_df.where(df<0)。
Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. The code below is equivalent to df.where(df < 0).
In [197]: dates = pd.date_range('1/1/2000', periods=8)
In [198]: df = pd.DataFrame(np.random.randn(8, 4),
.....: index=dates, columns=['A', 'B', 'C', 'D'])
.....:
In [199]: df[df < 0]
Out[199]:
A B C D
2000-01-01 -2.104139 -1.309525 NaN NaN
2000-01-02 -0.352480 NaN -1.192319 NaN
2000-01-03 -0.864883 NaN -0.227870 NaN
2000-01-04 NaN -1.222082 NaN -1.233203
2000-01-05 NaN -0.605656 -1.169184 NaN
2000-01-06 NaN -0.948458 NaN -0.684718
2000-01-07 -2.670153 -0.114722 NaN -0.048048
2000-01-08 NaN NaN -0.048788 -0.808838
此外,_where_有可选项_other_参数,可用于在返回的副本中替换为 False 的值。
In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.
In [200]: df.where(df < 0, -df)
Out[200]:
A B C D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838
您可能希望根据一些布尔条件设置值。这可以通过直观的方法完成:
You may wish to set values based on some boolean criteria. This can be done intuitively like so:
In [201]: s2 = s.copy()
In [202]: s2[s2 < 0] = 0
In [203]: s2
Out[203]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [204]: df2 = df.copy()
In [205]: df2[df2 < 0] = 0
In [206]: df2
Out[206]:
A B C D
2000-01-01 0.000000 0.000000 0.485855 0.245166
2000-01-02 0.000000 0.390389 0.000000 1.655824
2000-01-03 0.000000 0.299674 0.000000 0.281059
2000-01-04 0.846958 0.000000 0.600705 0.000000
2000-01-05 0.669692 0.000000 0.000000 0.342416
2000-01-06 0.868584 0.000000 2.297780 0.000000
2000-01-07 0.000000 0.000000 0.168904 0.000000
2000-01-08 0.801196 1.392071 0.000000 0.000000
_where_返回已修改的数据副本。
where returns a modified copy of the data.
DataFrame.where()的签名与 numpy.where()不同。大致上,df1.where(m,df2)_等价于_np.where(m,df1,__df2)。 |
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2). |
In [207]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[207]:
A B C D
2000-01-01 True True True True
2000-01-02 True True True True
2000-01-03 True True True True
2000-01-04 True True True True
2000-01-05 True True True True
2000-01-06 True True True True
2000-01-07 True True True True
2000-01-08 True True True True
对齐
Alignment
此外,where_对齐输入布尔条件(ndarray 或 DataFrame),使得在设置中可以进行部分选择。这类似于通过.loc_的部分设置(但存在于内容中,而不是轴标签中)。
Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .loc (but on the contents rather than the axis labels).
In [208]: df2 = df.copy()
In [209]: df2[df2[1:4] > 0] = 3
In [210]: df2
Out[210]:
A B C D
2000-01-01 -2.104139 -1.309525 0.485855 0.245166
2000-01-02 -0.352480 3.000000 -1.192319 3.000000
2000-01-03 -0.864883 3.000000 -0.227870 3.000000
2000-01-04 3.000000 -1.222082 3.000000 -1.233203
2000-01-05 0.669692 -0.605656 -1.169184 0.342416
2000-01-06 0.868584 -0.948458 2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 0.168904 -0.048048
2000-01-08 0.801196 1.392071 -0.048788 -0.808838
在执行_where_时,还可以接受_axis_和_level_参数以对齐输入。
Where can also accept axis and level parameters to align the input when performing the where.
In [211]: df2 = df.copy()
In [212]: df2.where(df2 > 0, df2['A'], axis='index')
Out[212]:
A B C D
2000-01-01 -2.104139 -2.104139 0.485855 0.245166
2000-01-02 -0.352480 0.390389 -0.352480 1.655824
2000-01-03 -0.864883 0.299674 -0.864883 0.281059
2000-01-04 0.846958 0.846958 0.600705 0.846958
2000-01-05 0.669692 0.669692 0.669692 0.342416
2000-01-06 0.868584 0.868584 2.297780 0.868584
2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
2000-01-08 0.801196 1.392071 0.801196 0.801196
这等价于(但比)以下操作速度更快。
This is equivalent to (but faster than) the following.
In [213]: df2 = df.copy()
In [214]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[214]:
A B C D
2000-01-01 -2.104139 -2.104139 0.485855 0.245166
2000-01-02 -0.352480 0.390389 -0.352480 1.655824
2000-01-03 -0.864883 0.299674 -0.864883 0.281059
2000-01-04 0.846958 0.846958 0.600705 0.846958
2000-01-05 0.669692 0.669692 0.669692 0.342416
2000-01-06 0.868584 0.868584 2.297780 0.868584
2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
2000-01-08 0.801196 1.392071 0.801196 0.801196
_where_可以接受可调用对象作为条件和_other_参数。该函数必须带有单个参数(调用序列或 DataFrame),并返回有效输出作为条件和_other_参数。
where can accept a callable as condition and other arguments. The function must be with one argument (the calling Series or DataFrame) and that returns valid output as condition and other argument.
In [215]: df3 = pd.DataFrame({'A': [1, 2, 3],
.....: 'B': [4, 5, 6],
.....: 'C': [7, 8, 9]})
.....:
In [216]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[216]:
A B C
0 11 14 7
1 12 5 8
2 13 6 9
Mask
mask()是_where_的反布尔运算。
mask() is the inverse boolean operation of where.
In [217]: s.mask(s >= 0)
Out[217]:
4 NaN
3 NaN
2 NaN
1 NaN
0 NaN
dtype: float64
In [218]: df.mask(df >= 0)
Out[218]:
A B C D
2000-01-01 -2.104139 -1.309525 NaN NaN
2000-01-02 -0.352480 NaN -1.192319 NaN
2000-01-03 -0.864883 NaN -0.227870 NaN
2000-01-04 NaN -1.222082 NaN -1.233203
2000-01-05 NaN -0.605656 -1.169184 NaN
2000-01-06 NaN -0.948458 NaN -0.684718
2000-01-07 -2.670153 -0.114722 NaN -0.048048
2000-01-08 NaN NaN -0.048788 -0.808838
Setting with enlargement conditionally using numpy()
where()的替代方法是使用 numpy.where()。结合设置新列,您可以使用它来扩大根据条件确定值的数据帧。
An alternative to where() is to use numpy.where(). Combined with setting a new column, you can use it to enlarge a DataFrame where the values are determined conditionally.
假设您在以下 DataFrame 中有两个选择。当第二列为“Z”时,您想将新列颜色设置为“绿色”。您可以执行以下操作:
Consider you have two choices to choose from in the following DataFrame. And you want to set a new column color to ‘green’ when the second column has ‘Z’. You can do the following:
In [219]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})
In [220]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')
In [221]: df
Out[221]:
col1 col2 color
0 A Z green
1 B Z green
2 B X red
3 C Y red
如果您有多个条件,可以使用 numpy.select()来实现。假设根据三个条件有三种颜色的选择,第四种颜色是后备,您可以执行以下操作。
If you have multiple conditions, you can use numpy.select() to achieve that. Say corresponding to three conditions there are three choice of colors, with a fourth color as a fallback, you can do the following.
In [222]: conditions = [
.....: (df['col2'] == 'Z') & (df['col1'] == 'A'),
.....: (df['col2'] == 'Z') & (df['col1'] == 'B'),
.....: (df['col1'] == 'B')
.....: ]
.....:
In [223]: choices = ['yellow', 'blue', 'purple']
In [224]: df['color'] = np.select(conditions, choices, default='black')
In [225]: df
Out[225]:
col1 col2 color
0 A Z yellow
1 B Z blue
2 B X purple
3 C Y black
The query() Method
您可以获取在列_b_的值介于列_a_和_c_的值之间的框架值。例如:
You can get the value of the frame where column b has values between the values of columns a and c. For example:
In [226]: n = 10
In [227]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [228]: df
Out[228]:
a b c
0 0.438921 0.118680 0.863670
1 0.138138 0.577363 0.686602
2 0.595307 0.564592 0.520630
3 0.913052 0.926075 0.616184
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
6 0.792342 0.216974 0.564056
7 0.397890 0.454131 0.915716
8 0.074315 0.437913 0.019794
9 0.559209 0.502065 0.026437
# pure python
In [229]: df[(df['a'] < df['b']) & (df['b'] < df['c'])]
Out[229]:
a b c
1 0.138138 0.577363 0.686602
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
7 0.397890 0.454131 0.915716
# query
In [230]: df.query('(a < b) & (b < c)')
Out[230]:
a b c
1 0.138138 0.577363 0.686602
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
7 0.397890 0.454131 0.915716
执行相同操作,但如果没有名称为_a_的列,则退回到命名字段。
Do the same thing but fall back on a named index if there is no column with the name a.
In [231]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))
In [232]: df.index.name = 'a'
In [233]: df
Out[233]:
b c
a
0 0 4
1 0 1
2 3 4
3 4 3
4 1 4
5 0 3
6 0 1
7 3 4
8 2 3
9 1 1
In [234]: df.query('a < b and b < c')
Out[234]:
b c
a
2 3 4
如果相反,您不想或无法为字段命名,则可以在查询表达式中使用名称_index_:
If instead you don’t want to or cannot name your index, you can use the name index in your query expression:
In [235]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))
In [236]: df
Out[236]:
b c
0 3 1
1 3 0
2 5 6
3 5 2
4 7 4
5 0 1
6 2 5
7 0 1
8 6 0
9 7 9
In [237]: df.query('index < b < c')
Out[237]:
b c
2 5 6
如果您的字段名称与列名称重叠,则列名称具有优先权。例如, |
If the name of your index overlaps with a column name, the column name is given precedence. For example, |
In [238]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})
In [239]: df.index.name = 'a'
In [240]: df.query('a > 2') # uses the column 'a', not the index
Out[240]:
a
a
1 3
3 3
您仍然可以通过使用特殊标识符“index”在查询表达式中使用该字段:
You can still use the index in a query expression by using the special identifier ‘index’:
In [241]: df.query('index > 2')
Out[241]:
a
a
3 3
4 2
如果由于某种原因您有一个名为_index_的列,那么您也可以将该字段称为_ilevel_0_,但此时您应该考虑将列重命名为不那么模棱两可的内容。
If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well, but at this point you should consider renaming your columns to something less ambiguous.
MultiIndex query() Syntax
您还可以使用_DataFrame_与 MultiIndex的级别,就好像它们是框架中的列一样:
You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:
In [242]: n = 10
In [243]: colors = np.random.choice(['red', 'green'], size=n)
In [244]: foods = np.random.choice(['eggs', 'ham'], size=n)
In [245]: colors
Out[245]:
array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
'green', 'green'], dtype='<U5')
In [246]: foods
Out[246]:
array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
'eggs'], dtype='<U4')
In [247]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
In [248]: df = pd.DataFrame(np.random.randn(n, 2), index=index)
In [249]: df
Out[249]:
0 1
color food
red ham 0.194889 -0.381994
ham 0.318587 2.089075
eggs -0.728293 -0.090255
green eggs -0.748199 1.318931
eggs -2.029766 0.792652
ham 0.461007 -0.542749
ham -0.305384 -0.479195
eggs 0.095031 -0.270099
eggs -0.707140 -0.773882
eggs 0.229453 0.304418
In [250]: df.query('color == "red"')
Out[250]:
0 1
color food
red ham 0.194889 -0.381994
ham 0.318587 2.089075
eggs -0.728293 -0.090255
如果_MultiIndex_的级别未命名,则可以使用特殊名称引用它们:
If the levels of the MultiIndex are unnamed, you can refer to them using special names:
In [251]: df.index.names = [None, None]
In [252]: df
Out[252]:
0 1
red ham 0.194889 -0.381994
ham 0.318587 2.089075
eggs -0.728293 -0.090255
green eggs -0.748199 1.318931
eggs -2.029766 0.792652
ham 0.461007 -0.542749
ham -0.305384 -0.479195
eggs 0.095031 -0.270099
eggs -0.707140 -0.773882
eggs 0.229453 0.304418
In [253]: df.query('ilevel_0 == "red"')
Out[253]:
0 1
red ham 0.194889 -0.381994
ham 0.318587 2.089075
eggs -0.728293 -0.090255
惯例是_ilevel_0_,它表示_index_的第 0 级“0 级字段”。
The convention is ilevel_0, which means “index level 0” for the 0th level of the index.
query() Use Cases
A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying
In [254]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [255]: df
Out[255]:
a b c
0 0.224283 0.736107 0.139168
1 0.302827 0.657803 0.713897
2 0.611185 0.136624 0.984960
3 0.195246 0.123436 0.627712
4 0.618673 0.371660 0.047902
5 0.480088 0.062993 0.185760
6 0.568018 0.483467 0.445289
7 0.309040 0.274580 0.587101
8 0.258993 0.477769 0.370255
9 0.550459 0.840870 0.304611
In [256]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)
In [257]: df2
Out[257]:
a b c
0 0.357579 0.229800 0.596001
1 0.309059 0.957923 0.965663
2 0.123102 0.336914 0.318616
3 0.526506 0.323321 0.860813
4 0.518736 0.486514 0.384724
5 0.190804 0.505723 0.614533
6 0.891939 0.623977 0.676639
7 0.480559 0.378528 0.460858
8 0.420223 0.136404 0.141295
9 0.732206 0.419540 0.604675
10 0.604466 0.848974 0.896165
11 0.589168 0.920046 0.732716
In [258]: expr = '0.0 <= a <= c <= 0.5'
In [259]: map(lambda frame: frame.query(expr), [df, df2])
Out[259]: <map at 0x7ff2e57db2e0>
query() Python versus pandas Syntax Comparison
完整的 numpy 样语法:
Full numpy-like syntax:
In [260]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
In [261]: df
Out[261]:
a b c
0 7 8 9
1 1 0 7
2 2 7 2
3 6 2 2
4 2 6 3
5 3 8 2
6 1 7 2
7 5 1 5
8 9 8 0
9 1 5 0
In [262]: df.query('(a < b) & (b < c)')
Out[262]:
a b c
0 7 8 9
In [263]: df[(df['a'] < df['b']) & (df['b'] < df['c'])]
Out[263]:
a b c
0 7 8 9
去掉括号(比较运算符比_&_和_|_更紧密),可以更简洁:
Slightly nicer by removing the parentheses (comparison operators bind tighter than & and |):
In [264]: df.query('a < b & b < c')
Out[264]:
a b c
0 7 8 9
使用英语代替符号:
Use English instead of symbols:
In [265]: df.query('a < b and b < c')
Out[265]:
a b c
0 7 8 9
非常接近您在纸上写的内容:
Pretty close to how you might write it on paper:
In [266]: df.query('a < b < c')
Out[266]:
a b c
0 7 8 9
The in and not in operators
query()还支持特殊使用 Python 的_in_和_not__in_比较运算符,从而为调用_Series_或_DataFrame_的_isin_方法提供简洁语法。
query() also supports special use of Python’s in and not in comparison operators, providing a succinct syntax for calling the isin method of a Series or DataFrame.
# get all rows where columns "a" and "b" have overlapping values
In [267]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
.....: 'c': np.random.randint(5, size=12),
.....: 'd': np.random.randint(9, size=12)})
.....:
In [268]: df
Out[268]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
In [269]: df.query('a in b')
Out[269]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
# How you'd do it in pure Python
In [270]: df[df['a'].isin(df['b'])]
Out[270]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
In [271]: df.query('a not in b')
Out[271]:
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
# pure Python
In [272]: df[~df['a'].isin(df['b'])]
Out[272]:
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
您可以将此与其他表达式结合起来,以获得非常简洁的查询:
You can combine this with other expressions for very succinct queries:
# rows where cols a and b have overlapping values
# and col c's values are less than col d's
In [273]: df.query('a in b and c < d')
Out[273]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
4 c b 3 6
5 c b 0 2
# pure Python
In [274]: df[df['b'].isin(df['a']) & (df['c'] < df['d'])]
Out[274]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
4 c b 3 6
5 c b 0 2
10 f c 0 6
11 f c 1 2
请注意,in_和_notin_在 Python 中求值,因为_numexpr_没有此操作的等效项。但是,只有_in/_notin_表达式本身在香草 Python 中求值。例如,在下面的表达式中 |
Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression |
df.query('a in b + c + d')
_(b+c+d)_由_numexpr_求值,然后_in_操作在 Python 中求值。通常,可以使用_numexpr_求值的任何运算都会被求值。
(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.
Special use of the == operator with list objects
使用_==/!=将值_list_与列比较类似于_in/not__in。
Comparing a list of values to a column using ==/!= works similarly to in/not in.
In [275]: df.query('b == ["a", "b", "c"]')
Out[275]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
# pure Python
In [276]: df[df['b'].isin(["a", "b", "c"])]
Out[276]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
In [277]: df.query('c == [1, 2]')
Out[277]:
a b c d
0 a a 2 6
2 b a 1 6
3 b a 2 1
7 d b 2 1
9 e c 2 0
11 f c 1 2
In [278]: df.query('c != [1, 2]')
Out[278]:
a b c d
1 a a 4 7
4 c b 3 6
5 c b 0 2
6 d b 3 3
8 e c 4 3
10 f c 0 6
# using in/not in
In [279]: df.query('[1, 2] in c')
Out[279]:
a b c d
0 a a 2 6
2 b a 1 6
3 b a 2 1
7 d b 2 1
9 e c 2 0
11 f c 1 2
In [280]: df.query('[1, 2] not in c')
Out[280]:
a b c d
1 a a 4 7
4 c b 3 6
5 c b 0 2
6 d b 3 3
8 e c 4 3
10 f c 0 6
# pure Python
In [281]: df[df['c'].isin([1, 2])]
Out[281]:
a b c d
0 a a 2 6
2 b a 1 6
3 b a 2 1
7 d b 2 1
9 e c 2 0
11 f c 1 2
Boolean operators
可使用单词 not 或 ~ 运算符来否定布尔表达式。
You can negate boolean expressions with the word not or the ~ operator.
In [282]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [283]: df['bools'] = np.random.rand(len(df)) > 0.5
In [284]: df.query('~bools')
Out[284]:
a b c bools
2 0.697753 0.212799 0.329209 False
7 0.275396 0.691034 0.826619 False
8 0.190649 0.558748 0.262467 False
In [285]: df.query('not bools')
Out[285]:
a b c bools
2 0.697753 0.212799 0.329209 False
7 0.275396 0.691034 0.826619 False
8 0.190649 0.558748 0.262467 False
In [286]: df.query('not bools') == df[~df['bools']]
Out[286]:
a b c bools
2 True True True True
7 True True True True
8 True True True True
当然,表达式也可以任意复杂:
Of course, expressions can be arbitrarily complex too:
# short query syntax
In [287]: shorter = df.query('a < b < c and (not bools) or bools > 2')
# equivalent in pure Python
In [288]: longer = df[(df['a'] < df['b'])
.....: & (df['b'] < df['c'])
.....: & (~df['bools'])
.....: | (df['bools'] > 2)]
.....:
In [289]: shorter
Out[289]:
a b c bools
7 0.275396 0.691034 0.826619 False
In [290]: longer
Out[290]:
a b c bools
7 0.275396 0.691034 0.826619 False
In [291]: shorter == longer
Out[291]:
a b c bools
7 True True True True
Performance of query()
DataFrame.query() 使用 numexpr 比 Python 用于大型框架的速度略快。
DataFrame.query() using numexpr is slightly faster than Python for large frames.
只有在你的框架拥有超过大约 100,000 行的情况下,你才能看到使用 numexpr 引擎带来的 DataFrame.query() 性能优势。
You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 100,000 rows.
该图使用 numpy.random.randn() 生成的,其中包含三列浮点值的 DataFrame 创建。
This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().
In [292]: df = pd.DataFrame(np.random.randn(8, 4),
.....: index=dates, columns=['A', 'B', 'C', 'D'])
.....:
In [293]: df2 = df.copy()
Duplicate data
如果你想在 DataFrame 中识别并删除重复行,可以使用两种方法:duplicated 和 drop_duplicates。每种方法都以参数的形式把用于识别重复行的列当作参数。
If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.
-
duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
-
drop_duplicates removes duplicate rows.
默认情况下,重复集中首个观察到的行被视为唯一,但每种方法都有一个 keep 参数,用于指定要保存的目标。
By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to specify targets to be kept.
-
keep='first' (default): mark / drop duplicates except for the first occurrence.
-
keep='last': mark / drop duplicates except for the last occurrence.
-
keep=False: mark / drop all duplicates.
In [294]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
.....: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
.....: 'c': np.random.randn(7)})
.....:
In [295]: df2
Out[295]:
a b c
0 one x -1.067137
1 one y 0.309500
2 two x -0.211056
3 two y -1.842023
4 two x -0.390820
5 three x -1.964475
6 four x 1.298329
In [296]: df2.duplicated('a')
Out[296]:
0 False
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [297]: df2.duplicated('a', keep='last')
Out[297]:
0 True
1 False
2 True
3 True
4 False
5 False
6 False
dtype: bool
In [298]: df2.duplicated('a', keep=False)
Out[298]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
dtype: bool
In [299]: df2.drop_duplicates('a')
Out[299]:
a b c
0 one x -1.067137
2 two x -0.211056
5 three x -1.964475
6 four x 1.298329
In [300]: df2.drop_duplicates('a', keep='last')
Out[300]:
a b c
1 one y 0.309500
4 two x -0.390820
5 three x -1.964475
6 four x 1.298329
In [301]: df2.drop_duplicates('a', keep=False)
Out[301]:
a b c
5 three x -1.964475
6 four x 1.298329
此外,你可以传递一个列列表来识别重复项。
Also, you can pass a list of columns to identify duplications.
In [302]: df2.duplicated(['a', 'b'])
Out[302]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
dtype: bool
In [303]: df2.drop_duplicates(['a', 'b'])
Out[303]:
a b c
0 one x -1.067137
1 one y 0.309500
2 two x -0.211056
3 two y -1.842023
5 three x -1.964475
6 four x 1.298329
若要按索引值删除重复项,请使用 Index.duplicated,然后执行切片。keep 参数提供相同的选项集。
To drop duplicates by index value, use Index.duplicated then perform slicing. The same set of options are available for the keep parameter.
In [304]: df3 = pd.DataFrame({'a': np.arange(6),
.....: 'b': np.random.randn(6)},
.....: index=['a', 'a', 'b', 'c', 'b', 'a'])
.....:
In [305]: df3
Out[305]:
a b
a 0 1.440455
a 1 2.456086
b 2 1.038402
c 3 -0.894409
b 4 0.683536
a 5 3.082764
In [306]: df3.index.duplicated()
Out[306]: array([False, True, False, False, True, True])
In [307]: df3[~df3.index.duplicated()]
Out[307]:
a b
a 0 1.440455
b 2 1.038402
c 3 -0.894409
In [308]: df3[~df3.index.duplicated(keep='last')]
Out[308]:
a b
c 3 -0.894409
b 4 0.683536
a 5 3.082764
In [309]: df3[~df3.index.duplicated(keep=False)]
Out[309]:
a b
c 3 -0.894409
Dictionary-like get() method
Series 或 DataFrame 中的每一个都有一个 get 方法,可以返回一个默认值。
Each of Series or DataFrame have a get method which can return a default value.
In [310]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [311]: s.get('a') # equivalent to s['a']
Out[311]: 1
In [312]: s.get('x', default=-1)
Out[312]: -1
Looking up values by index/column labels
有时,你希望给定一行标签和一列标签序列来提取一组值,这可以通过 pandas.factorize 和 NumPy 索引实现。例如:
Sometimes you want to extract a set of values given a sequence of row labels and column labels, this can be achieved by pandas.factorize and NumPy indexing. For instance:
In [313]: df = pd.DataFrame({'col': ["A", "A", "B", "B"],
.....: 'A': [80, 23, np.nan, 22],
.....: 'B': [80, 55, 76, 67]})
.....:
In [314]: df
Out[314]:
col A B
0 A 80.0 80
1 A 23.0 55
2 B NaN 76
3 B 22.0 67
In [315]: idx, cols = pd.factorize(df['col'])
In [316]: df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Out[316]: array([80., 23., 76., 67.])
以前可以使用专门的 DataFrame.lookup 方法来实现这一点,该方法在版本 1.2.0 中已弃用,并在版本 2.0.0 中已删除。
Formerly this could be achieved with the dedicated DataFrame.lookup method which was deprecated in version 1.2.0 and removed in version 2.0.0.
Index objects
熊猫 Index 类及其子类可视为实现了有序多重集。允许重复项。
The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are allowed.
Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an Index directly is to pass a list or other sequence to Index:
In [317]: index = pd.Index(['e', 'd', 'a', 'b'])
In [318]: index
Out[318]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [319]: 'd' in index
Out[319]: True
或使用数字:
or using numbers:
In [320]: index = pd.Index([1, 5, 12])
In [321]: index
Out[321]: Index([1, 5, 12], dtype='int64')
In [322]: 5 in index
Out[322]: True
如果未给定dtype,Index 会尝试从数据推断dtype。在初始化 Index 时,也可以给出显式的dtype:
If no dtype is given, Index tries to infer the dtype from the data. It is also possible to give an explicit dtype when instantiating an Index:
In [323]: index = pd.Index(['e', 'd', 'a', 'b'], dtype="string")
In [324]: index
Out[324]: Index(['e', 'd', 'a', 'b'], dtype='string')
In [325]: index = pd.Index([1, 5, 12], dtype="int8")
In [326]: index
Out[326]: Index([1, 5, 12], dtype='int8')
In [327]: index = pd.Index([1, 5, 12], dtype="float32")
In [328]: index
Out[328]: Index([1.0, 5.0, 12.0], dtype='float32')
您还可以传递_name_ 来存储在索引中:
You can also pass a name to be stored in the index:
In [329]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
In [330]: index.name
Out[330]: 'something'
名称(如果已设置)将显示在控制台显示中:
The name, if set, will be shown in the console display:
In [331]: index = pd.Index(list(range(5)), name='rows')
In [332]: columns = pd.Index(['A', 'B', 'C'], name='cols')
In [333]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
In [334]: df
Out[334]:
cols A B C
rows
0 1.295989 -1.051694 1.340429
1 -2.366110 0.428241 0.387275
2 0.433306 0.929548 0.278094
3 2.154730 -0.315628 0.264223
4 1.126818 1.132290 -0.353310
In [335]: df['A']
Out[335]:
rows
0 1.295989
1 -2.366110
2 0.433306
3 2.154730
4 1.126818
Name: A, dtype: float64
Setting metadata
索引“基本上不可变”,但可以设置和更改其_name_ 属性。您可以使用_rename_ 和_set_names_ 直接设置这些属性,它们的默认情况下会返回一个副本。
Indexes are “mostly immutable”, but it is possible to set and change their name attribute. You can use the rename, set_names to set these attributes directly, and they default to returning a copy.
请参阅 Advanced Indexing,了解MultiIndex 的用法。
See Advanced Indexing for usage of MultiIndexes.
In [336]: ind = pd.Index([1, 2, 3])
In [337]: ind.rename("apple")
Out[337]: Index([1, 2, 3], dtype='int64', name='apple')
In [338]: ind
Out[338]: Index([1, 2, 3], dtype='int64')
In [339]: ind = ind.set_names(["apple"])
In [340]: ind.name = "bob"
In [341]: ind
Out[341]: Index([1, 2, 3], dtype='int64', name='bob')
set_names、set_levels 和_set_codes_ 还会接收一个可选的_level_ 参数
set_names, set_levels, and set_codes also take an optional level argument
In [342]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [343]: index
Out[343]:
MultiIndex([(0, 'one'),
(0, 'two'),
(1, 'one'),
(1, 'two'),
(2, 'one'),
(2, 'two')],
names=['first', 'second'])
In [344]: index.levels[1]
Out[344]: Index(['one', 'two'], dtype='object', name='second')
In [345]: index.set_levels(["a", "b"], level=1)
Out[345]:
MultiIndex([(0, 'a'),
(0, 'b'),
(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['first', 'second'])
Set operations on Index objects
两个主要操作是_union_ 和_intersection_。通过_.difference()_ 方法提供差异。
The two main operations are union and intersection. Difference is provided via the .difference() method.
In [346]: a = pd.Index(['c', 'b', 'a'])
In [347]: b = pd.Index(['c', 'e', 'd'])
In [348]: a.difference(b)
Out[348]: Index(['a', 'b'], dtype='object')
还提供_symmetric_difference_ 操作,它返回出现在_idx1_ 或_idx2_ 但不出现在两者中的元素。这相当于_idx1.difference(idx2).union(idx2.difference(idx1))_ 创建的索引,其中去除了重复项。
Also available is the symmetric_difference operation, which returns elements that appear in either idx1 or idx2, but not in both. This is equivalent to the Index created by idx1.difference(idx2).union(idx2.difference(idx1)), with duplicates dropped.
In [349]: idx1 = pd.Index([1, 2, 3, 4])
In [350]: idx2 = pd.Index([2, 3, 4, 5])
In [351]: idx1.symmetric_difference(idx2)
Out[351]: Index([1, 5], dtype='int64')
集合操作产生的结果索引将按升序排序。 |
The resulting index from a set operation will be sorted in ascending order. |
在不同dtype 的索引之间执行 Index.union() 时,必须将索引强制转换为通用的dtype。通常(但并非总是如此),这是对象dtype。执行整数和浮点数据之间的并集时除外。在这种情况下,整数会转换为浮点数
When performing Index.union() between indexes with different dtypes, the indexes must be cast to a common dtype. Typically, though not always, this is object dtype. The exception is when performing a union between integer and float data. In this case, the integer values are converted to float
In [352]: idx1 = pd.Index([0, 1, 2])
In [353]: idx2 = pd.Index([0.5, 1.5])
In [354]: idx1.union(idx2)
Out[354]: Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')
Missing values
重要信息
Important
即使_Index_ 可以容纳缺失值 (NaN),如果您不想要任何意外结果,也应避免这种情况。例如,某些操作隐式排除缺失值。
Even though Index can hold missing values (NaN), it should be avoided if you do not want any unexpected results. For example, some operations exclude missing values implicitly.
Index.fillna 会用指定的标量值填充缺失值。
Index.fillna fills missing values with specified scalar value.
In [355]: idx1 = pd.Index([1, np.nan, 3, 4])
In [356]: idx1
Out[356]: Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [357]: idx1.fillna(2)
Out[357]: Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [358]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
.....: pd.NaT,
.....: pd.Timestamp('2011-01-03')])
.....:
In [359]: idx2
Out[359]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
In [360]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[360]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)
Set / reset index
偶尔,您会将一个数据集加载或创建到 DataFrame 中,并且希望在完成此操作之后添加一个索引。这里有几种不同的方法。
Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done so. There are a couple of different ways.
Set an index
DataFrame 有一个 set_index() 方法,它接受一个列名(对于常规的_Index_)或一个列名列表(对于_MultiIndex_)。要创建一个新的、重新索引的 DataFrame:
DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex). To create a new, re-indexed DataFrame:
In [361]: data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
.....: 'b': ['one', 'two', 'one', 'two'],
.....: 'c': ['z', 'y', 'x', 'w'],
.....: 'd': [1., 2., 3, 4]})
.....:
In [362]: data
Out[362]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
In [363]: indexed1 = data.set_index('c')
In [364]: indexed1
Out[364]:
a b d
c
z bar one 1.0
y bar two 2.0
x foo one 3.0
w foo two 4.0
In [365]: indexed2 = data.set_index(['a', 'b'])
In [366]: indexed2
Out[366]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
append 关键字选项允许您保留现有索引并将给定的列附加到MultiIndex:
The append keyword option allow you to keep the existing index and append the given columns to a MultiIndex:
In [367]: frame = data.set_index('c', drop=False)
In [368]: frame = frame.set_index(['a', 'b'], append=True)
In [369]: frame
Out[369]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
set_index 中的其他选项允许您不删除索引列。
Other options in set_index allow you not drop the index columns.
In [370]: data.set_index('c', drop=False)
Out[370]:
a b c d
c
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
Reset the index
作为一个方便的应用,DataFrame 上有一个新功能 reset_index(),它将索引值传送到DataFrame 的列中并设置一个简单的整数索引。这是 set_index() 的逆向操作。
As a convenience, there is a new function on DataFrame called reset_index() which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation of set_index().
In [371]: data
Out[371]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
In [372]: data.reset_index()
Out[372]:
index a b c d
0 0 bar one z 1.0
1 1 bar two y 2.0
2 2 foo one x 3.0
3 3 foo two w 4.0
输出更类似于 SQL 表或记录数组。由索引派生的列的名称存储在_names_ 属性中。
The output is more similar to a SQL table or a record array. The names for the columns derived from the index are the ones stored in the names attribute.
可以使用 level 关键字仅删除索引的部分内容:
You can use the level keyword to remove only a portion of the index:
In [373]: frame
Out[373]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
In [374]: frame.reset_index(level=1)
Out[374]:
a c d
c b
z one bar z 1.0
y two bar y 2.0
x one foo x 3.0
w two foo w 4.0
reset_index 接受一个可选参数 drop,如果为 true,则只丢弃索引,而不是将索引值放在 DataFrame 的列中。
reset_index takes an optional parameter drop which if true simply discards the index, instead of putting index values in the DataFrame’s columns.
Returning a view versus a copy
警告
Warning
Copy-on-Write 将会在 pandas 3.0 中成为新默认值。这意味着链式索引功能将永远无法实现。因此,SettingWithCopyWarning 将不再必要。有关更多背景信息,请参阅 this section。我们建议打开 Copy-on-Write,以利用这些改进
Copy-on-Write will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the SettingWithCopyWarning won’t be necessary anymore. See this section for more context. We recommend turning Copy-on-Write on to leverage the improvements with
` pd.options.mode.copy_on_write = True `
甚至在 pandas 3.0 可用之前。
even before pandas 3.0 is available.
在 pandas 对象中设置值时,必须注意避免 sogenannten chained indexing。这里有一个例子。
When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.
In [378]: dfmi = pd.DataFrame([list('abcd'),
.....: list('efgh'),
.....: list('ijkl'),
.....: list('mnop')],
.....: columns=pd.MultiIndex.from_product([['one', 'two'],
.....: ['first', 'second']]))
.....:
In [379]: dfmi
Out[379]:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
比较这两个访问方法:
Compare these two access methods:
In [380]: dfmi['one']['second']
Out[380]:
0 b
1 f
2 j
3 n
Name: second, dtype: object
In [381]: dfmi.loc[:, ('one', 'second')]
Out[381]:
0 b
1 f
2 j
3 n
Name: (one, second), dtype: object
这两个访问方法都产生相同的结果,所以你应该使用哪个?了解这两个访问方法的操作顺序,以及为什么方法 2 (.loc) 远比方法 1 (链式 []) 更好的原因很有帮助。
These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).
dfmi['one'] 选择列的第一层,并返回一个单索引的 DataFrame。然后,另一个 Python 操作 dfmi_with_one['second'] 选择由 'second' 索引的序列。这由变量 dfmi_with_one 指示,因为 pandas 将这些操作视为独立事件。例如独立调用 getitem,因此它必须将它们视为线性操作,它们是一个接一个发生的。
dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to getitem, so it has to treat them as linear operations, they happen one after another.
将此与 df.loc[:,('one','second')] 的对比,后者将 (slice(None),('one','second')) 的嵌套元组传递给 getitem 的一个单一调用。这允许 pandas 将其作为单个实体进行处理。此外,此操作顺序可能显著加快,并允许在需要时对两个轴进行索引。
Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to getitem. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.
Why does assignment fail when using chained indexing?
警告
Warning
Copy-on-Write 将会在 pandas 3.0 中成为新默认值。这意味着链式索引功能将永远无法实现。因此,SettingWithCopyWarning 将不再必要。有关更多背景信息,请参阅 this section。我们建议打开 Copy-on-Write,以利用这些改进
Copy-on-Write will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the SettingWithCopyWarning won’t be necessary anymore. See this section for more context. We recommend turning Copy-on-Write on to leverage the improvements with
` pd.options.mode.copy_on_write = True `
甚至在 pandas 3.0 可用之前。
even before pandas 3.0 is available.
上一部分中的问题仅仅是一个性能问题。SettingWithCopy 警告是怎么回事?当你做某些可能花费额外几毫秒的事情时,我们通常不会发出警告!
The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!
但事实证明,将值分配给链式索引的乘积本质上将产生不可预测的结果。为了理解这一点,请思考 Python 解释器如何执行这段代码:
But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:
dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
但这段代码的处理方式不同:
But this code is handled differently:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
看到那里的_getitem_了吗?除了简单案例之外,很难预测它是否会返回视图或副本(这取决于数组的存储布局,熊猫对此不作任何保证),因此_setitem_是否会修改_dfmi_或随即丢弃的临时对象。这就是_SettingWithCopy_对你发出的警告!
See that getitem in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the setitem will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!
你也许想知道我们是否应该担心第一个示例中的_loc_属性。但_dfmi.loc_必定是具有修改过索引行为的_dfmi_自身,因此直接_dfmi.loc.getitem / dfmi.loc._setitem operate on dfmi。当然,dfmi.loc._getitem(idx)_可能是_dfmi_的视图或副本。 |
You may be wondering whether we should be concerned about the loc property in the first example. But dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc._getitem / _dfmi.loc._setitem operate on _dfmi directly. Of course, dfmi.loc._getitem(idx)_ may be a view or a copy of dfmi. |
有时,当没有明显的链式索引时,_SettingWithCopy_警告会及时出现。这些由_SettingWithCopy_设计用来捕获错误!pandas可能试图警告你,你已执行以下操作:
Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! pandas is probably trying to warn you that you’ve done this:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
Yikes!
Evaluation order matters
警告
Warning
Copy-on-Write 将会在 pandas 3.0 中成为新默认值。这意味着链式索引功能将永远无法实现。因此,SettingWithCopyWarning 将不再必要。有关更多背景信息,请参阅 this section。我们建议打开 Copy-on-Write,以利用这些改进
Copy-on-Write will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the SettingWithCopyWarning won’t be necessary anymore. See this section for more context. We recommend turning Copy-on-Write on to leverage the improvements with
` pd.options.mode.copy_on_write = True `
甚至在 pandas 3.0 可用之前。
even before pandas 3.0 is available.
当使用链式索引时,索引操作的顺序和类型部分决定结果是原始对象的切片,还是切片的副本。
When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or a copy of the slice.
pandas 拥有 SettingWithCopyWarning,因为对切片副本进行赋值通常不是有意为之,而是由于链接索引返回切片无法预料的副本导致的错误。
pandas has the SettingWithCopyWarning because assigning to a copy of a slice is frequently not intentional, but a mistake caused by chained indexing returning a copy where a slice was expected.
如果你希望 pandas 更或多或少地信任对链接索引表达式的赋值,你可以将 option mode.chained_assignment 设置为以下某个值:
If you would like pandas to be more or less trusting about assignment to a chained indexing expression, you can set the option mode.chained_assignment to one of these values:
-
'warn', the default, means a SettingWithCopyWarning is printed.
-
'raise' means pandas will raise a SettingWithCopyError you have to deal with.
-
None will suppress the warnings entirely.
In [382]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
.....: 'three', 'two', 'one', 'six'],
.....: 'c': np.arange(7)})
.....:
# This will show the SettingWithCopyWarning
# but the frame values will be set
In [383]: dfb['c'][dfb['a'].str.startswith('o')] = 42
不过这是对副本进行操作,将不起作用。
This however is operating on a copy and will not work.
In [384]: with pd.option_context('mode.chained_assignment','warn'):
.....: dfb[dfb['a'].str.startswith('o')]['c'] = 42
.....:
在设置混合数据类型框架时也可能出现链接赋值。
A chained assignment can also crop up in setting in a mixed dtype frame.
这些设置规则适用于 .loc/.iloc 的所有规则。 |
These setting rules apply to all of .loc/.iloc. |
以下是使用 .loc 推荐的访问方法:使用 mask 访问多项和使用固定索引访问单项:
The following is the recommended access method using .loc for multiple items (using mask) and a single item using a fixed index:
In [385]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
.....: 'three', 'two', 'one', 'six'],
.....: 'c': np.arange(7)})
.....:
In [386]: dfd = dfc.copy()
# Setting multiple items using a mask
In [387]: mask = dfd['a'].str.startswith('o')
In [388]: dfd.loc[mask, 'c'] = 42
In [389]: dfd
Out[389]:
a c
0 one 42
1 one 42
2 two 2
3 three 3
4 two 4
5 one 42
6 six 6
# Setting a single item
In [390]: dfd = dfc.copy()
In [391]: dfd.loc[2, 'a'] = 11
In [392]: dfd
Out[392]:
a c
0 one 0
1 one 1
2 11 2
3 three 3
4 two 4
5 one 5
6 six 6
以下操作在某些情况下适用,但无法获得保障,因此应该避免执行:
The following can work at times, but it is not guaranteed to, and therefore should be avoided:
In [393]: dfd = dfc.copy()
In [394]: dfd['a'][2] = 111
In [395]: dfd
Out[395]:
a c
0 one 0
1 one 1
2 111 2
3 three 3
4 two 4
5 one 5
6 six 6
最后,以下示例将完全不起作用,因此应该避免执行:
Last, the subsequent example will not work at all, and so should be avoided:
In [396]: with pd.option_context('mode.chained_assignment','raise'):
.....: dfd.loc[0]['a'] = 1111
.....:
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
<ipython-input-396-32ce785aaa5b> in ?()
1 with pd.option_context('mode.chained_assignment','raise'):
----> 2 dfd.loc[0]['a'] = 1111
~/work/pandas/pandas/pandas/core/series.py in ?(self, key, value)
1284 )
1285
1286 check_dict_or_set_indexers(key)
1287 key = com.apply_if_callable(key, self)
-> 1288 cacher_needs_updating = self._check_is_chained_assignment_possible()
1289
1290 if key is Ellipsis:
1291 key = slice(None)
~/work/pandas/pandas/pandas/core/series.py in ?(self)
1489 ref = self._get_cacher()
1490 if ref is not None and ref._is_mixed_type:
1491 self._check_setitem_copy(t="referent", force=True)
1492 return True
-> 1493 return super()._check_is_chained_assignment_possible()
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
4395 single-dtype meaning that the cacher should be updated following
4396 setting.
4397 """
4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
4400 return False
~/work/pandas/pandas/pandas/core/generic.py in ?(self, t, force)
4469 "indexing.html#returning-a-view-versus-a-copy"
4470 )
4471
4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
4474 if value == "warn":
4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
警告
Warning
链接赋值警告/异常旨在告知用户可能出现的无效赋值。可能存在误报;可能会在不经意间报告链接赋值的情况。
The chained assignment warnings / exceptions are aiming to inform the user of a possibly invalid assignment. There may be false positives; situations where a chained assignment is inadvertently reported.