Pandas 中文参考指南
Duplicate Labels
Index 对象不必是唯一的;可以有重复的行或列标签。这可能有点令人困惑。如果你熟悉 SQL,则知道行标签类似于表上的主键,你永远不希望在 SQL 表中出现重复值。但 pandas 的一个作用是在将其发送到某个下游系统之前清理凌乱的、现实世界中的数据。现实世界中的数据有重复值,即使是在应该唯一的字段中也是如此。
Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing at first. If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.
本节描述重复标签如何改变某些操作的行为,以及如何在操作过程中防止重复值出现,或者在出现重复值时检测到它们。
This section describes how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.
In [1]: import pandas as pd
In [2]: import numpy as np
Consequences of Duplicate Labels
一些 pandas 方法(例如 Series.reindex())在存在重复值时就行不通。无法确定输出,因此 pandas 引发错误。
Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.
In [3]: s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])
In [4]: s1.reindex(["a", "b", "c"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 s1.reindex(["a", "b", "c"])
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
5136 @doc(
5137 NDFrame.reindex, # type: ignore[has-type]
5138 klass=_shared_doc_kwargs["klass"],
(...)
5151 tolerance=None,
5152 ) -> Series:
-> 5153 return super().reindex(
5154 index=index,
5155 method=method,
5156 copy=copy,
5157 level=level,
5158 fill_value=fill_value,
5159 limit=limit,
5160 tolerance=tolerance,
5161 )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
5607 return self._reindex_multi(axes, copy, fill_value)
5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
5611 axes, level, limit, tolerance, method, fill_value, copy
5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
5630 continue
5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
5634 labels, level=level, limit=limit, tolerance=tolerance, method=method
5635 )
5637 axis = self._get_axis_number(a)
5638 obj = obj._reindex_with_indexers(
5639 {axis: [new_index, indexer]},
5640 fill_value=fill_value,
5641 copy=copy,
5642 allow_dups=False,
5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
4426 raise ValueError("cannot handle a non-unique multi-index!")
4427 elif not self.is_unique:
4428 # GH#42568
-> 4429 raise ValueError("cannot reindex on an axis with duplicate labels")
4430 else:
4431 indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels
其他方法(如索引)可能会产生令人惊讶的结果。通常,使用标量进行索引将降低维度。使用标量切片 DataFrame 将返回 Series。使用标量切片 Series 将返回标量。但对于有重复值的情况,则不是这样。
Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.
In [5]: df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])
In [6]: df1
Out[6]:
A A B
0 0 1 2
1 3 4 5
列中有重复值。如果切片 'B',我们会得到 Series
We have duplicates in the columns. If we slice 'B', we get back a Series
In [7]: df1["B"] # a series
Out[7]:
0 2
1 5
Name: B, dtype: int64
但切片 'A' 会返回 DataFrame
But slicing 'A' returns a DataFrame
In [8]: df1["A"] # a DataFrame
Out[8]:
A A
0 0 1
1 3 4
这同样适用于行标签
This applies to row labels as well
In [9]: df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])
In [10]: df2
Out[10]:
A
a 0
a 1
b 2
In [11]: df2.loc["b", "A"] # a scalar
Out[11]: 2
In [12]: df2.loc["a", "A"] # a Series
Out[12]:
a 0
a 1
Name: A, dtype: int64
Duplicate Label Detection
可以使用 Index.is_unique 查看 Index(存储行或列标签)是否唯一:
You can check whether an Index (storing the row or column labels) is unique with Index.is_unique:
In [13]: df2
Out[13]:
A
a 0
a 1
b 2
In [14]: df2.index.is_unique
Out[14]: False
In [15]: df2.columns.is_unique
Out[15]: True
对于大型数据集,检查索引是否唯一有点费时。pandas 确实缓存这一结果,因此在同一个索引上再次检查非常快。 |
Checking whether an index is unique is somewhat expensive for large datasets. pandas does cache this result, so re-checking on the same index is very fast. |
Index.duplicated() 将返回一个布尔 ndarray,表明标签是否重复。
Index.duplicated() will return a boolean ndarray indicating whether a label is repeated.
In [16]: df2.index.duplicated()
Out[16]: array([False, True, False])
该 ndarray 可用作布尔过滤器以删除重复行。
Which can be used as a boolean filter to drop duplicate rows.
In [17]: df2.loc[~df2.index.duplicated(), :]
Out[17]:
A
a 0
b 2
如果你需要更多逻辑来处理重复标签,而不仅仅是删除重复项,则对索引使用 groupby() 是一种常见技巧。例如,我们会通过取具有相同标签的所有行的平均值来解决重复问题。
If you need additional logic to handle duplicate labels, rather than just dropping the repeats, using groupby() on the index is a common trick. For example, we’ll resolve duplicates by taking the average of all rows with the same label.
In [18]: df2.groupby(level=0).mean()
Out[18]:
A
a 0.5
b 2.0
Disallowing Duplicate Labels
在版本 1.2.0 中新增。
New in version 1.2.0.
如上所述,在读取原始数据时,处理重复值是一项重要功能。也就是说,你可能希望避免在数据处理管道(来自 pandas.concat()、rename() 等方法)中引入重复值。Series 和 DataFrame 都通过调用 .set_flags(allows_duplicate_labels=False) 来禁止重复标签。(默认是允许它们)。如果有重复的标签,将引发异常。
As noted above, handling duplicates is an important feature when reading in raw data. That said, you may want to avoid introducing duplicates as part of a data processing pipeline (from methods like pandas.concat(), rename(), etc.). Both Series and DataFrame disallow duplicate labels by calling .set_flags(allows_duplicate_labels=False). (the default is to allow them). If there are duplicate labels, an exception will be raised.
In [19]: pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
Cell In[19], line 1
----> 1 pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
File ~/work/pandas/pandas/pandas/core/generic.py:508, in NDFrame.set_flags(self, copy, allows_duplicate_labels)
506 df = self.copy(deep=copy and not using_copy_on_write())
507 if allows_duplicate_labels is not None:
--> 508 df.flags["allows_duplicate_labels"] = allows_duplicate_labels
509 return df
File ~/work/pandas/pandas/pandas/core/flags.py:109, in Flags.__setitem__(self, key, value)
107 if key not in self._keys:
108 raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 109 setattr(self, key, value)
File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
94 if not value:
95 for ax in obj.axes:
---> 96 ax._maybe_check_unique()
98 self._allows_duplicate_labels = value
File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
712 duplicates = self._format_duplicate_message()
713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)
DuplicateLabelError: Index has duplicates.
positions
label
b [1, 2]
这适用于 DataFrame 的行标签和列标签
This applies to both row and column labels for a DataFrame
In [20]: pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],).set_flags(
....: allows_duplicate_labels=False
....: )
....:
Out[20]:
A B C
0 0 1 2
1 3 4 5
可以使用 allows_duplicate_labels 来检查或设置此属性,该属性指示该对象是否可以有重复标签。
This attribute can be checked or set with allows_duplicate_labels, which indicates whether that object can have duplicate labels.
In [21]: df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=["x", "y", "X", "Y"]).set_flags(
....: allows_duplicate_labels=False
....: )
....:
In [22]: df
Out[22]:
A
x 0
y 1
X 2
Y 3
In [23]: df.flags.allows_duplicate_labels
Out[23]: False
DataFrame.set_flags() 可用于返回一个新的 DataFrame,其属性(如 allows_duplicate_labels)设置为某个值
DataFrame.set_flags() can be used to return a new DataFrame with attributes like allows_duplicate_labels set to some value
In [24]: df2 = df.set_flags(allows_duplicate_labels=True)
In [25]: df2.flags.allows_duplicate_labels
Out[25]: True
新的 DataFrame 返回的是与旧 DataFrame 相同数据上的一个视图。或者该属性可以直接设置在同一个对象上
The new DataFrame returned is a view on the same data as the old DataFrame. Or the property can just be set directly on the same object
In [26]: df2.flags.allows_duplicate_labels = False
In [27]: df2.flags.allows_duplicate_labels
Out[27]: False
处理原始、混乱的数据时,你可能最初需要读入混乱的数据(潜在有重复的标签)、去重,然后禁止重复进行,以确保你的数据管道不会引入重复。
When processing raw, messy data you might initially read in the messy data (which potentially has duplicate labels), deduplicate, and then disallow duplicates going forward, to ensure that your data pipeline doesn’t introduce duplicates.
>>> raw = pd.read_csv("...")
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
在具有重复标签的 Series 或 DataFrame 上设置 allows_duplicate_labels=False,或对禁止重复的 Series 或 DataFrame 执行引入重复标签的操作,将引发一个 errors.DuplicateLabelError。
Setting allows_duplicate_labels=False on a Series or DataFrame with duplicate labels or performing an operation that introduces duplicate labels on a Series or DataFrame that disallows duplicates will raise an errors.DuplicateLabelError.
In [28]: df.rename(str.upper)
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
Cell In[28], line 1
----> 1 df.rename(str.upper)
File ~/work/pandas/pandas/pandas/core/frame.py:5767, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
5636 def rename(
5637 self,
5638 mapper: Renamer | None = None,
(...)
5646 errors: IgnoreRaise = "ignore",
5647 ) -> DataFrame | None:
5648 """
5649 Rename columns or index labels.
5650
(...)
5765 4 3 6
5766 """
-> 5767 return super()._rename(
5768 mapper=mapper,
5769 index=index,
5770 columns=columns,
5771 axis=axis,
5772 copy=copy,
5773 inplace=inplace,
5774 level=level,
5775 errors=errors,
5776 )
File ~/work/pandas/pandas/pandas/core/generic.py:1140, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
1138 return None
1139 else:
-> 1140 return result.__finalize__(self, method="rename")
File ~/work/pandas/pandas/pandas/core/generic.py:6262, in NDFrame.__finalize__(self, other, method, **kwargs)
6255 if other.attrs:
6256 # We want attrs propagation to have minimal performance
6257 # impact if attrs are not used; i.e. attrs is an empty dict.
6258 # One could make the deepcopy unconditionally, but a deepcopy
6259 # of an empty dict is 50x more expensive than the empty check.
6260 self.attrs = deepcopy(other.attrs)
-> 6262 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels
6263 # For subclasses using _metadata.
6264 for name in set(self._metadata) & set(other._metadata):
File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
94 if not value:
95 for ax in obj.axes:
---> 96 ax._maybe_check_unique()
98 self._allows_duplicate_labels = value
File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
712 duplicates = self._format_duplicate_message()
713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)
DuplicateLabelError: Index has duplicates.
positions
label
X [0, 2]
Y [1, 3]
此错误消息包含重复的标签,以及所有重复项(包括“原始”)在 Series 或 DataFrame 中的数字位置
This error message contains the labels that are duplicated, and the numeric positions of all the duplicates (including the “original”) in the Series or DataFrame
Duplicate Label Propagation
一般来说,禁止重复是“粘性的”。它通过操作保留。
In general, disallowing duplicates is “sticky”. It’s preserved through operations.
In [29]: s1 = pd.Series(0, index=["a", "b"]).set_flags(allows_duplicate_labels=False)
In [30]: s1
Out[30]:
a 0
b 0
dtype: int64
In [31]: s1.head().rename({"a": "b"})
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
Cell In[31], line 1
----> 1 s1.head().rename({"a": "b"})
File ~/work/pandas/pandas/pandas/core/series.py:5090, in Series.rename(self, index, axis, copy, inplace, level, errors)
5083 axis = self._get_axis_number(axis)
5085 if callable(index) or is_dict_like(index):
5086 # error: Argument 1 to "_rename" of "NDFrame" has incompatible
5087 # type "Union[Union[Mapping[Any, Hashable], Callable[[Any],
5088 # Hashable]], Hashable, None]"; expected "Union[Mapping[Any,
5089 # Hashable], Callable[[Any], Hashable], None]"
-> 5090 return super()._rename(
5091 index, # type: ignore[arg-type]
5092 copy=copy,
5093 inplace=inplace,
5094 level=level,
5095 errors=errors,
5096 )
5097 else:
5098 return self._set_name(index, inplace=inplace, deep=copy)
File ~/work/pandas/pandas/pandas/core/generic.py:1140, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
1138 return None
1139 else:
-> 1140 return result.__finalize__(self, method="rename")
File ~/work/pandas/pandas/pandas/core/generic.py:6262, in NDFrame.__finalize__(self, other, method, **kwargs)
6255 if other.attrs:
6256 # We want attrs propagation to have minimal performance
6257 # impact if attrs are not used; i.e. attrs is an empty dict.
6258 # One could make the deepcopy unconditionally, but a deepcopy
6259 # of an empty dict is 50x more expensive than the empty check.
6260 self.attrs = deepcopy(other.attrs)
-> 6262 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels
6263 # For subclasses using _metadata.
6264 for name in set(self._metadata) & set(other._metadata):
File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
94 if not value:
95 for ax in obj.axes:
---> 96 ax._maybe_check_unique()
98 self._allows_duplicate_labels = value
File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
712 duplicates = self._format_duplicate_message()
713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)
DuplicateLabelError: Index has duplicates.
positions
label
b [0, 1]
警告
Warning
这是一个实验性功能。目前,许多方法都无法传播 allows_duplicate_labels 值。在未来版本中,预期每个获取或返回一个或多个 DataFrame 或 Series 对象的方法都会传播 allows_duplicate_labels。
This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.