Pandas 中文参考指南

Duplicate Labels

Index 对象不必是唯一的;可以有重复的行或列标签。这可能有点令人困惑。如果你熟悉 SQL,则知道行标签类似于表上的主键,你永远不希望在 SQL 表中出现重复值。但 pandas 的一个作用是在将其发送到某个下游系统之前清理凌乱的、现实世界中的数据。现实世界中的数据有重复值,即使是在应该唯一的字段中也是如此。

Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing at first. If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

本节描述重复标签如何改变某些操作的行为,以及如何在操作过程中防止重复值出现,或者在出现重复值时检测到它们。

This section describes how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.

In [1]: import pandas as pd

In [2]: import numpy as np

Consequences of Duplicate Labels

一些 pandas 方法(例如 Series.reindex())在存在重复值时就行不通。无法确定输出,因此 pandas 引发错误。

Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.

In [3]: s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])

In [4]: s1.reindex(["a", "b", "c"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 s1.reindex(["a", "b", "c"])

File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
   5136 @doc(
   5137     NDFrame.reindex,  # type: ignore[has-type]
   5138     klass=_shared_doc_kwargs["klass"],
   (...)
   5151     tolerance=None,
   5152 ) -> Series:
-> 5153     return super().reindex(
   5154         index=index,
   5155         method=method,
   5156         copy=copy,
   5157         level=level,
   5158         fill_value=fill_value,
   5159         limit=limit,
   5160         tolerance=tolerance,
   5161     )

File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
   5607     return self._reindex_multi(axes, copy, fill_value)
   5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
   5611     axes, level, limit, tolerance, method, fill_value, copy
   5612 ).__finalize__(self, method="reindex")

File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   5630     continue
   5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
   5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
   5635 )
   5637 axis = self._get_axis_number(a)
   5638 obj = obj._reindex_with_indexers(
   5639     {axis: [new_index, indexer]},
   5640     fill_value=fill_value,
   5641     copy=copy,
   5642     allow_dups=False,
   5643 )

File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
   4426     raise ValueError("cannot handle a non-unique multi-index!")
   4427 elif not self.is_unique:
   4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
   4430 else:
   4431     indexer, _ = self.get_indexer_non_unique(target)

ValueError: cannot reindex on an axis with duplicate labels

其他方法(如索引)可能会产生令人惊讶的结果。通常,使用标量进行索引将降低维度。使用标量切片 DataFrame 将返回 Series。使用标量切片 Series 将返回标量。但对于有重复值的情况,则不是这样。

Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.

In [5]: df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])

In [6]: df1
Out[6]:
   A  A  B
0  0  1  2
1  3  4  5

列中有重复值。如果切片 'B',我们会得到 Series

We have duplicates in the columns. If we slice 'B', we get back a Series

In [7]: df1["B"]  # a series
Out[7]:
0    2
1    5
Name: B, dtype: int64

但切片 'A' 会返回 DataFrame

But slicing 'A' returns a DataFrame

In [8]: df1["A"]  # a DataFrame
Out[8]:
   A  A
0  0  1
1  3  4

这同样适用于行标签

This applies to row labels as well

In [9]: df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])

In [10]: df2
Out[10]:
   A
a  0
a  1
b  2

In [11]: df2.loc["b", "A"]  # a scalar
Out[11]: 2

In [12]: df2.loc["a", "A"]  # a Series
Out[12]:
a    0
a    1
Name: A, dtype: int64

Duplicate Label Detection

可以使用 Index.is_unique 查看 Index(存储行或列标签)是否唯一:

You can check whether an Index (storing the row or column labels) is unique with Index.is_unique:

In [13]: df2
Out[13]:
   A
a  0
a  1
b  2

In [14]: df2.index.is_unique
Out[14]: False

In [15]: df2.columns.is_unique
Out[15]: True

对于大型数据集,检查索引是否唯一有点费时。pandas 确实缓存这一结果,因此在同一个索引上再次检查非常快。

Checking whether an index is unique is somewhat expensive for large datasets. pandas does cache this result, so re-checking on the same index is very fast.

Index.duplicated() 将返回一个布尔 ndarray,表明标签是否重复。

Index.duplicated() will return a boolean ndarray indicating whether a label is repeated.

In [16]: df2.index.duplicated()
Out[16]: array([False,  True, False])

该 ndarray 可用作布尔过滤器以删除重复行。

Which can be used as a boolean filter to drop duplicate rows.

In [17]: df2.loc[~df2.index.duplicated(), :]
Out[17]:
   A
a  0
b  2

如果你需要更多逻辑来处理重复标签,而不仅仅是删除重复项,则对索引使用 groupby() 是一种常见技巧。例如,我们会通过取具有相同标签的所有行的平均值来解决重复问题。

If you need additional logic to handle duplicate labels, rather than just dropping the repeats, using groupby() on the index is a common trick. For example, we’ll resolve duplicates by taking the average of all rows with the same label.

In [18]: df2.groupby(level=0).mean()
Out[18]:
     A
a  0.5
b  2.0

Disallowing Duplicate Labels

在版本 1.2.0 中新增。

New in version 1.2.0.

如上所述,在读取原始数据时,处理重复值是一项重要功能。也就是说,你可能希望避免在数据处理管道(来自 pandas.concat()rename() 等方法)中引入重复值。SeriesDataFrame 都通过调用 .set_flags(allows_duplicate_labels=False) 来禁止重复标签。(默认是允许它们)。如果有重复的标签,将引发异常。

As noted above, handling duplicates is an important feature when reading in raw data. That said, you may want to avoid introducing duplicates as part of a data processing pipeline (from methods like pandas.concat(), rename(), etc.). Both Series and DataFrame disallow duplicate labels by calling .set_flags(allows_duplicate_labels=False). (the default is to allow them). If there are duplicate labels, an exception will be raised.

In [19]: pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
Cell In[19], line 1
----> 1 pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)

File ~/work/pandas/pandas/pandas/core/generic.py:508, in NDFrame.set_flags(self, copy, allows_duplicate_labels)
    506 df = self.copy(deep=copy and not using_copy_on_write())
    507 if allows_duplicate_labels is not None:
--> 508     df.flags["allows_duplicate_labels"] = allows_duplicate_labels
    509 return df

File ~/work/pandas/pandas/pandas/core/flags.py:109, in Flags.__setitem__(self, key, value)
    107 if key not in self._keys:
    108     raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 109 setattr(self, key, value)

File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
     94 if not value:
     95     for ax in obj.axes:
---> 96         ax._maybe_check_unique()
     98 self._allows_duplicate_labels = value

File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
    712 duplicates = self._format_duplicate_message()
    713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)

DuplicateLabelError: Index has duplicates.
      positions
label
b        [1, 2]

这适用于 DataFrame 的行标签和列标签

This applies to both row and column labels for a DataFrame

In [20]: pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],).set_flags(
   ....:     allows_duplicate_labels=False
   ....: )
   ....:
Out[20]:
   A  B  C
0  0  1  2
1  3  4  5

可以使用 allows_duplicate_labels 来检查或设置此属性,该属性指示该对象是否可以有重复标签。

This attribute can be checked or set with allows_duplicate_labels, which indicates whether that object can have duplicate labels.

In [21]: df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=["x", "y", "X", "Y"]).set_flags(
   ....:     allows_duplicate_labels=False
   ....: )
   ....:

In [22]: df
Out[22]:
   A
x  0
y  1
X  2
Y  3

In [23]: df.flags.allows_duplicate_labels
Out[23]: False

DataFrame.set_flags() 可用于返回一个新的 DataFrame,其属性(如 allows_duplicate_labels)设置为某个值

DataFrame.set_flags() can be used to return a new DataFrame with attributes like allows_duplicate_labels set to some value

In [24]: df2 = df.set_flags(allows_duplicate_labels=True)

In [25]: df2.flags.allows_duplicate_labels
Out[25]: True

新的 DataFrame 返回的是与旧 DataFrame 相同数据上的一个视图。或者该属性可以直接设置在同一个对象上

The new DataFrame returned is a view on the same data as the old DataFrame. Or the property can just be set directly on the same object

In [26]: df2.flags.allows_duplicate_labels = False

In [27]: df2.flags.allows_duplicate_labels
Out[27]: False

处理原始、混乱的数据时,你可能最初需要读入混乱的数据(潜在有重复的标签)、去重,然后禁止重复进行,以确保你的数据管道不会引入重复。

When processing raw, messy data you might initially read in the messy data (which potentially has duplicate labels), deduplicate, and then disallow duplicates going forward, to ensure that your data pipeline doesn’t introduce duplicates.

>>> raw = pd.read_csv("...")
>>> deduplicated = raw.groupby(level=0).first()  # remove duplicates
>>> deduplicated.flags.allows_duplicate_labels = False  # disallow going forward

在具有重复标签的 SeriesDataFrame 上设置 allows_duplicate_labels=False,或对禁止重复的 SeriesDataFrame 执行引入重复标签的操作,将引发一个 errors.DuplicateLabelError

Setting allows_duplicate_labels=False on a Series or DataFrame with duplicate labels or performing an operation that introduces duplicate labels on a Series or DataFrame that disallows duplicates will raise an errors.DuplicateLabelError.

In [28]: df.rename(str.upper)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
Cell In[28], line 1
----> 1 df.rename(str.upper)

File ~/work/pandas/pandas/pandas/core/frame.py:5767, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   5636 def rename(
   5637     self,
   5638     mapper: Renamer | None = None,
   (...)
   5646     errors: IgnoreRaise = "ignore",
   5647 ) -> DataFrame | None:
   5648     """
   5649     Rename columns or index labels.
   5650
   (...)
   5765     4  3  6
   5766     """
-> 5767     return super()._rename(
   5768         mapper=mapper,
   5769         index=index,
   5770         columns=columns,
   5771         axis=axis,
   5772         copy=copy,
   5773         inplace=inplace,
   5774         level=level,
   5775         errors=errors,
   5776     )

File ~/work/pandas/pandas/pandas/core/generic.py:1140, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   1138     return None
   1139 else:
-> 1140     return result.__finalize__(self, method="rename")

File ~/work/pandas/pandas/pandas/core/generic.py:6262, in NDFrame.__finalize__(self, other, method, **kwargs)
   6255 if other.attrs:
   6256     # We want attrs propagation to have minimal performance
   6257     # impact if attrs are not used; i.e. attrs is an empty dict.
   6258     # One could make the deepcopy unconditionally, but a deepcopy
   6259     # of an empty dict is 50x more expensive than the empty check.
   6260     self.attrs = deepcopy(other.attrs)
-> 6262 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels
   6263 # For subclasses using _metadata.
   6264 for name in set(self._metadata) & set(other._metadata):

File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
     94 if not value:
     95     for ax in obj.axes:
---> 96         ax._maybe_check_unique()
     98 self._allows_duplicate_labels = value

File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
    712 duplicates = self._format_duplicate_message()
    713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)

DuplicateLabelError: Index has duplicates.
      positions
label
X        [0, 2]
Y        [1, 3]

此错误消息包含重复的标签,以及所有重复项(包括“原始”)在 SeriesDataFrame 中的数字位置

This error message contains the labels that are duplicated, and the numeric positions of all the duplicates (including the “original”) in the Series or DataFrame

Duplicate Label Propagation

一般来说,禁止重复是“粘性的”。它通过操作保留。

In general, disallowing duplicates is “sticky”. It’s preserved through operations.

In [29]: s1 = pd.Series(0, index=["a", "b"]).set_flags(allows_duplicate_labels=False)

In [30]: s1
Out[30]:
a    0
b    0
dtype: int64

In [31]: s1.head().rename({"a": "b"})
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
Cell In[31], line 1
----> 1 s1.head().rename({"a": "b"})

File ~/work/pandas/pandas/pandas/core/series.py:5090, in Series.rename(self, index, axis, copy, inplace, level, errors)
   5083     axis = self._get_axis_number(axis)
   5085 if callable(index) or is_dict_like(index):
   5086     # error: Argument 1 to "_rename" of "NDFrame" has incompatible
   5087     # type "Union[Union[Mapping[Any, Hashable], Callable[[Any],
   5088     # Hashable]], Hashable, None]"; expected "Union[Mapping[Any,
   5089     # Hashable], Callable[[Any], Hashable], None]"
-> 5090     return super()._rename(
   5091         index,  # type: ignore[arg-type]
   5092         copy=copy,
   5093         inplace=inplace,
   5094         level=level,
   5095         errors=errors,
   5096     )
   5097 else:
   5098     return self._set_name(index, inplace=inplace, deep=copy)

File ~/work/pandas/pandas/pandas/core/generic.py:1140, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   1138     return None
   1139 else:
-> 1140     return result.__finalize__(self, method="rename")

File ~/work/pandas/pandas/pandas/core/generic.py:6262, in NDFrame.__finalize__(self, other, method, **kwargs)
   6255 if other.attrs:
   6256     # We want attrs propagation to have minimal performance
   6257     # impact if attrs are not used; i.e. attrs is an empty dict.
   6258     # One could make the deepcopy unconditionally, but a deepcopy
   6259     # of an empty dict is 50x more expensive than the empty check.
   6260     self.attrs = deepcopy(other.attrs)
-> 6262 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels
   6263 # For subclasses using _metadata.
   6264 for name in set(self._metadata) & set(other._metadata):

File ~/work/pandas/pandas/pandas/core/flags.py:96, in Flags.allows_duplicate_labels(self, value)
     94 if not value:
     95     for ax in obj.axes:
---> 96         ax._maybe_check_unique()
     98 self._allows_duplicate_labels = value

File ~/work/pandas/pandas/pandas/core/indexes/base.py:715, in Index._maybe_check_unique(self)
    712 duplicates = self._format_duplicate_message()
    713 msg += f"\n{duplicates}"
--> 715 raise DuplicateLabelError(msg)

DuplicateLabelError: Index has duplicates.
      positions
label
b        [0, 1]

警告

Warning

这是一个实验性功能。目前,许多方法都无法传播 allows_duplicate_labels 值。在未来版本中,预期每个获取或返回一个或多个 DataFrame 或 Series 对象的方法都会传播 allows_duplicate_labels

This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.