Pandas 中文参考指南

MultiIndex / advanced indexing

有关常规索引文档,请参阅 Indexing and Selecting Data

See the Indexing and Selecting Data for general indexing documentation.



是否为设置操作返回副本或引用可能取决于具体情况。这有时称为 chained assignment,应避免使用。请参阅 Returning a View versus Copy

Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

参阅 cookbook 了解一些高级策略。

See the cookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)

分层/多级索引非常激动人心的,因为它为进行相当精细的数据分析和操作敞开了大门,特别是对于使用高维数据的情况。从本质上说,它允许你将数据存储和处理到一个任意维度的指定维度数据结构中,例如 Series(1 维)和 DataFrame(2 维)。

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

在本节中,我们会展示我们通过“分层”索引的含义,以及它如何与上面以及前几节中所描述的所有 Pandas 索引功能进行集成。稍后,在讨论 group bypivoting and reshaping data 时,我们会展示一些非平凡的应用程序来说明它如何帮助组织数据,以便进行分析。

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

请参阅 cookbook 了解一些高级策略。

See the cookbook for some advanced strategies.

Creating a MultiIndex (hierarchical index) object

MultiIndex 对象是标准 Index 对象的分层模拟对象,它通常存储 Pandas 对象中的轴标签。你可以将 MultiIndex 视为元组数组,其中每个元组都是唯一的。可以使用数组列表(使用 MultiIndex.from_arrays())、元组数组(使用 MultiIndex.from_tuples())、一系列迭代器(使用 MultiIndex.from_product())或 DataFrame(使用 MultiIndex.from_frame())来创建 MultiIndex。当 Index 构造函数接收到一个元组列表时,它将尝试返回一个 MultiIndex。以下示例演示了初始化 MultiIndex 的不同方法。

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]: arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

如果你想要迭代中所有元素的每一个配对,可以使用 MultiIndex.from_product() 方法可能会更简单:

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

你还可以使用 MultiIndex.from_frame() 方法直接从 DataFrame 构造一个 MultiIndex。这是一个 MultiIndex.to_frame() 的补充方法。

You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().

In [10]: df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )

In [11]: pd.MultiIndex.from_frame(df)
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

为方便起见,你可以直接将数组列表传递到 SeriesDataFrame,以自动构建一个 MultiIndex

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [12]: arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

所有 MultiIndex 构造函数均接受 names 自变量,它用来存储各个级别的字符串名称。如果没有提供名称,则会分配 None

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [17]: df.index.names
Out[17]: FrozenList([None, None])

此索引可以支持 Pandas 对象的任何轴,索引级别的数量由你决定:

This index can back any axis of a pandas object, and the number of levels of the index is up to you:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
first        bar                 baz  ...       foo       qux
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
first              bar                 baz                 foo
second             one       two       one       two       one       two
first second
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

我们“稀疏化”了索引的高级别,以使控制台输出对眼睛来说更友好。请注意,可以利用 pandas.set_options() 中的 multi_sparse 选项来控制显示索引的方式:

We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

In [21]: with pd.option_context("display.multi_sparse", False):
   ....:     df


It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [22]: pd.Series(np.random.randn(8), index=tuples)
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

MultiIndex 至关重要,因为正如我们将在下文和文档的后续部分中所描述的那样,它可以让你分组、选择和调整操作。正如你将在后面的部分中看到的那样,你可能会发现自己正在使用分层索引的数据,而不需要自己显式地创建一个 MultiIndex。然而,在从文件中加载数据时,你可能希望在准备数据集时生成自己的 MultiIndex

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels

方法 get_level_values() 将返回特定级别每个位置的标签向量:

The method get_level_values() will return a vector of the labels for each location at a particular level:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex

分层索引的一项重要特性是,你可以通过识别数据中子组的“部分”标签来选择数据。部分选择以完全类似于在常规 DataFrame 中选择列的方式“删除”分层索引的级别:

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [25]: df["bar"]
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
one   -1.039575
two    0.271860
dtype: float64

有关如何在更深层次进行选择,请参阅 Cross-section with hierarchical index

See Cross-section with hierarchical index for how to select on a deeper level.

Defined levels

MultiIndex 保留已定义的索引的所有级别,即使它们实际上没有被使用。在对索引进行切片时,你可能会注意到这一点。例如:

The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

这是为了避免对级别进行重新计算,以使切片具有很高的性能。如果你只想看到已使用的级别,可以使用 get_level_values() 方法。

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

In [31]: df[["foo", "qux"]].columns.to_numpy()
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

要仅使用已使用的级别重建 MultiIndex,可以使用 remove_unused_levels() 方法。

To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and using reindex

在轴上具有 MultiIndex 的不同索引对象之间的操作将按预期的方式进行;数据对齐方式将与元组索引保持一致:

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [35]: s + s[:-2]
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

函数 Series/DataFramesreindex() 方法可以用另一个 MultiIndex,甚至元组的列表或数组来调用:

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [37]: s.reindex(index[:3])
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

Advanced indexing with hierarchical index

在高级索引中语法集成 MultiIndex.loc 具有挑战性,但我们的目标是集成两者。MultiIndex 键通常采用元组形式。例如,以下代码可以按预期工作:

Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [39]: df = df.T

In [40]: df
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

请注意,df.loc['bar', 'two'] 在此示例中也可以工作,但这种简写符号通常会导致歧义。

Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

如果您还想使用 .loc 索引特定列,则必须使用以下元组:

If you also want to index a specific column with .loc, you must use a tuple like this:

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785

你不必通过仅传递元组的前几个元素来指定 MultiIndex 的所有级别。例如,可以使用“部分”索引来获取第一个级别中 bar 的所有元素,如下所示:

You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

In [43]: df.loc["bar"]
               A         B         C
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920

这是稍显冗长的符号 df.loc[('bar',),](在此示例中,等效于 df.loc['bar',])的缩写。

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).


“Partial” slicing also works quite nicely.

In [44]: df.loc["baz":"foo"]
                     A         B         C
first second
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372


You can slice with a ‘range’ of values, by providing a slice of tuples.

In [45]: df.loc[("baz", "two"):("qux", "one")]
                     A         B         C
first second
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
                     A         B         C
first second
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372


Passing a list of labels or tuples works similar to reindexing:

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
                     A         B         C
first second
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

当涉及索引时,很重要的一点是,元组和列表在 pandas 中的处理方式并不相同。元组解释为一个多级键,而列表用于指定多个键。或者换句话说,元组水平放置(跨越级别),列表垂直放置(扫描级别)。

It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

重要的是,元组列表索引几个完整的 MultiIndex 键,而列表元组引用一个级别中的几个值:

Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [48]: s = pd.Series(
   ....:     [1, 2, 3, 4, 5, 6],
   ....:     index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
   ....: )

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

Using slicers

你可以通过提供多个索引元来对 MultiIndex 进行切片。

You can slice a MultiIndex by providing multiple indexers.

你可以提供任何选择器,就好像你正在通过标签进行索引一样,请参见 Selection by Label,包括切片、标签列表、标签和布尔索引。

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

你可以使用 slice(None) 选择该级别的所有内容。你不必指定更深层次的级别,它们将被隐含为 slice(None)

You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).


As usual, both sides of the slicers are included as this is label indexing.



你应该在 .loc 指定符中指定所有轴,这意味着针对索引和列的索引元。在某些情况下,传递的索引元可能会被误解释为同时对两个轴进行索引,而不是对行进行 MultiIndex 索引。

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be misinterpreted as indexing both axes, rather than into say the MultiIndex for the rows.


You should do this:

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999


You should not do this:

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
In [51]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]

In [52]: miindex = pd.MultiIndex.from_product(
   ....:     [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
   ....: )

In [53]: micolumns = pd.MultiIndex.from_tuples(
   ....:     [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
   ....: )

In [54]: dfmi = (
   ....:     pd.DataFrame(
   ....:         np.arange(len(miindex) * len(micolumns)).reshape(
   ....:             (len(miindex), len(micolumns))
   ....:         ),
   ....:         index=miindex,
   ....:         columns=micolumns,
   ....:     )
   ....:     .sort_index()
   ....:     .sort_index(axis=1)
   ....: )

In [55]: dfmi
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

使用切片、列表和标签进行基本 MultiIndex 切片。

Basic MultiIndex slicing using slices, lists, and labels.

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
lvl0           a         b
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

你可以使用 : 而不是 slice(None) 促进了更自然的语法的 pandas.IndexSlice

You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]


It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [59]: dfmi.loc["A1", (slice(None), "foo")]
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]


Using a boolean indexer you can provide selection related to the values.

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

您还可以为 .loc 指定 axis 参数,以便在单个轴上解释传递的切片机。

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]


Furthermore, you can set the values using the following methods.

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]


You can use a right-hand-side of an alignable object as well.

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
lvl0              a               b
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]


DataFramexs() 方法还采用一个等级参数,以便更轻松地在 MultiIndex 的特定等级上选择数据。

The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [70]: df
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
              A         B         C
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

您还可以通过提供轴参数使用 xs 在列上进行选择。

You can also select on the columns with xs, by providing the axis argument.

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs 还允许使用多个键进行选择。

xs also allows selection with multiple keys.

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

您可以将 drop_level=False 传递给 xs 以保留选定的级别。

You can pass drop_level=False to xs to retain the level that was selected.

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

将上述情况与使用 drop_level=True(默认值)的结果进行比较。

Compare the above with the result using drop_level=True (the default value).

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment

在 pandas 对象的 reindex()align() 方法中使用参数 level 可用于广播一个等级中的值。例如:

Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast values across a level. For instance:

In [80]: midx = pd.MultiIndex(
   ....:     levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
   ....: )

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [88]: df2_aligned
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

Swapping levels with swaplevel

swaplevel() 方法可以切换两个级别的顺序:

The swaplevel() method can switch the order of two levels:

In [89]: df[:5]
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Reordering levels with reorder_levels

reorder_levels() 方法概括了 swaplevel 方法,允许您一步完成层次索引级别的排列:

The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical index levels in one step:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Renaming names of an Index or MultiIndex

rename() 方法用于重命名 MultiIndex 的标签,通常用于重命名 DataFrame 的列。renamecolumns 参数允许指定仅包含想要重命名的列的字典。

The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the columns you wish to rename.

In [92]: df.rename(columns={0: "col0", 1: "col1"})
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

此方法还可以用于重命名 DataFrame 的主索引的特定标签。

This method can also be used to rename specific labels of the main index of the DataFrame.

In [93]: df.rename(index={"one": "two", "y": "z"})
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

rename_axis() 方法用于重命名 IndexMultiIndex 的名称。特别是,可以指定 MultiIndex 的级别的名称,这在稍后使用 reset_index()MultiIndex 的值移动到列时很有用。

The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names of the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the values from the MultiIndex to a column.

In [94]: df.rename_axis(index=["abc", "def"])
                 0         1
abc  def
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

请注意,DataFrame 的列是一个索引,因此与 columns 参数一起使用 rename_axis 将更改该索引的名称。

Note that the columns of a DataFrame are an index, so that using rename_axis with the columns argument will change the name of that index.

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')

renamerename_axis 都支持指定一个字典、Series 或映射函数,以将标签/名称映射到新值。

Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map labels/names to new values.

直接使用 Index 对象(而不是通过 DataFrame)时,可以利用 Index.set_names() 更改名称。

When working with an Index object directly, rather than via a DataFrame, Index.set_names() can be used to change the names.

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

你无法通过层来设置 MultiIndex 的名称。

You cannot set the names of the MultiIndex via a level.

In [100]: mi.levels[0].name = "name via level"
RuntimeError                              Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/, in, value)
   1686 @name.setter
   1687 def name(self, value: Hashable) -> None:
   1688     if self._no_setting_name:
   1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
   1691             "Cannot set name on a level of a MultiIndex. Use "
   1692             "'MultiIndex.set_names' instead."
   1693         )
   1694     maybe_extract_name(value, None, type(self))
   1695     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

使用 Index.set_names() 代替。

Use Index.set_names() instead.

Sorting a MultiIndex

要有效地对 MultiIndex 的对象进行索引和切片,需要对其进行排序。与任何索引一样,可以使用 sort_index()

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
qux  two    0.206053
bar  one   -0.251905
foo  one   -2.213588
qux  one    1.063327
foo  two    1.266143
baz  two    0.299368
bar  two   -0.863838
baz  one    0.408204
dtype: float64

In [105]: s.sort_index()
bar  one   -0.251905
     two   -0.863838
baz  one    0.408204
     two    0.299368
foo  one   -2.213588
     two    1.266143
qux  one    1.063327
     two    0.206053
dtype: float64

In [106]: s.sort_index(level=0)
bar  one   -0.251905
     two   -0.863838
baz  one    0.408204
     two    0.299368
foo  one   -2.213588
     two    1.266143
qux  one    1.063327
     two    0.206053
dtype: float64

In [107]: s.sort_index(level=1)
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64

如果 MultiIndex 层已命名,你也可以将层名称传递给 sort_index

You may also pass a level name to sort_index if the MultiIndex levels are named.

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
L1   L2
bar  one   -0.251905
     two   -0.863838
baz  one    0.408204
     two    0.299368
foo  one   -2.213588
     two    1.266143
qux  one    1.063327
     two    0.206053
dtype: float64

In [110]: s.sort_index(level="L2")
L1   L2
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64

在高维对象上,如果有 MultiIndex,你可以按层对其他任意轴进行排序:

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [111]: df.T.sort_index(level=1, axis=1)
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

即使数据未排序,索引仍将起作用,但会相当低效(并显示 PerformanceWarning)。它还将返回数据的副本而不是视图:

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

In [112]: dfm = pd.DataFrame(
   .....:     {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
   .....: )

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
jim joe
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

In [115]: dfm.loc[(1, 'z')]
jim joe
1   z    0.53702


Furthermore, if you try to index something that is not fully lexsorted, this can raise:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError                        Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/, in MultiIndex.slice_locs(self, start, end, step)
   2852 """
   2853 For an ordered MultiIndex, compute the slice locations for input
   2854 labels.
   2900                       sequence of such.
   2901 """
   2902 # This function adds nothing to its parent implementation (the magic
   2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/, in MultiIndex.get_slice_bound(self, label, side)
   2846 if not isinstance(label, tuple):
   2847     label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/, in MultiIndex._partial_tup_index(self, tup, side)
   2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
   2907     if len(tup) > self._lexsort_depth:
-> 2908         raise UnsortedIndexError(
   2909             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
   2910             f"({self._lexsort_depth})"
   2911         )
   2913     n = len(tup)
   2914     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

MultiIndex 上的 is_monotonic_increasing() 方法显示索引是否已排序:

The is_monotonic_increasing() method on a MultiIndex shows if the index is sorted:

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()

In [119]: dfm
jim joe
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True


And now selection works as expected.

In [121]: dfm.loc[(0, "y"):(1, "z")]
jim joe
1   y    0.110968
    z    0.537020

Take methods

与 NumPy ndarrays 类似,pandas IndexSeriesDataFrame 也提供了 take() 方法,该方法可沿着给定 axis 检索在给定索引处的元素。给定的索引必须是整数索引位置的列表或 ndarray。take 也将接受负整数作为相对于对象末尾的相对位置。

Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

对于 DataFrames,给定的索引应为指定行或列位置的一维列表或 ndarray。

For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

需要注意的是,pandas 对象上的 take 方法不适用于布尔索引,可能会返回意外结果。

It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
0    0.233141
1   -0.223540
dtype: float64

最后,关于性能的一点小说明,因为 take 方法会处理范围较窄的输入,所以它的性能可以比奇技淫巧的索引快很多。

Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

Index types

我们在前几节中已经对 MultiIndex 进行了相当广泛的讨论。DatetimeIndexPeriodIndex 的文档显示在 here 中,有关 TimedeltaIndex 的文档可在 here 中找到。

We have discussed MultiIndex in the previous sections pretty extensively. Documentation about DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found here.


In the following sub-sections we will highlight some other index types.


CategoricalIndex 是一种索引类型,很有用,因为它支持使用重复项进行索引。这是一个 Categorical 周围的容器,并允许高效地索引和存储具有大量重复元素的索引。

CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')

设置索引将创建 CategoricalIndex

Setting the index will create a CategoricalIndex.

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

使用 _getitem/.iloc/.loc_ 进行索引的工作方式类似于带有重复项的 Index。索引器必须属于该类别,否则操作会引发 KeyError

Indexing with _getitem/.iloc/.loc_ works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

In [153]: df2.loc["a"]
a  0
a  1
a  5

索引后保留 CategoricalIndex:

The CategoricalIndex is preserved after indexing:

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

对索引进行排序将按类别顺序进行排序(回想一下,我们使用 CategoricalDtype(list('cab')) 创建了索引,因此排序的顺序为 cab)。

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

In [155]: df2.sort_index()
c  4
a  0
a  1
a  5
b  2
b  3


Groupby operations on the index will preserve the index nature as well.

In [156]: df2.groupby(level=0, observed=True).sum()
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

重新索引运算会根据传递的索引器的类型返回一个结果索引。传递一个列表将返回一个普通的 Index;使用 Categorical 索引会返回 CategoricalIndex,根据传递的 Categorical 数据类型的类别进行索引。这样就能任意索引这些内容,即使使用不在类别中的值,类似于可以对任何熊猫索引进行重新索引的方式。

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

In [158]: df3 = pd.DataFrame(
   .....:     {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
   .....: )

In [159]: df3 = df3.set_index("B")

In [160]: df3
a  0
b  1
c  2
In [161]: df3.reindex(["a", "e"])
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')



CategoricalIndex 进行重塑和比较运算时,必须有相同的类别,否则会引发 TypeError

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
b  0
a  1
b  0
c  1


RangeIndexIndex 的子类,可为所有 DataFrameSeries 对象提供默认索引。RangeIndexIndex 的优化版本,可以表示单调的有序集合。这些都类似于 Python range typesRangeIndex 始终具有 int64 数据类型。

RangeIndex is a sub-class of Index that provides the default index for all DataFrame and Series objects. RangeIndex is an optimized version of Index that can represent a monotonic ordered set. These are analogous to Python range types. A RangeIndex will always have an int64 dtype.

In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)

RangeIndex 是所有 DataFrameSeries 对象的默认索引:

RangeIndex is the default index for all DataFrame and Series objects:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)

RangeIndex 的行为类似于 int64 数据类型的 Index,并且对 RangeIndex 进行运算(其结果无法表示为 RangeIndex,但应具有整数数据类型)将被转换为带有 int64Index。例如:

A RangeIndex will behave similarly to a Index with an int64 dtype and operations on a RangeIndex, whose result cannot be represented by a RangeIndex, but should have an integer dtype, will be converted to an Index with int64. For example:

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')


IntervalIndex 与其自身数据类型 IntervalDtype 及其 Interval 标量类型一起,在熊猫中为区间表示法提供了一流的支持。

IntervalIndex together with its own dtype, IntervalDtype as well as the Interval scalar type, allow first-class support in pandas for interval notation.

IntervalIndex 允许一些唯一的索引也用作 cut()qcut() 中类别的返回类型。

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

IntervalIndex 可用于 SeriesDataFrame 中作为索引。

An IntervalIndex can be used in Series and in DataFrame as the index.

In [182]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
   .....: )

In [183]: df
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

通过 .loc 在区间边缘进行基于标签的索引时,可以选择特定区间,效果符合预期。

Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

In [184]: df.loc[2]
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
(1, 2]  2
(2, 3]  3


If you select a label contained within an interval, this will also select the interval.

In [186]: df.loc[2.5]
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
(2, 3]  3
(3, 4]  4

使用 Interval 进行选择时,只返回确切匹配项。

Selecting using an Interval will only return exact matches.

In [188]: df.loc[pd.Interval(1, 2)]
A    2
Name: (1, 2], dtype: int64

如果要选择不在 IntervalIndex 中确切包含的 Interval,将会引发 KeyError

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

In [189]: df.loc[pd.Interval(0.5, 2.5)]
KeyError                                  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_axis(self, key, axis)
   1429 # fall thru to straight lookup
   1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._get_label(self, label, axis)
   1379 def _get_label(self, label, axis: AxisInt):
   1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in NDFrame.xs(self, key, axis, level, drop_level)
   4299             new_index = index[loc]
   4300 else:
-> 4301     loc = index.get_loc(key)
   4303     if isinstance(loc, np.ndarray):
   4304         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/, in IntervalIndex.get_loc(self, key)
    676 matches = mask.sum()
    677 if matches == 0:
--> 678     raise KeyError(key)
    679 if matches == 1:
    680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right')

可以选择所有与给定 Interval 重叠的 Intervals,可以使用 overlaps() 方法创建一个布尔索引器。

Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to create a boolean indexer.

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
(0, 1]  1
(1, 2]  2
(2, 3]  3

cut()qcut() 都返回 Categorical 对象,并且它们创建的箱存储为其 .categories 属性中的 IntervalIndex

cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex in its .categories attribute.

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut() 也接受其 bins 参数的 IntervalIndex,这使得一个有用的熊猫惯用语能够得以实现。首先,使用一些数据和 bins(设置为固定数字)调用 cut() 来生成箱。然后,在后续调用 cut() 时将 .categories 的值作为 bins 参数传递,从而提供新数据,这些数据将分箱到相同的箱中。

cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .categories as the bins argument in subsequent calls to cut(), supplying new data which will be binned into the same bins.

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

任何超出所有箱的值都将被分配 NaN 值。

Any value which falls outside all bins will be assigned a NaN value.

如果我们需要一个具有一定规则频率的区间,我们可以使用 interval_range() 函数来创建 IntervalIndex,并运用各种 startendperiods 组合。对于 interval_range,默认的频率是数值区间的 1,以及日期时间区间中的日历日:

If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
               (2017-01-02 00:00:00, 2017-01-03 00:00:00],
               (2017-01-03 00:00:00, 2017-01-04 00:00:00],
               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
               (1 days 00:00:00, 2 days 00:00:00],
               (2 days 00:00:00, 3 days 00:00:00]],
              dtype='interval[timedelta64[ns], right]')

freq 参数可用于指定非默认频率,并可以在日期时间区间中使用各种 frequency aliases

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
               (2017-01-08 00:00:00, 2017-01-15 00:00:00],
               (2017-01-15 00:00:00, 2017-01-22 00:00:00],
               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
               (0 days 09:00:00, 0 days 18:00:00],
               (0 days 18:00:00, 1 days 03:00:00]],
              dtype='interval[timedelta64[ns], right]')

此外,closed 参数可用于指定闭合在哪些侧面的区间。默认情况下,区间在右侧闭合。

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

指定 startendperiods 将生成 startend 的一系列以 periods 为间隔的区间,并包含 IntervalIndex 中的元素数量:

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
               (2018-01-20 08:00:00, 2018-02-08 16:00:00],
               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
              dtype='interval[datetime64[ns], right]')

Miscellaneous indexing FAQ

Integer indexing

使用整数轴标进行标签索引是一个棘手的话题。它已经在邮件列表和科学 Python 社区的各个成员之间进行了充分的讨论。在 pandas 中,我们的总体观点是标签比整数位置更重要。因此,对于整数轴索引,只能使用 .loc 之类的标准工具进行基于标签的索引。以下代码将生成异常:

Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
ValueError                                Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/, in Series.__getitem__(self, key)
   1118     return self._values[key]
   1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
   1123 # Convert generator to list before going through hashable part
   1124 # (We will iterate through the generator there to check for slices)
   1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/, in Series._get_value(self, label, takeable)
   1234     return self._values[label]
   1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
   1239 if is_integer(loc):
   1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

做出此项深思熟虑的决定是为了避免出现歧义和不易察觉的错误(许多用户报告在 API 更改为停止“回退”到基于位置的索引后发现错误)。

This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).

Non-monotonic indexes require exact matches

如果 SeriesDataFrame 的索引是单调递增或递减的,那么基于标签的切片的边界可以超出索引范围,就像对 Python 普通 list 进行切片索引一样。可以借助 is_monotonic_increasing()is_monotonic_decreasing() 属性来测试索引的单调性。

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes.

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Empty DataFrame
Columns: [data]
Index: []


On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
2     0
3     1
1     2
4     3
 # 0 is not in the index
In [219]: df.loc[0:4, :]
KeyError                                  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval,, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.get_slice_bound(self, label, side)
   6801         return self._searchsorted_monotonic(label, side)
   6802     except ValueError:
   6803         # raise the original KeyError
-> 6804         raise err
   6806 if isinstance(slc, np.ndarray):
   6807     # get_loc may return a boolean array, which
   6808     # is OK as long as they are representable by a slice.
   6809     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.get_slice_bound(self, label, side)
   6796 # we need to look up the label
   6797 try:
-> 6798     slc = self.get_loc(label)
   6799 except KeyError as err:
   6800     try:

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
KeyError                                  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval,, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.slice_locs(self, start, end, step)
   6883 end_slice = None
   6884 if end is not None:
-> 6885     end_slice = self.get_slice_bound(end, "right")
   6886 if end_slice is None:
   6887     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/, in Index.get_slice_bound(self, label, side)
   6810     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
   6811     if isinstance(slc, np.ndarray):
-> 6812         raise KeyError(
   6813             f"Cannot get {side} slice bound for non-unique "
   6814             f"label: {repr(original_label)}"
   6815         )
   6817 if isinstance(slc, slice):
   6818     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasingIndex.is_monotonic_decreasing 只检查一个索引是否为弱单调。要检查严格单调性,你可以将其中之一与 is_unique() 属性结合使用。

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with the is_unique() attribute.

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False

Endpoints are inclusive

与切片端点不包括在内的标准 Python 序列切片相比,pandas 中基于标签的切片是包含的。其主要原因是通常无法轻松地确定索引中某个特定标签之后的“后继”或下一个元素。例如,考虑以下 Series

Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64

假设我们希望从 c 切片到 e,使用整数可以这样实现:

Suppose we wished to slice from c to e, using integers this would be accomplished as such:

In [227]: s[2:5]
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

然而,如果你只有 ce,则确定索引中的下一个元素会 somewhat 复杂。例如,以下操作无效:

However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

In [228]: s.loc['c':'e' + 1]
TypeError                                 Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str


A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:

In [229]: s.loc["c":"e"]
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

这绝对是“实用性胜过纯粹性”之类的事,但如果你期望基于标签的切片完全按照标准 Python 整数切片的工作方式工作,则这是需要留意的。

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype

不同的索引操作可能会更改 Series 的数据类型。

The different indexing operation can potentially change the dtype of a Series.

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
0    1.0
4    NaN
dtype: float64
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
0    True
1     NaN
2     NaN
dtype: object

这是因为上面的(重新)索引操作会静默插入 NaNs,并且 dtype 会相应地更改。在使用 numpy ufuncs(例如 numpy.logical_and)时,这可能会造成一些问题。

This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

请参阅 GH 2388 了解更多详细内容的讨论。

See the GH 2388 for a more detailed discussion.