Pandas 中文参考指南

Reshaping and pivot tables

pandas 提供了用于操作 SeriesDataFrame 的方法,以更改数据的表示形式,以便进一步进行数据处理或数据汇总。

pandas provides methods for manipulating a Series and DataFrame to alter the representation of the data for further data processing or data summarization.

  1. pivot() and pivot_table(): Group unique values within one or more discrete categories.

  2. stack() and unstack(): Pivot a column or row level to the opposite axis respectively.

  3. melt() and wide_to_long(): Unpivot a wide DataFrame to a long format.

  4. get_dummies() and from_dummies(): Conversions with indicator variables.

  5. explode(): Convert a column of list-like values to individual rows.

  6. crosstab(): Calculate a cross-tabulation of multiple 1 dimensional factor arrays.

  7. cut(): Transform continuous variables to discrete, categorical values

  8. factorize(): Encode 1 dimensional variables into integer labels.

pivot() and pivot_table()

pivot()

数据通常存储在所谓的“堆叠”或“记录”格式中。在“记录”或“宽”格式中,通常每位对象有一行。在“堆叠”或“长”格式中,每位对象有多行(若适用)。

Data is often stored in so-called “stacked” or “record” format. In a “record” or “wide” format, typically there is one row for each subject. In the “stacked” or “long” format there are multiple rows for each subject where applicable.

In [1]: data = {
   ...:    "value": range(12),
   ...:    "variable": ["A"] * 3 + ["B"] * 3 + ["C"] * 3 + ["D"] * 3,
   ...:    "date": pd.to_datetime(["2020-01-03", "2020-01-04", "2020-01-05"] * 4)
   ...: }
   ...:

In [2]: df = pd.DataFrame(data)

为了使用每个唯一变量执行时间序列操作,更好的表示方式是 columns 是唯一变量,并且日期的 index 标识单个观察。为了将数据重新塑造成此形式,我们使用 DataFrame.pivot() 方法(也作为顶级函数 pivot() 实现):

To perform time series operations with each unique variable, a better representation would be where the columns are the unique variables and an index of dates identifies individual observations. To reshape the data into this form, we use the DataFrame.pivot() method (also implemented as a top level function pivot()):

In [3]: pivoted = df.pivot(index="date", columns="variable", values="value")

In [4]: pivoted
Out[4]:
variable    A  B  C   D
date
2020-01-03  0  3  6   9
2020-01-04  1  4  7  10
2020-01-05  2  5  8  11

如果省略 values 参数,并且输入 DataFrame 具有多个值列(未使用作 pivot() 的列或索引输入),则生成的“透视” DataFrame 将具有 hierarchical columns,其最顶层表示各自的值列:

If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot(), then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column:

In [5]: df["value2"] = df["value"] * 2

In [6]: pivoted = df.pivot(index="date", columns="variable")

In [7]: pivoted
Out[7]:
           value           value2
variable       A  B  C   D      A   B   C   D
date
2020-01-03     0  3  6   9      0   6  12  18
2020-01-04     1  4  7  10      2   8  14  20
2020-01-05     2  5  8  11      4  10  16  22

然后,您可以从透视 DataFrame 中选择子集:

You can then select subsets from the pivoted DataFrame:

In [8]: pivoted["value2"]
Out[8]:
variable    A   B   C   D
date
2020-01-03  0   6  12  18
2020-01-04  2   8  14  20
2020-01-05  4  10  16  22

请注意,在数据为同质类型的情况下,这将返回一个基础数据视图。

Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.

pivot() 只能处理由 indexcolumns 指定的唯一行。如果您的数据包含重复项,请使用 pivot_table()

pivot() can only handle unique rows specified by index and columns. If you data contains duplicates, use pivot_table().

pivot_table()

尽管 pivot() 提供了具有各种数据类型的通用透视,但 pandas 还提供了 pivot_table()pivot_table() 用于对数字数据进行聚合透视。

While pivot() provides general purpose pivoting with various data types, pandas also provides pivot_table() or pivot_table() for pivoting with aggregation of numeric data.

函数 pivot_table() 可用于创建电子表格样式的透视表。有关一些高级策略,请参见 cookbook

The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some advanced strategies.

In [9]: import datetime

In [10]: df = pd.DataFrame(
   ....:     {
   ....:         "A": ["one", "one", "two", "three"] * 6,
   ....:         "B": ["A", "B", "C"] * 8,
   ....:         "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
   ....:         "D": np.random.randn(24),
   ....:         "E": np.random.randn(24),
   ....:         "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
   ....:         + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
   ....:     }
   ....: )
   ....:

In [11]: df
Out[11]:
        A  B    C         D         E          F
0     one  A  foo  0.469112  0.404705 2013-01-01
1     one  B  foo -0.282863  0.577046 2013-02-01
2     two  C  foo -1.509059 -1.715002 2013-03-01
3   three  A  bar -1.135632 -1.039268 2013-04-01
4     one  B  bar  1.212112 -0.370647 2013-05-01
..    ... ..  ...       ...       ...        ...
19  three  B  foo -1.087401 -0.472035 2013-08-15
20    one  C  foo -0.673690 -0.013960 2013-09-15
21    one  A  bar  0.113648 -0.362543 2013-10-15
22    two  B  bar -1.478427 -0.006154 2013-11-15
23  three  C  bar  0.524988 -0.923061 2013-12-15

[24 rows x 6 columns]

In [12]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[12]:
C             bar       foo
A     B
one   A -0.995460  0.595334
      B  0.393570 -0.494817
      C  0.196903 -0.767769
three A -0.431886       NaN
      B       NaN -1.065818
      C  0.798396       NaN
two   A       NaN  0.197720
      B -0.986678       NaN
      C       NaN -1.274317

In [13]: pd.pivot_table(
   ....:     df, values=["D", "E"],
   ....:     index=["B"],
   ....:     columns=["A", "C"],
   ....:     aggfunc="sum",
   ....: )
   ....:
Out[13]:
          D                      ...         E
A       one               three  ...     three      two
C       bar       foo       bar  ...       foo      bar       foo
B                                ...
A -1.990921  1.190667 -0.863772  ...       NaN      NaN -1.067650
B  0.787140 -0.989634       NaN  ...  0.372851  1.63741       NaN
C  0.393806 -1.535539  1.596791  ...       NaN      NaN -3.491906

[3 rows x 12 columns]

In [14]: pd.pivot_table(
   ....:     df, values="E",
   ....:     index=["B", "C"],
   ....:     columns=["A"],
   ....:     aggfunc=["sum", "mean"],
   ....: )
   ....:
Out[14]:
            sum                          mean
A           one     three       two       one     three       two
B C
A bar -0.471593 -2.008182       NaN -0.235796 -1.004091       NaN
  foo  0.761726       NaN -1.067650  0.380863       NaN -0.533825
B bar -1.665170       NaN  1.637410 -0.832585       NaN  0.818705
  foo -0.097554  0.372851       NaN -0.048777  0.186425       NaN
C bar -0.744154 -2.392449       NaN -0.372077 -1.196224       NaN
  foo  1.061810       NaN -3.491906  0.530905       NaN -1.745953

结果是 DataFrame 可能在索引或列上进行 MultiIndex。如果未给定 values 列名称,则数据透视表将在列中其他层次结构级别中包含所有数据:

The result is a DataFrame potentially having a MultiIndex on the index or column. If the values column name is not given, the pivot table will include all of the data in an additional level of hierarchy in the columns:

In [15]: pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"])
Out[15]:
                D                   E
C             bar       foo       bar       foo
A     B
one   A -0.995460  0.595334 -0.235796  0.380863
      B  0.393570 -0.494817 -0.832585 -0.048777
      C  0.196903 -0.767769 -0.372077  0.530905
three A -0.431886       NaN -1.004091       NaN
      B       NaN -1.065818       NaN  0.186425
      C  0.798396       NaN -1.196224       NaN
two   A       NaN  0.197720       NaN -0.533825
      B -0.986678       NaN  0.818705       NaN
      C       NaN -1.274317       NaN -1.745953

此外,你可以将 Grouper 用于 indexcolumns 关键字。有关 Grouper 的详细信息,请参见 Grouping with a Grouper specification

Also, you can use Grouper for index and columns keywords. For detail of Grouper, see Grouping with a Grouper specification.

In [16]: pd.pivot_table(df, values="D", index=pd.Grouper(freq="ME", key="F"), columns="C")
Out[16]:
C                bar       foo
F
2013-01-31       NaN  0.595334
2013-02-28       NaN -0.494817
2013-03-31       NaN -1.274317
2013-04-30 -0.431886       NaN
2013-05-31  0.393570       NaN
2013-06-30  0.196903       NaN
2013-07-31       NaN  0.197720
2013-08-31       NaN -1.065818
2013-09-30       NaN -0.767769
2013-10-31 -0.995460       NaN
2013-11-30 -0.986678       NaN
2013-12-31  0.798396       NaN

margins=True 传递给 pivot_table() 将添加一行和一列,其中包含 All 标记,其中包含行和列上的部分组聚合:

Passing margins=True to pivot_table() will add a row and column with an All label with partial group aggregates across the categories on the rows and columns:

In [17]: table = df.pivot_table(
   ....:     index=["A", "B"],
   ....:     columns="C",
   ....:     values=["D", "E"],
   ....:     margins=True,
   ....:     aggfunc="std"
   ....: )
   ....:

In [18]: table
Out[18]:
                D                             E
C             bar       foo       All       bar       foo       All
A     B
one   A  1.568517  0.178504  1.293926  0.179247  0.033718  0.371275
      B  1.157593  0.299748  0.860059  0.653280  0.885047  0.779837
      C  0.523425  0.133049  0.638297  1.111310  0.770555  0.938819
three A  0.995247       NaN  0.995247  0.049748       NaN  0.049748
      B       NaN  0.030522  0.030522       NaN  0.931203  0.931203
      C  0.386657       NaN  0.386657  0.386312       NaN  0.386312
two   A       NaN  0.111032  0.111032       NaN  1.146201  1.146201
      B  0.695438       NaN  0.695438  1.166526       NaN  1.166526
      C       NaN  0.331975  0.331975       NaN  0.043771  0.043771
All      1.014073  0.713941  0.871016  0.881376  0.984017  0.923568

此外,你可以调用 DataFrame.stack(),以将数据透视 DataFrame 显示为具有多级索引:

Additionally, you can call DataFrame.stack() to display a pivoted DataFrame as having a multi-level index:

In [19]: table.stack(future_stack=True)
Out[19]:
                  D         E
A   B C
one A bar  1.568517  0.179247
      foo  0.178504  0.033718
      All  1.293926  0.371275
    B bar  1.157593  0.653280
      foo  0.299748  0.885047
...             ...       ...
two C foo  0.331975  0.043771
      All  0.331975  0.043771
All   bar  1.014073  0.881376
      foo  0.713941  0.984017
      All  0.871016  0.923568

[30 rows x 2 columns]

stack() and unstack()

pivot() 方法密切相关的是 SeriesDataFrame 上提供的 related stack()unstack() 方法。这些方法旨在与 MultiIndex 对象配合使用(请参阅有关 hierarchical indexing 的部分)。

Closely related to the pivot() method are the related stack() and unstack() methods available on Series and DataFrame. These methods are designed to work together with MultiIndex objects (see the section on hierarchical indexing).

  1. stack(): “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.

  2. unstack(): (inverse operation of stack()) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

In [20]: tuples = [
   ....:    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:    ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....:

In [21]: index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"])

In [22]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

In [23]: df2 = df[:4]

In [24]: df2
Out[24]:
                     A         B
first second
bar   one     0.895717  0.805244
      two    -1.206412  2.565646
baz   one     1.431256  1.340309
      two    -1.170299 -0.226169

stack() 函数“压缩” DataFrame 列中的一个级别,以生成以下内容:

The stack() function “compresses” a level in the DataFrame columns to produce either:

  1. A Series, in the case of a Index in the columns.

  2. A DataFrame, in the case of a MultiIndex in the columns.

如果列具有 MultiIndex,你可以选择要堆叠哪个级别。堆叠的级别成为列上 MultiIndex 的新最低级别:

If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:

In [25]: stacked = df2.stack(future_stack=True)

In [26]: stacked
Out[26]:
first  second
bar    one     A    0.895717
               B    0.805244
       two     A   -1.206412
               B    2.565646
baz    one     A    1.431256
               B    1.340309
       two     A   -1.170299
               B   -0.226169
dtype: float64

对于“堆叠” DataFrameSeries(其中 MultiIndex 作为 index), stack() 的逆运算为 unstack(),它默认取消堆叠最后一层:

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:

In [27]: stacked.unstack()
Out[27]:
                     A         B
first second
bar   one     0.895717  0.805244
      two    -1.206412  2.565646
baz   one     1.431256  1.340309
      two    -1.170299 -0.226169

In [28]: stacked.unstack(1)
Out[28]:
second        one       two
first
bar   A  0.895717 -1.206412
      B  0.805244  2.565646
baz   A  1.431256 -1.170299
      B  1.340309 -0.226169

In [29]: stacked.unstack(0)
Out[29]:
first          bar       baz
second
one    A  0.895717  1.431256
       B  0.805244  1.340309
two    A -1.206412 -1.170299
       B  2.565646 -0.226169

如果索引有名称,则可以使用级别名称,而不必指定级别编号:

If the indexes have names, you can use the level names instead of specifying the level numbers:

In [30]: stacked.unstack("second")
Out[30]:
second        one       two
first
bar   A  0.895717 -1.206412
      B  0.805244  2.565646
baz   A  1.431256 -1.170299
      B  1.340309 -0.226169

请注意, stack()unstack() 方法会隐式对所涉及的索引级别进行排序。因此,调用 stack(),然后调用 unstack(),或 vice versa,将生成原始 DataFrameSeries 的已排序副本:

Notice that the stack() and unstack() methods implicitly sort the index levels involved. Hence a call to stack() and then unstack(), or vice versa, will result in a sorted copy of the original DataFrame or Series:

In [31]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])

In [32]: df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])

In [33]: df
Out[33]:
            A
2 a -1.413681
  b  1.607920
1 a  1.024180
  b  0.569605

In [34]: all(df.unstack().stack(future_stack=True) == df.sort_index())
Out[34]: True

Multiple levels

你还可以一次堆叠或取消堆叠多个级别,方法是传递一个级别列表,在这种情况下,最终结果就如同逐个处理列表中的每个级别一样。

You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is as if each level in the list were processed individually.

In [35]: columns = pd.MultiIndex.from_tuples(
   ....:     [
   ....:         ("A", "cat", "long"),
   ....:         ("B", "cat", "long"),
   ....:         ("A", "dog", "short"),
   ....:         ("B", "dog", "short"),
   ....:     ],
   ....:     names=["exp", "animal", "hair_length"],
   ....: )
   ....:

In [36]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)

In [37]: df
Out[37]:
exp                 A         B         A         B
animal            cat       cat       dog       dog
hair_length      long      long     short     short
0            0.875906 -2.211372  0.974466 -2.006747
1           -0.410001 -0.078638  0.545952 -1.219217
2           -1.226825  0.769804 -1.281247 -0.727707
3           -0.121306 -0.097883  0.695775  0.341734

In [38]: df.stack(level=["animal", "hair_length"], future_stack=True)
Out[38]:
exp                          A         B
  animal hair_length
0 cat    long         0.875906 -2.211372
  dog    short        0.974466 -2.006747
1 cat    long        -0.410001 -0.078638
  dog    short        0.545952 -1.219217
2 cat    long        -1.226825  0.769804
  dog    short       -1.281247 -0.727707
3 cat    long        -0.121306 -0.097883
  dog    short        0.695775  0.341734

级别列表可以包含级别名称或级别编号,但不能同时包含这两种类型。

The list of levels can contain either level names or level numbers but not a mixture of the two.

# df.stack(level=['animal', 'hair_length'], future_stack=True)
# from above is equivalent to:
In [39]: df.stack(level=[1, 2], future_stack=True)
Out[39]:
exp                          A         B
  animal hair_length
0 cat    long         0.875906 -2.211372
  dog    short        0.974466 -2.006747
1 cat    long        -0.410001 -0.078638
  dog    short        0.545952 -1.219217
2 cat    long        -1.226825  0.769804
  dog    short       -1.281247 -0.727707
3 cat    long        -0.121306 -0.097883
  dog    short        0.695775  0.341734

Missing data

如果子组不具有相同的标签集,则取消堆叠可能会导致缺失值。默认情况下,将用该数据类型的默认填充值替换缺失值。

Unstacking can result in missing values if subgroups do not have the same set of labels. By default, missing values will be replaced with the default fill value for that data type.

In [40]: columns = pd.MultiIndex.from_tuples(
   ....:     [
   ....:         ("A", "cat"),
   ....:         ("B", "dog"),
   ....:         ("B", "cat"),
   ....:         ("A", "dog"),
   ....:     ],
   ....:     names=["exp", "animal"],
   ....: )
   ....:

In [41]: index = pd.MultiIndex.from_product(
   ....:     [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
   ....: )
   ....:

In [42]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)

In [43]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]

In [44]: df3
Out[44]:
exp                  B
animal             dog       cat
first second
bar   one    -1.110336 -0.619976
      two     0.687738  0.176444
foo   one     1.314232  0.690579
qux   two     0.380396  0.084844

In [45]: df3.unstack()
Out[45]:
exp            B
animal       dog                 cat
second       one       two       one       two
first
bar    -1.110336  0.687738 -0.619976  0.176444
foo     1.314232       NaN  0.690579       NaN
qux          NaN  0.380396       NaN  0.084844

可以使用 fill_value 参数用特定值填充缺失值。

The missing value can be filled with a specific value with the fill_value argument.

In [46]: df3.unstack(fill_value=-1e9)
Out[46]:
exp                B
animal           dog                         cat
second           one           two           one           two
first
bar    -1.110336e+00  6.877384e-01 -6.199759e-01  1.764443e-01
foo     1.314232e+00 -1.000000e+09  6.905793e-01 -1.000000e+09
qux    -1.000000e+09  3.803956e-01 -1.000000e+09  8.484421e-02

melt() and wide_to_long()

顶级 melt() 函数和相应的 DataFrame.melt() 可用于将 DataFrame 整理成一种格式,其中一列或多列是标识符变量,而所有其他列(视为度量变量)被“取消透视”到行轴,仅留下两个非标识符列,“变量”和“值”。可以通过提供 var_namevalue_name 参数自定义这些列的名称。

The top-level melt() function and the corresponding DataFrame.melt() are useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names of those columns can be customized by supplying the var_name and value_name parameters.

In [47]: cheese = pd.DataFrame(
   ....:     {
   ....:         "first": ["John", "Mary"],
   ....:         "last": ["Doe", "Bo"],
   ....:         "height": [5.5, 6.0],
   ....:         "weight": [130, 150],
   ....:     }
   ....: )
   ....:

In [48]: cheese
Out[48]:
  first last  height  weight
0  John  Doe     5.5     130
1  Mary   Bo     6.0     150

In [49]: cheese.melt(id_vars=["first", "last"])
Out[49]:
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

In [50]: cheese.melt(id_vars=["first", "last"], var_name="quantity")
Out[50]:
  first last quantity  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

使用 melt() 转换数据框时,将忽略索引。通过将 ignore_index=False 参数设定为 False,可保留原始索引值(默认为 True)。但 ignore_index=False 将重复索引值。

When transforming a DataFrame using melt(), the index will be ignored. The original index values can be kept by setting the ignore_index=False parameter to False (default is True). ignore_index=False will however duplicate index values.

In [51]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])

In [52]: cheese = pd.DataFrame(
   ....:     {
   ....:         "first": ["John", "Mary"],
   ....:         "last": ["Doe", "Bo"],
   ....:         "height": [5.5, 6.0],
   ....:         "weight": [130, 150],
   ....:     },
   ....:     index=index,
   ....: )
   ....:

In [53]: cheese
Out[53]:
         first last  height  weight
person A  John  Doe     5.5     130
       B  Mary   Bo     6.0     150

In [54]: cheese.melt(id_vars=["first", "last"])
Out[54]:
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

In [55]: cheese.melt(id_vars=["first", "last"], ignore_index=False)
Out[55]:
         first last variable  value
person A  John  Doe   height    5.5
       B  Mary   Bo   height    6.0
       A  John  Doe   weight  130.0
       B  Mary   Bo   weight  150.0

wide_to_long() 类似于 melt(),具有更多定制化的列匹配功能。

wide_to_long() is similar to melt() with more customization for column matching.

In [56]: dft = pd.DataFrame(
   ....:     {
   ....:         "A1970": {0: "a", 1: "b", 2: "c"},
   ....:         "A1980": {0: "d", 1: "e", 2: "f"},
   ....:         "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
   ....:         "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
   ....:         "X": dict(zip(range(3), np.random.randn(3))),
   ....:     }
   ....: )
   ....:

In [57]: dft["id"] = dft.index

In [58]: dft
Out[58]:
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2  1.519970   0
1     b     e    1.2    1.3 -0.493662   1
2     c     f    0.7    0.1  0.600178   2

In [59]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[59]:
                X  A    B
id year
0  1970  1.519970  a  2.5
1  1970 -0.493662  b  1.2
2  1970  0.600178  c  0.7
0  1980  1.519970  d  3.2
1  1980 -0.493662  e  1.3
2  1980  0.600178  f  0.1

get_dummies() and from_dummies()

若想将 Series 的类别变量转换为“虚拟”或“指示”, get_dummies() 会创建一个新的 DataFrame,其中包含唯一变量的列,以及指示每行存在的那些变量的值。

To convert categorical variables of a Series into a “dummy” or “indicator”, get_dummies() creates a new DataFrame with columns of the unique variables and the values representing the presence of those variables per row.

In [60]: df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})

In [61]: pd.get_dummies(df["key"])
Out[61]:
       a      b      c
0  False   True  False
1  False   True  False
2   True  False  False
3  False  False   True
4   True  False  False
5  False   True  False

In [62]: df["key"].str.get_dummies()
Out[62]:
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

prefix 向列名添加前缀,对于将结果与原始 DataFrame 合并很有用:

prefix adds a prefix to the the column names which is useful for merging the result with the original DataFrame:

In [63]: dummies = pd.get_dummies(df["key"], prefix="key")

In [64]: dummies
Out[64]:
   key_a  key_b  key_c
0  False   True  False
1  False   True  False
2   True  False  False
3  False  False   True
4   True  False  False
5  False   True  False

In [65]: df[["data1"]].join(dummies)
Out[65]:
   data1  key_a  key_b  key_c
0      0  False   True  False
1      1  False   True  False
2      2   True  False  False
3      3  False  False   True
4      4   True  False  False
5      5  False   True  False

此函数通常与 cut() 这样的离散化函数一起使用:

This function is often used along with discretization functions like cut():

In [66]: values = np.random.randn(10)

In [67]: values
Out[67]:
array([ 0.2742,  0.1329, -0.0237,  2.4102,  1.4505,  0.2061, -0.2519,
       -2.2136,  1.0633,  1.2661])

In [68]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [69]: pd.get_dummies(pd.cut(values, bins))
Out[69]:
   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
0       False        True       False       False       False
1        True       False       False       False       False
2       False       False       False       False       False
3       False       False       False       False       False
4       False       False       False       False       False
5       False        True       False       False       False
6       False       False       False       False       False
7       False       False       False       False       False
8       False       False       False       False       False
9       False       False       False       False       False

get_dummies() 还接受 DataFrame。默认情况下,objectstringcategorical 类型的列会被编码为虚拟变量,而其他列则保持不变。

get_dummies() also accepts a DataFrame. By default, object, string, or categorical type columns are encoded as dummy variables with other columns unaltered.

In [70]: df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]})

In [71]: pd.get_dummies(df)
Out[71]:
   C    A_a    A_b    B_b    B_c
0  1   True  False  False   True
1  2  False   True  False   True
2  3   True  False   True  False

指定 columns 关键字将对任何类型的列进行编码。

Specifying the columns keyword will encode a column of any type.

In [72]: pd.get_dummies(df, columns=["A"])
Out[72]:
   B  C    A_a    A_b
0  c  1   True  False
1  c  2  False   True
2  b  3   True  False

Series 版本一样,您可以为 prefixprefix_sep 传递值。默认情况下,列名用作前缀,_ 用作前缀分隔符。您有 3 种方式可以指定 prefixprefix_sep

As with the Series version, you can pass values for the prefix and prefix_sep. By default the column name is used as the prefix and _ as the prefix separator. You can specify prefix and prefix_sep in 3 ways:

  1. string: Use the same value for prefix or prefix_sep for each column to be encoded.

  2. list: Must be the same length as the number of columns being encoded.

  3. dict: Mapping column name to prefix.

In [73]: simple = pd.get_dummies(df, prefix="new_prefix")

In [74]: simple
Out[74]:
   C  new_prefix_a  new_prefix_b  new_prefix_b  new_prefix_c
0  1          True         False         False          True
1  2         False          True         False          True
2  3          True         False          True         False

In [75]: from_list = pd.get_dummies(df, prefix=["from_A", "from_B"])

In [76]: from_list
Out[76]:
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1      True     False     False      True
1  2     False      True     False      True
2  3      True     False      True     False

In [77]: from_dict = pd.get_dummies(df, prefix={"B": "from_B", "A": "from_A"})

In [78]: from_dict
Out[78]:
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1      True     False     False      True
1  2     False      True     False      True
2  3      True     False      True     False

若想在将结果提供给统计模型时避免共线性,请指定 drop_first=True

To avoid collinearity when feeding the result to statistical models, specify drop_first=True.

In [79]: s = pd.Series(list("abcaa"))

In [80]: pd.get_dummies(s)
Out[80]:
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
4   True  False  False

In [81]: pd.get_dummies(s, drop_first=True)
Out[81]:
       b      c
0  False  False
1   True  False
2  False   True
3  False  False
4  False  False

当列仅包含一个级别时,它将从结果中省略。

When a column contains only one level, it will be omitted in the result.

In [82]: df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})

In [83]: pd.get_dummies(df)
Out[83]:
    A_a    B_a    B_b    B_c
0  True   True  False  False
1  True  False   True  False
2  True   True  False  False
3  True  False   True  False
4  True  False  False   True

In [84]: pd.get_dummies(df, drop_first=True)
Out[84]:
     B_b    B_c
0  False  False
1   True  False
2  False  False
3   True  False
4  False   True

可以使用 dtype 参数将值转换为不同类型。

The values can be cast to a different type using the dtype argument.

In [85]: df = pd.DataFrame({"A": list("abc"), "B": [1.1, 2.2, 3.3]})

In [86]: pd.get_dummies(df, dtype=np.float32).dtypes
Out[86]:
B      float64
A_a    float32
A_b    float32
A_c    float32
dtype: object

1.5.0 版中的新增功能。

New in version 1.5.0.

from_dummies()get_dummies() 的输出转换回由指示值生成的类别值 Series

from_dummies() converts the output of get_dummies() back into a Series of categorical values from indicator values.

In [87]: df = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]})

In [88]: df
Out[88]:
   prefix_a  prefix_b
0         0         1
1         1         0
2         0         1

In [89]: pd.from_dummies(df, sep="_")
Out[89]:
  prefix
0      b
1      a
2      b

虚拟编码数据仅需要包括 k - 1 类别,在这种情况下,最后一个类别是默认类别。可以使用 default_category 更改默认类别。

Dummy coded data only requires k - 1 categories to be included, in this case the last category is the default category. The default category can be modified with default_category.

In [90]: df = pd.DataFrame({"prefix_a": [0, 1, 0]})

In [91]: df
Out[91]:
   prefix_a
0         0
1         1
2         0

In [92]: pd.from_dummies(df, sep="_", default_category="b")
Out[92]:
  prefix
0      b
1      a
2      b

explode()

对于具有嵌套列表值 DataFrame 的列, explode() 会将每个列表值转换为单独的一行。生成的 Index 将根据原始行的索引标签进行复制:

For a DataFrame column with nested, list-like values, explode() will transform each list-like value to a separate row. The resulting Index will be duplicated corresponding to the index label from the original row:

In [93]: keys = ["panda1", "panda2", "panda3"]

In [94]: values = [["eats", "shoots"], ["shoots", "leaves"], ["eats", "leaves"]]

In [95]: df = pd.DataFrame({"keys": keys, "values": values})

In [96]: df
Out[96]:
     keys            values
0  panda1    [eats, shoots]
1  panda2  [shoots, leaves]
2  panda3    [eats, leaves]

In [97]: df["values"].explode()
Out[97]:
0      eats
0    shoots
1    shoots
1    leaves
2      eats
2    leaves
Name: values, dtype: object

DataFrame.explode 还可以炸开 DataFrame 中的列。

DataFrame.explode can also explode the column in the DataFrame.

In [98]: df.explode("values")
Out[98]:
     keys  values
0  panda1    eats
0  panda1  shoots
1  panda2  shoots
1  panda2  leaves
2  panda3    eats
2  panda3  leaves

Series.explode() 将使用缺失值指示符替换空列表并保留标量条目。

Series.explode() will replace empty lists with a missing value indicator and preserve scalar entries.

In [99]: s = pd.Series([[1, 2, 3], "foo", [], ["a", "b"]])

In [100]: s
Out[100]:
0    [1, 2, 3]
1          foo
2           []
3       [a, b]
dtype: object

In [101]: s.explode()
Out[101]:
0      1
0      2
0      3
1    foo
2    NaN
3      a
3      b
dtype: object

以逗号分隔的字符串值可以拆分为列表中的单个值,然后拆分为新行。

A comma-separated string value can be split into individual values in a list and then exploded to a new row.

In [102]: df = pd.DataFrame([{"var1": "a,b,c", "var2": 1}, {"var1": "d,e,f", "var2": 2}])

In [103]: df.assign(var1=df.var1.str.split(",")).explode("var1")
Out[103]:
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

crosstab()

使用 crosstab() 计算两个(或更多)因子的联合表。默认情况下, crosstab() 会计算因子的频率表,除非传入一个值数组和一个聚合函数。

Use crosstab() to compute a cross-tabulation of two (or more) factors. By default crosstab() computes a frequency table of the factors unless an array of values and an aggregation function are passed.

任何传入的 Series 都将使用其名称属性,除非指定了联合表的行名或列名。

Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified

In [104]: a = np.array(["foo", "foo", "bar", "bar", "foo", "foo"], dtype=object)

In [105]: b = np.array(["one", "one", "two", "one", "two", "one"], dtype=object)

In [106]: c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny"], dtype=object)

In [107]: pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
Out[107]:
b    one        two
c   dull shiny dull shiny
a
bar    1     0    0     1
foo    2     1    1     0

如果 crosstab() 仅接收两个 Series,它将提供一个频率表。

If crosstab() receives only two Series, it will provide a frequency table.

In [108]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 2, 2, 2], "B": [3, 3, 4, 4, 4], "C": [1, 1, np.nan, 1, 1]}
   .....: )
   .....:

In [109]: df
Out[109]:
   A  B    C
0  1  3  1.0
1  2  3  1.0
2  2  4  NaN
3  2  4  1.0
4  2  4  1.0

In [110]: pd.crosstab(df["A"], df["B"])
Out[110]:
B  3  4
A
1  1  0
2  1  3

crosstab() 还可以总结为 Categorical 数据。

crosstab() can also summarize to Categorical data.

In [111]: foo = pd.Categorical(["a", "b"], categories=["a", "b", "c"])

In [112]: bar = pd.Categorical(["d", "e"], categories=["d", "e", "f"])

In [113]: pd.crosstab(foo, bar)
Out[113]:
col_0  d  e
row_0
a      1  0
b      0  1

对于 Categorical 数据,如果实际数据不包含任何特定类别的实例,则使用 dropna=False 包含所有数据类别。

For Categorical data, to include all of data categories even if the actual data does not contain any instances of a particular category, use dropna=False.

In [114]: pd.crosstab(foo, bar, dropna=False)
Out[114]:
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

Normalization

频率表还可以使用 normalize 参数标准化,以显示百分比而不是计数:

Frequency tables can also be normalized to show percentages rather than counts using the normalize argument:

In [115]: pd.crosstab(df["A"], df["B"], normalize=True)
Out[115]:
B    3    4
A
1  0.2  0.0
2  0.2  0.6

normalize 还可以按每行或每列标准化值:

normalize can also normalize values within each row or within each column:

In [116]: pd.crosstab(df["A"], df["B"], normalize="columns")
Out[116]:
B    3    4
A
1  0.5  0.0
2  0.5  1.0

crosstab() 还可以接受第三个 Series 和一个聚合函数(aggfunc),用于在前两个 Series 定义的每个组内,应用于第三个 Series 的值:

crosstab() can also accept a third Series and an aggregation function (aggfunc) that will be applied to the values of the third Series within each group defined by the first two Series:

In [117]: pd.crosstab(df["A"], df["B"], values=df["C"], aggfunc="sum")
Out[117]:
B    3    4
A
1  1.0  NaN
2  1.0  2.0

Adding margins

margins=True 将添加一行和一列带有 All 标签,其中包含跨行和列上的类别部分组汇总:

margins=True will add a row and column with an All label with partial group aggregates across the categories on the rows and columns:

In [118]: pd.crosstab(
   .....:     df["A"], df["B"], values=df["C"], aggfunc="sum", normalize=True, margins=True
   .....: )
   .....:
Out[118]:
B       3    4   All
A
1    0.25  0.0  0.25
2    0.25  0.5  0.75
All  0.50  0.5  1.00

cut()

cut() 函数计算输入数组值的分组,通常用于将连续变量转换为离散变量或分类变量:

The cut() function computes groupings for the values of the input array and is often used to transform continuous variables to discrete or categorical variables:

一个整数 bins 将形成等宽箱。

An integer bins will form equal-width bins.

In [119]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])

In [120]: pd.cut(ages, bins=3)
Out[120]:
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]

一个有序的容器边缘列表将为每个变量分配一个区间。

A list of ordered bin edges will assign an interval for each variable.

In [121]: pd.cut(ages, bins=[0, 18, 35, 70])
Out[121]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]

如果 bins 关键字是一个 IntervalIndex,则这些将用于对传入数据进行分箱。

If the bins keyword is an IntervalIndex, then these will be used to bin the passed data.

In [122]: pd.cut(ages, bins=pd.IntervalIndex.from_breaks([0, 40, 70]))
Out[122]:
[(0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (40, 70], (40, 70]]
Categories (2, interval[int64, right]): [(0, 40] < (40, 70]]

factorize()

factorize() 对一维值进行编码,转换为整数标签。缺失值编码为 -1

factorize() encodes 1 dimensional values into integer labels. Missing values are encoded as -1.

In [123]: x = pd.Series(["A", "A", np.nan, "B", 3.14, np.inf])

In [124]: x
Out[124]:
0       A
1       A
2     NaN
3       B
4    3.14
5     inf
dtype: object

In [125]: labels, uniques = pd.factorize(x)

In [126]: labels
Out[126]: array([ 0,  0, -1,  1,  2,  3])

In [127]: uniques
Out[127]: Index(['A', 'B', 3.14, inf], dtype='object')

Categorical 将对一维值进行类似编码,以便进行进一步的分类操作。

Categorical will similarly encode 1 dimensional values for further categorical operations

In [128]: pd.Categorical(x)
Out[128]:
['A', 'A', NaN, 'B', 3.14, inf]
Categories (4, object): [3.14, inf, 'A', 'B']