Pandas 中文参考指南

Merge, join, concatenate and compare

pandas 提供了多种方法来组合和比较 Series 或 DataFrame。

pandas provides various methods for combining and comparing Series or DataFrame.

concat(): Merge multiple Series or DataFrame objects along a shared index or column
DataFrame.join(): Merge multiple DataFrame objects along the columns
DataFrame.combine_first(): Update missing values with non-missing values in the same location
merge(): Combine two Series or DataFrame objects with SQL-style joining
merge_ordered(): Combine two Series or DataFrame objects along an ordered axis
merge_asof(): Combine two Series or DataFrame objects by near instead of exact matching keys
Series.compare() and DataFrame.compare(): Show differences in values between two Series or DataFrame objects

concat()

concat() 函数在轴上连接任意数量的 Series 或 DataFrame 对象，同时在其他轴上执行可选的集合逻辑（并集或交集）。类似于 numpy.concatenate， concat() 获取同质对象列表或字典并将它们连接起来。

The concat() function concatenates an arbitrary amount of Series or DataFrame objects along an axis while performing optional set logic (union or intersection) of the indexes on the other axes. Like numpy.concatenate, concat() takes a list or dict of homogeneously-typed objects and concatenates them.

In [1]: df1 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A0", "A1", "A2", "A3"],
   ...:         "B": ["B0", "B1", "B2", "B3"],
   ...:         "C": ["C0", "C1", "C2", "C3"],
   ...:         "D": ["D0", "D1", "D2", "D3"],
   ...:     },
   ...:     index=[0, 1, 2, 3],
   ...: )
   ...:

In [2]: df2 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A4", "A5", "A6", "A7"],
   ...:         "B": ["B4", "B5", "B6", "B7"],
   ...:         "C": ["C4", "C5", "C6", "C7"],
   ...:         "D": ["D4", "D5", "D6", "D7"],
   ...:     },
   ...:     index=[4, 5, 6, 7],
   ...: )
   ...:

In [3]: df3 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A8", "A9", "A10", "A11"],
   ...:         "B": ["B8", "B9", "B10", "B11"],
   ...:         "C": ["C8", "C9", "C10", "C11"],
   ...:         "D": ["D8", "D9", "D10", "D11"],
   ...:     },
   ...:     index=[8, 9, 10, 11],
   ...: )
   ...:

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)

In [6]: result
Out[6]:
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

concat() 创建数据的完整副本，并且重复使用 concat() 可能创建不必要的副本。在使用 concat() 之前，在列表中收集所有 DataFrame 或 Series 对象。

concat() makes a full copy of the data, and iteratively reusing concat() can create unnecessary copies. Collect all DataFrame or Series objects in a list before using concat().

frames = [process_your_file(f) for f in files]
result = pd.concat(frames)

在连接具有命名轴的 DataFrame 时，pandas 将尽可能地尝试保留这些索引/列名。在所有输入项共享一个公共名称的情况下，将此名称分配给结果。当输入名称不都一致时，结果将不带名称。 MultiIndex 也一样，但会逐级单独应用逻辑。

When concatenating DataFrame with named axes, pandas will attempt to preserve these index/column names whenever possible. In the case where all inputs share a common name, this name will be assigned to the result. When the input names do not all agree, the result will be unnamed. The same is true for MultiIndex, but the logic is applied separately on a level-by-level basis.

Joining logic of the resulting axis

join 关键字指定如何处理第一个 DataFrame 中不存在的轴值。

The join keyword specifies how to handle axis values that don’t exist in the first DataFrame.

join='outer' 采用所有轴值的并集

join='outer' takes the union of all axis values

In [7]: df4 = pd.DataFrame(
   ...:     {
   ...:         "B": ["B2", "B3", "B6", "B7"],
   ...:         "D": ["D2", "D3", "D6", "D7"],
   ...:         "F": ["F2", "F3", "F6", "F7"],
   ...:     },
   ...:     index=[2, 3, 6, 7],
   ...: )
   ...:

In [8]: result = pd.concat([df1, df4], axis=1)

In [9]: result
Out[9]:
     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7

join='inner' 采用轴值的交集

join='inner' takes the intersection of the axis values

In [10]: result = pd.concat([df1, df4], axis=1, join="inner")

In [11]: result
Out[11]:
    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3

要使用原始 DataFrame 中的确切索引来执行有效的“左”联接，可以对结果重新编制索引。

To perform an effective “left” join using the exact index from the original DataFrame, result can be reindexed.

In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)

In [13]: result
Out[13]:
    A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3

Ignoring indexes on the concatenation axis

对于没有有意义的索引的 DataFrame 对象，ignore_index 会忽略重叠索引。

For DataFrame objects which don’t have a meaningful index, the ignore_index ignores overlapping indexes.

In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)

In [15]: result
Out[15]:
     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

Concatenating Series and DataFrame together

您可以连接 Series 和 DataFrame 对象的组合。 Series 将转换为 DataFrame，列名将作为 Series 的名称。

You can concatenate a mix of Series and DataFrame objects. The Series will be transformed to DataFrame with the column name as the name of the Series.

In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

In [17]: result = pd.concat([df1, s1], axis=1)

In [18]: result
Out[18]:
    A   B   C   D   X
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3

未命名的 Series 将连续编号。

Unnamed Series will be numbered consecutively.

In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])

In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)

In [21]: result
Out[21]:
    A   B   C   D   0   1   2
0  A0  B0  C0  D0  _0  _0  _0
1  A1  B1  C1  D1  _1  _1  _1
2  A2  B2  C2  D2  _2  _2  _2
3  A3  B3  C3  D3  _3  _3  _3

ignore_index=True 将删除所有名称引用。

ignore_index=True will drop all name references.

In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)

In [23]: result
Out[23]:
    0   1   2   3   4
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3

Resulting keys

keys 参数为结果索引或列中添加了另一个轴级别（创建了一个 MultiIndex）将特定的键与每个原始 DataFrame 关联。

The keys argument adds another axis level to the resulting index or column (creating a MultiIndex) associate specific keys with each original DataFrame.

In [24]: result = pd.concat(frames, keys=["x", "y", "z"])

In [25]: result
Out[25]:
        A    B    C    D
x 0    A0   B0   C0   D0
  1    A1   B1   C1   D1
  2    A2   B2   C2   D2
  3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11

In [26]: result.loc["y"]
Out[26]:
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

当根据现有 Series 创建一个新的 DataFrame 时，keys 参数能覆盖列名称。

The keys argument cane override the column names when creating a new DataFrame based on existing Series.

In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")

In [28]: s4 = pd.Series([0, 1, 2, 3])

In [29]: s5 = pd.Series([0, 1, 4, 5])

In [30]: pd.concat([s3, s4, s5], axis=1)
Out[30]:
   foo  0  1
0    0  0  0
1    1  1  1
2    2  2  4
3    3  3  5

In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Out[31]:
   red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5

您也可以将字典传入 concat()，在这种情况下，除非指定了其他 keys 参数，否则会将字典键用于 keys 参数：

You can also pass a dict to concat() in which case the dict keys will be used for the keys argument unless other keys argument is specified:

In [32]: pieces = {"x": df1, "y": df2, "z": df3}

In [33]: result = pd.concat(pieces)

In [34]: result
Out[34]:
        A    B    C    D
x 0    A0   B0   C0   D0
  1    A1   B1   C1   D1
  2    A2   B2   C2   D2
  3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11

In [35]: result = pd.concat(pieces, keys=["z", "y"])

In [36]: result
Out[36]:
        A    B    C    D
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7

MultiIndex 创建的包含一些级别的信息，这些级别是从所述键和 DataFrame 部分的索引构建而成的：

The MultiIndex created has levels that are constructed from the passed keys and the index of the DataFrame pieces:

In [37]: result.index.levels
Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])

levels 参数允许指定与 keys 相关联的联接级别

levels argument allows specifying resulting levels associated with the keys

In [38]: result = pd.concat(
   ....:     pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
   ....: )
   ....:

In [39]: result
Out[39]:
                A    B    C    D
group_key
x         0    A0   B0   C0   D0
          1    A1   B1   C1   D1
          2    A2   B2   C2   D2
          3    A3   B3   C3   D3
y         4    A4   B4   C4   D4
          5    A5   B5   C5   D5
          6    A6   B6   C6   D6
          7    A7   B7   C7   D7
z         8    A8   B8   C8   D8
          9    A9   B9   C9   D9
          10  A10  B10  C10  D10
          11  A11  B11  C11  D11

In [40]: result.index.levels
Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

Appending rows to a DataFrame

如果您有一个 Series 并且您想将其追加为单行到一个 DataFrame 中，您可以将该行转换成一个 DataFrame 并使用 concat()

If you have a Series that you want to append as a single row to a DataFrame, you can convert the row into a DataFrame and use concat()

In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

In [43]: result
Out[43]:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3

merge()

merge() 执行类似于 SQL 等关系型数据库的连接操作。熟悉 SQL 但对 pandas 比较陌生的用户可以参考 comparison with SQL。

merge() performs join operations similar to relational databases like SQL. Users who are familiar with SQL but new to pandas can reference a comparison with SQL.

Merge types

merge() 实现了常见的 SQL 风格连接操作。

merge() implements common SQL style joining operations.

one-to-one: joining two DataFrame objects on their indexes which must contain unique values.
many-to-one: joining a unique index to one or more columns in a different DataFrame.
many-to-many : joining columns on columns.

当在列上连接列（有可能是一对多连接）的时候，所述 DataFrame 对象中的所有索引都会被拒绝。

When joining columns on columns, potentially a many-to-many join, any indexes on the passed DataFrame objects will be discarded.

对于一对多连接，如果键组合在两表中出现不止一次， DataFrame 将具有关联数据的笛卡尔积。

For a many-to-many join, if a key combination appears more than once in both tables, the DataFrame will have the Cartesian product of the associated data.

In [44]: left = pd.DataFrame(
   ....:     {
   ....:         "key": ["K0", "K1", "K2", "K3"],
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:     }
   ....: )
   ....:

In [45]: right = pd.DataFrame(
   ....:     {
   ....:         "key": ["K0", "K1", "K2", "K3"],
   ....:         "C": ["C0", "C1", "C2", "C3"],
   ....:         "D": ["D0", "D1", "D2", "D3"],
   ....:     }
   ....: )
   ....:

In [46]: result = pd.merge(left, right, on="key")

In [47]: result
Out[47]:
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

merge() 的 how 参数指定了结果表中包含哪些键。如果键组合没有出现在左侧表或右侧表中，那么该连接表的中的值将为 NA。以下是 how 选项及其 SQL 等效名称的摘要：

The how argument to merge() specifies which keys are included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

合并方法

Merge method

SQL 连接名称

SQL Join Name

说明

Description

left

LEFT OUTER JOIN

仅使用来自左侧框架中的键

Use keys from left frame only

right

RIGHT OUTER JOIN

仅使用右侧数据框中的键

Use keys from right frame only

outer

FULL OUTER JOIN

使用两个数据框中键的并集

Use union of keys from both frames

inner

INNER JOIN

使用两个数据框中键的交集

Use intersection of keys from both frames

cross

CROSS JOIN

创建两个数据框中行的笛卡尔积

Create the cartesian product of rows of both frames

In [48]: left = pd.DataFrame(
   ....:    {
   ....:       "key1": ["K0", "K0", "K1", "K2"],
   ....:       "key2": ["K0", "K1", "K0", "K1"],
   ....:       "A": ["A0", "A1", "A2", "A3"],
   ....:       "B": ["B0", "B1", "B2", "B3"],
   ....:    }
   ....: )
   ....:

In [49]: right = pd.DataFrame(
   ....:    {
   ....:       "key1": ["K0", "K1", "K1", "K2"],
   ....:       "key2": ["K0", "K0", "K0", "K0"],
   ....:       "C": ["C0", "C1", "C2", "C3"],
   ....:       "D": ["D0", "D1", "D2", "D3"],
   ....:    }
   ....: )
   ....:

In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])

In [51]: result
Out[51]:
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN

In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])

In [53]: result
Out[53]:
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])

In [55]: result
Out[55]:
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN

In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])

In [57]: result
Out[57]:
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

In [58]: result = pd.merge(left, right, how="cross")

In [59]: result
Out[59]:
   key1_x key2_x   A   B key1_y key2_y   C   D
0      K0     K0  A0  B0     K0     K0  C0  D0
1      K0     K0  A0  B0     K1     K0  C1  D1
2      K0     K0  A0  B0     K1     K0  C2  D2
3      K0     K0  A0  B0     K2     K0  C3  D3
4      K0     K1  A1  B1     K0     K0  C0  D0
..    ...    ...  ..  ..    ...    ...  ..  ..
11     K1     K0  A2  B2     K2     K0  C3  D3
12     K2     K1  A3  B3     K0     K0  C0  D0
13     K2     K1  A3  B3     K1     K0  C1  D1
14     K2     K1  A3  B3     K1     K0  C2  D2
15     K2     K1  A3  B3     K2     K0  C3  D3

[16 rows x 8 columns]

您可以将 Series 和 DataFrame 与 MultiIndex 相搭配，前提是 MultiIndex 的名称与 DataFrame 中的列相对应。在合并之前使用 Series.reset_index() 将 Series 转换为 DataFrame

You can Series and a DataFrame with a MultiIndex if the names of the MultiIndex correspond to the columns from the DataFrame. Transform the Series to a DataFrame using Series.reset_index() before merging

In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

In [61]: df
Out[61]:
  Let  Num
0   A    1
1   B    2
2   C    3

In [62]: ser = pd.Series(
   ....:     ["a", "b", "c", "d", "e", "f"],
   ....:     index=pd.MultiIndex.from_arrays(
   ....:         [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
   ....:     ),
   ....: )
   ....:

In [63]: ser
Out[63]:
Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[64]:
  Let  Num  0
0   A    1  a
1   B    2  b
2   C    3  c

在 DataFrame 中执行具有重复连接键的外连接

Performing an outer join with duplicate join keys in DataFrame

In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [67]: result = pd.merge(left, right, on="B", how="outer")

In [68]: result
Out[68]:
   A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6

警告

Warning

基于重复键进行合并将显著增加结果的维度，可能导致内存溢出。

Merging on duplicate keys significantly increase the dimensions of the result and can cause a memory overflow.

Merge key uniqueness

validate 参数检查合并键的唯一性。在合并操作之前检查键的唯一性，可以防止内存溢出和意外的键重复。

The validate argument checks whether the uniqueness of merge keys. Key uniqueness is checked before merge operations and can protect against memory overflows and unexpected key duplication.

In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Cell In[71], line 1
----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    155     return _cross_merge(
    156         left_df,
    157         right_df,
   (...)
    167         copy=copy,
    168     )
    169 else:
--> 170     op = _MergeOperation(
    171         left_df,
    172         right_df,
    173         how=how,
    174         on=on,
    175         left_on=left_on,
    176         right_on=right_on,
    177         left_index=left_index,
    178         right_index=right_index,
    179         sort=sort,
    180         suffixes=suffixes,
    181         indicator=indicator,
    182         validate=validate,
    183     )
    184     return op.get_result(copy=copy)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
    809 # If argument passed to validate,
    810 # check if columns specified as unique
    811 # are in fact unique.
    812 if validate is not None:
--> 813     self._validate_validate_kwd(validate)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate)
   1653         raise MergeError(
   1654             "Merge keys are not unique in left dataset; not a one-to-one merge"
   1655         )
   1656     if not right_unique:
-> 1657         raise MergeError(
   1658             "Merge keys are not unique in right dataset; not a one-to-one merge"
   1659         )
   1661 elif validate in ["one_to_many", "1:m"]:
   1662     if not left_unique:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

如果用户知道右侧 DataFrame 中存在重复项，但希望确保左侧 DataFrame 中没有重复项，则可以使用 validate='one_to_many' 参数，这样就不会引发异常。

If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left DataFrame, one can use the validate='one_to_many' argument instead, which will not raise an exception.

In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[72]:
   A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0

Merge result indicator

merge() 接受 indicator 参数。如果 True，则名为 _merge 的类别型列将添加到输出对象中，该列值为：

merge() accepts the argument indicator. If True, a Categorical-type column called _merge will be added to the output object that takes on values:

观察的起源

Observation Origin

uniquement la valeur _merge

_merge value

Fusionner uniquement la clé dans le cadre 'left'

Merge key only in 'left' frame

left_only

Fusionner uniquement la clé dans le cadre 'right'

Merge key only in 'right' frame

right_only

Fusionner la clé dans les deux cadres

Merge key in both frames

both

In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[75]:
   col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only

Un argument de chaîne pour indicator utilisera la valeur comme nom pour la colonne d’indicateur.

A string argument to indicator will use the value as the name for the indicator column.

In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[76]:
   col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only

Overlapping value columns

L’argument de fusion suffixes prend un tuple de liste de chaînes à ajouter aux noms de colonnes se chevauchant dans l’entrée DataFrame pour lever l’ambiguïté des colonnes de résultat :

The merge suffixes argument takes a tuple of list of strings to append to overlapping column names in the input DataFrame to disambiguate the result columns:

In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})

In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})

In [79]: result = pd.merge(left, right, on="k")

In [80]: result
Out[80]:
    k  v_x  v_y
0  K0    1    4
1  K0    1    5

In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

In [82]: result
Out[82]:
    k  v_l  v_r
0  K0    1    4
1  K0    1    5

DataFrame.join()

DataFrame.join() combine les colonnes de plusieurs DataFrame potentiellement indexés différemment en un seul résultat DataFrame.

DataFrame.join() combines the columns of multiple, potentially differently-indexed DataFrame into a single result DataFrame.

In [83]: left = pd.DataFrame(
   ....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
   ....: )
   ....:

In [84]: right = pd.DataFrame(
   ....:     {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
   ....: )
   ....:

In [85]: result = left.join(right)

In [86]: result
Out[86]:
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2

In [87]: result = left.join(right, how="outer")

In [88]: result
Out[88]:
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3

In [89]: result = left.join(right, how="inner")

In [90]: result
Out[90]:
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2

DataFrame.join() prend un argument on facultatif qui peut être une ou plusieurs colonnes que le DataFrame passé doit aligner.

DataFrame.join() takes an optional on argument which may be a column or multiple column names that the passed DataFrame is to be aligned.

In [91]: left = pd.DataFrame(
   ....:     {
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:         "key": ["K0", "K1", "K0", "K1"],
   ....:     }
   ....: )
   ....:

In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

In [93]: result = left.join(right, on="key")

In [94]: result
Out[94]:
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

In [95]: result = pd.merge(
   ....:     left, right, left_on="key", right_index=True, how="left", sort=False
   ....: )
   ....:

In [96]: result
Out[96]:
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

Pour joindre plusieurs clés, le DataFrame passé doit avoir un MultiIndex :

To join on multiple keys, the passed DataFrame must have a MultiIndex:

In [97]: left = pd.DataFrame(
   ....:     {
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:         "key1": ["K0", "K0", "K1", "K2"],
   ....:         "key2": ["K0", "K1", "K0", "K1"],
   ....:     }
   ....: )
   ....:

In [98]: index = pd.MultiIndex.from_tuples(
   ....:     [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
   ....: )
   ....:

In [99]: right = pd.DataFrame(
   ....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
   ....: )
   ....:

In [100]: result = left.join(right, on=["key1", "key2"])

In [101]: result
Out[101]:
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3

La valeur par défaut pour DataFrame.join est d’effectuer une jointure gauche qui utilise uniquement les clés trouvées dans l’appelant DataFrame. D’autres types de jointure peuvent être spécifiés avec how.

The default for DataFrame.join is to perform a left join which uses only the keys found in the calling DataFrame. Other join types can be specified with how.

In [102]: result = left.join(right, on=["key1", "key2"], how="inner")

In [103]: result
Out[103]:
    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
2  A2  B2   K1   K0  C1  D1
3  A3  B3   K2   K1  C3  D3

Joining a single Index to a MultiIndex

Vous pouvez joindre un DataFrame avec un Index à un DataFrame avec un MultiIndex sur un niveau. Le name du Index correspondra au nom de niveau du MultiIndex.

You can join a DataFrame with a Index to a DataFrame with a MultiIndex on a level. The name of the Index with match the level name of the MultiIndex.

In [104]: left = pd.DataFrame(
   .....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
   .....:     index=pd.Index(["K0", "K1", "K2"], name="key"),
   .....: )
   .....:

In [105]: index = pd.MultiIndex.from_tuples(
   .....:     [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
   .....:     names=["key", "Y"],
   .....: )
   .....:

In [106]: right = pd.DataFrame(
   .....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
   .....:     index=index,
   .....: )
   .....:

In [107]: result = left.join(right, how="inner")

In [108]: result
Out[108]:
         A   B   C   D
key Y
K0  Y0  A0  B0  C0  D0
K1  Y1  A1  B1  C1  D1
K2  Y2  A2  B2  C2  D2
    Y3  A2  B2  C3  D3

Joining with two MultiIndex

Le MultiIndex de l’argument d’entrée doit être complètement utilisé dans la jointure et est un sous-ensemble des indices dans l’argument de gauche.

The MultiIndex of the input argument must be completely used in the join and is a subset of the indices in the left argument.

In [109]: leftindex = pd.MultiIndex.from_product(
   .....:     [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
   .....: )
   .....:

In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)

In [111]: left
Out[111]:
            v1
abc xy num
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

In [112]: rightindex = pd.MultiIndex.from_product(
   .....:     [list("abc"), list("xy")], names=["abc", "xy"]
   .....: )
   .....:

In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

In [114]: right
Out[114]:
         v2
abc xy
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]:
            v1   v2
abc xy num
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

In [116]: leftindex = pd.MultiIndex.from_tuples(
   .....:     [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
   .....: )
   .....:

In [117]: left = pd.DataFrame(
   .....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
   .....: )
   .....:

In [118]: rightindex = pd.MultiIndex.from_tuples(
   .....:     [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
   .....: )
   .....:

In [119]: right = pd.DataFrame(
   .....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
   .....: )
   .....:

In [120]: result = pd.merge(
   .....:     left.reset_index(), right.reset_index(), on=["key"], how="inner"
   .....: ).set_index(["key", "X", "Y"])
   .....:

In [121]: result
Out[121]:
            A   B   C   D
key X  Y
K0  X0 Y0  A0  B0  C0  D0
    X1 Y0  A1  B1  C0  D0
K1  X2 Y1  A2  B2  C1  D1

Merging on a combination of columns and index levels

Les chaînes passées comme paramètres on, left_on et right_on peuvent faire référence à des noms de colonnes ou à des noms de niveau d’index. Ceci permet de fusionner les instances DataFrame sur une combinaison de niveaux d’index et de colonnes sans réinitialiser les index.

Strings passed as the on, left_on, and right_on parameters may refer to either column names or index level names. This enables merging DataFrame instances on a combination of index levels and columns without resetting indexes.

In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")

In [123]: left = pd.DataFrame(
   .....:     {
   .....:         "A": ["A0", "A1", "A2", "A3"],
   .....:         "B": ["B0", "B1", "B2", "B3"],
   .....:         "key2": ["K0", "K1", "K0", "K1"],
   .....:     },
   .....:     index=left_index,
   .....: )
   .....:

In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")

In [125]: right = pd.DataFrame(
   .....:     {
   .....:         "C": ["C0", "C1", "C2", "C3"],
   .....:         "D": ["D0", "D1", "D2", "D3"],
   .....:         "key2": ["K0", "K0", "K0", "K1"],
   .....:     },
   .....:     index=right_index,
   .....: )
   .....:

In [126]: result = left.merge(right, on=["key1", "key2"])

In [127]: result
Out[127]:
       A   B key2   C   D
key1
K0    A0  B0   K0  C0  D0
K1    A2  B2   K0  C1  D1
K2    A3  B3   K1  C3  D3

Lorsque DataFrame sont joints sur une chaîne qui correspond à un niveau d’index dans les deux arguments, le niveau d’index est conservé comme niveau d’index dans le DataFrame résultant.

When DataFrame are joined on a string that matches an index level in both arguments, the index level is preserved as an index level in the resulting DataFrame.

Lorsque DataFrame sont joints en utilisant uniquement certains niveaux d’un MultiIndex, les niveaux supplémentaires seront supprimés de la jointure résultante. Pour conserver ces niveaux, utilisez DataFrame.reset_index() sur ces noms de niveaux pour déplacer ces niveaux vers des colonnes avant la jointure.

When DataFrame are joined using only some of the levels of a MultiIndex, the extra levels will be dropped from the resulting join. To preserve those levels, use DataFrame.reset_index() on those level names to move those levels to columns prior to the join.

Joining multiple DataFrame

Une liste ou un tuple de :class:`DataFrame` peut également être transmis à join() pour les joindre ensemble sur leurs index.

A list or tuple of :class:`DataFrame` can also be passed to join() to join them together on their indexes.

In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

In [129]: result = left.join([right, right2])

DataFrame.combine_first()

DataFrame.combine_first() 从另一个 DataFrame 中用非缺失值更新一个 DataFrame 的缺失值，在对应的位置中。

DataFrame.combine_first() update missing values from one DataFrame with the non-missing values in another DataFrame in the corresponding location.

In [130]: df1 = pd.DataFrame(
   .....:     [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
   .....: )
   .....:

In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

In [132]: result = df1.combine_first(df2)

In [133]: result
Out[133]:
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0

merge_ordered()

merge_ordered() 将订单数据（例如数字或时间序列数据）与 fill_method 的缺失数据可选项填充结合起来。

merge_ordered() combines order data such as numeric or time series data with optional filling of missing data with fill_method.

In [134]: left = pd.DataFrame(
   .....:     {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
   .....: )
   .....:

In [135]: right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})

In [136]: pd.merge_ordered(left, right, fill_method="ffill", left_by="s")
Out[136]:
     k   lv  s   rv
0   K0  1.0  a  NaN
1   K1  1.0  a  1.0
2   K2  1.0  a  2.0
3   K4  1.0  a  3.0
4   K1  2.0  b  1.0
5   K2  2.0  b  2.0
6   K4  2.0  b  3.0
7   K1  3.0  c  1.0
8   K2  3.0  c  2.0
9   K4  3.0  c  3.0
10  K1  NaN  d  1.0
11  K2  4.0  d  2.0
12  K4  4.0  d  3.0

merge_asof()

merge_asof() 类似于按顺序连接（left-join），但匹配基于最接近的键，而不是相等的键。对于 left DataFrame 中的每行，将在 right DataFrame 中选择最后一行，其中 on 键小于左键。两个 DataFrame 都必须按键进行排序。

merge_asof() is similar to an ordered left-join except that mactches are on the nearest key rather than equal keys. For each row in the left DataFrame, the last row in the right DataFrame are selected where the on key is less than the left’s key. Both DataFrame must be sorted by the key.

或者，一个 merge_asof() 可以通过匹配 by 键以及按 on 键最接近的匹配来执行分组合并。

Optionally an merge_asof() can perform a group-wise merge by matching the by key in addition to the nearest match on the on key.

In [137]: trades = pd.DataFrame(
   .....:     {
   .....:         "time": pd.to_datetime(
   .....:             [
   .....:                 "20160525 13:30:00.023",
   .....:                 "20160525 13:30:00.038",
   .....:                 "20160525 13:30:00.048",
   .....:                 "20160525 13:30:00.048",
   .....:                 "20160525 13:30:00.048",
   .....:             ]
   .....:         ),
   .....:         "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   .....:         "price": [51.95, 51.95, 720.77, 720.92, 98.00],
   .....:         "quantity": [75, 155, 100, 100, 100],
   .....:     },
   .....:     columns=["time", "ticker", "price", "quantity"],
   .....: )
   .....:

In [138]: quotes = pd.DataFrame(
   .....:     {
   .....:         "time": pd.to_datetime(
   .....:             [
   .....:                 "20160525 13:30:00.023",
   .....:                 "20160525 13:30:00.023",
   .....:                 "20160525 13:30:00.030",
   .....:                 "20160525 13:30:00.041",
   .....:                 "20160525 13:30:00.048",
   .....:                 "20160525 13:30:00.049",
   .....:                 "20160525 13:30:00.072",
   .....:                 "20160525 13:30:00.075",
   .....:             ]
   .....:         ),
   .....:         "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
   .....:         "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
   .....:         "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
   .....:     },
   .....:     columns=["time", "ticker", "bid", "ask"],
   .....: )
   .....:

In [139]: trades
Out[139]:
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

In [140]: quotes
Out[140]:
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

In [141]: pd.merge_asof(trades, quotes, on="time", by="ticker")
Out[141]:
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

2ms 中的 merge_asof() 在报价时间和交易时间之间。

merge_asof() within 2ms between the quote time and the trade time.

In [142]: pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))
Out[142]:
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

10ms 中的 merge_asof() 在报价时间和交易时间之间，并排除时间中的完全匹配。请注意，尽管我们排除了完全匹配（报价），但之前的报价确实会传播到时间中的那个点。

merge_asof() within 10ms between the quote time and the trade time and exclude exact matches on time. Note that though we exclude the exact matches (of the quotes), prior quotes do propagate to that point in time.

In [143]: pd.merge_asof(
   .....:     trades,
   .....:     quotes,
   .....:     on="time",
   .....:     by="ticker",
   .....:     tolerance=pd.Timedelta("10ms"),
   .....:     allow_exact_matches=False,
   .....: )
   .....:
Out[143]:
                     time ticker   price  quantity    bid    ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75    NaN    NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155  51.97  51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100    NaN    NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100    NaN    NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100    NaN    NaN

compare()

Series.compare() 和 DataFrame.compare() 方法分别允许你比较两个 DataFrame 或 Series，并总结它们的差异。

The Series.compare() and DataFrame.compare() methods allow you to compare two DataFrame or Series, respectively, and summarize their differences.

In [144]: df = pd.DataFrame(
   .....:     {
   .....:         "col1": ["a", "a", "b", "b", "a"],
   .....:         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
   .....:         "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
   .....:     },
   .....:     columns=["col1", "col2", "col3"],
   .....: )
   .....:

In [145]: df
Out[145]:
  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

In [146]: df2 = df.copy()

In [147]: df2.loc[0, "col1"] = "c"

In [148]: df2.loc[2, "col3"] = 4.0

In [149]: df2
Out[149]:
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

In [150]: df.compare(df2)
Out[150]:
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

默认情况下，如果两个相应的值相等，它们将显示为 NaN。此外，如果一整行/列中的所有值都为 NaN，则该行/列将从结果中省略。剩下的差异将在列上对齐。

By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.

在行上堆叠差异。

Stack the differences on rows.

In [151]: df.compare(df2, align_axis=0)
Out[151]:
        col1  col3
0 self     a   NaN
  other    c   NaN
2 self   NaN   3.0
  other  NaN   4.0

保留带有 keep_shape=True 的所有原始行和列。

Keep all original rows and columns with keep_shape=True

In [152]: df.compare(df2, keep_shape=True)
Out[152]:
  col1       col2       col3
  self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN

即使它们相等，也会保留所有原始值。

Keep all the original values even if they are equal.

In [153]: df.compare(df2, keep_shape=True, keep_equal=True)
Out[153]:
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0