Pandas 中文参考指南

Cookbook

这是一个存储实用的 pandas 配方的简洁、精确例子和链接的库。我们鼓励用户为该文档添加内容。

添加有趣的链接和/或将内联示例添加到这一部分是一个绝佳的首次请求提取(Pull Request)。

在可能的情况下,已插入简化、浓缩、对新用户友好的内联示例以增强 Stack Overflow 和 GitHub 链接。很多链接都包含了比内联示例提供的更多的信息。

pandas (pd) 和 NumPy (np) 是仅有的两个缩写导入模块。其余的对于新用户来说保持显式导入。

Idioms

以下是一些简洁的 pandas idioms

In [1]: df = pd.DataFrame(
   ...:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ...: )
   ...:

In [2]: df
Out[2]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

if-then…

对一列执行 if-then

In [3]: df.loc[df.AAA >= 5, "BBB"] = -1

In [4]: df
Out[4]:
   AAA  BBB  CCC
0    4   10  100
1    5   -1   50
2    6   -1  -30
3    7   -1  -50

对 2 列执行 if-then 并赋值:

In [5]: df.loc[df.AAA >= 5, ["BBB", "CCC"]] = 555

In [6]: df
Out[6]:
   AAA  BBB  CCC
0    4   10  100
1    5  555  555
2    6  555  555
3    7  555  555

添加另一条包含不同逻辑的行,来执行 -else

In [7]: df.loc[df.AAA < 5, ["BBB", "CCC"]] = 2000

In [8]: df
Out[8]:
   AAA   BBB   CCC
0    4  2000  2000
1    5   555   555
2    6   555   555
3    7   555   555

或者在设置掩码后使用 pandas

In [9]: df_mask = pd.DataFrame(
   ...:     {"AAA": [True] * 4, "BBB": [False] * 4, "CCC": [True, False] * 2}
   ...: )
   ...:

In [10]: df.where(df_mask, -1000)
Out[10]:
   AAA   BBB   CCC
0    4 -1000  2000
1    5 -1000 -1000
2    6 -1000   555
3    7 -1000 -1000
In [11]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [12]: df
Out[12]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [13]: df["logic"] = np.where(df["AAA"] > 5, "high", "low")

In [14]: df
Out[14]:
   AAA  BBB  CCC logic
0    4   10  100   low
1    5   20   50   low
2    6   30  -30  high
3    7   40  -50  high

Splitting

In [15]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [16]: df
Out[16]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [17]: df[df.AAA <= 5]
Out[17]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50

In [18]: df[df.AAA > 5]
Out[18]:
   AAA  BBB  CCC
2    6   30  -30
3    7   40  -50

Building criteria

In [19]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [20]: df
Out[20]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

…并且(无赋值返回序列)

In [21]: df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"]
Out[21]:
0    4
1    5
Name: AAA, dtype: int64

…或者(无赋值返回序列)

In [22]: df.loc[(df["BBB"] > 25) | (df["CCC"] >= -40), "AAA"]
Out[22]:
0    4
1    5
2    6
3    7
Name: AAA, dtype: int64

…或者(赋值时修改数据框。)

In [23]: df.loc[(df["BBB"] > 25) | (df["CCC"] >= 75), "AAA"] = 999

In [24]: df
Out[24]:
   AAA  BBB  CCC
0  999   10  100
1    5   20   50
2  999   30  -30
3  999   40  -50
In [25]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [26]: df
Out[26]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [27]: aValue = 43.0

In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
Out[28]:
   AAA  BBB  CCC
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
In [29]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [30]: df
Out[30]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [31]: Crit1 = df.AAA <= 5.5

In [32]: Crit2 = df.BBB == 10.0

In [33]: Crit3 = df.CCC > -40.0

可以硬编码:

In [34]: AllCrit = Crit1 & Crit2 & Crit3

…或者用动态构建的条件的列表来实现

In [35]: import functools

In [36]: CritList = [Crit1, Crit2, Crit3]

In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)

In [38]: df[AllCrit]
Out[38]:
   AAA  BBB  CCC
0    4   10  100

Selection

Dataframes

indexing 文档。

In [39]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [40]: df
Out[40]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[41]:
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

使用 loc 进行面向标签的切片和 iloc 位置切片 GH 2904

In [42]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]},
   ....:     index=["foo", "bar", "boo", "kar"],
   ....: )
   ....:

有 2 个显式切片方法,第三个是一般情况

  1. 面向位置(Python 切片风格:排除结尾)

  2. 面向标签(非 Python 切片风格:包括结尾)

  3. 一般(两种切片风格:取决于切片包含的是标签还是位置)

In [43]: df.loc["bar":"kar"]  # Label
Out[43]:
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

# Generic
In [44]: df[0:3]
Out[44]:
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

In [45]: df["bar":"kar"]
Out[45]:
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

如果一个索引包含从一个非零开始的整数或非单位增量,则会出现二义性。

In [46]: data = {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}

In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4])  # Note index starts at 1.

In [48]: df2.iloc[1:3]  # Position-oriented
Out[48]:
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [49]: df2.loc[1:3]  # Label-oriented
Out[49]:
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30
In [50]: df = pd.DataFrame(
   ....:     {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
   ....: )
   ....:

In [51]: df
Out[51]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
Out[52]:
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50

New columns

In [53]: df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})

In [54]: df
Out[54]:
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [55]: source_cols = df.columns  # Or some subset would work too

In [56]: new_cols = [str(x) + "_cat" for x in source_cols]

In [57]: categories = {1: "Alpha", 2: "Beta", 3: "Charlie"}

In [58]: df[new_cols] = df[source_cols].map(categories.get)

In [59]: df
Out[59]:
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha
In [60]: df = pd.DataFrame(
   ....:     {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
   ....: )
   ....:

In [61]: df
Out[61]:
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

方法 1:使用 idxmin() 得到最小值的索引

In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]:
   AAA  BBB
1    1    1
5    2    1
6    3    2

方法 2:排序然后获取每个的第一个

In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]:
   AAA  BBB
0    1    1
1    2    1
2    3    2

注意,除了索引之外,结果相同。

Multiindexing

multindexing 文档。

In [64]: df = pd.DataFrame(
   ....:     {
   ....:         "row": [0, 1, 2],
   ....:         "One_X": [1.1, 1.1, 1.1],
   ....:         "One_Y": [1.2, 1.2, 1.2],
   ....:         "Two_X": [1.11, 1.11, 1.11],
   ....:         "Two_Y": [1.22, 1.22, 1.22],
   ....:     }
   ....: )
   ....:

In [65]: df
Out[65]:
   row  One_X  One_Y  Two_X  Two_Y
0    0    1.1    1.2   1.11   1.22
1    1    1.1    1.2   1.11   1.22
2    2    1.1    1.2   1.11   1.22

# As Labelled Index
In [66]: df = df.set_index("row")

In [67]: df
Out[67]:
     One_X  One_Y  Two_X  Two_Y
row
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

# With Hierarchical Columns
In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])

In [69]: df
Out[69]:
     One        Two
       X    Y     X     Y
row
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22

# Now stack & Reset
In [70]: df = df.stack(0, future_stack=True).reset_index(1)

In [71]: df
Out[71]:
    level_1     X     Y
row
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22

# And fix the labels (Notice the label 'level_1' got added automatically)
In [72]: df.columns = ["Sample", "All_X", "All_Y"]

In [73]: df
Out[73]:
    Sample  All_X  All_Y
row
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22

Arithmetic

In [74]: cols = pd.MultiIndex.from_tuples(
   ....:     [(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]]
   ....: )
   ....:

In [75]: df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols)

In [76]: df
Out[76]:
          A                   B                   C
          O         I         O         I         O         I
n  0.469112 -0.282863 -1.509059 -1.135632  1.212112 -0.173215
m  0.119209 -1.044236 -0.861849 -2.104569 -0.494929  1.071804

In [77]: df = df.div(df["C"], level=1)

In [78]: df
Out[78]:
          A                   B              C
          O         I         O         I    O    I
n  0.387021  1.633022 -1.244983  6.556214  1.0  1.0
m -0.240860 -0.974279  1.741358 -1.963577  1.0  1.0

Slicing

In [79]: coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB", "six")]

In [80]: index = pd.MultiIndex.from_tuples(coords)

In [81]: df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"])

In [82]: df
Out[82]:
        MyData
AA one      11
   six      22
BB one      33
   two      44
   six      55

要对第一层的第一轴的索引进行交叉部分:

# Note : level and axis are optional, and default to zero
In [83]: df.xs("BB", level=0, axis=0)
Out[83]:
     MyData
one      33
two      44
six      55

… 现在是第一轴的第二层。

In [84]: df.xs("six", level=1, axis=0)
Out[84]:
    MyData
AA      22
BB      55
In [85]: import itertools

In [86]: index = list(itertools.product(["Ada", "Quinn", "Violet"], ["Comp", "Math", "Sci"]))

In [87]: headr = list(itertools.product(["Exams", "Labs"], ["I", "II"]))

In [88]: indx = pd.MultiIndex.from_tuples(index, names=["Student", "Course"])

In [89]: cols = pd.MultiIndex.from_tuples(headr)  # Notice these are un-named

In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]

In [91]: df = pd.DataFrame(data, indx, cols)

In [92]: df
Out[92]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81

In [93]: All = slice(None)

In [94]: df.loc["Violet"]
Out[94]:
       Exams     Labs
           I  II    I  II
Course
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81

In [95]: df.loc[(All, "Math"), All]
Out[95]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80

In [96]: df.loc[(slice("Ada", "Quinn"), "Math"), All]
Out[96]:
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77

In [97]: df.loc[(All, "Math"), ("Exams")]
Out[97]:
                 I  II
Student Course
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79

In [98]: df.loc[(All, "Math"), (All, "II")]
Out[98]:
               Exams Labs
                  II   II
Student Course
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80

Sorting

In [99]: df.sort_values(by=("Labs", "II"), ascending=False)
Out[99]:
               Exams     Labs
                   I  II    I  II
Student Course
Violet  Sci       78  81   81  81
        Math      77  79   81  80
        Comp      76  77   78  79
Quinn   Sci       75  78   78  78
        Math      74  76   78  77
        Comp      73  74   75  76
Ada     Sci       72  75   75  75
        Math      71  73   75  74
        Comp      70  71   72  73

部分选择,有序性的需要 GH 2995

Missing data

missing data 文档。

向前填充一个反向时序

In [100]: df = pd.DataFrame(
   .....:     np.random.randn(6, 1),
   .....:     index=pd.date_range("2013-08-01", periods=6, freq="B"),
   .....:     columns=list("A"),
   .....: )
   .....:

In [101]: df.loc[df.index[3], "A"] = np.nan

In [102]: df
Out[102]:
                   A
2013-08-01  0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06       NaN
2013-08-07 -0.424972
2013-08-08  0.567020

In [103]: df.bfill()
Out[103]:
                   A
2013-08-01  0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06 -0.424972
2013-08-07 -0.424972
2013-08-08  0.567020

Grouping

grouping 文档。

与 agg 不同,apply 的可调用对象会传递一个子 DataFrame,让你可以访问所有列

In [104]: df = pd.DataFrame(
   .....:     {
   .....:         "animal": "cat dog cat fish dog cat cat".split(),
   .....:         "size": list("SSMMMLL"),
   .....:         "weight": [8, 10, 11, 1, 20, 12, 12],
   .....:         "adult": [False] * 5 + [True] * 2,
   .....:     }
   .....: )
   .....:

In [105]: df
Out[105]:
  animal size  weight  adult
0    cat    S       8  False
1    dog    S      10  False
2    cat    M      11  False
3   fish    M       1  False
4    dog    M      20  False
5    cat    L      12   True
6    cat    L      12   True

# List the size of the animals with the highest weight.
In [106]: df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()], include_groups=False)
Out[106]:
animal
cat     L
dog     M
fish    M
dtype: object
In [107]: gb = df.groupby("animal")

In [108]: gb.get_group("cat")
Out[108]:
  animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True
In [109]: def GrowUp(x):
   .....:     avg_weight = sum(x[x["size"] == "S"].weight * 1.5)
   .....:     avg_weight += sum(x[x["size"] == "M"].weight * 1.25)
   .....:     avg_weight += sum(x[x["size"] == "L"].weight)
   .....:     avg_weight /= len(x)
   .....:     return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
   .....:

In [110]: expected_df = gb.apply(GrowUp, include_groups=False)

In [111]: expected_df
Out[111]:
       size   weight  adult
animal
cat       L  12.4375   True
dog       L  20.0000   True
fish      L   1.2500   True
In [112]: S = pd.Series([i / 100.0 for i in range(1, 11)])

In [113]: def cum_ret(x, y):
   .....:     return x * (1 + y)
   .....:

In [114]: def red(x):
   .....:     return functools.reduce(cum_ret, x, 1.0)
   .....:

In [115]: S.expanding().apply(red, raw=True)
Out[115]:
0    1.010000
1    1.030200
2    1.061106
3    1.103550
4    1.158728
5    1.228251
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64
In [116]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, -1, 1, 2]})

In [117]: gb = df.groupby("A")

In [118]: def replace(g):
   .....:     mask = g < 0
   .....:     return g.where(~mask, g[~mask].mean())
   .....:

In [119]: gb.transform(replace)
Out[119]:
   B
0  1
1  1
2  1
3  2
In [120]: df = pd.DataFrame(
   .....:     {
   .....:         "code": ["foo", "bar", "baz"] * 2,
   .....:         "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
   .....:         "flag": [False, True] * 3,
   .....:     }
   .....: )
   .....:

In [121]: code_groups = df.groupby("code")

In [122]: agg_n_sort_order = code_groups[["data"]].transform("sum").sort_values(by="data")

In [123]: sorted_df = df.loc[agg_n_sort_order.index]

In [124]: sorted_df
Out[124]:
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True
In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq="2min")

In [126]: ts = pd.Series(data=list(range(10)), index=rng)

In [127]: def MyCust(x):
   .....:     if len(x) > 2:
   .....:         return x.iloc[1] * 1.234
   .....:     return pd.NaT
   .....:

In [128]: mhc = {"Mean": "mean", "Max": "max", "Custom": MyCust}

In [129]: ts.resample("5min").apply(mhc)
Out[129]:
                     Mean  Max Custom
2014-10-07 00:00:00   1.0    2  1.234
2014-10-07 00:05:00   3.5    4    NaT
2014-10-07 00:10:00   6.0    7  7.404
2014-10-07 00:15:00   8.5    9    NaT

In [130]: ts
Out[130]:
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2min, dtype: int64
In [131]: df = pd.DataFrame(
   .....:     {"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]}
   .....: )
   .....:

In [132]: df
Out[132]:
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [133]: df["Counts"] = df.groupby(["Color"]).transform(len)

In [134]: df
Out[134]:
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1
In [135]: df = pd.DataFrame(
   .....:     {"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88, 100]},
   .....:     index=[
   .....:         "Last Gunfighter",
   .....:         "Last Gunfighter",
   .....:         "Last Gunfighter",
   .....:         "Paynter",
   .....:         "Paynter",
   .....:         "Paynter",
   .....:     ],
   .....: )
   .....:

In [136]: df
Out[136]:
                 line_race  beyer
Last Gunfighter         10     99
Last Gunfighter         10    102
Last Gunfighter          8    103
Paynter                 10    103
Paynter                 10     88
Paynter                  8    100

In [137]: df["beyer_shifted"] = df.groupby(level=0)["beyer"].shift(1)

In [138]: df
Out[138]:
                 line_race  beyer  beyer_shifted
Last Gunfighter         10     99            NaN
Last Gunfighter         10    102           99.0
Last Gunfighter          8    103          102.0
Paynter                 10    103            NaN
Paynter                 10     88          103.0
Paynter                  8    100           88.0
In [139]: df = pd.DataFrame(
   .....:     {
   .....:         "host": ["other", "other", "that", "this", "this"],
   .....:         "service": ["mail", "web", "mail", "mail", "web"],
   .....:         "no": [1, 2, 1, 2, 1],
   .....:     }
   .....: ).set_index(["host", "service"])
   .....:

In [140]: mask = df.groupby(level=0).agg("idxmax")

In [141]: df_count = df.loc[mask["no"]].reset_index()

In [142]: df_count
Out[142]:
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2
In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=["A"])

In [144]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).groups
Out[144]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]}

In [145]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).cumsum()
Out[145]:
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

Splitting

创建一个数据框列表,使用基于包含在行中的逻辑的划分进行拆分。

In [146]: df = pd.DataFrame(
   .....:     data={
   .....:         "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
   .....:         "Data": np.random.randn(9),
   .....:     }
   .....: )
   .....:

In [147]: dfs = list(
   .....:     zip(
   .....:         *df.groupby(
   .....:             (1 * (df["Case"] == "B"))
   .....:             .cumsum()
   .....:             .rolling(window=3, min_periods=1)
   .....:             .median()
   .....:         )
   .....:     )
   .....: )[-1]
   .....:

In [148]: dfs[0]
Out[148]:
  Case      Data
0    A  0.276232
1    A -1.087401
2    A -0.673690
3    B  0.113648

In [149]: dfs[1]
Out[149]:
  Case      Data
4    A -1.478427
5    A  0.524988
6    B  0.404705

In [150]: dfs[2]
Out[150]:
  Case      Data
7    A  0.577046
8    A -1.715002

Pivot

Pivot 文档。

In [151]: df = pd.DataFrame(
   .....:     data={
   .....:         "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
   .....:         "City": [
   .....:             "Toronto",
   .....:             "Montreal",
   .....:             "Vancouver",
   .....:             "Calgary",
   .....:             "Edmonton",
   .....:             "Winnipeg",
   .....:             "Windsor",
   .....:         ],
   .....:         "Sales": [13, 6, 16, 8, 4, 3, 1],
   .....:     }
   .....: )
   .....:

In [152]: table = pd.pivot_table(
   .....:     df,
   .....:     values=["Sales"],
   .....:     index=["Province"],
   .....:     columns=["City"],
   .....:     aggfunc="sum",
   .....:     margins=True,
   .....: )
   .....:

In [153]: table.stack("City", future_stack=True)
Out[153]:
                    Sales
Province City
AL       Calgary      8.0
         Edmonton     4.0
         Montreal     NaN
         Toronto      NaN
         Vancouver    NaN
...                   ...
All      Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0
         All         51.0

[48 rows x 1 columns]
In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]

In [155]: df = pd.DataFrame(
   .....:     {
   .....:         "ID": ["x%d" % r for r in range(10)],
   .....:         "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
   .....:         "ExamYear": [
   .....:             "2007",
   .....:             "2007",
   .....:             "2007",
   .....:             "2008",
   .....:             "2008",
   .....:             "2008",
   .....:             "2008",
   .....:             "2009",
   .....:             "2009",
   .....:             "2009",
   .....:         ],
   .....:         "Class": [
   .....:             "algebra",
   .....:             "stats",
   .....:             "bio",
   .....:             "algebra",
   .....:             "algebra",
   .....:             "stats",
   .....:             "stats",
   .....:             "algebra",
   .....:             "bio",
   .....:             "bio",
   .....:         ],
   .....:         "Participated": [
   .....:             "yes",
   .....:             "yes",
   .....:             "yes",
   .....:             "yes",
   .....:             "no",
   .....:             "yes",
   .....:             "yes",
   .....:             "yes",
   .....:             "yes",
   .....:             "yes",
   .....:         ],
   .....:         "Passed": ["yes" if x > 50 else "no" for x in grades],
   .....:         "Employed": [
   .....:             True,
   .....:             True,
   .....:             True,
   .....:             False,
   .....:             False,
   .....:             False,
   .....:             False,
   .....:             True,
   .....:             True,
   .....:             False,
   .....:         ],
   .....:         "Grade": grades,
   .....:     }
   .....: )
   .....:

In [156]: df.groupby("ExamYear").agg(
   .....:     {
   .....:         "Participated": lambda x: x.value_counts()["yes"],
   .....:         "Passed": lambda x: sum(x == "yes"),
   .....:         "Employed": lambda x: sum(x),
   .....:         "Grade": lambda x: sum(x) / len(x),
   .....:     }
   .....: )
   .....:
Out[156]:
          Participated  Passed  Employed      Grade
ExamYear
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667

要创建年度和月份交叉制表:

In [157]: df = pd.DataFrame(
   .....:     {"value": np.random.randn(36)},
   .....:     index=pd.date_range("2011-01-01", freq="ME", periods=36),
   .....: )
   .....:

In [158]: pd.pivot_table(
   .....:     df, index=df.index.month, columns=df.index.year, values="value", aggfunc="sum"
   .....: )
   .....:
Out[158]:
        2011      2012      2013
1  -1.039268 -0.968914  2.565646
2  -0.370647 -1.294524  1.431256
3  -1.157892  0.413738  1.340309
4  -1.344312  0.276662 -1.170299
5   0.844885 -0.472035 -0.226169
6   1.075770 -0.013960  0.410835
7  -0.109050 -0.362543  0.813850
8   1.643563 -0.006154  0.132003
9  -1.469388 -0.923061 -0.827317
10  0.357021  0.895717 -0.076467
11 -0.674600  0.805244 -1.187678
12 -1.776904 -1.206412  1.130127

Apply

In [159]: df = pd.DataFrame(
   .....:     data={
   .....:         "A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
   .....:         "B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]],
   .....:     },
   .....:     index=["I", "II", "III"],
   .....: )
   .....:

In [160]: def SeriesFromSubList(aList):
   .....:     return pd.Series(aList)
   .....:

In [161]: df_orgz = pd.concat(
   .....:     {ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
   .....: )
   .....:

In [162]: df_orgz
Out[162]:
         0     1     2     3
I   A    2     4     8  16.0
    B    a     b     c   NaN
II  A  100   200   NaN   NaN
    B   jj    kk   NaN   NaN
III A   10  20.0  30.0   NaN
    B  ccc   NaN   NaN   NaN

滚动应用到多列,其中函数在从 Series 返回标量之前计算 Series

In [163]: df = pd.DataFrame(
   .....:     data=np.random.randn(2000, 2) / 10000,
   .....:     index=pd.date_range("2001-01-01", periods=2000),
   .....:     columns=["A", "B"],
   .....: )
   .....:

In [164]: df
Out[164]:
                   A         B
2001-01-01 -0.000144 -0.000141
2001-01-02  0.000161  0.000102
2001-01-03  0.000057  0.000088
2001-01-04 -0.000221  0.000097
2001-01-05 -0.000201 -0.000041
...              ...       ...
2006-06-19  0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113  0.000114
2006-06-22  0.000136  0.000109
2006-06-23  0.000027  0.000030

[2000 rows x 2 columns]

In [165]: def gm(df, const):
   .....:     v = ((((df["A"] + df["B"]) + 1).cumprod()) - 1) * const
   .....:     return v.iloc[-1]
   .....:

In [166]: s = pd.Series(
   .....:     {
   .....:         df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
   .....:         for i in range(len(df) - 50)
   .....:     }
   .....: )
   .....:

In [167]: s
Out[167]:
2001-01-01    0.000930
2001-01-02    0.002615
2001-01-03    0.001281
2001-01-04    0.001117
2001-01-05    0.002772
                ...
2006-04-30    0.003296
2006-05-01    0.002629
2006-05-02    0.002081
2006-05-03    0.004247
2006-05-04    0.003928
Length: 1950, dtype: float64

滚动应用到多列,其中函数返回标量(成交量加权平均价)

In [168]: rng = pd.date_range(start="2014-01-01", periods=100)

In [169]: df = pd.DataFrame(
   .....:     {
   .....:         "Open": np.random.randn(len(rng)),
   .....:         "Close": np.random.randn(len(rng)),
   .....:         "Volume": np.random.randint(100, 2000, len(rng)),
   .....:     },
   .....:     index=rng,
   .....: )
   .....:

In [170]: df
Out[170]:
                Open     Close  Volume
2014-01-01 -1.611353 -0.492885    1219
2014-01-02 -3.000951  0.445794    1054
2014-01-03 -0.138359 -0.076081    1381
2014-01-04  0.301568  1.198259    1253
2014-01-05  0.276381 -0.669831    1728
...              ...       ...     ...
2014-04-06 -0.040338  0.937843    1188
2014-04-07  0.359661 -0.285908    1864
2014-04-08  0.060978  1.714814     941
2014-04-09  1.759055 -0.455942    1065
2014-04-10  0.138185 -1.147008    1453

[100 rows x 3 columns]

In [171]: def vwap(bars):
   .....:     return (bars.Close * bars.Volume).sum() / bars.Volume.sum()
   .....:

In [172]: window = 5

In [173]: s = pd.concat(
   .....:     [
   .....:         (pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i + window]]))
   .....:         for i in range(len(df) - window)
   .....:     ]
   .....: )
   .....:

In [174]: s.round(2)
Out[174]:
2014-01-06    0.02
2014-01-07    0.11
2014-01-08    0.10
2014-01-09    0.07
2014-01-10   -0.29
              ...
2014-04-06   -0.63
2014-04-07   -0.02
2014-04-08   -0.03
2014-04-09    0.34
2014-04-10    0.29
Length: 95, dtype: float64

Timeseries

将以列为小时数和以行为天数的矩阵,变成一个以时间序列的形式连续的行序列。 How to rearrange a Python pandas DataFrame?

计算 DatetimeIndex 中每个条目所在月份的第一天

In [175]: dates = pd.date_range("2000-01-01", periods=5)

In [176]: dates.to_period(freq="M").to_timestamp()
Out[176]:
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
               '2000-01-01'],
              dtype='datetime64[ns]', freq=None)

Merge

Join 文档。

In [177]: rng = pd.date_range("2000-01-01", periods=6)

In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=["A", "B", "C"])

In [179]: df2 = df1.copy()

根据 df 构造,ignore_index 可能必须使用

In [180]: df = pd.concat([df1, df2], ignore_index=True)

In [181]: df
Out[181]:
           A         B         C
0  -0.870117 -0.479265 -0.790855
1   0.144817  1.726395 -0.464535
2  -0.821906  1.597605  0.187307
3  -0.128342 -1.511638 -0.289858
4   0.399194 -1.430030 -0.639760
5   1.115116 -2.012600  1.810662
6  -0.870117 -0.479265 -0.790855
7   0.144817  1.726395 -0.464535
8  -0.821906  1.597605  0.187307
9  -0.128342 -1.511638 -0.289858
10  0.399194 -1.430030 -0.639760
11  1.115116 -2.012600  1.810662

DataFrame 的自连接 GH 2996

In [182]: df = pd.DataFrame(
   .....:     data={
   .....:         "Area": ["A"] * 5 + ["C"] * 2,
   .....:         "Bins": [110] * 2 + [160] * 3 + [40] * 2,
   .....:         "Test_0": [0, 1, 0, 1, 2, 0, 1],
   .....:         "Data": np.random.randn(7),
   .....:     }
   .....: )
   .....:

In [183]: df
Out[183]:
  Area  Bins  Test_0      Data
0    A   110       0 -0.433937
1    A   110       1 -0.160552
2    A   160       0  0.744434
3    A   160       1  1.754213
4    A   160       2  0.000850
5    C    40       0  0.342243
6    C    40       1  1.070599

In [184]: df["Test_1"] = df["Test_0"] - 1

In [185]: pd.merge(
   .....:     df,
   .....:     df,
   .....:     left_on=["Bins", "Area", "Test_0"],
   .....:     right_on=["Bins", "Area", "Test_1"],
   .....:     suffixes=("_L", "_R"),
   .....: )
   .....:
Out[185]:
  Area  Bins  Test_0_L    Data_L  Test_1_L  Test_0_R    Data_R  Test_1_R
0    A   110         0 -0.433937        -1         1 -0.160552         0
1    A   160         0  0.744434        -1         1  1.754213         0
2    A   160         1  1.754213         0         2  0.000850         1
3    C    40         0  0.342243        -1         1  1.070599         0

Plotting

Plotting 文档。

In [186]: df = pd.DataFrame(
   .....:     {
   .....:         "stratifying_var": np.random.uniform(0, 100, 20),
   .....:         "price": np.random.normal(100, 5, 20),
   .....:     }
   .....: )
   .....:

In [187]: df["quartiles"] = pd.qcut(
   .....:     df["stratifying_var"], 4, labels=["0-25%", "25-50%", "50-75%", "75-100%"]
   .....: )
   .....:

In [188]: df.boxplot(column="price", by="quartiles")
Out[188]: <Axes: title={'center': 'price'}, xlabel='quartiles'>

Data in/out

CSV

CSV 文档

读取一个已压缩但非由 gzip/bz2 压缩的文件(其为 read_csv 识别的压缩格式)。此示例显示了一个 WinZipped 文件,但其可作为使用上下文管理器打开文件并在其中使用该句柄进行读取的通用应用程序。 See here

处理不当的线条 GH 2886

将多个文件合并到单个 DataFrame 中的最好方法是逐个读取各个帧,将所有单个帧放入列表,然后使用 pd.concat() 合并列表中的帧:

In [189]: for i in range(3):
   .....:     data = pd.DataFrame(np.random.randn(10, 4))
   .....:     data.to_csv("file_{}.csv".format(i))
   .....:

In [190]: files = ["file_0.csv", "file_1.csv", "file_2.csv"]

In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

你可以使用相同的方法读取匹配模式的所有文件。下面是一个使用 glob 的示例:

In [192]: import glob

In [193]: import os

In [194]: files = glob.glob("file_*.csv")

In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

最后,此策略将与 io docs 中描述的其他 pd.read*(…​)_ 函数配合使用。

在多列中解析日期组件使用格式更快

In [196]: i = pd.date_range("20000101", periods=10000)

In [197]: df = pd.DataFrame({"year": i.year, "month": i.month, "day": i.day})

In [198]: df.head()
Out[198]:
   year  month  day
0  2000      1    1
1  2000      1    2
2  2000      1    3
3  2000      1    4
4  2000      1    5

In [199]: %timeit pd.to_datetime(df.year * 10000 + df.month * 100 + df.day, format='%Y%m%d')
   .....: ds = df.apply(lambda x: "%04d%02d%02d" % (x["year"], x["month"], x["day"]), axis=1)
   .....: ds.head()
   .....: %timeit pd.to_datetime(ds)
   .....:
4.01 ms +- 635 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
1.05 ms +- 7.39 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
In [200]: data = """;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....: ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....: ;;;;
   .....: date;Param1;Param2;Param4;Param5
   .....:     ;m²;°C;m²;m
   .....: ;;;;
   .....: 01.01.1990 00:00;1;1;2;3
   .....: 01.01.1990 01:00;5;3;4;5
   .....: 01.01.1990 02:00;9;5;6;7
   .....: 01.01.1990 03:00;13;7;8;9
   .....: 01.01.1990 04:00;17;9;10;11
   .....: 01.01.1990 05:00;21;11;12;13
   .....: """
   .....:
In [201]: from io import StringIO

In [202]: pd.read_csv(
   .....:     StringIO(data),
   .....:     sep=";",
   .....:     skiprows=[11, 12],
   .....:     index_col=0,
   .....:     parse_dates=True,
   .....:     header=10,
   .....: )
   .....:
Out[202]:
                     Param1  Param2  Param4  Param5
date
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13
In [203]: pd.read_csv(StringIO(data), sep=";", header=10, nrows=10).columns
Out[203]: Index(['date', 'Param1', 'Param2', 'Param4', 'Param5'], dtype='object')

In [204]: columns = pd.read_csv(StringIO(data), sep=";", header=10, nrows=10).columns

In [205]: pd.read_csv(
   .....:     StringIO(data), sep=";", index_col=0, header=12, parse_dates=True, names=columns
   .....: )
   .....:
Out[205]:
                     Param1  Param2  Param4  Param5
date
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13

HDFStore

HDFStores 文档

使用链接的多表层次结构管理异构数据 GH 3032

使用 Gemini 将其从大份数据块中去重,基本上是一个递归的归约操作。展示了一个从 csv 文件中获取数据并通过数据块创建一个存储功能,还可以解析日期。 See here

将属性存储到组节点

In [206]: df = pd.DataFrame(np.random.randn(8, 3))

In [207]: store = pd.HDFStore("test.h5")

In [208]: store.put("df", df)

# you can store an arbitrary Python object via pickle
In [209]: store.get_storer("df").attrs.my_attribute = {"A": 10}

In [210]: store.get_storer("df").attrs.my_attribute
Out[210]: {'A': 10}

您可以通过将 driver 参数传递给 PyTables 来在内存中创建或加载 HDFStore。只有当 HDFStore 关闭时才会将更改写入磁盘。

In [211]: store = pd.HDFStore("test.h5", "w", driver="H5FD_CORE")

In [212]: df = pd.DataFrame(np.random.randn(8, 3))

In [213]: store["test"] = df

# only after closing the store, data is written to disk:
In [214]: store.close()

Binary files

如果您需要读取由 C 结构数组组成的二进制文件,则 pandas 就能轻松接受 NumPy 记录数组。例如,给定一个名为 main.c 的文件中用 gcc main.c -std=gnu99 在 64 位计算机上编译的 C 程序,

#include <stdio.h>
#include <stdint.h>

typedef struct _Data
{
    int32_t count;
    double avg;
    float scale;
} Data;

int main(int argc, const char *argv[])
{
    size_t n = 10;
    Data d[n];

    for (int i = 0; i < n; ++i)
    {
        d[i].count = i;
        d[i].avg = i + 1.0;
        d[i].scale = (float) i + 2.0f;
    }

    FILE *file = fopen("binary.dat", "wb");
    fwrite(&d, sizeof(Data), n, file);
    fclose(file);

    return 0;
}

如下的 Python 代码会将二进制文件 'binary.dat' 读取到 pandas DataFrame 中,其中结构的每个元素都对应于框架中的一个列:

names = "count", "avg", "scale"

# note that the offsets are larger than the size of the type because of
# struct padding
offsets = 0, 8, 16
formats = "i4", "f8", "f4"
dt = np.dtype({"names": names, "offsets": offsets, "formats": formats}, align=True)
df = pd.DataFrame(np.fromfile("binary.dat", dt))

结构元素的偏移量可能会根据创建文件时所用机器的架构而有所不同。不建议将这样的原始二进制文件格式作为一般数据存储使用,因为它不是跨平台的。我们建议使用 HDF5 或 parquet,它们都受 pandas 的 IO 工具支持。

Computation

Correlation

通常情况下,从 DataFrame.corr() 计算出的相关系数矩阵的以下 (或上) 三角形形式是有用的。通过将一个布尔掩码传递给 where,可以实现这一点:

In [215]: df = pd.DataFrame(np.random.random(size=(100, 5)))

In [216]: corr_mat = df.corr()

In [217]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool_), k=-1)

In [218]: corr_mat.where(mask)
Out[218]:
          0         1         2        3   4
0       NaN       NaN       NaN      NaN NaN
1 -0.079861       NaN       NaN      NaN NaN
2 -0.236573  0.183801       NaN      NaN NaN
3 -0.013795 -0.051975  0.037235      NaN NaN
4 -0.031974  0.118342 -0.073499 -0.02063 NaN

method 中的 DataFrame.corr 参数接受可调用的内容,此外还可以接受命名的关联类型。在这里,我们计算 DataFrame 对象的 distance correlation 矩阵。

In [219]: def distcorr(x, y):
   .....:     n = len(x)
   .....:     a = np.zeros(shape=(n, n))
   .....:     b = np.zeros(shape=(n, n))
   .....:     for i in range(n):
   .....:         for j in range(i + 1, n):
   .....:             a[i, j] = abs(x[i] - x[j])
   .....:             b[i, j] = abs(y[i] - y[j])
   .....:     a += a.T
   .....:     b += b.T
   .....:     a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
   .....:     b_bar = np.vstack([np.nanmean(b, axis=0)] * n)
   .....:     A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
   .....:     B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())
   .....:     cov_ab = np.sqrt(np.nansum(A * B)) / n
   .....:     std_a = np.sqrt(np.sqrt(np.nansum(A ** 2)) / n)
   .....:     std_b = np.sqrt(np.sqrt(np.nansum(B ** 2)) / n)
   .....:     return cov_ab / std_a / std_b
   .....:

In [220]: df = pd.DataFrame(np.random.normal(size=(100, 3)))

In [221]: df.corr(method=distcorr)
Out[221]:
          0         1         2
0  1.000000  0.197613  0.216328
1  0.197613  1.000000  0.208749
2  0.216328  0.208749  1.000000

Timedeltas

Timedeltas 文档。

In [222]: import datetime

In [223]: s = pd.Series(pd.date_range("2012-1-1", periods=3, freq="D"))

In [224]: s - s.max()
Out[224]:
0   -2 days
1   -1 days
2    0 days
dtype: timedelta64[ns]

In [225]: s.max() - s
Out[225]:
0   2 days
1   1 days
2   0 days
dtype: timedelta64[ns]

In [226]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[226]:
0   364 days 20:55:00
1   365 days 20:55:00
2   366 days 20:55:00
dtype: timedelta64[ns]

In [227]: s + datetime.timedelta(minutes=5)
Out[227]:
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

In [228]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[228]:
0   -365 days +03:05:00
1   -366 days +03:05:00
2   -367 days +03:05:00
dtype: timedelta64[ns]

In [229]: datetime.timedelta(minutes=5) + s
Out[229]:
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]
In [230]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])

In [231]: df = pd.DataFrame({"A": s, "B": deltas})

In [232]: df
Out[232]:
           A      B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days

In [233]: df["New Dates"] = df["A"] + df["B"]

In [234]: df["Delta"] = df["A"] - df["New Dates"]

In [235]: df
Out[235]:
           A      B  New Dates   Delta
0 2012-01-01 0 days 2012-01-01  0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days

In [236]: df.dtypes
Out[236]:
A             datetime64[ns]
B            timedelta64[ns]
New Dates     datetime64[ns]
Delta        timedelta64[ns]
dtype: object

与日期时间类似,值可使用 np.nan 设置为 NaT

In [237]: y = s - s.shift()

In [238]: y
Out[238]:
0      NaT
1   1 days
2   1 days
dtype: timedelta64[ns]

In [239]: y[1] = np.nan

In [240]: y
Out[240]:
0      NaT
1      NaT
2   1 days
dtype: timedelta64[ns]

Creating example data

若要从一些给定值的每个组合创建数据框(类似于 R 的 expand.grid() 函数),我们可以创建一个字典,其中键是列名,值是数据值列表:

In [241]: def expand_grid(data_dict):
   .....:     rows = itertools.product(*data_dict.values())
   .....:     return pd.DataFrame.from_records(rows, columns=data_dict.keys())
   .....:

In [242]: df = expand_grid(
   .....:     {"height": [60, 70], "weight": [100, 140, 180], "sex": ["Male", "Female"]}
   .....: )
   .....:

In [243]: df
Out[243]:
    height  weight     sex
0       60     100    Male
1       60     100  Female
2       60     140    Male
3       60     140  Female
4       60     180    Male
5       60     180  Female
6       70     100    Male
7       70     100  Female
8       70     140    Male
9       70     140  Female
10      70     180    Male
11      70     180  Female

Constant series

若要评估序列是否包含常量值,我们可以检查 series.nunique() 1。不过,一种性能更高的做法是不先统计所有唯一值,即:

In [244]: v = s.to_numpy()

In [245]: is_constant = v.shape[0] == 0 or (s[0] == s).all()

此方法假设该序列不包含缺失值。对于要删除 NA 值的情况,我们可以先简单地删除这些值:

In [246]: v = s.dropna().to_numpy()

In [247]: is_constant = v.shape[0] == 0 or (s[0] == s).all()

如果缺失值被视为与任何其他值不同,则可以使用:

In [248]: v = s.to_numpy()

In [249]: is_constant = v.shape[0] == 0 or (s[0] == s).all() or not pd.notna(v).any()

(请注意,此示例没有消除 np.nanpd.NANone 之间的歧义)