Pandas 中文参考指南

Working with text data

Text data types

在 pandas 中存储文本数据有两种方式:

There are two ways to store text data in pandas:

  1. object -dtype NumPy array.

  2. StringDtype extension type.

我们建议使用 StringDtype 存储文本数据。

We recommend using StringDtype to store text data.

在 pandas 1.0 之前,object dtype 是唯一的选择。由于许多原因,这是不幸的:

Prior to pandas 1.0, object dtype was the only option. This was unfortunate for many reasons:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clear than 'string'.

目前,object dtype 字符串数组和 arrays.StringArray 的性能大致相同。我们希望未来的增强功能能够显著提高性能,并降低 StringArray 的内存开销。

Currently, the performance of object dtype arrays of strings and arrays.StringArray are about the same. We expect future enhancements to significantly increase the performance and lower the memory overhead of StringArray.

警告

Warning

目前 StringArray 被认为是试验性的。API 的实现和部分内容可能会在不发出警告的情况下发生变化。

StringArray is currently considered experimental. The implementation and parts of the API may change without warning.

为了向后兼容,object dtype 仍然是我们推断字符串列表的默认类型

For backwards-compatibility, object dtype remains the default type we infer a list of strings to

In [1]: pd.Series(["a", "b", "c"])
Out[1]:
0    a
1    b
2    c
dtype: object

要显式请求 string dtype,请指定 dtype

To explicitly request string dtype, specify the dtype

In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]:
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Out[3]:
0    a
1    b
2    c
dtype: string

或在创建 SeriesDataFrame 后指定 astype

Or astype after the Series or DataFrame is created

In [4]: s = pd.Series(["a", "b", "c"])

In [5]: s
Out[5]:
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]:
0    a
1    b
2    c
dtype: string

您还可以将 StringDtype/"string" 用作非字符串数据的 dtype,它将转换为 string dtype:

You can also use StringDtype/"string" as the dtype on non-string data and it will be converted to string dtype:

In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")

In [8]: s
Out[8]:
0       a
1       2
2    <NA>
dtype: string

In [9]: type(s[1])
Out[9]: str

或从现有的 pandas 数据进行转换:

or convert from existing pandas data:

In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")

In [11]: s1
Out[11]:
0       1
1       2
2    <NA>
dtype: Int64

In [12]: s2 = s1.astype("string")

In [13]: s2
Out[13]:
0       1
1       2
2    <NA>
dtype: string

In [14]: type(s2[0])
Out[14]: str

Behavior differences

以下是 StringDtype 对象的行为与 object dtype 不同的位置

These are places where the behavior of StringDtype objects differ from object dtype

  1. For StringDtype, string accessor methods that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.

In [15]: s = pd.Series(["a", None, "b"], dtype="string")

In [16]: s
Out[16]:
0       a
1    <NA>
2       b
dtype: string

In [17]: s.str.count("a")
Out[17]:
0       1
1    <NA>
2       0
dtype: Int64

In [18]: s.dropna().str.count("a")
Out[18]:
0    1
2    0
dtype: Int64
  1. Both outputs are Int64 dtype. Compare that with object-dtype

In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")

In [20]: s2.str.count("a")
Out[20]:
0    1.0
1    NaN
2    0.0
dtype: float64

In [21]: s2.dropna().str.count("a")
Out[21]:
0    1
2    0
dtype: int64
  1. When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.

In [22]: s.str.isdigit()
Out[22]:
0    False
1     <NA>
2    False
dtype: boolean

In [23]: s.str.match("a")
Out[23]:
0     True
1     <NA>
2    False
dtype: boolean
  1. Some string methods, like Series.str.decode() are not available on StringArray because StringArray only holds strings, not bytes.

  2. In comparison operations, arrays.StringArray and Series backed by a StringArray will return an object with BooleanDtype, rather than a bool dtype object. Missing values in a StringArray will propagate in comparison operations, rather than always comparing unequal like numpy.nan.

本档的其他全部内容同样适用于 stringobject 数据类型。

Everything else that follows in the rest of this document applies equally to string and object dtype.

String methods

Series 和 Index 配备了一组字符串处理方法,使用时可轻松对数组的各个元素进行运算。最重要的是,这些方法自动排除了缺失/NA 的值。通过 str 属性访问这些方法,其名称通常与等效的(标量)内置字符串方法相符:

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

In [24]: s = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   ....: )
   ....:

In [25]: s.str.lower()
Out[25]:
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]:
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]:
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64
In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

Index 上的字符串方法对于清理或转换 DataFrame 列特别有用。例如,您可能有一些包含前导或尾随空格的列:

The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:

In [32]: df = pd.DataFrame(
   ....:     np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
   ....: )
   ....:

In [33]: df
Out[33]:
   Column A   Column B
0   0.469112  -0.282863
1  -1.509059  -1.135632
2   1.212112  -0.173215

由于 df.columns 是一个索引对象,因此我们可以使用访问器 .str

Since df.columns is an Index object, we can use the .str accessor

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

然后,可以使用这些字符串方法来按需要清理列。此处,我们正在删除前导和尾随空格,小写所有名称,并将所有剩余的空格替换为下划线:

These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:

In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [37]: df
Out[37]:
   column_a  column_b
0  0.469112 -0.282863
1 -1.509059 -1.135632
2  1.212112 -0.173215

如果您拥有大量元素重复的 Series(也就是说,Series 中的唯一元素数量远小于 Series 的长度),那么将原始 Series 转换为 category 类型并使用 .str.<method>.dt.<property> 可能会更快。性能差异源于这样的事实:对于 Series 类型为 category 的类型,字符串运算是在 .categories 上进行的,而不是在 Series 的每个元素上进行的。

If you have a Series where lots of elements are repeated (i.e. the number of unique elements in the Series is a lot smaller than the length of the Series), it can be faster to convert the original Series to one of type category and then use .str.<method> or .dt.<property> on that. The performance difference comes from the fact that, for Series of type category, the string operations are done on the .categories and not on each element of the Series.

请注意,与字符串类型(例如,如果您无法将字符串彼此相加:s + " " + s 将不起作用(如果 sSeries 类型为 category 的对象))的 Series 相比,类型为 categorySeries 存在一些限制。此外,针对 list 类型元素进行运算的 .str 方法在该 Series 中不可用。

Please note that a Series of type category with string .categories has some limitations in comparison to Series of type string (e.g. you can’t add strings to each other: s + " " + s won’t work if s is a Series of type category). Also, .str methods which operate on elements of type list are not available on such a Series.

警告

Warning

Series 类型是推断出来的,允许的类型(例如字符串)。

The type of the Series is inferred and the allowed types (i.e. strings).

一般来说,.str 访问器的目的是仅处理字符串。在极少数例外情况下,不支持其他用法,并且可能在稍后禁用这些用法。

Generally speaking, the .str accessor is intended to work only on strings. With very few exceptions, other uses are not supported, and may be disabled at a later point.

Splitting and replacing strings

split 等方法会返回一个列表系列:

Methods like split return a Series of lists:

In [38]: s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")

In [39]: s2.str.split("_")
Out[39]:
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

可以使用 get[] 表示法访问分割列表中的元素:

Elements in the split lists can be accessed using get or [] notation:

In [40]: s2.str.split("_").str.get(1)
Out[40]:
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split("_").str[1]
Out[41]:
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand 可以轻松地展开为返回 DataFrame。

It is easy to expand this to return a DataFrame using expand.

In [42]: s2.str.split("_", expand=True)
Out[42]:
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h

当原始 SeriesStringDtype 时,输出列也全部为 StringDtype

When original Series has StringDtype, the output columns will all be StringDtype as well.

还可以限制拆分数量:

It is also possible to limit the number of splits:

In [43]: s2.str.split("_", expand=True, n=1)
Out[43]:
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

rsplitsplit 类似,但它以相反的方向工作,即从字符串的末尾到字符串的开头:

rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:

In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]:
      0     1
0   a_b     c
1   c_d     e
2  <NA>  <NA>
3   f_g     h

replace 可选择使用 regular expressions

replace optionally uses regular expressions:

In [45]: s3 = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="string",
   ....: )
   ....:

In [46]: s3
Out[46]:
0       A
1       B
2       C
3    Aaba
4    Baca
5
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

In [47]: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
Out[47]:
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5
6        <NA>
7    XX-XX BA
8      XX-XX
9     XX-XX t
dtype: string

此功能在 2.0 版中已更改。

Changed in version 2.0.

使用 regex=True 的单字符模式也将被视为正则表达式:

Single character pattern with regex=True will also be treated as regular expressions:

In [48]: s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string")

In [49]: s4
Out[49]:
0     a.b
1       .
2       b
3    <NA>
4
dtype: string

In [50]: s4.str.replace(".", "a", regex=True)
Out[50]:
0     aaa
1       a
2       a
3    <NA>
4
dtype: string

如果您需要对字符串进行文本替换(相当于 str.replace()),则可以将可选 regex 参数设置为 False,而不是转义每个字符。在这种情况下,patrepl 都必须是字符串:

If you want literal replacement of a string (equivalent to str.replace()), you can set the optional regex parameter to False, rather than escaping each character. In this case both pat and repl must be strings:

In [51]: dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# These lines are equivalent
In [52]: dollars.str.replace(r"-\$", "-", regex=True)
Out[52]:
0         12
1        -10
2    $10,000
dtype: string

In [53]: dollars.str.replace("-$", "-", regex=False)
Out[53]:
0         12
1        -10
2    $10,000
dtype: string

replace 方法还可以将可调用内容用作替换。通过 re.sub() 在每个 pat 上进行调用。可调用内容应期望一个位置参数(正则表达式对象),并返回字符串。

The replace method can also take a callable as replacement. It is called on every pat using re.sub(). The callable should expect one positional argument (a regex object) and return a string.

# Reverse every lowercase alphabetic word
In [54]: pat = r"[a-z]+"

In [55]: def repl(m):
   ....:     return m.group(0)[::-1]
   ....:

In [56]: pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....:
Out[56]:
0    oof 123
1    rab zab
2       <NA>
dtype: string

# Using regex groups
In [57]: pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"

In [58]: def repl(m):
   ....:     return m.group("two").swapcase()
   ....:

In [59]: pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....:
Out[59]:
0     bAR
1    <NA>
dtype: string

replace 方法还接受编译好的正则表达式对象,作为 re.compile() 中的模式。所有标志都应包含在编译好的正则表达式对象中。

The replace method also accepts a compiled regular expression object from re.compile() as a pattern. All flags should be included in the compiled regular expression object.

In [60]: import re

In [61]: regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)

In [62]: s3.str.replace(regex_pat, "XX-XX ", regex=True)
Out[62]:
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5
6        <NA>
7    XX-XX BA
8      XX-XX
9     XX-XX t
dtype: string

当使用编译好的正则表达式对象调用 replace 时,包含 flags 参数会引发 ValueError

Including a flags argument when calling replace with a compiled regular expression object will raise a ValueError.

In [63]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex

removeprefixremovesuffix 与 Python 3.9 中添加的 str.removeprefixstr.removesuffix 具有相同的效果 < https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:

removeprefix and removesuffix have the same effect as str.removeprefix and str.removesuffix added in Python 3.9 <https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:

1.4.0 版中的新增功能。

New in version 1.4.0.

In [64]: s = pd.Series(["str_foo", "str_bar", "no_prefix"])

In [65]: s.str.removeprefix("str_")
Out[65]:
0          foo
1          bar
2    no_prefix
dtype: object

In [66]: s = pd.Series(["foo_str", "bar_str", "no_suffix"])

In [67]: s.str.removesuffix("_str")
Out[67]:
0          foo
1          bar
2    no_suffix
dtype: object

Concatenation

有多种方法可以级联 SeriesIndex,通过本身或其他内容与自身或其他人进行级联,所有内容分别都基于 cat()Index.str.cat

There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(), resp. Index.str.cat.

Concatenating a single Series into a string

Series(或 Index)的内容可以进行级联:

The content of a Series (or Index) can be concatenated:

In [68]: s = pd.Series(["a", "b", "c", "d"], dtype="string")

In [69]: s.str.cat(sep=",")
Out[69]: 'a,b,c,d'

如果没有指定,分隔符的关键字 sep 默认为空字符串 sep='':

If not specified, the keyword sep for the separator defaults to the empty string, sep='':

In [70]: s.str.cat()
Out[70]: 'abcd'

默认情况下,将忽略缺失值。通过使用 na_rep,可以为其提供一个表示形式:

By default, missing values are ignored. Using na_rep, they can be given a representation:

In [71]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")

In [72]: t.str.cat(sep=",")
Out[72]: 'a,b,d'

In [73]: t.str.cat(sep=",", na_rep="-")
Out[73]: 'a,b,-,d'

Concatenating a Series and something list-like into a Series

如果 Series(或 Index)的长度一致,则 cat() 的第一个参数可以是列表状对象。

The first argument to cat() can be a list-like object, provided that it matches the length of the calling Series (or Index).

In [74]: s.str.cat(["A", "B", "C", "D"])
Out[74]:
0    aA
1    bB
2    cC
3    dD
dtype: string

两侧的缺失值同样会导致结果中的缺失值,除非指定了 na_rep

Missing values on either side will result in missing values in the result as well, unless na_rep is specified:

In [75]: s.str.cat(t)
Out[75]:
0      aa
1      bb
2    <NA>
3      dd
dtype: string

In [76]: s.str.cat(t, na_rep="-")
Out[76]:
0    aa
1    bb
2    c-
3    dd
dtype: string

Concatenating a Series and something array-like into a Series

参数 others 还可能是二维的。在这种情况下,行数必须与 Series(或 Index)的长度一致。

The parameter others can also be two-dimensional. In this case, the number or rows must match the lengths of the calling Series (or Index).

In [77]: d = pd.concat([t, s], axis=1)

In [78]: s
Out[78]:
0    a
1    b
2    c
3    d
dtype: string

In [79]: d
Out[79]:
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d

In [80]: s.str.cat(d, na_rep="-")
Out[80]:
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

Concatenating a Series and an indexed object into a Series, with alignment

对于使用 SeriesDataFrame 的级联,可以通过设置 join 关键词在级联之前对索引进行对齐。

For concatenation with a Series or DataFrame, it is possible to align the indexes before concatenation by setting the join-keyword.

In [81]: u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")

In [82]: s
Out[82]:
0    a
1    b
2    c
3    d
dtype: string

In [83]: u
Out[83]:
1    b
3    d
0    a
2    c
dtype: string

In [84]: s.str.cat(u)
Out[84]:
0    aa
1    bb
2    cc
3    dd
dtype: string

In [85]: s.str.cat(u, join="left")
Out[85]:
0    aa
1    bb
2    cc
3    dd
dtype: string

join 的常见选项可用('left', 'outer', 'inner', 'right' 之一)。尤其是,对齐还意味着不同的长度不再需要一致。

The usual options are available for join (one of 'left', 'outer', 'inner', 'right'). In particular, alignment also means that the different lengths do not need to coincide anymore.

In [86]: v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")

In [87]: s
Out[87]:
0    a
1    b
2    c
3    d
dtype: string

In [88]: v
Out[88]:
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [89]: s.str.cat(v, join="left", na_rep="-")
Out[89]:
0    aa
1    bb
2    c-
3    dd
dtype: string

In [90]: s.str.cat(v, join="outer", na_rep="-")
Out[90]:
-1    -z
 0    aa
 1    bb
 2    c-
 3    dd
 4    -e
dtype: string

使用 othersDataFrame 时,可以使用相同的对齐。

The same alignment can be used when others is a DataFrame:

In [91]: f = d.loc[[3, 2, 1, 0], :]

In [92]: s
Out[92]:
0    a
1    b
2    c
3    d
dtype: string

In [93]: f
Out[93]:
      0  1
3     d  d
2  <NA>  c
1     b  b
0     a  a

In [94]: s.str.cat(f, join="left", na_rep="-")
Out[94]:
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

Concatenating a Series and many objects into a Series

几个类似数组的项目(具体来说:SeriesIndexnp.ndarray 的一维变量)可以组合在类似列表的容器中(包括迭代器、dict 视图等)。

Several array-like items (specifically: Series, Index, and 1-dimensional variants of np.ndarray) can be combined in a list-like container (including iterators, dict-views, etc.).

In [95]: s
Out[95]:
0    a
1    b
2    c
3    d
dtype: string

In [96]: u
Out[96]:
1    b
3    d
0    a
2    c
dtype: string

In [97]: s.str.cat([u, u.to_numpy()], join="left")
Out[97]:
0    aab
1    bbd
2    cca
3    ddc
dtype: string

在给定类似列表中没有索引的所有元素(例如 np.ndarray)的长度,必须与 Series(或 Index)一致,但 SeriesIndex 可以有任意长度(只要未禁用 join=None 对齐):

All elements without an index (e.g. np.ndarray) within the passed list-like must match in length to the calling Series (or Index), but Series and Index may have arbitrary length (as long as alignment is not disabled with join=None):

In [98]: v
Out[98]:
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [99]: s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
Out[99]:
-1    -z--
0     aaab
1     bbbd
2     c-ca
3     dddc
4     -e--
dtype: string

如果对包含不同索引的 others 的类似列表,使用 join='right',这些索引的并集将被用作最终级联的基础:

If using join='right' on a list-like of others that contains different indexes, the union of these indexes will be used as the basis for the final concatenation:

In [100]: u.loc[[3]]
Out[100]:
3    d
dtype: string

In [101]: v.loc[[-1, 0]]
Out[101]:
-1    z
 0    a
dtype: string

In [102]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Out[102]:
 3    dd-
-1    --z
 0    a-a
dtype: string

Indexing with .str

يمكنك استخدام ترميز [] للفهرسة المباشرة لمواقع المواضع. إذا فهرست بعد نهاية السلسلة، فستكون النتيجة NaN‎.

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [103]: s = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....:

In [104]: s.str[0]
Out[104]:
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [105]: s.str[1]
Out[105]:
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

Extracting substrings

Extract first match in each subject (extract)

تتقبّل طريقة extract ‎مجموعة التقاط واحدة على الأقل باستخدام regular expression.

The extract method accepts a regular expression with at least one capture group.

提取具有多个组的正则表达式将返回一个 DataFrame,每个组一列。

Extracting a regular expression with more than one group returns a DataFrame with one column per group.

In [106]: pd.Series(
   .....:     ["a1", "b2", "c3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])(\d)", expand=False)
   .....:
Out[106]:
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

不匹配的元素返回一行为 NaN 填充的行。因此,一系列杂乱的字符串可以“转换为”经过清理或更为实用的字符串的类似索引的 Series 或 DataFrame,而无需 get() 来访问元组或 re.match 对象。即使没有找到匹配项并且结果只包含 NaN,结果的类型始终是对象。

Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.

如下的命名组

Named groups like

In [107]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
   .....:     r"(?P<letter>[ab])(?P<digit>\d)", expand=False
   .....: )
   .....:
Out[107]:
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

以及如下的可选组

and optional groups like

In [108]: pd.Series(
   .....:     ["a1", "b2", "3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])?(\d)", expand=False)
   .....:
Out[108]:
      0  1
0     a  1
1     b  2
2  <NA>  3

也可以使用。请注意,正则表达式中的任何捕获组名称都将用于列名称;否则将使用捕获组编号。

can also be used. Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.

提取具有一个组的正则表达式将返回一个 DataFrame,如果 expand=True 的话,则为一列。

Extracting a regular expression with one group returns a DataFrame with one column if expand=True.

In [109]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
Out[109]:
      0
0     1
1     2
2  <NA>

如果 expand=False 的话,它将返回一个 Series。

It returns a Series if expand=False.

In [110]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Out[110]:
0       1
1       2
2    <NA>
dtype: string

对具有恰好一个捕获组的 regex 调用 Index 将返回一个 DataFrame,如果 expand=True 的话,则为一列。

Calling on an Index with a regex with exactly one capture group returns a DataFrame with one column if expand=True.

In [111]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")

In [112]: s
Out[112]:
A11    a1
B22    b2
C33    c3
dtype: string

In [113]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[113]:
  letter
0      A
1      B
2      C

如果 expand=False 的话,它将返回一个 Index

It returns an Index if expand=False.

In [114]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[114]: Index(['A', 'B', 'C'], dtype='object', name='letter')

对具有多个捕获组的 regex 调用 Index 将返回一个 DataFrame,如果 expand=True 的话。

Calling on an Index with a regex with more than one capture group returns a DataFrame if expand=True.

In [115]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[115]:
  letter   1
0      A  11
1      B  22
2      C  33

如果 expand=False 的话,它将引发 ValueError

It raises ValueError if expand=False.

In [116]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[116], line 1
----> 1 s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
    136     raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2743, in StringMethods.extract(self, pat, flags, expand)
   2740     raise ValueError("pattern contains no capture groups")
   2742 if not expand and regex.groups > 1 and isinstance(self._data, ABCIndex):
-> 2743     raise ValueError("only one regex group is supported with Index")
   2745 obj = self._data
   2746 result_dtype = _result_dtype(obj)

ValueError: only one regex group is supported with Index

下表总结了 extract(expand=False)(第一列中的输入主题,第一行中 regex 中的组数)的行为

The table below summarizes the behavior of extract(expand=False) (input subject in first column, number of groups in regex in first row)

1 组

1 group

>1 组

>1 group

Index

Index

ValueError

序列

Series

序列

Series

DataFrame

Extract all matches in each subject (extractall)

extract(仅返回第一个匹配项)不同,

Unlike extract (which returns only the first match),

In [117]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")

In [118]: s
Out[118]:
A    a1a2
B      b1
C      c1
dtype: string

In [119]: two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"

In [120]: s.str.extract(two_groups, expand=True)
Out[120]:
  letter digit
A      a     1
B      b     1
C      c     1

extractall 方法返回每个匹配项。extractall 的结果始终是其行上带有 MultiIndexDataFrameMultiIndex 的最后一层被命名为 match,并指示主体中的顺序。

the extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.

In [121]: s.str.extractall(two_groups)
Out[121]:
        letter digit
  match
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

当 Series 中的每个主题字符串恰好有一个匹配项时,

When each subject string in the Series has exactly one match,

In [122]: s = pd.Series(["a3", "b3", "c2"], dtype="string")

In [123]: s
Out[123]:
0    a3
1    b3
2    c2
dtype: string

然后 extractall(pat).xs(0, level='match') 会给出一个和 extract(pat) 相同的结果。

then extractall(pat).xs(0, level='match') gives the same result as extract(pat).

In [124]: extract_result = s.str.extract(two_groups, expand=True)

In [125]: extract_result
Out[125]:
  letter digit
0      a     3
1      b     3
2      c     2

In [126]: extractall_result = s.str.extractall(two_groups)

In [127]: extractall_result
Out[127]:
        letter digit
  match
0 0          a     3
1 0          b     3
2 0          c     2

In [128]: extractall_result.xs(0, level="match")
Out[128]:
  letter digit
0      a     3
1      b     3
2      c     2

Index 也支持 .str.extractall。它返回一个 DataFrame,其结果与带默认索引(从 0 开始)的 Series.str.extractall 相同。

Index also supports .str.extractall. It returns a DataFrame which has the same result as a Series.str.extractall with a default index (starts from 0).

In [129]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[129]:
        letter digit
  match
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

In [130]: pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
Out[130]:
        letter digit
  match
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

Testing for strings that match or contain a pattern

您可以检查元素是否包含模式:

You can check whether elements contain a pattern:

In [131]: pattern = r"[0-9][a-z]"

In [132]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.contains(pattern)
   .....:
Out[132]:
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

或者元素是否匹配模式:

Or whether elements match a pattern:

In [133]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.match(pattern)
   .....:
Out[133]:
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean
In [134]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.fullmatch(pattern)
   .....:
Out[134]:
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

matchfullmatchcontains 之间的区别是严格性:fullmatch 测试整个字符串是否与正则表达式匹配;match 测试是否有从字符串的第一个字符开始匹配正则表达式的匹配;contains 测试字符串中任何位置是否有匹配正则表达式的匹配。

The distinction between match, fullmatch, and contains is strictness: fullmatch tests whether the entire string matches the regular expression; match tests whether there is a match of the regular expression that begins at the first character of the string; and contains tests whether there is a match of the regular expression at any position within the string.

re 包中针对这三种匹配模式的相应函数分别是 re.fullmatchre.matchre.search

The corresponding functions in the re package for these three match modes are re.fullmatch, re.match, and re.search, respectively.

matchfullmatchcontainsstartswithendswith 等方法采用一个额外的 na 参数,因此可以将缺失值视为 True 或 False:

Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:

In [135]: s4 = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....:

In [136]: s4.str.contains("A", na=False)
Out[136]:
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

Creating indicator variables

您可以从字符串列提取虚拟变量。例如,如果它们用 '|' 分隔:

You can extract dummy variables from string columns. For example if they are separated by a '|':

In [137]: s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")

In [138]: s.str.get_dummies(sep="|")
Out[138]:
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

字符串 Index 还支持返回 MultiIndexget_dummies

String Index also supports get_dummies which returns a MultiIndex.

In [139]: idx = pd.Index(["a", "a|b", np.nan, "a|c"])

In [140]: idx.str.get_dummies(sep="|")
Out[140]:
MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

另请参阅 get_dummies()

See also get_dummies().

Method summary

方法

Method

说明

Description

连接字符串

Concatenate strings

根据分隔符拆分字符串

Split strings on delimiter

从字符串结尾根据分隔符拆分字符串

Split strings on delimiter working from the end of the string

对每个元素建立索引(检索第 i 个元素)

Index into each element (retrieve i-th element)

在 Series 的每个元素中使用通过的分隔符来连接字符串

Join strings in each element of the Series with passed separator

在分隔符上分割字符串,返回虚拟变量的 DataFrame

Split strings on the delimiter returning DataFrame of dummy variables

如果每个字符串都包含模式/正则表达式,则返回布尔数组

Return boolean array if each string contains pattern/regex

用其他字符串或给定结果的可调用函数替换模式/正则表达式/字符串出现

Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence

从字符串中删除前缀,即仅当字符串以前缀开始时才删除。

Remove prefix from string, i.e. only remove if string starts with prefix.

从字符串中删除后缀,即仅当字符串以后缀结尾时才删除。

Remove suffix from string, i.e. only remove if string ends with suffix.

重复值(s.str.repeat(3) 等同于 x * 3

Duplicate values (s.str.repeat(3) equivalent to x * 3)

在字符串的左边、右边或两边添加空格

Add whitespace to left, right, or both sides of strings

等同于 str.center

Equivalent to str.center

等同于 str.ljust

Equivalent to str.ljust

相当于 str.rjust

Equivalent to str.rjust

相当于 str.zfill

Equivalent to str.zfill

将长字符串拆分为长度小于给定宽度的一系列行

Split long strings into lines with length less than a given width

将 Series 中的每个字符串切片

Slice each string in the Series

用传入值替换每个字符串中的切片

Replace slice in each string with passed value

计数模式出现的次数

Count occurrences of pattern

对于每个元素,相当于 str.startswith(pat)

Equivalent to str.startswith(pat) for each element

对于每个元素,相当于 str.endswith(pat)

Equivalent to str.endswith(pat) for each element

计算每个字符串的所有模式/正则表达式出现位置的列表

Compute list of all occurrences of pattern/regex for each string

_re.match_对每个元素调用,返回匹配的组作为列表

Call re.match on each element, returning matched groups as list

_re.search_对每个元素调用,返回 DataFram,每行一个元素,每列一个正则捕获组

Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group

_re.findall_对每个元素调用,返回 DataFrame,每行一个匹配,每列一个正则捕获组

Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group

计算字符串长度

Compute string lengths

相当于_str.strip_

Equivalent to str.strip

相当于_str.rstrip_

Equivalent to str.rstrip

相当于_str.lstrip_

Equivalent to str.lstrip

相当于_str.partition_

Equivalent to str.partition

相当于_str.rpartition_

Equivalent to str.rpartition

相当于_str.lower_

Equivalent to str.lower

相当于_str.casefold_

Equivalent to str.casefold

相当于_str.upper_

Equivalent to str.upper

相当于_str.find_

Equivalent to str.find

相当于_str.rfind_

Equivalent to str.rfind

相当于_str.index_

Equivalent to str.index

相当于_str.rindex_

Equivalent to str.rindex

相当于_str.capitalize_

Equivalent to str.capitalize

相当于_str.swapcase_

Equivalent to str.swapcase

返回 Unicode 范式。相当于_unicodedata.normalize_

Return Unicode normal form. Equivalent to unicodedata.normalize

相当于 str.translate

Equivalent to str.translate

相当于 str.isalnum

Equivalent to str.isalnum

相当于 str.isalpha

Equivalent to str.isalpha

相当于 str.isdigit

Equivalent to str.isdigit

相当于 str.isspace

Equivalent to str.isspace

相当于 str.islower

Equivalent to str.islower

相当于 str.isupper

Equivalent to str.isupper

相当于 str.istitle

Equivalent to str.istitle

相当于 str.isnumeric

Equivalent to str.isnumeric

相当于 str.isdecimal

Equivalent to str.isdecimal