Pandas 中文参考指南
Working with text data
Text data types
在 pandas 中存储文本数据有两种方式:
There are two ways to store text data in pandas:
-
object -dtype NumPy array.
-
StringDtype extension type.
我们建议使用 StringDtype 存储文本数据。
We recommend using StringDtype to store text data.
在 pandas 1.0 之前,object dtype 是唯一的选择。由于许多原因,这是不幸的:
Prior to pandas 1.0, object dtype was the only option. This was unfortunate for many reasons:
-
You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.
-
object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.
-
When reading code, the contents of an object dtype array is less clear than 'string'.
目前,object dtype 字符串数组和 arrays.StringArray 的性能大致相同。我们希望未来的增强功能能够显著提高性能,并降低 StringArray 的内存开销。
Currently, the performance of object dtype arrays of strings and arrays.StringArray are about the same. We expect future enhancements to significantly increase the performance and lower the memory overhead of StringArray.
警告
Warning
目前 StringArray 被认为是试验性的。API 的实现和部分内容可能会在不发出警告的情况下发生变化。
StringArray is currently considered experimental. The implementation and parts of the API may change without warning.
为了向后兼容,object dtype 仍然是我们推断字符串列表的默认类型
For backwards-compatibility, object dtype remains the default type we infer a list of strings to
In [1]: pd.Series(["a", "b", "c"])
Out[1]:
0 a
1 b
2 c
dtype: object
要显式请求 string dtype,请指定 dtype
To explicitly request string dtype, specify the dtype
In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]:
0 a
1 b
2 c
dtype: string
In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Out[3]:
0 a
1 b
2 c
dtype: string
或在创建 Series 或 DataFrame 后指定 astype
Or astype after the Series or DataFrame is created
In [4]: s = pd.Series(["a", "b", "c"])
In [5]: s
Out[5]:
0 a
1 b
2 c
dtype: object
In [6]: s.astype("string")
Out[6]:
0 a
1 b
2 c
dtype: string
您还可以将 StringDtype/"string" 用作非字符串数据的 dtype,它将转换为 string dtype:
You can also use StringDtype/"string" as the dtype on non-string data and it will be converted to string dtype:
In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")
In [8]: s
Out[8]:
0 a
1 2
2 <NA>
dtype: string
In [9]: type(s[1])
Out[9]: str
或从现有的 pandas 数据进行转换:
or convert from existing pandas data:
In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")
In [11]: s1
Out[11]:
0 1
1 2
2 <NA>
dtype: Int64
In [12]: s2 = s1.astype("string")
In [13]: s2
Out[13]:
0 1
1 2
2 <NA>
dtype: string
In [14]: type(s2[0])
Out[14]: str
Behavior differences
以下是 StringDtype 对象的行为与 object dtype 不同的位置
These are places where the behavior of StringDtype objects differ from object dtype
-
For StringDtype, string accessor methods that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.
In [15]: s = pd.Series(["a", None, "b"], dtype="string")
In [16]: s
Out[16]:
0 a
1 <NA>
2 b
dtype: string
In [17]: s.str.count("a")
Out[17]:
0 1
1 <NA>
2 0
dtype: Int64
In [18]: s.dropna().str.count("a")
Out[18]:
0 1
2 0
dtype: Int64
-
Both outputs are Int64 dtype. Compare that with object-dtype
In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")
In [20]: s2.str.count("a")
Out[20]:
0 1.0
1 NaN
2 0.0
dtype: float64
In [21]: s2.dropna().str.count("a")
Out[21]:
0 1
2 0
dtype: int64
-
When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.
In [22]: s.str.isdigit()
Out[22]:
0 False
1 <NA>
2 False
dtype: boolean
In [23]: s.str.match("a")
Out[23]:
0 True
1 <NA>
2 False
dtype: boolean
-
Some string methods, like Series.str.decode() are not available on StringArray because StringArray only holds strings, not bytes.
-
In comparison operations, arrays.StringArray and Series backed by a StringArray will return an object with BooleanDtype, rather than a bool dtype object. Missing values in a StringArray will propagate in comparison operations, rather than always comparing unequal like numpy.nan.
本档的其他全部内容同样适用于 string 和 object 数据类型。
Everything else that follows in the rest of this document applies equally to string and object dtype.
String methods
Series 和 Index 配备了一组字符串处理方法,使用时可轻松对数组的各个元素进行运算。最重要的是,这些方法自动排除了缺失/NA 的值。通过 str 属性访问这些方法,其名称通常与等效的(标量)内置字符串方法相符:
Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:
In [24]: s = pd.Series(
....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
....: )
....:
In [25]: s.str.lower()
Out[25]:
0 a
1 b
2 c
3 aaba
4 baca
5 <NA>
6 caba
7 dog
8 cat
dtype: string
In [26]: s.str.upper()
Out[26]:
0 A
1 B
2 C
3 AABA
4 BACA
5 <NA>
6 CABA
7 DOG
8 CAT
dtype: string
In [27]: s.str.len()
Out[27]:
0 1
1 1
2 1
3 4
4 4
5 <NA>
6 4
7 3
8 3
dtype: Int64
In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
Index 上的字符串方法对于清理或转换 DataFrame 列特别有用。例如,您可能有一些包含前导或尾随空格的列:
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:
In [32]: df = pd.DataFrame(
....: np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
....: )
....:
In [33]: df
Out[33]:
Column A Column B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
由于 df.columns 是一个索引对象,因此我们可以使用访问器 .str
Since df.columns is an Index object, we can use the .str accessor
In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')
In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')
然后,可以使用这些字符串方法来按需要清理列。此处,我们正在删除前导和尾随空格,小写所有名称,并将所有剩余的空格替换为下划线:
These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:
In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
In [37]: df
Out[37]:
column_a column_b
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
如果您拥有大量元素重复的 Series(也就是说,Series 中的唯一元素数量远小于 Series 的长度),那么将原始 Series 转换为 category 类型并使用 .str.<method> 或 .dt.<property> 可能会更快。性能差异源于这样的事实:对于 Series 类型为 category 的类型,字符串运算是在 .categories 上进行的,而不是在 Series 的每个元素上进行的。 |
If you have a Series where lots of elements are repeated (i.e. the number of unique elements in the Series is a lot smaller than the length of the Series), it can be faster to convert the original Series to one of type category and then use .str.<method> or .dt.<property> on that. The performance difference comes from the fact that, for Series of type category, the string operations are done on the .categories and not on each element of the Series. |
请注意,与字符串类型(例如,如果您无法将字符串彼此相加:s + " " + s 将不起作用(如果 s 是 Series 类型为 category 的对象))的 Series 相比,类型为 category 的 Series 存在一些限制。此外,针对 list 类型元素进行运算的 .str 方法在该 Series 中不可用。
Please note that a Series of type category with string .categories has some limitations in comparison to Series of type string (e.g. you can’t add strings to each other: s + " " + s won’t work if s is a Series of type category). Also, .str methods which operate on elements of type list are not available on such a Series.
警告
Warning
Series 类型是推断出来的,允许的类型(例如字符串)。
The type of the Series is inferred and the allowed types (i.e. strings).
一般来说,.str 访问器的目的是仅处理字符串。在极少数例外情况下,不支持其他用法,并且可能在稍后禁用这些用法。
Generally speaking, the .str accessor is intended to work only on strings. With very few exceptions, other uses are not supported, and may be disabled at a later point.
Splitting and replacing strings
split 等方法会返回一个列表系列:
Methods like split return a Series of lists:
In [38]: s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")
In [39]: s2.str.split("_")
Out[39]:
0 [a, b, c]
1 [c, d, e]
2 <NA>
3 [f, g, h]
dtype: object
可以使用 get 或 [] 表示法访问分割列表中的元素:
Elements in the split lists can be accessed using get or [] notation:
In [40]: s2.str.split("_").str.get(1)
Out[40]:
0 b
1 d
2 <NA>
3 g
dtype: object
In [41]: s2.str.split("_").str[1]
Out[41]:
0 b
1 d
2 <NA>
3 g
dtype: object
使用 expand 可以轻松地展开为返回 DataFrame。
It is easy to expand this to return a DataFrame using expand.
In [42]: s2.str.split("_", expand=True)
Out[42]:
0 1 2
0 a b c
1 c d e
2 <NA> <NA> <NA>
3 f g h
当原始 Series 有 StringDtype 时,输出列也全部为 StringDtype。
When original Series has StringDtype, the output columns will all be StringDtype as well.
还可以限制拆分数量:
It is also possible to limit the number of splits:
In [43]: s2.str.split("_", expand=True, n=1)
Out[43]:
0 1
0 a b_c
1 c d_e
2 <NA> <NA>
3 f g_h
rsplit 与 split 类似,但它以相反的方向工作,即从字符串的末尾到字符串的开头:
rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:
In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]:
0 1
0 a_b c
1 c_d e
2 <NA> <NA>
3 f_g h
replace 可选择使用 regular expressions:
replace optionally uses regular expressions:
In [45]: s3 = pd.Series(
....: ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
....: dtype="string",
....: )
....:
In [46]: s3
Out[46]:
0 A
1 B
2 C
3 Aaba
4 Baca
5
6 <NA>
7 CABA
8 dog
9 cat
dtype: string
In [47]: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
Out[47]:
0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6 <NA>
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: string
此功能在 2.0 版中已更改。
Changed in version 2.0.
使用 regex=True 的单字符模式也将被视为正则表达式:
Single character pattern with regex=True will also be treated as regular expressions:
In [48]: s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string")
In [49]: s4
Out[49]:
0 a.b
1 .
2 b
3 <NA>
4
dtype: string
In [50]: s4.str.replace(".", "a", regex=True)
Out[50]:
0 aaa
1 a
2 a
3 <NA>
4
dtype: string
如果您需要对字符串进行文本替换(相当于 str.replace()),则可以将可选 regex 参数设置为 False,而不是转义每个字符。在这种情况下,pat 和 repl 都必须是字符串:
If you want literal replacement of a string (equivalent to str.replace()), you can set the optional regex parameter to False, rather than escaping each character. In this case both pat and repl must be strings:
In [51]: dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")
# These lines are equivalent
In [52]: dollars.str.replace(r"-\$", "-", regex=True)
Out[52]:
0 12
1 -10
2 $10,000
dtype: string
In [53]: dollars.str.replace("-$", "-", regex=False)
Out[53]:
0 12
1 -10
2 $10,000
dtype: string
replace 方法还可以将可调用内容用作替换。通过 re.sub() 在每个 pat 上进行调用。可调用内容应期望一个位置参数(正则表达式对象),并返回字符串。
The replace method can also take a callable as replacement. It is called on every pat using re.sub(). The callable should expect one positional argument (a regex object) and return a string.
# Reverse every lowercase alphabetic word
In [54]: pat = r"[a-z]+"
In [55]: def repl(m):
....: return m.group(0)[::-1]
....:
In [56]: pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
....: pat, repl, regex=True
....: )
....:
Out[56]:
0 oof 123
1 rab zab
2 <NA>
dtype: string
# Using regex groups
In [57]: pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
In [58]: def repl(m):
....: return m.group("two").swapcase()
....:
In [59]: pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
....: pat, repl, regex=True
....: )
....:
Out[59]:
0 bAR
1 <NA>
dtype: string
replace 方法还接受编译好的正则表达式对象,作为 re.compile() 中的模式。所有标志都应包含在编译好的正则表达式对象中。
The replace method also accepts a compiled regular expression object from re.compile() as a pattern. All flags should be included in the compiled regular expression object.
In [60]: import re
In [61]: regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
In [62]: s3.str.replace(regex_pat, "XX-XX ", regex=True)
Out[62]:
0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6 <NA>
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: string
当使用编译好的正则表达式对象调用 replace 时,包含 flags 参数会引发 ValueError。
Including a flags argument when calling replace with a compiled regular expression object will raise a ValueError.
In [63]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex
removeprefix 和 removesuffix 与 Python 3.9 中添加的 str.removeprefix 和 str.removesuffix 具有相同的效果 < https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:
removeprefix and removesuffix have the same effect as str.removeprefix and str.removesuffix added in Python 3.9 <https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:
1.4.0 版中的新增功能。
New in version 1.4.0.
In [64]: s = pd.Series(["str_foo", "str_bar", "no_prefix"])
In [65]: s.str.removeprefix("str_")
Out[65]:
0 foo
1 bar
2 no_prefix
dtype: object
In [66]: s = pd.Series(["foo_str", "bar_str", "no_suffix"])
In [67]: s.str.removesuffix("_str")
Out[67]:
0 foo
1 bar
2 no_suffix
dtype: object
Concatenation
有多种方法可以级联 Series 或 Index,通过本身或其他内容与自身或其他人进行级联,所有内容分别都基于 cat() 和 Index.str.cat。
There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(), resp. Index.str.cat.
Concatenating a single Series into a string
Series(或 Index)的内容可以进行级联:
The content of a Series (or Index) can be concatenated:
In [68]: s = pd.Series(["a", "b", "c", "d"], dtype="string")
In [69]: s.str.cat(sep=",")
Out[69]: 'a,b,c,d'
如果没有指定,分隔符的关键字 sep 默认为空字符串 sep='':
If not specified, the keyword sep for the separator defaults to the empty string, sep='':
In [70]: s.str.cat()
Out[70]: 'abcd'
默认情况下,将忽略缺失值。通过使用 na_rep,可以为其提供一个表示形式:
By default, missing values are ignored. Using na_rep, they can be given a representation:
In [71]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")
In [72]: t.str.cat(sep=",")
Out[72]: 'a,b,d'
In [73]: t.str.cat(sep=",", na_rep="-")
Out[73]: 'a,b,-,d'
Concatenating a Series and something list-like into a Series
如果 Series(或 Index)的长度一致,则 cat() 的第一个参数可以是列表状对象。
The first argument to cat() can be a list-like object, provided that it matches the length of the calling Series (or Index).
In [74]: s.str.cat(["A", "B", "C", "D"])
Out[74]:
0 aA
1 bB
2 cC
3 dD
dtype: string
两侧的缺失值同样会导致结果中的缺失值,除非指定了 na_rep:
Missing values on either side will result in missing values in the result as well, unless na_rep is specified:
In [75]: s.str.cat(t)
Out[75]:
0 aa
1 bb
2 <NA>
3 dd
dtype: string
In [76]: s.str.cat(t, na_rep="-")
Out[76]:
0 aa
1 bb
2 c-
3 dd
dtype: string
Concatenating a Series and something array-like into a Series
参数 others 还可能是二维的。在这种情况下,行数必须与 Series(或 Index)的长度一致。
The parameter others can also be two-dimensional. In this case, the number or rows must match the lengths of the calling Series (or Index).
In [77]: d = pd.concat([t, s], axis=1)
In [78]: s
Out[78]:
0 a
1 b
2 c
3 d
dtype: string
In [79]: d
Out[79]:
0 1
0 a a
1 b b
2 <NA> c
3 d d
In [80]: s.str.cat(d, na_rep="-")
Out[80]:
0 aaa
1 bbb
2 c-c
3 ddd
dtype: string
Concatenating a Series and an indexed object into a Series, with alignment
对于使用 Series 或 DataFrame 的级联,可以通过设置 join 关键词在级联之前对索引进行对齐。
For concatenation with a Series or DataFrame, it is possible to align the indexes before concatenation by setting the join-keyword.
In [81]: u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")
In [82]: s
Out[82]:
0 a
1 b
2 c
3 d
dtype: string
In [83]: u
Out[83]:
1 b
3 d
0 a
2 c
dtype: string
In [84]: s.str.cat(u)
Out[84]:
0 aa
1 bb
2 cc
3 dd
dtype: string
In [85]: s.str.cat(u, join="left")
Out[85]:
0 aa
1 bb
2 cc
3 dd
dtype: string
join 的常见选项可用('left', 'outer', 'inner', 'right' 之一)。尤其是,对齐还意味着不同的长度不再需要一致。
The usual options are available for join (one of 'left', 'outer', 'inner', 'right'). In particular, alignment also means that the different lengths do not need to coincide anymore.
In [86]: v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")
In [87]: s
Out[87]:
0 a
1 b
2 c
3 d
dtype: string
In [88]: v
Out[88]:
-1 z
0 a
1 b
3 d
4 e
dtype: string
In [89]: s.str.cat(v, join="left", na_rep="-")
Out[89]:
0 aa
1 bb
2 c-
3 dd
dtype: string
In [90]: s.str.cat(v, join="outer", na_rep="-")
Out[90]:
-1 -z
0 aa
1 bb
2 c-
3 dd
4 -e
dtype: string
使用 others 为 DataFrame 时,可以使用相同的对齐。
The same alignment can be used when others is a DataFrame:
In [91]: f = d.loc[[3, 2, 1, 0], :]
In [92]: s
Out[92]:
0 a
1 b
2 c
3 d
dtype: string
In [93]: f
Out[93]:
0 1
3 d d
2 <NA> c
1 b b
0 a a
In [94]: s.str.cat(f, join="left", na_rep="-")
Out[94]:
0 aaa
1 bbb
2 c-c
3 ddd
dtype: string
Concatenating a Series and many objects into a Series
几个类似数组的项目(具体来说:Series、Index 和 np.ndarray 的一维变量)可以组合在类似列表的容器中(包括迭代器、dict 视图等)。
Several array-like items (specifically: Series, Index, and 1-dimensional variants of np.ndarray) can be combined in a list-like container (including iterators, dict-views, etc.).
In [95]: s
Out[95]:
0 a
1 b
2 c
3 d
dtype: string
In [96]: u
Out[96]:
1 b
3 d
0 a
2 c
dtype: string
In [97]: s.str.cat([u, u.to_numpy()], join="left")
Out[97]:
0 aab
1 bbd
2 cca
3 ddc
dtype: string
在给定类似列表中没有索引的所有元素(例如 np.ndarray)的长度,必须与 Series(或 Index)一致,但 Series 和 Index 可以有任意长度(只要未禁用 join=None 对齐):
All elements without an index (e.g. np.ndarray) within the passed list-like must match in length to the calling Series (or Index), but Series and Index may have arbitrary length (as long as alignment is not disabled with join=None):
In [98]: v
Out[98]:
-1 z
0 a
1 b
3 d
4 e
dtype: string
In [99]: s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
Out[99]:
-1 -z--
0 aaab
1 bbbd
2 c-ca
3 dddc
4 -e--
dtype: string
如果对包含不同索引的 others 的类似列表,使用 join='right',这些索引的并集将被用作最终级联的基础:
If using join='right' on a list-like of others that contains different indexes, the union of these indexes will be used as the basis for the final concatenation:
In [100]: u.loc[[3]]
Out[100]:
3 d
dtype: string
In [101]: v.loc[[-1, 0]]
Out[101]:
-1 z
0 a
dtype: string
In [102]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Out[102]:
3 dd-
-1 --z
0 a-a
dtype: string
Indexing with .str
يمكنك استخدام ترميز [] للفهرسة المباشرة لمواقع المواضع. إذا فهرست بعد نهاية السلسلة، فستكون النتيجة NaN.
You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.
In [103]: s = pd.Series(
.....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
.....: )
.....:
In [104]: s.str[0]
Out[104]:
0 A
1 B
2 C
3 A
4 B
5 <NA>
6 C
7 d
8 c
dtype: string
In [105]: s.str[1]
Out[105]:
0 <NA>
1 <NA>
2 <NA>
3 a
4 a
5 <NA>
6 A
7 o
8 a
dtype: string
Extracting substrings
Extract first match in each subject (extract)
تتقبّل طريقة extract مجموعة التقاط واحدة على الأقل باستخدام regular expression.
The extract method accepts a regular expression with at least one capture group.
提取具有多个组的正则表达式将返回一个 DataFrame,每个组一列。
Extracting a regular expression with more than one group returns a DataFrame with one column per group.
In [106]: pd.Series(
.....: ["a1", "b2", "c3"],
.....: dtype="string",
.....: ).str.extract(r"([ab])(\d)", expand=False)
.....:
Out[106]:
0 1
0 a 1
1 b 2
2 <NA> <NA>
不匹配的元素返回一行为 NaN 填充的行。因此,一系列杂乱的字符串可以“转换为”经过清理或更为实用的字符串的类似索引的 Series 或 DataFrame,而无需 get() 来访问元组或 re.match 对象。即使没有找到匹配项并且结果只包含 NaN,结果的类型始终是对象。
Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.
如下的命名组
Named groups like
In [107]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
.....: r"(?P<letter>[ab])(?P<digit>\d)", expand=False
.....: )
.....:
Out[107]:
letter digit
0 a 1
1 b 2
2 <NA> <NA>
以及如下的可选组
and optional groups like
In [108]: pd.Series(
.....: ["a1", "b2", "3"],
.....: dtype="string",
.....: ).str.extract(r"([ab])?(\d)", expand=False)
.....:
Out[108]:
0 1
0 a 1
1 b 2
2 <NA> 3
也可以使用。请注意,正则表达式中的任何捕获组名称都将用于列名称;否则将使用捕获组编号。
can also be used. Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.
提取具有一个组的正则表达式将返回一个 DataFrame,如果 expand=True 的话,则为一列。
Extracting a regular expression with one group returns a DataFrame with one column if expand=True.
In [109]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
Out[109]:
0
0 1
1 2
2 <NA>
如果 expand=False 的话,它将返回一个 Series。
It returns a Series if expand=False.
In [110]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Out[110]:
0 1
1 2
2 <NA>
dtype: string
对具有恰好一个捕获组的 regex 调用 Index 将返回一个 DataFrame,如果 expand=True 的话,则为一列。
Calling on an Index with a regex with exactly one capture group returns a DataFrame with one column if expand=True.
In [111]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")
In [112]: s
Out[112]:
A11 a1
B22 b2
C33 c3
dtype: string
In [113]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[113]:
letter
0 A
1 B
2 C
如果 expand=False 的话,它将返回一个 Index。
It returns an Index if expand=False.
In [114]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[114]: Index(['A', 'B', 'C'], dtype='object', name='letter')
对具有多个捕获组的 regex 调用 Index 将返回一个 DataFrame,如果 expand=True 的话。
Calling on an Index with a regex with more than one capture group returns a DataFrame if expand=True.
In [115]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[115]:
letter 1
0 A 11
1 B 22
2 C 33
如果 expand=False 的话,它将引发 ValueError。
It raises ValueError if expand=False.
In [116]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[116], line 1
----> 1 s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
File ~/work/pandas/pandas/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
132 msg = (
133 f"Cannot use .str.{func_name} with values of "
134 f"inferred dtype '{self._inferred_dtype}'."
135 )
136 raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)
File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2743, in StringMethods.extract(self, pat, flags, expand)
2740 raise ValueError("pattern contains no capture groups")
2742 if not expand and regex.groups > 1 and isinstance(self._data, ABCIndex):
-> 2743 raise ValueError("only one regex group is supported with Index")
2745 obj = self._data
2746 result_dtype = _result_dtype(obj)
ValueError: only one regex group is supported with Index
下表总结了 extract(expand=False)(第一列中的输入主题,第一行中 regex 中的组数)的行为
The table below summarizes the behavior of extract(expand=False) (input subject in first column, number of groups in regex in first row)
1 组
1 group
>1 组
>1 group
Index
Index
ValueError
序列
Series
序列
Series
DataFrame
Extract all matches in each subject (extractall)
与 extract(仅返回第一个匹配项)不同,
Unlike extract (which returns only the first match),
In [117]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")
In [118]: s
Out[118]:
A a1a2
B b1
C c1
dtype: string
In [119]: two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"
In [120]: s.str.extract(two_groups, expand=True)
Out[120]:
letter digit
A a 1
B b 1
C c 1
extractall 方法返回每个匹配项。extractall 的结果始终是其行上带有 MultiIndex 的 DataFrame。MultiIndex 的最后一层被命名为 match,并指示主体中的顺序。
the extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.
In [121]: s.str.extractall(two_groups)
Out[121]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
当 Series 中的每个主题字符串恰好有一个匹配项时,
When each subject string in the Series has exactly one match,
In [122]: s = pd.Series(["a3", "b3", "c2"], dtype="string")
In [123]: s
Out[123]:
0 a3
1 b3
2 c2
dtype: string
然后 extractall(pat).xs(0, level='match') 会给出一个和 extract(pat) 相同的结果。
then extractall(pat).xs(0, level='match') gives the same result as extract(pat).
In [124]: extract_result = s.str.extract(two_groups, expand=True)
In [125]: extract_result
Out[125]:
letter digit
0 a 3
1 b 3
2 c 2
In [126]: extractall_result = s.str.extractall(two_groups)
In [127]: extractall_result
Out[127]:
letter digit
match
0 0 a 3
1 0 b 3
2 0 c 2
In [128]: extractall_result.xs(0, level="match")
Out[128]:
letter digit
0 a 3
1 b 3
2 c 2
Index 也支持 .str.extractall。它返回一个 DataFrame,其结果与带默认索引(从 0 开始)的 Series.str.extractall 相同。
Index also supports .str.extractall. It returns a DataFrame which has the same result as a Series.str.extractall with a default index (starts from 0).
In [129]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[129]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1
In [130]: pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
Out[130]:
letter digit
match
0 0 a 1
1 a 2
1 0 b 1
2 0 c 1
Testing for strings that match or contain a pattern
您可以检查元素是否包含模式:
You can check whether elements contain a pattern:
In [131]: pattern = r"[0-9][a-z]"
In [132]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.contains(pattern)
.....:
Out[132]:
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean
或者元素是否匹配模式:
Or whether elements match a pattern:
In [133]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.match(pattern)
.....:
Out[133]:
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean
In [134]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.fullmatch(pattern)
.....:
Out[134]:
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean
match、fullmatch 和 contains 之间的区别是严格性:fullmatch 测试整个字符串是否与正则表达式匹配;match 测试是否有从字符串的第一个字符开始匹配正则表达式的匹配;contains 测试字符串中任何位置是否有匹配正则表达式的匹配。 |
The distinction between match, fullmatch, and contains is strictness: fullmatch tests whether the entire string matches the regular expression; match tests whether there is a match of the regular expression that begins at the first character of the string; and contains tests whether there is a match of the regular expression at any position within the string. |
re 包中针对这三种匹配模式的相应函数分别是 re.fullmatch、 re.match 和 re.search。
The corresponding functions in the re package for these three match modes are re.fullmatch, re.match, and re.search, respectively.
match、fullmatch、contains、startswith 和 endswith 等方法采用一个额外的 na 参数,因此可以将缺失值视为 True 或 False:
Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:
In [135]: s4 = pd.Series(
.....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
.....: )
.....:
In [136]: s4.str.contains("A", na=False)
Out[136]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: boolean
Creating indicator variables
您可以从字符串列提取虚拟变量。例如,如果它们用 '|' 分隔:
You can extract dummy variables from string columns. For example if they are separated by a '|':
In [137]: s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")
In [138]: s.str.get_dummies(sep="|")
Out[138]:
a b c
0 1 0 0
1 1 1 0
2 0 0 0
3 1 0 1
字符串 Index 还支持返回 MultiIndex 的 get_dummies。
String Index also supports get_dummies which returns a MultiIndex.
In [139]: idx = pd.Index(["a", "a|b", np.nan, "a|c"])
In [140]: idx.str.get_dummies(sep="|")
Out[140]:
MultiIndex([(1, 0, 0),
(1, 1, 0),
(0, 0, 0),
(1, 0, 1)],
names=['a', 'b', 'c'])
另请参阅 get_dummies()。
See also get_dummies().
Method summary
方法
Method
说明
Description
连接字符串
Concatenate strings
根据分隔符拆分字符串
Split strings on delimiter
从字符串结尾根据分隔符拆分字符串
Split strings on delimiter working from the end of the string
对每个元素建立索引(检索第 i 个元素)
Index into each element (retrieve i-th element)
在 Series 的每个元素中使用通过的分隔符来连接字符串
Join strings in each element of the Series with passed separator
在分隔符上分割字符串,返回虚拟变量的 DataFrame
Split strings on the delimiter returning DataFrame of dummy variables
如果每个字符串都包含模式/正则表达式,则返回布尔数组
Return boolean array if each string contains pattern/regex
用其他字符串或给定结果的可调用函数替换模式/正则表达式/字符串出现
Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
从字符串中删除前缀,即仅当字符串以前缀开始时才删除。
Remove prefix from string, i.e. only remove if string starts with prefix.
从字符串中删除后缀,即仅当字符串以后缀结尾时才删除。
Remove suffix from string, i.e. only remove if string ends with suffix.
重复值(s.str.repeat(3) 等同于 x * 3)
Duplicate values (s.str.repeat(3) equivalent to x * 3)
在字符串的左边、右边或两边添加空格
Add whitespace to left, right, or both sides of strings
等同于 str.center
Equivalent to str.center
等同于 str.ljust
Equivalent to str.ljust
相当于 str.rjust
Equivalent to str.rjust
相当于 str.zfill
Equivalent to str.zfill
将长字符串拆分为长度小于给定宽度的一系列行
Split long strings into lines with length less than a given width
将 Series 中的每个字符串切片
Slice each string in the Series
用传入值替换每个字符串中的切片
Replace slice in each string with passed value
计数模式出现的次数
Count occurrences of pattern
对于每个元素,相当于 str.startswith(pat)
Equivalent to str.startswith(pat) for each element
对于每个元素,相当于 str.endswith(pat)
Equivalent to str.endswith(pat) for each element
计算每个字符串的所有模式/正则表达式出现位置的列表
Compute list of all occurrences of pattern/regex for each string
_re.match_对每个元素调用,返回匹配的组作为列表
Call re.match on each element, returning matched groups as list
_re.search_对每个元素调用,返回 DataFram,每行一个元素,每列一个正则捕获组
Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
_re.findall_对每个元素调用,返回 DataFrame,每行一个匹配,每列一个正则捕获组
Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
计算字符串长度
Compute string lengths
相当于_str.strip_
Equivalent to str.strip
相当于_str.rstrip_
Equivalent to str.rstrip
相当于_str.lstrip_
Equivalent to str.lstrip
相当于_str.partition_
Equivalent to str.partition
相当于_str.rpartition_
Equivalent to str.rpartition
相当于_str.lower_
Equivalent to str.lower
相当于_str.casefold_
Equivalent to str.casefold
相当于_str.upper_
Equivalent to str.upper
相当于_str.find_
Equivalent to str.find
相当于_str.rfind_
Equivalent to str.rfind
相当于_str.index_
Equivalent to str.index
相当于_str.rindex_
Equivalent to str.rindex
相当于_str.capitalize_
Equivalent to str.capitalize
相当于_str.swapcase_
Equivalent to str.swapcase
返回 Unicode 范式。相当于_unicodedata.normalize_
Return Unicode normal form. Equivalent to unicodedata.normalize
相当于 str.translate
Equivalent to str.translate
相当于 str.isalnum
Equivalent to str.isalnum
相当于 str.isalpha
Equivalent to str.isalpha
相当于 str.isdigit
Equivalent to str.isdigit
相当于 str.isspace
Equivalent to str.isspace
相当于 str.islower
Equivalent to str.islower
相当于 str.isupper
Equivalent to str.isupper
相当于 str.istitle
Equivalent to str.istitle
相当于 str.isnumeric
Equivalent to str.isnumeric
相当于 str.isdecimal
Equivalent to str.isdecimal