Python Pandas 简明教程

Python Pandas - Working with Text Data

在本节中,我们将讨论使用基本Series/Index进行字符串操作。在后续章节中,我们将学习如何在DataFrame上应用这些字符串函数。

In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame.

Pandas 提供了一组字符串函数,使其易于对字符串数据进行操作。最重要的是,这些函数忽略(或排除)缺失/NaN 值。

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.

几乎所有这些方法都适用于 Python 字符串函数(参考: https://docs.python.org/3/library/stdtypes.html#string-methods )。因此,将Series对象转换为字符串对象,然后执行操作。

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation.

现在让我们看看每个操作如何执行。

Let us now see how each operation performs.

Sr.No

Function & Description

1

lower() Converts strings in the Series/Index to lower case.

2

upper() Converts strings in the Series/Index to upper case.

3

len() Computes String length().

4

strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

5

split(' ') Splits each string with the given pattern.

6

cat(sep=' ') Concatenates the series/index elements with given separator.

7

get_dummies() Returns the DataFrame with One-Hot Encoded values.

8

contains(pattern) Returns a Boolean value True for each element if the substring contains in the element, else False.

9

replace(a,b) Replaces the value a with the value b.

10

repeat(value) Repeats each element with specified number of times.

11

count(pattern) Returns count of appearance of pattern in each element.

12

startswith(pattern) Returns true if the element in the Series/Index starts with the pattern.

13

endswith(pattern) Returns true if the element in the Series/Index ends with the pattern.

14

find(pattern) Returns the first position of the first occurrence of the pattern.

15

findall(pattern) Returns a list of all occurrence of the pattern.

16

swapcase Swaps the case lower/upper.

17

islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

18

isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

19

isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

现在,我们创建一个Series,看看上述所有函数如何工作。

Let us now create a Series and see how all the above functions work.

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
4            NaN
5           1234
6    Steve Smith
dtype: object

lower()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.lower()

它的 output 如下所示 −

Its output is as follows −

0            tom
1   william rick
2           john
3        alber@t
4            NaN
5           1234
6    steve smith
dtype: object

upper()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.upper()

它的 output 如下所示 −

Its output is as follows −

0            TOM
1   WILLIAM RICK
2           JOHN
3        ALBER@T
4            NaN
5           1234
6    STEVE SMITH
dtype: object

len()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.len()

它的 output 如下所示 −

Its output is as follows −

0    3.0
1   12.0
2    4.0
3    7.0
4    NaN
5    4.0
6   10.0
dtype: float64

strip()

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

After Stripping:
0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

split(pattern)

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

Split Pattern:
0   [Tom, , , , , , , , , , ]
1   [, , , , , William, Rick]
2   [John]
3   [Alber@t]
dtype: object

cat(sep=pattern)

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.cat(sep='_')

它的 output 如下所示 −

Its output is as follows −

Tom _ William Rick_John_Alber@t

get_dummies()

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.get_dummies()

它的 output 如下所示 −

Its output is as follows −

   William Rick   Alber@t   John   Tom
0             0         0      0     1
1             1         0      0     0
2             0         0      1     0
3             0         1      0     0

contains ()

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.contains(' ')

它的 output 如下所示 −

Its output is as follows −

0   True
1   True
2   False
3   False
dtype: bool

replace(a,b)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')

它的 output 如下所示 −

Its output is as follows −

0   Tom
1   William Rick
2   John
3   Alber@t
dtype: object

After replacing @ with $:
0   Tom
1   William Rick
2   John
3   Alber$t
dtype: object

repeat(value)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.repeat(2)

它的 output 如下所示 −

Its output is as follows −

0   Tom            Tom
1   William Rick   William Rick
2                  JohnJohn
3                  Alber@tAlber@t
dtype: object

count(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print s.str.count('m')

它的 output 如下所示 −

Its output is as follows −

The number of 'm's in each string:
0    1
1    1
2    0
3    0

startswith(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("Strings that start with 'T':")
print s.str. startswith ('T')

它的 output 如下所示 −

Its output is as follows −

0  True
1  False
2  False
3  False
dtype: bool

endswith(pattern)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print s.str.endswith('t')

它的 output 如下所示 −

Its output is as follows −

Strings that end with 't':
0  False
1  False
2  False
3  True
dtype: bool

find(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.find('e')

它的 output 如下所示 −

Its output is as follows −

0  -1
1  -1
2  -1
3   3
dtype: int64

“-1”表示元素中没有此类模式。

"-1" indicates that there no such pattern available in the element.

findall(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.findall('e')

它的 output 如下所示 −

Its output is as follows −

0 []
1 []
2 []
3 [e]
dtype: object

空列表([ ])表示元素中没有此类模式。

Null list([ ]) indicates that there is no such pattern available in the element.

swapcase()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.swapcase()

它的 output 如下所示 −

Its output is as follows −

0  tOM
1  wILLIAM rICK
2  jOHN
3  aLBER@T
dtype: object

islower()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.islower()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool

isupper()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isupper()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool

isnumeric()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isnumeric()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool