Python Pandas 简明教程
Python Pandas - Indexing and Selecting Data
在本章中,我们将讨论如何对日期进行切片和划分,通常获得 Pandas 对象的子集。
Python 和 NumPy 索引运算符 “[ ]” 和属性运算符 “.” 可以在各种用例中快速简便地访问 Pandas 数据结构。但是,由于要访问的数据类型是事先不知道的,因此直接使用标准运算符会带来一些优化限制。对于生产代码,我们建议你利用本章中介绍的经过优化的 Pandas 数据访问方法。
Pandas 现在支持三种多轴索引;以下表中提到了这三种类型:
Sr.No |
Indexing & Description |
1 |
.loc() Label based |
2 |
.iloc() Integer based |
3 |
.ix() 标签和整数为基础 |
.loc()
Pandas 提供了多种方法来拥有纯粹的 label based indexing 。切片时,也包括起始边界。整数有效标签,但它们指的是该标签,而不是该位置。
.loc() 有多种访问方法,如 −
-
A single scalar label
-
A list of labels
-
A slice object
-
A Boolean array
loc 使用两个由“,”分隔的单一/列表/范围运算符。第一个指示行,而第二个指示列。
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
#select all rows for a specific column
print df.loc[:,'A']
它的 output 如下所示 −
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select all rows for multiple columns, say list[]
print df.loc[:,['A','C']]
它的 output 如下所示 −
A C
a 0.391548 0.745623
b -0.070649 1.620406
c -0.317212 1.448365
d -2.162406 -0.873557
e 2.202797 0.528067
f 0.613709 0.286414
g 1.050559 0.216526
h 1.122680 -1.621420
Example 3
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select few rows for multiple columns, say list[]
print df.loc[['a','b','f','h'],['A','C']]
它的 output 如下所示 −
A C
a 0.391548 0.745623
b -0.070649 1.620406
f 0.613709 0.286414
h 1.122680 -1.621420
Example 4
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select range of rows for all columns
print df.loc['a':'h']
它的 output 如下所示 −
A B C D
a 0.391548 -0.224297 0.745623 0.054301
b -0.070649 -0.880130 1.620406 1.419743
c -0.317212 -1.929698 1.448365 0.616899
d -2.162406 0.614256 -0.873557 1.093958
e 2.202797 -2.315915 0.528067 0.612482
f 0.613709 -0.157674 0.286414 -0.500517
g 1.050559 -2.272099 0.216526 0.928449
h 1.122680 0.324368 -1.621420 -0.741470
Example 5
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# for getting values with a boolean array
print df.loc['a']>0
它的 output 如下所示 −
A False
B True
C False
D False
Name: a, dtype: bool
.iloc()
Pandas 提供多种方法来获取纯基于整数的索引。如同 python 和 numpy,这些是 0-based 索引。
各种访问方法如下所示 −
-
An Integer
-
A list of integers
-
A range of values
Example 1
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# select all rows for a specific column
print df.iloc[:4]
它的 output 如下所示 −
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]
它的 output 如下所示 −
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
C D
1 -0.813012 0.631615
2 0.025070 0.230806
3 0.826977 -0.026251
4 1.423332 1.130568
Example 3
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Slicing through list of values
print df.iloc[[1, 3, 5], [1, 3]]
print df.iloc[1:3, :]
print df.iloc[:,1:3]
它的 output 如下所示 −
B D
1 0.890791 0.631615
3 -1.284314 -0.026251
5 -0.512888 -0.518930
A B C D
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
B C
0 0.256239 -1.270702
1 0.890791 -0.813012
2 -0.531378 0.025070
3 -1.284314 0.826977
4 -0.460729 1.423332
5 -0.512888 0.581409
6 -1.204853 0.098060
7 -0.947857 0.641358
.ix()
除了基于纯标签和基于整数之外,Pandas 提供了一种混合方法来选择和子集化对象,使用 .ix() 运算符。
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Integer slicing
print df.ix[:4]
它的 output 如下所示 −
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Use of Notations
使用多轴索引从 Pandas 对象获取值使用以下表示法 −
Object |
Indexers |
Return Type |
Series |
s.loc[indexer] |
Scalar value |
DataFrame |
df.loc[row_index,col_index] |
Series object |
Panel |
p.loc[item_index,major_index, minor_index] |
p.loc[item_index,major_index, minor_index] |
Note − .iloc() & .ix() 应用相同的索引选项和返回值。
现在让我们看看如何在 DataFrame 对象上执行每个操作。我们将使用基本的索引运算符“[ ]” −
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df['A']
它的 output 如下所示 −
0 -0.478893
1 0.391931
2 0.336825
3 -1.055102
4 -0.165218
5 -0.328641
6 0.567721
7 -0.759399
Name: A, dtype: float64
Note − 我们可以将一个值列表传递给 [ ] 来选择这些列。
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df[['A','B']]
它的 output 如下所示 −
A B
0 -0.478893 -0.606311
1 0.391931 -0.949025
2 0.336825 0.093717
3 -1.055102 -0.012944
4 -0.165218 1.550310
5 -0.328641 -0.226363
6 0.567721 -0.312585
7 -0.759399 -0.372696