Python Pandas 简明教程
Python Pandas - DataFrame
数据框是一种二维数据结构,即,数据按照表格方式以行和列对齐。
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Features of DataFrame
-
Potentially columns are of different types
-
Size – Mutable
-
Labeled axes (rows and columns)
-
Can Perform Arithmetic operations on rows and columns
Structure
让我们假设我们正在使用学生数据创建一个数据帧。
Let us assume that we are creating a data frame with student’s data.
你可以把它想象成 SQL 表或电子表格数据表示。
You can think of it as an SQL table or a spreadsheet data representation.
pandas.DataFrame
可以使用以下构造函数创建熊猫数据帧 −
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
构造函数的参数如下:
The parameters of the constructor are as follows −
Sr.No |
Parameter & Description |
1 |
data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. |
2 |
index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed. |
3 |
columns For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed. |
4 |
dtype Data type of each column. |
5 |
copy This command (or whatever it is) is used for copying of data, if the default is False. |
Create DataFrame
可以使用各种输入创建熊猫数据帧,例如 −
A pandas DataFrame can be created using various inputs like −
-
Lists
-
dict
-
Series
-
Numpy ndarrays
-
Another DataFrame
在本章的后续章节中,我们将看到如何使用这些输入创建数据帧。
In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs.
Create an Empty DataFrame
可以创建的基本数据帧是空数据帧。
A basic DataFrame, which can be created is an Empty Dataframe.
Create a DataFrame from Lists
可以使用单个列表或列表的列表创建数据帧。
The DataFrame can be created using a single list or a list of lists.
Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
它的 output 如下所示 −
Its output is as follows −
0
0 1
1 2
2 3
3 4
4 5
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
它的 output 如下所示 −
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
它的 output 如下所示 −
Its output is as follows −
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − 请注意, dtype 参数将年龄列的类型更改为浮点。
Note − Observe, the dtype parameter changes the type of Age column to floating point.
Create a DataFrame from Dict of ndarrays / Lists
所有 ndarrays 必须具有相同的长度。如果传递了索引,则索引的长度应等于数组的长度。
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.
如果未传递索引,则默认情况下,索引将为 range(n),其中 n 是数组长度。
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
它的 output 如下所示 −
Its output is as follows −
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − 观察值 0、1、2、3。它们是使用函数 range(n) 分配给每个元素的默认索引。
Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).
Example 2
现在我们使用数组创建索引数据框。
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
它的 output 如下所示 −
Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Note − 观察, index 参数为每一行分配一个索引。
Note − Observe, the index parameter assigns an index to each row.
Create a DataFrame from List of Dicts
可以将字典列表作为输入数据传递以创建数据帧。默认情况下,字典键被视为列名称。
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.
Example 1
以下示例展示了如何通过传递字典列表来创建数据帧。
The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
它的 output 如下所示 −
Its output is as follows −
a b c
0 1 2 NaN
1 5 10 20.0
Note - 观察,未填充区域中添加 NaN(非数字)。
Note − Observe, NaN (Not a Number) is appended in missing areas.
Example 2
以下示例展示如何通过传递词典列表和行索引来创建 DataFrame。
The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
它的 output 如下所示 −
Its output is as follows −
a b c
first 1 2 NaN
second 5 10 20.0
Example 3
以下示例展示如何使用词典列表、行索引和列索引来创建 DataFrame。
The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2
它的 output 如下所示 −
Its output is as follows −
#df1 output
a b
first 1 2
second 5 10
#df2 output
a b1
first 1 NaN
second 5 NaN
Note - 观察,创建 df2 DataFrame,其列索引与词典键不同;因此,在适当的位置添加 NaN。而 df1 的创建中列索引与词典键相同,因此添加 NaN。
Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.
Create a DataFrame from Dict of Series
可以传递系列词典以形成 DataFrame。生成索引是传递的所有系列索引的 union。
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.
Example
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df
它的 output 如下所示 −
Its output is as follows −
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note - 观察,对于系列一,没有传递标签 ‘d’ ,但结果中,NaN 标签添加了 NaN。
Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.
现在让我们通过示例了解 column selection, addition 和 deletion 。
Let us now understand column selection, addition, and deletion through examples.
Column Selection
我们将通过从 DataFrame 中选择一列来理解这一点。
We will understand this by selecting a column from the DataFrame.
Column Addition
我们将通过向现有数据框中添加新列来理解这一点。
We will understand this by adding a new column to an existing data frame.
Example
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print df
它的 output 如下所示 −
Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Adding a new column using the existing columns in DataFrame:
one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN
Column Deletion
可以删除或弹出列;让我们举个例子来了解如何做。
Columns can be deleted or popped; let us take an example to understand how.
Example
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df
# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df
它的 output 如下所示 −
Its output is as follows −
Our dataframe is:
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4
Deleting the first column using DEL function:
three two
a 10.0 1
b 20.0 2
c 30.0 3
d NaN 4
Deleting another column using POP function:
three
a 10.0
b 20.0
c 30.0
d NaN
Row Selection, Addition, and Deletion
现在我们将通过示例来了解行选择、添加和删除。让我们从选择的概念开始。
We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.
Selection by Label
可以通过向 loc 函数传递行标签来选择行。
Rows can be selected by passing row label to a loc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b']
它的 output 如下所示 −
Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
结果是系列,其标签作为 DataFrame 的列名。并且,系列的名称是检索它的标签。
The result is a series with labels as column names of the DataFrame. And, the Name of the series is the label with which it is retrieved.
Selection by integer location
可以通过向 iloc 函数传递整型位置来选择行。
Rows can be selected by passing integer location to an iloc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.iloc[2]
它的 output 如下所示 −
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
可以使用 ' : ' 运算符选择多行。
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df[2:4]
它的 output 如下所示 −
Its output is as follows −
one two
c 3.0 3
d NaN 4
Addition of Rows
使用 append 函数向 DataFrame 中添置新行。此函数将在末尾添加这些行。
Add new rows to a DataFrame using the append function. This function will append the rows at the end.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print df
它的 output 如下所示 −
Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
使用索引标签从 DataFrame 中删除或丢弃行。如果标签是重复的,则将丢弃多行。
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.
如果你观察,在上面的示例中,这些标签是重复的。让我们丢弃一个标签,然后看看将被丢弃多少行。
If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print df
它的 output 如下所示 −
Its output is as follows −
a b
1 3 4
1 7 8
在上面的示例中,丢弃了两行,这是因为它们包含相同标签 0。
In the above example, two rows were dropped because those two contain the same label 0.