Python Pandas 简明教程

Python Pandas - DataFrame

数据框是一种二维数据结构,即,数据按照表格方式以行和列对齐。

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features of DataFrame

  1. Potentially columns are of different types

  2. Size – Mutable

  3. Labeled axes (rows and columns)

  4. Can Perform Arithmetic operations on rows and columns

Structure

让我们假设我们正在使用学生数据创建一个数据帧。

Let us assume that we are creating a data frame with student’s data.

structure table

你可以把它想象成 SQL 表或电子表格数据表示。

You can think of it as an SQL table or a spreadsheet data representation.

pandas.DataFrame

可以使用以下构造函数创建熊猫数据帧 −

A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)

构造函数的参数如下:

The parameters of the constructor are as follows −

Sr.No

Parameter & Description

1

data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

2

index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.

3

columns For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

4

dtype Data type of each column.

5

copy This command (or whatever it is) is used for copying of data, if the default is False.

Create DataFrame

可以使用各种输入创建熊猫数据帧,例如 −

A pandas DataFrame can be created using various inputs like −

  1. Lists

  2. dict

  3. Series

  4. Numpy ndarrays

  5. Another DataFrame

在本章的后续章节中,我们将看到如何使用这些输入创建数据帧。

In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs.

Create an Empty DataFrame

可以创建的基本数据帧是空数据帧。

A basic DataFrame, which can be created is an Empty Dataframe.

Example

#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df

它的 output 如下所示 −

Its output is as follows −

Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists

可以使用单个列表或列表的列表创建数据帧。

The DataFrame can be created using a single list or a list of lists.

Example 1

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

     0
0    1
1    2
2    3
3    4
4    5

Example 2

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

它的 output 如下所示 −

Its output is as follows −

      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13

Example 3

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df

它的 output 如下所示 −

Its output is as follows −

      Name     Age
0     Alex     10.0
1     Bob      12.0
2     Clarke   13.0

Note − 请注意, dtype 参数将年龄列的类型更改为浮点。

Note − Observe, the dtype parameter changes the type of Age column to floating point.

Create a DataFrame from Dict of ndarrays / Lists

所有 ndarrays 必须具有相同的长度。如果传递了索引,则索引的长度应等于数组的长度。

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

如果未传递索引,则默认情况下,索引将为 range(n),其中 n 是数组长度。

If no index is passed, then by default, index will be range(n), where n is the array length.

Example 1

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky

Note − 观察值 0、1、2、3。它们是使用函数 range(n) 分配给每个元素的默认索引。

Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

Example 2

现在我们使用数组创建索引数据框。

Let us now create an indexed DataFrame using arrays.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

它的 output 如下所示 −

Its output is as follows −

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Note − 观察, index 参数为每一行分配一个索引。

Note − Observe, the index parameter assigns an index to each row.

Create a DataFrame from List of Dicts

可以将字典列表作为输入数据传递以创建数据帧。默认情况下,字典键被视为列名称。

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

Example 1

以下示例展示了如何通过传递字典列表来创建数据帧。

The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

    a    b      c
0   1   2     NaN
1   5   10   20.0

Note - 观察,未填充区域中添加 NaN(非数字)。

Note − Observe, NaN (Not a Number) is appended in missing areas.

Example 2

以下示例展示如何通过传递词典列表和行索引来创建 DataFrame。

The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

它的 output 如下所示 −

Its output is as follows −

        a   b       c
first   1   2     NaN
second  5   10   20.0

Example 3

以下示例展示如何使用词典列表、行索引和列索引来创建 DataFrame。

The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2

它的 output 如下所示 −

Its output is as follows −

#df1 output
         a  b
first    1  2
second   5  10

#df2 output
         a  b1
first    1  NaN
second   5  NaN

Note - 观察,创建 df2 DataFrame,其列索引与词典键不同;因此,在适当的位置添加 NaN。而 df1 的创建中列索引与词典键相同,因此添加 NaN。

Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.

Create a DataFrame from Dict of Series

可以传递系列词典以形成 DataFrame。生成索引是传递的所有系列索引的 union。

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df

它的 output 如下所示 −

Its output is as follows −

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4

Note - 观察,对于系列一,没有传递标签 ‘d’ ,但结果中,NaN 标签添加了 NaN。

Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

现在让我们通过示例了解 column selection, additiondeletion

Let us now understand column selection, addition, and deletion through examples.

Column Selection

我们将通过从 DataFrame 中选择一列来理解这一点。

We will understand this by selecting a column from the DataFrame.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df ['one']

它的 output 如下所示 −

Its output is as follows −

a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64

Column Addition

我们将通过向现有数据框中添加新列来理解这一点。

We will understand this by adding a new column to an existing data frame.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print df

它的 output 如下所示 −

Its output is as follows −

Adding a new column by passing as Series:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN

Column Deletion

可以删除或弹出列;让我们举个例子来了解如何做。

Columns can be deleted or popped; let us take an example to understand how.

Example

# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df

它的 output 如下所示 −

Its output is as follows −

Our dataframe is:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

Deleting the first column using DEL function:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

Deleting another column using POP function:
   three
a  10.0
b  20.0
c  30.0
d  NaN

Row Selection, Addition, and Deletion

现在我们将通过示例来了解行选择、添加和删除。让我们从选择的概念开始。

We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

Selection by Label

可以通过向 loc 函数传递行标签来选择行。

Rows can be selected by passing row label to a loc function.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']

它的 output 如下所示 −

Its output is as follows −

one 2.0
two 2.0
Name: b, dtype: float64

结果是系列,其标签作为 DataFrame 的列名。并且,系列的名称是检索它的标签。

The result is a series with labels as column names of the DataFrame. And, the Name of the series is the label with which it is retrieved.

Selection by integer location

可以通过向 iloc 函数传递整型位置来选择行。

Rows can be selected by passing integer location to an iloc function.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]

它的 output 如下所示 −

Its output is as follows −

one   3.0
two   3.0
Name: c, dtype: float64

Slice Rows

可以使用 ' : ' 运算符选择多行。

Multiple rows can be selected using ‘ : ’ operator.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]

它的 output 如下所示 −

Its output is as follows −

   one  two
c  3.0    3
d  NaN    4

Addition of Rows

使用 append 函数向 DataFrame 中添置新行。此函数将在末尾添加这些行。

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df

它的 output 如下所示 −

Its output is as follows −

   a  b
0  1  2
1  3  4
0  5  6
1  7  8

Deletion of Rows

使用索引标签从 DataFrame 中删除或丢弃行。如果标签是重复的,则将丢弃多行。

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

如果你观察,在上面的示例中,这些标签是重复的。让我们丢弃一个标签,然后看看将被丢弃多少行。

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print df

它的 output 如下所示 −

Its output is as follows −

  a b
1 3 4
1 7 8

在上面的示例中,丢弃了两行,这是因为它们包含相同标签 0。

In the above example, two rows were dropped because those two contain the same label 0.