Python Pandas 简明教程

Python Pandas - Quick Guide

Python Pandas - Introduction

Pandas 是一个开源 Python 库,通过其强大的数据结构提供高性能数据操作和分析工具。Pandas 名称源自术语面板数据 - 多维数据计量经济学。

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

2008 年,开发人员 Wes McKinney 开始开发 pandas,当时需要高性能、灵活的工具来分析数据。

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

在 Pandas 之前,Python 主要用于数据整理和准备。它对数据分析的贡献极小。Pandas 解决这个问题。使用 Pandas,我们可以完成数据处理和分析中的五个典型步骤,无论数据的来源如何 - 加载、准备、操作、建模和分析。

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

带有 Pandas 的 Python 用于广泛的领域,包括学术和商业领域,包括金融、经济学、统计学、分析等。

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas

  1. Fast and efficient DataFrame object with default and customized indexing.

  2. Tools for loading data into in-memory data objects from different file formats.

  3. Data alignment and integrated handling of missing data.

  4. Reshaping and pivoting of date sets.

  5. Label-based slicing, indexing and subsetting of large data sets.

  6. Columns from a data structure can be deleted or inserted.

  7. Group by data for aggregation and transformations.

  8. High performance merging and joining of data.

  9. Time Series functionality.

Python Pandas - Environment Setup

标准 Python 发行版不附带 Pandas 模块。一种轻量级替代方法是使用流行的 Python 软件包安装程序 pip. 安装 NumPy。

Standard Python distribution doesn’t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install pandas

如果安装 Anaconda Python 软件包,则 Pandas 将默认安装为以下形式 −

If you install Anaconda Python package, Pandas will be installed by default with the following −

Windows

  1. Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.

  2. Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.

  3. Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)

Linux

各个 Linux 发行版的包管理器用于安装 SciPy 堆栈中的一个或多个包。

Package managers of respective Linux distributions are used to install one or more packages in SciPy stack.

For Ubuntu Users

For Ubuntu Users

sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook
python-pandas python-sympy python-nose

For Fedora Users

For Fedora Users

sudo yum install numpyscipy python-matplotlibipython python-pandas sympy
python-nose atlas-devel

Introduction to Data Structures

熊猫处理以下三个数据结构 −

Pandas deals with the following three data structures −

  1. Series

  2. DataFrame

  3. Panel

这些数据结构建立在 Numpy 阵列的基础上,也就是说它们很快。

These data structures are built on top of Numpy array, which means they are fast.

Dimension & Description

思考这些数据结构的最佳方式是,高维数据结构是其低维数据结构的一个容器。例如,数据框是序列的容器,面板是数据框的容器。

The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

Data Structure

Dimensions

Description

Series

1

1D labeled homogeneous array, sizeimmutable.

Data Frames

2

General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

Panel

3

General 3D labeled, size-mutable array.

构建和处理两个或多个维度阵列是一项繁琐的任务,在编写函数时考虑数据集的方向负担由用户承担。但是,使用 Pandas 数据结构,用户的脑力负担会减轻。

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

例如,对于表格数据(DataFrame),考虑 index (行)和 columns 比考虑轴 0 和轴 1 在语义上更有帮助。

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

Mutability

所有 Pandas 数据结构都是值可变的(可以更改),但除了 Series 之外,其余的大小都是可变的。Series 的大小不可变。

All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

Note − DataFrame 用途广泛,是最重要的数据结构之一。Panel 用途少得多。

Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.

Series

Series 是具有齐次数据的一维类似数组的结构。例如,以下系列收集了整数 10、23、56、…

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

10

23

56

17

52

61

73

90

26

72

Key Points

  1. Homogeneous data

  2. Size Immutable

  3. Values of Data Mutable

DataFrame

DataFrame 是一个具有异构数据的二维数组。例如,

DataFrame is a two-dimensional array with heterogeneous data. For example,

Name

Age

Gender

Rating

Steve

32

Male

3.45

Lia

28

Female

4.6

Vin

45

Male

3.9

Katie

38

Female

2.78

该表表示一个组织的销售团队及其总体绩效评级的表示,数据以行和列表示。每一列表示一个属性,每一行表示一个人。

The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.

Data Type of Columns

四列的数据类型如下 −

The data types of the four columns are as follows −

Column

Type

Name

String

Age

Integer

Gender

String

Rating

Float

Key Points

  1. Heterogeneous data

  2. Size Mutable

  3. Data Mutable

Panel

Panel 是一个具有异类数据的三个维度数据结构。很难以图形方式表示 Panel。但可以将 Panel 阐释为 DataFrame 的容器。

Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.

Key Points

  1. Heterogeneous data

  2. Size Mutable

  3. Data Mutable

Python Pandas - Series

Series 是一个一维的带标签阵列,它可以保存任何类型(整数、字符串、浮动、Python 对象等)的数据。轴标签统称为索引。

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series

可以使用以下构造函数创建 Pandas Series −

A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)

构造函数的参数如下:

The parameters of the constructor are as follows −

Sr.No

Parameter & Description

1

data data takes various forms like ndarray, list, constants

2

index Index values must be unique and hashable, same length as data. Default np.arange(n) if no index is passed.

3

dtype dtype is for data type. If None, data type will be inferred

4

copy Copy data. Default False

可以使用各种输入创建 series,如:

A series can be created using various inputs like −

  1. Array

  2. Dict

  3. Scalar value or constant

Create an Empty Series

可以创建的基本 series 是一个空 series。

A basic series, which can be created is an Empty Series.

Example

#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s

它的 output 如下所示 −

Its output is as follows −

Series([], dtype: float64)

Create a Series from ndarray

如果 data 是一个 ndarray,则传入的索引的长度必须相同。如果没有传入索引,则默认的索引是 range(n) ,其中 n 是数组长度,即 [0,1,2,3…. range(len(array))-1].

If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

Example 1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s

它的 output 如下所示 −

Its output is as follows −

0   a
1   b
2   c
3   d
dtype: object

我们并没有传入任何索引,因此它默认分配了从 0 到 len(data)-1 的索引,即 0 到 3。

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

Example 2

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s

它的 output 如下所示 −

Its output is as follows −

100  a
101  b
102  c
103  d
dtype: object

我们在这里传入了索引值。现在,我们可以在输出中看到自定义的索引值。

We passed the index values here. Now we can see the customized indexed values in the output.

Create a Series from dict

可以将一个 dict 作为输入,并且如果未指定索引,则会按照排序的顺序取字典作为密钥,以构造索引。如果传递了 index ,则会提取出与索引中标签相对应的 data 中的值。

A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

Example 1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s

它的 output 如下所示 −

Its output is as follows −

a 0.0
b 1.0
c 2.0
dtype: float64

Observe − 字典密钥用于构建索引。

Observe − Dictionary keys are used to construct index.

Example 2

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s

它的 output 如下所示 −

Its output is as follows −

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Observe − 索引顺序得以保留,并且将缺失元素填充为 NaN(非数字)。

Observe − Index order is persisted and the missing element is filled with NaN (Not a Number).

Create a Series from Scalar

如果 data 是一个标量值,则必须提供一个索引。该值将会重复,以匹配 index 的长度

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s

它的 output 如下所示 −

Its output is as follows −

0  5
1  5
2  5
3  5
dtype: int64

Accessing Data from Series with Position

可以通过类似于 ndarray. 中的方式访问 series 中的数据

Data in the series can be accessed similar to that in an ndarray.

Example 1

检索第一个元素。正如我们已经知道的那样,数组的计数从零开始,这意味着第一个元素存储在第零个位置,依此类推。

Retrieve the first element. As we already know, the counting starts from zero for the array, which means the first element is stored at zeroth position and so on.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print s[0]

它的 output 如下所示 −

Its output is as follows −

1

Example 2

在 Series 中检索前三个元素。如果在它的前面插入了一个 : ,则会提取从该索引开始的所有项目。如果使用了两个参数(中间以 : 分隔),则会提取两个索引所对应的项(不包括停止索引)

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index)

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print s[:3]

它的 output 如下所示 −

Its output is as follows −

a  1
b  2
c  3
dtype: int64

Example 3

检索最后三个元素。

Retrieve the last three elements.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print s[-3:]

它的 output 如下所示 −

Its output is as follows −

c  3
d  4
e  5
dtype: int64

Retrieve Data Using Label (Index)

Series 就像一个固定大小的 dict ,因为你可以通过索引标签获取和设置值。

A Series is like a fixed-size dict in that you can get and set values by index label.

Example 1

使用索引标签值检索一个单独的元素。

Retrieve a single element using index label value.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print s['a']

它的 output 如下所示 −

Its output is as follows −

1

Example 2

使用索引标签值列表检索多个元素。

Retrieve multiple elements using a list of index label values.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s[['a','c','d']]

它的 output 如下所示 −

Its output is as follows −

a  1
c  3
d  4
dtype: int64

Example 3

如果未包含标签,则会引发异常。

If a label is not contained, an exception is raised.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s['f']

它的 output 如下所示 −

Its output is as follows −

…
KeyError: 'f'

Python Pandas - DataFrame

数据框是一种二维数据结构,即,数据按照表格方式以行和列对齐。

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features of DataFrame

  1. Potentially columns are of different types

  2. Size – Mutable

  3. Labeled axes (rows and columns)

  4. Can Perform Arithmetic operations on rows and columns

Structure

让我们假设我们正在使用学生数据创建一个数据帧。

Let us assume that we are creating a data frame with student’s data.

structure table

你可以把它想象成 SQL 表或电子表格数据表示。

You can think of it as an SQL table or a spreadsheet data representation.

pandas.DataFrame

可以使用以下构造函数创建熊猫数据帧 −

A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)

构造函数的参数如下:

The parameters of the constructor are as follows −

Sr.No

Parameter & Description

1

data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

2

index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.

3

columns For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

4

dtype Data type of each column.

5

copy This command (or whatever it is) is used for copying of data, if the default is False.

Create DataFrame

可以使用各种输入创建熊猫数据帧,例如 −

A pandas DataFrame can be created using various inputs like −

  1. Lists

  2. dict

  3. Series

  4. Numpy ndarrays

  5. Another DataFrame

在本章的后续章节中,我们将看到如何使用这些输入创建数据帧。

In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs.

Create an Empty DataFrame

可以创建的基本数据帧是空数据帧。

A basic DataFrame, which can be created is an Empty Dataframe.

Example

#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df

它的 output 如下所示 −

Its output is as follows −

Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists

可以使用单个列表或列表的列表创建数据帧。

The DataFrame can be created using a single list or a list of lists.

Example 1

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

     0
0    1
1    2
2    3
3    4
4    5

Example 2

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

它的 output 如下所示 −

Its output is as follows −

      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13

Example 3

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df

它的 output 如下所示 −

Its output is as follows −

      Name     Age
0     Alex     10.0
1     Bob      12.0
2     Clarke   13.0

Note − 请注意, dtype 参数将年龄列的类型更改为浮点。

Note − Observe, the dtype parameter changes the type of Age column to floating point.

Create a DataFrame from Dict of ndarrays / Lists

所有 ndarrays 必须具有相同的长度。如果传递了索引,则索引的长度应等于数组的长度。

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

如果未传递索引,则默认情况下,索引将为 range(n),其中 n 是数组长度。

If no index is passed, then by default, index will be range(n), where n is the array length.

Example 1

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky

Note − 观察值 0、1、2、3。它们是使用函数 range(n) 分配给每个元素的默认索引。

Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

Example 2

现在我们使用数组创建索引数据框。

Let us now create an indexed DataFrame using arrays.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

它的 output 如下所示 −

Its output is as follows −

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Note − 观察, index 参数为每一行分配一个索引。

Note − Observe, the index parameter assigns an index to each row.

Create a DataFrame from List of Dicts

可以将字典列表作为输入数据传递以创建数据帧。默认情况下,字典键被视为列名称。

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

Example 1

以下示例展示了如何通过传递字典列表来创建数据帧。

The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

它的 output 如下所示 −

Its output is as follows −

    a    b      c
0   1   2     NaN
1   5   10   20.0

Note - 观察,未填充区域中添加 NaN(非数字)。

Note − Observe, NaN (Not a Number) is appended in missing areas.

Example 2

以下示例展示如何通过传递词典列表和行索引来创建 DataFrame。

The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

它的 output 如下所示 −

Its output is as follows −

        a   b       c
first   1   2     NaN
second  5   10   20.0

Example 3

以下示例展示如何使用词典列表、行索引和列索引来创建 DataFrame。

The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2

它的 output 如下所示 −

Its output is as follows −

#df1 output
         a  b
first    1  2
second   5  10

#df2 output
         a  b1
first    1  NaN
second   5  NaN

Note - 观察,创建 df2 DataFrame,其列索引与词典键不同;因此,在适当的位置添加 NaN。而 df1 的创建中列索引与词典键相同,因此添加 NaN。

Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.

Create a DataFrame from Dict of Series

可以传递系列词典以形成 DataFrame。生成索引是传递的所有系列索引的 union。

Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df

它的 output 如下所示 −

Its output is as follows −

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4

Note - 观察,对于系列一,没有传递标签 ‘d’ ,但结果中,NaN 标签添加了 NaN。

Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN.

现在让我们通过示例了解 column selection, additiondeletion

Let us now understand column selection, addition, and deletion through examples.

Column Selection

我们将通过从 DataFrame 中选择一列来理解这一点。

We will understand this by selecting a column from the DataFrame.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df ['one']

它的 output 如下所示 −

Its output is as follows −

a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64

Column Addition

我们将通过向现有数据框中添加新列来理解这一点。

We will understand this by adding a new column to an existing data frame.

Example

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print df

它的 output 如下所示 −

Its output is as follows −

Adding a new column by passing as Series:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN

Column Deletion

可以删除或弹出列;让我们举个例子来了解如何做。

Columns can be deleted or popped; let us take an example to understand how.

Example

# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df

它的 output 如下所示 −

Its output is as follows −

Our dataframe is:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

Deleting the first column using DEL function:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

Deleting another column using POP function:
   three
a  10.0
b  20.0
c  30.0
d  NaN

Row Selection, Addition, and Deletion

现在我们将通过示例来了解行选择、添加和删除。让我们从选择的概念开始。

We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection.

Selection by Label

可以通过向 loc 函数传递行标签来选择行。

Rows can be selected by passing row label to a loc function.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']

它的 output 如下所示 −

Its output is as follows −

one 2.0
two 2.0
Name: b, dtype: float64

结果是系列,其标签作为 DataFrame 的列名。并且,系列的名称是检索它的标签。

The result is a series with labels as column names of the DataFrame. And, the Name of the series is the label with which it is retrieved.

Selection by integer location

可以通过向 iloc 函数传递整型位置来选择行。

Rows can be selected by passing integer location to an iloc function.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]

它的 output 如下所示 −

Its output is as follows −

one   3.0
two   3.0
Name: c, dtype: float64

Slice Rows

可以使用 ' : ' 运算符选择多行。

Multiple rows can be selected using ‘ : ’ operator.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]

它的 output 如下所示 −

Its output is as follows −

   one  two
c  3.0    3
d  NaN    4

Addition of Rows

使用 append 函数向 DataFrame 中添置新行。此函数将在末尾添加这些行。

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df

它的 output 如下所示 −

Its output is as follows −

   a  b
0  1  2
1  3  4
0  5  6
1  7  8

Deletion of Rows

使用索引标签从 DataFrame 中删除或丢弃行。如果标签是重复的,则将丢弃多行。

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

如果你观察,在上面的示例中,这些标签是重复的。让我们丢弃一个标签,然后看看将被丢弃多少行。

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print df

它的 output 如下所示 −

Its output is as follows −

  a b
1 3 4
1 7 8

在上面的示例中,丢弃了两行,这是因为它们包含相同标签 0。

In the above example, two rows were dropped because those two contain the same label 0.

Python Pandas - Panel

panel 是数据的 3D 容器。术语 Panel data 衍生自计量经济学并且部分负责熊猫名称 − pan(el)-da(ta) -s。

A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.

3 根轴的名称是为了给描述面板数据的操作赋予一些语义含义。它们是 −

The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −

  1. items − axis 0, each item corresponds to a DataFrame contained inside.

  2. major_axis − axis 1, it is the index (rows) of each of the DataFrames.

  3. minor_axis − axis 2, it is the columns of each of the DataFrames.

pandas.Panel()

可以使用以下构造函数创建一个面板 −

A Panel can be created using the following constructor −

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

构造函数的参数如下:

The parameters of the constructor are as follows −

Parameter

Description

data

Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame

items

axis=0

major_axis

axis=1

minor_axis

axis=2

dtype

Data type of each column

copy

Copy data. Default, false

Create Panel

可以使用多种方法创建面板,如 −

A Panel can be created using multiple ways like −

  1. From ndarrays

  2. From dict of DataFrames

From 3D ndarray

# creating an empty panel
import pandas as pd
import numpy as np

data = np.random.rand(2,4,5)
p = pd.Panel(data)
print p

它的 output 如下所示 −

Its output is as follows −

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

Note − 观察空面板和上述面板的维度,所有对象都不同。

Note − Observe the dimensions of the empty panel and the above panel, all the objects are different.

From dict of DataFrame Objects

#creating an empty panel
import pandas as pd
import numpy as np

data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p

它的 output 如下所示 −

Its output is as follows −

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

Create an Empty Panel

可以使用 Panel 构造函数创建空面板,如下所示 −

An empty panel can be created using the Panel constructor as follows −

#creating an empty panel
import pandas as pd
p = pd.Panel()
print p

它的 output 如下所示 −

Its output is as follows −

<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None

Selecting the Data from Panel

使用以下方法从面板中选择数据 −

Select the data from the panel using −

  1. Items

  2. Major_axis

  3. Minor_axis

Using Items

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p['Item1']

它的 output 如下所示 −

Its output is as follows −

            0          1          2
0    0.488224  -0.128637   0.930817
1    0.417497   0.896681   0.576657
2   -2.775266   0.571668   0.290082
3   -0.400538  -0.144234   1.110535

我们有两项,并且我们检索了 item1。结果是一个带有 4 行和 3 列的 DataFrame,它们是 Major_axisMinor_axis 维度。

We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.

Using major_axis

可以使用方法 panel.major_axis(index) 访问数据。

Data can be accessed using the method panel.major_axis(index).

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p.major_xs(1)

它的 output 如下所示 −

Its output is as follows −

      Item1       Item2
0   0.417497    0.748412
1   0.896681   -0.557322
2   0.576657       NaN

Using minor_axis

可以使用方法 panel.minor_axis(index). 访问数据。

Data can be accessed using the method panel.minor_axis(index).

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p.minor_xs(1)

它的 output 如下所示 −

Its output is as follows −

       Item1       Item2
0   -0.128637   -1.047032
1    0.896681   -0.557322
2    0.571668    0.431953
3   -0.144234    1.302466

Note − 观察维度中的变化。

Note − Observe the changes in the dimensions.

Python Pandas - Basic Functionality

到目前为止,我们了解了这三个 Pandas 数据结构以及如何创建它们。我们将主要关注 DataFrame 对象,因为它在实时数据处理中非常重要,还将讨论一些其他数据结构。

By now, we learnt about the three Pandas DataStructures and how to create them. We will majorly focus on the DataFrame objects because of its importance in the real time data processing and also discuss a few other DataStructures.

Series Basic Functionality

Sr.No.

Attribute or Method & Description

1

axes Returns a list of the row axis labels

2

dtype Returns the dtype of the object.

3

empty Returns True if series is empty.

4

ndim Returns the number of dimensions of the underlying data, by definition 1.

5

size Returns the number of elements in the underlying data.

6

values Returns the Series as ndarray.

7

head() Returns the first n rows.

8

tail() Returns the last n rows.

我们现在创建一个 Series,然后查看所有以上表格中的属性操作。

Let us now create a Series and see all the above tabulated attributes operation.

Example

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print s

它的 output 如下所示 −

Its output is as follows −

0   0.967853
1  -0.148368
2  -1.395906
3  -1.758394
dtype: float64

axes

返回系列标签列表。

Returns the list of the labels of the series.

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("The axes are:")
print s.axes

它的 output 如下所示 −

Its output is as follows −

The axes are:
[RangeIndex(start=0, stop=4, step=1)]

以上结果是以 0 到 5 的值列表的紧凑形式,即 [0,1,2,3,4]。

The above result is a compact format of a list of values from 0 to 5, i.e., [0,1,2,3,4].

empty

返回布尔值,表示对象是否为空。True 指示对象为空。

Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty.

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print s.empty

它的 output 如下所示 −

Its output is as follows −

Is the Object empty?
False

ndim

返回对象维数。根据定义,Series 是 1D 数据结构,因此它返回

Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print s

print ("The dimensions of the object:")
print s.ndim

它的 output 如下所示 −

Its output is as follows −

0   0.175898
1   0.166197
2  -0.609712
3  -1.377000
dtype: float64

The dimensions of the object:
1

size

返回系列大小(长度)。

Returns the size(length) of the series.

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(2))
print s
print ("The size of the object:")
print s.size

它的 output 如下所示 −

Its output is as follows −

0   3.078058
1  -1.207803
dtype: float64

The size of the object:
2

values

返回系列中的实际数据作为数组。

Returns the actual data in the series as an array.

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print s

print ("The actual data series is:")
print s.values

它的 output 如下所示 −

Its output is as follows −

0   1.787373
1  -0.605159
2   0.180477
3  -0.140922
dtype: float64

The actual data series is:
[ 1.78737302 -0.60515881 0.18047664 -0.1409218 ]

Head & Tail

若要查看 Series 或 DataFrame 对象的小样本,请使用 head() 和 tail() 方法。

To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.

head() 返回前 n 行(观察索引值)。要显示的元素默认值为 5,但你可以传递自定义数字。

head() returns the first n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number.

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print s

print ("The first two rows of the data series:")
print s.head(2)

它的 output 如下所示 −

Its output is as follows −

The original series is:
0   0.720876
1  -0.765898
2   0.479221
3  -0.139547
dtype: float64

The first two rows of the data series:
0   0.720876
1  -0.765898
dtype: float64

tail() 返回后 n 行(观察索引值)。要显示的元素默认值为 5,但你可以传递自定义数字。

tail() returns the last n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number.

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print s

print ("The last two rows of the data series:")
print s.tail(2)

它的 output 如下所示 −

Its output is as follows −

The original series is:
0 -0.655091
1 -0.881407
2 -0.608592
3 -2.341413
dtype: float64

The last two rows of the data series:
2 -0.608592
3 -2.341413
dtype: float64

DataFrame Basic Functionality

我们现在了解什么是 DataFrame 基本功能。下表列出了有助于实现 DataFrame 基本功能的重要属性或方法。

Let us now understand what DataFrame Basic Functionality is. The following tables lists down the important attributes or methods that help in DataFrame Basic Functionality.

Sr.No.

Attribute or Method & Description

1

T Transposes rows and columns.

2

axes Returns a list with the row axis labels and column axis labels as the only members.

3

dtypes Returns the dtypes in this object.

4

empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.

5

ndim Number of axes / array dimensions.

6

shape Returns a tuple representing the dimensionality of the DataFrame.

7

size Number of elements in the NDFrame.

8

values Numpy representation of NDFrame.

9

head() Returns the first n rows.

10

tail() Returns last n rows.

让我们创建 DataFrame,观察上述属性如何运作。

Let us now create a DataFrame and see all how the above mentioned attributes operate.

Example

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print df

它的 output 如下所示 −

Its output is as follows −

Our data series is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Smith   4.60
6   23    Jack    3.80

T (Transpose)

返回 DataFrame 的转置。行列将交换。

Returns the transpose of the DataFrame. The rows and columns will interchange.

import pandas as pd
import numpy as np

# Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print df.T

它的 output 如下所示 −

Its output is as follows −

The transpose of the data series is:
         0     1       2      3      4      5       6
Age      25    26      25     23     30     29      23
Name     Tom   James   Ricky  Vin    Steve  Smith   Jack
Rating   4.23  3.24    3.98   2.56   3.2    4.6     3.8

axes

返回行轴标签和列轴标签的列表。

Returns the list of row axis labels and column axis labels.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print df.axes

它的 output 如下所示 −

Its output is as follows −

Row axis labels and column axis labels are:

[RangeIndex(start=0, stop=7, step=1), Index([u'Age', u'Name', u'Rating'],
dtype='object')]

dtypes

返回每列的数据类型。

Returns the data type of each column.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print df.dtypes

它的 output 如下所示 −

Its output is as follows −

The data types of each column are:
Age     int64
Name    object
Rating  float64
dtype: object

empty

返回布尔值,指示对象是否为空;True 指示对象为空。

Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print df.empty

它的 output 如下所示 −

Its output is as follows −

Is the object empty?
False

ndim

返回对象的维度数。根据定义,DataFrame 是 2D 对象。

Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The dimension of the object is:")
print df.ndim

它的 output 如下所示 −

Its output is as follows −

Our object is:
      Age    Name     Rating
0     25     Tom      4.23
1     26     James    3.24
2     25     Ricky    3.98
3     23     Vin      2.56
4     30     Steve    3.20
5     29     Smith    4.60
6     23     Jack     3.80

The dimension of the object is:
2

shape

返回一个元组表示 DataFrame 的维度。元组 (a,b),其中 a 表示行数, b 表示列数。

Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The shape of the object is:")
print df.shape

它的 output 如下所示 −

Its output is as follows −

Our object is:
   Age   Name    Rating
0  25    Tom     4.23
1  26    James   3.24
2  25    Ricky   3.98
3  23    Vin     2.56
4  30    Steve   3.20
5  29    Smith   4.60
6  23    Jack    3.80

The shape of the object is:
(7, 3)

size

返回 DataFrame 中的元素数。

Returns the number of elements in the DataFrame.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The total number of elements in our object is:")
print df.size

它的 output 如下所示 −

Its output is as follows −

Our object is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Smith   4.60
6   23    Jack    3.80

The total number of elements in our object is:
21

values

NDarray. 的形式返回 DataFrame 中的实际数据

Returns the actual data in the DataFrame as an NDarray.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The actual data in our data frame is:")
print df.values

它的 output 如下所示 −

Its output is as follows −

Our object is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Smith   4.60
6   23    Jack    3.80
The actual data in our data frame is:
[[25 'Tom' 4.23]
[26 'James' 3.24]
[25 'Ricky' 3.98]
[23 'Vin' 2.56]
[30 'Steve' 3.2]
[29 'Smith' 4.6]
[23 'Jack' 3.8]]

Head & Tail

要查看 DataFrame 对象的小样本,使用 head() 和 tail() 方法。 head() 返回前 n 行(观察索引值)。显示的元素默认数为 5,但可传递自定义数字。

To view a small sample of a DataFrame object, use the head() and tail() methods. head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)

它的 output 如下所示 −

Its output is as follows −

Our data frame is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Smith   4.60
6   23    Jack    3.80

The first two rows of the data frame is:
   Age   Name   Rating
0  25    Tom    4.23
1  26    James  3.24

tail() 返回最后 n 行(观察索引值)。显示的元素默认数为 5,但可传递自定义数字。

tail() returns the last n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)

它的 output 如下所示 −

Its output is as follows −

Our data frame is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Smith   4.60
6   23    Jack    3.80

The last two rows of the data frame is:
    Age   Name    Rating
5   29    Smith    4.6
6   23    Jack     3.8

Python Pandas - Descriptive Statistics

许多方法共同计算描述性统计信息和其他对 DataFrame 的相关操作。其中大多数是像 sum(), mean(), 这样的聚合,但其中一些(如 sumsum() )生成相同大小的对象。通常情况下,这些方法采用 axis 参数,就像 ndarray.{sum, std, …​},但可以按名称或整数指定轴

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …​}, but the axis can be specified by name or integer

  1. DataFrame − “index” (axis=0, default), “columns” (axis=1)

让我们创建 DataFrame,并在本章中针对所有操作使用此对象。

Let us create a DataFrame and use this object throughout this chapter for all the operations.

Example

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df

它的 output 如下所示 −

Its output is as follows −

    Age  Name   Rating
0   25   Tom     4.23
1   26   James   3.24
2   25   Ricky   3.98
3   23   Vin     2.56
4   30   Steve   3.20
5   29   Smith   4.60
6   23   Jack    3.80
7   34   Lee     3.78
8   40   David   2.98
9   30   Gasper  4.80
10  51   Betina  4.10
11  46   Andres  3.65

sum()

返回请求轴上的值的总和。默认情况下,轴为索引 (axis=0)。

Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()

它的 output 如下所示 −

Its output is as follows −

Age                                                    382
Name     TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating                                               44.92
dtype: object

单独添加每个列(追加字符串)。

Each individual column is added individually (Strings are appended).

axis=1

此语法将给予如下所示的输出。

This syntax will give the output as shown below.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)

它的 output 如下所示 −

Its output is as follows −

0    29.23
1    29.24
2    28.98
3    25.56
4    33.20
5    33.60
6    26.80
7    37.78
8    42.98
9    34.80
10   55.10
11   49.65
dtype: float64

mean()

返回平均值

Returns the average value

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()

它的 output 如下所示 −

Its output is as follows −

Age       31.833333
Rating     3.743333
dtype: float64

std()

返回数值列的 Bessel 标准差。

Returns the Bressel standard deviation of the numerical columns.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.std()

它的 output 如下所示 −

Its output is as follows −

Age       9.232682
Rating    0.661628
dtype: float64

Functions & Description

让我们现在理解 Python Pandas 中描述性统计下的函数。下表列出了重要函数 -

Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −

Sr.No.

Function

Description

1

count()

Number of non-null observations

2

sum()

Sum of values

3

mean()

Mean of Values

4

median()

Median of Values

5

mode()

Mode of values

6

std()

Standard Deviation of the Values

7

min()

Minimum Value

8

max()

Maximum Value

9

abs()

Absolute Value

10

prod()

Product of Values

11

cumsum()

Cumulative Sum

12

cumprod()

Cumulative Product

Note - 由于 DataFrame 是异构数据结构。通用操作并不适用于所有函数。

Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

  1. Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.

  2. Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

Summarizing Data

describe() 函数计算与 DataFrame 列有关的统计数据摘要。

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()

它的 output 如下所示 −

Its output is as follows −

               Age         Rating
count    12.000000      12.000000
mean     31.833333       3.743333
std       9.232682       0.661628
min      23.000000       2.560000
25%      25.000000       3.230000
50%      29.500000       3.790000
75%      35.500000       4.132500
max      51.000000       4.800000

此函数给出了 mean, stdIQR 值。并且,函数排除了字符列并给出了有关数值列的摘要。 'include' 是一个参数,用于传递有关需要考虑哪些列进行总结的必要信息。采用值列表;默认情况下为“number”。

This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.

  1. object − Summarizes String columns

  2. number − Summarizes Numeric columns

  3. all − Summarizes all columns together (Should not pass it as a list value)

现在,在程序中使用以下语句并检查输出 -

Now, use the following statement in the program and check the output −

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])

它的 output 如下所示 −

Its output is as follows −

          Name
count       12
unique      12
top      Ricky
freq         1

现在,使用以下语句并检查输出 -

Now, use the following statement and check the output −

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')

它的 output 如下所示 −

Its output is as follows −

          Age          Name       Rating
count   12.000000        12    12.000000
unique        NaN        12          NaN
top           NaN     Ricky          NaN
freq          NaN         1          NaN
mean    31.833333       NaN     3.743333
std      9.232682       NaN     0.661628
min     23.000000       NaN     2.560000
25%     25.000000       NaN     3.230000
50%     29.500000       NaN     3.790000
75%     35.500000       NaN     4.132500
max     51.000000       NaN     4.800000

Python Pandas - Function Application

要将自己的函数或其他库的函数应用于 Pandas 对象,您应该注意这三种重要方法。下面讨论了这些方法。使用哪种合适的方法取决于您的函数是否期望对整个 DataFrame、面向行或列或按元素进行操作。

To apply your own or another library’s functions to Pandas objects, you should be aware of the three important methods. The methods have been discussed below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame, row- or column-wise, or element wise.

  1. Table wise Function Application: pipe()

  2. Row or Column Wise Function Application: apply()

  3. Element wise Function Application: applymap()

Table-wise Function Application

自定义操作可以通过传递函数和适量参数作为管道参数来执行。因此,操作将对整个 DataFrame 执行。

Custom operations can be performed by passing the function and the appropriate number of parameters as pipe arguments. Thus, operation is performed on the whole DataFrame.

例如,给 DataFrame 中所有的元素添加值 2。然后,

For example, add a value 2 to all the elements in the DataFrame. Then,

adder function

adder 函数将两个数字值作为参数添加,并返回和。

The adder function adds two numeric values as parameters and returns the sum.

def adder(ele1,ele2):
   return ele1+ele2

我们现在将使用自定义函数对 DataFrame 执行操作。

We will now use the custom function to conduct operation on the DataFrame.

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)

让我们看一下完整的程序 −

Let’s see the full program −

import pandas as pd
import numpy as np

def adder(ele1,ele2):
   return ele1+ele2

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

        col1       col2       col3
0   2.176704   2.219691   1.509360
1   2.222378   2.422167   3.953921
2   2.241096   1.135424   2.696432
3   2.355763   0.376672   1.182570
4   2.308743   2.714767   2.130288

Row or Column Wise Function Application

任意函数可以使用 apply() 方法沿 DataFrame 或 Panel 的轴应用,该方法与描述性统计方法类似,接受可选的 axis 参数。默认情况下,操作按列执行,将每一列视为类似数组。

Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like.

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

col1   -0.288022
col2    1.044839
col3   -0.187009
dtype: float64

通过传递 axis 参数,可以按行执行操作。

By passing axis parameter, operations can be performed row wise.

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

col1    0.034093
col2   -0.152672
col3   -0.229728
dtype: float64

Example 3

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

col1   -0.167413
col2   -0.370495
col3   -0.707631
dtype: float64

Element Wise Function Application

并非所有函数都可以向量化(既不能返回另一个数组的 NumPy 数组,也不能返回任何值),DataFrame 上的 applymap() 方法和 Series 上的 analogously map() 方法接受任何 Python 函数,该函数取单个值并返回单个值。

Not all functions can be vectorized (neither the NumPy arrays which return another array nor any value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

Example 1

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
df['col1'].map(lambda x:x*100)
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

col1    0.480742
col2    0.454185
col3    0.266563
dtype: float64

Example 2

import pandas as pd
import numpy as np

# My custom function
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.applymap(lambda x:x*100)
print df.apply(np.mean)

它的 output 如下所示 −

Its output is as follows −

col1    0.395263
col2    0.204418
col3   -0.795188
dtype: float64

Python Pandas - Reindexing

Reindexing 更改 DataFrame 的行标签和列标签。重新索引是指使数据符合沿特定轴匹配的一组给定标签。

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

可以通过索引完成多个操作,如下所示 −

Multiple operations can be accomplished through indexing like −

  1. Reorder the existing data to match a new set of labels.

  2. Insert missing value (NA) markers in label locations where no data for the label existed.

Example

import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print df_reindexed

它的 output 如下所示 −

Its output is as follows −

            A    C     B
0  2016-01-01  Low   NaN
2  2016-01-03  High  NaN
5  2016-01-06  Low   NaN

Reindex to Align with Other Objects

你可以获取一个对象并重新索引其轴以与另一个对象相同进行标记。考虑下面的示例来理解它。

You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

Example

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])

df1 = df1.reindex_like(df2)
print df1

它的 output 如下所示 −

Its output is as follows −

          col1         col2         col3
0    -2.467652    -1.211687    -0.391761
1    -0.287396     0.522350     0.562512
2    -0.255409    -0.483250     1.866258
3    -1.150467    -0.646493    -0.222462
4     0.152768    -2.056643     1.877233
5    -1.155997     1.528719    -1.343719
6    -1.015606    -1.245936    -0.295275

Note − 在这里, df1 DataFrame 被更改并重新索引为 df2 。列名应该匹配,否则将为整个列标签添加 NAN。

Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.

Filling while ReIndexing

reindex() 接受一个可选参数 method,该参数是一个填充方法,其值如下 −

reindex() takes an optional parameter method which is a filling method with values as follows −

  1. pad/ffill − Fill values forward

  2. bfill/backfill − Fill values backward

  3. nearest − Fill from the nearest index values

Example

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print df2.reindex_like(df1,method='ffill')

它的 output 如下所示 −

Its output is as follows −

         col1        col2       col3
0    1.311620   -0.707176   0.599863
1   -0.423455   -0.700265   1.133371
2         NaN         NaN        NaN
3         NaN         NaN        NaN
4         NaN         NaN        NaN
5         NaN         NaN        NaN

Data Frame with Forward Fill:
         col1        col2        col3
0    1.311620   -0.707176    0.599863
1   -0.423455   -0.700265    1.133371
2   -0.423455   -0.700265    1.133371
3   -0.423455   -0.700265    1.133371
4   -0.423455   -0.700265    1.133371
5   -0.423455   -0.700265    1.133371

Note − 填充最后四行。

Note − The last four rows are padded.

Limits on Filling while Reindexing

limit 参数在重新索引时提供对填充的额外控制。Limit 指定连续匹配的最大计数。我们考虑以下示例来理解它 −

The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. Let us consider the following example to understand the same −

Example

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print df2.reindex_like(df1,method='ffill',limit=1)

它的 output 如下所示 −

Its output is as follows −

         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2         NaN         NaN         NaN
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

Data Frame with Forward Fill limiting to 1:
         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2   -0.055713   -0.021732   -0.174577
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

Note − 观察到,只有第 7 行被前面的第 6 行填充。然后,这些行会保持原样。

Note − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.

Renaming

rename() 方法允许你根据某些映射(一个字典或 Series)或一个任意函数来重新标记一个轴。

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

我们考虑以下示例来理解它 −

Let us consider the following example to understand this −

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print df1

print ("After renaming the rows and columns:")
print df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})

它的 output 如下所示 −

Its output is as follows −

         col1        col2        col3
0    0.486791    0.105759    1.540122
1   -0.990237    1.007885   -0.217896
2   -0.483855   -1.645027   -1.194113
3   -0.122316    0.566277   -0.366028
4   -0.231524   -0.721172   -0.112007
5    0.438810    0.000225    0.435479

After renaming the rows and columns:
                c1          c2        col3
apple     0.486791    0.105759    1.540122
banana   -0.990237    1.007885   -0.217896
durian   -0.483855   -1.645027   -1.194113
3        -0.122316    0.566277   -0.366028
4        -0.231524   -0.721172   -0.112007
5         0.438810    0.000225    0.435479

rename() 方法提供了一个 inplace 命名参数,其默认值为 False,并复制底层数据。传递 inplace=True 来原地重命名数据。

The rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.

Python Pandas - Iteration

对 Pandas 对象进行基本迭代的行为取决于类型。当对 Series 迭代时,它被视为类似于数组,并且基本迭代会生成值。其他数据结构,例如 DataFrame 和 Panel,遵循 dict-like 约定,即对对象的 keys 进行迭代。

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

简而言之,基本迭代(对于 i 在对象中)生成 −

In short, basic iteration (for i in object) produces −

  1. Series − values

  2. DataFrame − column labels

  3. Panel − item labels

Iterating a DataFrame

迭代一个 DataFrame 会给出列名。我们考虑以下示例来理解它。

Iterating a DataFrame gives column names. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np

N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })

for col in df:
   print col

它的 output 如下所示 −

Its output is as follows −

A
C
D
x
y

为了对 DataFrame 的行进行迭代,我们可以使用以下函数 −

To iterate over the rows of the DataFrame, we can use the following functions −

  1. iteritems() − to iterate over the (key,value) pairs

  2. iterrows() − iterate over the rows as (index,series) pairs

  3. itertuples() − iterate over the rows as namedtuples

iteritems()

以密钥作为密钥,以列值作为 Series 对象对每一列进行迭代。

Iterates over each column as key, value pair with label as key and column value as a Series object.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
   print key,value

它的 output 如下所示 −

Its output is as follows −

col1 0    0.802390
1    0.324060
2    0.256811
3    0.839186
Name: col1, dtype: float64

col2 0    1.624313
1   -1.033582
2    1.796663
3    1.856277
Name: col2, dtype: float64

col3 0   -0.022142
1   -0.230820
2    1.160691
3   -0.830279
Name: col3, dtype: float64

观察,每一列都以 Series 中的键值对形式单独进行迭代。

Observe, each column is iterated separately as a key-value pair in a Series.

iterrows()

iterrows() 返回迭代器,生成每一行索引值以及包含每一行数据的 series。

iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print row_index,row

它的 output 如下所示 −

Its output is as follows −

0  col1    1.529759
   col2    0.762811
   col3   -0.634691
Name: 0, dtype: float64

1  col1   -0.944087
   col2    1.420919
   col3   -0.507895
Name: 1, dtype: float64

2  col1   -0.077287
   col2   -0.858556
   col3   -0.663385
Name: 2, dtype: float64
3  col1    -1.638578
   col2     0.059866
   col3     0.493482
Name: 3, dtype: float64

Note − 因为 iterrows() 迭代行,所以它不会保留行中的数据类型。0、1、2 是行索引,col1、col2、col3 是列索引。

Note − Because iterrows() iterate over the rows, it doesn’t preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices.

itertuples()

itertuples() 方法将返回一个迭代器,该迭代器会为 DataFrame 中的每一行生成一个命名元组。元组的第一个元素将是行的相应索引值,而其余的值是行值。

itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print row

它的 output 如下所示 −

Its output is as follows −

Pandas(Index=0, col1=1.5297586201375899, col2=0.76281127433814944, col3=-
0.6346908238310438)

Pandas(Index=1, col1=-0.94408735763808649, col2=1.4209186418359423, col3=-
0.50789517967096232)

Pandas(Index=2, col1=-0.07728664756791935, col2=-0.85855574139699076, col3=-
0.6633852507207626)

Pandas(Index=3, col1=0.65734942534106289, col2=-0.95057710432604969,
col3=0.80344487462316527)

Note − 在进行迭代时,不要尝试修改任何对象。迭代目的是为了读取,而迭代器返回原始对象(一个视图)的副本,因此更改不会反映在原始对象中。

Note − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

for index, row in df.iterrows():
   row['a'] = 10
print df

它的 output 如下所示 −

Its output is as follows −

        col1       col2       col3
0  -1.739815   0.735595  -0.295589
1   0.635485   0.106803   1.527922
2  -0.939064   0.547095   0.038585
3  -1.016509  -0.116580  -0.523158

观察,没有反映出的更改。

Observe, no changes reflected.

Python Pandas - Sorting

Pandas 提供两种排序方式。它们是-

There are two kinds of sorting available in Pandas. They are −

  1. By label

  2. By Actual Value

我们考虑一个有输出的示例。

Let us consider an example with an output.

import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns=['col2','col1'])
print unsorted_df

它的 output 如下所示 −

Its output is as follows −

        col2       col1
1  -2.063177   0.537527
4   0.142932  -0.684884
6   0.012667  -0.389340
2  -0.548797   1.848743
3  -1.044160   0.837381
5   0.385605   1.300185
9   1.031425  -1.002967
8  -0.407374  -0.435142
0   2.237453  -1.067139
7  -1.445831  -1.701035

unsorted_df 中, labelsvalues 未排序。让我们看看如何对它们进行排序。

In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted.

By Label

使用 sort_index() 方法,通过传递axis参数和排序顺序,可以对DataFrame进行排序。默认情况下,按行标签升序排序。

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print sorted_df

它的 output 如下所示 −

Its output is as follows −

        col2       col1
0   0.208464   0.627037
1   0.641004   0.331352
2  -0.038067  -0.464730
3  -0.638456  -0.021466
4   0.014646  -0.737438
5  -0.290761  -1.669827
6  -0.797303  -0.018737
7   0.525753   1.628921
8  -0.567031   0.775951
9   0.060724  -0.322425

Order of Sorting

通过将布尔值传递给ascending参数,可以控制排序顺序。我们考虑以下示例来理解它。

By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print sorted_df

它的 output 如下所示 −

Its output is as follows −

         col2        col1
9    0.825697    0.374463
8   -1.699509    0.510373
7   -0.581378    0.622958
6   -0.202951    0.954300
5   -1.289321   -1.551250
4    1.302561    0.851385
3   -0.157915   -0.388659
2   -1.222295    0.166609
1    0.584890   -0.291048
0    0.668444   -0.061294

Sort the Columns

通过将axis参数传递一个值0或1,可以在列标签上进行排序。默认情况下,axis=0,按行排序。我们考虑以下示例来理解它。

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
   mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index(axis=1)

print sorted_df

它的 output 如下所示 −

Its output is as follows −

         col1        col2
1   -0.291048    0.584890
4    0.851385    1.302561
6    0.954300   -0.202951
2    0.166609   -1.222295
3   -0.388659   -0.157915
5   -1.551250   -1.289321
9    0.374463    0.825697
8    0.510373   -1.699509
0   -0.061294    0.668444
7    0.622958   -0.581378

By Value

与索引排序一样, sort_values() 是按值排序的方法。它接受一个“by”参数,它将使用要对其进行值排序的DataFrame的列名。

Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
   sorted_df = unsorted_df.sort_values(by='col1')

print sorted_df

它的 output 如下所示 −

Its output is as follows −

   col1  col2
1    1    3
2    1    2
3    1    4
0    2    1

注意,col1值已排序,并且相应的col2值和行索引将与col1一起改变。因此,它们看起来是未排序的。

Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.

'by' 参数采用列值列表。

'by' argument takes a list of column values.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
   sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print sorted_df

它的 output 如下所示 −

Its output is as follows −

  col1 col2
2   1   2
1   1   3
3   1   4
0   2   1

Sorting Algorithm

sort_values() 提供了一个从mergesort、heapsort和quicksort中选择算法的条款。Mergesort是唯一稳定的算法。

sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print sorted_df

它的 output 如下所示 −

Its output is as follows −

  col1 col2
1    1    3
2    1    2
3    1    4
0    2    1

Python Pandas - Working with Text Data

在本节中,我们将讨论使用基本Series/Index进行字符串操作。在后续章节中,我们将学习如何在DataFrame上应用这些字符串函数。

In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame.

Pandas 提供了一组字符串函数,使其易于对字符串数据进行操作。最重要的是,这些函数忽略(或排除)缺失/NaN 值。

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.

几乎所有这些方法都适用于 Python 字符串函数(参考: https://docs.python.org/3/library/stdtypes.html#string-methods )。因此,将Series对象转换为字符串对象,然后执行操作。

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation.

现在让我们看看每个操作如何执行。

Let us now see how each operation performs.

Sr.No

Function & Description

1

lower() Converts strings in the Series/Index to lower case.

2

upper() Converts strings in the Series/Index to upper case.

3

len() Computes String length().

4

strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

5

split(' ') Splits each string with the given pattern.

6

cat(sep=' ') Concatenates the series/index elements with given separator.

7

get_dummies() Returns the DataFrame with One-Hot Encoded values.

8

contains(pattern) Returns a Boolean value True for each element if the substring contains in the element, else False.

9

replace(a,b) Replaces the value a with the value b.

10

repeat(value) Repeats each element with specified number of times.

11

count(pattern) Returns count of appearance of pattern in each element.

12

startswith(pattern) Returns true if the element in the Series/Index starts with the pattern.

13

endswith(pattern) Returns true if the element in the Series/Index ends with the pattern.

14

find(pattern) Returns the first position of the first occurrence of the pattern.

15

findall(pattern) Returns a list of all occurrence of the pattern.

16

swapcase Swaps the case lower/upper.

17

islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

18

isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

19

isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

现在,我们创建一个Series,看看上述所有函数如何工作。

Let us now create a Series and see how all the above functions work.

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
4            NaN
5           1234
6    Steve Smith
dtype: object

lower()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.lower()

它的 output 如下所示 −

Its output is as follows −

0            tom
1   william rick
2           john
3        alber@t
4            NaN
5           1234
6    steve smith
dtype: object

upper()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

print s.str.upper()

它的 output 如下所示 −

Its output is as follows −

0            TOM
1   WILLIAM RICK
2           JOHN
3        ALBER@T
4            NaN
5           1234
6    STEVE SMITH
dtype: object

len()

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print s.str.len()

它的 output 如下所示 −

Its output is as follows −

0    3.0
1   12.0
2    4.0
3    7.0
4    NaN
5    4.0
6   10.0
dtype: float64

strip()

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After Stripping:")
print s.str.strip()

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

After Stripping:
0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

split(pattern)

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("Split Pattern:")
print s.str.split(' ')

它的 output 如下所示 −

Its output is as follows −

0            Tom
1   William Rick
2           John
3        Alber@t
dtype: object

Split Pattern:
0   [Tom, , , , , , , , , , ]
1   [, , , , , William, Rick]
2   [John]
3   [Alber@t]
dtype: object

cat(sep=pattern)

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.cat(sep='_')

它的 output 如下所示 −

Its output is as follows −

Tom _ William Rick_John_Alber@t

get_dummies()

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.get_dummies()

它的 output 如下所示 −

Its output is as follows −

   William Rick   Alber@t   John   Tom
0             0         0      0     1
1             1         0      0     0
2             0         0      1     0
3             0         1      0     0

contains ()

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.contains(' ')

它的 output 如下所示 −

Its output is as follows −

0   True
1   True
2   False
3   False
dtype: bool

replace(a,b)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print s
print ("After replacing @ with $:")
print s.str.replace('@','$')

它的 output 如下所示 −

Its output is as follows −

0   Tom
1   William Rick
2   John
3   Alber@t
dtype: object

After replacing @ with $:
0   Tom
1   William Rick
2   John
3   Alber$t
dtype: object

repeat(value)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.repeat(2)

它的 output 如下所示 −

Its output is as follows −

0   Tom            Tom
1   William Rick   William Rick
2                  JohnJohn
3                  Alber@tAlber@t
dtype: object

count(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print s.str.count('m')

它的 output 如下所示 −

Its output is as follows −

The number of 'm's in each string:
0    1
1    1
2    0
3    0

startswith(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("Strings that start with 'T':")
print s.str. startswith ('T')

它的 output 如下所示 −

Its output is as follows −

0  True
1  False
2  False
3  False
dtype: bool

endswith(pattern)

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print s.str.endswith('t')

它的 output 如下所示 −

Its output is as follows −

Strings that end with 't':
0  False
1  False
2  False
3  True
dtype: bool

find(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.find('e')

它的 output 如下所示 −

Its output is as follows −

0  -1
1  -1
2  -1
3   3
dtype: int64

“-1”表示元素中没有此类模式。

"-1" indicates that there no such pattern available in the element.

findall(pattern)

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print s.str.findall('e')

它的 output 如下所示 −

Its output is as follows −

0 []
1 []
2 []
3 [e]
dtype: object

空列表([ ])表示元素中没有此类模式。

Null list([ ]) indicates that there is no such pattern available in the element.

swapcase()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.swapcase()

它的 output 如下所示 −

Its output is as follows −

0  tOM
1  wILLIAM rICK
2  jOHN
3  aLBER@T
dtype: object

islower()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print s.str.islower()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool

isupper()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isupper()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool

isnumeric()

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

print s.str.isnumeric()

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  False
3  False
dtype: bool

Python Pandas - Options and Customization

Pandas提供API来定制某些行为,显示最常用的。

Pandas provide API to customize some aspects of its behavior, display is being mostly used.

该API由五个相关函数组成。它们是−

The API is composed of five relevant functions. They are −

  1. get_option()

  2. set_option()

  3. reset_option()

  4. describe_option()

  5. option_context()

现在,我们了解一下函数如何操作。

Let us now understand how the functions operate.

get_option(param)

get_option获取一个单个参数,并返回如下输出所示的值−

get_option takes a single parameter and returns the value as given in the output below −

display.max_rows

显示默认值的数量。解释器读取此值并显示此值的行作为显示的上限。

Displays the default number of value. Interpreter reads this value and displays the rows with this value as upper limit to display.

import pandas as pd
print pd.get_option("display.max_rows")

它的 output 如下所示 −

Its output is as follows −

60

display.max_columns

显示默认值的数量。解释器读取此值并显示此值的行作为显示的上限。

Displays the default number of value. Interpreter reads this value and displays the rows with this value as upper limit to display.

import pandas as pd
print pd.get_option("display.max_columns")

它的 output 如下所示 −

Its output is as follows −

20

此处,60 和 20 是默认配置参数值。

Here, 60 and 20 are the default configuration parameter values.

set_option(param,value)

set_option 接受两个参数并设置参数的值,如下所示:

set_option takes two arguments and sets the value to the parameter as shown below −

display.max_rows

使用@{{s0}},我们可以将默认的显示行数更改为其他数量。

Using set_option(), we can change the default number of rows to be displayed.

import pandas as pd

pd.set_option("display.max_rows",80)

print pd.get_option("display.max_rows")

它的 output 如下所示 −

Its output is as follows −

80

display.max_columns

使用@{{s0}},我们可以将默认的显示行数更改为其他数量。

Using set_option(), we can change the default number of rows to be displayed.

import pandas as pd

pd.set_option("display.max_columns",30)

print pd.get_option("display.max_columns")

它的 output 如下所示 −

Its output is as follows −

30

reset_option(param)

reset_option 接受一个参数并把该值设置回默认值。

reset_option takes an argument and sets the value back to the default value.

display.max_rows

使用 reset_option(),我们可以将值更改回默认显示的行数。

Using reset_option(), we can change the value back to the default number of rows to be displayed.

import pandas as pd

pd.reset_option("display.max_rows")
print pd.get_option("display.max_rows")

它的 output 如下所示 −

Its output is as follows −

60

describe_option(param)

describe_option 打印参数的描述。

describe_option prints the description of the argument.

display.max_rows

使用 reset_option(),我们可以将值更改回默认显示的行数。

Using reset_option(), we can change the value back to the default number of rows to be displayed.

import pandas as pd
pd.describe_option("display.max_rows")

它的 output 如下所示 −

Its output is as follows −

display.max_rows : int
   If max_rows is exceeded, switch to truncate view. Depending on
   'large_repr', objects are either centrally truncated or printed as
   a summary view. 'None' value means unlimited.

   In case python/IPython is running in a terminal and `large_repr`
   equals 'truncate' this can be set to 0 and pandas will auto-detect
   the height of the terminal and print a truncated object which fits
   the screen height. The IPython notebook, IPython qtconsole, or
   IDLE do not run in a terminal and hence it is not possible to do
   correct auto-detection.
   [default: 60] [currently: 60]

option_context()

option_context 上下文管理器用于设置 with statement 中的选项。退出 with block 时,选项值会自动恢复。

option_context context manager is used to set the option in with statement temporarily. Option values are restored automatically when you exit the with block

display.max_rows

使用 option_context(),我们可以暂时设置该值。

Using option_context(), we can set the value temporarily.

import pandas as pd
with pd.option_context("display.max_rows",10):
   print(pd.get_option("display.max_rows"))
   print(pd.get_option("display.max_rows"))

它的 output 如下所示 −

Its output is as follows −

10
10

请注意,第一个和第二个打印语句之间的差异。第一个语句打印的是 option_context() 设置的值,它暂时存在于 with context 本身内。在 with context 之后,第二个打印语句打印配置值。

See, the difference between the first and the second print statements. The first statement prints the value set by option_context() which is temporary within the with context itself. After the with context, the second print statement prints the configured value.

Frequently used Parameters

Sr.No

Parameter & Description

1

display.max_rows Displays maximum number of rows to display

2

2 display.max_columns Displays maximum number of columns to display

3

display.expand_frame_repr Displays DataFrames to Stretch Pages

4

display.max_colwidth Displays maximum column width

5

display.precision Displays precision for decimal numbers

Python Pandas - Indexing and Selecting Data

在本章中,我们将讨论如何对日期进行切片和划分,通常获得 Pandas 对象的子集。

In this chapter, we will discuss how to slice and dice the date and generally get the subset of pandas object.

Python 和 NumPy 索引运算符 “[ ]” 和属性运算符 “.” 可以在各种用例中快速简便地访问 Pandas 数据结构。但是,由于要访问的数据类型是事先不知道的,因此直接使用标准运算符会带来一些优化限制。对于生产代码,我们建议你利用本章中介绍的经过优化的 Pandas 数据访问方法。

The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.

Pandas 现在支持三种多轴索引;以下表中提到了这三种类型:

Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −

Sr.No

Indexing & Description

1

.loc() Label based

2

.iloc() Integer based

3

.ix() Both Label and Integer based

.loc()

Pandas 提供了多种方法来拥有纯粹的 label based indexing 。切片时,也包括起始边界。整数有效标签,但它们指的是该标签,而不是该位置。

Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

.loc() 有多种访问方法,如 −

.loc() has multiple access methods like −

  1. A single scalar label

  2. A list of labels

  3. A slice object

  4. A Boolean array

loc 使用两个由“,”分隔的单一/列表/范围运算符。第一个指示行,而第二个指示列。

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Example 1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print df.loc[:,'A']

它的 output 如下所示 −

Its output is as follows −

a   0.391548
b  -0.070649
c  -0.317212
d  -2.162406
e   2.202797
f   0.613709
g   1.050559
h   1.122680
Name: A, dtype: float64

Example 2

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print df.loc[:,['A','C']]

它的 output 如下所示 −

Its output is as follows −

            A           C
a    0.391548    0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e    2.202797    0.528067
f    0.613709    0.286414
g    1.050559    0.216526
h    1.122680   -1.621420

Example 3

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns, say list[]
print df.loc[['a','b','f','h'],['A','C']]

它的 output 如下所示 −

Its output is as follows −

           A          C
a   0.391548   0.745623
b  -0.070649   1.620406
f   0.613709   0.286414
h   1.122680  -1.621420

Example 4

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print df.loc['a':'h']

它的 output 如下所示 −

Its output is as follows −

            A           B          C          D
a    0.391548   -0.224297   0.745623   0.054301
b   -0.070649   -0.880130   1.620406   1.419743
c   -0.317212   -1.929698   1.448365   0.616899
d   -2.162406    0.614256  -0.873557   1.093958
e    2.202797   -2.315915   0.528067   0.612482
f    0.613709   -0.157674   0.286414  -0.500517
g    1.050559   -2.272099   0.216526   0.928449
h    1.122680    0.324368  -1.621420  -0.741470

Example 5

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# for getting values with a boolean array
print df.loc['a']>0

它的 output 如下所示 −

Its output is as follows −

A  False
B  True
C  False
D  False
Name: a, dtype: bool

.iloc()

Pandas 提供多种方法来获取纯基于整数的索引。如同 python 和 numpy,这些是 0-based 索引。

Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

各种访问方法如下所示 −

The various access methods are as follows −

  1. An Integer

  2. A list of integers

  3. A range of values

Example 1

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select all rows for a specific column
print df.iloc[:4]

它的 output 如下所示 −

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print df.iloc[:4]
print df.iloc[1:5, 2:4]

它的 output 如下所示 −

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

           C          D
1  -0.813012   0.631615
2   0.025070   0.230806
3   0.826977  -0.026251
4   1.423332   1.130568

Example 3

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Slicing through list of values
print df.iloc[[1, 3, 5], [1, 3]]
print df.iloc[1:3, :]
print df.iloc[:,1:3]

它的 output 如下所示 −

Its output is as follows −

           B           D
1   0.890791    0.631615
3  -1.284314   -0.026251
5  -0.512888   -0.518930

           A           B           C           D
1  -0.685354    0.890791   -0.813012    0.631615
2  -0.783192   -0.531378    0.025070    0.230806

           B           C
0   0.256239   -1.270702
1   0.890791   -0.813012
2  -0.531378    0.025070
3  -1.284314    0.826977
4  -0.460729    1.423332
5  -0.512888    0.581409
6  -1.204853    0.098060
7  -0.947857    0.641358

.ix()

除了基于纯标签和基于整数之外,Pandas 提供了一种混合方法来选择和子集化对象,使用 .ix() 运算符。

Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator.

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print df.ix[:4]

它的 output 如下所示 −

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# Index slicing
print df.ix[:,'A']

它的 output 如下所示 −

Its output is as follows −

0   0.699435
1  -0.685354
2  -0.783192
3   0.539042
4  -1.044209
5  -1.415411
6   1.062095
7   0.994204
Name: A, dtype: float64

Use of Notations

使用多轴索引从 Pandas 对象获取值使用以下表示法 −

Getting values from the Pandas object with Multi-axes indexing uses the following notation −

Object

Indexers

Return Type

Series

s.loc[indexer]

Scalar value

DataFrame

df.loc[row_index,col_index]

Series object

Panel

p.loc[item_index,major_index, minor_index]

p.loc[item_index,major_index, minor_index]

Note − .iloc() & .ix() 应用相同的索引选项和返回值。

Note − .iloc() & .ix() applies the same indexing options and Return value.

现在让我们看看如何在 DataFrame 对象上执行每个操作。我们将使用基本的索引运算符“[ ]” −

Let us now see how each operation can be performed on the DataFrame object. We will use the basic indexing operator '[ ]' −

Example 1

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df['A']

它的 output 如下所示 −

Its output is as follows −

0  -0.478893
1   0.391931
2   0.336825
3  -1.055102
4  -0.165218
5  -0.328641
6   0.567721
7  -0.759399
Name: A, dtype: float64

Note − 我们可以将一个值列表传递给 [ ] 来选择这些列。

Note − We can pass a list of values to [ ] to select those columns.

Example 2

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print df[['A','B']]

它的 output 如下所示 −

Its output is as follows −

           A           B
0  -0.478893   -0.606311
1   0.391931   -0.949025
2   0.336825    0.093717
3  -1.055102   -0.012944
4  -0.165218    1.550310
5  -0.328641   -0.226363
6   0.567721   -0.312585
7  -0.759399   -0.372696

Example 3

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print df[2:2]

它的 output 如下所示 −

Its output is as follows −

Columns: [A, B, C, D]
Index: []

Attribute Access

可以使用属性运算符“.”来选择列。

Columns can be selected using the attribute operator '.'.

Example

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print df.A

它的 output 如下所示 −

Its output is as follows −

0   -0.478893
1    0.391931
2    0.336825
3   -1.055102
4   -0.165218
5   -0.328641
6    0.567721
7   -0.759399
Name: A, dtype: float64

Python Pandas - Statistical Functions

统计方法有助于理解和分析数据的行为。我们现在将学习一些可以在 Pandas 对象上应用的统计函数。

Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects.

Percent_change

Series、DatFrames 和 Panel 都具有函数 pct_change() 。此函数将每个元素与其之前的元素进行比较并计算变化百分比。

Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.

import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()

df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()

它的 output 如下所示 −

Its output is as follows −

0        NaN
1   1.000000
2   0.500000
3   0.333333
4   0.250000
5  -0.200000
dtype: float64

            0          1
0         NaN        NaN
1  -15.151902   0.174730
2  -0.746374   -1.449088
3  -3.582229   -3.165836
4   15.601150  -1.860434

默认情况下, pct_change() 在列上运行;如果你想应用同一行,则使用 axis=1() 参数。

By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.

Covariance

协方差应用于序列数据。Series 对象有一个方法 cov 来计算序列对象之间的协方差。NA 将自动排除。

Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

Cov Series

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print s1.cov(s2)

它的 output 如下所示 −

Its output is as follows −

-0.12978405324

协方差方法在应用于 DataFrame 时,计算所有列之间的 cov

Covariance method when applied on a DataFrame, computes cov between all the columns.

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print frame['a'].cov(frame['b'])
print frame.cov()

它的 output 如下所示 −

Its output is as follows −

-0.58312921152741437

           a           b           c           d            e
a   1.780628   -0.583129   -0.185575    0.003679    -0.136558
b  -0.583129    1.297011    0.136530   -0.523719     0.251064
c  -0.185575    0.136530    0.915227   -0.053881    -0.058926
d   0.003679   -0.523719   -0.053881    1.521426    -0.487694
e  -0.136558    0.251064   -0.058926   -0.487694     0.960761

Note − 观察第一条语句中 abcov ,DataFrame 上的 cov 返回的值也相同。

Note − Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame.

Correlation

相关性显示任何两个值数组(序列)之间的线性关系。有多种计算相关性的方法,如 pearson(默认)、spearman 和 kendall。

Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print frame['a'].corr(frame['b'])
print frame.corr()

它的 output 如下所示 −

Its output is as follows −

-0.383712785514

           a          b          c          d           e
a   1.000000  -0.383713  -0.145368   0.002235   -0.104405
b  -0.383713   1.000000   0.125311  -0.372821    0.224908
c  -0.145368   0.125311   1.000000  -0.045661   -0.062840
d   0.002235  -0.372821  -0.045661   1.000000   -0.403380
e  -0.104405   0.224908  -0.062840  -0.403380    1.000000

如果 DataFrame 中存在任何非数字列,则会自动将其排除。

If any non-numeric column is present in the DataFrame, it is excluded automatically.

Data Ranking

数据排序为数组中的每个元素产生排序。在存在并列的情况下,分配平均等级。

Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.

import pandas as pd
import numpy as np

s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s.rank()

它的 output 如下所示 −

Its output is as follows −

a  1.0
b  3.5
c  2.0
d  3.5
e  5.0
dtype: float64

Rank 可选地采用 ascending 参数,该参数的默认值为 true;为 false 时,数据将以逆序排名,并将较大的值分配给较小的排名。

Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

Rank 支持不同的平局打破方法,通过 method 参数指定−

Rank supports different tie-breaking methods, specified with the method parameter −

  1. average − average rank of tied group

  2. min − lowest rank in the group

  3. max − highest rank in the group

  4. first − ranks assigned in the order they appear in the array

Python Pandas - Window Functions

为了处理数值数据,Pandas 提供了几个变体,例如窗口统计的滚动、扩展和指数移动加权。其中包括 sum, mean, median, variance, covariance, correlation, 等。

For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc.

我们现在将学习如何将这些中的每一个应用于 DataFrame 对象。

We will now learn how each of these can be applied on DataFrame objects.

.rolling() Function

此函数可以应用于一系列数据。指定 window=n 参数并在其上应用适当的统计函数。

This function can be applied on a series of data. Specify the window=n argument and apply the appropriate statistical function on top of it.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df.rolling(window=3).mean()

它的 output 如下所示 −

Its output is as follows −

                    A           B           C           D
2000-01-01        NaN         NaN         NaN         NaN
2000-01-02        NaN         NaN         NaN         NaN
2000-01-03   0.434553   -0.667940   -1.051718   -0.826452
2000-01-04   0.628267   -0.047040   -0.287467   -0.161110
2000-01-05   0.398233    0.003517    0.099126   -0.405565
2000-01-06   0.641798    0.656184   -0.322728    0.428015
2000-01-07   0.188403    0.010913   -0.708645    0.160932
2000-01-08   0.188043   -0.253039   -0.818125   -0.108485
2000-01-09   0.682819   -0.606846   -0.178411   -0.404127
2000-01-10   0.688583    0.127786    0.513832   -1.067156

Note − 由于窗口大小为 3,对于前两个元素,有空值,从第三个元素开始,值将是 nn-1n-2 元素的平均值。因此,我们还可以应用如上所述的各种函数。

Note − Since the window size is 3, for first two elements there are nulls and from third the value will be the average of the n, n-1 and n-2 elements. Thus we can also apply various functions as mentioned above.

.expanding() Function

此函数可以应用于一系列数据。指定 min_periods=n 参数并在其上应用适当的统计函数。

This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df.expanding(min_periods=3).mean()

它的 output 如下所示 −

Its output is as follows −

                   A           B           C           D
2000-01-01        NaN         NaN         NaN         NaN
2000-01-02        NaN         NaN         NaN         NaN
2000-01-03   0.434553   -0.667940   -1.051718   -0.826452
2000-01-04   0.743328   -0.198015   -0.852462   -0.262547
2000-01-05   0.614776   -0.205649   -0.583641   -0.303254
2000-01-06   0.538175   -0.005878   -0.687223   -0.199219
2000-01-07   0.505503   -0.108475   -0.790826   -0.081056
2000-01-08   0.454751   -0.223420   -0.671572   -0.230215
2000-01-09   0.586390   -0.206201   -0.517619   -0.267521
2000-01-10   0.560427   -0.037597   -0.399429   -0.376886

.ewm() Function

ewm 应用于一系列数据。指定任何 com、span、 halflife 参数并在其上应用适当的统计函数。它指数分配权重。

ewm is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate statistical function on top of it. It assigns the weights exponentially.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df.ewm(com=0.5).mean()

它的 output 如下所示 −

Its output is as follows −

                    A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   0.865131   -0.453626   -1.137961    0.058747
2000-01-03  -0.132245   -0.807671   -0.308308   -1.491002
2000-01-04   1.084036    0.555444   -0.272119    0.480111
2000-01-05   0.425682    0.025511    0.239162   -0.153290
2000-01-06   0.245094    0.671373   -0.725025    0.163310
2000-01-07   0.288030   -0.259337   -1.183515    0.473191
2000-01-08   0.162317   -0.771884   -0.285564   -0.692001
2000-01-09   1.147156   -0.302900    0.380851   -0.607976
2000-01-10   0.600216    0.885614    0.569808   -1.110113

窗口函数主要用于通过平滑曲线在图形中查找数据内的趋势。如果每天的数据变化很大,并且有大量可用的数据点,那么抽样和绘制是一个方法,而应用窗口计算并在结果上绘制图形是另一种方法。通过这些方法,我们可以平滑曲线或趋势。

Window functions are majorly used in finding the trends within the data graphically by smoothing the curve. If there is lot of variation in the everyday data and a lot of data points are available, then taking the samples and plotting is one method and applying the window computations and plotting the graph on the results is another method. By these methods, we can smooth the curve or the trend.

Python Pandas - Aggregations

一旦创建了滚动、扩展和 ewm 对象,就可以使用几种方法对数据执行聚合。

Once the rolling, expanding and ewm objects are created, several methods are available to perform aggregations on data.

Applying Aggregations on DataFrame

让我们创建一个数据框,并在上面应用聚合。

Let us create a DataFrame and apply aggregations on it.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])

print df
r = df.rolling(window=3,min_periods=1)
print r

它的 output 如下所示 −

Its output is as follows −

                    A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   0.790670   -0.387854   -0.668132    0.267283
2000-01-03  -0.575523   -0.965025    0.060427   -2.179780
2000-01-04   1.669653    1.211759   -0.254695    1.429166
2000-01-05   0.100568   -0.236184    0.491646   -0.466081
2000-01-06   0.155172    0.992975   -1.205134    0.320958
2000-01-07   0.309468   -0.724053   -1.412446    0.627919
2000-01-08   0.099489   -1.028040    0.163206   -1.274331
2000-01-09   1.639500   -0.068443    0.714008   -0.565969
2000-01-10   0.326761    1.479841    0.664282   -1.361169

Rolling [window=3,min_periods=1,center=False,axis=0]

我们可以通过将一个函数传递给整个数据框,或通过标准 get item 方法选择一个列,来进行聚合。

We can aggregate by passing a function to the entire DataFrame, or select a column via the standard get item method.

Apply Aggregation on a Whole Dataframe

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)

它的 output 如下所示 −

Its output is as follows −

                    A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469

                    A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469

Apply Aggregation on a Single Column of a Dataframe

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)

它的 output 如下所示 −

Its output is as follows −

                 A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469
2000-01-01   1.088512
2000-01-02   1.879182
2000-01-03   1.303660
2000-01-04   1.884801
2000-01-05   1.194699
2000-01-06   1.925393
2000-01-07   0.565208
2000-01-08   0.564129
2000-01-09   2.048458
2000-01-10   2.065750
Freq: D, Name: A, dtype: float64

Apply Aggregation on Multiple Columns of a DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate(np.sum)

它的 output 如下所示 −

Its output is as follows −

                 A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469
                    A           B
2000-01-01   1.088512   -0.650942
2000-01-02   1.879182   -1.038796
2000-01-03   1.303660   -2.003821
2000-01-04   1.884801   -0.141119
2000-01-05   1.194699    0.010551
2000-01-06   1.925393    1.968551
2000-01-07   0.565208    0.032738
2000-01-08   0.564129   -0.759118
2000-01-09   2.048458   -1.820537
2000-01-10   2.065750    0.383357

Apply Multiple Functions on a Single Column of a DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate([np.sum,np.mean])

它的 output 如下所示 −

Its output is as follows −

                 A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469
                  sum       mean
2000-01-01   1.088512   1.088512
2000-01-02   1.879182   0.939591
2000-01-03   1.303660   0.434553
2000-01-04   1.884801   0.628267
2000-01-05   1.194699   0.398233
2000-01-06   1.925393   0.641798
2000-01-07   0.565208   0.188403
2000-01-08   0.564129   0.188043
2000-01-09   2.048458   0.682819
2000-01-10   2.065750   0.688583

Apply Multiple Functions on Multiple Columns of a DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate([np.sum,np.mean])

它的 output 如下所示 −

Its output is as follows −

                 A           B           C           D
2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
2000-01-05   1.194699    0.010551    0.297378   -1.216695
2000-01-06   1.925393    1.968551   -0.968183    1.284044
2000-01-07   0.565208    0.032738   -2.125934    0.482797
2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
2000-01-10   2.065750    0.383357    1.541496   -3.201469
                    A                      B
                  sum       mean         sum        mean
2000-01-01   1.088512   1.088512   -0.650942   -0.650942
2000-01-02   1.879182   0.939591   -1.038796   -0.519398
2000-01-03   1.303660   0.434553   -2.003821   -0.667940
2000-01-04   1.884801   0.628267   -0.141119   -0.047040
2000-01-05   1.194699   0.398233    0.010551    0.003517
2000-01-06   1.925393   0.641798    1.968551    0.656184
2000-01-07   0.565208   0.188403    0.032738    0.010913
2000-01-08   0.564129   0.188043   -0.759118   -0.253039
2000-01-09   2.048458   0.682819   -1.820537   -0.606846
2000-01-10   2.065750   0.688583    0.383357    0.127786

Apply Different Functions to Different Columns of a Dataframe

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 4),
   index = pd.date_range('1/1/2000', periods=3),
   columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate({'A' : np.sum,'B' : np.mean})

它的 output 如下所示 −

Its output is as follows −

                    A          B          C         D
2000-01-01  -1.575749  -1.018105   0.317797  0.545081
2000-01-02  -0.164917  -1.361068   0.258240  1.113091
2000-01-03   1.258111   1.037941  -0.047487  0.867371
                    A          B
2000-01-01  -1.575749  -1.018105
2000-01-02  -1.740666  -1.189587
2000-01-03  -0.482555  -0.447078

Python Pandas - Missing Data

缺失数据在实际生活中始终是一个问题。由于由缺失值导致的数据质量低,机器学习和数据挖掘等领域在其模型预测的准确性方面面临严重问题。在这些领域,缺失值处理是使其模型更准确、更有效的重点。

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

When and Why Is Data Missed?

我们考虑一下一个产品的在线调查。很多时候,人们不会分享与他们相关的所有信息。很少有人分享他们的体验,但不会分享他们使用该产品的时间;很少有人分享他们使用该产品的时间、体验但不会分享他们的联系信息。因此,或多或少总有部分数据缺失,而且这在实时中是非常常见的。

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

现在让我们看看如何使用 Pandas 处理缺失值(例如 NA 或 NaN)。

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df

它的 output 如下所示 −

Its output is as follows −

         one        two      three
a   0.077988   0.476149   0.965836
b        NaN        NaN        NaN
c  -0.390208  -0.551605  -2.301950
d        NaN        NaN        NaN
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g        NaN        NaN        NaN
h   0.085100   0.532791   0.887415

使用 reindexing,我们创建了一个包含缺失值的 DataFrame。在输出中, NaN 意味着 Not a Number.

Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

Check for Missing Values

为了使检测缺失值更容易(并且跨不同的数组数据类型),Pandas 提供了 isnull()notnull() 函数,它们也是 Series 和 DataFrame 对象上的方法 −

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df['one'].isnull()

它的 output 如下所示 −

Its output is as follows −

a  False
b  True
c  False
d  True
e  False
f  False
g  True
h  False
Name: one, dtype: bool

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df['one'].notnull()

它的 output 如下所示 −

Its output is as follows −

a  True
b  False
c  True
d  False
e  True
f  True
g  False
h  True
Name: one, dtype: bool

Calculations with Missing Data

  1. When summing data, NA will be treated as Zero

  2. If the data are all NA, then the result will be NA

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df['one'].sum()

它的 output 如下所示 −

Its output is as follows −

2.02357685917

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print df['one'].sum()

它的 output 如下所示 −

Its output is as follows −

nan

Cleaning / Filling Missing Data

Pandas 提供了多种方法来清理缺失值。fillna 函数可以通过几种方式用非空数据“填充”NA 值,我们已在以下部分中进行了说明。

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

Replace NaN with a Scalar Value

以下程序显示了如何将“NaN”替换为“0”。

The following program shows how you can replace "NaN" with "0".

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print df
print ("NaN replaced with '0':")
print df.fillna(0)

它的 output 如下所示 −

Its output is as follows −

         one        two     three
a  -0.576991  -0.741695  0.553172
b        NaN        NaN       NaN
c   0.744328  -1.735166  1.749580

NaN replaced with '0':
         one        two     three
a  -0.576991  -0.741695  0.553172
b   0.000000   0.000000  0.000000
c   0.744328  -1.735166  1.749580

在这里,我们用值 0 来填充;相反,我们也可以用任何其他值来填充。

Here, we are filling with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

利用重新索引章节中讨论的填充概念,我们将填充缺失值。

Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

Sr.No

Method & Action

1

pad/fill Fill methods Forward

2

bfill/backfill Fill methods Backward

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df.fillna(method='pad')

它的 output 如下所示 −

Its output is as follows −

         one        two      three
a   0.077988   0.476149   0.965836
b   0.077988   0.476149   0.965836
c  -0.390208  -0.551605  -2.301950
d  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g  -0.930230  -0.670473   1.146615
h   0.085100   0.532791   0.887415

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df.fillna(method='backfill')

它的 output 如下所示 −

Its output is as follows −

         one        two      three
a   0.077988   0.476149   0.965836
b  -0.390208  -0.551605  -2.301950
c  -0.390208  -0.551605  -2.301950
d  -2.000303  -0.788201   1.510072
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g   0.085100   0.532791   0.887415
h   0.085100   0.532791   0.887415

Drop Missing Values

如果你只想排除缺失值,那么使用 dropna 函数以及 axis 参数。默认情况下,axis=0,即沿着行,这意味着如果一行内的任何值都为 NA,则整行都会被排除在外。

If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.dropna()

它的 output 如下所示 −

Its output is as follows −

         one        two      three
a   0.077988   0.476149   0.965836
c  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
h   0.085100   0.532791   0.887415

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.dropna(axis=1)

它的 output 如下所示 −

Its output is as follows −

Empty DataFrame
Columns: [ ]
Index: [a, b, c, d, e, f, g, h]

Replace Missing (or) Generic Values

很多时候,我们必须用某个特定值替换一个通用值。我们可以通过应用 replace 方法来实现此目的。

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

使用标量值替换 NA 与 fillna() 函数的行为相同。

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

print df.replace({1000:10,2000:60})

它的 output 如下所示 −

Its output is as follows −

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})

它的 output 如下所示 −

Its output is as follows −

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

Python Pandas - GroupBy

任何 groupby 操作都涉及对原始对象执行以下操作之一。它们是 −

Any groupby operation involves one of the following operations on the original object. They are −

  1. Splitting the Object

  2. Applying a function

  3. Combining the results

在许多情况下,我们将数据分成集合,并对每个子集应用一些功能。在应用功能中,我们可以执行以下操作 −

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

  1. Aggregation − computing a summary statistic

  2. Transformation − perform some group-specific operation

  3. Filtration − discarding the data with some condition

现在,让我们创建一个 DataFrame 对象,并针对其执行所有操作 -

Let us now create a DataFrame object and perform all the operations on it −

#import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df

它的 output 如下所示 −

Its output is as follows −

    Points   Rank     Team   Year
0      876      1   Riders   2014
1      789      2   Riders   2015
2      863      2   Devils   2014
3      673      3   Devils   2015
4      741      3    Kings   2014
5      812      4    kings   2015
6      756      1    Kings   2016
7      788      1    Kings   2017
8      694      2   Riders   2016
9      701      4   Royals   2014
10     804      1   Royals   2015
11     690      2   Riders   2017

Split Data into Groups

Pandas 对象可以拆分成它们的对象。有多种拆分对象的方式,例如 -

Pandas object can be split into any of their objects. There are multiple ways to split an object like −

  1. obj.groupby('key')

  2. obj.groupby(['key1','key2'])

  3. obj.groupby(key,axis=1)

现在,让我们看看如何将分组对象应用到 DataFrame 对象

Let us now see how the grouping objects can be applied to the DataFrame object

Example

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby('Team')

它的 output 如下所示 −

Its output is as follows −

<pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50>

View Groups

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby('Team').groups

它的 output 如下所示 −

Its output is as follows −

{'Kings': Int64Index([4, 6, 7],      dtype='int64'),
'Devils': Int64Index([2, 3],         dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11],  dtype='int64'),
'Royals': Int64Index([9, 10],        dtype='int64'),
'kings' : Int64Index([5],            dtype='int64')}

Example

Group by 具有多列 -

Group by with multiple columns −

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby(['Team','Year']).groups

它的 output 如下所示 −

Its output is as follows −

{('Kings', 2014): Int64Index([4], dtype='int64'),
 ('Royals', 2014): Int64Index([9], dtype='int64'),
 ('Riders', 2014): Int64Index([0], dtype='int64'),
 ('Riders', 2015): Int64Index([1], dtype='int64'),
 ('Kings', 2016): Int64Index([6], dtype='int64'),
 ('Riders', 2016): Int64Index([8], dtype='int64'),
 ('Riders', 2017): Int64Index([11], dtype='int64'),
 ('Devils', 2014): Int64Index([2], dtype='int64'),
 ('Devils', 2015): Int64Index([3], dtype='int64'),
 ('kings', 2015): Int64Index([5], dtype='int64'),
 ('Royals', 2015): Int64Index([10], dtype='int64'),
 ('Kings', 2017): Int64Index([7], dtype='int64')}

Iterating through Groups

有了 groupby 对象,我们可以遍历该对象类似于迭代器。

With the groupby object in hand, we can iterate through the object similar to itertools.obj.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')

for name,group in grouped:
   print name
   print group

它的 output 如下所示 −

Its output is as follows −

2014
   Points  Rank     Team   Year
0     876     1   Riders   2014
2     863     2   Devils   2014
4     741     3   Kings    2014
9     701     4   Royals   2014

2015
   Points  Rank     Team   Year
1     789     2   Riders   2015
3     673     3   Devils   2015
5     812     4    kings   2015
10    804     1   Royals   2015

2016
   Points  Rank     Team   Year
6     756     1    Kings   2016
8     694     2   Riders   2016

2017
   Points  Rank    Team   Year
7     788     1   Kings   2017
11    690     2  Riders   2017

默认情况下, groupby 对象具有与组名称相同标签名称。

By default, the groupby object has the same label name as the group name.

Select a Group

使用 get_group() 方法,我们可以选择单个组。

Using the get_group() method, we can select a single group.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)

它的 output 如下所示 −

Its output is as follows −

   Points  Rank     Team    Year
0     876     1   Riders    2014
2     863     2   Devils    2014
4     741     3   Kings     2014
9     701     4   Royals    2014

Aggregations

聚合函数为每个组返回单个聚合值。创建 group by 对象后,可以在分组数据上执行多个聚合操作。

An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

显而易见的是,通过聚合或等效的 agg 方法进行聚合 -

An obvious one is aggregation via the aggregate or equivalent agg method −

# import the pandas library
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)

它的 output 如下所示 −

Its output is as follows −

Year
2014   795.25
2015   769.50
2016   725.00
2017   739.00
Name: Points, dtype: float64

查看每个组大小的另一种方法是应用 size() 函数 -

Another way to see the size of each group is by applying the size() function −

import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

Attribute Access in Python Pandas
grouped = df.groupby('Team')
print grouped.agg(np.size)

它的 output 如下所示 −

Its output is as follows −

         Points   Rank   Year
Team
Devils        2      2      2
Kings         3      3      3
Riders        4      4      4
Royals        2      2      2
kings         1      1      1

Applying Multiple Aggregation Functions at Once

对于已分组的 Series,您还可以传递 listdict of functions 来聚合,并生成 DataFrame 作为输出 -

With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output −

# import the pandas library
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Team')
print grouped['Points'].agg([np.sum, np.mean, np.std])

它的 output 如下所示 −

Its output is as follows −

Team      sum      mean          std
Devils   1536   768.000000   134.350288
Kings    2285   761.666667    24.006943
Riders   3049   762.250000    88.567771
Royals   1505   752.500000    72.831998
kings     812   812.000000          NaN

Transformations

对组或列进行转换会返回一个对象,该对象的索引大小与所分组的对象的索引大小相同。因此,转换应该返回一个大小与组块相同的结果。

Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.

# import the pandas library
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)

它的 output 如下所示 −

Its output is as follows −

       Points        Rank        Year
0   12.843272  -15.000000  -11.618950
1   3.020286     5.000000   -3.872983
2   7.071068    -7.071068   -7.071068
3  -7.071068     7.071068    7.071068
4  -8.608621    11.547005  -10.910895
5        NaN          NaN         NaN
6  -2.360428    -5.773503    2.182179
7  10.969049    -5.773503    8.728716
8  -7.705963     5.000000    3.872983
9  -7.071068     7.071068   -7.071068
10  7.071068    -7.071068    7.071068
11 -8.157595     5.000000   11.618950

Filtration

过滤会根据定义好的准则过滤数据,并返回数据的子集。 filter() 函数用于过滤数据。

Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby('Team').filter(lambda x: len(x) >= 3)

它的 output 如下所示 −

Its output is as follows −

    Points  Rank     Team   Year
0      876     1   Riders   2014
1      789     2   Riders   2015
4      741     3   Kings    2014
6      756     1   Kings    2016
7      788     1   Kings    2017
8      694     2   Riders   2016
11     690     2   Riders   2017

在上述过滤器条件中,我们要求返回参加过三次或以上 IPL 比赛的球队。

In the above filter condition, we are asking to return the teams which have participated three or more times in IPL.

Python Pandas - Merging/Joining

Pandas 具有功能全面、性能卓越的内存中连接操作,与 SQL 等关系数据库极为相似。

Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

Pandas 提供了一个函数 merge ,作为 DataFrame 对象之间所有标准数据库连接操作的入口点 -

Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

此处,我们使用了以下参数 -

Here, we have used the following parameters −

  1. left − A DataFrame object.

  2. right − Another DataFrame object.

  3. on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.

  4. left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  5. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  6. left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.

  7. right_index − Same usage as left_index for the right DataFrame.

  8. how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below.

  9. sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

现在让我们创建两个不同的 DataFrame 并对它们执行合并操作。

Let us now create two different DataFrames and perform the merging operations on it.

# import the pandas library
import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
   {'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

它的 output 如下所示 −

Its output is as follows −

    Name  id   subject_id
0   Alex   1         sub1
1    Amy   2         sub2
2  Allen   3         sub4
3  Alice   4         sub6
4  Ayoung  5         sub5

    Name  id   subject_id
0  Billy   1         sub2
1  Brian   2         sub4
2  Bran    3         sub3
3  Bryce   4         sub6
4  Betty   5         sub5

Merge Two DataFrames on a Key

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
	'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')

它的 output 如下所示 −

Its output is as follows −

   Name_x   id  subject_id_x   Name_y   subject_id_y
0  Alex      1          sub1    Billy           sub2
1  Amy       2          sub2    Brian           sub4
2  Allen     3          sub4     Bran           sub3
3  Alice     4          sub6    Bryce           sub6
4  Ayoung    5          sub5    Betty           sub5

Merge Two DataFrames on Multiple Keys

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
	'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on=['id','subject_id'])

它的 output 如下所示 −

Its output is as follows −

    Name_x   id   subject_id   Name_y
0    Alice    4         sub6    Bryce
1   Ayoung    5         sub5    Betty

Merge Using 'how' Argument

merge 的 how 参数指定如何确定要包含在结果表中的键。如果键组合没有出现在左表或右表中,则连接表中的值将为 NA。

The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA.

以下是 how 选项及其 SQL 等价名称的摘要 −

Here is a summary of the how options and their SQL equivalent names −

Merge Method

SQL Equivalent

Description

left

LEFT OUTER JOIN

Use keys from left object

right

RIGHT OUTER JOIN

Use keys from right object

outer

FULL OUTER JOIN

Use union of keys

inner

INNER JOIN

Use intersection of keys

Left Join

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='left')

它的 output 如下所示 −

Its output is as follows −

    Name_x   id_x   subject_id   Name_y   id_y
0     Alex      1         sub1      NaN    NaN
1      Amy      2         sub2    Billy    1.0
2    Allen      3         sub4    Brian    2.0
3    Alice      4         sub6    Bryce    4.0
4   Ayoung      5         sub5    Betty    5.0

Right Join

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='right')

它的 output 如下所示 −

Its output is as follows −

    Name_x  id_x   subject_id   Name_y   id_y
0      Amy   2.0         sub2    Billy      1
1    Allen   3.0         sub4    Brian      2
2    Alice   4.0         sub6    Bryce      4
3   Ayoung   5.0         sub5    Betty      5
4      NaN   NaN         sub3     Bran      3

Outer Join

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, how='outer', on='subject_id')

它的 output 如下所示 −

Its output is as follows −

    Name_x  id_x   subject_id   Name_y   id_y
0     Alex   1.0         sub1      NaN    NaN
1      Amy   2.0         sub2    Billy    1.0
2    Allen   3.0         sub4    Brian    2.0
3    Alice   4.0         sub6    Bryce    4.0
4   Ayoung   5.0         sub5    Betty    5.0
5      NaN   NaN         sub3     Bran    3.0

Inner Join

连接将在索引上执行。连接操作会遵守对其进行调用的对象。因此, a.join(b) 不等于 b.join(a)

Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='inner')

它的 output 如下所示 −

Its output is as follows −

    Name_x   id_x   subject_id   Name_y   id_y
0      Amy      2         sub2    Billy      1
1    Allen      3         sub4    Brian      2
2    Alice      4         sub6    Bryce      4
3   Ayoung      5         sub5    Betty      5

Python Pandas - Concatenation

Pandas 提供了多种功能,可轻松组合 Series, DataFramePanel 对象。

Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects.

 pd.concat(objs,axis=0,join='outer',join_axes=None,
ignore_index=False)
  1. objs − This is a sequence or mapping of Series, DataFrame, or Panel objects.

  2. axis − {0, 1, …​}, default 0. This is the axis to concatenate along.

  3. join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.

  4. ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …​, n - 1.

  5. join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic.

Concatenating Objects

concat 函数执行沿轴执行串联操作的所有繁重工作。让我们创建不同的对象并进行串联。

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print pd.concat([one,two])

它的 output 如下所示 −

Its output is as follows −

    Marks_scored     Name   subject_id
1             98     Alex         sub1
2             90      Amy         sub2
3             87    Allen         sub4
4             69    Alice         sub6
5             78   Ayoung         sub5
1             89    Billy         sub2
2             80    Brian         sub4
3             79     Bran         sub3
4             97    Bryce         sub6
5             88    Betty         sub5

假设我们希望将特定键与切片的每个 DataFrame 片段相关联。我们可以通过使用 keys 参数来实现此目的−

Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this by using the keys argument −

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'])

它的 output 如下所示 −

Its output is as follows −

x  1  98    Alex    sub1
   2  90    Amy     sub2
   3  87    Allen   sub4
   4  69    Alice   sub6
   5  78    Ayoung  sub5
y  1  89    Billy   sub2
   2  80    Brian   sub4
   3  79    Bran    sub3
   4  97    Bryce   sub6
   5  88    Betty   sub5

结果索引被复制;每个索引都被重复。

The index of the resultant is duplicated; each index is repeated.

如果结果对象必须遵循其自己的索引,则将 ignore_index 设置为 True

If the resultant object has to follow its own indexing, set ignore_index to True.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)

它的 output 如下所示 −

Its output is as follows −

    Marks_scored     Name    subject_id
0             98     Alex          sub1
1             90      Amy          sub2
2             87    Allen          sub4
3             69    Alice          sub6
4             78   Ayoung          sub5
5             89    Billy          sub2
6             80    Brian          sub4
7             79     Bran          sub3
8             97    Bryce          sub6
9             88    Betty          sub5

请观察,索引完全更改,并且键也被覆盖。

Observe, the index changes completely and the Keys are also overridden.

如果两个对象需要沿 axis=1 添加,那么将追加新列。

If two objects need to be added along axis=1, then the new columns will be appended.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print pd.concat([one,two],axis=1)

它的 output 如下所示 −

Its output is as follows −

    Marks_scored    Name  subject_id   Marks_scored    Name   subject_id
1           98      Alex      sub1         89         Billy         sub2
2           90       Amy      sub2         80         Brian         sub4
3           87     Allen      sub4         79          Bran         sub3
4           69     Alice      sub6         97         Bryce         sub6
5           78    Ayoung      sub5         88         Betty         sub5

Concatenating Using append

一个有用的串联快捷方式是 Series 和 DataFrame 上的 append 实例方法。这些方法实际上早于 concat。它们沿 axis=0 串联,即索引−

A useful shortcut to concat are the append instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index −

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print one.append(two)

它的 output 如下所示 −

Its output is as follows −

    Marks_scored    Name  subject_id
1           98      Alex      sub1
2           90       Amy      sub2
3           87     Allen      sub4
4           69     Alice      sub6
5           78    Ayoung      sub5
1           89     Billy      sub2
2           80     Brian      sub4
3           79      Bran      sub3
4           97     Bryce      sub6
5           88     Betty      sub5

append 函数也可以接受多个对象−

The append function can take multiple objects as well −

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print one.append([two,one,two])

它的 output 如下所示 −

Its output is as follows −

    Marks_scored   Name    subject_id
1           98     Alex          sub1
2           90      Amy          sub2
3           87    Allen          sub4
4           69    Alice          sub6
5           78   Ayoung          sub5
1           89    Billy          sub2
2           80    Brian          sub4
3           79     Bran          sub3
4           97    Bryce          sub6
5           88    Betty          sub5
1           98     Alex          sub1
2           90      Amy          sub2
3           87    Allen          sub4
4           69    Alice          sub6
5           78   Ayoung          sub5
1           89    Billy          sub2
2           80    Brian          sub4
3           79     Bran          sub3
4           97    Bryce          sub6
5           88    Betty          sub5

Time Series

Pandas 提供了一个强大的工具,用于使用时间序列数据进行工作时间,尤其是在金融领域。在使用时间序列数据时,我们经常会遇到以下问题 −

Pandas provide a robust tool for working time with Time series data, especially in the financial sector. While working with time series data, we frequently come across the following −

  1. Generating sequence of time

  2. Convert the time series to different frequencies

Pandas 提供了一组相对紧凑和独立的工具来执行上述任务。

Pandas provides a relatively compact and self-contained set of tools for performing the above tasks.

Get Current Time

datetime.now() 为您提供当前日期和时间。

datetime.now() gives you the current date and time.

import pandas as pd

print pd.datetime.now()

它的 output 如下所示 −

Its output is as follows −

2017-05-11 06:10:13.393147

Create a TimeStamp

时间戳数据是最基本类型的时间序列数据,它将值与时间点相关联。对于 pandas 对象,这意味着使用时间点。我们来看一个例子−

Time-stamped data is the most basic type of timeseries data that associates values with points in time. For pandas objects, it means using the points in time. Let’s take an example −

import pandas as pd

print pd.Timestamp('2017-03-01')

它的 output 如下所示 −

Its output is as follows −

2017-03-01 00:00:00

还可以转换整数或浮点数时间戳。它们的默认单位是纳秒(因为这是 Timestamp 的存储方式)。但是,时间戳经常存储在另一个单位中,该单位可以被指定。我们来看另一个例子

It is also possible to convert integer or float epoch times. The default unit for these is nanoseconds (since these are how Timestamps are stored). However, often epochs are stored in another unit which can be specified. Let’s take another example

import pandas as pd

print pd.Timestamp(1587687255,unit='s')

它的 output 如下所示 −

Its output is as follows −

2020-04-24 00:14:15

Create a Range of Time

import pandas as pd

print pd.date_range("11:00", "13:30", freq="30min").time

它的 output 如下所示 −

Its output is as follows −

[datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0)
datetime.time(12, 30) datetime.time(13, 0) datetime.time(13, 30)]

Change the Frequency of Time

import pandas as pd

print pd.date_range("11:00", "13:30", freq="H").time

它的 output 如下所示 −

Its output is as follows −

[datetime.time(11, 0) datetime.time(12, 0) datetime.time(13, 0)]

Converting to Timestamps

要转换Series 或类似列表的类似日期的对象,例如字符串、时间戳或混合,可以使用 to_datetime 函数。传递时,它返回一个 Series(具有相同的索引),而 list-like 将转换为 DatetimeIndex 。请看以下示例 −

To convert a Series or list-like object of date-like objects, for example strings, epochs, or a mixture, you can use the to_datetime function. When passed, this returns a Series (with the same index), while a list-like is converted to a DatetimeIndex. Take a look at the following example −

import pandas as pd

print pd.to_datetime(pd.Series(['Jul 31, 2009','2010-01-10', None]))

它的 output 如下所示 −

Its output is as follows −

0  2009-07-31
1  2010-01-10
2         NaT
dtype: datetime64[ns]

NaT 表示 Not a Time (等同于 NaN)

NaT means Not a Time (equivalent to NaN)

我们来看另一个例子。

Let’s take another example.

import pandas as pd

print pd.to_datetime(['2005/11/23', '2010.12.31', None])

它的 output 如下所示 −

Its output is as follows −

DatetimeIndex(['2005-11-23', '2010-12-31', 'NaT'], dtype='datetime64[ns]', freq=None)

Python Pandas - Date Functionality

扩展时间序列,日期功能在金融数据分析中扮演着重要角色。在使用日期数据时,我们经常会遇到以下问题−

Extending the Time series, Date functionalities play major role in financial data analysis. While working with Date data, we will frequently come across the following −

  1. Generating sequence of dates

  2. Convert the date series to different frequencies

Create a Range of Dates

通过指定周期和频率使用 date.range() 函数,我们可以创建日期序列。默认情况下,范围的频率为天。

Using the date.range() function by specifying the periods and the frequency, we can create the date series. By default, the frequency of range is Days.

import pandas as pd

print pd.date_range('1/1/2011', periods=5)

它的 output 如下所示 −

Its output is as follows −

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'],
   dtype='datetime64[ns]', freq='D')

Change the Date Frequency

import pandas as pd

print pd.date_range('1/1/2011', periods=5,freq='M')

它的 output 如下所示 −

Its output is as follows −

DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30', '2011-05-31'],
   dtype='datetime64[ns]', freq='M')

bdate_range

bdate_range() 代表业务日期范围。不同于 date_range(),它排除了星期六和星期日。

bdate_range() stands for business date ranges. Unlike date_range(), it excludes Saturday and Sunday.

import pandas as pd

print pd.date_range('1/1/2011', periods=5)

它的 output 如下所示 −

Its output is as follows −

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'],
   dtype='datetime64[ns]', freq='D')

注意,在 3 月 3 日之后,日期跳到 3 月 6 日,排除了 4 日和 5 日。只需查看您日历中的天数即可。

Observe, after 3rd March, the date jumps to 6th march excluding 4th and 5th. Just check your calendar for the days.

date_rangebdate_range 这样的便捷函数利用了各种频率别名。date_range 的默认频率是日历日,而 bdate_range 的默认频率是工作日。

Convenience functions like date_range and bdate_range utilize a variety of frequency aliases. The default frequency for date_range is a calendar day while the default for bdate_range is a business day.

import pandas as pd
start = pd.datetime(2011, 1, 1)
end = pd.datetime(2011, 1, 5)

print pd.date_range(start, end)

它的 output 如下所示 −

Its output is as follows −

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'],
   dtype='datetime64[ns]', freq='D')

Offset Aliases

将许多字符串别名提供给有用的常见时间序列频率。我们将这些别名称为偏移别名。

A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases.

Alias

Description

Alias

Description

B

business day frequency

BQS

business quarter start frequency

D

calendar day frequency

A

annual(Year) end frequency

W

weekly frequency

BA

business year end frequency

M

month end frequency

BAS

business year start frequency

SM

semi-month end frequency

BH

business hour frequency

BM

business month end frequency

H

hourly frequency

MS

month start frequency

T, min

minutely frequency

SMS

SMS semi month start frequency

S

secondly frequency

BMS

business month start frequency

L, ms

milliseconds

Q

quarter end frequency

U, us

microseconds

BQ

business quarter end frequency

N

nanoseconds

QS

quarter start frequency

Python Pandas - Timedelta

时间差是时间差异,表示为差分单位,例如天、小时、分钟、秒。它们既可以是正值,也可以是负值。

Timedeltas are differences in times, expressed in difference units, for example, days, hours, minutes, seconds. They can be both positive and negative.

我们可以使用各种参数来创建 Timedelta 对象,如下所示:

We can create Timedelta objects using various arguments as shown below −

String

通过传递字符串文字,我们可以创建一个 timedelta 对象。

By passing a string literal, we can create a timedelta object.

import pandas as pd

print pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')

它的 output 如下所示 −

Its output is as follows −

2 days 02:15:30

Integer

通过使用单位的整数值作为参数创建一个 Timedelta 对象。

By passing an integer value with the unit, an argument creates a Timedelta object.

import pandas as pd

print pd.Timedelta(6,unit='h')

它的 output 如下所示 −

Its output is as follows −

0 days 06:00:00

Data Offsets

在构造中还可以使用数据偏移量,例如 - 周、天、小时、分钟、秒、毫秒、微秒、纳秒。

Data offsets such as - weeks, days, hours, minutes, seconds, milliseconds, microseconds, nanoseconds can also be used in construction.

import pandas as pd

print pd.Timedelta(days=2)

它的 output 如下所示 −

Its output is as follows −

2 days 00:00:00

to_timedelta()

使用顶级 pd.to_timedelta ,你可以将标量、数组、列表或序列从公认 timedelta 格式/值转换为 Timedelta 类型。如果输入是序列,它将构建序列;如果输入是标量,它将构建标量;否则,将输出一个 TimedeltaIndex

Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or series from a recognized timedelta format/ value into a Timedelta type. It will construct Series if the input is a Series, a scalar if the input is scalar-like, otherwise will output a TimedeltaIndex.

import pandas as pd

print pd.Timedelta(days=2)

它的 output 如下所示 −

Its output is as follows −

2 days 00:00:00

Operations

你可以操作 Series/ DataFrame,并通过对 datetime64[ns] Series 或时间戳执行减法操作来构建 timedelta64[ns] Series。

You can operate on Series/ DataFrames and construct timedelta64[ns] Series through subtraction operations on datetime64[ns] Series, or Timestamps.

现在,我们创建一个包含 Timedelta 和日期时间对象的 DataFrame,并对其执行一些算术运算:

Let us now create a DataFrame with Timedelta and datetime objects and perform some arithmetic operations on it −

import pandas as pd

s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))

print df

它的 output 如下所示 −

Its output is as follows −

            A      B
0  2012-01-01 0 days
1  2012-01-02 1 days
2  2012-01-03 2 days

Addition Operations

import pandas as pd

s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']

print df

它的 output 如下所示 −

Its output is as follows −

           A      B          C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05

Subtraction Operation

import pandas as pd

s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']
df['D']=df['C']+df['B']

print df

它的 output 如下所示 −

Its output is as follows −

           A      B          C          D
0 2012-01-01 0 days 2012-01-01 2012-01-01
1 2012-01-02 1 days 2012-01-03 2012-01-04
2 2012-01-03 2 days 2012-01-05 2012-01-07

Python Pandas - Categorical Data

通常在实时数据中,将包含重复的文本列。性别、国家和代码等要素始终是重复的。这些是分类数据的示例。

Often in real-time, data includes the text columns, which are repetitive. Features like gender, country, and codes are always repetitive. These are the examples for categorical data.

分类变量只能取有限且通常是固定数量的可能值。除了固定长度外,分类数据可能存在顺序,但不能执行数值操作。分类是熊猫数据类型。

Categorical variables can take on only a limited, and usually fixed number of possible values. Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type.

分类数据类型在以下情况下有用−

The categorical data type is useful in the following cases −

  1. A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.

  2. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.

  3. As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

Object Creation

分类对象可以通过多种方式创建。以下介绍了不同的方式-

Categorical object can be created in multiple ways. The different ways have been described below −

category

在熊猫对象创建中将 dtype 指定为“类别”。

By specifying the dtype as "category" in pandas object creation.

import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print s

它的 output 如下所示 −

Its output is as follows −

0  a
1  b
2  c
3  a
dtype: category
Categories (3, object): [a, b, c]

传递给 series 对象的元素数量为四个,但类别只有三个。在输出类别中观察相同的内容。

The number of elements passed to the series object is four, but the categories are only three. Observe the same in the output Categories.

pd.Categorical

使用标准的 pandas 分类构造函数,我们可以创建一个类别对象。

Using the standard pandas Categorical constructor, we can create a category object.

pandas.Categorical(values, categories, ordered)

让我们举个例子-

Let’s take an example −

import pandas as pd

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
print cat

它的 output 如下所示 −

Its output is as follows −

[a, b, c, a, b, c]
Categories (3, object): [a, b, c]

我们举另一个例子-

Let’s have another example −

import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
print cat

它的 output 如下所示 −

Its output is as follows −

[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]

在此,第二个参数表示类别。因此,在类别中不存在的任何值都将被视为 NaN

Here, the second argument signifies the categories. Thus, any value which is not present in the categories will be treated as NaN.

现在,看下面的例子 -

Now, take a look at the following example −

import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
print cat

它的 output 如下所示 −

Its output is as follows −

[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]

从逻辑上来说,该顺序表示 a 大于 bb 大于 c

Logically, the order means that, a is greater than b and b is greater than c.

Description

使用分类数据上的 .describe() 命令,我们得到一个类似于 type 字符串中的 SeriesDataFrame 的输出。

Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string.

import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

print df.describe()
print df["cat"].describe()

它的 output 如下所示 −

Its output is as follows −

       cat s
count    3 3
unique   2 2
top      c c
freq     2 2
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

Get the Properties of the Category

obj.cat.categories 命令用于获取 categories of the object

obj.cat.categories command is used to get the categories of the object.

import pandas as pd
import numpy as np

s = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
print s.categories

它的 output 如下所示 −

Its output is as follows −

Index([u'b', u'a', u'c'], dtype='object')

obj.ordered 命令用于获取对象的顺序。

obj.ordered command is used to get the order of the object.

import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
print cat.ordered

它的 output 如下所示 −

Its output is as follows −

False

函数返回 false ,因为我们没有指定任何顺序。

The function returned false because we haven’t specified any order.

Renaming Categories

通过将新值赋值给 *series.cat.categories*series.cat.categories 属性来重新命名类别。

Renaming categories is done by assigning new values to the *series.cat.categories*series.cat.categories property.

import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
print s.cat.categories

它的 output 如下所示 −

Its output is as follows −

Index([u'Group a', u'Group b', u'Group c'], dtype='object')

对象的 s.cat.categories 属性更新了初始类别 [a,b,c]

Initial categories [a,b,c] are updated by the s.cat.categories property of the object.

Appending New Categories

使用 Categorical.add.categories() 方法可以附加新类别。

Using the Categorical.add.categories() method, new categories can be appended.

import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.add_categories([4])
print s.cat.categories

它的 output 如下所示 −

Its output is as follows −

Index([u'a', u'b', u'c', 4], dtype='object')

Removing Categories

使用 Categorical.remove_categories() 方法可以删除不需要的类别。

Using the Categorical.remove_categories() method, unwanted categories can be removed.

import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print ("Original object:")
print s

print ("After removal:")
print s.cat.remove_categories("a")

它的 output 如下所示 −

Its output is as follows −

Original object:
0  a
1  b
2  c
3  a
dtype: category
Categories (3, object): [a, b, c]

After removal:
0  NaN
1  b
2  c
3  NaN
dtype: category
Categories (2, object): [b, c]

Comparison of Categorical Data

在三个情况下,分类数据与其他对象进行比较 -

Comparing categorical data with other objects is possible in three cases −

  1. comparing equality (== and !=) to a list-like object (list, Series, array, …​) of the same length as the categorical data.

  2. all comparisons (==, !=, >, >=, <, and ⇐) of categorical data to another categorical Series, when ordered==True and the categories are the same.

  3. all comparisons of a categorical data to a scalar.

请看以下示例:

Take a look at the following example −

import pandas as pd

cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)
cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)

print cat>cat1

它的 output 如下所示 −

Its output is as follows −

0  False
1  False
2  True
dtype: bool

Python Pandas - Visualization

Basic Plotting: plot

Series 和 DataFrame 上的此功能只是对 matplotlib libraries plot() 方法的简单包装。

This functionality on Series and DataFrame is just a simple wrapper around the matplotlib libraries plot() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000',
   periods=10), columns=list('ABCD'))

df.plot()

它的 output 如下所示 −

Its output is as follows −

basic plotting

如果索引包含日期,它将调用 gct().autofmt_xdate() 将 x 轴格式化为如上图所示。

If the index consists of dates, it calls gct().autofmt_xdate() to format the x-axis as shown in the above illustration.

我们可以使用 xy 关键字将一列与另一列作对比。

We can plot one column versus another using the x and y keywords.

绘图方法允许一些绘图样式,这些样式与默认的线图不同。这些方法可以作为 plot() 的 kind 关键字参数提供。它们包括 -

Plotting methods allow a handful of plot styles other than the default line plot. These methods can be provided as the kind keyword argument to plot(). These include −

  1. bar or barh for bar plots

  2. hist for histogram

  3. box for boxplot

  4. 'area' for area plots

  5. 'scatter' for scatter plots

Bar Plot

让我们通过创建一个条形图来看一个条形图是什么。条形图可以通过以下方式创建 -

Let us now see what a Bar Plot is by creating one. A bar plot can be created in the following way −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar()

它的 output 如下所示 −

Its output is as follows −

bar plot

若要制作堆叠条形图,使用 pass stacked=True

To produce a stacked bar plot, pass stacked=True

import pandas as pd
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')
df.plot.bar(stacked=True)

它的 output 如下所示 −

Its output is as follows −

stacked bar plot

若要获得水平条形图,使用 barh 方法 −

To get horizontal bar plots, use the barh method −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d')

df.plot.barh(stacked=True)

它的 output 如下所示 −

Its output is as follows −

horizontal bar plot

Histograms

可以使用 plot.hist() 方法绘制直方图。我们可以指定柱的数量。

Histograms can be plotted using the plot.hist() method. We can specify number of bins.

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

df.plot.hist(bins=20)

它的 output 如下所示 −

Its output is as follows −

histograms using plot hist

若要针对每一列绘制不同的直方图,使用以下代码 −

To plot different histograms for each column, use the following code −

import pandas as pd
import numpy as np

df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

df.diff.hist(bins=20)

它的 output 如下所示 −

Its output is as follows −

histograms for column

Box Plots

可以通过调用 Series.box.plot()DataFrame.box.plot()DataFrame.boxplot() 绘制箱形图,以可视化每列中的值分布。

Boxplot can be drawn calling Series.box.plot() and DataFrame.box.plot(), or DataFrame.boxplot() to visualize the distribution of values within each column.

例如,这里是一个箱线图,表示在 [0,1) 上的均匀随机变量的 10 次观测的五次试验。

For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1).

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()

它的 output 如下所示 −

Its output is as follows −

box plot

Area Plot

可以使用 Series.plot.area()DataFrame.plot.area() 方法创建面积图。

Area plot can be created using the Series.plot.area() or the DataFrame.plot.area() methods.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area()

它的 output 如下所示 −

Its output is as follows −

area plot

Scatter Plot

可以使用 DataFrame.plot.scatter() 方法创建散点图。

Scatter plot can be created using the DataFrame.plot.scatter() methods.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')

它的 output 如下所示 −

Its output is as follows −

scatter plot

Pie Chart

可以使用 DataFrame.plot.pie() 方法创建饼状图。

Pie chart can be created using the DataFrame.plot.pie() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)

它的 output 如下所示 −

Its output is as follows −

pie chart

Python Pandas - IO Tools

Pandas I/O API 是一个顶级读取器函数集,其访问方式类似于 pd.read_csv() ,它通常返回一个 Pandas 对象。

The Pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object.

用于读取文本文件(或平面文件)的两个主力函数是 read_csv()read_table() 。它们都使用相同的解析代码,将表格数据智能地转换为一个 DataFrame 对象:

The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both use the same parsing code to intelligently convert tabular data into a DataFrame object −

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None
pandas.read_csv(filepath_or_buffer, sep='\t', delimiter=None, header='infer',
names=None, index_col=None, usecols=None

以下是 csv 文件数据的样子 −

Here is how the csv file data looks like −

S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900

将这些数据另存为 temp.csv 并对其进行操作。

Save this data as temp.csv and conduct operations on it.

S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900

将这些数据另存为 temp.csv 并对其进行操作。

Save this data as temp.csv and conduct operations on it.

read.csv

read.csv 从 csv 文件中读取数据并创建一个 DataFrame 对象。

read.csv reads data from the csv files and creates a DataFrame object.

import pandas as pd

df=pd.read_csv("temp.csv")
print df

它的 output 如下所示 −

Its output is as follows −

   S.No     Name   Age       City   Salary
0     1      Tom    28    Toronto    20000
1     2      Lee    32   HongKong     3000
2     3   Steven    43   Bay Area     8300
3     4      Ram    38  Hyderabad     3900

custom index

这指定 csv 文件中使用 index_col. 自定义索引的列。

This specifies a column in the csv file to customize the index using index_col.

import pandas as pd

df=pd.read_csv("temp.csv",index_col=['S.No'])
print df

它的 output 如下所示 −

Its output is as follows −

S.No   Name   Age       City   Salary
1       Tom    28    Toronto    20000
2       Lee    32   HongKong     3000
3    Steven    43   Bay Area     8300
4       Ram    38  Hyderabad     3900

Converters

列的 dtype 可以作为 dict 传递。

dtype of the columns can be passed as a dict.

import pandas as pd

df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print df.dtypes

它的 output 如下所示 −

Its output is as follows −

S.No       int64
Name      object
Age        int64
City      object
Salary   float64
dtype: object

默认情况下,Salary 列的 dtypeint ,但结果显示为 float ,因为我们已显式地强制转换了该类型。

By default, the dtype of the Salary column is int, but the result shows it as float because we have explicitly casted the type.

因此,数据看起来像浮点数 −

Thus, the data looks like float −

  S.No   Name   Age      City    Salary
0   1     Tom   28    Toronto   20000.0
1   2     Lee   32   HongKong    3000.0
2   3  Steven   43   Bay Area    8300.0
3   4     Ram   38  Hyderabad    3900.0

header_names

使用 names 参数指定标头名称。

Specify the names of the header using the names argument.

import pandas as pd

df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print df

它的 output 如下所示 −

Its output is as follows −

       a        b    c           d        e
0   S.No     Name   Age       City   Salary
1      1      Tom   28     Toronto    20000
2      2      Lee   32    HongKong     3000
3      3   Steven   43    Bay Area     8300
4      4      Ram   38   Hyderabad     3900

请注意,标头名称附加自定义名称,但文件中的标头尚未消除。现在,我们使用 header 参数来删除该标头。

Observe, the header names are appended with the custom names, but the header in the file has not been eliminated. Now, we use the header argument to remove that.

如果标头不在第一行,则将行号传递给 header。这将跳过前几行。

If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows.

import pandas as pd

df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
print df

它的 output 如下所示 −

Its output is as follows −

      a        b    c           d        e
0  S.No     Name   Age       City   Salary
1     1      Tom   28     Toronto    20000
2     2      Lee   32    HongKong     3000
3     3   Steven   43    Bay Area     8300
4     4      Ram   38   Hyderabad     3900

skiprows

skiprows 跳过指定数量的行。

skiprows skips the number of rows specified.

import pandas as pd

df=pd.read_csv("temp.csv", skiprows=2)
print df

它的 output 如下所示 −

Its output is as follows −

    2      Lee   32    HongKong   3000
0   3   Steven   43    Bay Area   8300
1   4      Ram   38   Hyderabad   3900

Python Pandas - Sparse Data

当省略任何与特定值(NaN/空白值,但可以选择任何值)匹配的数据时,“压缩”稀疏对象。一个特殊的 SparseIndex 对象会追踪数据被“稀疏化”的位置。这将在一个示例中更有意义。所有标准 Pandas 数据结构都会应用 to_sparse 方法 −

Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted. A special SparseIndex object tracks where data has been “sparsified”. This will make much more sense in an example. All of the standard Pandas data structures apply the to_sparse method −

import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print sts

它的 output 如下所示 −

Its output is as follows −

0   -0.810497
1   -1.419954
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    0.439240
9   -1.095910
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

稀疏对象的存在是为了提高内存效率。

The sparse objects exist for memory efficiency reasons.

让我们假设你有一个较大的 NA DataFrame 并执行以下代码 −

Let us now assume you had a large NA DataFrame and execute the following code −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 4))
df.ix[:9998] = np.nan
sdf = df.to_sparse()

print sdf.density

它的 output 如下所示 −

Its output is as follows −

0.0001

可以使用 to_dense 将任何稀疏对象转换回标准的密集形式 −

Any sparse object can be converted back to the standard dense form by calling to_dense

import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
print sts.to_dense()

它的 output 如下所示 −

Its output is as follows −

0   -0.810497
1   -1.419954
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    0.439240
9   -1.095910
dtype: float64

Sparse Dtypes

稀疏数据应该与其密集表示形式具有相同的 dtype。当前,支持 float64, int64booldtypes 。根据原始 dtype, fill_value default 更改 −

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and booldtypes are supported. Depending on the original dtype, fill_value default changes −

  1. float64 − np.nan

  2. int64 − 0

  3. bool − False

让我们执行以下代码来理解这一点 −

Let us execute the following code to understand the same −

import pandas as pd
import numpy as np

s = pd.Series([1, np.nan, np.nan])
print s

s.to_sparse()
print s

它的 output 如下所示 −

Its output is as follows −

0   1.0
1   NaN
2   NaN
dtype: float64

0   1.0
1   NaN
2   NaN
dtype: float64

Python Pandas - Caveats & Gotchas

警告是指警示而意外问题是指不可预见的问题。

Caveats means warning and gotcha means an unseen problem.

Using If/Truth Statement with Pandas

当您尝试将某个内容转换成 bool 时,Pandas 遵循 numpy 惯例发出错误。这发生在 if 中或 when 中,使用布尔运算以及 ornot 。结果应该是什么还不清楚。它应该是 True 因为它不是零长度吗?还是 False 因为存在 False 值?还不清楚,因此,Pandas 会发出 ValueError

Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be. Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead, Pandas raises a ValueError

import pandas as pd

if pd.Series([False, True, False]):
   print 'I am True'

它的 output 如下所示 −

Its output is as follows −

ValueError: The truth value of a Series is ambiguous.
Use a.empty, a.bool() a.item(),a.any() or a.all().

if 条件中,不清楚该对其做什么。错误暗示应使用 None 还是 any of those

In if condition, it is unclear what to do with it. The error is suggestive of whether to use a None or any of those.

import pandas as pd

if pd.Series([False, True, False]).any():
   print("I am any")

它的 output 如下所示 −

Its output is as follows −

I am any

若要在布尔上下文中评估单元素 pandas 对象,使用 .bool() 方法 −

To evaluate single-element pandas objects in a Boolean context, use the method .bool()

import pandas as pd

print pd.Series([True]).bool()

它的 output 如下所示 −

Its output is as follows −

True

Bitwise Boolean

Bitwise 布尔运算符(如 == 和 ! )将返回一个布尔序列,这几乎总是所需。

Bitwise Boolean operators like == and != will return a Boolean series, which is almost always what is required anyways.

import pandas as pd

s = pd.Series(range(5))
print s==4

它的 output 如下所示 −

Its output is as follows −

0 False
1 False
2 False
3 False
4 True
dtype: bool

isin Operation

这返回一个布尔序列,显示 Series 中的每个元素是否恰好包含在传递的值序列中。

This returns a Boolean series showing whether each element in the Series is exactly contained in the passed sequence of values.

import pandas as pd

s = pd.Series(list('abc'))
s = s.isin(['a', 'c', 'e'])
print s

它的 output 如下所示 −

Its output is as follows −

0 True
1 False
2 True
dtype: bool

Reindexing vs ix Gotcha

许多用户会发现自己使用 ix indexing capabilities 作为从 Pandas 对象选择数据的一种简洁方式 −

Many users will find themselves using the ix indexing capabilities as a concise means of selecting data from a Pandas object −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.ix[['b', 'c', 'e']]

它的 output 如下所示 −

Its output is as follows −

          one        two      three       four
a   -1.582025   1.335773   0.961417  -1.272084
b    1.461512   0.111372  -0.072225   0.553058
c   -1.240671   0.762185   1.511936  -0.630920
d   -2.380648  -0.029981   0.196489   0.531714
e    1.846746   0.148149   0.275398  -0.244559
f   -1.842662  -0.933195   2.303949   0.677641

          one        two      three       four
b    1.461512   0.111372  -0.072225   0.553058
c   -1.240671   0.762185   1.511936  -0.630920
e    1.846746   0.148149   0.275398  -0.244559

这当然与使用 reindex 方法在此情况下完全等效 −

This is, of course, completely equivalent in this case to using the reindex method −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.reindex(['b', 'c', 'e'])

它的 output 如下所示 −

Its output is as follows −

          one        two      three       four
a    1.639081   1.369838   0.261287  -1.662003
b   -0.173359   0.242447  -0.494384   0.346882
c   -0.106411   0.623568   0.282401  -0.916361
d   -1.078791  -0.612607  -0.897289  -1.146893
e    0.465215   1.552873  -1.841959   0.329404
f    0.966022  -0.190077   1.324247   0.678064

          one        two      three       four
b   -0.173359   0.242447  -0.494384   0.346882
c   -0.106411   0.623568   0.282401  -0.916361
e    0.465215   1.552873  -1.841959   0.329404

有人可能由此得出 ixreindex 100% 等效的结论。除整型索引的情形外,这确实成立。例如,上述操作也可以表述为 −

Some might conclude that ix and reindex are 100% equivalent based on this. This is true except in the case of integer indexing. For example, the above operation can alternatively be expressed as −

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print df
print df.ix[[1, 2, 4]]
print df.reindex([1, 2, 4])

它的 output 如下所示 −

Its output is as follows −

          one        two      three       four
a   -1.015695  -0.553847   1.106235  -0.784460
b   -0.527398  -0.518198  -0.710546  -0.512036
c   -0.842803  -1.050374   0.787146   0.205147
d   -1.238016  -0.749554  -0.547470  -0.029045
e   -0.056788   1.063999  -0.767220   0.212476
f    1.139714   0.036159   0.201912   0.710119

          one        two      three       four
b   -0.527398  -0.518198  -0.710546  -0.512036
c   -0.842803  -1.050374   0.787146   0.205147
e   -0.056788   1.063999  -0.767220   0.212476

    one  two  three  four
1   NaN  NaN    NaN   NaN
2   NaN  NaN    NaN   NaN
4   NaN  NaN    NaN   NaN

记住 reindex is strict label indexing only 非常重要。在病理病例中,其中索引同时包含(例如)整型和字符串时,这会产生一些可能令人惊讶的结果。

It is important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings.

Python Pandas - Comparison with SQL

由于许多潜在的 Pandas 用户对 SQL 有些熟悉,本页旨在提供有关如何在 pandas 中执行各种 SQL 操作的一些示例。

Since many potential Pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations can be performed using pandas.

import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips.head()

它的 output 如下所示 −

Its output is as follows −

    total_bill   tip      sex  smoker  day     time  size
0        16.99  1.01   Female      No  Sun  Dinner      2
1        10.34  1.66     Male      No  Sun  Dinner      3
2        21.01  3.50     Male      No  Sun  Dinner      3
3        23.68  3.31     Male      No  Sun  Dinner      2
4        24.59  3.61   Female      No  Sun  Dinner      4

SELECT

在 SQL 中,选择是使用一个逗号分隔的列表完成的,列出你选择的列(或一个 * 来选择所有列) −

In SQL, selection is done using a comma-separated list of columns that you select (or a * to select all columns) −

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

使用 Pandas,列选择是通过将列名列表传递给 DataFrame 完成的 −

With Pandas, column selection is done by passing a list of column names to your DataFrame −

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

让我们检查完整程序 −

Let’s check the full program −

import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

它的 output 如下所示 −

Its output is as follows −

   total_bill   tip  smoker     time
0       16.99  1.01      No   Dinner
1       10.34  1.66      No   Dinner
2       21.01  3.50      No   Dinner
3       23.68  3.31      No   Dinner
4       24.59  3.61      No   Dinner

不带列名列表地调用 DataFrame 将显示所有列(类似于 SQL 的 *)。

Calling the DataFrame without the list of column names will display all columns (akin to SQL’s *).

WHERE

在 SQL 中,筛选是通过 WHERE 子句完成的。

Filtering in SQL is done via a WHERE clause.

  SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

DataFrame 可以通过多种方式筛选;其中最直观的是使用布尔索引。

DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.

  tips[tips['time'] == 'Dinner'].head(5)

让我们检查完整程序 −

Let’s check the full program −

import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips[tips['time'] == 'Dinner'].head(5)

它的 output 如下所示 −

Its output is as follows −

   total_bill   tip      sex  smoker  day    time  size
0       16.99  1.01   Female     No   Sun  Dinner    2
1       10.34  1.66     Male     No   Sun  Dinner    3
2       21.01  3.50     Male     No   Sun  Dinner    3
3       23.68  3.31     Male     No   Sun  Dinner    2
4       24.59  3.61   Female     No   Sun  Dinner    4

以上语句将一个 True/False 对象的 Series 传递给 DataFrame,返回所有带有 True 的行。

The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.

GroupBy

此操作获取整个数据集中每个组中的记录计数。例如,一个查询获取我们按性别留下的提示数量 −

This operation fetches the count of records in each group throughout a dataset. For instance, a query fetching us the number of tips left by sex −

SELECT sex, count(*)
FROM tips
GROUP BY sex;

Pandas 等效项是 −

The Pandas equivalent would be −

tips.groupby('sex').size()

让我们检查完整程序 −

Let’s check the full program −

import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips.groupby('sex').size()

它的 output 如下所示 −

Its output is as follows −

sex
Female   87
Male    157
dtype: int64

Top N rows

SQL 返回 top n rows ,使用 LIMIT

SQL returns the top n rows using LIMIT

SELECT * FROM tips
LIMIT 5 ;

Pandas 等效项是 −

The Pandas equivalent would be −

tips.head(5)

我们来检查完整示例 −

Let’s check the full example −

import pandas as pd

url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
tips = tips[['smoker', 'day', 'time']].head(5)
print tips

它的 output 如下所示 −

Its output is as follows −

   smoker   day     time
0      No   Sun   Dinner
1      No   Sun   Dinner
2      No   Sun   Dinner
3      No   Sun   Dinner
4      No   Sun   Dinner

这些是我们比较的一些基本操作,这即是我们之前在 Pandas 库的章节所了解的内容。

These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.