Python Data Science 简明教程

Python - Data Operations

Python 主要通过两个库 Pandas 和 Numpy 处理各种格式的数据。我们在前几章已经了解了这两个库的重要特性。本章我们将分别从每个库中了解一些基本示例,了解如何对数据进行操作。

Python handles data of various formats mainly through the two libraries, Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data.

Data Operations in Numpy

NumPy 中定义的最重要的对象是一个称为 ndarray 的 N 维数组类型。它描述了相同类型的项目的集合。可以使用基于零的索引访问集合中的项目。可以通过本教程后面描述的不同数组创建例程来构造 ndarray 类的实例。使用 NumPy 中的数组功能创建基本 ndarray 如下所示 −

The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. An instance of ndarray class can be constructed by different array creation routines described later in the tutorial. The basic ndarray is created using an array function in NumPy as follows −

numpy.array

以下是一些关于 Numpy 数据处理的示例。

Following are some examples on Numpy Data handling.

Example 1

# more than one dimensions
import numpy as np
a = np.array([[1, 2], [3, 4]])
print a

输出如下 −

The output is as follows −

[[1, 2]
 [3, 4]]

Example 2

# minimum dimensions
import numpy as np
a = np.array([1, 2, 3,4,5], ndmin = 2)
print a

输出如下 −

The output is as follows −

[[1, 2, 3, 4, 5]]

Example 3

# dtype parameter
import numpy as np
a = np.array([1, 2, 3], dtype = complex)
print a

输出如下 −

The output is as follows −

[ 1.+0.j,  2.+0.j,  3.+0.j]

Data Operations in Pandas

Pandas 通过 SeriesData FramePanel 处理数据。我们将从每个部分中了解一些示例。

Pandas handles data through Series,Data Frame, and Panel. We will see some examples from each of these.

Pandas Series

Series 是一个一维标记数组,能够容纳任何类型(整数、字符串、浮点数、Python 对象等)的数据。轴标签统称为索引。可以使用以下构造函数创建 Pandas Series −

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)

Example

我们在这里根据 Numpy 数组创建一个序列。

Here we create a series from a Numpy Array.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s

它的 output 如下所示 −

Its output is as follows −

0   a
1   b
2   c
3   d
dtype: object

Pandas DataFrame

数据框是二维数据结构,即,数据在行和列中按表格方式对齐。可以使用以下构造函数创建熊猫数据框 −

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)

现在我们使用数组创建索引数据框。

Let us now create an indexed DataFrame using arrays.

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

它的 output 如下所示 −

Its output is as follows −

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Pandas Panel

panel 是数据的 3D 容器。术语 Panel data 衍生自计量经济学并且部分负责熊猫名称 − pan(el)-da(ta) -s。

A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.

可以使用以下构造函数创建一个面板 −

A Panel can be created using the following constructor −

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

在下面示例中我们根据数据框对象词典创建一个面板

In the below example we create a panel from dict of DataFrame Objects

#creating an empty panel
import pandas as pd
import numpy as np

data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p

它的 output 如下所示 −

Its output is as follows −

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4