Python Data Science 简明教程
Python - Data Wrangling
数据争吵涉及以各种格式处理数据,例如合并、分组、连接等,以便对其进行分析或准备好与另一组数据一起使用。Python 内置了函数可将这些争吵方法应用于各种数据集,以实现分析目标。在本章中,我们将看几个描述这些方法的示例。
Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods.
Merging Data
python 中的 Pandas 库提供了一个单一函数 merge ,作为 DataFrame 对象之间所有标准数据库联接操作的入口点 −
The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
现在让我们创建两个不同的 DataFrame 并对它们执行合并操作。
Let us now create two different DataFrames and perform the merging operations on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
它的 output 如下所示 −
Its output is as follows −
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
对数据集进行分组是数据分析中的一个常见需求,在数据分析中,我们需要根据数据集中存在的各种组来获得结果。Panadas 有内置的方法,可以将数据滚动到各个组。
Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups.
在下面的示例中,我们按年份对数据进行分组,然后获取特定年份的结果。
In the below example we group the data by year and then get the result for a specific year.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped.get_group(2014)
它的 output 如下所示 −
Its output is as follows −
Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
Concatenating Data
Pandas 提供了多种工具,可轻松组合 Series, DataFrame 和 Panel 对象。在下面的示例中, concat 函数沿着某一轴执行连接操作。让我们创建不同的对象并进行连接。
Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
它的 output 如下所示 −
Its output is as follows −
Marks_scored Name subject_id
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5