Python Data Science 简明教程
Python - Processing XLS Data
Microsoft Excel 是一款使用非常广泛的电子表格程序。其用户友好性以及吸引人的特性使其成为了数据科学中使用频率非常高的工具。Pandas 库提供了一些功能,我们可以使用这些功能对 Excel 文件进行全面读取,也可以只对选定的数据组的部分进行读取。我们还可以读取其中包含多个工作表的 Excel 文件。我们使用 read_excel 函数从中读取数据。
Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.
Input as Excel File
我们在 Windows 操作系统下创建了一个包含多个工作表的 Excel 文件。不同工作表中的数据如下所示。
We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.
你可以使用 Windows 操作系统中的 Excel 程序创建此文件。将文件另存为 input.xlsx 。
You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.
# Data in Sheet1
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
# Data in Sheet2
id name zipcode
1 Rick 301224
2 Dan 341255
3 Tusar 297704
4 Ryan 216650
5 Gary 438700
6 Rasmi 665100
7 Pranab 341211
8 Guru 347480
Reading an Excel File
Pandas 库的 read_excel 函数用于将 Excel 文件的内容作为熊猫数据框读取到 Python 环境中。该函数可以通过使用指向文件正确路径来从 OS 中读取文件。默认情况下,该函数将读取 Sheet1。
The read_excel function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.
import pandas as pd
data = pd.read_excel('path/input.xlsx')
print (data)
当我们执行上面的代码时,它会产生以下结果。请注意该函数如何创建了一个以零开头的附加列作为索引。
When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.
id name salary start_date dept
0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance
Reading Specific Columns and Rows
类似于我们在上一个章节中已经看到的内容,Pandas 库的 read_excel 函数也可用于读取一些特定列和特定行,以读取 CSV 文件。我们为此目的使用了名为 .loc() 的多轴索引方法。我们选择为某些行显示薪水和姓名列。
Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_excel('path/input.xlsx')
# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi
Reading Multiple Excel Sheets
也可以使用名为 ExcelFile 的包装器类的帮助使用 read_excel 函数读取具有不同数据格式的多个工作表。它只将多个工作表读取到内存中一次。在下面的示例中,我们将 sheet1 和 sheet2 读入两个数据框中,并分别打印出来。
Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out individually.
import pandas as pd
with pd.ExcelFile('C:/Users/Rasmi/Documents/pydatasci/input.xlsx') as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
print("****Result Sheet 1****")
print (df1[0:5]['salary'])
print("")
print("***Result Sheet 2****")
print (df2[0:5]['zipcode'])
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
****Result Sheet 1****
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64
***Result Sheet 2****
0 301224
1 341255
2 297704
3 216650
4 438700
Name: zipcode, dtype: int64