Python Data Science 简明教程

Python - Processing CSV Data

从 CSV（逗号分隔值）中读取数据是数据科学中的基本需要。通常，我们会从各个来源获得数据，这些数据可以导出为 CSV 格式，以便其他系统可以使用它们。Panadas 库提供了可以使用哪些功能来完整地读取 CSV 文件，以及仅针对选定的列和行组的部分读取 CSV 文件。

Reading data from CSV(comma separated values) is a fundamental necessity in Data Science. Often, we get data from various sources which can get exported to CSV format so that they can be used by other systems. The Panadas library provides features using which we can read the CSV file in full as well as in parts for only a selected group of columns and rows.

Input as CSV File

csv 文件是一个文本文件，其中各列中的值由逗号分隔。让我们考虑名为 input.csv 的文件中存在的以下数据。

The csv file is a text file in which the values in the columns are separated by a comma. Let’s consider the following data present in the file named input.csv.

您可以通过复制并粘贴此数据来使用 Windows 记事本创建此文件。在记事本中使用另存为所有文件( . ) 选项将文件保存为 input.csv 。

You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(.) option in notepad.

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

Reading a CSV File

pandas 库的 read_csv 函数用于将 CSV 文件的内容作为 pandas DataFrame 读取到 python 环境中。该函数可以使用文件的正确路径从 OS 中读取文件。

The read_csv function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file.

import pandas as pd
data = pd.read_csv('path/input.csv')
print (data)

当我们执行上面的代码时，它会产生以下结果。请注意该函数如何创建了一个以零开头的附加列作为索引。

When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.

   id    name  salary  start_date        dept
0   1    Rick  623.30  2012-01-01          IT
1   2     Dan  515.20  2013-09-23  Operations
2   3   Tusar  611.00  2014-11-15          IT
3   4    Ryan  729.00  2014-05-11          HR
4   5    Gary  843.25  2015-03-27     Finance
5   6   Rasmi  578.00  2013-05-21          IT
6   7  Pranab  632.80  2013-07-30  Operations
7   8    Guru  722.50  2014-06-17     Finance

Reading Specific Rows

pandas 库的 read_csv 函数也可用于为给定列读取某些特定行。我们使用以下代码从 read_csv 函数中截取结果，以获得名为 salary 的列的前 5 行。

The read_csv function of the pandas library can also be used to read some specific rows for a given column. We slice the result from the read_csv function using the code shown below for first 5 rows for the column named salary.

import pandas as pd
data = pd.read_csv('path/input.csv')

# Slice the result for first 5 rows
print (data[0:5]['salary'])

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

0    623.30
1    515.20
2    611.00
3    729.00
4    843.25
Name: salary, dtype: float64

Reading Specific Columns

pandas 库的 read_csv 函数也可用于读取某些特定列。为此，我们使用多轴索引方法 .loc() 。我们选择显示所有行的 salary 和 name 列。

The read_csv function of the pandas library can also be used to read some specific columns. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for all the rows.

import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion
print (data.loc[:,['salary','name']])

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

   salary    name
0  623.30    Rick
1  515.20     Dan
2  611.00   Tusar
3  729.00    Ryan
4  843.25    Gary
5  578.00   Rasmi
6  632.80  Pranab
7  722.50    Guru

Reading Specific Columns and Rows

read_csv 函数在 pandas 库中也可以用于读取一些特定列和特定行。在此目的下，我们采用了叫做 .loc() 的多轴索引法。我们选择为一些行显示工资和名称列。

The read_csv function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.

import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

   salary   name
1   515.2    Dan
3   729.0   Ryan
5   578.0  Rasmi

Reading Specific Columns for a Range of Rows

read_csv 函数在 pandas 库中也可以用于读取一些特定列和一系列行。在此目的下，我们采用了叫做 .loc() 的多轴索引法。我们选择为一些行显示工资和名称列。

The read_csv function of the pandas library can also be used to read some specific columns and a range of rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.

import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion
print (data.loc[2:6,['salary','name']])

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

   salary    name
2  611.00   Tusar
3  729.00    Ryan
4  843.25    Gary
5  578.00   Rasmi
6  632.80  Pranab