Python Pandas 简明教程

Python Pandas Tutorial

Pandas 是一个开源的、BSD 许可的 Python 库,为 Python programming language 提供高性能、易于使用的数据结构和数据分析工具。本 Pandas tutorial 面向希望了解 Pandas Python 包的基础和高级特性的用户编写。带 Pandas 的 Python 可用于广泛的领域,包括学术和商业领域,如金融、经济、统计、分析等。在本教程中,我们将了解 Python Pandas 的各种功能以及如何在实践中使用它们。

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This Pandas tutorial has been prepared for those who want to learn about the foundations and advanced features of the Pandas Python package. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.

What is Pandas?

Pandas 是一个强大的 Python 库,专门用于处理具有“关系”或“标签”数据的 DataFrame。其目标与使用 Python 执行现实世界数据分析保持一致。其灵活性和功能性使其成为各种与数据相关的任务的不可或缺的工具。因此,此 Python 包非常适合数据操作、操作数据集、探索 DataFrame、数据分析和机器学习相关任务。要在其上工作,我们应该首先使用类似于“pip install pandas”的 pip 命令安装它,然后使用"import pandas as pd" 导入它。在成功安装和导入后,我们可以使用 pandas 的创新功能来处理数据集或 DataFrame。由于 Pandas 的通用性和易用性,使其成为使用 Python 处理结构化数据的首选工具。

Pandas is a powerful Python library that is specifically designed to work on data frames that have "relational" or "labeled" data. Its aim aligns with doing real-world data analysis using Python. Its flexibility and functionality make it indispensable for various data-related tasks. Hence, this Python package works well for data manipulation, operating a dataset, exploring a data frame, data analysis, and machine learning-related tasks. To work on it we should first install it using a pip command like "pip install pandas" and then import it like "import pandas as pd". After successfully installing and importing, we can enjoy the innovative functions of pandas to work on datasets or data frames. Pandas versatility and ease of use make it a go-to tool for working with structured data in Python.

通常,Pandas 使用 SeriesDataFrame 来操作 DataFrame;其中 Series 处理一维标签数组,其中包含任意类型的数据,例如 integersstringsobjects ,而 DataFrame 是一个二维数据结构,它以表格形式管理和操作数据(使用行和列)。

Generally, Pandas operates a data frame using Series and DataFrame; where Series works on a one-dimensional labeled array holding data of any type like integers, strings, and objects, while a DataFrame is a two-dimensional data structure that manages and operates data in tabular form (using rows and columns).

Why Pandas?

Pandas 的优点在于它简化了与 DataFrame 相关的任务,并简化了处理 DataFrame 中涉及的许多耗时、重复性的任务,例如:

The beauty of Pandas is that it simplifies the task related to data frames and makes it simple to do many of the time-consuming, repetitive tasks involved in working with data frames, such as:

  1. Import datasets - available in the form of spreadsheets, comma-separated values (CSV) files, and more.

  2. Data cleansing - dealing with missing values and representing them as NaN, NA, or NaT.

  3. Size mutability - columns can be added and removed from DataFrame and higher-dimensional objects.

  4. Data normalization – normalize the data into a suitable format for analysis.

  5. Data alignment - objects can be explicitly aligned to a set of labels. Intuitive merging and joining data sets – we can merge and join datasets.

  6. Reshaping and pivoting of datasets – datasets can be reshaped and pivoted as per the need.

  7. Efficient manipulation and extraction - manipulation and extraction of specific parts of extensive datasets using intelligent label-based slicing, indexing, and subsetting techniques.

  8. Statistical analysis - to perform statistical operations on datasets.

  9. Data visualization - Visualize datasets and uncover insights.

Applications of Pandas

Pandas 最常见的应用如下:

The most common applications of Pandas are as follows:

  1. Data Cleaning: Pandas provides functionalities to clean messy data, deal with incomplete or inconsistent data, handle missing values, remove duplicates, and standardise formats to do effective data analysis.

  2. Data Exploration: Pandas easily summarise statistics, find trends, and visualise data using built-in plotting functions, Matplotlib, or Seaborn integration.

  3. Data Preparation: Pandas may pivot, melt, convert variables, and merge datasets based on common columns to prepare data for analysis.

  4. Data Analysis: Pandas supports descriptive statistics, time series analysis, group-by operations, and custom functions.

  5. Data Visualisation: Pandas itself has basic plotting capabilities; it integrates and supports data visualisation libraries like Matplotlib, Seaborn, and Plotly to create innovative visualisations.

  6. Time Series Analysis: Pandas supports date/time indexing, resampling, frequency conversion, and rolling statistics for time series data.

  7. Data Aggregation and Grouping: Pandas groupby() function lets you aggregate data and compute group-wise summary statistics or apply functions to groups.

  8. Data Input/Output: Pandas makes data input and export easy by reading and writing CSV, Excel, JSON, SQL databases, and more.

  9. Machine Learning: Pandas works well with Scikit-learn for data preparation, feature engineering, and model input data.

  10. Web Scraping: Pandas may be used with BeautifulSoup or Scrapy to parse and analyse structured web data for web scraping and data extraction.

  11. Financial Analysis: Pandas is commonly used in finance for stock market data analysis, financial indicator calculation, and portfolio optimisation.

  12. Text Data Analysis: Pandas' string manipulation, regular expressions, and text mining functions help analyse textual data.

  13. Experimental Data Analysis: Pandas makes manipulating and analysing large datasets, performing statistical tests, and visualising results easy.

Audience: Who Should Learn Pandas

Pandas tutorial 专为那些希望了解 Python Pandas 软件包中的基础和高级功能的人员准备。它最广泛地应用在数据科学、工程、研究、农学科学、管理学、统计学和其他相关领域,在这些领域,数据集合的计算需要或探索数据框架,找出有益决策所需的见解数据。完成本教程后,您将发现自己熟练掌握 Python Pandas 软件包,从中您可以提升自己在其他 Python 软件包(如 Matplotlib、SciPy、scikit-learn、scikit-image 等)方面的专业知识,不断掌握 Python 语言。

This Pandas tutorial has been prepared for those who want to learn about the foundations and advanced features of the Pandas Python package. It is most widely used in the domain of data science, engineering, research, agriculture science, management, statistics, and other related fields where computation on a data set requires or explores the data frames to find out the data insights that are required to make fruitful decisions. After completing this tutorial, you will find yourself skilled in pandas Python package from where you can take yourself to the next levels of expertise on other Python packages like Matplotlib, SciPy, scikit-learn, scikit-image, and many more to keep mastering Python language.

Pandas 库利用 NumPy 的大多数功能。建议您阅读我们 NumPy 中的教程。

Pandas library uses most of the functionalities of NumPy. It is suggested to you to go through our tutorial on NumPy.

Prerequisites To Learn Pandas

您应具有一定的计算机编程基础。对 Python 和任何编程语言有基本的了解会更好。统计和数学的基本知识有助于数据分析和解读。Pandas 提供用于描述性统计、聚合和汇总指标计算的功能。通过掌握上述基础,您将能充分利用 Pandas 的强大功能进行数据操作和分析任务。

You should have a basic understanding of computer programming. A basic understanding of Python and any of the programming languages is a plus. Basic knowledge of statistics and mathematics is helpful for data analysis and interpretation. Pandas provide functions for descriptive statistics, aggregation, and computation of summary metrics. By having a strong foundation of above mentioned, you’ll be well-equipped to leverage the power of Pandas for data manipulation and analysis tasks.

Pandas Codebase

您可以在 https://github.com/jvns/pandas-cookbook 中找到 Pandas 的源代码。

You can find the source for the Pandas at https://github.com/jvns/pandas-cookbook