Pandas 中文参考指南

Scaling to large datasets

pandas 针对内存分析提供数据结构,这使得使用 pandas 分析大于内存数据集的数据集变得有些棘手。即使是占内存相当大一部分的数据集都会变得难以处理,因为一些 pandas 操作需要制作中间副本。

pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.

本文档提供了一些建议,用于将分析扩展到更大的数据集。这是 Enhancing performance 的补充,该文章重点介绍加速适合内存分析的数据集的分析速度。

This document provides a few recommendations for scaling your analysis to larger datasets. It’s a complement to Enhancing performance, which focuses on speeding up analysis for datasets that fit in memory.

Load less data

假设磁盘上的原始数据集包含许多列。

Suppose our raw dataset on disk has many columns.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
   ...:     index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
   ...:     n = len(index)
   ...:     state = np.random.RandomState(seed)
   ...:     columns = {
   ...:         "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
   ...:         "id": state.poisson(1000, size=n),
   ...:         "x": state.rand(n) * 2 - 1,
   ...:         "y": state.rand(n) * 2 - 1,
   ...:     }
   ...:     df = pd.DataFrame(columns, index=index, columns=sorted(columns))
   ...:     if df.index[-1] == end:
   ...:         df = df.iloc[:-1]
   ...:     return df
   ...:

In [4]: timeseries = [
   ...:     make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
   ...:     for i in range(10)
   ...: ]
   ...:

In [5]: ts_wide = pd.concat(timeseries, axis=1)

In [6]: ts_wide.head()
Out[6]:
                     id_0 name_0       x_0  ...   name_9       x_9       y_9
timestamp                                   ...
2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.757508
2000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.100298
2000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.581859
2000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.686692
2000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161

[5 rows x 40 columns]

In [7]: ts_wide.to_parquet("timeseries_wide.parquet")

要加载所需的列,我们有两种选择。选项 1 加载所有数据,然后根据我们的需要进行过滤。

To load the columns we want, we have two options. Option 1 loads in all the data and then filters to what we need.

In [8]: columns = ["id_0", "name_0", "x_0", "y_0"]

In [9]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[9]:
                     id_0 name_0       x_0       y_0
timestamp
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

选项 2 仅加载我们请求的那些列。

Option 2 only loads the columns we request.

In [10]: pd.read_parquet("timeseries_wide.parquet", columns=columns)
Out[10]:
                     id_0 name_0       x_0       y_0
timestamp
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

如果我们测量这两个调用的内存使用情况,就会看到指定 columns 在此情况下使用了大约十分之一的内存。

If we were to measure the memory usage of the two calls, we’d see that specifying columns uses about 1/10th the memory in this case.

使用 pandas.read_csv(),您可以指定 usecols 来限制读入内存的列。并非所有可通过 pandas 读取的文件格式都提供读取列子集的选项。

With pandas.read_csv(), you can specify usecols to limit the columns read into memory. Not all file formats that can be read by pandas provide an option to read a subset of columns.

Use efficient datatypes

默认的 Pandas 数据类型并不是最节省内存的类型。对于唯一值相对较少的文本数据列(通常被称为“低基数”数据),这一点尤其正确。通过使用更有效率的数据类型,您可以将更大的数据集存储在内存中。

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

In [11]: ts = make_timeseries(freq="30s", seed=0)

In [12]: ts.to_parquet("timeseries.parquet")

In [13]: ts = pd.read_parquet("timeseries.parquet")

In [14]: ts
Out[14]:
                       id     name         x         y
timestamp
2000-01-01 00:00:00  1041    Alice  0.889987  0.281011
2000-01-01 00:00:30   988      Bob -0.455299  0.488153
2000-01-01 00:01:00  1018    Alice  0.096061  0.580473
2000-01-01 00:01:30   992      Bob  0.142482  0.041665
2000-01-01 00:02:00   960      Bob -0.036235  0.802159
...                   ...      ...       ...       ...
2000-12-30 23:58:00  1022    Alice  0.266191  0.875579
2000-12-30 23:58:30   974    Alice -0.009826  0.413686
2000-12-30 23:59:00  1028  Charlie  0.307108 -0.656789
2000-12-30 23:59:30  1002    Alice  0.202602  0.541335
2000-12-31 00:00:00   987    Alice  0.200832  0.615972

[1051201 rows x 4 columns]

现在,让我们检查数据类型和内存使用情况,看看我们应该将注意力集中在哪里。

Now, let’s inspect the data types and memory usage to see where we should focus our attention.

In [15]: ts.dtypes
Out[15]:
id        int64
name     object
x       float64
y       float64
dtype: object
In [16]: ts.memory_usage(deep=True)  # memory usage in bytes
Out[16]:
Index     8409608
id        8409608
name     65176434
x         8409608
y         8409608
dtype: int64

name 占用比任何其他列都多得多的内存。它只有少量唯一值,因此是转换为 pandas.Categorical 的良好候选选项。使用 pandas.Categorical,我们会存储每个唯一名称一次,然后使用节省空间的整数来了解在每行中使用了哪个特定名称。

The name column is taking up much more memory than any other. It has just a few unique values, so it’s a good candidate for converting to a pandas.Categorical. With a pandas.Categorical, we store each unique name once and use space-efficient integers to know which specific name is used in each row.

In [17]: ts2 = ts.copy()

In [18]: ts2["name"] = ts2["name"].astype("category")

In [19]: ts2.memory_usage(deep=True)
Out[19]:
Index    8409608
id       8409608
name     1051495
x        8409608
y        8409608
dtype: int64

我们可以更进一步,使用 pandas.to_numeric() 将数字列向下转换为它们的最小类型。

We can go a bit further and downcast the numeric columns to their smallest types using pandas.to_numeric().

In [20]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")

In [21]: ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")

In [22]: ts2.dtypes
Out[22]:
id        uint16
name    category
x        float32
y        float32
dtype: object
In [23]: ts2.memory_usage(deep=True)
Out[23]:
Index    8409608
id       2102402
name     1051495
x        4204804
y        4204804
dtype: int64
In [24]: reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()

In [25]: print(f"{reduction:0.2f}")
0.20

总而言之,我们已将此数据集的内存占用空间缩小到其原始大小的五分之一。

In all, we’ve reduced the in-memory footprint of this dataset to 1/5 of its original size.

请参阅 Categorical data 以了解 pandas.Categorical 的更多信息,以及 dtypes 以了解所有 pandas 数据类型的概述。

See Categorical data for more on pandas.Categorical and dtypes for an overview of all of pandas’ dtypes.

Use chunking

通过将一个大问题分成一堆小问题,一些工作量可以通过分块来实现。例如,将单个 CSV 文件转换为 Parquet 文件,并对目录中的每个文件重复该操作。只要每个分块都适合内存,您就可以处理比内存大得多的数据集。

Some workloads can be achieved with chunking by splitting a large problem into a bunch of small problems. For example, converting an individual CSV file into a Parquet file and repeating that for each file in a directory. As long as each chunk fits in memory, you can work with datasets that are much larger than memory.

当您正在执行的操作需要零协调或最少的块之间协调时,分块效果很好。对于更复杂的工作流,您最好 using other libraries

Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, you’re better off using other libraries.

假设我们在磁盘上有一个更大的“逻辑数据集”,它是一个由镶木地板文件组成的目录。目录中的每个文件都代表整个数据集的不同年份。

Suppose we have an even larger “logical dataset” on disk that’s a directory of parquet files. Each file in the directory represents a different year of the entire dataset.

In [26]: import pathlib

In [27]: N = 12

In [28]: starts = [f"20{i:>02d}-01-01" for i in range(N)]

In [29]: ends = [f"20{i:>02d}-12-13" for i in range(N)]

In [30]: pathlib.Path("data/timeseries").mkdir(exist_ok=True)

In [31]: for i, (start, end) in enumerate(zip(starts, ends)):
   ....:     ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
   ....:     ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
   ....:
data
└── timeseries
    ├── ts-00.parquet
    ├── ts-01.parquet
    ├── ts-02.parquet
    ├── ts-03.parquet
    ├── ts-04.parquet
    ├── ts-05.parquet
    ├── ts-06.parquet
    ├── ts-07.parquet
    ├── ts-08.parquet
    ├── ts-09.parquet
    ├── ts-10.parquet
    └── ts-11.parquet

现在,我们将实施一个非核心 pandas.Series.value_counts()。此工作流的峰值内存使用量是单个最大分块,加上一个存储此点之前的唯一值计数的小序列。只要每个单独的文件适合内存,这将适用于大小任意的数据集。

Now we’ll implement an out-of-core pandas.Series.value_counts(). The peak memory usage of this workflow is the single largest chunk, plus a small series storing the unique value counts up to this point. As long as each individual file fits in memory, this will work for arbitrary-sized datasets.

In [32]: %%time
   ....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
   ....: counts = pd.Series(dtype=int)
   ....: for path in files:
   ....:     df = pd.read_parquet(path)
   ....:     counts = counts.add(df["name"].value_counts(), fill_value=0)
   ....: counts.astype(int)
   ....:
CPU times: user 760 ms, sys: 26.1 ms, total: 786 ms
Wall time: 559 ms
Out[32]:
name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

一些读取器(如 pandas.read_csv())提供参数,可在读取单个文件时控制 chunksize

Some readers, like pandas.read_csv(), offer parameters to control the chunksize when reading a single file.

手动分块是工作流的不错选择,这些工作流不需要太多复杂的操作。一些操作(如 pandas.DataFrame.groupby()),很难逐块执行。在这些情况下,您最好切换到为您实现这些非核心算法的不同库。

Manually chunking is an OK option for workflows that don’t require too sophisticated of operations. Some operations, like pandas.DataFrame.groupby(), are much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you.

Use Other Libraries

还有其他一些库提供与 pandas 类似的 API,并且与 pandas DataFrame 配合良好,并且可以让您通过并行运行时、分布式内存、群集等来扩展大数据集处理和分析功能。您可以在 the ecosystem page 中找到更多信息。

There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame, and can give you the ability to scale your large dataset processing and analytics by parallel runtime, distributed memory, clustering, etc. You can find more information in the ecosystem page.