Pandas 中文参考指南
Time series / date functionality
pandas 具有用于处理所有领域的时间序列数据的广泛功能和特性。使用 NumPy datetime64 和 timedelta64 数据类型,pandas 已整合了 scikits.timeseries 等其他 Python 库的大量特性,并为处理时间序列数据创建了大量的新功能。
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.
例如,pandas 支持:
For example, pandas supports:
解析来自各种源和格式的时间序列信息
Parsing time series information from various sources and formats
In [1]: import datetime
In [2]: dti = pd.to_datetime(
...: ["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
...: )
...:
In [3]: dti
Out[3]: DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)
生成固定频率的日期和时间跨度的序列
Generate sequences of fixed-frequency dates and time spans
In [4]: dti = pd.date_range("2018-01-01", periods=3, freq="h")
In [5]: dti
Out[5]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
'2018-01-01 02:00:00'],
dtype='datetime64[ns]', freq='h')
使用时区信息处理和转换日期时间
Manipulating and converting date times with timezone information
In [6]: dti = dti.tz_localize("UTC")
In [7]: dti
Out[7]:
DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 01:00:00+00:00',
'2018-01-01 02:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='h')
In [8]: dti.tz_convert("US/Pacific")
Out[8]:
DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-31 17:00:00-08:00',
'2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='h')
将时间序列重新采样或转换为特定频率
Resampling or converting a time series to a particular frequency
In [9]: idx = pd.date_range("2018-01-01", periods=5, freq="h")
In [10]: ts = pd.Series(range(len(idx)), index=idx)
In [11]: ts
Out[11]:
2018-01-01 00:00:00 0
2018-01-01 01:00:00 1
2018-01-01 02:00:00 2
2018-01-01 03:00:00 3
2018-01-01 04:00:00 4
Freq: h, dtype: int64
In [12]: ts.resample("2h").mean()
Out[12]:
2018-01-01 00:00:00 0.5
2018-01-01 02:00:00 2.5
2018-01-01 04:00:00 4.0
Freq: 2h, dtype: float64
使用绝对或相对时间增量执行日期和时间算术
Performing date and time arithmetic with absolute or relative time increments
In [13]: friday = pd.Timestamp("2018-01-05")
In [14]: friday.day_name()
Out[14]: 'Friday'
# Add 1 day
In [15]: saturday = friday + pd.Timedelta("1 day")
In [16]: saturday.day_name()
Out[16]: 'Saturday'
# Add 1 business day (Friday --> Monday)
In [17]: monday = friday + pd.offsets.BDay()
In [18]: monday.day_name()
Out[18]: 'Monday'
pandas 提供了一个相对紧凑且独立的工具集,用于执行上述任务等操作。
pandas provides a relatively compact and self-contained set of tools for performing the above tasks and more.
Overview
pandas 捕获了 4 个一般时间相关概念:
pandas captures 4 general time related concepts:
-
Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.
-
Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.
-
Time spans: A span of time defined by a point in time and its associated frequency.
-
Date offsets: A relative time duration that respects calendar arithmetic. Similar to dateutil.relativedelta.relativedelta from the dateutil package.
概念
Concept
标量类
Scalar Class
数组类
Array Class
pandas 数据类型
pandas Data Type
主要创建方法
Primary Creation Method
日期时间
Date times
Timestamp
DatetimeIndex
datetime64[ns] 或 datetime64[ns, tz]
datetime64[ns] or datetime64[ns, tz]
to_datetime 或 date_range
to_datetime or date_range
时间间隔
Time deltas
Timedelta
TimedeltaIndex
timedelta64[ns]
to_timedelta 或 timedelta_range
to_timedelta or timedelta_range
时间跨度
Time spans
Period
PeriodIndex
period[freq]
Period 或 period_range
Period or period_range
日期偏移
Date offsets
DateOffset
None
None
DateOffset
For time series data, it’s conventional to represent the time component in the index of a Series or DataFrame so manipulations can be performed with respect to the time element.
In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]:
2000-01-01 0
2000-01-02 1
2000-01-03 2
Freq: D, dtype: int64
In [20]: pd.Series(pd.date_range("2000", freq="D", periods=3))
Out[20]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
dtype: datetime64[ns]
如果将 datetime、timedelta 和 Period 数据传入这些构造函数,那么 Series 和 DataFrame 将为扩展数据类型提供支持,并为其提供功能。但是,DateOffset 数据将存储为 object 数据。
Series and DataFrame have extended data type support and functionality for datetime, timedelta and Period data when passed into those constructors. DateOffset data however will be stored as object data.
In [21]: pd.Series(pd.period_range("1/1/2011", freq="M", periods=3))
Out[21]:
0 2011-01
1 2011-02
2 2011-03
dtype: period[M]
In [22]: pd.Series([pd.DateOffset(1), pd.DateOffset(2)])
Out[22]:
0 <DateOffset>
1 <2 * DateOffsets>
dtype: object
In [23]: pd.Series(pd.date_range("1/1/2011", freq="ME", periods=3))
Out[23]:
0 2011-01-31
1 2011-02-28
2 2011-03-31
dtype: datetime64[ns]
最后,pandas 将空日期时间、时间增量和时间跨度表示为 NaT,这对于表示缺失或空日期之类的值非常有用,并且其行为类似于 np.nan 对浮点数据的处理行为。
Lastly, pandas represents null date times, time deltas, and time spans as NaT which is useful for representing missing or null date like values and behaves similar as np.nan does for float data.
In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False
Timestamps vs. time spans
带时间标记的数据是最基本类型的时间序列数据,它将值与时间点相关联。对于 pandas 对象,这意味着使用时间点。
Timestamped data is the most basic type of time series data that associates values with points in time. For pandas objects it means using the points in time.
In [28]: import datetime
In [29]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[29]: Timestamp('2012-05-01 00:00:00')
In [30]: pd.Timestamp("2012-05-01")
Out[30]: Timestamp('2012-05-01 00:00:00')
In [31]: pd.Timestamp(2012, 5, 1)
Out[31]: Timestamp('2012-05-01 00:00:00')
但是,在许多情况下,将诸如变更变量之类的东西与时间跨度相关联更为自然。可以使用 Timestamp 和 Period 来显式指定 Period 所代表的跨度,也可以从日期时间字符串格式推断得来。
However, in many cases it is more natural to associate things like change variables with a time span instead. The span represented by Period can be specified explicitly, or inferred from datetime string format.
例如:
For example:
In [32]: pd.Period("2011-01")
Out[32]: Period('2011-01', 'M')
In [33]: pd.Period("2012-05", freq="D")
Out[33]: Period('2012-05-01', 'D')
Timestamp 和 Period 可用作索引。Timestamp 和 Period 的列表会自动分别强制转换为 DatetimeIndex 和 PeriodIndex。
Timestamp and Period can serve as an index. Lists of Timestamp and Period are automatically coerced to DatetimeIndex and PeriodIndex respectively.
In [34]: dates = [
....: pd.Timestamp("2012-05-01"),
....: pd.Timestamp("2012-05-02"),
....: pd.Timestamp("2012-05-03"),
....: ]
....:
In [35]: ts = pd.Series(np.random.randn(3), dates)
In [36]: type(ts.index)
Out[36]: pandas.core.indexes.datetimes.DatetimeIndex
In [37]: ts.index
Out[37]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
In [38]: ts
Out[38]:
2012-05-01 0.469112
2012-05-02 -0.282863
2012-05-03 -1.509059
dtype: float64
In [39]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]
In [40]: ts = pd.Series(np.random.randn(3), periods)
In [41]: type(ts.index)
Out[41]: pandas.core.indexes.period.PeriodIndex
In [42]: ts.index
Out[42]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]')
In [43]: ts
Out[43]:
2012-01 -1.135632
2012-02 1.212112
2012-03 -0.173215
Freq: M, dtype: float64
pandas 允许你捕获两种表示法,并在这两种表示法之间进行转换。在底层,pandas 使用 Timestamp 实例表示时间戳,并使用 DatetimeIndex 实例表示时间戳序列。对于常规时间跨度,pandas 使用 Period 对象表示标量值,并使用 PeriodIndex 表示跨度序列。对具有任意起始点和结束点的非常规间隔的更好支持将出现在未来的版本中。
pandas allows you to capture both representations and convert between them. Under the hood, pandas represents timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans. Better support for irregular intervals with arbitrary start and end points are forth-coming in future releases.
Converting to timestamps
要转换类似日期的 Series 或列表对象(例如字符串、纪元或混合),可以使用 to_datetime 函数。当传递 Series 时,它会返回一个 Series(具有相同的索引),而列表类似的对象会转换为 DatetimeIndex:
To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function. When passed a Series, this returns a Series (with the same index), while a list-like is converted to a DatetimeIndex:
In [44]: pd.to_datetime(pd.Series(["Jul 31, 2009", "Jan 10, 2010", None]))
Out[44]:
0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]
In [45]: pd.to_datetime(["2005/11/23", "2010/12/31"])
Out[45]: DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)
如果你使用以天开始的日期(即欧洲风格),则可以传递 dayfirst 标志:
If you use dates which start with the day first (i.e. European style), you can pass the dayfirst flag:
In [46]: pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
Out[46]: DatetimeIndex(['2012-01-04 10:00:00'], dtype='datetime64[ns]', freq=None)
In [47]: pd.to_datetime(["04-14-2012 10:00"], dayfirst=True)
Out[47]: DatetimeIndex(['2012-04-14 10:00:00'], dtype='datetime64[ns]', freq=None)
警告
Warning
你在上面的示例中看到 dayfirst 不是严格的。如果无法按天优先解析日期,则会将其解析为 dayfirst 的 False,并且还将发出警告。
You see in the above example that dayfirst isn’t strict. If a date can’t be parsed with the day being first it will be parsed as if dayfirst were False and a warning will also be raised.
如果您将一个字符串传递给 to_datetime,它将返回一个 Timestamp。Timestamp 也可以接受字符串输入,但它不接受字符串解析选项,如 dayfirst 或 format,所以如果需要这些选项,请使用 to_datetime。
If you pass a single string to to_datetime, it returns a single Timestamp. Timestamp can also accept string input, but it doesn’t accept string parsing options like dayfirst or format, so use to_datetime if these are required.
In [48]: pd.to_datetime("2010/11/12")
Out[48]: Timestamp('2010-11-12 00:00:00')
In [49]: pd.Timestamp("2010/11/12")
Out[49]: Timestamp('2010-11-12 00:00:00')
您还可以直接使用 DatetimeIndex 构造函数:
You can also use the DatetimeIndex constructor directly:
In [50]: pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"])
Out[50]: DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq=None)
字符串“infer”可以传递,以便在创建时将索引的频率设置为推断的频率:
The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation:
In [51]: pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"], freq="infer")
Out[51]: DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq='2D')
Providing a format argument
除了必需的 datetime 字符串,可以传递 format 参数以确保特定的解析。这也可能会极大地加快转换速度。
In addition to the required datetime string, a format argument can be passed to ensure specific parsing. This could also potentially speed up the conversion considerably.
In [52]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[52]: Timestamp('2010-11-12 00:00:00')
In [53]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[53]: Timestamp('2010-11-12 00:00:00')
有关在指定 format 选项时可用的选择项的详细信息,请参阅 Python datetime documentation。
For more information on the choices available when specifying the format option, see the Python datetime documentation.
Assembling datetime from multiple DataFrame columns
您还可以传递一个整数或字符串列的 DataFrame 来组装成一个 Series,用于 Timestamps。
You can also pass a DataFrame of integer or string columns to assemble into a Series of Timestamps.
In [54]: df = pd.DataFrame(
....: {"year": [2015, 2016], "month": [2, 3], "day": [4, 5], "hour": [2, 3]}
....: )
....:
In [55]: pd.to_datetime(df)
Out[55]:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
您只需传递需要组装的列。
You can pass only the columns that you need to assemble.
In [56]: pd.to_datetime(df[["year", "month", "day"]])
Out[56]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
pd.to_datetime 查找列名中 datetime 组件的标准指定,包括:
pd.to_datetime looks for standard designations of the datetime component in the column names, including:
-
required: year, month, day
-
optional: hour, minute, second, millisecond, microsecond, nanosecond
Invalid data
默认行为 errors='raise' 是在无法解析时引发:
The default behavior, errors='raise', is to raise when unparsable:
In [57]: pd.to_datetime(['2009/07/31', 'asd'], errors='raise')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[57], line 1
----> 1 pd.to_datetime(['2009/07/31', 'asd'], errors='raise')
File ~/work/pandas/pandas/pandas/core/tools/datetimes.py:1099, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1097 result = _convert_and_box_cache(argc, cache_array)
1098 else:
-> 1099 result = convert_listlike(argc, format)
1100 else:
1101 result = convert_listlike(np.array([arg]), format)[0]
File ~/work/pandas/pandas/pandas/core/tools/datetimes.py:433, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
431 # `format` could be inferred, or user didn't ask for mixed-format parsing.
432 if format is not None and format != "mixed":
--> 433 return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
435 result, tz_parsed = objects_to_datetime64(
436 arg,
437 dayfirst=dayfirst,
(...)
441 allow_object=True,
442 )
444 if tz_parsed is not None:
445 # We can take a shortcut since the datetime64 numpy array
446 # is in UTC
File ~/work/pandas/pandas/pandas/core/tools/datetimes.py:467, in _array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)
456 def _array_strptime_with_fallback(
457 arg,
458 name,
(...)
462 errors: str,
463 ) -> Index:
464 """
465 Call array_strptime, with fallback behavior depending on 'errors'.
466 """
--> 467 result, tz_out = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
468 if tz_out is not None:
469 unit = np.datetime_data(result.dtype)[0]
File strptime.pyx:501, in pandas._libs.tslibs.strptime.array_strptime()
File strptime.pyx:451, in pandas._libs.tslibs.strptime.array_strptime()
File strptime.pyx:583, in pandas._libs.tslibs.strptime._parse_with_format()
ValueError: time data "asd" doesn't match format "%Y/%m/%d", at position 1. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
传递 errors='coerce' 会将无法解析的数据转换为 NaT(而不是时间):
Pass errors='coerce' to convert unparsable data to NaT (not a time):
In [58]: pd.to_datetime(["2009/07/31", "asd"], errors="coerce")
Out[58]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)
Epoch timestamps
pandas 支持将整数或浮点时间戳转换为 Timestamp 和 DatetimeIndex。默认单位为纳秒,因为这是 Timestamp 对象在内部存储的方式。然而,时间戳通常存储在另一个 unit 中,可以指定该时间戳。它们根据 origin 参数指定的起始点计算出来。
pandas supports converting integer or float epoch times to Timestamp and DatetimeIndex. The default unit is nanoseconds, since that is how Timestamp objects are stored internally. However, epochs are often stored in another unit which can be specified. These are computed from the starting point specified by the origin parameter.
In [59]: pd.to_datetime(
....: [1349720105, 1349806505, 1349892905, 1349979305, 1350065705], unit="s"
....: )
....:
Out[59]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05',
'2012-10-12 18:15:05'],
dtype='datetime64[ns]', freq=None)
In [60]: pd.to_datetime(
....: [1349720105100, 1349720105200, 1349720105300, 1349720105400, 1349720105500],
....: unit="ms",
....: )
....:
Out[60]:
DatetimeIndex(['2012-10-08 18:15:05.100000', '2012-10-08 18:15:05.200000',
'2012-10-08 18:15:05.300000', '2012-10-08 18:15:05.400000',
'2012-10-08 18:15:05.500000'],
dtype='datetime64[ns]', freq=None)
unit 参数不使用与 format 参数相同的字符串(在 above 中讨论过)。可用单位列在 pandas.to_datetime() 的文档中。 |
The unit parameter does not use the same strings as the format parameter that was discussed above). The available units are listed on the documentation for pandas.to_datetime(). |
使用指定了 tz 参数的时间戳构造 Timestamp 或 DatetimeIndex 将抛出 ValueError。如果您在其他时区中以绝对时间存储有时间戳,则可以将它们读取为不与时区相关的时间戳,然后将其定位到适当的时区:
Constructing a Timestamp or DatetimeIndex with an epoch timestamp with the tz argument specified will raise a ValueError. If you have epochs in wall time in another timezone, you can read the epochs as timezone-naive timestamps and then localize to the appropriate timezone:
In [61]: pd.Timestamp(1262347200000000000).tz_localize("US/Pacific")
Out[61]: Timestamp('2010-01-01 12:00:00-0800', tz='US/Pacific')
In [62]: pd.DatetimeIndex([1262347200000000000]).tz_localize("US/Pacific")
Out[62]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)
时间戳将被四舍五入到最接近的纳秒。 |
Epoch times will be rounded to the nearest nanosecond. |
警告
Warning
转换浮点时间戳可能会导致不准确和意外的结果。 Python floats 在小数中大约有 15 位精度。从浮点转换为高精度 Timestamp 期间的舍入是不可避免的。实现精确精度的唯一方法是使用固定宽度类型(例如 int64)。
Conversion of float epoch times can lead to inaccurate and unexpected results. Python floats have about 15 digits precision in decimal. Rounding during conversion from float to high precision Timestamp is unavoidable. The only way to achieve exact precision is to use a fixed-width types (e.g. an int64).
In [63]: pd.to_datetime([1490195805.433, 1490195805.433502912], unit="s")
Out[63]: DatetimeIndex(['2017-03-22 15:16:45.433000088', '2017-03-22 15:16:45.433502913'], dtype='datetime64[ns]', freq=None)
In [64]: pd.to_datetime(1490195805433502912, unit="ns")
Out[64]: Timestamp('2017-03-22 15:16:45.433502912')
请参阅
See also
From timestamps to epoch
要从上述操作进行反向转换,即从 Timestamp 转换为“unix”时间戳:
To invert the operation from above, namely, to convert from a Timestamp to a ‘unix’ epoch:
In [65]: stamps = pd.date_range("2012-10-08 18:15:05", periods=4, freq="D")
In [66]: stamps
Out[66]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05'],
dtype='datetime64[ns]', freq='D')
我们减去时间戳(1970 年 1 月 1 日 UTC 午夜),然后向下取整除以“单位”(1 秒)。
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the “unit” (1 second).
In [67]: (stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
Out[67]: Index([1349720105, 1349806505, 1349892905, 1349979305], dtype='int64')
Using the origin parameter
使用 origin 参数,可以为 DatetimeIndex 的创建指定一个备用开始点。例如,将 1960-01-01 用作开始日期:
Using the origin parameter, one can specify an alternative starting point for creation of a DatetimeIndex. For example, to use 1960-01-01 as the starting date:
In [68]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[68]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
默认设置在 origin='unix',它默认为 1970-01-01 00:00:00。通常称为“unix 时间戳”或 POSIX 时间。
The default is set at origin='unix', which defaults to 1970-01-01 00:00:00. Commonly called ‘unix epoch’ or POSIX time.
In [69]: pd.to_datetime([1, 2, 3], unit="D")
Out[69]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)
Generating ranges of timestamps
要使用时间戳生成索引,可以使用 DatetimeIndex 或 Index 构造函数,并传入日期时间对象列表:
To generate an index with timestamps, you can use either the DatetimeIndex or Index constructor and pass in a list of datetime objects:
In [70]: dates = [
....: datetime.datetime(2012, 5, 1),
....: datetime.datetime(2012, 5, 2),
....: datetime.datetime(2012, 5, 3),
....: ]
....:
# Note the frequency information
In [71]: index = pd.DatetimeIndex(dates)
In [72]: index
Out[72]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
# Automatically converted to DatetimeIndex
In [73]: index = pd.Index(dates)
In [74]: index
Out[74]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
在实践中,这变得非常繁琐,因为我们通常需要一个拥有大量时间戳的非常长的索引。如果需要在常规频率上使用时间戳,则可以使用 date_range() 和 bdate_range() 函数创建 DatetimeIndex。date_range 的默认频率是日历日,而 bdate_range 的默认频率是工作日:
In practice this becomes very cumbersome because we often need a very long index with a large number of timestamps. If we need timestamps on a regular frequency, we can use the date_range() and bdate_range() functions to create a DatetimeIndex. The default frequency for date_range is a calendar day while the default for bdate_range is a business day:
In [75]: start = datetime.datetime(2011, 1, 1)
In [76]: end = datetime.datetime(2012, 1, 1)
In [77]: index = pd.date_range(start, end)
In [78]: index
Out[78]:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
'2011-01-09', '2011-01-10',
...
'2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
'2011-12-31', '2012-01-01'],
dtype='datetime64[ns]', length=366, freq='D')
In [79]: index = pd.bdate_range(start, end)
In [80]: index
Out[80]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14',
...
'2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
'2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
'2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', length=260, freq='B')
像 date_range 和 bdate_range 这样的便捷函数可以使用各种 frequency aliases:
Convenience functions like date_range and bdate_range can utilize a variety of frequency aliases:
In [81]: pd.date_range(start, periods=1000, freq="ME")
Out[81]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30',
'2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31',
'2011-09-30', '2011-10-31',
...
'2093-07-31', '2093-08-31', '2093-09-30', '2093-10-31',
'2093-11-30', '2093-12-31', '2094-01-31', '2094-02-28',
'2094-03-31', '2094-04-30'],
dtype='datetime64[ns]', length=1000, freq='ME')
In [82]: pd.bdate_range(start, periods=250, freq="BQS")
Out[82]:
DatetimeIndex(['2011-01-03', '2011-04-01', '2011-07-01', '2011-10-03',
'2012-01-02', '2012-04-02', '2012-07-02', '2012-10-01',
'2013-01-01', '2013-04-01',
...
'2071-01-01', '2071-04-01', '2071-07-01', '2071-10-01',
'2072-01-01', '2072-04-01', '2072-07-01', '2072-10-03',
'2073-01-02', '2073-04-03'],
dtype='datetime64[ns]', length=250, freq='BQS-JAN')
date_range 和 bdate_range 使得使用 start、end、periods 和 freq 等各种参数组合生成一系列日期变得非常容易。开始日期和结束日期严格包含在内,因此不会生成超出指定范围的日期:
date_range and bdate_range make it easy to generate a range of dates using various combinations of parameters like start, end, periods, and freq. The start and end dates are strictly inclusive, so dates outside of those specified will not be generated:
In [83]: pd.date_range(start, end, freq="BME")
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BME')
In [84]: pd.date_range(start, end, freq="W")
Out[84]:
DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16', '2011-01-23',
'2011-01-30', '2011-02-06', '2011-02-13', '2011-02-20',
'2011-02-27', '2011-03-06', '2011-03-13', '2011-03-20',
'2011-03-27', '2011-04-03', '2011-04-10', '2011-04-17',
'2011-04-24', '2011-05-01', '2011-05-08', '2011-05-15',
'2011-05-22', '2011-05-29', '2011-06-05', '2011-06-12',
'2011-06-19', '2011-06-26', '2011-07-03', '2011-07-10',
'2011-07-17', '2011-07-24', '2011-07-31', '2011-08-07',
'2011-08-14', '2011-08-21', '2011-08-28', '2011-09-04',
'2011-09-11', '2011-09-18', '2011-09-25', '2011-10-02',
'2011-10-09', '2011-10-16', '2011-10-23', '2011-10-30',
'2011-11-06', '2011-11-13', '2011-11-20', '2011-11-27',
'2011-12-04', '2011-12-11', '2011-12-18', '2011-12-25',
'2012-01-01'],
dtype='datetime64[ns]', freq='W-SUN')
In [85]: pd.bdate_range(end=end, periods=20)
Out[85]:
DatetimeIndex(['2011-12-05', '2011-12-06', '2011-12-07', '2011-12-08',
'2011-12-09', '2011-12-12', '2011-12-13', '2011-12-14',
'2011-12-15', '2011-12-16', '2011-12-19', '2011-12-20',
'2011-12-21', '2011-12-22', '2011-12-23', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', freq='B')
In [86]: pd.bdate_range(start=start, periods=20)
Out[86]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14', '2011-01-17', '2011-01-18',
'2011-01-19', '2011-01-20', '2011-01-21', '2011-01-24',
'2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28'],
dtype='datetime64[ns]', freq='B')
指定 start、end 和 periods 将生成从 start 到 end(含)的一系列均匀间隔日期,结果中的 DatetimeIndex 元素数为 periods:
Specifying start, end, and periods will generate a range of evenly spaced dates from start to end inclusively, with periods number of elements in the resulting DatetimeIndex:
In [87]: pd.date_range("2018-01-01", "2018-01-05", periods=5)
Out[87]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq=None)
In [88]: pd.date_range("2018-01-01", "2018-01-05", periods=10)
Out[88]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 10:40:00',
'2018-01-01 21:20:00', '2018-01-02 08:00:00',
'2018-01-02 18:40:00', '2018-01-03 05:20:00',
'2018-01-03 16:00:00', '2018-01-04 02:40:00',
'2018-01-04 13:20:00', '2018-01-05 00:00:00'],
dtype='datetime64[ns]', freq=None)
Custom frequency ranges
bdate_range 还可以使用 weekmask 和 holidays 参数生成一系列自定义频率日期。如果传递自定义频率字符串,则只会使用这些参数。
bdate_range can also generate a range of custom frequency dates by using the weekmask and holidays parameters. These parameters will only be used if a custom frequency string is passed.
In [89]: weekmask = "Mon Wed Fri"
In [90]: holidays = [datetime.datetime(2011, 1, 5), datetime.datetime(2011, 3, 14)]
In [91]: pd.bdate_range(start, end, freq="C", weekmask=weekmask, holidays=holidays)
Out[91]:
DatetimeIndex(['2011-01-03', '2011-01-07', '2011-01-10', '2011-01-12',
'2011-01-14', '2011-01-17', '2011-01-19', '2011-01-21',
'2011-01-24', '2011-01-26',
...
'2011-12-09', '2011-12-12', '2011-12-14', '2011-12-16',
'2011-12-19', '2011-12-21', '2011-12-23', '2011-12-26',
'2011-12-28', '2011-12-30'],
dtype='datetime64[ns]', length=154, freq='C')
In [92]: pd.bdate_range(start, end, freq="CBMS", weekmask=weekmask)
Out[92]:
DatetimeIndex(['2011-01-03', '2011-02-02', '2011-03-02', '2011-04-01',
'2011-05-02', '2011-06-01', '2011-07-01', '2011-08-01',
'2011-09-02', '2011-10-03', '2011-11-02', '2011-12-02'],
dtype='datetime64[ns]', freq='CBMS')
请参阅
See also
Timestamp limitations
时间戳表示的限制取决于所选分辨率。对于纳秒分辨率,可以使用 64 位整数表示的时间跨度限制为大约 584 年:
The limits of timestamp representation depend on the chosen resolution. For nanosecond resolution, the time span that can be represented using a 64-bit integer is limited to approximately 584 years:
In [93]: pd.Timestamp.min
Out[93]: Timestamp('1677-09-21 00:12:43.145224193')
In [94]: pd.Timestamp.max
Out[94]: Timestamp('2262-04-11 23:47:16.854775807')
在选择秒分辨率时,可用范围增长为 +/- 2.9e11 years。可以通过 as_unit 将不同分辨率相互转换。
When choosing second-resolution, the available range grows to +/- 2.9e11 years. Different resolutions can be converted to each other through as_unit.
请参阅
See also
Indexing
DatetimeIndex 的主要用途之一是作为 pandas 对象的索引。DatetimeIndex 类包含许多与时序相关的优化:
One of the main uses for DatetimeIndex is as an index for pandas objects. The DatetimeIndex class contains many time series related optimizations:
-
A large range of dates for various offsets are pre-computed and cached under the hood in order to make generating subsequent date ranges very fast (just have to grab a slice).
-
Fast shifting using the shift method on pandas objects.
-
Unioning of overlapping DatetimeIndex objects with the same frequency is very fast (important for fast data alignment).
-
Quick access to date fields via properties such as year, month, etc.
-
Regularization functions like snap and very fast asof logic.
DatetimeIndex 对象具有常规 Index 对象的所有基本功能,此外还提供一系列高级时间序列特定方法,用于轻松进行频率处理。
DatetimeIndex objects have all the basic functionality of regular Index objects, and a smorgasbord of advanced time series specific methods for easy frequency processing.
请参阅
See also
虽然 pandas 不会强制你拥有已排序的日期索引,但如果日期未排序,则其中一些方法可能会出现异常或不正确的行为。 |
While pandas does not force you to have a sorted date index, some of these methods may have unexpected or incorrect behavior if the dates are unsorted. |
DatetimeIndex 可以像常规索引一样使用,并提供其所有智能功能,如选择、切片等。
DatetimeIndex can be used like a regular index and offers all of its intelligent functionality like selection, slicing, etc.
In [95]: rng = pd.date_range(start, end, freq="BME")
In [96]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [97]: ts.index
Out[97]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BME')
In [98]: ts[:5].index
Out[98]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31'],
dtype='datetime64[ns]', freq='BME')
In [99]: ts[::2].index
Out[99]:
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
'2011-09-30', '2011-11-30'],
dtype='datetime64[ns]', freq='2BME')
Partial string indexing
可以将解析为时间戳的日期和字符串作为索引参数传递:
Dates and strings that parse to timestamps can be passed as indexing parameters:
In [100]: ts["1/31/2011"]
Out[100]: 0.11920871129693428
In [101]: ts[datetime.datetime(2011, 12, 25):]
Out[101]:
2011-12-30 0.56702
Freq: BME, dtype: float64
In [102]: ts["10/31/2011":"12/31/2011"]
Out[102]:
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BME, dtype: float64
为了方便访问较长的时间序列,你还可以将年份或年份和月份作为字符串传递:
To provide convenience for accessing longer time series, you can also pass in the year or year and month as strings:
In [103]: ts["2011"]
Out[103]:
2011-01-31 0.119209
2011-02-28 -1.044236
2011-03-31 -0.861849
2011-04-29 -2.104569
2011-05-31 -0.494929
2011-06-30 1.071804
2011-07-29 0.721555
2011-08-31 -0.706771
2011-09-30 -1.039575
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BME, dtype: float64
In [104]: ts["2011-6"]
Out[104]:
2011-06-30 1.071804
Freq: BME, dtype: float64
此类切片操作同样适用于具有 DatetimeIndex 的 DataFrame。由于部分字符串选择是标签切片的某种形式,因此端点将被包含在内。这将包括匹配包含日期中的时间:
This type of slicing will work on a DataFrame with a DatetimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date:
警告
Warning
从 pandas 1.2.0 开始,使用 getitem(例如 frame[dtstring])对 DataFrame 行执行索引操作已被弃用(鉴于对行索引或列选择进行索引的歧义),并将在未来版本中予以移除。仍然支持用 .loc 执行的等效操作(例如 frame.loc[dtstring])。
Indexing DataFrame rows with a single string with getitem (e.g. frame[dtstring]) is deprecated starting with pandas 1.2.0 (given the ambiguity whether it is indexing the rows or selecting a column) and will be removed in a future version. The equivalent with .loc (e.g. frame.loc[dtstring]) is still supported.
In [105]: dft = pd.DataFrame(
.....: np.random.randn(100000, 1),
.....: columns=["A"],
.....: index=pd.date_range("20130101", periods=100000, freq="min"),
.....: )
.....:
In [106]: dft
Out[106]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]
In [107]: dft.loc["2013"]
Out[107]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]
此操作从月份的第一时刻开始,并包括该月份的最后日期和时间:
This starts on the very first time in the month, and includes the last date and time for the month:
In [108]: dft["2013-1":"2013-2"]
Out[108]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
[84960 rows x 1 columns]
此操作指定一个结束时间,该结束时间包括最后一天的所有时间:
This specifies a stop time that includes all of the times on the last day:
In [109]: dft["2013-1":"2013-2-28"]
Out[109]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
[84960 rows x 1 columns]
此操作指定一个准确的结束时间(与上述不同):
This specifies an exact stop time (and is not the same as the above):
In [110]: dft["2013-1":"2013-2-28 00:00:00"]
Out[110]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-27 23:56:00 1.197749
2013-02-27 23:57:00 0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501
[83521 rows x 1 columns]
我们正在包含的端点处停止,因为它是索引的一部分:
We are stopping on the included end-point as it is part of the index:
In [111]: dft["2013-1-15":"2013-1-15 12:30:00"]
Out[111]:
A
2013-01-15 00:00:00 -0.984810
2013-01-15 00:01:00 0.941451
2013-01-15 00:02:00 1.559365
2013-01-15 00:03:00 1.034374
2013-01-15 00:04:00 -1.480656
... ...
2013-01-15 12:26:00 0.371454
2013-01-15 12:27:00 -0.930806
2013-01-15 12:28:00 -0.069177
2013-01-15 12:29:00 0.066510
2013-01-15 12:30:00 -0.003945
[751 rows x 1 columns]
DatetimeIndex 部分字符串索引同样适用于具有 MultiIndex 的 DataFrame。
DatetimeIndex partial string indexing also works on a DataFrame with a MultiIndex:
In [112]: dft2 = pd.DataFrame(
.....: np.random.randn(20, 1),
.....: columns=["A"],
.....: index=pd.MultiIndex.from_product(
.....: [pd.date_range("20130101", periods=10, freq="12h"), ["a", "b"]]
.....: ),
.....: )
.....:
In [113]: dft2
Out[113]:
A
2013-01-01 00:00:00 a -0.298694
b 0.823553
2013-01-01 12:00:00 a 0.943285
b -1.479399
2013-01-02 00:00:00 a -1.643342
... ...
2013-01-04 12:00:00 b 0.069036
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
[20 rows x 1 columns]
In [114]: dft2.loc["2013-01-05"]
Out[114]:
A
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
In [115]: idx = pd.IndexSlice
In [116]: dft2 = dft2.swaplevel(0, 1).sort_index()
In [117]: dft2.loc[idx[:, "2013-01-05"], :]
Out[117]:
A
a 2013-01-05 00:00:00 0.122297
2013-01-05 12:00:00 0.370079
b 2013-01-05 00:00:00 1.422060
2013-01-05 12:00:00 1.016331
使用字符串索引进行切片也遵循 UTC 偏移。
Slicing with string indexing also honors UTC offset.
In [118]: df = pd.DataFrame([0], index=pd.DatetimeIndex(["2019-01-01"], tz="US/Pacific"))
In [119]: df
Out[119]:
0
2019-01-01 00:00:00-08:00 0
In [120]: df["2019-01-01 12:00:00+04:00":"2019-01-01 13:00:00+04:00"]
Out[120]:
0
2019-01-01 00:00:00-08:00 0
Slice vs. exact match
根据索引的分辨率,可以用作索引参数的相同字符串可以被视为切片或精确匹配。如果字符串比索引的准确度低,则它将被视为切片,否则将被视为精确匹配。
The same string used as an indexing parameter can be treated either as a slice or as an exact match depending on the resolution of the index. If the string is less accurate than the index, it will be treated as a slice, otherwise as an exact match.
考虑具有分钟分辨率索引的 Series 对象:
Consider a Series object with a minute resolution index:
In [121]: series_minute = pd.Series(
.....: [1, 2, 3],
.....: pd.DatetimeIndex(
.....: ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]
.....: ),
.....: )
.....:
In [122]: series_minute.index.resolution
Out[122]: 'minute'
一个比分钟准确度低的 timetamps 字符串给出了一个 Series 对象。
A timestamp string less accurate than a minute gives a Series object.
In [123]: series_minute["2011-12-31 23"]
Out[123]:
2011-12-31 23:59:00 1
dtype: int64
一个具有分钟分辨率(或更准确)的 timestamp 字符串给出了一个标量,也就是说,它没有被强制转换为切片。
A timestamp string with minute resolution (or more accurate), gives a scalar instead, i.e. it is not casted to a slice.
In [124]: series_minute["2011-12-31 23:59"]
Out[124]: 1
In [125]: series_minute["2011-12-31 23:59:00"]
Out[125]: 1
如果索引分辨率为秒,那么分钟准确的时间戳将给出一个 Series。
If index resolution is second, then the minute-accurate timestamp gives a Series.
In [126]: series_second = pd.Series(
.....: [1, 2, 3],
.....: pd.DatetimeIndex(
.....: ["2011-12-31 23:59:59", "2012-01-01 00:00:00", "2012-01-01 00:00:01"]
.....: ),
.....: )
.....:
In [127]: series_second.index.resolution
Out[127]: 'second'
In [128]: series_second["2011-12-31 23:59"]
Out[128]:
2011-12-31 23:59:59 1
dtype: int64
如果时间戳字符串被视为切片,则可以使用它来用 .loc[] 将 DataFrame 索引。
If the timestamp string is treated as a slice, it can be used to index DataFrame with .loc[] as well.
In [129]: dft_minute = pd.DataFrame(
.....: {"a": [1, 2, 3], "b": [4, 5, 6]}, index=series_minute.index
.....: )
.....:
In [130]: dft_minute.loc["2011-12-31 23"]
Out[130]:
a b
2011-12-31 23:59:00 1 4
警告
Warning
但是,如果字符串被视为完全匹配,则 DataFrame 的 [] 中的选择将按列进行,而不是按行进行,请参见 Indexing Basics。例如,dft_minute['2011-12-31 23:59'] 将引发 KeyError,因为 '2012-12-31 23:59' 与索引具有相同的分辨率,并且没有这样的列名:
However, if the string is treated as an exact match, the selection in DataFrame’s [] will be column-wise and not row-wise, see Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise KeyError as '2012-12-31 23:59' has the same resolution as the index and there is no column with such name:
为了始终有明确的选择,无论行被视为切片还是单个选择,请使用 .loc。
To always have unambiguous selection, whether the row is treated as a slice or a single selection, use .loc.
In [131]: dft_minute.loc["2011-12-31 23:59"]
Out[131]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
还要注意,DatetimeIndex 的分辨率不能低于天。
Note also that DatetimeIndex resolution cannot be less precise than day.
In [132]: series_monthly = pd.Series(
.....: [1, 2, 3], pd.DatetimeIndex(["2011-12", "2012-01", "2012-02"])
.....: )
.....:
In [133]: series_monthly.index.resolution
Out[133]: 'day'
In [134]: series_monthly["2011-12"] # returns Series
Out[134]:
2011-12-01 1
dtype: int64
Exact indexing
如前一节所述,使用部分字符串索引 DatetimeIndex 取决于该时期的“准确性”,或者换句话说,间隔相对于索引分辨率的具体程度。与此相反,使用 Timestamp 或 datetime 对象进行索引是精确的,因为这些对象具有精确的含义。这些还遵循包含两个端点的语义。
As discussed in previous section, indexing a DatetimeIndex with a partial string depends on the “accuracy” of the period, in other words how specific the interval is in relation to the resolution of the index. In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.
这些 Timestamp 和 datetime 对象具有精确的 hours, minutes, 和 seconds,即使它们没有被明确指定(它们是 0)。
These Timestamp and datetime objects have exact hours, minutes, and seconds, even though they were not explicitly specified (they are 0).
In [135]: dft[datetime.datetime(2013, 1, 1): datetime.datetime(2013, 2, 28)]
Out[135]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-27 23:56:00 1.197749
2013-02-27 23:57:00 0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501
[83521 rows x 1 columns]
没有默认值。
With no defaults.
In [136]: dft[
.....: datetime.datetime(2013, 1, 1, 10, 12, 0): datetime.datetime(
.....: 2013, 2, 28, 10, 12, 0
.....: )
.....: ]
.....:
Out[136]:
A
2013-01-01 10:12:00 0.565375
2013-01-01 10:13:00 0.068184
2013-01-01 10:14:00 0.788871
2013-01-01 10:15:00 -0.280343
2013-01-01 10:16:00 0.931536
... ...
2013-02-28 10:08:00 0.148098
2013-02-28 10:09:00 -0.388138
2013-02-28 10:10:00 0.139348
2013-02-28 10:11:00 0.085288
2013-02-28 10:12:00 0.950146
[83521 rows x 1 columns]
Truncating & fancy indexing
提供了类似于切片的 truncate() 方便函数。请注意,truncate 为 DatetimeIndex 中任何未指定日期组件假定值为 0,这与返回任何部分匹配日期的分片相反:
A truncate() convenience function is provided that is similar to slicing. Note that truncate assumes a 0 value for any unspecified date component in a DatetimeIndex in contrast to slicing which returns any partially matching dates:
In [137]: rng2 = pd.date_range("2011-01-01", "2012-01-01", freq="W")
In [138]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)
In [139]: ts2.truncate(before="2011-11", after="2011-12")
Out[139]:
2011-11-06 0.437823
2011-11-13 -0.293083
2011-11-20 -0.059881
2011-11-27 1.252450
Freq: W-SUN, dtype: float64
In [140]: ts2["2011-11":"2011-12"]
Out[140]:
2011-11-06 0.437823
2011-11-13 -0.293083
2011-11-20 -0.059881
2011-11-27 1.252450
2011-12-04 0.046611
2011-12-11 0.059478
2011-12-18 -0.286539
2011-12-25 0.841669
Freq: W-SUN, dtype: float64
即使复杂的 fancy 索引破坏了 DatetimeIndex 频率规则,它也会导致 DatetimeIndex,尽管丢失了频率:
Even complicated fancy indexing that breaks the DatetimeIndex frequency regularity will result in a DatetimeIndex, although frequency is lost:
In [141]: ts2.iloc[[0, 2, 6]].index
Out[141]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)
Time/date components
有一些时间/日期属性可以从 Timestamp 或像 DatetimeIndex 这样的时间戳集合中访问。
There are several time/date properties that one can access from Timestamp or a collection of timestamps like a DatetimeIndex.
属性
Property
说明
Description
year
日期时间年
The year of the datetime
month
日期时间月
The month of the datetime
day
日期时间日
The days of the datetime
hour
日期时间时
The hour of the datetime
分
minute
日期和时间的分钟数
The minutes of the datetime
秒
second
日期和时间的秒数
The seconds of the datetime
微秒
microsecond
日期和时间的微秒数
The microseconds of the datetime
纳秒
nanosecond
日期和时间的纳秒数
The nanoseconds of the datetime
日期
date
返回 datetime.date(不包含时区信息)
Returns datetime.date (does not contain timezone information)
时间
time
返回 datetime.time(不包含时区信息)
Returns datetime.time (does not contain timezone information)
时区时间
timetz
以本地时间返回 datetime.time,并包含时区信息
Returns datetime.time as local time with timezone information
年分日
dayofyear
一年的序数日
The ordinal day of year
年分日
day_of_year
一年的序数日
The ordinal day of year
年分周
weekofyear
一年的序数周
The week ordinal of the year
week
一年的序数周
The week ordinal of the year
dayofweek
星期数,星期一为 0,星期日为 6
The number of the day of the week with Monday=0, Sunday=6
day_of_week
星期数,星期一为 0,星期日为 6
The number of the day of the week with Monday=0, Sunday=6
weekday
星期数,星期一为 0,星期日为 6
The number of the day of the week with Monday=0, Sunday=6
quarter
日期的季度:1 月到 3 月为 1,4 月到 6 月为 2,依此类推
Quarter of the date: Jan-Mar = 1, Apr-Jun = 2, etc.
days_in_month
日期所在月份的天数
The number of days in the month of the datetime
is_month_start
逻辑值,表示是否是月份的第一天(由频率定义)
Logical indicating if first day of month (defined by frequency)
is_month_end
逻辑值,表示是否是月份的最后一天(由频率定义)
Logical indicating if last day of month (defined by frequency)
is_quarter_start
逻辑值,表示是否是季度的第一天(由频率定义)
Logical indicating if first day of quarter (defined by frequency)
is_quarter_end
逻辑值,表示是否是季度的最后一天(由频率定义)
Logical indicating if last day of quarter (defined by frequency)
is_year_start
逻辑值,表示是否是年份的第一天(由频率定义)
Logical indicating if first day of year (defined by frequency)
is_year_end
逻辑值,表明是否为年的最后一天(由频率定义)
Logical indicating if last day of year (defined by frequency)
is_leap_year
逻辑值,表明日期是否属于闰年
Logical indicating if the date belongs to a leap year
此外,如果您有包含日期时间值的 Series,则可以通过 .dt 存取器访问这些属性,如 .dt accessors 部分中所述。
Furthermore, if you have a Series with datetimelike values, then you can access these properties via the .dt accessor, as detailed in the section on .dt accessors.
您可以从 ISO 8601 标准中获取 ISO 年份的年份、周和日期组成部分:
You may obtain the year, week and day components of the ISO year from the ISO 8601 standard:
In [142]: idx = pd.date_range(start="2019-12-29", freq="D", periods=4)
In [143]: idx.isocalendar()
Out[143]:
year week day
2019-12-29 2019 52 7
2019-12-30 2020 1 1
2019-12-31 2020 1 2
2020-01-01 2020 1 3
In [144]: idx.to_series().dt.isocalendar()
Out[144]:
year week day
2019-12-29 2019 52 7
2019-12-30 2020 1 1
2019-12-31 2020 1 2
2020-01-01 2020 1 3
DateOffset objects
在前面的示例中,使用频率字符串(例如 'D')来指定一个频率,该频率定义了:
In the preceding examples, frequency strings (e.g. 'D') were used to specify a frequency that defined:
-
how the date times in DatetimeIndex were spaced when using date_range()
-
the frequency of a Period or PeriodIndex
这些频率字符串映射到一个 DateOffset 对象及其子类。DateOffset 类似于表示时间持续的 Timedelta,但遵循特定的日历持续时间规则。例如, Timedelta 天总是将 datetimes 增加 24 小时,而 DateOffset 天将 datetimes 增加到次日的相同时间,无论该天由于夏令时而表示 23、24 或 25 小时。但所有 DateOffset 子类(如果是一小时或更短 (Hour、Minute、Second、Milli、Micro、Nano))的行为与 Timedelta 类似并遵循绝对时间。
These frequency strings map to a DateOffset object and its subclasses. A DateOffset is similar to a Timedelta that represents a duration of time but follows specific calendar duration rules. For example, a Timedelta day will always increment datetimes by 24 hours, while a DateOffset day will increment datetimes to the same time the next day whether a day represents 23, 24 or 25 hours due to daylight savings time. However, all DateOffset subclasses that are an hour or smaller (Hour, Minute, Second, Milli, Micro, Nano) behave like Timedelta and respect absolute time.
基本的 DateOffset 类似于按照指定相应的日历持续时间对日期时间进行转换的 dateutil.relativedelta ( relativedelta documentation)。可以使用算术运算符 (+) 来执行转换。
The basic DateOffset acts similar to dateutil.relativedelta (relativedelta documentation) that shifts a date time by the corresponding calendar duration specified. The arithmetic operator (+) can be used to perform the shift.
# This particular day contains a day light savings time transition
In [145]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")
# Respects absolute time
In [146]: ts + pd.Timedelta(days=1)
Out[146]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
# Respects calendar time
In [147]: ts + pd.DateOffset(days=1)
Out[147]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')
In [148]: friday = pd.Timestamp("2018-01-05")
In [149]: friday.day_name()
Out[149]: 'Friday'
# Add 2 business days (Friday --> Tuesday)
In [150]: two_business_days = 2 * pd.offsets.BDay()
In [151]: friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')
In [152]: (friday + two_business_days).day_name()
Out[152]: 'Tuesday'
大多数 DateOffsets 都有关联的频率字符串或偏移别名,它们可以传递到 freq 关键字参数中。可用的日期偏移和关联的频率字符串可以在下面找到:
Most DateOffsets have associated frequencies strings, or offset aliases, that can be passed into freq keyword arguments. The available date offsets and associated frequency strings can be found below:
日期偏移
Date Offset
频率字符串
Frequency String
说明
Description
无
None
通用偏移类,默认为绝对 24 小时
Generic offset class, defaults to absolute 24 hours
BDay or BusinessDay
'B'
工作日(平日)
business day (weekday)
'C'
自定义工作日
custom business day
'W'
一周,可选在周中的某一天开始
one week, optionally anchored on a day of the week
'WOM'
每月第 y 周中的第 x 天
the x-th day of the y-th week of each month
'LWOM'
每月最后一周的第 x 天
the x-th day of the last week of each month
'ME'
日历月末
calendar month end
'MS'
日历月初
calendar month begin
'BME'
business month end
'BMS'
商业月开始
business month begin
'CBME'
自定义商业月结束
custom business month end
'CBMS'
自定义商业月开始
custom business month begin
'SME'
15 日(或本月其他日期)及日历月结束
15th (or other day_of_month) and calendar month end
'SMS'
15 日(或本月其他日期)及日历月开始
15th (or other day_of_month) and calendar month begin
'QE'
日历季度末
calendar quarter end
'QS'
日历季度初
calendar quarter begin
'BQE
商业季度末
business quarter end
'BQS'
商业季度初
business quarter begin
'REQ'
零售(又名 52-53 周)季度
retail (aka 52-53 week) quarter
'YE'
日历年末
calendar year end
'YS' 或 'BYS'
'YS' or 'BYS'
日历年伊始
calendar year begin
'BYE'
营业年终
business year end
'BYS'
营业年伊始
business year begin
'RE'
零售(又称 52-53 周)年
retail (aka 52-53 week) year
无
None
复活节假期
Easter holiday
'bh'
营业时间
business hour
'cbh'
自定义营业时间
custom business hour
'D'
一个绝对的日子
one absolute day
'h'
一小时
one hour
'min'
一分钟
one minute
's'
一秒
one second
'ms'
一毫秒
one millisecond
'us'
一微秒
one microsecond
'ns'
一纳秒
one nanosecond
DateOffsets 除了 rollforward() 和 rollback() 方法,它们分别可以将一个日期向前或向后移动到相对于该偏移的有效偏移日期。例如,由于营业偏移在工作日进行,因此它们将落在周末(星期六和星期日)的日期向前滚动到星期一。
DateOffsets additionally have rollforward() and rollback() methods for moving a date forward or backward respectively to a valid offset date relative to the offset. For example, business offsets will roll dates that land on the weekends (Saturday and Sunday) forward to Monday since business offsets operate on the weekdays.
In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")
In [154]: ts.day_name()
Out[154]: 'Saturday'
# BusinessHour's valid offset dates are Monday through Friday
In [155]: offset = pd.offsets.BusinessHour(start="09:00")
# Bring the date to the closest offset date (Monday)
In [156]: offset.rollforward(ts)
Out[156]: Timestamp('2018-01-08 09:00:00')
# Date is brought to the closest offset date first and then the hour is added
In [157]: ts + offset
Out[157]: Timestamp('2018-01-08 10:00:00')
默认情况下,这些操作会保留时间(小时、分钟等)信息。若要在应用操作之前或之后重设时间以使其午夜,请使用 normalize()(取决于您是否需要在操作中包含时间信息)。
These operations preserve time (hour, minute, etc) information by default. To reset time to midnight, use normalize() before or after applying the operation (depending on whether you want the time information included in the operation).
In [158]: ts = pd.Timestamp("2014-01-01 09:00")
In [159]: day = pd.offsets.Day()
In [160]: day + ts
Out[160]: Timestamp('2014-01-02 09:00:00')
In [161]: (day + ts).normalize()
Out[161]: Timestamp('2014-01-02 00:00:00')
In [162]: ts = pd.Timestamp("2014-01-01 22:00")
In [163]: hour = pd.offsets.Hour()
In [164]: hour + ts
Out[164]: Timestamp('2014-01-01 23:00:00')
In [165]: (hour + ts).normalize()
Out[165]: Timestamp('2014-01-01 00:00:00')
In [166]: (hour + pd.Timestamp("2014-01-01 23:30")).normalize()
Out[166]: Timestamp('2014-01-02 00:00:00')
Parametric offsets
在创建某些偏移时可以对其进行“参数化”设置,从而导致不同的行为。例如,Week 偏移用于生成每周数据,它接受 weekday 参数,这会导致生成的日期始终落在一周的某一天:
Some of the offsets can be “parameterized” when created to result in different behaviors. For example, the Week offset for generating weekly data accepts a weekday parameter which results in the generated dates always lying on a particular day of the week:
In [167]: d = datetime.datetime(2008, 8, 18, 9, 0)
In [168]: d
Out[168]: datetime.datetime(2008, 8, 18, 9, 0)
In [169]: d + pd.offsets.Week()
Out[169]: Timestamp('2008-08-25 09:00:00')
In [170]: d + pd.offsets.Week(weekday=4)
Out[170]: Timestamp('2008-08-22 09:00:00')
In [171]: (d + pd.offsets.Week(weekday=4)).weekday()
Out[171]: 4
In [172]: d - pd.offsets.Week()
Out[172]: Timestamp('2008-08-11 09:00:00')
normalize 选项对于加法和减法有效。
The normalize option will be effective for addition and subtraction.
In [173]: d + pd.offsets.Week(normalize=True)
Out[173]: Timestamp('2008-08-25 00:00:00')
In [174]: d - pd.offsets.Week(normalize=True)
Out[174]: Timestamp('2008-08-11 00:00:00')
另一个示例是使用特定结束月份对 YearEnd 进行参数化:
Another example is parameterizing YearEnd with the specific ending month:
In [175]: d + pd.offsets.YearEnd()
Out[175]: Timestamp('2008-12-31 09:00:00')
In [176]: d + pd.offsets.YearEnd(month=6)
Out[176]: Timestamp('2009-06-30 09:00:00')
Using offsets with Series / DatetimeIndex
可以使用 Series 或 DatetimeIndex 偏移来对每个元素应用偏移。
Offsets can be used with either a Series or DatetimeIndex to apply the offset to each element.
In [177]: rng = pd.date_range("2012-01-01", "2012-01-03")
In [178]: s = pd.Series(rng)
In [179]: rng
Out[179]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D')
In [180]: rng + pd.DateOffset(months=2)
Out[180]: DatetimeIndex(['2012-03-01', '2012-03-02', '2012-03-03'], dtype='datetime64[ns]', freq=None)
In [181]: s + pd.DateOffset(months=2)
Out[181]:
0 2012-03-01
1 2012-03-02
2 2012-03-03
dtype: datetime64[ns]
In [182]: s - pd.DateOffset(months=2)
Out[182]:
0 2011-11-01
1 2011-11-02
2 2011-11-03
dtype: datetime64[ns]
如果偏移类直接映射到 Timedelta (Day、Hour、Minute、Second、Micro、Milli、Nano),那么它可以像 Timedelta 一样使用 - 请参阅 Timedelta section 以获取更多示例。
If the offset class maps directly to a Timedelta (Day, Hour, Minute, Second, Micro, Milli, Nano) it can be used exactly like a Timedelta - see the Timedelta section for more examples.
In [183]: s - pd.offsets.Day(2)
Out[183]:
0 2011-12-30
1 2011-12-31
2 2012-01-01
dtype: datetime64[ns]
In [184]: td = s - pd.Series(pd.date_range("2011-12-29", "2011-12-31"))
In [185]: td
Out[185]:
0 3 days
1 3 days
2 3 days
dtype: timedelta64[ns]
In [186]: td + pd.offsets.Minute(15)
Out[186]:
0 3 days 00:15:00
1 3 days 00:15:00
2 3 days 00:15:00
dtype: timedelta64[ns]
请注意,某些偏移(例如 BQuarterEnd)没有矢量化实现。它们仍然可以使用,但计算速度可能显著降低,并且会显示 PerformanceWarning
Note that some offsets (such as BQuarterEnd) do not have a vectorized implementation. They can still be used but may calculate significantly slower and will show a PerformanceWarning
In [187]: rng + pd.offsets.BQuarterEnd()
Out[187]: DatetimeIndex(['2012-03-30', '2012-03-30', '2012-03-30'], dtype='datetime64[ns]', freq=None)
Custom business days
CDay 或 CustomBusinessDay 类提供了一个参数 BusinessDay 类,该类可用于创建定制的营业日日历,其中考虑了当地节日和当地周末惯例。
The CDay or CustomBusinessDay class provides a parametric BusinessDay class which can be used to create customized business day calendars which account for local holidays and local weekend conventions.
作为一个有趣的例子,让我们来看看在星期五到星期六为周末的埃及。
As an interesting example, let’s look at Egypt where a Friday-Saturday weekend is observed.
In [188]: weekmask_egypt = "Sun Mon Tue Wed Thu"
# They also observe International Workers' Day so let's
# add that for a couple of years
In [189]: holidays = [
.....: "2012-05-01",
.....: datetime.datetime(2013, 5, 1),
.....: np.datetime64("2014-05-01"),
.....: ]
.....:
In [190]: bday_egypt = pd.offsets.CustomBusinessDay(
.....: holidays=holidays,
.....: weekmask=weekmask_egypt,
.....: )
.....:
In [191]: dt = datetime.datetime(2013, 4, 30)
In [192]: dt + 2 * bday_egypt
Out[192]: Timestamp('2013-05-05 00:00:00')
让我们映射到星期几名称:
Let’s map to the weekday names:
In [193]: dts = pd.date_range(dt, periods=5, freq=bday_egypt)
In [194]: pd.Series(dts.weekday, dts).map(pd.Series("Mon Tue Wed Thu Fri Sat Sun".split()))
Out[194]:
2013-04-30 Tue
2013-05-02 Thu
2013-05-05 Sun
2013-05-06 Mon
2013-05-07 Tue
Freq: C, dtype: object
可以使用节日日历来提供节日列表。有关详细信息,请参阅 holiday calendar 部分。
Holiday calendars can be used to provide the list of holidays. See the holiday calendar section for more information.
In [195]: from pandas.tseries.holiday import USFederalHolidayCalendar
In [196]: bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
# Friday before MLK Day
In [197]: dt = datetime.datetime(2014, 1, 17)
# Tuesday after MLK Day (Monday is skipped because it's a holiday)
In [198]: dt + bday_us
Out[198]: Timestamp('2014-01-21 00:00:00')
通常情况下,可以定义尊重特定节日日历的月度偏移。
Monthly offsets that respect a certain holiday calendar can be defined in the usual way.
In [199]: bmth_us = pd.offsets.CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
# Skip new years
In [200]: dt = datetime.datetime(2013, 12, 17)
In [201]: dt + bmth_us
Out[201]: Timestamp('2014-01-02 00:00:00')
# Define date index with custom offset
In [202]: pd.date_range(start="20100101", end="20120101", freq=bmth_us)
Out[202]:
DatetimeIndex(['2010-01-04', '2010-02-01', '2010-03-01', '2010-04-01',
'2010-05-03', '2010-06-01', '2010-07-01', '2010-08-02',
'2010-09-01', '2010-10-01', '2010-11-01', '2010-12-01',
'2011-01-03', '2011-02-01', '2011-03-01', '2011-04-01',
'2011-05-02', '2011-06-01', '2011-07-01', '2011-08-01',
'2011-09-01', '2011-10-03', '2011-11-01', '2011-12-01'],
dtype='datetime64[ns]', freq='CBMS')
频率字符串“C”用于表示使用了 CustomBusinessDay DateOffset,需要注意的是,由于 CustomBusinessDay 是一个参数化类型,因此 CustomBusinessDay 的实例可能会有所不同,而且这是从“C”频率字符串中检测不到的。因此,用户需要确保在用户的应用程序中一致地使用“C”频率字符串。 |
The frequency string ‘C’ is used to indicate that a CustomBusinessDay DateOffset is used, it is important to note that since CustomBusinessDay is a parameterised type, instances of CustomBusinessDay may differ and this is not detectable from the ‘C’ frequency string. The user therefore needs to ensure that the ‘C’ frequency string is used consistently within the user’s application. |
Business hour
BusinessHour 类在 BusinessDay 上提供一个营业时间表示,允许使用特定的开始和结束时间。
The BusinessHour class provides a business hour representation on BusinessDay, allowing to use specific start and end times.
默认情况下,BusinessHour 使用 9:00 - 17:00 作为营业时间。添加 BusinessHour 会按小时频率增加 Timestamp。如果目标 Timestamp 超出营业时间,则移至下一个营业时间然后增加它。如果结果超过营业时间结束时间,则将剩余小时数添加到下一个营业日。
By default, BusinessHour uses 9:00 - 17:00 as business hours. Adding BusinessHour will increment Timestamp by hourly frequency. If target Timestamp is out of business hours, move to the next business hour then increment it. If the result exceeds the business hours end, the remaining hours are added to the next business day.
In [203]: bh = pd.offsets.BusinessHour()
In [204]: bh
Out[204]: <BusinessHour: bh=09:00-17:00>
# 2014-08-01 is Friday
In [205]: pd.Timestamp("2014-08-01 10:00").weekday()
Out[205]: 4
In [206]: pd.Timestamp("2014-08-01 10:00") + bh
Out[206]: Timestamp('2014-08-01 11:00:00')
# Below example is the same as: pd.Timestamp('2014-08-01 09:00') + bh
In [207]: pd.Timestamp("2014-08-01 08:00") + bh
Out[207]: Timestamp('2014-08-01 10:00:00')
# If the results is on the end time, move to the next business day
In [208]: pd.Timestamp("2014-08-01 16:00") + bh
Out[208]: Timestamp('2014-08-04 09:00:00')
# Remainings are added to the next day
In [209]: pd.Timestamp("2014-08-01 16:30") + bh
Out[209]: Timestamp('2014-08-04 09:30:00')
# Adding 2 business hours
In [210]: pd.Timestamp("2014-08-01 10:00") + pd.offsets.BusinessHour(2)
Out[210]: Timestamp('2014-08-01 12:00:00')
# Subtracting 3 business hours
In [211]: pd.Timestamp("2014-08-01 10:00") + pd.offsets.BusinessHour(-3)
Out[211]: Timestamp('2014-07-31 15:00:00')
您还可以通过关键字指定 start 和 end 时间。参数必须是一个具有 hour:minute 表示形式的 str 或一个 datetime.time 实例。将秒、微秒和纳秒指定为营业时间会导致 ValueError。
You can also specify start and end time by keywords. The argument must be a str with an hour:minute representation or a datetime.time instance. Specifying seconds, microseconds and nanoseconds as business hour results in ValueError.
In [212]: bh = pd.offsets.BusinessHour(start="11:00", end=datetime.time(20, 0))
In [213]: bh
Out[213]: <BusinessHour: bh=11:00-20:00>
In [214]: pd.Timestamp("2014-08-01 13:00") + bh
Out[214]: Timestamp('2014-08-01 14:00:00')
In [215]: pd.Timestamp("2014-08-01 09:00") + bh
Out[215]: Timestamp('2014-08-01 12:00:00')
In [216]: pd.Timestamp("2014-08-01 18:00") + bh
Out[216]: Timestamp('2014-08-01 19:00:00')
将 start 时间传递为晚于 end 时间代表午夜营业时间。在这种情况下,营业时间超过午夜并与第二天重叠。有效的营业时间通过它是否从有效的 BusinessDay 开始来区分。
Passing start time later than end represents midnight business hour. In this case, business hour exceeds midnight and overlap to the next day. Valid business hours are distinguished by whether it started from valid BusinessDay.
In [217]: bh = pd.offsets.BusinessHour(start="17:00", end="09:00")
In [218]: bh
Out[218]: <BusinessHour: bh=17:00-09:00>
In [219]: pd.Timestamp("2014-08-01 17:00") + bh
Out[219]: Timestamp('2014-08-01 18:00:00')
In [220]: pd.Timestamp("2014-08-01 23:00") + bh
Out[220]: Timestamp('2014-08-02 00:00:00')
# Although 2014-08-02 is Saturday,
# it is valid because it starts from 08-01 (Friday).
In [221]: pd.Timestamp("2014-08-02 04:00") + bh
Out[221]: Timestamp('2014-08-02 05:00:00')
# Although 2014-08-04 is Monday,
# it is out of business hours because it starts from 08-03 (Sunday).
In [222]: pd.Timestamp("2014-08-04 04:00") + bh
Out[222]: Timestamp('2014-08-04 18:00:00')
将 BusinessHour.rollforward 和 rollback 应用到非工作时间将导致下一个工作小时的开始或前一天的结束。不同于其他偏移,根据定义,BusinessHour.rollforward 可能从 apply 输出不同的结果。
Applying BusinessHour.rollforward and rollback to out of business hours results in the next business hour start or previous day’s end. Different from other offsets, BusinessHour.rollforward may output different results from apply by definition.
这是因为一天的工作时间结束等于下一天的工作时间开始。举例来说,在默认的工作时间(9:00 - 17:00)下,2014-08-01 17:00 和 2014-08-04 09:00 之间没有间隔(0 分钟)。
This is because one day’s business hour end is equal to next day’s business hour start. For example, under the default business hours (9:00 - 17:00), there is no gap (0 minutes) between 2014-08-01 17:00 and 2014-08-04 09:00.
# This adjusts a Timestamp to business hour edge
In [223]: pd.offsets.BusinessHour().rollback(pd.Timestamp("2014-08-02 15:00"))
Out[223]: Timestamp('2014-08-01 17:00:00')
In [224]: pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02 15:00"))
Out[224]: Timestamp('2014-08-04 09:00:00')
# It is the same as BusinessHour() + pd.Timestamp('2014-08-01 17:00').
# And it is the same as BusinessHour() + pd.Timestamp('2014-08-04 09:00')
In [225]: pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02 15:00")
Out[225]: Timestamp('2014-08-04 10:00:00')
# BusinessDay results (for reference)
In [226]: pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02"))
Out[226]: Timestamp('2014-08-04 09:00:00')
# It is the same as BusinessDay() + pd.Timestamp('2014-08-01')
# The result is the same as rollworward because BusinessDay never overlap.
In [227]: pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02")
Out[227]: Timestamp('2014-08-04 10:00:00')
BusinessHour 将星期六和星期日视为假日。若要使用任意的假日,您可以使用 CustomBusinessHour 偏移,如在以下小节中所述。
BusinessHour regards Saturday and Sunday as holidays. To use arbitrary holidays, you can use CustomBusinessHour offset, as explained in the following subsection.
Custom business hour
CustomBusinessHour 是 BusinessHour 和 CustomBusinessDay 的组合,它允许您指定任意的假日。CustomBusinessHour 工作方式与 BusinessHour 相同,除了它会跳过指定的自定义假日。
The CustomBusinessHour is a mixture of BusinessHour and CustomBusinessDay which allows you to specify arbitrary holidays. CustomBusinessHour works as the same as BusinessHour except that it skips specified custom holidays.
In [228]: from pandas.tseries.holiday import USFederalHolidayCalendar
In [229]: bhour_us = pd.offsets.CustomBusinessHour(calendar=USFederalHolidayCalendar())
# Friday before MLK Day
In [230]: dt = datetime.datetime(2014, 1, 17, 15)
In [231]: dt + bhour_us
Out[231]: Timestamp('2014-01-17 16:00:00')
# Tuesday after MLK Day (Monday is skipped because it's a holiday)
In [232]: dt + bhour_us * 2
Out[232]: Timestamp('2014-01-21 09:00:00')
您可以使用 BusinessHour 和 CustomBusinessDay 支持的关键字参数。
You can use keyword arguments supported by either BusinessHour and CustomBusinessDay.
In [233]: bhour_mon = pd.offsets.CustomBusinessHour(start="10:00", weekmask="Tue Wed Thu Fri")
# Monday is skipped because it's a holiday, business hour starts from 10:00
In [234]: dt + bhour_mon * 2
Out[234]: Timestamp('2014-01-21 10:00:00')
Offset aliases
将许多字符串别名提供给有用的常见时间序列频率。我们将这些别名称为偏移别名。
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases.
别名
Alias
说明
Description
B
工作日频率
business day frequency
C
自定义工作日频率
custom business day frequency
D
日历日频率
calendar day frequency
W
每周频率
weekly frequency
ME
月末频率
month end frequency
SME
半月末频率(15 日和月末)
semi-month end frequency (15th and end of month)
BME
业务月结束频率
business month end frequency
CBME
自定义业务月结束频率
custom business month end frequency
MS
月开始频率
month start frequency
SMS
半月开始频率(1 日和 15 日)
semi-month start frequency (1st and 15th)
BMS
业务月开始频率
business month start frequency
CBMS
自定义业务月开始频率
custom business month start frequency
QE
季度结束频率
quarter end frequency
BQE
业务季度结束频率
business quarter end frequency
QS
季度开始频率
quarter start frequency
BQS
商业季度开始频率
business quarter start frequency
YE
年末频率
year end frequency
BYE
业务年结束频率
business year end frequency
YS
年开始频率
year start frequency
BYS
业务年开始频率
business year start frequency
h
每小时频率
hourly frequency
bh
业务小时频率
business hour frequency
cbh
自定义业务小时频率
custom business hour frequency
min
每分钟频率
minutely frequency
s
每秒频率
secondly frequency
ms
毫秒
milliseconds
us
微秒
microseconds
ns
纳秒
nanoseconds
自版本 2.2.0 弃用:为支持 h、bh、cbh、min、s、ms、us 和 ns 别名,H、BH、CBH、T、S、L、U 和 N 别名已弃用。
Deprecated since version 2.2.0: Aliases H, BH, CBH, T, S, L, U, and N are deprecated in favour of the aliases h, bh, cbh, min, s, ms, us, and ns.
在使用以上偏移别名时,应注意,诸如 date_range()、 bdate_range() 等函数只会返回 start_date 和 end_date 定义的间隔内的 timestamp。如果 start_date 与频率不对应,则返回的 timestamp 将从下一个有效 timestamp 开始,对于 end_date 也是如此,返回的 timestamp 将在之前有效的 timestamp 处停止。 |
When using the offset aliases above, it should be noted that functions such as date_range(), bdate_range(), will only return timestamps that are in the interval defined by start_date and end_date. If the start_date does not correspond to the frequency, the returned timestamps will start at the next valid timestamp, same for end_date, the returned timestamps will stop at the previous valid timestamp. |
例如,对于偏移 MS,如果 start_date 不是当月第一天,则返回的 timestamp 将从下个月的第一天开始。如果 end_date 不是某个月的第一天,则最后返回的 timestamp 将是相应月份的第一天。
For example, for the offset MS, if the start_date is not the first of the month, the returned timestamps will start with the first day of the next month. If end_date is not the first day of a month, the last returned timestamp will be the first day of the corresponding month.
In [235]: dates_lst_1 = pd.date_range("2020-01-06", "2020-04-03", freq="MS")
In [236]: dates_lst_1
Out[236]: DatetimeIndex(['2020-02-01', '2020-03-01', '2020-04-01'], dtype='datetime64[ns]', freq='MS')
In [237]: dates_lst_2 = pd.date_range("2020-01-01", "2020-04-01", freq="MS")
In [238]: dates_lst_2
Out[238]: DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01'], dtype='datetime64[ns]', freq='MS')
我们可以在上述示例中看到 date_range() 和 bdate_range() 只会返回 start_date 和 end_date_之间的有效 timestamp。如果这些不是给定频率的有效 timestamp,它将滚动至 _start_date 的下一个值(对 end_date 为前一个值)。
We can see in the above example date_range() and bdate_range() will only return the valid timestamps between the start_date and end_date. If these are not valid timestamps for the given frequency it will roll to the next value for start_date (respectively previous for the end_date)
Period aliases
大量字符串别名用于有用的常用时间序列频率。我们将这些别名称为周期别名。
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as period aliases.
别名
Alias
说明
Description
B
工作日频率
business day frequency
D
日历日频率
calendar day frequency
W
每周频率
weekly frequency
M
月度频率
monthly frequency
Q
季度频率
quarterly frequency
Y
年度频率
yearly frequency
h
每小时频率
hourly frequency
min
每分钟频率
minutely frequency
s
每秒频率
secondly frequency
ms
毫秒
milliseconds
us
微秒
microseconds
ns
纳秒
nanoseconds
自版本 2.2.0 弃用:为支持 Y、h、min、s、ms、us 和 ns 别名,A、H、T、S、L、U 和 N 别名已弃用。
Deprecated since version 2.2.0: Aliases A, H, T, S, L, U, and N are deprecated in favour of the aliases Y, h, min, s, ms, us, and ns.
Combining aliases
正如我们先前所看到的,在大多数函数中,别名和偏移实例是可交换的:
As we have seen previously, the alias and the offset instance are fungible in most functions:
In [239]: pd.date_range(start, periods=5, freq="B")
Out[239]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07'],
dtype='datetime64[ns]', freq='B')
In [240]: pd.date_range(start, periods=5, freq=pd.offsets.BDay())
Out[240]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07'],
dtype='datetime64[ns]', freq='B')
您可以将每日和日内偏移组合在一起:
You can combine together day and intraday offsets:
In [241]: pd.date_range(start, periods=10, freq="2h20min")
Out[241]:
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 02:20:00',
'2011-01-01 04:40:00', '2011-01-01 07:00:00',
'2011-01-01 09:20:00', '2011-01-01 11:40:00',
'2011-01-01 14:00:00', '2011-01-01 16:20:00',
'2011-01-01 18:40:00', '2011-01-01 21:00:00'],
dtype='datetime64[ns]', freq='140min')
In [242]: pd.date_range(start, periods=10, freq="1D10us")
Out[242]:
DatetimeIndex([ '2011-01-01 00:00:00', '2011-01-02 00:00:00.000010',
'2011-01-03 00:00:00.000020', '2011-01-04 00:00:00.000030',
'2011-01-05 00:00:00.000040', '2011-01-06 00:00:00.000050',
'2011-01-07 00:00:00.000060', '2011-01-08 00:00:00.000070',
'2011-01-09 00:00:00.000080', '2011-01-10 00:00:00.000090'],
dtype='datetime64[ns]', freq='86400000010us')
Anchored offsets
对于某些频率,您可以指定一个锚定后缀:
For some frequencies you can specify an anchoring suffix:
别名
Alias
说明
Description
W-SUN
每周频率(星期日)。与“W”相同
weekly frequency (Sundays). Same as ‘W’
W-MON
每周频率(星期一)
weekly frequency (Mondays)
W-TUE
每周频率(星期二)
weekly frequency (Tuesdays)
W-WED
每周频率(星期三)
weekly frequency (Wednesdays)
W-THU
每周频率(星期四)
weekly frequency (Thursdays)
W-FRI
每周频率(星期五)
weekly frequency (Fridays)
W-SAT
每周频率(星期六)
weekly frequency (Saturdays)
(B)Q(E)(S)-DEC
每季度频率,年底在 12 月。与“QE”相同
quarterly frequency, year ends in December. Same as ‘QE’
(B)Q(E)(S)-JAN
季度频率,一年1月结束
quarterly frequency, year ends in January
(B)Q(E)(S)-FEB
季度频率,一年2月结束
quarterly frequency, year ends in February
(B)Q(E)(S)-MAR
季度频率,一年3月结束
quarterly frequency, year ends in March
(B)Q(E)(S)-APR
季度频率,一年4月结束
quarterly frequency, year ends in April
(B)Q(E)(S)-MAY
季度频率,一年5月结束
quarterly frequency, year ends in May
(B)Q(E)(S)-JUN
季度频率,一年6月结束
quarterly frequency, year ends in June
(B)Q(E)(S)-JUL
季度频率,一年7月结束
quarterly frequency, year ends in July
(B)Q(E)(S)-AUG
季度频率,一年8月结束
quarterly frequency, year ends in August
(B)Q(E)(S)-SEP
季度频率,一年9月结束
quarterly frequency, year ends in September
(B)Q(E)(S)-OCT
季度频率,一年10月结束
quarterly frequency, year ends in October
(B)Q(E)(S)-NOV
季度频率,年于 11 月结束
quarterly frequency, year ends in November
(B)Y(E)(S)-DEC
年度频率,停靠于 12 月末。与 'YE' 相同
annual frequency, anchored end of December. Same as ‘YE’
(B)Y(E)(S)-JAN
年度频率,停靠于 1 月末
annual frequency, anchored end of January
(B)Y(E)(S)-FEB
年度频率,停靠于 2 月末
annual frequency, anchored end of February
(B)Y(E)(S)-MAR
年度频率,停靠于 3 月末
annual frequency, anchored end of March
(B)Y(E)(S)-APR
年度频率,停靠于 4 月末
annual frequency, anchored end of April
(B)Y(E)(S)-MAY
年度频率,停靠于 5 月末
annual frequency, anchored end of May
(B)Y(E)(S)-JUN
年度频率,停靠于 6 月末
annual frequency, anchored end of June
(B)Y(E)(S)-JUL
年度频率,停靠于 7 月末
annual frequency, anchored end of July
(B)Y(E)(S)-AUG
年度频率,固定于 8 月末
annual frequency, anchored end of August
(B)Y(E)(S)-SEP
年度频率,固定于 9 月末
annual frequency, anchored end of September
(B)Y(E)(S)-OCT
年度频率,固定于 10 月末
annual frequency, anchored end of October
(B)Y(E)(S)-NOV
年度频率,固定于 11 月末
annual frequency, anchored end of November
这些可以用作 date_range,bdate_range,DatetimeIndex 的构造函数,以及 pandas 中的其他各种时间序列相关函数的参数。
These can be used as arguments to date_range, bdate_range, constructors for DatetimeIndex, as well as various other timeseries-related functions in pandas.
Anchored offset semantics
对于固定于特定频率的开始或结束的偏移量 (MonthEnd,MonthBegin,WeekEnd 等),以下规则适用于前滚与后退。
For those offsets that are anchored to the start or end of specific frequency (MonthEnd, MonthBegin, WeekEnd, etc), the following rules apply to rolling forward and backwards.
当 n 不为 0 时,如果给定日期不在锚点上,则将其对齐到下一个(上一个)锚点,并向前或向后退移 |n|-1 个附加步长。
When n is not 0, if the given date is not on an anchor point, it snapped to the next(previous) anchor point, and moved |n|-1 additional steps forwards or backwards.
In [243]: pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=1)
Out[243]: Timestamp('2014-02-01 00:00:00')
In [244]: pd.Timestamp("2014-01-02") + pd.offsets.MonthEnd(n=1)
Out[244]: Timestamp('2014-01-31 00:00:00')
In [245]: pd.Timestamp("2014-01-02") - pd.offsets.MonthBegin(n=1)
Out[245]: Timestamp('2014-01-01 00:00:00')
In [246]: pd.Timestamp("2014-01-02") - pd.offsets.MonthEnd(n=1)
Out[246]: Timestamp('2013-12-31 00:00:00')
In [247]: pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=4)
Out[247]: Timestamp('2014-05-01 00:00:00')
In [248]: pd.Timestamp("2014-01-02") - pd.offsets.MonthBegin(n=4)
Out[248]: Timestamp('2013-10-01 00:00:00')
如果给定日期在锚点上,则将其向前或向后退移 |n| 个点。
If the given date is on an anchor point, it is moved |n| points forwards or backwards.
In [249]: pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=1)
Out[249]: Timestamp('2014-02-01 00:00:00')
In [250]: pd.Timestamp("2014-01-31") + pd.offsets.MonthEnd(n=1)
Out[250]: Timestamp('2014-02-28 00:00:00')
In [251]: pd.Timestamp("2014-01-01") - pd.offsets.MonthBegin(n=1)
Out[251]: Timestamp('2013-12-01 00:00:00')
In [252]: pd.Timestamp("2014-01-31") - pd.offsets.MonthEnd(n=1)
Out[252]: Timestamp('2013-12-31 00:00:00')
In [253]: pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=4)
Out[253]: Timestamp('2014-05-01 00:00:00')
In [254]: pd.Timestamp("2014-01-31") - pd.offsets.MonthBegin(n=4)
Out[254]: Timestamp('2013-10-01 00:00:00')
对于 n=0 的情况,如果在锚点上则不移动日期,否则前滚至下一个锚点。
For the case when n=0, the date is not moved if on an anchor point, otherwise it is rolled forward to the next anchor point.
In [255]: pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=0)
Out[255]: Timestamp('2014-02-01 00:00:00')
In [256]: pd.Timestamp("2014-01-02") + pd.offsets.MonthEnd(n=0)
Out[256]: Timestamp('2014-01-31 00:00:00')
In [257]: pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=0)
Out[257]: Timestamp('2014-01-01 00:00:00')
In [258]: pd.Timestamp("2014-01-31") + pd.offsets.MonthEnd(n=0)
Out[258]: Timestamp('2014-01-31 00:00:00')
Holidays / holiday calendars
假期和日历提供了一种简单的方法,可定义与 CustomBusinessDay 或需要预定义假期集的其他分析配合使用的假期规则。AbstractHolidayCalendar 类提供了返回假期列表的所有必要方法,而且只需在特定的假期日历类中定义 rules 即可。此外,start_date 和 end_date 类属性确定生成假期的日期范围。应在 AbstractHolidayCalendar 类中覆盖这些内容,以便范围适用于所有日历子类。USFederalHolidayCalendar 是唯一存在的日历,主要用作开发其他日历的示例。
Holidays and calendars provide a simple way to define holiday rules to be used with CustomBusinessDay or in other analysis that requires a predefined set of holidays. The AbstractHolidayCalendar class provides all the necessary methods to return a list of holidays and only rules need to be defined in a specific holiday calendar class. Furthermore, the start_date and end_date class attributes determine over what date range holidays are generated. These should be overwritten on the AbstractHolidayCalendar class to have the range apply to all calendar subclasses. USFederalHolidayCalendar is the only calendar that exists and primarily serves as an example for developing other calendars.
对于发生在固定日期(例如,美国阵亡将士纪念日或 7 月 4 日)的假期,观察规则决定了如果该假期落在周末或其他非观察日,则何时观察该假期。定义的观察规则为:
For holidays that occur on fixed dates (e.g., US Memorial Day or July 4th) an observance rule determines when that holiday is observed if it falls on a weekend or some other non-observed day. Defined observance rules are:
规则
Rule
说明
Description
nearest_workday
将星期六移到星期五,将星期天移到星期一
move Saturday to Friday and Sunday to Monday
sunday_to_monday
将星期天移到下星期一
move Sunday to following Monday
next_monday_or_tuesday
将星期六移至星期一,将星期日/星期一移至星期二
move Saturday to Monday and Sunday/Monday to Tuesday
previous_friday
将星期六和星期日移至上一个星期五”
move Saturday and Sunday to previous Friday”
next_monday
将星期六和星期日移至接下来的星期一
move Saturday and Sunday to following Monday
以下是节日和节日日历定义举例:
An example of how holidays and holiday calendars are defined:
In [259]: from pandas.tseries.holiday import (
.....: Holiday,
.....: USMemorialDay,
.....: AbstractHolidayCalendar,
.....: nearest_workday,
.....: MO,
.....: )
.....:
In [260]: class ExampleCalendar(AbstractHolidayCalendar):
.....: rules = [
.....: USMemorialDay,
.....: Holiday("July 4th", month=7, day=4, observance=nearest_workday),
.....: Holiday(
.....: "Columbus Day",
.....: month=10,
.....: day=1,
.....: offset=pd.DateOffset(weekday=MO(2)),
.....: ),
.....: ]
.....:
In [261]: cal = ExampleCalendar()
In [262]: cal.holidays(datetime.datetime(2012, 1, 1), datetime.datetime(2012, 12, 31))
Out[262]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype='datetime64[ns]', freq=None)
-
* hint:* weekday=MO(2) is same as 2 * Week(weekday=2) Using this calendar, creating an index or doing offset arithmetic skips weekends and holidays (i.e., Memorial Day/July 4th). For example, the below defines a custom business day offset using the ExampleCalendar. Like any other offset, it can be used to create a DatetimeIndex or added to datetime or Timestamp objects.
In [263]: pd.date_range(
.....: start="7/1/2012", end="7/10/2012", freq=pd.offsets.CDay(calendar=cal)
.....: ).to_pydatetime()
.....:
Out[263]:
array([datetime.datetime(2012, 7, 2, 0, 0),
datetime.datetime(2012, 7, 3, 0, 0),
datetime.datetime(2012, 7, 5, 0, 0),
datetime.datetime(2012, 7, 6, 0, 0),
datetime.datetime(2012, 7, 9, 0, 0),
datetime.datetime(2012, 7, 10, 0, 0)], dtype=object)
In [264]: offset = pd.offsets.CustomBusinessDay(calendar=cal)
In [265]: datetime.datetime(2012, 5, 25) + offset
Out[265]: Timestamp('2012-05-29 00:00:00')
In [266]: datetime.datetime(2012, 7, 3) + offset
Out[266]: Timestamp('2012-07-05 00:00:00')
In [267]: datetime.datetime(2012, 7, 3) + 2 * offset
Out[267]: Timestamp('2012-07-06 00:00:00')
In [268]: datetime.datetime(2012, 7, 6) + offset
Out[268]: Timestamp('2012-07-09 00:00:00')
范围由 AbstractHolidayCalendar 的 start_date 和 end_date 类属性定义。以下是显示的默认值。
Ranges are defined by the start_date and end_date class attributes of AbstractHolidayCalendar. The defaults are shown below.
In [269]: AbstractHolidayCalendar.start_date
Out[269]: Timestamp('1970-01-01 00:00:00')
In [270]: AbstractHolidayCalendar.end_date
Out[270]: Timestamp('2200-12-31 00:00:00')
可以通过将属性设置为 datetime/Timestamp/string 来覆盖这些日期。
These dates can be overwritten by setting the attributes as datetime/Timestamp/string.
In [271]: AbstractHolidayCalendar.start_date = datetime.datetime(2012, 1, 1)
In [272]: AbstractHolidayCalendar.end_date = datetime.datetime(2012, 12, 31)
In [273]: cal.holidays()
Out[273]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype='datetime64[ns]', freq=None)
每个日历类都可以通过 get_calendar 函数访问,该函数返回节日类实例。任何导入的日历类都将自动通过此函数用作。此外,HolidayCalendarFactory 提供了一个简单的接口,用于创建日历,其中包含日历的组合或具有附加规则的日历。
Every calendar class is accessible by name using the get_calendar function which returns a holiday class instance. Any imported calendar class will automatically be available by this function. Also, HolidayCalendarFactory provides an easy interface to create calendars that are combinations of calendars or calendars with additional rules.
In [274]: from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, USLaborDay
In [275]: cal = get_calendar("ExampleCalendar")
In [276]: cal.rules
Out[276]:
[Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x7ff27fdb0b80>),
Holiday: Columbus Day (month=10, day=1, offset=<DateOffset: weekday=MO(+2)>)]
In [277]: new_cal = HolidayCalendarFactory("NewExampleCalendar", cal, USLaborDay)
In [278]: new_cal.rules
Out[278]:
[Holiday: Labor Day (month=9, day=1, offset=<DateOffset: weekday=MO(+1)>),
Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x7ff27fdb0b80>),
Holiday: Columbus Day (month=10, day=1, offset=<DateOffset: weekday=MO(+2)>)]
Time Series-related instance methods
Shifting / lagging
可能需要将时间序列中的值向前或向后移动或滞后。方法是 shift(),该方法适用于所有 pandas 对象。
One may want to shift or lag the values in a time series back and forward in time. The method for this is shift(), which is available on all of the pandas objects.
In [279]: ts = pd.Series(range(len(rng)), index=rng)
In [280]: ts = ts[:5]
In [281]: ts.shift(1)
Out[281]:
2012-01-01 NaN
2012-01-02 0.0
2012-01-03 1.0
Freq: D, dtype: float64
shift 方法接受 freq 参数,该参数可以接受 DateOffset 类或其他 timedelta 类对象或 offset alias。
The shift method accepts an freq argument which can accept a DateOffset class or other timedelta-like object or also an offset alias.
当指定 freq 时,shift 方法会更改索引中的所有日期,而不是更改数据和索引的对齐方式:
When freq is specified, shift method changes all the dates in the index rather than changing the alignment of the data and the index:
In [282]: ts.shift(5, freq="D")
Out[282]:
2012-01-06 0
2012-01-07 1
2012-01-08 2
Freq: D, dtype: int64
In [283]: ts.shift(5, freq=pd.offsets.BDay())
Out[283]:
2012-01-06 0
2012-01-09 1
2012-01-10 2
dtype: int64
In [284]: ts.shift(5, freq="BME")
Out[284]:
2012-05-31 0
2012-05-31 1
2012-05-31 2
dtype: int64
请注意,当指定 freq 时,领先条目不再是 NaN,因为数据没有重新对齐。
Note that with when freq is specified, the leading entry is no longer NaN because the data is not being realigned.
Frequency conversion
更改频率的主要功能是 asfreq() 方法。对于 DatetimeIndex,这基本上只是一种简单的但方便用于包装 reindex() 的包装,它生成 date_range 并调用 reindex。
The primary function for changing frequencies is the asfreq() method. For a DatetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
In [285]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())
In [286]: ts = pd.Series(np.random.randn(3), index=dr)
In [287]: ts
Out[287]:
2010-01-01 1.494522
2010-01-06 -0.778425
2010-01-11 -0.253355
Freq: 3B, dtype: float64
In [288]: ts.asfreq(pd.offsets.BDay())
Out[288]:
2010-01-01 1.494522
2010-01-04 NaN
2010-01-05 NaN
2010-01-06 -0.778425
2010-01-07 NaN
2010-01-08 NaN
2010-01-11 -0.253355
Freq: B, dtype: float64
asfreq 进一步提供了便利,因此可以为频率转换后可能出现的任何间隙指定插补方法。
asfreq provides a further convenience so you can specify an interpolation method for any gaps that may appear after the frequency conversion.
In [289]: ts.asfreq(pd.offsets.BDay(), method="pad")
Out[289]:
2010-01-01 1.494522
2010-01-04 1.494522
2010-01-05 1.494522
2010-01-06 -0.778425
2010-01-07 -0.778425
2010-01-08 -0.778425
2010-01-11 -0.253355
Freq: B, dtype: float64
Filling forward / backward
与 asfreq 和 reindex 相关的是 fillna(),它在 missing data section 中有记录。
Related to asfreq and reindex is fillna(), which is documented in the missing data section.
Converting to Python datetimes
DatetimeIndex 可使用 to_pydatetime 方法转换为 Python 原生 datetime.datetime 对象的数组
DatetimeIndex can be converted to an array of Python native datetime.datetime objects using the to_pydatetime method.
Resampling
pandas 具有用于在频率转换期间(例如,将每秒数据转换为每 5 分钟数据)执行重新采样操作的简单、强大且高效的功能。这在(但不限于)金融应用程序中非常常见。
pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.
resample() 是基于时间的 groupby,然后是对其每个组执行的还原方法。查看一些 cookbook examples 以了解一些高级策略。
resample() is a time-based groupby, followed by a reduction method on each of its groups. See some cookbook examples for some advanced strategies.
resample() 方法可以直接从 DataFrameGroupBy 对象中使用,请参见 groupby docs。
The resample() method can be used directly from DataFrameGroupBy objects, see the groupby docs.
Basics
In [290]: rng = pd.date_range("1/1/2012", periods=100, freq="s")
In [291]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [292]: ts.resample("5Min").sum()
Out[292]:
2012-01-01 25103
Freq: 5min, dtype: int64
resample 函数非常灵活,允许您指定许多不同的参数来控制频率转换和重新采样操作。
The resample function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation.
通过 GroupBy 可用的任何内置方法均可用作返回对象的方法,包括 sum、mean、std、sem、max、min、median、first、last、ohlc:
Any built-in method available via GroupBy is available as a method of the returned object, including sum, mean, std, sem, max, min, median, first, last, ohlc:
In [293]: ts.resample("5Min").mean()
Out[293]:
2012-01-01 251.03
Freq: 5min, dtype: float64
In [294]: ts.resample("5Min").ohlc()
Out[294]:
open high low close
2012-01-01 308 460 9 205
In [295]: ts.resample("5Min").max()
Out[295]:
2012-01-01 460
Freq: 5min, dtype: int64
对于缩减采样,可以将 closed 设置为“左”或“右”以指定闭合的时间间隔哪一端:
For downsampling, closed can be set to ‘left’ or ‘right’ to specify which end of the interval is closed:
In [296]: ts.resample("5Min", closed="right").mean()
Out[296]:
2011-12-31 23:55:00 308.000000
2012-01-01 00:00:00 250.454545
Freq: 5min, dtype: float64
In [297]: ts.resample("5Min", closed="left").mean()
Out[297]:
2012-01-01 251.03
Freq: 5min, dtype: float64
label 等参数用于操作结果标签。label 指定结果是否用间隔的开始或结束标记。
Parameters like label are used to manipulate the resulting labels. label specifies whether the result is labeled with the beginning or the end of the interval.
In [298]: ts.resample("5Min").mean() # by default label='left'
Out[298]:
2012-01-01 251.03
Freq: 5min, dtype: float64
In [299]: ts.resample("5Min", label="left").mean()
Out[299]:
2012-01-01 251.03
Freq: 5min, dtype: float64
警告
Warning
所有频率偏移量中 label 和 closed 的默认值均为“左”,但以下值除外:所有频率偏移量均具有“右”的默认值,其中包括“ME”、“YE”、“QE”、“BME”、“BYE”、“BQE”和“W”。
The default values for label and closed is ‘left’ for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’, ‘BYE’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
这可能会意外地导致向前看,其中以后时间的数值被拉回到以前的时间,如下面的 BusinessDay 频率示例所示:
This might unintendedly lead to looking ahead, where the value for a later time is pulled back to a previous time as in the following example with the BusinessDay frequency:
In [300]: s = pd.date_range("2000-01-01", "2000-01-05").to_series()
In [301]: s.iloc[2] = pd.NaT
In [302]: s.dt.day_name()
Out[302]:
2000-01-01 Saturday
2000-01-02 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: D, dtype: object
# default: label='left', closed='left'
In [303]: s.resample("B").last().dt.day_name()
Out[303]:
1999-12-31 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: B, dtype: object
注意星期天的值如何被拉回到上周五。要获得将星期天的值推至星期一的行为,请改用
Notice how the value for Sunday got pulled back to the previous Friday. To get the behavior where the value for Sunday is pushed to Monday, use instead
In [304]: s.resample("B", label="right", closed="right").last().dt.day_name()
Out[304]:
2000-01-03 Sunday
2000-01-04 Tuesday
2000-01-05 Wednesday
2000-01-06 NaN
Freq: B, dtype: object
axis 参数可以设置为 0 或 1,它允许您为 DataFrame 重新采样特定轴。
The axis parameter can be set to 0 or 1 and allows you to resample the specified axis for a DataFrame.
kind 可设置为“时间戳”或“周期”以将结果索引转换为/从时间戳和时间跨度表示形式进行转换。默认情况下,resample 保留输入表示形式。
kind can be set to ‘timestamp’ or ‘period’ to convert the resulting index to/from timestamp and time span representations. By default resample retains the input representation.
当重新采样周期数据时,可以将 convention 设置为“开始”或“结束”(具体见下文)。它指定如何将低频周期转换为高频周期。
convention can be set to ‘start’ or ‘end’ when resampling period data (detail below). It specifies how low frequency periods are converted to higher frequency periods.
Upsampling
对于上采样,您可以指定上采样方式以及 limit 参数以在创建的间隙之上进行插值:
For upsampling, you can specify a way to upsample and the limit parameter to interpolate over the gaps that are created:
# from secondly to every 250 milliseconds
In [305]: ts[:2].resample("250ms").asfreq()
Out[305]:
2012-01-01 00:00:00.000 308.0
2012-01-01 00:00:00.250 NaN
2012-01-01 00:00:00.500 NaN
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 204.0
Freq: 250ms, dtype: float64
In [306]: ts[:2].resample("250ms").ffill()
Out[306]:
2012-01-01 00:00:00.000 308
2012-01-01 00:00:00.250 308
2012-01-01 00:00:00.500 308
2012-01-01 00:00:00.750 308
2012-01-01 00:00:01.000 204
Freq: 250ms, dtype: int64
In [307]: ts[:2].resample("250ms").ffill(limit=2)
Out[307]:
2012-01-01 00:00:00.000 308.0
2012-01-01 00:00:00.250 308.0
2012-01-01 00:00:00.500 308.0
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 204.0
Freq: 250ms, dtype: float64
Sparse resampling
稀疏时间序列是指相对于您要重新采样的时间量而言,所拥有的点数较少。实时上采样稀疏序列可能会生成大量中间值。当您不想使用一种方法来填充这些值(例如 fill_method 等于 None)时,中间值会用 NaN 填充。
Sparse timeseries are the ones where you have a lot fewer points relative to the amount of time you are looking to resample. Naively upsampling a sparse series can potentially generate lots of intermediate values. When you don’t want to use a method to fill these values, e.g. fill_method is None, then intermediate values will be filled with NaN.
因为 resample 是基于时间的 groupby,所以以下方法可以高效地仅重新采样不是所有 NaN 的组。
Since resample is a time-based groupby, the following is a method to efficiently resample only the groups that are not all NaN.
In [308]: rng = pd.date_range("2014-1-1", periods=100, freq="D") + pd.Timedelta("1s")
In [309]: ts = pd.Series(range(100), index=rng)
如果要重新采样为该序列的完整范围:
If we want to resample to the full range of the series:
In [310]: ts.resample("3min").sum()
Out[310]:
2014-01-01 00:00:00 0
2014-01-01 00:03:00 0
2014-01-01 00:06:00 0
2014-01-01 00:09:00 0
2014-01-01 00:12:00 0
..
2014-04-09 23:48:00 0
2014-04-09 23:51:00 0
2014-04-09 23:54:00 0
2014-04-09 23:57:00 0
2014-04-10 00:00:00 99
Freq: 3min, Length: 47521, dtype: int64
我们反而可以仅重新采样我们有以下几个点的那些组:
We can instead only resample those groups where we have points as follows:
In [311]: from functools import partial
In [312]: from pandas.tseries.frequencies import to_offset
In [313]: def round(t, freq):
.....: freq = to_offset(freq)
.....: td = pd.Timedelta(freq)
.....: return pd.Timestamp((t.value // td.value) * td.value)
.....:
In [314]: ts.groupby(partial(round, freq="3min")).sum()
Out[314]:
2014-01-01 0
2014-01-02 1
2014-01-03 2
2014-01-04 3
2014-01-05 4
..
2014-04-06 95
2014-04-07 96
2014-04-08 97
2014-04-09 98
2014-04-10 99
Length: 100, dtype: int64
Aggregation
resample() 方法返回 pandas.api.typing.Resampler 实例。类似于 aggregating API、 groupby API 和 window API,可以有选择性地重新采样 Resampler。
The resample() method returns a pandas.api.typing.Resampler instance. Similar to the aggregating API, groupby API, and the window API, a Resampler can be selectively resampled.
对 DataFrame 进行重新采样,将作为所有具有相同函数的列执行操作。
Resampling a DataFrame, the default will be to act on all columns with the same function.
In [315]: df = pd.DataFrame(
.....: np.random.randn(1000, 3),
.....: index=pd.date_range("1/1/2012", freq="s", periods=1000),
.....: columns=["A", "B", "C"],
.....: )
.....:
In [316]: r = df.resample("3min")
In [317]: r.mean()
Out[317]:
A B C
2012-01-01 00:00:00 -0.033823 -0.121514 -0.081447
2012-01-01 00:03:00 0.056909 0.146731 -0.024320
2012-01-01 00:06:00 -0.058837 0.047046 -0.052021
2012-01-01 00:09:00 0.063123 -0.026158 -0.066533
2012-01-01 00:12:00 0.186340 -0.003144 0.074752
2012-01-01 00:15:00 -0.085954 -0.016287 -0.050046
我们可以使用标准 getitem 选择特定的一列或多列。
We can select a specific column or columns using standard getitem.
In [318]: r["A"].mean()
Out[318]:
2012-01-01 00:00:00 -0.033823
2012-01-01 00:03:00 0.056909
2012-01-01 00:06:00 -0.058837
2012-01-01 00:09:00 0.063123
2012-01-01 00:12:00 0.186340
2012-01-01 00:15:00 -0.085954
Freq: 3min, Name: A, dtype: float64
In [319]: r[["A", "B"]].mean()
Out[319]:
A B
2012-01-01 00:00:00 -0.033823 -0.121514
2012-01-01 00:03:00 0.056909 0.146731
2012-01-01 00:06:00 -0.058837 0.047046
2012-01-01 00:09:00 0.063123 -0.026158
2012-01-01 00:12:00 0.186340 -0.003144
2012-01-01 00:15:00 -0.085954 -0.016287
可以传递函数列表或字典进行聚合,输出 DataFrame:
You can pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [320]: r["A"].agg(["sum", "mean", "std"])
Out[320]:
sum mean std
2012-01-01 00:00:00 -6.088060 -0.033823 1.043263
2012-01-01 00:03:00 10.243678 0.056909 1.058534
2012-01-01 00:06:00 -10.590584 -0.058837 0.949264
2012-01-01 00:09:00 11.362228 0.063123 1.028096
2012-01-01 00:12:00 33.541257 0.186340 0.884586
2012-01-01 00:15:00 -8.595393 -0.085954 1.035476
在一个重新采样的 DataFrame 上,可以传递函数列表应用于每一列,从而生成带有分层索引的聚合结果:
On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [321]: r.agg(["sum", "mean"])
Out[321]:
A ... C
sum mean ... sum mean
2012-01-01 00:00:00 -6.088060 -0.033823 ... -14.660515 -0.081447
2012-01-01 00:03:00 10.243678 0.056909 ... -4.377642 -0.024320
2012-01-01 00:06:00 -10.590584 -0.058837 ... -9.363825 -0.052021
2012-01-01 00:09:00 11.362228 0.063123 ... -11.975895 -0.066533
2012-01-01 00:12:00 33.541257 0.186340 ... 13.455299 0.074752
2012-01-01 00:15:00 -8.595393 -0.085954 ... -5.004580 -0.050046
[6 rows x 6 columns]
通过将字典传递给 aggregate,可以对 DataFrame 的列应用不同的聚合:
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
In [322]: r.agg({"A": "sum", "B": lambda x: np.std(x, ddof=1)})
Out[322]:
A B
2012-01-01 00:00:00 -6.088060 1.001294
2012-01-01 00:03:00 10.243678 1.074597
2012-01-01 00:06:00 -10.590584 0.987309
2012-01-01 00:09:00 11.362228 0.944953
2012-01-01 00:12:00 33.541257 1.095025
2012-01-01 00:15:00 -8.595393 1.035312
函数名称也可以是字符串。为了使字符串有效,它必须在重新采样对象上实现:
The function names can also be strings. In order for a string to be valid it must be implemented on the resampled object:
In [323]: r.agg({"A": "sum", "B": "std"})
Out[323]:
A B
2012-01-01 00:00:00 -6.088060 1.001294
2012-01-01 00:03:00 10.243678 1.074597
2012-01-01 00:06:00 -10.590584 0.987309
2012-01-01 00:09:00 11.362228 0.944953
2012-01-01 00:12:00 33.541257 1.095025
2012-01-01 00:15:00 -8.595393 1.035312
此外,还可以分别为每一列指定多个聚合函数。
Furthermore, you can also specify multiple aggregation functions for each column separately.
In [324]: r.agg({"A": ["sum", "std"], "B": ["mean", "std"]})
Out[324]:
A B
sum std mean std
2012-01-01 00:00:00 -6.088060 1.043263 -0.121514 1.001294
2012-01-01 00:03:00 10.243678 1.058534 0.146731 1.074597
2012-01-01 00:06:00 -10.590584 0.949264 0.047046 0.987309
2012-01-01 00:09:00 11.362228 1.028096 -0.026158 0.944953
2012-01-01 00:12:00 33.541257 0.884586 -0.003144 1.095025
2012-01-01 00:15:00 -8.595393 1.035476 -0.016287 1.035312
如果 DataFrame 没有 datetimelike 索引,但是想要基于数据框中的 datetimelike 列进行重新采样,则可以传递到 on 关键字中。
If a DataFrame does not have a datetimelike index, but instead you want to resample based on datetimelike column in the frame, it can passed to the on keyword.
In [325]: df = pd.DataFrame(
.....: {"date": pd.date_range("2015-01-01", freq="W", periods=5), "a": np.arange(5)},
.....: index=pd.MultiIndex.from_arrays(
.....: [[1, 2, 3, 4, 5], pd.date_range("2015-01-01", freq="W", periods=5)],
.....: names=["v", "d"],
.....: ),
.....: )
.....:
In [326]: df
Out[326]:
date a
v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
In [327]: df.resample("ME", on="date")[["a"]].sum()
Out[327]:
a
date
2015-01-31 6
2015-02-28 4
类似地,如果想要按 MultiIndex 的 datetimelike 级别进行重新采样,则可以将其名称或位置传递给 level 关键字。
Similarly, if you instead want to resample by a datetimelike level of MultiIndex, its name or location can be passed to the level keyword.
In [328]: df.resample("ME", level="d")[["a"]].sum()
Out[328]:
a
d
2015-01-31 6
2015-02-28 4
Iterating through groups
有了 Resampler 对象,遍历分组数据非常自然,并且函数类似于 itertools.groupby():
With the Resampler object in hand, iterating through the grouped data is very natural and functions similarly to itertools.groupby():
In [329]: small = pd.Series(
.....: range(6),
.....: index=pd.to_datetime(
.....: [
.....: "2017-01-01T00:00:00",
.....: "2017-01-01T00:30:00",
.....: "2017-01-01T00:31:00",
.....: "2017-01-01T01:00:00",
.....: "2017-01-01T03:00:00",
.....: "2017-01-01T03:05:00",
.....: ]
.....: ),
.....: )
.....:
In [330]: resampled = small.resample("h")
In [331]: for name, group in resampled:
.....: print("Group: ", name)
.....: print("-" * 27)
.....: print(group, end="\n\n")
.....:
Group: 2017-01-01 00:00:00
---------------------------
2017-01-01 00:00:00 0
2017-01-01 00:30:00 1
2017-01-01 00:31:00 2
dtype: int64
Group: 2017-01-01 01:00:00
---------------------------
2017-01-01 01:00:00 3
dtype: int64
Group: 2017-01-01 02:00:00
---------------------------
Series([], dtype: int64)
Group: 2017-01-01 03:00:00
---------------------------
2017-01-01 03:00:00 4
2017-01-01 03:05:00 5
dtype: int64
请参见 Iterating through groups 或 Resampler._iter_ 了解更多信息。
See Iterating through groups or Resampler._iter_ for more.
Use origin or offset to adjust the start of the bins
分组的区间根据时间序列起始点的当天开始时间进行调整。这适用于一天的倍数(例如 30D)或均匀划分一天(例如 90s 或 1min)的频率。这会与不符合此条件的一些频率产生不一致性。若要更改此行为,可以用 origin 参数指定一个固定的时间戳。
The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divide a day evenly (like 90s or 1min). This can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can specify a fixed Timestamp with the argument origin.
例如:
For example:
In [332]: start, end = "2000-10-01 23:30:00", "2000-10-02 00:30:00"
In [333]: middle = "2000-10-02 00:00:00"
In [334]: rng = pd.date_range(start, end, freq="7min")
In [335]: ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
In [336]: ts
Out[336]:
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7min, dtype: int64
在这里,我们可以看到,当将 origin 用于其默认值 ('start_day') 时,在 '2000-10-02 00:00:00' 之后的 origin 结果根据时间序列的开始而不同:
Here we can see that, when using origin with its default value ('start_day'), the result after '2000-10-02 00:00:00' are not identical depending on the start of time series:
In [337]: ts.resample("17min", origin="start_day").sum()
Out[337]:
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17min, dtype: int64
In [338]: ts[middle:end].resample("17min", origin="start_day").sum()
Out[338]:
2000-10-02 00:00:00 33
2000-10-02 00:17:00 45
Freq: 17min, dtype: int64
在这里,我们可以看到,当将 origin 设置为 'epoch' 时,在 '2000-10-02 00:00:00' 之后的 origin 结果根据时间序列的开始而相同:
Here we can see that, when setting origin to 'epoch', the result after '2000-10-02 00:00:00' are identical depending on the start of time series:
In [339]: ts.resample("17min", origin="epoch").sum()
Out[339]:
2000-10-01 23:18:00 0
2000-10-01 23:35:00 18
2000-10-01 23:52:00 27
2000-10-02 00:09:00 39
2000-10-02 00:26:00 24
Freq: 17min, dtype: int64
In [340]: ts[middle:end].resample("17min", origin="epoch").sum()
Out[340]:
2000-10-01 23:52:00 15
2000-10-02 00:09:00 39
2000-10-02 00:26:00 24
Freq: 17min, dtype: int64
如果需要,可以使用 origin 的自定义时间戳:
If needed you can use a custom timestamp for origin:
In [341]: ts.resample("17min", origin="2001-01-01").sum()
Out[341]:
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17min, dtype: int64
In [342]: ts[middle:end].resample("17min", origin=pd.Timestamp("2001-01-01")).sum()
Out[342]:
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17min, dtype: int64
如果需要,可以使用 offset 时间增量调整区间,该增量将添加到默认 origin 中。对于此时间序列,这两个示例是等效的:
If needed you can just adjust the bins with an offset Timedelta that would be added to the default origin. Those two examples are equivalent for this time series:
In [343]: ts.resample("17min", origin="start").sum()
Out[343]:
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17min, dtype: int64
In [344]: ts.resample("17min", offset="23h30min").sum()
Out[344]:
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17min, dtype: int64
请注意在最后一个示例中将 'start' 用于 origin 的用法。在这种情况下,origin 将设置为此时间序列的第一个值。
Note the use of 'start' for origin on the last example. In that case, origin will be set to the first value of the timeseries.
Backward resample
1.3.0 版中的新增功能。
New in version 1.3.0.
有时,我们不需要调整区间开始,而需要固定区间结束才能进行向后重新采样,使用给定的 freq。向后重新采样将 closed 设置为 'right'(默认),因为应将最后一个值视为最后一个区间的边缘点。
Instead of adjusting the beginning of bins, sometimes we need to fix the end of the bins to make a backward resample with a given freq. The backward resample sets closed to 'right' by default since the last value should be considered as the edge point for the last bin.
我们可以将 origin 设置为 'end'。特定 Timestamp 索引的值表示当前 Timestamp 减去当前 Timestamp 到 freq 右封闭的重新采样结果。
We can set origin to 'end'. The value for a specific Timestamp index stands for the resample result from the current Timestamp minus freq to the current Timestamp with a right close.
In [345]: ts.resample('17min', origin='end').sum()
Out[345]:
2000-10-01 23:35:00 0
2000-10-01 23:52:00 18
2000-10-02 00:09:00 27
2000-10-02 00:26:00 63
Freq: 17min, dtype: int64
此外,与 'start_day' 选项相反,支持 end_day。这会将原点设置为最大 Timestamp 的午夜上限。
Besides, in contrast with the 'start_day' option, end_day is supported. This will set the origin as the ceiling midnight of the largest Timestamp.
In [346]: ts.resample('17min', origin='end_day').sum()
Out[346]:
2000-10-01 23:38:00 3
2000-10-01 23:55:00 15
2000-10-02 00:12:00 45
2000-10-02 00:29:00 45
Freq: 17min, dtype: int64
由于以下计算,上述结果将 2000-10-02 00:29:00 用作最后一个 bin 的右边缘。
The above result uses 2000-10-02 00:29:00 as the last bin’s right edge since the following computation.
In [347]: ceil_mid = rng.max().ceil('D')
In [348]: freq = pd.offsets.Minute(17)
In [349]: bin_res = ceil_mid - freq * ((ceil_mid - rng.max()) // freq)
In [350]: bin_res
Out[350]: Timestamp('2000-10-02 00:29:00')
Time span representation
在 pandas 中,规则时间间隔由 Period 对象表示,而 Period 对象的序列则收集在 PeriodIndex 中,该序列可以使用便捷函数 period_range 创建。
Regular intervals of time are represented by Period objects in pandas while sequences of Period objects are collected in a PeriodIndex, which can be created with the convenience function period_range.
Period
Period 表示时间跨度(例如,一天、一个月、一个季度等)。你可以使用以下频率别名通过 freq 关键字来指定跨度。因为 freq 表示 Period 的跨度,所以它不能像 “-3D” 那样是负数。
A Period represents a span of time (e.g., a day, a month, a quarter, etc). You can specify the span via freq keyword using a frequency alias like below. Because freq represents a span of Period, it cannot be negative like “-3D”.
In [351]: pd.Period("2012", freq="Y-DEC")
Out[351]: Period('2012', 'Y-DEC')
In [352]: pd.Period("2012-1-1", freq="D")
Out[352]: Period('2012-01-01', 'D')
In [353]: pd.Period("2012-1-1 19:00", freq="h")
Out[353]: Period('2012-01-01 19:00', 'h')
In [354]: pd.Period("2012-1-1 19:00", freq="5h")
Out[354]: Period('2012-01-01 19:00', '5h')
从周期中加减整数会通过其自身的频率来偏移周期。在不同 freq(跨度)的 Period 之间不允许算术运算。
Adding and subtracting integers from periods shifts the period by its own frequency. Arithmetic is not allowed between Period with different freq (span).
In [355]: p = pd.Period("2012", freq="Y-DEC")
In [356]: p + 1
Out[356]: Period('2013', 'Y-DEC')
In [357]: p - 3
Out[357]: Period('2009', 'Y-DEC')
In [358]: p = pd.Period("2012-01", freq="2M")
In [359]: p + 2
Out[359]: Period('2012-05', '2M')
In [360]: p - 1
Out[360]: Period('2011-11', '2M')
In [361]: p == pd.Period("2012-01", freq="3M")
Out[361]: False
如果 Period 的频率是每日或更高的频率 (D、h、min、s、ms、us 和 ns),如果结果具有相同的频率,则可以添加 offsets 和类似 timedelta。否则,将会引发 ValueError。
If Period freq is daily or higher (D, h, min, s, ms, us, and ns), offsets and timedelta-like can be added if the result can have the same freq. Otherwise, ValueError will be raised.
In [362]: p = pd.Period("2014-07-01 09:00", freq="h")
In [363]: p + pd.offsets.Hour(2)
Out[363]: Period('2014-07-01 11:00', 'h')
In [364]: p + datetime.timedelta(minutes=120)
Out[364]: Period('2014-07-01 11:00', 'h')
In [365]: p + np.timedelta64(7200, "s")
Out[365]: Period('2014-07-01 11:00', 'h')
In [366]: p + pd.offsets.Minute(5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File period.pyx:1824, in pandas._libs.tslibs.period._Period._add_timedeltalike_scalar()
File timedeltas.pyx:278, in pandas._libs.tslibs.timedeltas.delta_to_nanoseconds()
File np_datetime.pyx:661, in pandas._libs.tslibs.np_datetime.convert_reso()
ValueError: Cannot losslessly convert units
The above exception was the direct cause of the following exception:
IncompatibleFrequency Traceback (most recent call last)
Cell In[366], line 1
----> 1 p + pd.offsets.Minute(5)
File period.pyx:1845, in pandas._libs.tslibs.period._Period.__add__()
File period.pyx:1826, in pandas._libs.tslibs.period._Period._add_timedeltalike_scalar()
IncompatibleFrequency: Input cannot be converted to Period(freq=h)
如果 Period 具有其他频率,则只能添加相同的 offsets。否则,将会引发 ValueError。
If Period has other frequencies, only the same offsets can be added. Otherwise, ValueError will be raised.
In [367]: p = pd.Period("2014-07", freq="M")
In [368]: p + pd.offsets.MonthEnd(3)
Out[368]: Period('2014-10', 'M')
In [369]: p + pd.offsets.MonthBegin(3)
---------------------------------------------------------------------------
IncompatibleFrequency Traceback (most recent call last)
Cell In[369], line 1
----> 1 p + pd.offsets.MonthBegin(3)
File period.pyx:1847, in pandas._libs.tslibs.period._Period.__add__()
File period.pyx:1837, in pandas._libs.tslibs.period._Period._add_offset()
File period.pyx:1732, in pandas._libs.tslibs.period.PeriodMixin._require_matching_freq()
IncompatibleFrequency: Input has different freq=3M from Period(freq=M)
获取具有相同频率的 Period 实例之间的差异将返回它们之间的频率单位数:
Taking the difference of Period instances with the same frequency will return the number of frequency units between them:
In [370]: pd.Period("2012", freq="Y-DEC") - pd.Period("2002", freq="Y-DEC")
Out[370]: <10 * YearEnds: month=12>
PeriodIndex and period_range
可以将 Period 对象的规则序列收集在 PeriodIndex 中,该序列可以使用 period_range 便捷函数构造:
Regular sequences of Period objects can be collected in a PeriodIndex, which can be constructed using the period_range convenience function:
In [371]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")
In [372]: prng
Out[372]:
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
'2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
'2012-01'],
dtype='period[M]')
也可以直接使用 PeriodIndex 构造函数:
The PeriodIndex constructor can also be used directly:
In [373]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[373]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]')
传递相乘的频率将输出具有相乘跨度的 Period 的序列。
Passing multiplied frequency outputs a sequence of Period which has multiplied span.
In [374]: pd.period_range(start="2014-01", freq="3M", periods=4)
Out[374]: PeriodIndex(['2014-01', '2014-04', '2014-07', '2014-10'], dtype='period[3M]')
如果 start 或 end 是 Period 对象,它们将用作带有与 PeriodIndex 构造函数匹配的频率的 PeriodIndex 的锚定端点。
If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with frequency matching that of the PeriodIndex constructor.
In [375]: pd.period_range(
.....: start=pd.Period("2017Q1", freq="Q"), end=pd.Period("2017Q2", freq="Q"), freq="M"
.....: )
.....:
Out[375]: PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'], dtype='period[M]')
就像 DatetimeIndex 一样,也可用 PeriodIndex 来索引 pandas 对象:
Just like DatetimeIndex, a PeriodIndex can also be used to index pandas objects:
In [376]: ps = pd.Series(np.random.randn(len(prng)), prng)
In [377]: ps
Out[377]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
2012-01 -0.329583
Freq: M, dtype: float64
PeriodIndex 支持与 Period 相同规则的加法和减法。
PeriodIndex supports addition and subtraction with the same rule as Period.
In [378]: idx = pd.period_range("2014-07-01 09:00", periods=5, freq="h")
In [379]: idx
Out[379]:
PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00',
'2014-07-01 12:00', '2014-07-01 13:00'],
dtype='period[h]')
In [380]: idx + pd.offsets.Hour(2)
Out[380]:
PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',
'2014-07-01 14:00', '2014-07-01 15:00'],
dtype='period[h]')
In [381]: idx = pd.period_range("2014-07", periods=5, freq="M")
In [382]: idx
Out[382]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype='period[M]')
In [383]: idx + pd.offsets.MonthEnd(3)
Out[383]: PeriodIndex(['2014-10', '2014-11', '2014-12', '2015-01', '2015-02'], dtype='period[M]')
PeriodIndex 具有名为 period 的自己的数据类型,请参阅 Period Dtypes。
PeriodIndex has its own dtype named period, refer to Period Dtypes.
Period dtypes
PeriodIndex 具有自定义 period 数据类型。这是一个 pandas 扩展数据类型,类似于 timezone aware dtype (datetime64[ns, tz])。
PeriodIndex has a custom period dtype. This is a pandas extension dtype similar to the timezone aware dtype (datetime64[ns, tz]).
period 数据类型包含 freq 属性,并用 frequency strings 表示为 period[freq],例如 period[D] 或 period[M]。
The period dtype holds the freq attribute and is represented with period[freq] like period[D] or period[M], using frequency strings.
In [384]: pi = pd.period_range("2016-01-01", periods=3, freq="M")
In [385]: pi
Out[385]: PeriodIndex(['2016-01', '2016-02', '2016-03'], dtype='period[M]')
In [386]: pi.dtype
Out[386]: period[M]
period 数据类型可以用在 .astype(…) 中。它允许更改 PeriodIndex 的 freq,例如 .asfreq(),并将 DatetimeIndex 转换为 PeriodIndex,例如 to_period():
The period dtype can be used in .astype(…). It allows one to change the freq of a PeriodIndex like .asfreq() and convert a DatetimeIndex to PeriodIndex like to_period():
# change monthly freq to daily freq
In [387]: pi.astype("period[D]")
Out[387]: PeriodIndex(['2016-01-31', '2016-02-29', '2016-03-31'], dtype='period[D]')
# convert to DatetimeIndex
In [388]: pi.astype("datetime64[ns]")
Out[388]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'], dtype='datetime64[ns]', freq='MS')
# convert to PeriodIndex
In [389]: dti = pd.date_range("2011-01-01", freq="ME", periods=3)
In [390]: dti
Out[390]: DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'], dtype='datetime64[ns]', freq='ME')
In [391]: dti.astype("period[M]")
Out[391]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]')
PeriodIndex partial string indexing
PeriodIndex 现在支持使用非单调索引进行部分字符串切片。
PeriodIndex now supports partial string slicing with non-monotonic indexes.
你可以按 DatetimeIndex 的方式将日期和字符串传递给 Series 和 DataFrame,附加 PeriodIndex。详情请参阅 DatetimeIndex Partial String Indexing。
You can pass in dates and strings to Series and DataFrame with PeriodIndex, in the same manner as DatetimeIndex. For details, refer to DatetimeIndex Partial String Indexing.
In [392]: ps["2011-01"]
Out[392]: -2.9169013294054507
In [393]: ps[datetime.datetime(2011, 12, 25):]
Out[393]:
2011-12 2.261385
2012-01 -0.329583
Freq: M, dtype: float64
In [394]: ps["10/31/2011":"12/31/2011"]
Out[394]:
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
传递一个表示低频的字符串(低于 PeriodIndex)会返回部分切片数据。
Passing a string representing a lower frequency than PeriodIndex returns partial sliced data.
In [395]: ps["2011"]
Out[395]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
In [396]: dfp = pd.DataFrame(
.....: np.random.randn(600, 1),
.....: columns=["A"],
.....: index=pd.period_range("2013-01-01 9:00", periods=600, freq="min"),
.....: )
.....:
In [397]: dfp
Out[397]:
A
2013-01-01 09:00 -0.538468
2013-01-01 09:01 -1.365819
2013-01-01 09:02 -0.969051
2013-01-01 09:03 -0.331152
2013-01-01 09:04 -0.245334
... ...
2013-01-01 18:55 0.522460
2013-01-01 18:56 0.118710
2013-01-01 18:57 0.167517
2013-01-01 18:58 0.922883
2013-01-01 18:59 1.721104
[600 rows x 1 columns]
In [398]: dfp.loc["2013-01-01 10h"]
Out[398]:
A
2013-01-01 10:00 -0.308975
2013-01-01 10:01 0.542520
2013-01-01 10:02 1.061068
2013-01-01 10:03 0.754005
2013-01-01 10:04 0.352933
... ...
2013-01-01 10:55 -0.865621
2013-01-01 10:56 -1.167818
2013-01-01 10:57 -2.081748
2013-01-01 10:58 -0.527146
2013-01-01 10:59 0.802298
[60 rows x 1 columns]
与 DatetimeIndex 一样,端点会包含在结果中。下面的示例切片了 10:00 到 11:59 之间的数据。
As with DatetimeIndex, the endpoints will be included in the result. The example below slices data starting from 10:00 to 11:59.
In [399]: dfp["2013-01-01 10h":"2013-01-01 11h"]
Out[399]:
A
2013-01-01 10:00 -0.308975
2013-01-01 10:01 0.542520
2013-01-01 10:02 1.061068
2013-01-01 10:03 0.754005
2013-01-01 10:04 0.352933
... ...
2013-01-01 11:55 -0.590204
2013-01-01 11:56 1.539990
2013-01-01 11:57 -1.224826
2013-01-01 11:58 0.578798
2013-01-01 11:59 -0.685496
[120 rows x 1 columns]
Frequency conversion and resampling with PeriodIndex
Period 和 PeriodIndex 的频率可以通过 asfreq 方法进行转换。让我们从 12 月结束的 2011 财年开始:
The frequency of Period and PeriodIndex can be converted via the asfreq method. Let’s start with the fiscal year 2011, ending in December:
In [400]: p = pd.Period("2011", freq="Y-DEC")
In [401]: p
Out[401]: Period('2011', 'Y-DEC')
我们可以将其转换为月频率。使用 how 参数,我们可以指定是返回开始月份还是结束月份:
We can convert it to a monthly frequency. Using the how parameter, we can specify whether to return the starting or ending month:
In [402]: p.asfreq("M", how="start")
Out[402]: Period('2011-01', 'M')
In [403]: p.asfreq("M", how="end")
Out[403]: Period('2011-12', 'M')
为了方便起见,我们提供了简写“s”和“e”:
The shorthands ‘s’ and ‘e’ are provided for convenience:
In [404]: p.asfreq("M", "s")
Out[404]: Period('2011-01', 'M')
In [405]: p.asfreq("M", "e")
Out[405]: Period('2011-12', 'M')
转换为“超级周期”(比如,年频率是季度的超级周期)会自动返回包含输入周期的超级周期:
Converting to a “super-period” (e.g., annual frequency is a super-period of quarterly frequency) automatically returns the super-period that includes the input period:
In [406]: p = pd.Period("2011-12", freq="M")
In [407]: p.asfreq("Y-NOV")
Out[407]: Period('2012', 'Y-NOV')
请注意,由于我们已将年频率转换为 11 月结束,因此 2011 年 12 月的月周期实际上在 2012 年 11 月的 Y-NOV 周期中。
Note that since we converted to an annual frequency that ends the year in November, the monthly period of December 2011 is actually in the 2012 Y-NOV period.
固定频率的周期转换对于处理经济、商业和其它领域的各种常见季度数据特别有用。许多组织根据其会计年度的开始和结束月份来定义季度。因此,2011 年的第一季度可能开始于 2010 年或 2011 年的几个月内。通过固定频率,pandas 对所有季度频率 Q-JAN 到 Q-DEC 都适用。
Period conversions with anchored frequencies are particularly useful for working with various quarterly data common to economics, business, and other fields. Many organizations define quarters relative to the month in which their fiscal year starts and ends. Thus, first quarter of 2011 could start in 2010 or a few months into 2011. Via anchored frequencies, pandas works for all quarterly frequencies Q-JAN through Q-DEC.
Q-DEC 定义规则日历季度:
Q-DEC define regular calendar quarters:
In [408]: p = pd.Period("2012Q1", freq="Q-DEC")
In [409]: p.asfreq("D", "s")
Out[409]: Period('2012-01-01', 'D')
In [410]: p.asfreq("D", "e")
Out[410]: Period('2012-03-31', 'D')
Q-MAR 定义 3 月份会计年度结束:
Q-MAR defines fiscal year end in March:
In [411]: p = pd.Period("2011Q4", freq="Q-MAR")
In [412]: p.asfreq("D", "s")
Out[412]: Period('2011-01-01', 'D')
In [413]: p.asfreq("D", "e")
Out[413]: Period('2011-03-31', 'D')
Converting between representations
可以利用 to_period 将带时间戳的数据转换为 PeriodIndex 标记的数据,反之亦然,利用 to_timestamp:
Timestamped data can be converted to PeriodIndex-ed data using to_period and vice-versa using to_timestamp:
In [414]: rng = pd.date_range("1/1/2012", periods=5, freq="ME")
In [415]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [416]: ts
Out[416]:
2012-01-31 1.931253
2012-02-29 -0.184594
2012-03-31 0.249656
2012-04-30 -0.978151
2012-05-31 -0.873389
Freq: ME, dtype: float64
In [417]: ps = ts.to_period()
In [418]: ps
Out[418]:
2012-01 1.931253
2012-02 -0.184594
2012-03 0.249656
2012-04 -0.978151
2012-05 -0.873389
Freq: M, dtype: float64
In [419]: ps.to_timestamp()
Out[419]:
2012-01-01 1.931253
2012-02-01 -0.184594
2012-03-01 0.249656
2012-04-01 -0.978151
2012-05-01 -0.873389
Freq: MS, dtype: float64
记住“s”和“e”可用于返回周期的开始或结束的时间戳:
Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:
In [420]: ps.to_timestamp("D", how="s")
Out[420]:
2012-01-01 1.931253
2012-02-01 -0.184594
2012-03-01 0.249656
2012-04-01 -0.978151
2012-05-01 -0.873389
Freq: MS, dtype: float64
周期和时间戳之间的转换能让一些方便的算法函数得到使用。在以下示例中,我们转换一个截至 11 月的季度频率,转换为季度结束后的下个月末 9 点:
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:
In [421]: prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
In [422]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [423]: ts.index = (prng.asfreq("M", "e") + 1).asfreq("h", "s") + 9
In [424]: ts.head()
Out[424]:
1990-03-01 09:00 -0.109291
1990-06-01 09:00 -0.637235
1990-09-01 09:00 -1.735925
1990-12-01 09:00 2.096946
1991-03-01 09:00 -1.039926
Freq: h, dtype: float64
Representing out-of-bounds spans
如果你有数据超出 Timestamp 范围,请参阅 Timestamp limitations,那么你可以使用 PeriodIndex 和/或 Series Periods 进行计算。
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
In [425]: span = pd.period_range("1215-01-01", "1381-01-01", freq="D")
In [426]: span
Out[426]:
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632)
要从基于 YYYYMMDD 的 int64 进行转换。
To convert from an int64 based YYYYMMDD representation.
In [427]: s = pd.Series([20121231, 20141130, 99991231])
In [428]: s
Out[428]:
0 20121231
1 20141130
2 99991231
dtype: int64
In [429]: def conv(x):
.....: return pd.Period(year=x // 10000, month=x // 100 % 100, day=x % 100, freq="D")
.....:
In [430]: s.apply(conv)
Out[430]:
0 2012-12-31
1 2014-11-30
2 9999-12-31
dtype: period[D]
In [431]: s.apply(conv)[2]
Out[431]: Period('9999-12-31', 'D')
这些内容可以轻松转换为 PeriodIndex:
These can easily be converted to a PeriodIndex:
In [432]: span = pd.PeriodIndex(s.apply(conv))
In [433]: span
Out[433]: PeriodIndex(['2012-12-31', '2014-11-30', '9999-12-31'], dtype='period[D]')
Time zone handling
pandas 提供了丰富的支持,可通过 pytz 和 dateutil 库或标准库中的 datetime.timezone 对象,在不同的时区中处理时间戳。
pandas provides rich support for working with timestamps in different time zones using the pytz and dateutil libraries or datetime.timezone objects from the standard library.
Working with time zones
默认情况下,pandas 对象不具备时区感知:
By default, pandas objects are time zone unaware:
In [434]: rng = pd.date_range("3/6/2012 00:00", periods=15, freq="D")
In [435]: rng.tz is None
Out[435]: True
要将这些日期本地化到某一时区(为幼稚日期分配特定时区),可以在 date_range()、 Timestamp 或 DatetimeIndex 中使用 tz_localize 方法或 tz 关键字参数。你可以传递 pytz 或 dateutil 时区对象,或 Olson 时区数据库字符串。Olson 时区字符串将默认情况下返回 pytz 时区对象。若要返回 dateutil 时区对象,请将 dateutil/ 添加到字符串前面。
To localize these dates to a time zone (assign a particular time zone to a naive date), you can use the tz_localize method or the tz keyword argument in date_range(), Timestamp, or DatetimeIndex. You can either pass pytz or dateutil time zone objects or Olson time zone database strings. Olson time zone strings will return pytz time zone objects by default. To return dateutil time zone objects, append dateutil/ before the string.
-
In pytz you can find a list of common (and less common) time zones using from pytz import common_timezones, all_timezones.
-
dateutil uses the OS time zones so there isn’t a fixed list available. For common zones, the names are the same as pytz.
In [436]: import dateutil
# pytz
In [437]: rng_pytz = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz="Europe/London")
In [438]: rng_pytz.tz
Out[438]: <DstTzInfo 'Europe/London' LMT-1 day, 23:59:00 STD>
# dateutil
In [439]: rng_dateutil = pd.date_range("3/6/2012 00:00", periods=3, freq="D")
In [440]: rng_dateutil = rng_dateutil.tz_localize("dateutil/Europe/London")
In [441]: rng_dateutil.tz
Out[441]: tzfile('/usr/share/zoneinfo/Europe/London')
# dateutil - utc special case
In [442]: rng_utc = pd.date_range(
.....: "3/6/2012 00:00",
.....: periods=3,
.....: freq="D",
.....: tz=dateutil.tz.tzutc(),
.....: )
.....:
In [443]: rng_utc.tz
Out[443]: tzutc()
# datetime.timezone
In [444]: rng_utc = pd.date_range(
.....: "3/6/2012 00:00",
.....: periods=3,
.....: freq="D",
.....: tz=datetime.timezone.utc,
.....: )
.....:
In [445]: rng_utc.tz
Out[445]: datetime.timezone.utc
请注意,UTC 时区在 dateutil 中是特殊情况,应明确构造为 dateutil.tz.tzutc 的实例。你也可以首先明确构造其他时区对象。
Note that the UTC time zone is a special case in dateutil and should be constructed explicitly as an instance of dateutil.tz.tzutc. You can also construct other time zones objects explicitly first.
In [446]: import pytz
# pytz
In [447]: tz_pytz = pytz.timezone("Europe/London")
In [448]: rng_pytz = pd.date_range("3/6/2012 00:00", periods=3, freq="D")
In [449]: rng_pytz = rng_pytz.tz_localize(tz_pytz)
In [450]: rng_pytz.tz == tz_pytz
Out[450]: True
# dateutil
In [451]: tz_dateutil = dateutil.tz.gettz("Europe/London")
In [452]: rng_dateutil = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz=tz_dateutil)
In [453]: rng_dateutil.tz == tz_dateutil
Out[453]: True
要将某个时区感知的 pandas 对象从一个时区转换为另一个时区,可以使用 tz_convert 方法。
To convert a time zone aware pandas object from one time zone to another, you can use the tz_convert method.
In [454]: rng_pytz.tz_convert("US/Eastern")
Out[454]:
DatetimeIndex(['2012-03-05 19:00:00-05:00', '2012-03-06 19:00:00-05:00',
'2012-03-07 19:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
使用 pytz 时区时, DatetimeIndex 会针对同一时区输入构造一个与 Timestamp 不同的时区对象。 DatetimeIndex 可以包含一个 Timestamp 对象集合,这些对象可能具有不同的 UTC 偏移,且不能通过一个 pytz 时区实例简要表示,而一个 Timestamp 则表示具有特定 UTC 偏移的某个时间点。 |
When using pytz time zones, DatetimeIndex will construct a different time zone object than a Timestamp for the same time zone input. A DatetimeIndex can hold a collection of Timestamp objects that may have different UTC offsets and cannot be succinctly represented by one pytz time zone instance while one Timestamp represents one point in time with a specific UTC offset. |
In [455]: dti = pd.date_range("2019-01-01", periods=3, freq="D", tz="US/Pacific")
In [456]: dti.tz
Out[456]: <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>
In [457]: ts = pd.Timestamp("2019-01-01", tz="US/Pacific")
In [458]: ts.tz
Out[458]: <DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>
警告
Warning
小心不同库之间的转换。对于某些时区,pytz 和 dateutil 对该时区的定义不同。对于“标准”时区(如 US/Eastern),这是一个次要问题,而不常见时区则更成问题。
Be wary of conversions between libraries. For some time zones, pytz and dateutil have different definitions of the zone. This is more of a problem for unusual time zones than for ‘standard’ zones like US/Eastern.
警告
Warning
请注意,不同版本的时区库中的时区定义可能不被视为相等的。当使用某个版本本地化的存储数据并使用不同版本对其进行操作时,这可能会导致问题。有关如何处理此类情况,请参阅 here。
Be aware that a time zone definition across versions of time zone libraries may not be considered equal. This may cause problems when working with stored data that is localized using one version and operated on with a different version. See here for how to handle such a situation.
警告
Warning
对于 pytz 时区,将时区对象直接传递到 datetime.datetime 构造函数中是不正确的(例如,datetime.datetime(2011, 1, 1, tzinfo=pytz.timezone('US/Eastern'))。相反,datetime 需要使用 localize 方法在 pytz 时区对象上进行本地化。
For pytz time zones, it is incorrect to pass a time zone object directly into the datetime.datetime constructor (e.g., datetime.datetime(2011, 1, 1, tzinfo=pytz.timezone('US/Eastern')). Instead, the datetime needs to be localized using the localize method on the pytz time zone object.
警告
Warning
请注意,对于未来的时间,任何时区库都不能保证在时区之间(以及与 UTC 之间)进行准确转换,因为政府可能会更改时区与 UTC 之间的偏移。
Be aware that for times in the future, correct conversion between time zones (and UTC) cannot be guaranteed by any time zone library because a timezone’s offset from UTC may be changed by the respective government.
警告
Warning
如果在 2038 年 1 月 18 日之后使用日期,由于由 2038 年问题引起的底层库中存在的当前缺陷,时区感知日期不会应用夏令时 (DST) 调整。如果底层库得到修复,则将应用 DST 过渡。
If you are using dates beyond 2038-01-18, due to current deficiencies in the underlying libraries caused by the year 2038 problem, daylight saving time (DST) adjustments to timezone aware dates will not be applied. If and when the underlying libraries are fixed, the DST transitions will be applied.
例如,有两个处于英国夏令时的日期(正常情况下为 GMT+1),以下断言均评估为 true:
For example, for two dates that are in British Summer Time (and so would normally be GMT+1), both the following asserts evaluate as true:
In [459]: d_2037 = "2037-03-31T010101"
In [460]: d_2038 = "2038-03-31T010101"
In [461]: DST = "Europe/London"
In [462]: assert pd.Timestamp(d_2037, tz=DST) != pd.Timestamp(d_2037, tz="GMT")
In [463]: assert pd.Timestamp(d_2038, tz=DST) == pd.Timestamp(d_2038, tz="GMT")
从本质上讲,所有时间戳都存储在 UTC 中。时区感知的 DatetimeIndex 或 Timestamp 的值会将其字段(日、时、分等)本地化为该时区。然而,即使时间戳具有不同的时区,只要其 UTC 值相同,它们仍会被视为相等:
Under the hood, all timestamps are stored in UTC. Values from a time zone aware DatetimeIndex or Timestamp will have their fields (day, hour, minute, etc.) localized to the time zone. However, timestamps with the same UTC value are still considered to be equal even if they are in different time zones:
In [464]: rng_eastern = rng_utc.tz_convert("US/Eastern")
In [465]: rng_berlin = rng_utc.tz_convert("Europe/Berlin")
In [466]: rng_eastern[2]
Out[466]: Timestamp('2012-03-07 19:00:00-0500', tz='US/Eastern')
In [467]: rng_berlin[2]
Out[467]: Timestamp('2012-03-08 01:00:00+0100', tz='Europe/Berlin')
In [468]: rng_eastern[2] == rng_berlin[2]
Out[468]: True
Operations between Series in different time zones will yield UTC Series, aligning the data on the UTC timestamps:
In [469]: ts_utc = pd.Series(range(3), pd.date_range("20130101", periods=3, tz="UTC"))
In [470]: eastern = ts_utc.tz_convert("US/Eastern")
In [471]: berlin = ts_utc.tz_convert("Europe/Berlin")
In [472]: result = eastern + berlin
In [473]: result
Out[473]:
2013-01-01 00:00:00+00:00 0
2013-01-02 00:00:00+00:00 2
2013-01-03 00:00:00+00:00 4
Freq: D, dtype: int64
In [474]: result.index
Out[474]:
DatetimeIndex(['2013-01-01 00:00:00+00:00', '2013-01-02 00:00:00+00:00',
'2013-01-03 00:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
要移除时区信息,请使用 tz_localize(None) 或 tz_convert(None)。tz_localize(None) 将移除时区,生成本地时间表示。tz_convert(None) 将先转换为 UTC 时间,然后再移除时区。
To remove time zone information, use tz_localize(None) or tz_convert(None). tz_localize(None) will remove the time zone yielding the local time representation. tz_convert(None) will remove the time zone after converting to UTC time.
In [475]: didx = pd.date_range(start="2014-08-01 09:00", freq="h", periods=3, tz="US/Eastern")
In [476]: didx
Out[476]:
DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00',
'2014-08-01 11:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', freq='h')
In [477]: didx.tz_localize(None)
Out[477]:
DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',
'2014-08-01 11:00:00'],
dtype='datetime64[ns]', freq=None)
In [478]: didx.tz_convert(None)
Out[478]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq='h')
# tz_convert(None) is identical to tz_convert('UTC').tz_localize(None)
In [479]: didx.tz_convert("UTC").tz_localize(None)
Out[479]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq=None)
Fold
对于不明确的时间,pandas 支持明确指定关键字专用 fold 参数。由于夏令时,当从夏时制切换到冬时制时,时钟可能会出现两次;fold 描述 datetime 类是否对应于时钟第一次(0)或第二次(1)达到不明确的时间。Fold 仅支持从幼稚的 datetime.datetime 构造(有关详细信息,请参见 datetime documentation)或 Timestamp 构造,或从组件构造(见下文)。仅支持 dateutil 时区(有关处理不明确日期的 dateutil 方法,请参见 dateutil documentation),因为 pytz 时区不支持 fold(有关 pytz 处理不明确日期的详细信息,请参见 pytz documentation)。要使用 pytz 本地化不明确的 datetime,请使用 Timestamp.tz_localize()。一般情况下,我们建议在本地化不明确的 datetime 时依赖 Timestamp.tz_localize(),前提是你需要直接控制处理方式。
For ambiguous times, pandas supports explicitly specifying the keyword-only fold argument. Due to daylight saving time, one wall clock time can occur twice when shifting from summer to winter time; fold describes whether the datetime-like corresponds to the first (0) or the second time (1) the wall clock hits the ambiguous time. Fold is supported only for constructing from naive datetime.datetime (see datetime documentation for details) or from Timestamp or for constructing from components (see below). Only dateutil timezones are supported (see dateutil documentation for dateutil methods that deal with ambiguous datetimes) as pytz timezones do not support fold (see pytz documentation for details on how pytz deals with ambiguous datetimes). To localize an ambiguous datetime with pytz, please use Timestamp.tz_localize(). In general, we recommend to rely on Timestamp.tz_localize() when localizing ambiguous datetimes if you need direct control over how they are handled.
In [480]: pd.Timestamp(
.....: datetime.datetime(2019, 10, 27, 1, 30, 0, 0),
.....: tz="dateutil/Europe/London",
.....: fold=0,
.....: )
.....:
Out[480]: Timestamp('2019-10-27 01:30:00+0100', tz='dateutil//usr/share/zoneinfo/Europe/London')
In [481]: pd.Timestamp(
.....: year=2019,
.....: month=10,
.....: day=27,
.....: hour=1,
.....: minute=30,
.....: tz="dateutil/Europe/London",
.....: fold=1,
.....: )
.....:
Out[481]: Timestamp('2019-10-27 01:30:00+0000', tz='dateutil//usr/share/zoneinfo/Europe/London')
Ambiguous times when localizing
tz_localize 可能无法确定时间戳的 UTC 偏移,因为本地时区中的夏令时 (DST) 会导致一些时间在一天内出现两次(“时钟回拨”)。可用选项包括:
tz_localize may not be able to determine the UTC offset of a timestamp because daylight savings time (DST) in a local time zone causes some times to occur twice within one day (“clocks fall back”). The following options are available:
-
'raise': Raises a pytz.AmbiguousTimeError (the default behavior)
-
'infer': Attempt to determine the correct offset base on the monotonicity of the timestamps
-
'NaT': Replaces ambiguous times with NaT
-
bool: True represents a DST time, False represents non-DST time. An array-like of bool values is supported for a sequence of times.
In [482]: rng_hourly = pd.DatetimeIndex(
.....: ["11/06/2011 00:00", "11/06/2011 01:00", "11/06/2011 01:00", "11/06/2011 02:00"]
.....: )
.....:
这将失败,因为有一些模糊的时间('11/06/2011 01:00')
This will fail as there are ambiguous times ('11/06/2011 01:00')
In [483]: rng_hourly.tz_localize('US/Eastern')
---------------------------------------------------------------------------
AmbiguousTimeError Traceback (most recent call last)
Cell In[483], line 1
----> 1 rng_hourly.tz_localize('US/Eastern')
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:293, in DatetimeIndex.tz_localize(self, tz, ambiguous, nonexistent)
286 @doc(DatetimeArray.tz_localize)
287 def tz_localize(
288 self,
(...)
291 nonexistent: TimeNonexistent = "raise",
292 ) -> Self:
--> 293 arr = self._data.tz_localize(tz, ambiguous, nonexistent)
294 return type(self)._simple_new(arr, name=self.name)
File ~/work/pandas/pandas/pandas/core/arrays/_mixins.py:81, in ravel_compat.<locals>.method(self, *args, **kwargs)
78 @wraps(meth)
79 def method(self, *args, **kwargs):
80 if self.ndim == 1:
---> 81 return meth(self, *args, **kwargs)
83 flags = self._ndarray.flags
84 flat = self.ravel("K")
File ~/work/pandas/pandas/pandas/core/arrays/datetimes.py:1088, in DatetimeArray.tz_localize(self, tz, ambiguous, nonexistent)
1085 tz = timezones.maybe_get_tz(tz)
1086 # Convert to UTC
-> 1088 new_dates = tzconversion.tz_localize_to_utc(
1089 self.asi8,
1090 tz,
1091 ambiguous=ambiguous,
1092 nonexistent=nonexistent,
1093 creso=self._creso,
1094 )
1095 new_dates_dt64 = new_dates.view(f"M8[{self.unit}]")
1096 dtype = tz_to_dtype(tz, unit=self.unit)
File tzconversion.pyx:371, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc()
AmbiguousTimeError: Cannot infer dst time from 2011-11-06 01:00:00, try using the 'ambiguous' argument
通过指定以下内容来处理这些模糊的时间。
Handle these ambiguous times by specifying the following.
In [484]: rng_hourly.tz_localize("US/Eastern", ambiguous="infer")
Out[484]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', '2011-11-06 01:00:00-04:00',
'2011-11-06 01:00:00-05:00', '2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
In [485]: rng_hourly.tz_localize("US/Eastern", ambiguous="NaT")
Out[485]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', 'NaT', 'NaT',
'2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
In [486]: rng_hourly.tz_localize("US/Eastern", ambiguous=[True, True, False, False])
Out[486]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', '2011-11-06 01:00:00-04:00',
'2011-11-06 01:00:00-05:00', '2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
Nonexistent times when localizing
DST 过渡还可能将本地时间提前 1 小时,从而创建不存在的本地时间(“时钟前移”)。通过 nonexistent 参数可以控制定位包含不存在时间的时序的行为。以下选项可用:
A DST transition may also shift the local time ahead by 1 hour creating nonexistent local times (“clocks spring forward”). The behavior of localizing a timeseries with nonexistent times can be controlled by the nonexistent argument. The following options are available:
-
'raise': Raises a pytz.NonExistentTimeError (the default behavior)
-
'NaT': Replaces nonexistent times with NaT
-
'shift_forward': Shifts nonexistent times forward to the closest real time
-
'shift_backward': Shifts nonexistent times backward to the closest real time
-
timedelta object: Shifts nonexistent times by the timedelta duration
In [487]: dti = pd.date_range(start="2015-03-29 02:30:00", periods=3, freq="h")
# 2:30 is a nonexistent time
默认情况下,不存在时间的本地化将引发错误。
Localization of nonexistent times will raise an error by default.
In [488]: dti.tz_localize('Europe/Warsaw')
---------------------------------------------------------------------------
NonExistentTimeError Traceback (most recent call last)
Cell In[488], line 1
----> 1 dti.tz_localize('Europe/Warsaw')
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:293, in DatetimeIndex.tz_localize(self, tz, ambiguous, nonexistent)
286 @doc(DatetimeArray.tz_localize)
287 def tz_localize(
288 self,
(...)
291 nonexistent: TimeNonexistent = "raise",
292 ) -> Self:
--> 293 arr = self._data.tz_localize(tz, ambiguous, nonexistent)
294 return type(self)._simple_new(arr, name=self.name)
File ~/work/pandas/pandas/pandas/core/arrays/_mixins.py:81, in ravel_compat.<locals>.method(self, *args, **kwargs)
78 @wraps(meth)
79 def method(self, *args, **kwargs):
80 if self.ndim == 1:
---> 81 return meth(self, *args, **kwargs)
83 flags = self._ndarray.flags
84 flat = self.ravel("K")
File ~/work/pandas/pandas/pandas/core/arrays/datetimes.py:1088, in DatetimeArray.tz_localize(self, tz, ambiguous, nonexistent)
1085 tz = timezones.maybe_get_tz(tz)
1086 # Convert to UTC
-> 1088 new_dates = tzconversion.tz_localize_to_utc(
1089 self.asi8,
1090 tz,
1091 ambiguous=ambiguous,
1092 nonexistent=nonexistent,
1093 creso=self._creso,
1094 )
1095 new_dates_dt64 = new_dates.view(f"M8[{self.unit}]")
1096 dtype = tz_to_dtype(tz, unit=self.unit)
File tzconversion.pyx:431, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc()
NonExistentTimeError: 2015-03-29 02:30:00
将不存在的时间转换为 NaT 或移动时间。
Transform nonexistent times to NaT or shift the times.
In [489]: dti
Out[489]:
DatetimeIndex(['2015-03-29 02:30:00', '2015-03-29 03:30:00',
'2015-03-29 04:30:00'],
dtype='datetime64[ns]', freq='h')
In [490]: dti.tz_localize("Europe/Warsaw", nonexistent="shift_forward")
Out[490]:
DatetimeIndex(['2015-03-29 03:00:00+02:00', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq=None)
In [491]: dti.tz_localize("Europe/Warsaw", nonexistent="shift_backward")
Out[491]:
DatetimeIndex(['2015-03-29 01:59:59.999999999+01:00',
'2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq=None)
In [492]: dti.tz_localize("Europe/Warsaw", nonexistent=pd.Timedelta(1, unit="h"))
Out[492]:
DatetimeIndex(['2015-03-29 03:30:00+02:00', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq=None)
In [493]: dti.tz_localize("Europe/Warsaw", nonexistent="NaT")
Out[493]:
DatetimeIndex(['NaT', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq=None)
Time zone Series operations
具有时区原生值的时间戳时间使用 datetime64[ns] 数据类型表示。
A Series with time zone naive values is represented with a dtype of datetime64[ns].
In [494]: s_naive = pd.Series(pd.date_range("20130101", periods=3))
In [495]: s_naive
Out[495]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
具有时区感知值的时间戳时间使用 datetime64[ns, tz] 数据类型表示,其中 tz 是时区
A Series with a time zone aware values is represented with a dtype of datetime64[ns, tz] where tz is the time zone
In [496]: s_aware = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))
In [497]: s_aware
Out[497]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
这两种 Series 时区信息都可以通过 .dt 访问器进行操作,请参阅 the dt accessor section。
Both of these Series time zone information can be manipulated via the .dt accessor, see the dt accessor section.
例如,要本地化并转换时间戳时间,使其对时区感知。
For example, to localize and convert a naive stamp to time zone aware.
In [498]: s_naive.dt.tz_localize("UTC").dt.tz_convert("US/Eastern")
Out[498]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
还可以使用 astype 方法操作时区信息。此方法可以在不同的可感知时区的 dtypes 之间进行转换。
Time zone information can also be manipulated using the astype method. This method can convert between different timezone-aware dtypes.
# convert to a new time zone
In [499]: s_aware.astype("datetime64[ns, CET]")
Out[499]:
0 2013-01-01 06:00:00+01:00
1 2013-01-02 06:00:00+01:00
2 2013-01-03 06:00:00+01:00
dtype: datetime64[ns, CET]
对 Series 使用 Series.to_numpy(),返回一个包含数据的 NumPy 数组。NumPy 当前不支持时区(尽管它以本地时区打印!),因此为可感知时区的数据返回时间戳的时间对象数组: |
Using Series.to_numpy() on a Series, returns a NumPy array of the data. NumPy does not currently support time zones (even though it is printing in the local time zone!), therefore an object array of Timestamps is returned for time zone aware data: |
In [500]: s_naive.to_numpy()
Out[500]:
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
'2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
In [501]: s_aware.to_numpy()
Out[501]:
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')],
dtype=object)
通过转为时间戳的对象数组,它保留时区信息。例如,在转换回序列时:
By converting to an object array of Timestamps, it preserves the time zone information. For example, when converting back to a Series:
In [502]: pd.Series(s_aware.to_numpy())
Out[502]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
但是,如果你想要一个实际的 NumPy datetime64[ns] 数组(其值转换为 UTC),而不是一个对象数组,你可以指定 dtype 参数:
However, if you want an actual NumPy datetime64[ns] array (with the values converted to UTC) instead of an array of objects, you can specify the dtype argument:
In [503]: s_aware.to_numpy(dtype="datetime64[ns]")
Out[503]:
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')