Time Series 简明教程
Time Series - Data Processing and Visualization
时间序列是在等距时间间隔内编制索引的一系列观测。因此,在任何时间序列中都应该保持顺序和连续性。
Time Series is a sequence of observations indexed in equi-spaced time intervals. Hence, the order and continuity should be maintained in any time series.
我们将使用的该数据集是一个多变量时间序列,它具有一个受严重污染的意大利城市空气质量的约一年的时均数据。可以从以下提供的链接下载该数据集 − https://archive.ics.uci.edu/ml/datasets/air+quality 。
The dataset we will be using is a multi-variate time series having hourly data for approximately one year, for air quality in a significantly polluted Italian city. The dataset can be downloaded from the link given below − https://archive.ics.uci.edu/ml/datasets/air+quality.
必须确保 −
It is necessary to make sure that −
-
The time series is equally spaced, and
-
There are no redundant values or gaps in it.
如果时间序列不连续,我们可以对其上采样或下采样。
In case the time series is not continuous, we can upsample or downsample it.
Showing df.head()
[122] 中:
In [122]:
import pandas
[123] 中:
In [123]:
df = pandas.read_csv("AirQualityUCI.csv", sep = ";", decimal = ",")
df = df.iloc[ : , 0:14]
[124] 中:
In [124]:
len(df)
Out[124]:
Out[124]:
9471
In [125]:
In [125]:
df.head()
Out[125]:
Out[125]:
对于时间序列的预处理,我们要确保数据集中没有 NaN(NULL) 值;如果有,我们可以将它们用 0 或平均值,或者前一个或后一个值替换。替换比丢弃是首选,以便保持时间序列的连续性。但是,在我们的数据集中,最后几个值似乎是 NULL,因此丢弃不会影响连续性。
For preprocessing the time series, we make sure there are no NaN(NULL) values in the dataset; if there are, we can replace them with either 0 or average or preceding or succeeding values. Replacing is a preferred choice over dropping so that the continuity of the time series is maintained. However, in our dataset the last few values seem to be NULL and hence dropping will not affect the continuity.
Dropping NaN(Not-a-Number)
In [126]:
In [126]:
df.isna().sum()
Out[126]:
Date 114
Time 114
CO(GT) 114
PT08.S1(CO) 114
NMHC(GT) 114
C6H6(GT) 114
PT08.S2(NMHC) 114
NOx(GT) 114
PT08.S3(NOx) 114
NO2(GT) 114
PT08.S4(NO2) 114
PT08.S5(O3) 114
T 114
RH 114
dtype: int64
In [127]:
In [127]:
df = df[df['Date'].notnull()]
In [128]:
In [128]:
df.isna().sum()
Out[128]:
Out[128]:
Date 0
Time 0
CO(GT) 0
PT08.S1(CO) 0
NMHC(GT) 0
C6H6(GT) 0
PT08.S2(NMHC) 0
NOx(GT) 0
PT08.S3(NOx) 0
NO2(GT) 0
PT08.S4(NO2) 0
PT08.S5(O3) 0
T 0
RH 0
dtype: int64
时间序列通常被描绘成时序折线图。为此,现在我们将日期和时间列结合起来,并将其从字符串转换为 datetime 对象。这可以使用 datetime 库完成。
Time Series are usually plotted as line graphs against time. For that we will now combine the date and time column and convert it into a datetime object from strings. This can be accomplished using the datetime library.
Converting to datetime object
In [129]:
In [129]:
df['DateTime'] = (df.Date) + ' ' + (df.Time)
print (type(df.DateTime[0]))
<class 'str'>
<class 'str'>
In [130]:
In [130]:
import datetime
df.DateTime = df.DateTime.apply(lambda x: datetime.datetime.strptime(x, '%d/%m/%Y %H.%M.%S'))
print (type(df.DateTime[0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
让我们看看一些变量,例如随着时间的变化温度如何变化。
Let us see how some variables like temperature changes with change in time.
Showing plots
In [131]:
In [131]:
df.index = df.DateTime
In [132]:
In [132]:
import matplotlib.pyplot as plt
plt.plot(df['T'])
Out[132]:
Out[132]:
[<matplotlib.lines.Line2D at 0x1eaad67f780>]
In [208]:
In [208]:
plt.plot(df['C6H6(GT)'])
Out[208]:
Out[208]:
[<matplotlib.lines.Line2D at 0x1eaaeedff28>]
箱形图是另一种有用的图表,允许您将有关数据集的大量信息浓缩到单个图表中。它显示一个或多个变量的平均值、25% 和 75% 的四分位数以及异常值。当异常值数量较少并且与平均值相差很远时,我们可以通过将它们设置为平均值或 75% 四分位数来消除这些异常值。
Box-plots are another useful kind of graphs that allow you to condense a lot of information about a dataset into a single graph. It shows the mean, 25% and 75% quartile and outliers of one or multiple variables. In the case when number of outliers is few and is very distant from the mean, we can eliminate the outliers by setting them to mean value or 75% quartile value.
Showing Boxplots
In [134]:
plt.boxplot(df[['T','C6H6(GT)']].values)
Out[134]:
{'whiskers': [<matplotlib.lines.Line2D at 0x1eaac16de80>,
<matplotlib.lines.Line2D at 0x1eaac16d908>,
<matplotlib.lines.Line2D at 0x1eaac177a58>,
<matplotlib.lines.Line2D at 0x1eaac177cf8>],
'caps': [<matplotlib.lines.Line2D at 0x1eaac16d2b0>,
<matplotlib.lines.Line2D at 0x1eaac16d588>,
<matplotlib.lines.Line2D at 0x1eaac1a69e8>,
<matplotlib.lines.Line2D at 0x1eaac1a64a8>],
'boxes': [<matplotlib.lines.Line2D at 0x1eaac16dc50>,
<matplotlib.lines.Line2D at 0x1eaac1779b0>],
'medians': [<matplotlib.lines.Line2D at 0x1eaac16d4a8>,
<matplotlib.lines.Line2D at 0x1eaac1a6c50>],
'fliers': [<matplotlib.lines.Line2D at 0x1eaac177dd8>,
<matplotlib.lines.Line2D at 0x1eaac1a6c18>],'means': []
}