Time Series 简明教程

Time Series - Quick Guide

Time Series - Introduction

时间序列是某段时间内一系列的观察结果。单变量时间序列由一个变量在一段时间内按周期性时间实例获取的值组成,而多变量时间序列由多个变量在一段时间内在同一周期性时间实例获取的值组成。我们所有人每天都会遇到的最简单的时间序列示例是全天或全周或全月或全年的温度变化。

A time series is a sequence of observations over a certain period. A univariate time series consists of the values taken by a single variable at periodic time instances over a period, and a multivariate time series consists of the values taken by multiple variables at the same periodic time instances over a period. The simplest example of a time series that all of us come across on a day to day basis is the change in temperature throughout the day or week or month or year.

时间数据分析能够让我们深入了解一个变量如何随时间变化,或者它如何依赖其他变量值的改变。变量与其之前的值和/或其他变量之间的这种关系可以分析时间序列预测并在人工智能中得到众多应用。

The analysis of temporal data is capable of giving us useful insights on how a variable changes over time, or how it depends on the change in the values of other variable(s). This relationship of a variable on its previous values and/or other variables can be analyzed for time series forecasting and has numerous applications in artificial intelligence.

Time Series - Programming Languages

对于用户来说,基本了解任何一种编程语言对于解决或开发机器学习问题都是至关重要的。对于希望从事机器学习工作的任何人,下面给出了他们首选的编程语言列表——

A basic understanding of any programming language is essential for a user to work with or develop machine learning problems. A list of preferred programming languages for anyone who wants to work on machine learning is given below −

Python

它是一种高级解释性编程语言,编码快速且容易。Python 可以遵循过程式或面向对象编程范例。各种库的存在让实现复杂的程序变得更简单。在本教程中,我们将用 Python 编码,并且将在后续章节中讨论对时间序列建模有用的相应库。

It is a high-level interpreted programming language, fast and easy to code. Python can follow either procedural or object-oriented programming paradigms. The presence of a variety of libraries makes implementation of complicated procedures simpler. In this tutorial, we will be coding in Python and the corresponding libraries useful for time series modelling will be discussed in the upcoming chapters.

R

与 Python 类似,R 是一种解释多范例语言,支持统计计算和图形。各种包让在 R 中实现机器学习建模变得更容易。

Similar to Python, R is an interpreted multi-paradigm language, which supports statistical computing and graphics. The variety of packages makes it easier to implement machine learning modelling in R.

Java

它是一种解释面向对象编程语言,以大量可用的包和复杂的数据可视化技术而闻名。

It is an interpreted object-oriented programming language, which is widely famous for a large range of package availability and sophisticated data visualization techniques.

C/C++

这些是编译语言,也是最古老的两种编程语言。在已有的应用程序中纳入 ML 能力时人们常常选择这些语言,因为它们能让您轻松自定义 ML 算法的实现。

These are compiled languages, and two of the oldest programming languages. These languages are often preferred to incorporate ML capabilities in the already existing applications as they allow you to customize the implementation of ML algorithms easily.

MATLAB

MATrix LABoratory 是一种多范例语言,为使用矩阵提供功能。它允许对复杂问题进行数学运算。它主要用于数值运算,但一些包也允许图形多域模拟和基于模型的设计。

MATrix LABoratory is a multi-paradigm language which gives functioning to work with matrices. It allows mathematical operations for complex problems. It is primarily used for numerical operations but some packages also allow the graphical multi-domain simulation and model-based design.

用于机器学习问题的其他首选编程语言包括 JavaScript、LISP、Prolog、SQL、Scala、Julia、SAS 等。

Other preferred programming languages for machine learning problems include JavaScript, LISP, Prolog, SQL, Scala, Julia, SAS etc.

Time Series - Python Libraries

由于其易于编写和易于理解的代码结构以及各种开源库,Python 在从事机器学习的人员中享有很高的知名度。我们在接下来章节中将使用的几个这样的开源库已在下面介绍。

Python has an established popularity among individuals who perform machine learning because of its easy-to-write and easy-to-understand code structure as well as a wide variety of open source libraries. A few of such open source libraries that we will be using in the coming chapters have been introduced below.

NumPy

NumPy(Numerical Python)是一个用于科学计算的库。它处理一个 N 维数组对象,并提供基本的数学功能,如大小、形状、平均值、标准差、最小值、最大值以及更复杂的一些函数,如线性代数函数和傅里叶变换。随着我们在此教程中不断前进,您将了解更多相关内容。

Numerical Python is a library used for scientific computing. It works on an N-dimensional array object and provides basic mathematical functionality such as size, shape, mean, standard deviation, minimum, maximum as well as some more complex functions such as linear algebraic functions and Fourier transform. You will learn more about these as we move ahead in this tutorial.

Pandas

此库提供诸如系列、数据框和面板等高效且易于使用的数据结构。它提升了 Python 的功能,使其从单纯的数据收集和准备转变为数据分析。Pandas 和 NumPy 这两个库大大简化了针对从小到非常大的数据集的任何操作。想要进一步了解这些函数,请查看本教程。

This library provides highly efficient and easy-to-use data structures such as series, dataframes and panels. It has enhanced Python’s functionality from mere data collection and preparation to data analysis. The two libraries, Pandas and NumPy, make any operation on small to very large dataset very simple. To know more about these functions, follow this tutorial.

SciPy

SciPy(Science Python)是一个用于科学和技术计算的库。它提供函数优化、信号和图像处理、积分、插值和线性代数功能。在执行机器学习时,此库非常有用。我们将在此教程中逐步讨论这些功能。

Science Python is a library used for scientific and technical computing. It provides functionalities for optimization, signal and image processing, integration, interpolation and linear algebra. This library comes handy while performing machine learning. We will discuss these functionalities as we move ahead in this tutorial.

Scikit Learn

此库是一个 SciPy 工具包,广泛用于统计建模、机器学习和深度学习,因为它包含各种可定制的回归、分类和聚类模型。它可以很好地与 Numpy、Pandas 和其他库配合使用,从而更易于使用。

This library is a SciPy Toolkit widely used for statistical modelling, machine learning and deep learning, as it contains various customizable regression, classification and clustering models. It works well with Numpy, Pandas and other libraries which makes it easier to use.

Statsmodels

此库与 Scikit Learn 一样,用于统计数据探索和统计建模。它也可与其他 Python 库兼容。

Like Scikit Learn, this library is used for statistical data exploration and statistical modelling. It also operates well with other Python libraries.

Matplotlib

此库适用于各种格式的数据可视化,例如折线图、条形图、热力图、散点图、柱状图,等等。它包含所有所需的图表相关功能,从绘图到标记。我们将在本教程中继续讲解这些功能。

This library is used for data visualization in various formats such as line plot, bar graph, heat maps, scatter plots, histogram etc. It contains all the graph related functionalities required from plotting to labelling. We will discuss these functionalities as we move ahead in this tutorial.

在使用任何类型数据的机器学习入门中,这些库至关重要。

These libraries are very essential to start with machine learning with any sort of data.

除了上述内容,另一个与时间序列极其相关的库是 −

Beside the ones discussed above, another library especially significant to deal with time series is −

Datetime

此库有两个模块—— datetime 和 calendar,可提供读取、格式化和处理时间的必要日期时间功能。

This library, with its two modules − datetime and calendar, provides all necessary datetime functionality for reading, formatting and manipulating time.

我们在接下来的章节中会用到这些库。

We shall be using these libraries in the coming chapters.

Time Series - Data Processing and Visualization

时间序列是在等距时间间隔内编制索引的一系列观测。因此,在任何时间序列中都应该保持顺序和连续性。

Time Series is a sequence of observations indexed in equi-spaced time intervals. Hence, the order and continuity should be maintained in any time series.

我们将使用的该数据集是一个多变量时间序列,它具有一个受严重污染的意大利城市空气质量的约一年的时均数据。可以从以下提供的链接下载该数据集 − https://archive.ics.uci.edu/ml/datasets/air+quality

The dataset we will be using is a multi-variate time series having hourly data for approximately one year, for air quality in a significantly polluted Italian city. The dataset can be downloaded from the link given below − https://archive.ics.uci.edu/ml/datasets/air+quality.

必须确保 −

It is necessary to make sure that −

  1. The time series is equally spaced, and

  2. There are no redundant values or gaps in it.

如果时间序列不连续,我们可以对其上采样或下采样。

In case the time series is not continuous, we can upsample or downsample it.

Showing df.head()

[122] 中:

In [122]:

import pandas

[123] 中:

In [123]:

df = pandas.read_csv("AirQualityUCI.csv", sep = ";", decimal = ",")
df = df.iloc[ : , 0:14]

[124] 中:

In [124]:

len(df)

Out[124]:

Out[124]:

9471

In [125]:

In [125]:

df.head()

Out[125]:

Out[125]:

code snippet

对于时间序列的预处理,我们要确保数据集中没有 NaN(NULL) 值;如果有,我们可以将它们用 0 或平均值,或者前一个或后一个值替换。替换比丢弃是首选,以便保持时间序列的连续性。但是,在我们的数据集中,最后几个值似乎是 NULL,因此丢弃不会影响连续性。

For preprocessing the time series, we make sure there are no NaN(NULL) values in the dataset; if there are, we can replace them with either 0 or average or preceding or succeeding values. Replacing is a preferred choice over dropping so that the continuity of the time series is maintained. However, in our dataset the last few values seem to be NULL and hence dropping will not affect the continuity.

Dropping NaN(Not-a-Number)

In [126]:

In [126]:

df.isna().sum()
Out[126]:
Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
dtype: int64

In [127]:

In [127]:

df = df[df['Date'].notnull()]

In [128]:

In [128]:

df.isna().sum()

Out[128]:

Out[128]:

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
dtype: int64

时间序列通常被描绘成时序折线图。为此,现在我们将日期和时间列结合起来,并将其从字符串转换为 datetime 对象。这可以使用 datetime 库完成。

Time Series are usually plotted as line graphs against time. For that we will now combine the date and time column and convert it into a datetime object from strings. This can be accomplished using the datetime library.

Converting to datetime object

In [129]:

In [129]:

df['DateTime'] = (df.Date) + ' ' + (df.Time)
print (type(df.DateTime[0]))

<class 'str'>

<class 'str'>

In [130]:

In [130]:

import datetime

df.DateTime = df.DateTime.apply(lambda x: datetime.datetime.strptime(x, '%d/%m/%Y %H.%M.%S'))
print (type(df.DateTime[0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

让我们看看一些变量,例如随着时间的变化温度如何变化。

Let us see how some variables like temperature changes with change in time.

Showing plots

In [131]:

In [131]:

df.index = df.DateTime

In [132]:

In [132]:

import matplotlib.pyplot as plt
plt.plot(df['T'])

Out[132]:

Out[132]:

[<matplotlib.lines.Line2D at 0x1eaad67f780>]
code snippet4

In [208]:

In [208]:

plt.plot(df['C6H6(GT)'])

Out[208]:

Out[208]:

[<matplotlib.lines.Line2D at 0x1eaaeedff28>]

箱形图是另一种有用的图表,允许您将有关数据集的大量信息浓缩到单个图表中。它显示一个或多个变量的平均值、25% 和 75% 的四分位数以及异常值。当异常值数量较少并且与平均值相差很远时,我们可以通过将它们设置为平均值或 75% 四分位数来消除这些异常值。

Box-plots are another useful kind of graphs that allow you to condense a lot of information about a dataset into a single graph. It shows the mean, 25% and 75% quartile and outliers of one or multiple variables. In the case when number of outliers is few and is very distant from the mean, we can eliminate the outliers by setting them to mean value or 75% quartile value.

Showing Boxplots

In [134]:

plt.boxplot(df[['T','C6H6(GT)']].values)

Out[134]:

{'whiskers': [<matplotlib.lines.Line2D at 0x1eaac16de80>,
   <matplotlib.lines.Line2D at 0x1eaac16d908>,
   <matplotlib.lines.Line2D at 0x1eaac177a58>,
   <matplotlib.lines.Line2D at 0x1eaac177cf8>],
   'caps': [<matplotlib.lines.Line2D at 0x1eaac16d2b0>,
   <matplotlib.lines.Line2D at 0x1eaac16d588>,
   <matplotlib.lines.Line2D at 0x1eaac1a69e8>,
   <matplotlib.lines.Line2D at 0x1eaac1a64a8>],
   'boxes': [<matplotlib.lines.Line2D at 0x1eaac16dc50>,
   <matplotlib.lines.Line2D at 0x1eaac1779b0>],
   'medians': [<matplotlib.lines.Line2D at 0x1eaac16d4a8>,
   <matplotlib.lines.Line2D at 0x1eaac1a6c50>],
   'fliers': [<matplotlib.lines.Line2D at 0x1eaac177dd8>,
   <matplotlib.lines.Line2D at 0x1eaac1a6c18>],'means': []
}
code snippet5

Time Series - Modeling

Introduction

一个时间序列有如下 4 个组成部分:

A time series has 4 components as given below −

  1. Level − It is the mean value around which the series varies.

  2. Trend − It is the increasing or decreasing behavior of a variable with time.

  3. Seasonality − It is the cyclic behavior of time series.

  4. Noise − It is the error in the observations added due to environmental factors.

Time Series Modeling Techniques

为了捕捉这些成分,有许多流行的时间序列建模技术。本节对每种技术进行简要介绍,但我们将在即将到来的章节中详细讨论它们−

To capture these components, there are a number of popular time series modelling techniques. This section gives a brief introduction of each technique, however we will discuss about them in detail in the upcoming chapters −

Naïve Methods

这些是简单的估计技术,例如给定的预测值等于时间相关变量的前一个值的平均值,或前一个实际值。它们用于与复杂建模技术进行比较。

These are simple estimation techniques, such as the predicted value is given the value equal to mean of preceding values of the time dependent variable, or previous actual value. These are used for comparison with sophisticated modelling techniques.

Auto Regression

自回归将未来时期的值预测为前一个时期的值的函数。自回归的预测可能比朴素方法更适合数据,但它可能无法解释季节性。

Auto regression predicts the values of future time periods as a function of values at previous time periods. Predictions of auto regression may fit the data better than that of naïve methods, but it may not be able to account for seasonality.

ARIMA Model

自回归综合移动平均模型建模变量的值为前一个值和驻留时间序列前一个时间步长的残差误差的线性函数。但是,实际数据可能是非平稳的并且具有季节性,因此开发了 Seasonal-ARIMA 和 Fractional-ARIMA。ARIMA 在单变量时间序列上工作,为了处理多个变量引入了 VARIMA。

An auto-regressive integrated moving-average models the value of a variable as a linear function of previous values and residual errors at previous time steps of a stationary timeseries. However, the real world data may be non-stationary and have seasonality, thus Seasonal-ARIMA and Fractional-ARIMA were developed. ARIMA works on univariate time series, to handle multiple variables VARIMA was introduced.

Exponential Smoothing

它将变量的值建模为前一个值的指数加权线性函数。这个统计模型也可以处理趋势和季节性。

It models the value of a variable as an exponential weighted linear function of previous values. This statistical model can handle trend and seasonality as well.

LSTM

长短期记忆模型 (LSTM) 是一种循环神经网络,用于时间序列来解释长期依赖关系。它可以使用大量数据进行训练,以捕捉多变量时间序列中的趋势。

Long Short-Term Memory model (LSTM) is a recurrent neural network which is used for time series to account for long term dependencies. It can be trained with large amount of data to capture the trends in multi-variate time series.

所述建模技术用于时间序列回归。在接下来的章节中,让我们一个一个地探讨所有这些。

The said modelling techniques are used for time series regression. In the coming chapters, let us now explore all these one by one.

Time Series - Parameter Calibration

Introduction

任何统计或机器学习模型都有一些参数,这些参数极大地影响对数据的建模方式。例如,ARIMA 具有 p、d、q 值。这些参数将被决定,使得实际值和建模值之间的误差最小。参数校准被称为模型拟合中最关键和最耗时的任务。因此,为我们选择最优参数非常重要。

Any statistical or machine learning model has some parameters which greatly influence how the data is modeled. For example, ARIMA has p, d, q values. These parameters are to be decided such that the error between actual values and modeled values is minimum. Parameter calibration is said to be the most crucial and time-consuming task of model fitting. Hence, it is very essential for us to choose optimal parameters.

Methods for Calibration of Parameters

有各种方式校准参数。本节将详细讨论其中一些。

There are various ways to calibrate parameters. This section talks about some of them in detail.

Hit-and-try

一种常见的校准模型的方法是手动校准,你在其中首先可视化时间序列,直观地尝试一些参数值并反复更改它们,直到达到足够好的拟合。它要求对我们尝试的模型有一个很好的理解。对于 ARIMA 模型,手工校准是借助自相关图来进行“p”参数、偏自相关图来进行“q”参数和 ADF 测试来确认时间序列的平稳性和设定“d”参数。我们将在接下来的章节中详细讨论所有这些。

One common way of calibrating models is hand calibration, where you start by visualizing the time-series and intuitively try some parameter values and change them over and over until you achieve a good enough fit. It requires a good understanding of the model we are trying. For ARIMA model, hand calibration is done with the help of auto-correlation plot for ‘p’ parameter, partial auto-correlation plot for ‘q’ parameter and ADF-test to confirm the stationarity of time-series and setting ‘d’ parameter. We will discuss all these in detail in the coming chapters.

另一种校准模型的方法是通过网格搜索,其本质上意味着你尝试为所有可能的参数组合构建一个模型,并选择误差最小的那个。这非常耗时,因此当要校准的参数数量及其取值范围较少时才有用,因为这涉及多个嵌套的 for 循环。

Another way of calibrating models is by grid search, which essentially means you try building a model for all possible combinations of parameters and select the one with minimum error. This is time-consuming and hence is useful when number of parameters to be calibrated and range of values they take are fewer as this involves multiple nested for loops.

Genetic Algorithm

遗传算法根据生物学原理工作,即好的解决方案最终会演化到最“最佳”的解决方案。它使用突变、交叉和选择的生物学操作来最终达到最佳解决方案。

Genetic algorithm works on the biological principle that a good solution will eventually evolve to the most ‘optimal’ solution. It uses biological operations of mutation, cross-over and selection to finally reach to an optimal solution.

为了获得更多知识,你可以阅读有关其他参数优化技术的资料,例如贝叶斯优化和粒子群优化。

For further knowledge you can read about other parameter optimization techniques like Bayesian optimization and Swarm optimization.

Time Series - Naïve Methods

Introduction

朴素方法,例如假设时间“t”处的预测值为时间“t-1”处变量的实际值或序列的滚动平均值,用于权衡统计模型和机器学习模型的表现如何,并强调它们的需要。

Naïve Methods such as assuming the predicted value at time ‘t’ to be the actual value of the variable at time ‘t-1’ or rolling mean of series, are used to weigh how well do the statistical models and machine learning models can perform and emphasize their need.

让我们尝试在本教程中在时间序列数据的一个特性中使用这些模型。

In this chapter, let us try these models on one of the features of our time-series data.

首先,我们将查看我们数据的“温度”特性的平均值及其周围的偏差。了解最大值和最小值也很有用。我们可以在此使用 numpy 库的功能。

First we shall see the mean of the ‘temperature’ feature of our data and the deviation around it. It is also useful to see maximum and minimum temperature values. We can use the functionalities of numpy library here.

Showing statistics

In [135]:

import numpy
print (
   'Mean: ',numpy.mean(df['T']), ';
   Standard Deviation: ',numpy.std(df['T']),';
   \nMaximum Temperature: ',max(df['T']),';
   Minimum Temperature: ',min(df['T'])
)

我们有跨越等间隔时间线的所有 9357 项观察统计数据,我们可借此了解数据。

We have the statistics for all 9357 observations across equi-spaced timeline which are useful for us to understand the data.

现在我们将尝试第一个朴素方法,将当前时间的预测值设置为前一时间点的实际值,并计算均方根误差 (RMSE) 来量化此方法的性能。

Now we will try the first naïve method, setting the predicted value at present time equal to actual value at previous time and calculate the root mean squared error(RMSE) for it to quantify the performance of this method.

Showing 1st naïve method

In [136]:

df['T']
df['T_t-1'] = df['T'].shift(1)

[137] 中:

In [137]:

df_naive = df[['T','T_t-1']][1:]

[138] 中:

In [138]:

from sklearn import metrics
from math import sqrt

true = df_naive['T']
prediction = df_naive['T_t-1']
error = sqrt(metrics.mean_squared_error(true,prediction))
print ('RMSE for Naive Method 1: ', error)

质朴方法 1 的 RMSE:12.901140576492974

RMSE for Naive Method 1: 12.901140576492974

让我们看看下一个朴素方法,其中当前时间的预测值等同于 предшествующих时期的平均值。我们还将计算此方法的 RMSE。

Let us see the next naïve method, where predicted value at present time is equated to the mean of the time periods preceding it. We will calculate the RMSE for this method too.

Showing 2nd naïve method

[139] 中:

In [139]:

df['T_rm'] = df['T'].rolling(3).mean().shift(1)
df_naive = df[['T','T_rm']].dropna()

[140] 中:

In [140]:

true = df_naive['T']
prediction = df_naive['T_rm']
error = sqrt(metrics.mean_squared_error(true,prediction))
print ('RMSE for Naive Method 2: ', error)

RMSE for Naive Method 2: 14.957633272839242

RMSE for Naive Method 2: 14.957633272839242

在此处,您还可以尝试各种前面时间段(也称为“滞后”)的数量,您想要考虑这些数量,此处保留为 3。在该数据中,您可以看到随着滞后数的增加,误差也会增加。如果滞后保持为 1,它将成为与之前使用的质朴方法相同。

Here, you can experiment with various number of previous time periods also called ‘lags’ you want to consider, which is kept as 3 here. In this data it can be seen that as you increase the number of lags and error increases. If lag is kept 1, it becomes same as the naïve method used earlier.

Points to Note

Points to Note

  1. You can write a very simple function for calculating root mean squared error. Here, we have used the mean squared error function from the package ‘sklearn’ and then taken its square root.

  2. In pandas df[‘column_name’] can also be written as df.column_name, however for this dataset df.T will not work the same as df[‘T’] because df.T is the function for transposing a dataframe. So use only df[‘T’] or consider renaming this column before using the other syntax.

Time Series - Auto Regression

对于平稳时间序列,自回归模型将时间“t”处的变量值视为其之前“p”时间步的值的线性函数。数学上可以写成以下形式:

For a stationary time series, an auto regression models sees the value of a variable at time ‘t’ as a linear function of values ‘p’ time steps preceding it. Mathematically it can be written as −

y_ {t} = \:C+\:\phi_{1}y_{t-1}\:+\:\phi_{2}Y_{t-2}…​\phi_{p}y_{t-p}+\epsilon_{t}

y_{t} = \:C+\:\phi_{1}y_{t-1}\:+\:\phi_{2}Y_{t-2}...\phi_{p}y_{t-p}+\epsilon_{t}

我无法使用 Gemini 翻译任何内容。

其中,“p”是自回归趋势参数

Where,‘p’ is the auto-regressive trend parameter

\epsilon_ {t} 是白噪声,并且

$\epsilon_{t}$ is white noise, and

y_ {t-1},y_ {t-2} \:\: …y_ {t-p} 表示先前的时期变量的值。

$y_{t-1}, y_{t-2}\:\: …​y_{t-p}$ denote the value of variable at previous time periods.

可以使用多种方法校准 p 的值。找到“p”的适当值的一种方法是绘制自相关图。

The value of p can be calibrated using various methods. One way of finding the apt value of ‘p’ is plotting the auto-correlation plot.

Note - 在对数据执行任何分析之前,我们应该以 8:2 的可用总数据集比率将数据分割为训练和测试,因为测试数据只能找出我们模型的准确性,并且假设在作出预测之前我们无法获得该数据。对于时间序列,数据点的序列非常重要,因此在分割数据时应记住不要丢失顺序。

Note − We should separate the data into train and test at 8:2 ratio of total data available prior to doing any analysis on the data because test data is only to find out the accuracy of our model and assumption is, it is not available to us until after predictions have been made. In case of time series, sequence of data points is very essential so one should keep in mind not to lose the order during splitting of data.

自相关图或相关图显示变量与其自身在先前的时步关系。它使用 Pearson 相关并且显示 95% 置信区间内的相关。让我们看看我们数据的“温度”变量是怎样的。

An auto-correlation plot or a correlogram shows the relation of a variable with itself at prior time steps. It makes use of Pearson’s correlation and shows the correlations within 95% confidence interval. Let’s see how it looks like for ‘temperature’ variable of our data.

Showing ACP

In [141]:

split = len(df) - int(0.2*len(df))
train, test = df['T'][0:split], df['T'][split:]

In [142]:

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(train, lags = 100)
plt.show()
code snippet9

假定所有位于蓝色阴影区域之外的滞后值具有相关性。

All the lag values lying outside the shaded blue region are assumed to have a csorrelation.

Time Series - Moving Average

对于平稳时间序列,移动平均模型将时间“t”处变量的值视为前“q”时间步长残差误差的线性函数。残差误差是通过将时间“t”处的值与前面值的移动平均值进行比较来计算的。

For a stationary time series, a moving average model sees the value of a variable at time ‘t’ as a linear function of residual errors from ‘q’ time steps preceding it. The residual error is calculated by comparing the value at the time ‘t’ to moving average of the values preceding.

在数学上可以写成 −

Mathematically it can be written as −

y_{t} = c\:+\:\epsilon_{t}\:+\:\theta_{1}\:\epsilon_{t-1}\:+\:\theta_{2}\:\epsilon_{t-2}\:+\:…​+:\theta_{q}\:\epsilon_{t-q}\:

y_{t} = c\:+\:\epsilon_{t}\:+\:\theta_{1}\:\epsilon_{t-1}\:+\:\theta_{2}\:\epsilon_{t-2}\:+\:…​+:\theta_{q}\:\epsilon_{t-q}\:

其中“q”是移动平均趋势参数

Where‘q’ is the moving-average trend parameter

\epsilon_ {t} 是白噪声,并且

$\epsilon_{t}$ is white noise, and

$\epsilon_{t-1}, \epsilon_{t-2}…​\epsilon_{t-q}$ 是前一时间段的误差项。

$\epsilon_{t-1}, \epsilon_{t-2}…​\epsilon_{t-q}$ are the error terms at previous time periods.

“q”的值可以使用多种方法进行校准。找到“q”的恰当值的一种方法是绘制偏自相关图。

Value of ‘q’ can be calibrated using various methods. One way of finding the apt value of ‘q’ is plotting the partial auto-correlation plot.

与显示直接和间接相关性的自相关图不同,偏自相关图显示变量与其自身在之前时间步长的关系,同时消除了间接相关性,让我们看看它对我们数据的“temperature”变量有何影响。

A partial auto-correlation plot shows the relation of a variable with itself at prior time steps with indirect correlations removed, unlike auto-correlation plot which shows direct as well as indirect correlations, let’s see how it looks like for ‘temperature’ variable of our data.

Showing PACP

[143] 中:

In [143]:

from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(train, lags = 100)
plt.show()
code snippet10

偏自相关以与相关图相同的方式进行读取。

A partial auto-correlation is read in the same way as a correlogram.

Time Series - ARIMA

我们已经了解到,对于平稳时间序列,时间“t”处的变量是先前观测或残差误差的线性函数。因此,现在是时候将这两者结合起来,建立自回归移动平均 (ARMA) 模型了。

We have already understood that for a stationary time series a variable at time ‘t’ is a linear function of prior observations or residual errors. Hence it is time for us to combine the two and have an Auto-regressive moving average (ARMA) model.

然而,有时时间序列不是平稳的,即序列的统计特性(如均值、方差)随时间变化。而我们迄今为止学习过的统计模型假设时间序列是平稳的,因此,我们可以包括差分时间序列的预处理步骤,使其平稳。现在,对于我们正在处理的时间序列是否是平稳的,我们必须找出答案。

However, at times the time series is not stationary, i.e the statistical properties of a series like mean, variance changes over time. And the statistical models we have studied so far assume the time series to be stationary, therefore, we can include a pre-processing step of differencing the time series to make it stationary. Now, it is important for us to find out whether the time series we are dealing with is stationary or not.

用于查找时间序列平稳性的各种方法正在寻找时间序列图中的季节性或趋势,检查不同时间段的均值和方差差异、增强型迪基-福勒 (ADF) 检验、KPSS 检验、赫斯特指数等。

Various methods to find the stationarity of a time series are looking for seasonality or trend in the plot of time series, checking the difference in mean and variance for various time periods, Augmented Dickey-Fuller (ADF) test, KPSS test, Hurst’s exponent etc.

让我们使用 ADF 检验来确定数据集中的“温度”变量是否是平稳的时间序列。

Let us see whether the ‘temperature’ variable of our dataset is a stationary time series or not using ADF test.

In [74]:

from statsmodels.tsa.stattools import adfuller

result = adfuller(train)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value In result[4].items()
   print('\t%s: %.3f' % (key, value))

ADF 统计量:-10.406056

ADF Statistic: -10.406056

p 值:0.000000

p-value: 0.000000

临界值:

Critical Values:

1%:-3.431

1%: -3.431

5%:-2.862

5%: -2.862

10%:-2.567

10%: -2.567

现在我们已经运行了 ADF 检验,让我们解释一下结果。首先,我们将 ADF 统计量与临界值进行比较,较低的临界值告诉我们该序列很可能是非平稳的。接下来,我们查看 p 值。大于 0.05 的 p 值也表明时间序列是非平稳的。

Now that we have run the ADF test, let us interpret the result. First we will compare the ADF Statistic with the critical values, a lower critical value tells us the series is most likely non-stationary. Next, we see the p-value. A p-value greater than 0.05 also suggests that the time series is non-stationary.

或者,p 值小于或等于 0.05,或 ADF 统计量小于临界值表明时间序列是平稳的。

Alternatively, p-value less than or equal to 0.05, or ADF Statistic less than critical values suggest the time series is stationary.

因此,我们正在处理的时间序列已经平稳。在平稳时间序列的情况下,我们将“d”参数设置为 0。

Hence, the time series we are dealing with is already stationary. In case of stationary time series, we set the ‘d’ parameter as 0.

我们还可以使用赫斯特指数确认时间序列的平稳性。

We can also confirm the stationarity of time series using Hurst exponent.

In [75]:

import hurst

H, c,data = hurst.compute_Hc(train)
print("H = {:.4f}, c = {:.4f}".format(H,c))

H = 0.1660,c = 5.0740

H = 0.1660, c = 5.0740

H<0.5 的值表示反持续性行为,H>0.5 表示持续性行为或趋势序列。H=0.5 表示随机游走/布朗运动。H<0.5 的值,确认我们的序列是平稳的。

The value of H<0.5 shows anti-persistent behavior, and H>0.5 shows persistent behavior or a trending series. H=0.5 shows random walk/Brownian motion. The value of H<0.5, confirming that our series is stationary.

对于非平稳时间序列,我们将“d”参数设置为 1。此外,自回归趋势参数“p”和移动平均趋势参数“q”的值是根据平稳时间序列计算的,即通过在对时间序列求差后绘制 ACP 和 PACP 来计算的。

For non-stationary time series, we set ‘d’ parameter as 1. Also, the value of the auto-regressive trend parameter ‘p’ and the moving average trend parameter ‘q’, is calculated on the stationary time series i.e by plotting ACP and PACP after differencing the time series.

ARIMA 模型的特点在于 3 个参数 (p,d,q),现在我们已经了解了它,因此让我们对我们的时间序列建模并预测温度的未来值。

ARIMA Model, which is characterized by 3 parameter, (p,d,q) are now clear to us, so let us model our time series and predict the future values of temperature.

In [156]:

from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(train.values, order=(5, 0, 2))
model_fit = model.fit(disp=False)

In [157]:

predictions = model_fit.predict(len(test))
test_ = pandas.DataFrame(test)
test_['predictions'] = predictions[0:1871]

In [158]:

plt.plot(df['T'])
plt.plot(test_.predictions)
plt.show()
code snippet13

In [167]:

error = sqrt(metrics.mean_squared_error(test.values,predictions[0:1871]))
print ('Test RMSE for ARIMA: ', error)

ARIMA 的测试 RMSE:43.21252940234892

Test RMSE for ARIMA: 43.21252940234892

Time Series - Variations of ARIMA

在上一章中,我们已经看到了 ARIMA 模型如何工作,以及它不能处理季节性数据或多元时间序列的局限性,因此引入了新的模型来包含这些特征。

In the previous chapter, we have now seen how ARIMA model works, and its limitations that it cannot handle seasonal data or multivariate time series and hence, new models were introduced to include these features.

这里提供了这些新模型的概览 −

A glimpse of these new models is given here −

Vector Auto-Regression (VAR)

它是多元固定时间序列自回归模型的广义版本。它的特征是“p”参数。

It is a generalized version of auto regression model for multivariate stationary time series. It is characterized by ‘p’ parameter.

Vector Moving Average (VMA)

它是多元固定时间序列移动平均模型的广义版本。它的特征是“q”参数。

It is a generalized version of moving average model for multivariate stationary time series. It is characterized by ‘q’ parameter.

Vector Auto Regression Moving Average (VARMA)

它是 VAR 和 VMA 的组合以及多元固定时间序列 ARMA 模型的广义版本。它的特征是“p”和“q”参数。与 ARMA 一样,通过将“q”参数设置为 0 来充当 AR 模型并且通过将“p”参数设置为 0 来充当 MA 模型,VARMA 也能够通过将“q”参数设置为 0 来充当 VAR 模型并且通过将“p”参数设置为 0 来充当 VMA 模型。

It is the combination of VAR and VMA and a generalized version of ARMA model for multivariate stationary time series. It is characterized by ‘p’ and ‘q’ parameters. Much like, ARMA is capable of acting like an AR model by setting ‘q’ parameter as 0 and as a MA model by setting ‘p’ parameter as 0, VARMA is also capable of acting like an VAR model by setting ‘q’ parameter as 0 and as a VMA model by setting ‘p’ parameter as 0.

[209] 中:

In [209]:

df_multi = df[['T', 'C6H6(GT)']]
split = len(df) - int(0.2*len(df))
train_multi, test_multi = df_multi[0:split], df_multi[split:]

[211] 中:

In [211]:

from statsmodels.tsa.statespace.varmax import VARMAX

model = VARMAX(train_multi, order = (2,1))
model_fit = model.fit()
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\tsa\statespace\varmax.py:152:
   EstimationWarning: Estimation of VARMA(p,q) models is not generically robust,
   due especially to identification issues.
   EstimationWarning)
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\tsa\base\tsa_model.py:171:
   ValueWarning: No frequency information was provided, so inferred frequency H will be used.
  % freq, ValueWarning)
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\base\model.py:508:
   ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

[213] 中:

In [213]:

predictions_multi = model_fit.forecast( steps=len(test_multi))
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\tsa\base\tsa_model.py:320:
   FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.
   freq = base_index.freq)
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\tsa\statespace\varmax.py:152:
   EstimationWarning: Estimation of VARMA(p,q) models is not generically robust, due especially to identification issues.
   EstimationWarning)

[231] 中:

In [231]:

plt.plot(train_multi['T'])
plt.plot(test_multi['T'])
plt.plot(predictions_multi.iloc[:,0:1], '--')
plt.show()

plt.plot(train_multi['C6H6(GT)'])
plt.plot(test_multi['C6H6(GT)'])
plt.plot(predictions_multi.iloc[:,1:2], '--')
plt.show()
code snippet14
codesnippet14

以上代码显示了如何使用 VARMA 模型对多元时间序列进行建模,虽然此模型可能不最适合我们的数据。

The above code shows how VARMA model can be used to model multivariate time series, although this model may not be best suited on our data.

VARMA with Exogenous Variables (VARMAX)

它是 VARMA 模型的扩展,其中使用称为协变量的额外变量对我们感兴趣的主要变量进行建模。

It is an extension of VARMA model where extra variables called covariates are used to model the primary variable we are interested it.

Seasonal Auto Regressive Integrated Moving Average (SARIMA)

这是 ARIMA 模型的扩展,用于处理季节性数据。它将数据划分为季节性和非季节性部分,并以类似的方式对它们进行建模。它的特征是 7 个参数,对于非季节部分(p、d、q)参数与 ARIMA 模型相同,对于季节部分(P、D、Q、m)参数,其中“m”是季节周期的数量,并且 P、D、Q 与 ARIMA 模型的参数类似。这些参数可以使用网格搜索或遗传算法进行校准。

This is the extension of ARIMA model to deal with seasonal data. It divides the data into seasonal and non-seasonal components and models them in a similar fashion. It is characterized by 7 parameters, for non-seasonal part (p,d,q) parameters same as for ARIMA model and for seasonal part (P,D,Q,m) parameters where ‘m’ is the number of seasonal periods and P,D,Q are similar to parameters of ARIMA model. These parameters can be calibrated using grid search or genetic algorithm.

SARIMA with Exogenous Variables (SARIMAX)

这是 SARIMA 模型的扩展,可用于包含外生变量,这有助于我们对我们感兴趣的变量进行建模。

This is the extension of SARIMA model to include exogenous variables which help us to model the variable we are interested in.

在将变量作为外生变量使用之前,对变量进行相关分析可能很有用。

It may be useful to do a co-relation analysis on variables before putting them as exogenous variables.

[251]:

In [251]:

from scipy.stats.stats import pearsonr
x = train_multi['T'].values
y = train_multi['C6H6(GT)'].values

corr , p = pearsonr(x,y)
print ('Corelation Coefficient =', corr,'\nP-Value =',p)
Corelation Coefficient = 0.9701173437269858
P-Value = 0.0

皮尔森相关性显示 2 个变量之间的线性关系,要解释结果,我们首先看 p 值,如果 p 值小于 0.05,则系数的值具有显着性,否则系数的值不具有显着性。对于有显着性的 p 值,相关系数的正值表示正相关,负值表示负相关。

Pearson’s Correlation shows a linear relation between 2 variables, to interpret the results, we first look at the p-value, if it is less that 0.05 then the value of coefficient is significant, else the value of coefficient is not significant. For significant p-value, a positive value of correlation coefficient indicates positive correlation, and a negative value indicates a negative correlation.

因此,对于我们的数据,“温度”和“C6H6”似乎具有高度正相关性。因此,我们将

Hence, for our data, ‘temperature’ and ‘C6H6’ seem to have a highly positive correlation. Therefore, we will

[297]:

In [297]:

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(x, exog = y, order = (2, 0, 2), seasonal_order = (2, 0, 1, 1), enforce_stationarity=False, enforce_invertibility = False)
model_fit = model.fit(disp = False)
c:\users\naveksha\appdata\local\programs\python\python37\lib\site-packages\statsmodels\base\model.py:508:
   ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
   "Check mle_retvals", ConvergenceWarning)

[298]:

In [298]:

y_ = test_multi['C6H6(GT)'].values
predicted = model_fit.predict(exog=y_)
test_multi_ = pandas.DataFrame(test)
test_multi_['predictions'] = predicted[0:1871]

[299]:

In [299]:

plt.plot(train_multi['T'])
plt.plot(test_multi_['T'])
plt.plot(test_multi_.predictions, '--')

Out[299]:

Out[299]:

[<matplotlib.lines.Line2D at 0x1eab0191c18>]

与单变量 ARIMA 建模相比,这里的预测似乎变化更大。

The predictions here seem to take larger variations now as opposed to univariate ARIMA modelling.

不用说,SARIMAX 可通过仅将相应参数设置为非零值来用作 ARX、MAX、ARMAX 或 ARIMAX 模型。

Needless to say, SARIMAX can be used as an ARX, MAX, ARMAX or ARIMAX model by setting only the corresponding parameters to non-zero values.

Fractional Auto Regressive Integrated Moving Average (FARIMA)

有时,我们的序列可能不是平稳的,但对于差分“d”参数取值为 1 可能会过度差分它。因此,我们需要使用分数值对时间序列进行差分。

At times, it may happen that our series is not stationary, yet differencing with ‘d’ parameter taking the value 1 may over-difference it. So, we need to difference the time series using a fractional value.

在数据科学领域没有一个优越的模型,适用于您数据的模型在很大程度上取决于您的数据集。对各种模型的了解使我们能够选择一个适用于我们数据的模型,并对该模型进行试验以达到最佳结果。结果应视为绘图和错误度量,有时一个小的错误也可能是糟糕的,因此,绘制和可视化结果至关重要。

In the world of data science there is no one superior model, the model that works on your data depends greatly on your dataset. Knowledge of various models allows us to choose one that work on our data and experimenting with that model to achieve the best results. And results should be seen as plot as well as error metrics, at times a small error may also be bad, hence, plotting and visualizing the results is essential.

在下一章中,我们将研究另一种统计模型,指数平滑。

In the next chapter, we will be looking at another statistical model, exponential smoothing.

Time Series - Exponential Smoothing

在本章中,我们将讨论时间序列指数平滑涉及的技术。

In this chapter, we will talk about the techniques involved in exponential smoothing of time series.

Simple Exponential Smoothing

指数平滑是一种通过在一段时间内对数据分配指数递减的权重来平滑单变量时间序列的技术。

Exponential Smoothing is a technique for smoothing univariate time-series by assigning exponentially decreasing weights to data over a time period.

在数学上,给定时间 t 时的值 y_(t+1|t) 时,时间“t+1”时的变量值定义为 −

Mathematically, the value of variable at time ‘t+1’ given value at time t, y_(t+1|t) is defined as −

y_{t+1|t}\:=\:\alpha y_{t}\:+\:\alpha\lgroup1 -\alpha\rgroup y_{t-1}\:+\alpha\lgroup1-\alpha\rgroup^{2}\:y_{t-2}\:+\:…​+y_{1}

y_{t+1|t}\:=\:\alpha y_{t}\:+\:\alpha\lgroup1 -\alpha\rgroup y_{t-1}\:+\alpha\lgroup1-\alpha\rgroup^{2}\:y_{t-2}\:+\:…​+y_{1}

其中,0≤α≤1 是平滑参数,并且

where,$0\leq\alpha \leq1$ is the smoothing parameter, and

$y_{1},…​.,y_{t}$ 是网络流量在时间点 1、2、3、… 、t 的前值。

$y_{1},…​.,y_{t}$ are previous values of network traffic at times 1, 2, 3, … ,t.

这是建模没有明显趋势或季节性的时间序列的一种简单方法。但指数平滑也可用于具有趋势和季节性的时间序列。

This is a simple method to model a time series with no clear trend or seasonality. But exponential smoothing can also be used for time series with trend and seasonality.

Triple Exponential Smoothing

三重指数平滑 (TES) 或霍尔特冬季方法应用指数平滑三次 - 水平平滑 $l_{t}$、趋势平滑 $b_{t}$ 和季节性平滑 $S_{t}$,其中 $\alpha$, $\beta^{*}$ 和 $\gamma$ 作为平滑参数,“m”为季节性频率,即一年中的季节数。

Triple Exponential Smoothing (TES) or Holt’s Winter method, applies exponential smoothing three times - level smoothing $l_{t}$, trend smoothing $b_{t}$, and seasonal smoothing $S_{t}$, with $\alpha$, $\beta^{*}$ and $\gamma$ as smoothing parameters with ‘m’ as the frequency of the seasonality, i.e. the number of seasons in a year.

根据季节性分量的性质,TES 有两种类别−

According to the nature of the seasonal component, TES has two categories −

  1. Holt-Winter’s Additive Method − When the seasonality is additive in nature.

  2. Holt-Winter’s Multiplicative Method − When the seasonality is multiplicative in nature.

对于非季节性时间序列,我们只有趋势平滑和水平平滑,这称为霍尔特线性趋势法。

For non-seasonal time series, we only have trend smoothing and level smoothing, which is called Holt’s Linear Trend Method.

让我们尝试对我们的数据应用三重指数平滑。

Let’s try applying triple exponential smoothing on our data.

输入[316]:

In [316]:

from statsmodels.tsa.holtwinters import ExponentialSmoothing

model = ExponentialSmoothing(train.values, trend= )
model_fit = model.fit()

输入[322]:

In [322]:

predictions_ = model_fit.predict(len(test))

输入[325]:

In [325]:

plt.plot(test.values)
plt.plot(predictions_[1:1871])

输出[325]:

Out[325]:

[<matplotlib.lines.Line2D at 0x1eab00f1cf8>]
code snippet17

在这里,我们用训练集训练了一次模型,然后我们继续做出预测。一种更现实的方法是在一个或多个时间步之后重新训练模型。当我们从训练数据“直到时间‘t’”获得时间“t+1”的预测时,下一次时间“t+2”的预测可以使用训练数据“直到时间‘t+1’”来做出,因为那时将知道时间“t+1”的实际值。这种为一个或多个未来步做出预测然后重新训练模型的方法称为滚动预测或向前验证。

Here, we have trained the model once with training set and then we keep on making predictions. A more realistic approach is to re-train the model after one or more time step(s). As we get the prediction for time ‘t+1’ from training data ‘til time ‘t’, the next prediction for time ‘t+2’ can be made using the training data ‘til time ‘t+1’ as the actual value at ‘t+1’ will be known then. This methodology of making predictions for one or more future steps and then re-training the model is called rolling forecast or walk forward validation.

Time Series - Walk Forward Validation

在时间序列建模中,随着时间的推移,预测会变得越来越不准确,因此根据实际数据重新训练模型是一种更为现实的方法,因为它可以用于进一步预测。由于统计模型的训练并不耗时,因此按步进验证是获得最准确结果的最优解决方案。

In time series modelling, the predictions over time become less and less accurate and hence it is a more realistic approach to re-train the model with actual data as it gets available for further predictions. Since training of statistical models are not time consuming, walk-forward validation is the most preferred solution to get most accurate results.

我们对数据应用一步按步进验证,并将其与我们之前获得的结果进行比较。

Let us apply one step walk forward validation on our data and compare it with the results we got earlier.

[333] 中:

In [333]:

prediction = []
data = train.values
for t In test.values:
   model = (ExponentialSmoothing(data).fit())
   y = model.predict()
   prediction.append(y[0])
   data = numpy.append(data, t)

[335] 中:

In [335]:

test_ = pandas.DataFrame(test)
test_['predictionswf'] = prediction

[341] 中:

In [341]:

plt.plot(test_['T'])
plt.plot(test_.predictionswf, '--')
plt.show()
code snippet18

[340] 中:

In [340]:

error = sqrt(metrics.mean_squared_error(test.values,prediction))
print ('Test RMSE for Triple Exponential Smoothing with Walk-Forward Validation: ', error)
Test RMSE for Triple Exponential Smoothing with Walk-Forward Validation:  11.787532205759442

我们可以看到,现在我们的模型执行得明显更好。事实上,趋势被跟踪得如此紧密,以至于在图中预测与实际值重叠。你也可以尝试对 ARIMA 模型应用按步进验证。

We can see that our model performs significantly better now. In fact, the trend is followed so closely that on the plot predictions are overlapping with the actual values. You can try applying walk-forward validation on ARIMA models too.

Time Series - Prophet Model

2017年,Facebook开源了Prophet模型,该模型能够对具有日、周、年等多强季度的时序进行建模,以及对趋势建模。它具有直观的参数,非专业级的数据科学家可以对其进行调整,以获得更好的预测。它的核心是一个加法回归模型,可以检测时序的变化点。

In 2017, Facebook open sourced the prophet model which was capable of modelling the time series with strong multiple seasonalities at day level, week level, year level etc. and trend. It has intuitive parameters that a not-so-expert data scientist can tune for better forecasts. At its core, it is an additive regressive model which can detect change points to model the time series.

Prophet将时序分解为趋势分量$g_{t}$,季节分量$S_{t}$和节假日分量$h_{t}$。

Prophet decomposes the time series into components of trend $g_{t}$, seasonality $S_{t}$ and holidays $h_{t}$.

$y_{t}=g_{t}s_{t}+h_{t}\epsilon_{t}$

y_{t}=g_{t}s_{t}+h_{t}\epsilon_{t}

其中,$\epsilon_{t}$ 是误差项。

Where, $\epsilon_{t}$ is the error term.

类似的时间序列预测包(例如因果影响和异常检测)分别由谷歌和推特在 R 中引入。

Similar packages for time series forecasting such as causal impact and anomaly detection were introduced in R by google and twitter respectively.

Time Series - LSTM Model

现在,我们已熟悉时间序列的统计建模,机器学习目前盛行,所以也务必要熟悉一些机器学习模型。我们从时间序列领域最流行的模型——长短期记忆模型开始。

Now, we are familiar with statistical modelling on time series, but machine learning is all the rage right now, so it is essential to be familiar with some machine learning models as well. We shall start with the most popular model in time series domain − Long Short-term Memory model.

LSTM 是一类循环神经网络。因此,在跳到 LSTM 之前,务必要了解神经网络和循环神经网络。

LSTM is a class of recurrent neural network. So before we can jump to LSTM, it is essential to understand neural networks and recurrent neural networks.

Neural Networks

由链接神经元构建多层结构的人工神经网络,灵感来自于生物神经网络。它不仅是一种算法,而是允许我们在数据上执行复杂运算的各种算法组合。

An artificial neural network is a layered structure of connected neurons, inspired by biological neural networks. It is not one algorithm but combinations of various algorithms which allows us to do complex operations on data.

Recurrent Neural Networks

这是一类专门用于处理临时数据的类神经网络。RNN 的神经元具有单元状态/记忆,输入根据此内部状态进行处理,这是通过神经网络中的循环来完成的。RNN 中的“tanh”层有周期性模块,可让它们保留信息。然而,保留的时间不会很长,因此我们才需要 LSTM 模型。

It is a class of neural networks tailored to deal with temporal data. The neurons of RNN have a cell state/memory, and input is processed according to this internal state, which is achieved with the help of loops with in the neural network. There are recurring module(s) of ‘tanh’ layers in RNNs that allow them to retain information. However, not for a long time, which is why we need LSTM models.

LSTM

这是一种特殊类别的循环神经网络,能够学习数据中的长期依赖性。这是因为模型的周期性模块将四个层层进行交互。

It is special kind of recurrent neural network that is capable of learning long term dependencies in data. This is achieved because the recurring module of the model has a combination of four layers interacting with each other.

neural network

上图以黄色框表示四个神经网络层,以绿色圆圈表示逐点运算符,以黄色圆圈表示输入,以蓝色圆圈表示单元状态。一个 LSTM 模块具有一个单元状态和三个门,这给了它有选择地学习、取消学习,或者保留每个单元信息的能力。LSTM 中的单元状态帮助信息在单元之间流动,并通过只允许少量线性交互而保持不变。每个单元具有输入、输出,以及可以将信息添加到单元状态、或者从单元状态中移除信息的遗忘门。遗忘门使用一个 sigmoid 函数来确定前一个单元状态中的哪些信息应该被忘记。输入门使用“sigmoid”和“tanh”的逐点乘法运算来控制信息流向当前单元状态。最后,输出门决定哪些信息应传递到下一个隐藏状态

The picture above depicts four neural network layers in yellow boxes, point wise operators in green circles, input in yellow circles and cell state in blue circles. An LSTM module has a cell state and three gates which provides them with the power to selectively learn, unlearn or retain information from each of the units. The cell state in LSTM helps the information to flow through the units without being altered by allowing only a few linear interactions. Each unit has an input, output and a forget gate which can add or remove the information to the cell state. The forget gate decides which information from the previous cell state should be forgotten for which it uses a sigmoid function. The input gate controls the information flow to the current cell state using a point-wise multiplication operation of ‘sigmoid’ and ‘tanh’ respectively. Finally, the output gate decides which information should be passed on to the next hidden state

现在我们已经了解了 LSTM 模型的内部工作原理,让我们来实施它。要了解 LSTM 的实施,我们从一个简单的示例开始——一条直线。我们来看看 LSTM 能否学习直线关系并预测它。

Now that we have understood the internal working of LSTM model, let us implement it. To understand the implementation of LSTM, we will start with a simple example − a straight line. Let us see, if LSTM can learn the relationship of a straight line and predict it.

首先,我们创建描述一条直线的数据集。

First let us create the dataset depicting a straight line.

In [402]:

x = numpy.arange (1,500,1)
y = 0.4 * x + 30
plt.plot(x,y)

Out[402]:

[<matplotlib.lines.Line2D at 0x1eab9d3ee10>]
code snippet19

In [403]:

trainx, testx = x[0:int(0.8*(len(x)))], x[int(0.8*(len(x))):]
trainy, testy = y[0:int(0.8*(len(y)))], y[int(0.8*(len(y))):]
train = numpy.array(list(zip(trainx,trainy)))
test = numpy.array(list(zip(trainx,trainy)))

现在数据已经创建并分成训练和测试。让我们将时间序列数据转换为监督学习数据,根据回看期的值,这实质上是用于预测时间 t 值的滞后数。

Now that the data has been created and split into train and test. Let’s convert the time series data into the form of supervised learning data according to the value of look-back period, which is essentially the number of lags which are seen to predict the value at time ‘t’.

因此像这样的时间序列 −

So a time series like this −

time variable_x
t1  x1
t2  x2
 :   :
 :   :
T   xT

在回溯周期为 1 时,转化为 −

When look-back period is 1, is converted to −

x1   x2
x2   x3
 :    :
 :    :
xT-1 xT

In [404]:

def create_dataset(n_X, look_back):
   dataX, dataY = [], []
   for i in range(len(n_X)-look_back):
      a = n_X[i:(i+look_back), ]
      dataX.append(a)
      dataY.append(n_X[i + look_back, ])
   return numpy.array(dataX), numpy.array(dataY)

In [405]:

look_back = 1
trainx,trainy = create_dataset(train, look_back)
testx,testy = create_dataset(test, look_back)

trainx = numpy.reshape(trainx, (trainx.shape[0], 1, 2))
testx = numpy.reshape(testx, (testx.shape[0], 1, 2))

现在,我们将训练模型。

Now we will train our model.

少量训练数据显示给网络,当所有训练数据分批显示给模型并且计算误差时称为一次 epoch。Epochs 将继续运行直到误差减少为止。

Small batches of training data are shown to network, one run of when entire training data is shown to the model in batches and error is calculated is called an epoch. The epochs are to be run ‘til the time the error is reducing.

In []:

In [ ]:

from keras.models import Sequential
from keras.layers import LSTM, Dense

model = Sequential()
model.add(LSTM(256, return_sequences = True, input_shape = (trainx.shape[1], 2)))
model.add(LSTM(128,input_shape = (trainx.shape[1], 2)))
model.add(Dense(2))
model.compile(loss = 'mean_squared_error', optimizer = 'adam')
model.fit(trainx, trainy, epochs = 2000, batch_size = 10, verbose = 2, shuffle = False)
model.save_weights('LSTMBasic1.h5')

In [407]:

model.load_weights('LSTMBasic1.h5')
predict = model.predict(testx)

现在,让我们看看我们的预测值是什么。

Now let’s see what our predictions look like.

In [408]:

plt.plot(testx.reshape(398,2)[:,0:1], testx.reshape(398,2)[:,1:2])
plt.plot(predict[:,0:1], predict[:,1:2])

Out[408]:

Out[408]:

[<matplotlib.lines.Line2D at 0x1eac792f048>]
code snippet22

现在,我们应尝试以类似方式对正弦波或余弦波建模。您可以运行下面提供的代码并使用模型参数进行操作,以查看结果如何变化。

Now, we should try and model a sine or cosine wave in a similar fashion. You can run the code given below and play with the model parameters to see how the results change.

In [409]:

x = numpy.arange (1,500,1)
y = numpy.sin(x)
plt.plot(x,y)

Out[409]:

Out[409]:

[<matplotlib.lines.Line2D at 0x1eac7a0b3c8>]
code snippet23

In [410]:

trainx, testx = x[0:int(0.8*(len(x)))], x[int(0.8*(len(x))):]
trainy, testy = y[0:int(0.8*(len(y)))], y[int(0.8*(len(y))):]
train = numpy.array(list(zip(trainx,trainy)))
test = numpy.array(list(zip(trainx,trainy)))

In [411]:

look_back = 1
trainx,trainy = create_dataset(train, look_back)
testx,testy = create_dataset(test, look_back)
trainx = numpy.reshape(trainx, (trainx.shape[0], 1, 2))
testx = numpy.reshape(testx, (testx.shape[0], 1, 2))

In []:

In [ ]:

model = Sequential()
model.add(LSTM(512, return_sequences = True, input_shape = (trainx.shape[1], 2)))
model.add(LSTM(256,input_shape = (trainx.shape[1], 2)))
model.add(Dense(2))
model.compile(loss = 'mean_squared_error', optimizer = 'adam')
model.fit(trainx, trainy, epochs = 2000, batch_size = 10, verbose = 2, shuffle = False)
model.save_weights('LSTMBasic2.h5')

In [413]:

model.load_weights('LSTMBasic2.h5')
predict = model.predict(testx)

In [415]:

plt.plot(trainx.reshape(398,2)[:,0:1], trainx.reshape(398,2)[:,1:2])
plt.plot(predict[:,0:1], predict[:,1:2])

Out [415]:

Out [415]:

[<matplotlib.lines.Line2D at 0x1eac7a1f550>]
codesnippet23

现在,您可以继续处理任意数据集了。

Now you are ready to move on to any dataset.

Time Series - Error Metrics

对模型的性能进行量化以便将其用作反馈和比较对我们来说很重要。在本教程中,我们使用了最常见的错误指标之一:均方根误差。还有其他各种可用错误指标。本章将简要讨论它们。

It is important for us to quantify the performance of a model to use it as a feedback and comparison. In this tutorial we have used one of the most popular error metric root mean squared error. There are various other error metrics available. This chapter discusses them in brief.

Mean Square Error

它是预测值与真值之间差值的平方平均值。Sklearn 以函数形式提供它。它的单位与真值和预测值的平方相同,并且始终为正。

It is the average of square of difference between the predicted values and true values. Sklearn provides it as a function. It has the same units as the true and predicted values squared and is always positive.

MSE = \frac{1}{n} displaystyle\sum\limits_{t=1}^n \lgroup y' {t}\:-y {t}\rgroup^{2}

MSE = \frac{1}{n} \displaystyle\sum\limits_{t=1}^n \lgroup y'{t}\:-y{t}\rgroup^{2}

其中,$y'_{t}$ 是预测值,

Where $y'_{t}$ is the predicted value,

$y_{t}$ 是实际值,

$y_{t}$ is the actual value, and

n 是测试集中值的总数。

n is the total number of values in test set.

从方程式中可以清楚地看出,对于较大误差或异常值,MSE 具有更大的惩罚性。

It is clear from the equation that MSE is more penalizing for larger errors, or the outliers.

Root Mean Square Error

它是均方误差的平方根。它也总是为正,并且在数据范围内。

It is the square root of the mean square error. It is also always positive and is in the range of the data.

RMSE = \sqrt{\frac{1}{n} displaystyle\sum\limits_{t=1}^n \lgroup y' {t}-y {t}\rgroup ^2}

RMSE = \sqrt{\frac{1}{n} \displaystyle\sum\limits_{t=1}^n \lgroup y'{t}-y{t}\rgroup ^2}

其中,$y'_{t}$ 是预测值,

Where, $y'_{t}$ is predicted value

$y_{t}$ 表示实际值,而

$y_{t}$ is actual value, and

n 是测试集中值的总数。

n is total number of values in test set.

它具有单位功率,因此与 MSE 相比更具可解释性。RMSE 对较大误差的惩罚也更大。我们在教程中使用了 RMSE 度量。

It is in the power of unity and hence is more interpretable as compared to MSE. RMSE is also more penalizing for larger errors. We have used RMSE metric in our tutorial.

Mean Absolute Error

这是预测值和真值之间绝对差值的平均值。它的单位与预测值和真值相同,且始终为正。

It is the average of absolute difference between predicted values and true values. It has the same units as predicted and true value and is always positive.

MAE = \frac{1}{n}\displaystyle\sum\limits_{t=1}^{t=n} | y'{t}-y_{t}\lvert

其中,$y'_{t}$ 是预测值,

Where, $y'_{t}$ is predicted value,

$y_{t}$ 表示实际值,而

$y_{t}$ is actual value, and

n 是测试集中值的总数。

n is total number of values in test set.

Mean Percentage Error

这是预测值和真值之间绝对差值的平均值除以真值的百分比。

It is the percentage of average of absolute difference between predicted values and true values, divided by the true value.

MAPE = \frac{1}{n}\displaystyle\sum\limits_{t=1}^n\frac{y' {t}-y {t}}{y_{t}}*100\: \%

MAPE = \frac{1}{n}\displaystyle\sum\limits_{t=1}^n\frac{y'{t}-y{t}}{y_{t}}*100\: \%

其中,$y'_{t}$ 是预测值,

Where, $y'_{t}$ is predicted value,

$y_{t}$ 是实际值,n 是测试集中值的总数。

$y_{t}$ is actual value and n is total number of values in test set.

然而,使用此误差的缺点是正误差和负误差可能会相互抵消。因此,需要使用平均绝对百分比误差。

However, the disadvantage of using this error is that the positive error and negative errors can offset each other. Hence mean absolute percentage error is used.

Mean Absolute Percentage Error

这是预测值和真值之间绝对差值的平均值除以真值的百分比。

It is the percentage of average of absolute difference between predicted values and true values, divided by the true value.

MAPE = \frac{1}{n}\displaystyle\sum\limits_{t=1}^n\frac{|y' {t}-y {t}\lvert}{y_{t}}*100\: \%

MAPE = \frac{1}{n}\displaystyle\sum\limits_{t=1}^n\frac{|y'{t}-y{t}\lvert}{y_{t}}*100\: \%

其中 $y'_{t}$ 是预测值

Where $y'_{t}$ is predicted value

$y_{t}$ 表示实际值,而

$y_{t}$ is actual value, and

n 是测试集中值的总数。

n is total number of values in test set.

Time Series - Applications

我们在本教程中讨论了时间序列分析,它使我们理解了时间序列模型首先根据现有观测识别趋势和季节性,然后基于该趋势和季节性预测一个值。这种分析在各个领域比如以下领域中很有用 −

We discussed time series analysis in this tutorial, which has given us the understanding that time series models first recognize the trend and seasonality from the existing observations and then forecast a value based on this trend and seasonality. Such analysis is useful in various fields such as −

  1. Financial Analysis − It includes sales forecasting, inventory analysis, stock market analysis, price estimation.

  2. Weather Analysis − It includes temperature estimation, climate change, seasonal shift recognition, weather forecasting.

  3. Network Data Analysis − It includes network usage prediction, anomaly or intrusion detection, predictive maintenance.

  4. Healthcare Analysis − It includes census prediction, insurance benefits prediction, patient monitoring.

Time Series - Further Scope

机器学习处理各种问题。实际上,几乎所有领域都具有借助机器学习实现自动化或改进的范围。正在对此进行大量工作的此类一些问题如下。

Machine learning deals with various kinds of problems. In fact, almost all fields have a scope to be automatized or improved with the help of machine learning. A few such problems on which a great deal of work is being done are given below.

Time Series Data

这是会随着时间而改变的数据,因此时间在这里扮演着至关重要的角色,我们在本教程中广泛探讨了这一点。

This is the data which changes according to time, and hence time plays a crucial role in it, which we largely discussed in this tutorial.

Non-Time Series Data

它是不随时间变化的数据,大多数 ML 问题都是非时间序列数据。为简单起见,我们将其进一步分类为 -

It is the data independent of time, and a major percentage of ML problems are on nontime series data. For simplicity, we shall categorize it further as −

  1. Numerical Data − Computers, unlike humans, only understand numbers, so all kinds of data ultimately is converted to numerical data for machine learning, for example, image data is converted to (r,b,g) values, characters are converted to ASCII codes or words are indexed to numbers, speech data is converted to mfcc files containing numerical data.

  2. Image Data − Computer vision has revolutionized the world of computers, it has various application in the field of medicine, satellite imaging etc.

  3. Text Data − Natural Language Processing (NLP) is used for text classification, paraphrase detection and language summarization. This is what makes Google and Facebook smart.

  4. Speech Data − Speech Processing involves speech recognition and sentiment understanding. It plays a crucial role in imparting computers the human-like qualities.