Big Data Analytics 简明教程

Big Data Analytics - Time Series Analysis

时间序列是被日期或时间戳索引的分类或数字变量观测序列。时间序列数据的明显示例是股票价格的时间序列。在下表中,我们可以看到时间序列数据的基本结构。在这种情况下,每小时记录一次观测结果。

Time series is a sequence of observations of categorical or numeric variables indexed by a date, or timestamp. A clear example of time series data is the time series of a stock price. In the following table, we can see the basic structure of time series data. In this case the observations are recorded every hour.

Timestamp

Stock - Price

2015-10-11 09:00:00

100

2015-10-11 10:00:00

110

2015-10-11 11:00:00

105

2015-10-11 12:00:00

90

2015-10-11 13:00:00

120

通常,时间序列分析的第一步是绘制序列,通常使用折线图进行。

Normally, the first step in time series analysis is to plot the series, this is normally done with a line chart.

时间序列分析最常见的应用是使用数据的时态结构预测数字值的未来值。这意味着,可用的观测值用于预测未来的值。

The most common application of time series analysis is forecasting future values of a numeric value using the temporal structure of the data. This means, the available observations are used to predict values from the future.

数据的时序顺序意味着传统的回归方法不起作用。为了构建稳健的预测,我们需要考虑数据时间排序的模型。

The temporal ordering of the data, implies that traditional regression methods are not useful. In order to build robust forecast, we need models that take into account the temporal ordering of the data.

时间序列分析中使用最广泛的模型称为 Autoregressive Moving Average (ARMA)。该模型由两部分组成,即 autoregressive (AR) 部分和 moving average (MA) 部分。该模型通常称为 ARMA(p, q) 模型,其中 p 是自回归部分的阶数,q 是滑动平均部分的阶数。

The most widely used model for Time Series Analysis is called Autoregressive Moving Average (ARMA). The model consists of two parts, an autoregressive (AR) part and a moving average (MA) part. The model is usually then referred to as the ARMA(p, q) model where p is the order of the autoregressive part and q is the order of the moving average part.

Autoregressive Model

AR(p) 被解读为 p 阶自回归模型。在数学上,它写成 −

The AR(p) is read as an autoregressive model of order p. Mathematically it is written as −

X_t = c + \sum_{i = 1}^{P} \phi_i X_{t - i} + \varepsilon_{t}

其中 {φ1, …, φp} 是要估计的参数,c 是常量,随机变量 εt 表示白噪声。对参数的值有一些必要的约束,以便模型保持平稳。

where {φ1, …, φp} are parameters to be estimated, c is a constant, and the random variable εt represents the white noise. Some constraints are necessary on the values of the parameters so that the model remains stationary.

Moving Average

表示法 MA(q) 指 q 阶滑动平均模型 −

The notation MA(q) refers to the moving average model of order q −

X_t = \mu + \varepsilon_t + \sum_{i = 1}^{q} \theta_i \varepsilon_{t - i}

其中 θ1, …​, θq 是模型的参数,μ 是 Xt 的期望,而 εt、εt − 1、…​ 是白噪声错误项。

where the θ1, …​, θq are the parameters of the model, μ is the expectation of Xt, and the εt, εt − 1, …​ are, white noise error terms.

Autoregressive Moving Average

ARMA(p, q) 模型结合了 p 个自回归项和 q 个移动平均项。在数学上,该模型用以下公式表示 −

The ARMA(p, q) model combines p autoregressive terms and q moving-average terms. Mathematically the model is expressed with the following formula −

X_t = c + \varepsilon_t + \sum_{i = 1}^{P} \phi_iX_{t - 1} + \sum_{i = 1}^{q} \theta_i \varepsilon_{t-i}

我们可以看到,ARMA(p, q) 模型是 AR(p) 和 MA(q) 模型的组合。

We can see that the ARMA(p, q) model is a combination of AR(p) and MA(q) models.

为了直观地了解该模型,请考虑公式的 AR 部分旨在估计 Xt − i 观测值的参数,以便预测 Xt 中变量的值。最终是对过去值的加权平均。MA 部分使用相同的方法,但使用先前观测的误差 εt − i。因此,最终,模型的结果是一个加权平均值。

To give some intuition of the model consider that the AR part of the equation seeks to estimate parameters for Xt − i observations of in order to predict the value of the variable in Xt. It is in the end a weighted average of the past values. The MA section uses the same approach but with the error of previous observations, εt − i. So in the end, the result of the model is a weighted average.

以下代码片段演示如何在 R 中实现 ARMA(p, q)。

The following code snippet demonstrates how to implement an ARMA(p, q) in R.

# install.packages("forecast")
library("forecast")

# Read the data
data = scan('fancy.dat')
ts_data <- ts(data, frequency = 12, start = c(1987,1))
ts_data
plot.ts(ts_data)

绘制数据通常是找出数据中是否存在时间结构的第一步。我们可以从图表中看到,每年的年底都有强劲的飙升。

Plotting the data is normally the first step to find out if there is a temporal structure in the data. We can see from the plot that there are strong spikes at the end of each year.

time series plot

以下代码将 ARMA 模型拟合到数据。它运行了多个模型组合,并选择了误差最小的模型。

The following code fits an ARMA model to the data. It runs several combinations of models and selects the one that has less error.

# Fit the ARMA model
fit = auto.arima(ts_data)
summary(fit)

# Series: ts_data
# ARIMA(1,1,1)(0,1,1)[12]
#    Coefficients:
#    ar1     ma1    sma1
# 0.2401  -0.9013  0.7499
# s.e.  0.1427   0.0709  0.1790

#
# sigma^2 estimated as 15464184:  log likelihood = -693.69
# AIC = 1395.38   AICc = 1395.98   BIC = 1404.43

# Training set error measures:
#                 ME        RMSE      MAE        MPE        MAPE      MASE       ACF1
# Training set   328.301  3615.374  2171.002  -2.481166  15.97302  0.4905797 -0.02521172