Artificial Intelligence With Python 简明教程

AI with Python – Analyzing Time Series Data

预测给定输入序列中的下一个序列是机器学习中的另一个重要概念。本章详细解释了如何分析时间序列数据。

Predicting the next in a given input sequence is another important concept in machine learning. This chapter gives you a detailed explanation about analyzing time series data.

Introduction

时间序列数据是指一系列特定时间间隔内的数据。如果我们想要在机器学习中构建序列预测,则必须处理顺序数据和时间。序列数据是顺序数据的摘要。数据的排序是顺序数据的一个重要特征。

Time series data means the data that is in a series of particular time intervals. If we want to build sequence prediction in machine learning, then we have to deal with sequential data and time. Series data is an abstract of sequential data. Ordering of data is an important feature of sequential data.

Basic Concept of Sequence Analysis or Time Series Analysis

序列分析或时间序列分析是基于以前观察到的内容预测给定输入序列中的下一个序列。预测可以是任何可能紧随其后的内容:符号、数字、次日天气、演讲中的下一个术语等。序列分析在诸如股票市场分析、天气预报和产品推荐之类的应用中非常有用。

Sequence analysis or time series analysis is to predict the next in a given input sequence based on the previously observed. The prediction can be of anything that may come next: a symbol, a number, next day weather, next term in speech etc. Sequence analysis can be very handy in applications such as stock market analysis, weather forecasting, and product recommendations.

Example

考虑以下示例以了解序列预测。此处 A,B,C,D 是给定值,您必须使用序列预测模型预测值 E

Consider the following example to understand sequence prediction. Here A,B,C,D are the given values and you have to predict the value E using a Sequence Prediction Model.

sequence prediction model

Installing Useful Packages

对于使用 Python 进行时间序列数据分析,我们需要安装以下软件包:

For time series data analysis using Python, we need to install the following packages −

Pandas

Pandas 是一个开源的 BSD 许可库,它为 Python 提供了高性能、易于使用的数据结构和数据分析工具。您可以使用以下命令安装 Pandas:

Pandas is an open source BSD-licensed library which provides high-performance, ease of data structure usage and data analysis tools for Python. You can install Pandas with the help of the following command −

pip install pandas

如果您使用的是 Anaconda,并且想通过 conda 软件包管理器进行安装,则可以使用以下命令:

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c anaconda pandas

hmmlearn

这是一个开源的 BSD 许可库,其中包含一些简单的算法和模型来学习 Python 中的隐马尔可夫模型 (HMM)。您可以使用以下命令安装它:

It is an open source BSD-licensed library which consists of simple algorithms and models to learn Hidden Markov Models(HMM) in Python. You can install it with the help of the following command −

pip install hmmlearn

如果您使用的是 Anaconda,并且想通过 conda 软件包管理器进行安装,则可以使用以下命令:

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c omnia hmmlearn

PyStruct

这是一个结构化学习和预测库。PyStruct 中实现的学习算法具有以下名称,如条件随机场 (CRF)、最大边缘马尔可夫随机网络 (M3N) 或结构支持向量机。您可以使用以下命令安装它:

It is a structured learning and prediction library. Learning algorithms implemented in PyStruct have names such as conditional random fields(CRF), Maximum-Margin Markov Random Networks (M3N) or structural support vector machines. You can install it with the help of the following command −

pip install pystruct

CVXOPT

它用于基于 Python 编程语言的凸优化。它也是一个免费软件包。您可以使用以下命令安装它:

It is used for convex optimization based on Python programming language. It is also a free software package. You can install it with the help of following command −

pip install cvxopt

如果您使用的是 Anaconda,并且想通过 conda 软件包管理器进行安装,则可以使用以下命令:

If you are using Anaconda and want to install by using the conda package manager, then you can use the following command −

conda install -c anaconda cvdoxt

Pandas: Handling, Slicing and Extracting Statistic from Time Series Data

如果您需要处理时间序列数据,Pandas 是一个非常有用的工具。借助 Pandas,您可以执行以下操作:

Pandas is a very useful tool if you have to work with time series data. With the help of Pandas, you can perform the following −

  1. Create a range of dates by using the pd.date_range package

  2. Index pandas with dates by using the pd.Series package

  3. Perform re-sampling by using the ts.resample package

  4. Change the frequency

Example

以下示例演示了如何使用 Pandas 处理和切分时间序列数据。请注意,此处我们使用的是北极涛动月度数据,它可以从以下链接下载: monthly.ao.index.b50.current.ascii ,并且可以转换成文本格式供我们使用。

The following example shows you handling and slicing the time series data by using Pandas. Note that here we are using the Monthly Arctic Oscillation data, which can be downloaded from link: monthly.ao.index.b50.current.ascii and can be converted to text format for our use.

Handling time series data

要处理时间序列数据,您需要执行以下步骤:

For handling time series data, you will have to perform the following steps −

第一步是导入以下软件包:

The first step involves importing the following packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下来,定义一个函数,它将从输入文件中读取数据,如下面的代码所示:

Next, define a function which will read the data from the input file, as shown in the code given below −

def read_data(input_file):
   input_data = np.loadtxt(input_file, delimiter = None)

现在,将此数据转换为时间序列。为此,创建我们时间序列的日期范围。在此示例中,我们将一个月的频率作为数据的频率。我们的文件具有从 1950 年 1 月开始的数据。

Now, convert this data to time series. For this, create the range of dates of our time series. In this example, we keep one month as frequency of data. Our file is having the data which starts from January 1950.

dates = pd.date_range('1950-01', periods = input_data.shape[0], freq = 'M')

在此步骤中,我们借助 Pandas Series 创建时间序列数据,如下所示 -

In this step, we create the time series data with the help of Pandas Series, as shown below −

output = pd.Series(input_data[:, index], index = dates)
return output

if __name__=='__main__':

输入输入文件路径,如下所示 -

Enter the path of the input file as shown here −

input_file = "/Users/admin/AO.txt"

现在,将列转换为时间序列格式,如下所示 -

Now, convert the column to timeseries format, as shown here −

timeseries = read_data(input_file)

最后,使用所示命令绘制并可视化数据 -

Finally, plot and visualize the data, using the commands shown −

plt.figure()
timeseries.plot()
plt.show()

您将观察到如下图像中所示的图表 -

You will observe the plots as shown in the following images −

time series
plots

Slicing time series data

切片涉及仅检索时间序列数据的一部分。作为示例的一部分,我们仅从 1980 年到 1990 年对数据进行切片。观察执行此任务的以下代码 -

Slicing involves retrieving only some part of the time series data. As a part of the example, we are slicing the data only from 1980 to 1990. Observe the following code that performs this task −

timeseries['1980':'1990'].plot()
   <matplotlib.axes._subplots.AxesSubplot at 0xa0e4b00>

plt.show()

当您运行切片时间序列数据的代码时,您可以看到图像中所示的以下图形 -

When you run the code for slicing the time series data, you can observe the following graph as shown in the image here −

slicing time series data

Extracting Statistic from Time Series Data

在需要得出一些重要结论的情况下,您必须从给定的数据中提取一些统计信息。均值、方差、相关性、最大值和最小值是此类统计数据的一部分。如果您想从给定的时间序列数据中提取此类统计信息,可以使用以下代码 -

You will have to extract some statistics from a given data, in cases where you need to draw some important conclusion. Mean, variance, correlation, maximum value, and minimum value are some of such statistics. You can use the following code if you want to extract such statistics from a given time series data −

Mean

您可以使用 mean() 函数来查找均值,如下所示 -

You can use the mean() function, for finding the mean, as shown here −

timeseries.mean()

然后你将观察到的示例输出是 -

Then the output that you will observe for the example discussed is −

-0.11143128165238671

Maximum

您可以使用 max() 函数查找最大值,如下所示 -

You can use the max() function, for finding maximum, as shown here −

timeseries.max()

然后你将观察到的示例输出是 -

Then the output that you will observe for the example discussed is −

3.4952999999999999

Minimum

您可以使用 min() 函数来查找最小值,如下所示 -

You can use the min() function, for finding minimum, as shown here −

timeseries.min()

然后你将观察到的示例输出是 -

Then the output that you will observe for the example discussed is −

-4.2656999999999998

Getting everything at once

如果您想一次计算所有统计信息,您可以使用 describe() 函数,如下所示 -

If you want to calculate all statistics at a time, you can use the describe() function as shown here −

timeseries.describe()

然后你将观察到的示例输出是 -

Then the output that you will observe for the example discussed is −

count   817.000000
mean     -0.111431
std       1.003151
min      -4.265700
25%      -0.649430
50%      -0.042744
75%       0.475720
max       3.495300
dtype: float64

Re-sampling

您可以将数据重新采样为不同的时间频率。用于执行重新采样的两个参数是 -

You can resample the data to a different time frequency. The two parameters for performing re-sampling are −

  1. Time period

  2. Method

Re-sampling with mean()

您可以使用以下代码使用 mean() 方法重新采样数据,这是默认方法 -

You can use the following code to resample the data with the mean()method, which is the default method −

timeseries_mm = timeseries.resample("A").mean()
timeseries_mm.plot(style = 'g--')
plt.show()

然后,您可以观察以下图形作为使用 mean() 重新采样的输出 -

Then, you can observe the following graph as the output of resampling using mean() −

re sampling with mean method

Re-sampling with median()

您可以使用以下代码使用 median() 方法重新采样数据 -

You can use the following code to resample the data using the *median()*method −

timeseries_mm = timeseries.resample("A").median()
timeseries_mm.plot()
plt.show()

然后,您可以观察以下图形作为使用 median() 重新采样的输出 -

Then, you can observe the following graph as the output of re-sampling with median() −

re sampling with median method

Rolling Mean

你可以使用以下代码计算滚动(移动)平均值 -

You can use the following code to calculate the rolling (moving) mean −

timeseries.rolling(window = 12, center = False).mean().plot(style = '-g')
plt.show()

然后,你可以观察到以下图表作为滚动(移动)平均值的输出 -

Then, you can observe the following graph as the output of the rolling (moving) mean −

rolling mean

Analyzing Sequential Data by Hidden Markov Model (HMM)

HMM 是一个统计模型,广泛用于具有连续性和可扩展性的数据,如时间序列股票市场分析、健康检查和语音识别。本节详细介绍了使用隐马尔可夫模型 (HMM) 分析顺序数据。

HMM is a statistic model which is widely used for data having continuation and extensibility such as time series stock market analysis, health checkup, and speech recognition. This section deals in detail with analyzing sequential data using Hidden Markov Model (HMM).

Hidden Markov Model (HMM)

HMM 是一个基于马尔可夫链概念构建的随机模型,基于如下假设:未来状态的概率仅取决于当前流程状态,而不是先前的任何状态。例如,抛硬币时,我们不能说第五次抛掷的结果将是正面。这是因为硬币没有任何记忆力,并且下一个结果并不取决于前一个结果。

HMM is a stochastic model which is built upon the concept of Markov chain based on the assumption that probability of future stats depends only on the current process state rather any state that preceded it. For example, when tossing a coin, we cannot say that the result of the fifth toss will be a head. This is because a coin does not have any memory and the next result does not depend on the previous result.

在数学上,HMM 包含以下变量 -

Mathematically, HMM consists of the following variables −

States (S)

它是一组存在于 HMM 中的隐藏或潜在状态。它用 S 表示。

It is a set of hidden or latent states present in a HMM. It is denoted by S.

Output symbols (O)

它是一组存在于 HMM 中的可能输出符号。它用 O 表示。

It is a set of possible output symbols present in a HMM. It is denoted by O.

State Transition Probability Matrix (A)

它是从一个状态转换为其他各个状态的概率。它用 A 表示。

It is the probability of making transition from one state to each of the other states. It is denoted by A.

Observation Emission Probability Matrix (B)

它是处于特定状态时发出/观察某个符号的概率。它用 B 表示。

It is the probability of emitting/observing a symbol at a particular state. It is denoted by B.

Prior Probability Matrix (Π)

它是从系统的各个状态开始处于特定状态的概率。它用 Π 表示。

It is the probability of starting at a particular state from various states of the system. It is denoted by Π.

因此,可以将 HMM 定义为 𝝀 = (S,O,A,B,𝝅)

Hence, a HMM may be defined as 𝝀 = (S,O,A,B,𝝅),

其中,

where,

  1. S = {s1,s2,…,sN} is a set of N possible states,

  2. O = {o1,o2,…,oM} is a set of M possible observation symbols,

  3. A is an N𝒙N state Transition Probability Matrix (TPM),

  4. B is an N𝒙M observation or Emission Probability Matrix (EPM),

  5. π is an N dimensional initial state probability distribution vector.

Example: Analysis of Stock Market data

在此示例中,我们将逐步分析股票市场的数据,了解 HMM 如何使用顺序或时间序列数据。请注意,我们是在 Python 中实现此示例。

In this example, we are going to analyze the data of stock market, step by step, to get an idea about how the HMM works with sequential or time series data. Please note that we are implementing this example in Python.

按照如下所示导入必要的包 -

Import the necessary packages as shown below −

import datetime
import warnings

现在,从 matpotlib.finance 包中使用股票市场数据,如下所示 -

Now, use the stock market data from the matpotlib.finance package, as shown here −

import numpy as np
from matplotlib import cm, pyplot as plt
from matplotlib.dates import YearLocator, MonthLocator
try:
   from matplotlib.finance import quotes_historical_yahoo_och1
except ImportError:
   from matplotlib.finance import (
      quotes_historical_yahoo as quotes_historical_yahoo_och1)

from hmmlearn.hmm import GaussianHMM

从开始日期和结束日期加载数据,即介于两个具体日期之间,如下所示 −

Load the data from a start date and end date, i.e., between two specific dates as shown here −

start_date = datetime.date(1995, 10, 10)
end_date = datetime.date(2015, 4, 25)
quotes = quotes_historical_yahoo_och1('INTC', start_date, end_date)

在此步骤中,我们将提取每天的收盘报价。为此,使用以下命令 −

In this step, we will extract the closing quotes every day. For this, use the following command −

closing_quotes = np.array([quote[2] for quote in quotes])

现在,我们将提取每天交易的股票数量。为此,使用以下命令 −

Now, we will extract the volume of shares traded every day. For this, use the following command −

volumes = np.array([quote[5] for quote in quotes])[1:]

在此,使用下面显示的代码得出收盘股票价格的百分比差异 −

Here, take the percentage difference of closing stock prices, using the code shown below −

diff_percentages = 100.0 * np.diff(closing_quotes) / closing_quotes[:-]
dates = np.array([quote[0] for quote in quotes], dtype = np.int)[1:]
training_data = np.column_stack([diff_percentages, volumes])

在此步骤中,创建并训练高斯 HMM。为此,使用以下代码 −

In this step, create and train the Gaussian HMM. For this, use the following code −

hmm = GaussianHMM(n_components = 7, covariance_type = 'diag', n_iter = 1000)
with warnings.catch_warnings():
   warnings.simplefilter('ignore')
   hmm.fit(training_data)

现在,使用显示的命令,利用 HMM 模型生成数据 −

Now, generate data using the HMM model, using the commands shown −

num_samples = 300
samples, _ = hmm.sample(num_samples)

最后,在此步骤中,我们将以图形的形式绘制和可视化差异百分比和作为输出成交的股票数量。

Finally, in this step, we plot and visualize the difference percentage and volume of shares traded as output in the form of graph.

使用以下代码绘制和可视化差异百分比 −

Use the following code to plot and visualize the difference percentages −

plt.figure()
plt.title('Difference percentages')
plt.plot(np.arange(num_samples), samples[:, 0], c = 'black')

使用以下代码绘制和可视化所交易股票的数量 −

Use the following code to plot and visualize the volume of shares traded −

plt.figure()
plt.title('Volume of shares')
plt.plot(np.arange(num_samples), samples[:, 1], c = 'black')
plt.ylim(ymin = 0)
plt.show()