Scikit Learn 简明教程

Scikit Learn - Anomaly Detection

在这里,我们将学习 Sklearn 中的异常检测以及如何在数据点识别中使用它。

异常检测是一种用于识别数据集中的数据点与其他数据不符的技术。它在业务中有许多应用,例如欺诈检测、入侵检测、系统运行状况监控、监测和预测性维护。异常值(也被称为离群点)可分为以下三类:

  1. Point anomalies − 它发生在单个数据实例被视为与其他数据相关的异常情况。

  2. Contextual anomalies − 这种异常是特定于上下文的。如果数据实例在特定上下文中异常,则会出现这种情况。

  3. Collective anomalies − 它发生在相关数据实例的集合与整个数据集(而不是单个值)相关的异常情况。

Methods

可以使用 outlier detectionnovelty detection 这两种方法进行异常检测。有必要了解它们之间的区别。

Outlier detection

The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

Novelty detection

It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.

There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows −

estimator.fit(X_train)

Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows −

estimator.fit(X_test)

The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter.

We can also define decision_function method that defines outliers as negative value and inliers as non-negative value.

estimator.decision_function(X_test)

Sklearn algorithms for Outlier Detection

Let us begin by understanding what an elliptic envelop is.

Fitting an elliptic envelop

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop.

This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

Parameters

Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method −

Sr.No

Parameter & Description

1

store_precision − Boolean, optional, default = True We can specify it if the estimated precision is stored.

2

assume_centered − Boolean, optional, default = False If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.

3

support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination − float in (0., 1.), optional, default = 0.1 It provides the proportion of the outliers in the data set.

5

random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.

Attributes

Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method −

Sr.No

Attributes & Description

1

support_ − array-like, shape(n_samples,) It represents the mask of the observations used to compute robust estimates of location and shape.

2

location_ − array-like, shape (n_features) It returns the estimated robust location.

3

covariance_ − array-like, shape (n_features, n_features) It returns the estimated robust covariance matrix.

4

precision_ – 类似数组,形状 (n_features, n_features)返回估计的 pseudo 逆矩阵。

5

offset_ – float用于基于原始分数定义决策函数。

Implementation Example

import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state = 0).fit(X)^M
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

Output

array([ 1, -1])

Isolation Forest

对于高维数据集,一种用于极值检测的高效方法是使用随机森林。scikit-learn 提供了通过随机选择一个特征来隔离观测的 ensemble.IsolationForest 方法。之后,它会随机选择一个介于所选特征的最大值和最小值之间的值。

此处,为了隔离样本所需的拆分次数相当于从根节点到终止节点的路径长度。

Parameters

以下表格包含 sklearn. ensemble.IsolationForest 方法使用到的参数−

Sr.No

Parameter & Description

1

n_estimators – int,可选,默认= 100它表示集成中的基本估计器的数量。

2

max_samples – int 或 float,可选,默认= “auto”它表示从 X 中提取的样本数量,用于训练每个基本估计器。如果我们选择 int 作为其值,则将提取 max_samples 个样本。如果我们选择 float 作为其值,则将提取 max_samples ∗ 𝑋.shape[0] 个样本。如果我们选择 auto 作为其值,则将提取 max_samples = min(256,n_samples)。

3

support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination – auto 或 float,可选,默认= auto它提供了数据集中极值所占的比例。如果将它设置为默认值(即 auto),将按照原始论文中的方式确定阈值。如果设置为 float,则污染的范围将介于 [0,0.5] 内。

5

random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.

6

max_features – int 或 float,可选(默认= 1.0)它表示从 X 中提取的特征数量,用于训练每个基本估计器。如果我们选择 int 作为其值,则将提取 max_features 个特征。如果我们选择 float 作为其值,则将提取 max_features * X.shape[𝟏] 个样本。

7

bootstrap – 布尔值,可选(默认= False)其默认选项为 False,这意味着采样将无放回地执行。另一方面,如果设置为 True,这意味着将基于通过放回抽样得到的训练数据的随机子集对各个树进行拟合。

8

n_jobs – int 或 None,可选(默认= None)它表示在 fit()predict() 方法中并行运行的作业数。

9

verbose – int,可选(默认= 0)此参数控制树构建过程的详细程度。

10

warm_start – 布尔值,可选(默认= False)如果 warm_start = true,我们可以重复使用以前的调用解决方案进行拟合,并且可以向集成中添加更多估计器。但是如果设置为 false,我们需要拟合一个全新的森林。

Attributes

以下表格包含 sklearn. ensemble.IsolationForest 方法使用到的属性−

Sr.No

Attributes & Description

1

estimators_ – DecisionTreeClassifier 的列表提供所有拟合的子估计器的集合。

2

max_samples_ – 整数它提供了实际使用的样本数。

3

offset_ – float用于基于原始分数定义决策函数。

Implementation Example

以下 Python 脚本将使用 sklearn. ensemble.IsolationForest 方法对给定数据拟合 10 棵树

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators = 10)
OUTDclf.fit(X)

Output

IsolationForest(
   behaviour = 'old', bootstrap = False, contamination='legacy',
   max_features = 1.0, max_samples = 'auto', n_estimators = 10, n_jobs=None,
   random_state = None, verbose = 0
)

Local Outlier Factor

局部离群因子 (LOF) 算法是另一种用于在高维度数据上执行离群检测的高效算法。scikit-learn 提供了 neighbors.LocalOutlierFactor 方法,它计算一个称为局部离群因子的分数,反映观测异常程度。此算法的主要逻辑是检测密度远低于其邻居的样本。这就是为什么它测量给定数据点相对于其邻居的局部密度偏差的原因。

Parameters

以下表格包含 sklearn. neighbors.LocalOutlierFactor 方法使用到的参数

Sr.No

Parameter & Description

1

n_neighbors - 整型,可选,默认值为 20,它表示 kneighbors 查询中默认使用的邻居数量。如果使用 . 则会使用所有样本。

2

algorithm - 可选,用于计算最近邻居的算法。如果选择 ball_tree,它将使用 BallTree 算法。如果选择 kd_tree,它将使用 KDTree 算法。如果选择 brute,它将使用暴力搜索算法。如果选择 auto,它将根据传递给 fit() 方法的值决定最合适的算法。

3

leaf_size - 整型,可选,默认值为 30,此参数的值可能会影响构建和查询的速度。它还影响存储树所需的内存。此参数传递给 BallTree 或 KdTree 算法。

4

contamination – auto 或 float,可选,默认= auto它提供了数据集中极值所占的比例。如果将它设置为默认值(即 auto),将按照原始论文中的方式确定阈值。如果设置为 float,则污染的范围将介于 [0,0.5] 内。

5

metric - 字符串或可调用,默认值,它表示用于距离计算的度量。

6

P - 整型,可选(默认值为 2),它是 Minkowski 度量的参数。P=1 等效于使用曼哈顿距离,即 L1,而 P=2 等效于使用欧几里得距离,即 L2。

7

novelty - 布尔值(默认值为 False),默认情况下,LOF 算法用于异常检测,但如果我们设置 novelty = true,则可以将其用于新颖性检测。

8

n_jobs - 整型或无,可选(默认值为无),它表示 fit() 和 predict() 方法并行运行的作业数。

Attributes

下表包含 sklearn.neighbors.LocalOutlierFactor 方法使用的属性:

Sr.No

Attributes & Description

1

negative_outlier_factor_ - numpy 数组,形状(n_samples,),提供训练样本的相反 LOF。

2

n_neighbors_ - 整型,提供用于邻居查询的实际邻居数。

3

offset_ - 浮点数,用于从原始分数中定义二进制标签。

Implementation Example

下面给出的 Python 脚本将使用 sklearn.neighbors.LocalOutlierFactor 方法从与我们的数据集对应的任何数组构建 NeighborsClassifier 类。

from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree",p=1)
LOFneigh.fit(samples)

Output

NearestNeighbors(
   algorithm = 'ball_tree', leaf_size = 30, metric='minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 1, p = 1, radius = 1.0
)

Example

现在,我们可以通过使用以下 Python 脚本,询问这个已构建的分类器离 [0.5, 1., 1.5] 最近的点。

print(neigh.kneighbors([[.5, 1., 1.5]])

Output

(array([[1.7]]), array([[1]], dtype = int64))

One-Class SVM

Schölkopf 等人提出的单类 SVM 是无监督异常检测。它在高维数据中也非常有效,并估计高维分布的支持。它在 Sklearn.svm.OneClassSVM 对象的 Support Vector Machines 模块中实现。为了定义一个边界,它需要一个核(最常用的 RBF)和一个标量参数。

为了更好地理解,让我们用 svm.OneClassSVM 对象拟合我们的数据:

Example

from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma = 'scale').fit(X)

现在,我们可以为输入数据获取 score_samples,如下所示:

OSVMclf.score_samples(X)

Output

array([1.12218594, 1.58645126, 1.58673086, 1.58645127, 1.55713767])