Scikit Learn 简明教程

Scikit Learn - Anomaly Detection

在这里，我们将学习 Sklearn 中的异常检测以及如何在数据点识别中使用它。

Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points.

异常检测是一种用于识别数据集中的数据点与其他数据不符的技术。它在业务中有许多应用，例如欺诈检测、入侵检测、系统运行状况监控、监测和预测性维护。异常值（也被称为离群点）可分为以下三类：

Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories −

Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data.
Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context.
Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values.

Methods

可以使用 outlier detection 和 novelty detection 这两种方法进行异常检测。有必要了解它们之间的区别。

Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them.

Outlier detection

训练数据包含了许多数据点，这些点偏离其余数据非常远。此类数据点被定义为观测值。这就是离群值检测估计器总是尝试拟合具有最集中训练数据的区域，同时忽略异常观测值的原因。它也被称为无监督异常检测。

The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

Novelty detection

它涉及在未观察到的模式中检测新的观察结果，这不在训练数据中包含。这里，训练数据不被异常值污染。它也称为半监督异常检测。

It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.

scikit-learn 提供了一组机器学习工具，它们可以用于异常值检测和新颖性检测。这些工具首先从数据中实现无监督对象学习，使用 fit () 方法，如下所示：

There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows −

estimator.fit(X_train)

现在，新的观察将使用 predict() 方法进行分类，如下所示，为 inliers (labeled 1) 或 outliers (labeled -1) ：

Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows −

estimator.fit(X_test)

估计器将首先计算原始评分函数，然后预测方法将使用该原始评分函数上的阈值。我们可以借助 score_sample 方法访问此原始评分函数，并可以通过 contamination 参数来控制阈值。

The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter.

我们还可以定义 decision_function 方法，它将异常值定义为负值，将内点定义为非负值。

We can also define decision_function method that defines outliers as negative value and inliers as non-negative value.

estimator.decision_function(X_test)

Sklearn algorithms for Outlier Detection

让我们从理解什么是椭圆包络开始。

Let us begin by understanding what an elliptic envelop is.

Fitting an elliptic envelop

此算法假设常规数据来自已知分布，例如高斯分布。对于异常值检测，Scikit-learn 提供了一个名为 covariance.EllipticEnvelop 的对象。

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop.

此对象将稳健协方差估计值拟合到数据中，从而将椭圆拟合到中心数据点。它忽略中心模式之外的点。

This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

Parameters

下表包含 sklearn. covariance.EllipticEnvelop 方法使用的参数：

Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method −

Sr.No	Parameter & Description
1	store_precision − Boolean, optional, default = True We can specify it if the estimated precision is stored.
2	assume_centered − Boolean, optional, default = False If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.
3	support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.
4	contamination − float in (0., 1.), optional, default = 0.1 It provides the proportion of the outliers in the data set.
5	random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.

Attributes

下表包含 sklearn. covariance.EllipticEnvelop 方法使用的属性：

Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method −

Sr.No	Attributes & Description
1	support_ − array-like, shape(n_samples,) It represents the mask of the observations used to compute robust estimates of location and shape.
2	location_ − array-like, shape (n_features) It returns the estimated robust location.
3	covariance_ − array-like, shape (n_features, n_features) It returns the estimated robust covariance matrix.
4	precision_ − array-like, shape (n_features, n_features) It returns the estimated pseudo inverse matrix.
5	offset_ − float It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

Implementation Example

import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state = 0).fit(X)^M
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

Output

array([ 1, -1])

Isolation Forest

对于高维数据集，一种用于极值检测的高效方法是使用随机森林。scikit-learn 提供了通过随机选择一个特征来隔离观测的 ensemble.IsolationForest 方法。之后，它会随机选择一个介于所选特征的最大值和最小值之间的值。

In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features.

此处，为了隔离样本所需的拆分次数相当于从根节点到终止节点的路径长度。

Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node.

Parameters

以下表格包含 sklearn. ensemble.IsolationForest 方法使用到的参数−

Followings table consist the parameters used by sklearn. ensemble.IsolationForest method −

Sr.No	Parameter & Description
1	n_estimators − int, optional, default = 100 It represents the number of base estimators in the ensemble.
2	max_samples − int or float, optional, default = “auto” It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples).
3	support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.
4	contamination − auto or float, optional, default = auto It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].
5	random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.
6	max_features − int or float, optional (default = 1.0) It represents the number of features to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_features features. If we choose float as its value, it will draw max_features * X.shape[𝟏] samples.
7	bootstrap − Boolean, optional (default = False) Its default option is False which means the sampling would be performed without replacement. And on the other hand, if set to True, means individual trees are fit on a random subset of the training data sampled with replacement.
8	n_jobs − int or None, optional (default = None) It represents the number of jobs to be run in parallel for fit() and predict() methods both.
9	verbose − int, optional (default = 0) This parameter controls the verbosity of the tree building process.
10	warm_start − Bool, optional (default=False) If warm_start = true, we can reuse previous calls solution to fit and can add more estimators to the ensemble. But if is set to false, we need to fit a whole new forest.

Attributes

以下表格包含 sklearn. ensemble.IsolationForest 方法使用到的属性−

Following table consist the attributes used by sklearn. ensemble.IsolationForest method −

Sr.No

Attributes & Description

estimators_ − list of DecisionTreeClassifier Providing the collection of all fitted sub-estimators.

max_samples_ − integer It provides the actual number of samples used.

offset_ − float It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

Implementation Example

以下 Python 脚本将使用 sklearn. ensemble.IsolationForest 方法对给定数据拟合 10 棵树

The Python script below will use sklearn. ensemble.IsolationForest method to fit 10 trees on given data

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators = 10)
OUTDclf.fit(X)

Output

IsolationForest(
   behaviour = 'old', bootstrap = False, contamination='legacy',
   max_features = 1.0, max_samples = 'auto', n_estimators = 10, n_jobs=None,
   random_state = None, verbose = 0
)

Local Outlier Factor

局部离群因子 (LOF) 算法是另一种用于在高维度数据上执行离群检测的高效算法。scikit-learn 提供了 neighbors.LocalOutlierFactor 方法，它计算一个称为局部离群因子的分数，反映观测异常程度。此算法的主要逻辑是检测密度远低于其邻居的样本。这就是为什么它测量给定数据点相对于其邻居的局部密度偏差的原因。

Local Outlier Factor (LOF) algorithm is another efficient algorithm to perform outlier detection on high dimension data. The scikit-learn provides neighbors.LocalOutlierFactor method that computes a score, called local outlier factor, reflecting the degree of anomality of the observations. The main logic of this algorithm is to detect the samples that have a substantially lower density than its neighbors. Thats why it measures the local density deviation of given data points w.r.t. their neighbors.

Parameters

以下表格包含 sklearn. neighbors.LocalOutlierFactor 方法使用到的参数

Followings table consist the parameters used by sklearn. neighbors.LocalOutlierFactor method

Sr.No	Parameter & Description
1	n_neighbors − int, optional, default = 20 It represents the number of neighbors use by default for kneighbors query. All samples would be used if .
2	algorithm − optional Which algorithm to be used for computing nearest neighbors. If you choose ball_tree, it will use BallTree algorithm. If you choose kd_tree, it will use KDTree algorithm. If you choose brute, it will use brute-force search algorithm. If you choose auto, it will decide the most appropriate algorithm on the basis of the value we passed to fit() method.
3	leaf_size − int, optional, default = 30 The value of this parameter can affect the speed of the construction and query. It also affects the memory required to store the tree. This parameter is passed to BallTree or KdTree algorithms.
4	contamination − auto or float, optional, default = auto It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].
5	metric − string or callable, default It represents the metric used for distance computation.
6	P − int, optional (default = 2) It is the parameter for the Minkowski metric. P=1 is equivalent to using manhattan_distance i.e. L1, whereas P=2 is equivalent to using euclidean_distance i.e. L2.
7	novelty − Boolean, (default = False) By default, LOF algorithm is used for outlier detection but it can be used for novelty detection if we set novelty = true.
8	n_jobs − int or None, optional (default = None) It represents the number of jobs to be run in parallel for fit() and predict() methods both.

Attributes

下表包含 sklearn.neighbors.LocalOutlierFactor 方法使用的属性：

Following table consist the attributes used by sklearn.neighbors.LocalOutlierFactor method −

Sr.No

Attributes & Description

negative_outlier_factor_ − numpy array, shape(n_samples,) Providing opposite LOF of the training samples.

n_neighbors_ − integer It provides the actual number of neighbors used for neighbors queries.

offset_ − float It is used to define the binary labels from the raw scores.

Implementation Example

下面给出的 Python 脚本将使用 sklearn.neighbors.LocalOutlierFactor 方法从与我们的数据集对应的任何数组构建 NeighborsClassifier 类。

The Python script given below will use sklearn.neighbors.LocalOutlierFactor method to construct NeighborsClassifier class from any array corresponding our data set

from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree",p=1)
LOFneigh.fit(samples)

Output

NearestNeighbors(
   algorithm = 'ball_tree', leaf_size = 30, metric='minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 1, p = 1, radius = 1.0
)

Example

现在，我们可以通过使用以下 Python 脚本，询问这个已构建的分类器离 [0.5, 1., 1.5] 最近的点。

Now, we can ask from this constructed classifier is the closet point to [0.5, 1., 1.5] by using the following python script −

print(neigh.kneighbors([[.5, 1., 1.5]])

Output

(array([[1.7]]), array([[1]], dtype = int64))

One-Class SVM

Schölkopf 等人提出的单类 SVM 是无监督异常检测。它在高维数据中也非常有效，并估计高维分布的支持。它在 Sklearn.svm.OneClassSVM 对象的 Support Vector Machines 模块中实现。为了定义一个边界，它需要一个核（最常用的 RBF）和一个标量参数。

The One-Class SVM, introduced by Schölkopf et al., is the unsupervised Outlier Detection. It is also very efficient in high-dimensional data and estimates the support of a high-dimensional distribution. It is implemented in the Support Vector Machines module in the Sklearn.svm.OneClassSVM object. For defining a frontier, it requires a kernel (mostly used is RBF) and a scalar parameter.

为了更好地理解，让我们用 svm.OneClassSVM 对象拟合我们的数据：

For better understanding let’s fit our data with svm.OneClassSVM object −

Example

from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma = 'scale').fit(X)

现在，我们可以为输入数据获取 score_samples，如下所示：

Now, we can get the score_samples for input data as follows −

OSVMclf.score_samples(X)

Output

array([1.12218594, 1.58645126, 1.58673086, 1.58645127, 1.55713767])