Machine Learning 简明教程
Machine Learning - Unsupervised
What is Unsupervised Learning?
在无监督机器学习算法中,我们没有任何主管来提供任何形式的指导。无监督学习算法在没有自由度(如在监督学习算法中拥有预先标记的训练数据)的情况下很方便,我们想从输入数据中提取有用的模式。
In unsupervised machine learning algorithms, we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.
无监督机器学习算法的示例包括 K-means clustering, K-nearest neighbors 等。
Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neighbors etc.
在回归中,我们训练机器预测未来值。在分类中,我们训练机器将未知对象分类到我们定义的某个类别中。简而言之,我们一直在训练机器,以便它能够为我们的数据 X 预测 Y。给定一个庞大的数据集且不估计类别,我们很难使用监督学习来训练机器。如果机器能够查找和分析大小达到几个千兆字节和太字节的大数据,并告诉我们这个数据包含很多不同的类别,会怎样?
In regression, we train the machine to predict a future value. In classification, we train the machine to classify an unknown object in one of the categories defined by us. In short, we have been training machines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it would be difficult for us to train the machine using supervised learning. What if the machine can look up and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so many distinct categories?
以选民数据为例。通过考虑每个选民的一些输入(在 AI 术语中称为特征),让机器预测有这么多的选民会为 X 政党投票,还有这么多的会为 Y 投票,等等。因此,总的来说,我们询问机器一个给定一个巨大的数据点 X 的数据集,“你能告诉我 X 的什么?”。或者它可能是一个类似于“我们可以从 X 中找出哪五个最好的组?”的问题。或者它甚至可以是“哪三个特征在 X 中一起出现的频率最高?”。
As an example, consider the voter’s data. By considering some inputs from each voter (these are called features in AI terminology), let the machine predict that there are so many voters who would vote for X political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given a huge set of data points X, “What can you tell me about X?”. Or it may be a question like “What are the five best groups we can make out of X?”. Or it could be even like “What three features occur together most frequently in X?”.
这正是无监督学习所要做的。
This is exactly the Unsupervised Learning is all about.
Algorithms for Unsupervised Learning
现在让我们讨论一下无监督学习中广泛使用的一种分类算法。
Let us now discuss one of the widely used algorithms for classification in unsupervised machine learning.
k-means clustering
2000 和 2004 年的美国总统选举都很接近——非常接近。任何候选人收到的最高普选票百分比为 50.7%,最低为 47.9%。如果有一定百分比的选民改变立场,选举的结果就会不同。有一些小团体选民,在适当劝说后会改变立场。这些团体可能并不庞大,但对于如此接近的竞选来说,他们可能足以改变选举结果。你如何找到这些人群?你如何用有限的预算吸引他们?答案是聚类。
The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percentage of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage of the voters were to have switched sides, the outcome of the election would have been different. There are small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge, but with such close races, they may be big enough to change the outcome of the election. How do you find these groups of people? How do you appeal to them with a limited budget? The answer is clustering.
让我们了解它是如何工作的。
Let us understand how it is done.
-
First, you collect information on people either with or without their consent: any sort of information that might give some clue about what is important to them and what will influence how they vote.
-
Then you put this information into some sort of clustering algorithm.
-
Next, for each cluster (it would be smart to choose the largest one first) you craft a message that will appeal to these voters.
-
Finally, you deliver the campaign and measure to see if it’s working.
聚类是一种无监督学习,可自动形成相似事物的集群。这类似于自动分类。你可以对几乎所有事物进行聚类,集群中项目越相似,则集群越好。在本章中,我们将研究一种称为 k 均值聚类的聚类算法。称其为 k 均值,因为它找到“k”个独特的集群,每个集群的中心はその集群中的值的平均值。
Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like automatic classification. You can cluster almost anything, and the more similar the items are in the cluster, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean of the values in that cluster.
Cluster Identification
集群识别告诉算法:“这里有一些数据。现在,将相似的事物组合在一起,并向我介绍这些组。”与分类的关键区别在于,在分类中,你了解正在寻找什么。然而,在聚类中则不是这种情况。
Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me about those groups.” The key difference from classification is that in classification you know what you are looking for. While that is not the case in clustering.
聚类有时被称为无监督分类,因为它产生与分类相同的结果,但无需预先定义类。
Clustering is sometimes called unsupervised classification because it produces the same result as classification does but without having predefined classes.
基于 ML 任务,无监督学习算法可以分为以下几个广泛的类别:聚类、关联、降维和异常检测。
Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes: Clustering, Association, Dimensionality Reduction, and Anomaly Detection.
Clustering
聚类方法是最有用的无监督 ML 方法之一。这些算法用于查找数据样本的相似性和关系模式,然后将这些样本聚类到具有基于特征的相似性的组中。聚类的实际示例是按购买行为对客户进行分组。
Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find similarity as well as relationship patterns among data samples and then cluster those samples into groups having similarity based on features. The real-world example of clustering is to group the customers by their purchasing behavior.
Association
另一种有用的无监督 ML 方法是 Association ,它基本上用于分析大型数据集以查找进一步表示不同项目之间有趣关系的模式。它也称为 Association Rule Mining 或 Market basket analysis ,主要用于分析客户购物模式。
Another useful unsupervised ML method is Association which is basically used to analyze large dataset to find patterns which further represent the interesting relationships between various items. It is also termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer shopping patterns.
Dimensionality Reduction
顾名思义,这种无监督 ML 方法用于通过选择一组主要特征或代表性特征来减少每个数据样本的特征变量数量。
As the name suggests, this unsupervised ML method is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features.
这里出现了一个问题,那就是我们为什么需要降维?这背后的原因是当我们开始分析和从数据样本中提取数百万个特征时,特征空间复杂性的问题就出现了。这个问题通常被称为“维数灾难”。主成分分析 (PCA)、k 最近邻和判别分析是用于此目的的几种流行算法。
A question arises here is that, why we need to reduce the dimensionality? The reason behind this is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to "curse of dimensionality". PCA (Principal Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms for this purpose.
Anomaly Detection
这种无监督 ML 方法用于找出通常不会发生的罕见事件或观察结果的发生情况。通过使用学习到的知识,异常检测方法将能够区分异常数据点或正常数据点。
This unsupervised ML method is used to find out occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or a normal data point.
一些无监督算法(如聚类、KNN)可以根据数据及其特征检测异常。
Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its features.