Machine Learning 简明教程
Machine Learning - Models
有各种机器学习算法、技术和方法,可用于通过使用数据构建 models 来解决实际问题。在本章中,我们将讨论这些不同类型的算法。
There are various Machine Learning algorithms, techniques and methods that can be used to build models for solving real-life problems by using data. In this chapter, we are going to discuss such different kinds of methods.
基于人工监督,机器学习算法主要分为四种类型:
There are four main types of machine learning methods classified based on human supervision −
在接下来的四章中,我们将详细讨论每种机器学习模型。在这里,我们对这些方法进行简要概述:
In the next four chapters, we will discuss each of these machine learning models in detail. Here, let’s have a brief overview of these methods:
Supervised Learning
Supervised learning 算法或方法是最常用的 ML 算法。此方法或学习算法在训练过程中获取数据样本(即训练数据)及其相关的输出(即每个数据样本的标签或响应)。
Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or learning algorithm takes the data sample i.e. the training data and its associated output i.e. labels or responses with each data sample during the training process.
监督学习算法的主要目的是在执行多个训练数据实例后,学习输入数据样本与相应输出之间的关联。
The main objective of supervised learning algorithms is to learn an association between input data samples and corresponding outputs after performing multiple training data instances.
例如,我们有
For example, we have
x : 输入变量和
x: Input variables and
Y : 输出变量
Y: Output variable
现在,应用算法从输入到输出学习映射函数,如下所示:
Now, apply an algorithm to learn the mapping function from the input to output as follows −
Y=f(x)
现在,主要目标是要很好地逼近映射函数,即使我们有新的输入数据 (x),我们也能轻松预测该新输入数据的输出变量 (Y)。
Now, the main objective would be to approximate the mapping function so well that even when we have new input data (x), we can easily predict the output variable (Y) for that new input data.
我们称之为监督,因为整个学习过程可以理解为它是由老师或主管监督的。监督机器学习算法的示例包括 Decision tree, Random Forest, KNN, Logistic Regression 等。
It is called supervised because the whole process of learning can be thought as it is being supervised by a teacher or supervisor. Examples of supervised machine learning algorithms includes Decision tree, Random Forest, KNN, Logistic Regression etc.
基于 ML 任务,监督学习算法可分为以下两大类:
Based on the ML tasks, supervised learning algorithms can be divided into the following two broad classes −
-
Classification
-
Regression
Classification
基于分类的任务主要目的是为给定的输入数据预测分类输出标签或响应。输出将基于模型在训练阶段学到的内容。我们知道分类输出响应表示无序和离散值,因此每个输出响应都将属于特定类别或分类。我们还将在接下来的章节中详细讨论分类和相关算法。
The key objective of classification-based tasks is to predict categorial output labels or responses for the given input data. The output will be based on what the model has learned in the training phase. As we know the categorial output responses means unordered and discrete values, hence each output response will belong to a specific class or category. We will discuss Classification and associated algorithms in detail in the upcoming chapters also.
以下是某些常见分类模型:
Followings are some common classification models −
-
Linear Discriminant Analysis
Regression
基于回归的任务主要目的是为给定的输入数据预测输出标签或响应(为连续数字值)。输出将基于模型在其训练阶段学到的内容。基本上,回归模型使用输入数据特征(自变量)及其相应的连续数字输出值(因变量或结果变量)来学习输入和相应输出之间的特定关联。我们将进一步在章节中详细讨论回归和相关算法。
The key objective of regression-based tasks is to predict output labels or responses, which are continuous numeric values, for the given input data. The output will be based on what the model has learned in its training phase. Basically, regression models use the input data features (independent variables) and their corresponding continuous numeric output values (dependent or outcome variables) to learn specific associations between inputs and corresponding outputs. We will discuss regression and associated algorithms in detail in further chapters.
以下是某些常见回归模型:
Followings are some common regression models −
Unsupervised Learning
正如名称所建议的那样,{s0}与我们没有监督者提供任何指导的监督 ML 方法或算法相反。在没有特权的情况下,无监督学习算法非常方便,例如在有预先标记的训练数据并且我们希望从输入数据中提取有用模式的监督学习算法中。
As the name suggests, unsupervised learning is opposite to supervised ML methods or algorithms in which we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.
例如,它可以理解如下:
For example, it can be understood as follows −
假设我们有 -
Suppose we have −
{s1},那么将没有相对应的输出变量,并且算法需要发现数据中有趣的模式以进行学习。
x: Input variables, then there would be no corresponding output variable and the algorithms need to discover the interesting pattern in data for learning.
无监督机器学习算法的示例包括 K 均值聚类,{s2} 等。
Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neighbors etc.
根据 ML 任务,无监督学习算法可以分为以下几大类:
Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes −
-
Clustering
-
Association
-
Dimensionality Reduction
Clustering
聚类方法是最有用的无监督 ML 方法之一。这些算法用于查找数据样本的相似性和关系模式,然后将这些样本聚类到具有基于特征的相似性的组中。聚类的实际示例是按购买行为对客户进行分组。
Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find similarity as well as relationship patterns among data samples and then cluster those samples into groups having similarity based on features. The real-world example of clustering is to group the customers by their purchasing behavior.
以下是某些常见聚类模型:
Followings are some common clustering models −
Association
另一种有用的无监督 ML 方法是 {s3},它用于分析大体数据集以查找进一步表示不同项目之间的有趣关系的模式。它也称为 {s4} 或 {s5},主要用于分析客户购物模式。
Another useful unsupervised ML method is Association which is used to analyze large dataset to find patterns which further represents the interesting relationships between various items. It is also termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer shopping patterns.
以下是某些常见关联模型:
Followings are some common association models −
-
Eclat algorithm
-
FP-growth algorithm
Dimensionality Reduction
此无监督 ML 方法通过选择代表性特征或主要特征集,用于减少每个数据样本的特征变量数量。这里产生一个问题,即我们为什么要减少维数?背后的原因是特征空间复杂性的问题,该问题在我们开始分析和从数据样本中提取数百万个特征时出现。此问题通常称为“维数灾难”。主成分分析 (PCA)、K 近邻和判别分析是用于此目的的部分流行算法。
This unsupervised ML method is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features. A question arises here is that why we need to reduce the dimensionality? The reason behind is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to “curse of dimensionality”. PCA (Principal Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms for this purpose.
以下是某些常见维数缩减模型:
Followings are some common dimensionality Reduction models −
-
Autoencoders
-
Singular value decomposition (SVD)
Anomaly Detection
此无监督 ML 方法用于找出通常不会发生的罕见事件或观测的发生。通过使用学到的知识,异常检测方法将能够区分异常数据点或正常数据点。诸如聚类、KNN 等部分无监督算法可以基于数据及其特征检测异常。
This unsupervised ML method is used to find out the occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or a normal data point. Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its features.
Semi-supervised Learning
{s6} 算法或方法既不是完全有监督也不是完全无监督。它们基本上介于有监督和无监督学习方法之间。这类算法通常使用小型有监督学习组件,即少量预先标记的带注释数据和大型无监督学习组件,即大量未标记数据进行训练。我们可以遵循以下任何一种方法来实现半监督学习方法:
Semi-supervised learning algorithms or methods are neither fully supervised nor fully unsupervised. They basically fall between the two i.e. supervised and unsupervised learning methods. These kinds of algorithms generally use small supervised learning component i.e. small amount of pre-labeled annotated data and large unsupervised learning component i.e. lots of unlabeled data for training. We can follow any of the following approaches for implementing semi-supervised learning methods −
-
The first and simple approach is to build the supervised model based on small amount of labeled and annotated data and then build the unsupervised model by applying the same to the large amounts of unlabeled data to get more labeled samples. Now, train the model on them and repeat the process.
-
The second approach needs some extra efforts. In this approach, we can first use the unsupervised methods to cluster similar data samples, annotate these groups and then use a combination of this information to train the model.
Reinforcement Learning
Reinforcement learning 方法与过去研究的方法不同,也很少使用。在这样的学习算法中,有一个智能体,我们希望在一段时间内训练它,以便它能够与特定环境进行交互。该智能体将遵循一组与环境交互的策略,然后在观察环境后采取有关环境当前状态的行动。以下强化学习方法的主要步骤 −
Reinforcement learning methods are different from previously studied methods and very rarely used also. In this kind of learning algorithms, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regards the current state of the environment. The following are the main steps of reinforcement learning methods −
-
Step 1 − First, we need to prepare an agent with some initial set of strategies.
-
Step 2 − Then observe the environment and its current state.
-
Step 3 − Next, select the optimal policy regards the current state of the environment and perform important action.
-
Step 4 − Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step.
-
Step 5 − Now, we can update the strategies if it is required so.
-
Step 6 − At last, repeat steps 2-5 until the agent got to learn and adopt the optimal policies.
以下是一些常见的强化学习算法 −
Following are some common reinforcement learning algorithms −
-
Q-learning
-
Markov Decision Process (MDP)
-
SARSA
-
DQN
-
DDPG
我们将在即将到来的章节中详细讨论上述每种机器学习模型。
We will discuss each of the above machine learning models in detail in upcoming chapters.