Machine Learning 简明教程

Machine Learning - Supervised

最常用的 ML 算法是监督学习算法或方法。这种方法或学习算法在训练过程中获取数据样本（即训练数据）和关联输出（即标签或响应）。监督学习算法的主要目的是在执行多次训练数据实例后学习输入数据样本和相应的输出之间的关联性。

Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or learning algorithm take the data sample i.e. training data and associated output i.e. labels or responses with each data samples during the training process. The main objective of supervised learning algorithms is to learn an association between input data samples and corresponding outputs after performing multiple training data instances.

例如，我们有：

For example, we have −

x − Input variables and
Y − Output variable

现在，应用算法从输入到输出学习映射函数，如下所示：

Now, apply an algorithm to learn the mapping function from the input to output as follows −

Y=f(x)

现在，主要目标是要很好地逼近映射函数，即使我们有新的输入数据 (x)，我们也能轻松预测该新输入数据的输出变量 (Y)。

Now, the main objective would be to approximate the mapping function so well that even when we have new input data (x), we can easily predict the output variable (Y) for that new input data.

我们称之为监督，因为整个学习过程可以理解为它是由老师或主管监督的。监督机器学习算法的示例包括 Decision tree, Random Forest, KNN, Logistic Regression 等。

It is called supervised because the whole process of learning can be thought as it is being supervised by a teacher or supervisor. Examples of supervised machine learning algorithms includes Decision tree, Random Forest, KNN, Logistic Regression etc.

基于 ML 任务，监督学习算法可分为两大类− Classification 和 Regression 。

Based on the ML tasks, supervised learning algorithms can be divided into two broad classes − Classification and Regression.

Classification

基于分类的任务的关键目标是针对给定的输入数据预测分类输出标签或响应。输出将基于模型在其训练阶段中学到的内容。

The key objective of classification-based tasks is to predict categorial output labels or responses for the given input data. The output will be based on what the model has learned in its training phase.

众所周知，分类输出响应表示无序和离散值，因此每个输出响应都将属于特定的类别或分类。我们还将在以后的章节中详细讨论分类和相关算法。

As we know that the categorial output responses means unordered and discrete values, hence each output response will belong to a specific class or category. We will discuss Classification and associated algorithms in detail in further chapters also.

Regression

基于回归的任务的关键目标是针对给定的输入数据预测输出标签或响应，这些输出标签或响应是连续的数值。输出将基于模型在其训练阶段中学到的内容。

The key objective of regression-based tasks is to predict output labels or responses which are continues numeric values, for the given input data. The output will be based on what the model has learned in training phase.

基本上，回归模型使用输入数据特征（自变量）及其对应的连续数值输出值（因变量或结果变量）来了解输入与相应输出之间的特定关联。我们还将在以后的章节中详细讨论回归和相关算法。

Basically, regression models use the input data features (independent variables) and their corresponding continuous numeric output values (dependent or outcome variables) to learn specific association between inputs and corresponding outputs. We will discuss regression and associated algorithms in detail in further chapters also.

Algorithms for Supervised Learning

监督学习是最重要的机器训练学习模型之一。本章将详细介绍它。

Supervised learning is one of the important models of learning involved in training machines. This chapter talks in detail about the same.

许多算法可用于监督学习。下面展示了监督学习中一些广泛使用的算法−

There are several algorithms available for supervised learning. Some of the widely used algorithms of supervised learning are as shown below −

k-Nearest Neighbours
Decision Trees
Naive Bayes
Logistic Regression
Support Vector Machines

随着我们本章的学习，让我们详细讨论其中的每种算法。

As we move ahead in this chapter, let us discuss in detail about each of the algorithms.

k-Nearest Neighbours

k-最近邻域算法简称 kNN 是一种统计技术，可用于解决分类和回归问题。让我们讨论使用 kNN 对未知对象进行分类的情况。考虑如下图像所示的对象分布−

The k-Nearest Neighbours, which is simply called kNN is a statistical technique that can be used for solving for classification and regression problems. Let us discuss the case of classifying an unknown object using kNN. Consider the distribution of objects as shown in the image given below −

来源：

Source:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

该图表显示三种类型的对象，分别用红色、蓝色和绿色标记。当您对上述数据集运行 kNN 分类器时，每种类型对象的边界将如下所示−

The diagram shows three types of objects, marked in red, blue and green colors. When you run the kNN classifier on the above dataset, the boundaries for each type of object will be marked as shown below −

来源：

Source:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

现在，考虑一个您想要将它分类为红色、绿色或蓝色的新的未知对象。如下图所示。

Now, consider a new unknown object that you want to classify as red, green or blue. This is depicted in the figure below.

如您所见，未知数据点属于蓝色对象类别。从数学上讲，可以通过测量该未知点与数据集中每个其他点的距离来得出这个结论。当您这样做时，您将知道它的大多数邻居都是蓝色的。到红色和绿色对象的平均距离肯定大于到蓝色对象的平均距离。因此，可以将这个未知对象分类为属于蓝色类别。

As you see it visually, the unknown data point belongs to a class of blue objects. Mathematically, this can be concluded by measuring the distance of this unknown point with every other point in the data set. When you do so, you will know that most of its neighbours are of blue color. The average distance to red and green objects would be definitely more than the average distance to blue objects. Thus, this unknown object can be classified as belonging to blue class.

kNN 算法还可以用于回归问题。kNN 算法在大多数 ML 库中都有现成的可供使用。

The kNN algorithm can also be used for regression problems. The kNN algorithm is available as ready-to-use in most of the ML libraries.

Decision Trees

简单的流程图格式决策树如下所示 −

A simple decision tree in a flowchart format is shown below −

您可以根据此流程图编写代码来对输入数据进行分类。这个流程图不言自明，而且很普通。在此场景中，您尝试对传入的电子邮件进行分类以决定何时阅读它。

You would write a code to classify your input data based on this flowchart. The flowchart is self-explanatory and trivial. In this scenario, you are trying to classify an incoming email to decide when to read it.

实际上，决策树可能又大又复杂。有许多算法可用于创建和遍历这些树。作为机器学习爱好者，您需要理解和掌握创建和遍历决策树的这些技术。

In reality, the decision trees can be large and complex. There are several algorithms available to create and traverse these trees. As a Machine Learning enthusiast, you need to understand and master these techniques of creating and traversing decision trees.

Naive Bayes

朴素贝叶斯用于创建分类器。假设您想从一个水果篮中对不同种类的水果进行分类。您可以使用水果的颜色、大小和形状等特征，例如，任何颜色为红色、形状为圆形、直径约为 10 厘米的水果都可以被认为是苹果。因此，要训练模型，您将使用这些特征并测试给定特征与所需约束匹配的概率。然后将不同特征的概率组合起来，得出给定水果是苹果的概率。朴素贝叶斯通常需要少量的训练数据进行分类。

Naive Bayes is used for creating classifiers. Suppose you want to sort out (classify) fruits of different kinds from a fruit basket. You may use features such as color, size and shape of a fruit, For example, any fruit that is red in color, is round in shape and is about 10 cm in diameter may be considered as Apple. So to train the model, you would use these features and test the probability that a given feature matches the desired constraints. The probabilities of different features are then combined to arrive at a probability that a given fruit is an Apple. Naive Bayes generally requires a small number of training data for classification.

Logistic Regression

观察以下图表。它显示了数据点在 XY 平面中的分布。

Look at the following diagram. It shows the distribution of data points in XY plane.

从图表中，我们可以从视觉上检查红色点和绿色点的分离情况。您可以绘制一条边界线以分离出这些点。现在，要对新的数据点进行分类，你只需确定该点位于该线的哪一侧。

From the diagram, we can visually inspect the separation of red dots from green dots. You may draw a boundary line to separate out these dots. Now, to classify a new data point, you will just need to determine on which side of the line the point lies.

Support Vector Machines

观察以下数据分布。此处三类数据不能线性分离。边界曲线是非线性的。在这种情况下，找出手术的公式会变成一项复杂的工作。

Look at the following distribution of data. Here the three classes of data cannot be linearly separated. The boundary curves are non-linear. In such a case, finding the equation of the curve becomes a complex job.

数据来源： http://uc-r.github.io/svm

Source: http://uc-r.github.io/svm

支持向量机 (SVM) 在这种情况下非常适合确定分离边界。

The Support Vector Machines (SVM) comes handy in determining the separation boundaries in such situations.