Machine Learning With Python 简明教程

Machine Learning - Basic Concepts

我们知道机器学习是人工智能的一个子集，它涉及训练计算机算法自动学习数据中的模式和关系。这里有一些机器学习的基本概念 −

Machine learning, as we know, is a subset of artificial intelligence that involves training computer algorithms to automatically learn patterns and relationships in data. Here are some basic concepts of machine learning −

Data

数据是机器学习的基础。没有数据，算法将无处学习。数据可以有多种形式，包括结构化数据（例如电子表格和数据库）和非结构化数据（例如文本和图像）。用于训练机器学习算法的数据的质量和数量是关键因素，可以显著影响其表现。

Data is the foundation of machine learning. Without data, there would be nothing for the algorithm to learn from. Data can come in many forms, including structured data (such as spreadsheets and databases) and unstructured data (such as text and images). The quality and quantity of the data used to train the machine learning algorithm are crucial factors that can significantly impact its performance.

Feature

在机器学习中，特征是描述输入数据的变量或属性。目标是选择最相关和信息最丰富的特征，能让算法做出准确的预测或判断。特征选择是机器学习过程中至关重要的一步，因为算法的表现很大程度上取决于所用特征的质量和相关性。

In machine learning, features are the variables or attributes used to describe the input data. The goal is to select the most relevant and informative features that will allow the algorithm to make accurate predictions or decisions. Feature selection is a crucial step in the machine learning process because the performance of the algorithm is heavily dependent on the quality and relevance of the features used.

Model

机器学习模型是输入数据（特征）和输出（预测或判断）之间关系的数学表示。使用训练数据集创建模型，然后使用一个独立的验证数据集对其进行评估。目标是创建能够准确推广到新数据和从未见过的数据的模型。

A machine learning model is a mathematical representation of the relationship between the input data (features) and the output (predictions or decisions). The model is created using a training dataset and then evaluated using a separate validation dataset. The goal is to create a model that can accurately generalize to new, unseen data.

Training

训练是教机器学习算法进行准确预测或判断的过程。这是通过向算法提供一个大数据集并允许它从数据中的模式和关系中学到而实现的。在训练期间，算法调整其内部参数，使预测输出和实际输出之间的差最小。

Training is the process of teaching the machine learning algorithm to make accurate predictions or decisions. This is done by providing the algorithm with a large dataset and allowing it to learn from the patterns and relationships in the data. During training, the algorithm adjusts its internal parameters to minimize the difference between its predicted output and the actual output.

Testing

测试是评估机器学习算法在之前从未见过的独立数据集中的表现的过程。目标是确定算法对新数据和从未见过的数据的泛化程度。如果算法在测试数据集中表现良好，那么它会被认为是一个成功的模型。

Testing is the process of evaluating the performance of the machine learning algorithm on a separate dataset that it has not seen before. The goal is to determine how well the algorithm generalizes to new, unseen data. If the algorithm performs well on the testing dataset, it is considered to be a successful model.

Overfitting

过拟合发生在机器学习模型过于复杂并且过于紧密地拟合训练数据时。这会导致对新数据和从未见过的数据的糟糕表现，因为该模型过于专门针对训练数据集。为了防止过拟合，使用一个验证数据集来评估模型的表现并使用正则化技术来简化模型非常重要。

Overfitting occurs when a machine learning model is too complex and fits the training data too closely. This can lead to poor performance on new, unseen data because the model is too specialized to the training dataset. To prevent overfitting, it is important to use a validation dataset to evaluate the model’s performance and to use regularization techniques to simplify the model.

Underfitting

欠拟合发生在机器学习模型过于简单并且不能捕捉数据中的模式和关系时。这会导致在训练数据集和测试数据集中的表现都很糟糕。为了防止欠拟合，我们可以使用若干技术，例如增加模型复杂性、收集更多数据、减少正则化以及特征工程。

Underfitting occurs when a machine learning model is too simple and cannot capture the patterns and relationships in the data. This can lead to poor performance on both the training and testing datasets. To prevent underfitting, we can use several techniques such as increasing model complexity, collect more data, reduce regularization, and feature engineering.

需要注意的是，防止欠拟合是在模型复杂性和可用数据量之间进行平衡。增加模型复杂度有助于防止欠拟合，但如果没有足够的数据来支持增加的复杂度，则可能发生过拟合。因此，重要的是监视模型的性能并根据需要调整复杂度。

It is important to note that preventing underfitting is a balancing act between model complexity and the amount of data available. Increasing model complexity can help prevent underfitting, but if there is not enough data to support the increased complexity, overfitting may occur instead. Therefore, it is important to monitor the model’s performance and adjust the complexity as necessary.

Why & When to Make Machines Learn?

我们已经讨论了机器学习的必要性，但另一个问题出现了，即在什么情况下我们必须让机器学习？机器需要高效且大规模地进行数据驱动决策的情况可能有多种。以下是机器学习更有效的一些情况−

We have already discussed the need for machine learning, but another question arises that in what scenarios we must make the machine learn? There can be several circumstances where we need machines to take data-driven decisions with efficiency and at a huge scale. The followings are some of such circumstances where making machines learn would be more effective −

Lack of human expertise

第一个我们希望机器学习并做出数据驱动决策的情况可能是缺乏人类专业知识的领域。示例可以是未知领域或空间行星的导航。

The very first scenario in which we want a machine to learn and take data-driven decisions, can be the domain where there is a lack of human expertise. The examples can be navigations in unknown territories or spatial planets.

Dynamic scenarios

有些情况本质上是动态的，即它们会随着时间的推移而不断变化。对于这些情况和行为，我们希望机器学习并做出数据驱动决策。一些示例可以是网络连接和组织中基础设施的可用性。

There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples can be network connectivity and availability of infrastructure in an organization.

Difficulty in translating expertise into computational tasks

人类可能在各个领域拥有专业知识；但是，他们无法将此专业知识转化为计算任务。在这种情况下，我们需要机器学习。示例可以是语音识别、认知任务等领域。

There can be various domains in which humans have their expertise,; however, they are unable to translate this expertise into computational tasks. In such circumstances we want machine learning. The examples can be the domains of speech recognition, cognitive tasks etc.

Machine Learning Model

在讨论机器学习模型之前，我们需要了解米切尔教授给出的以下 ML 正式定义：

Before discussing the machine learning model, we must need to understand the following formal definition of ML given by professor Mitchell −

“如果计算机程序相对于某些类别的任务 T 和性能衡量标准 P 从经验 E 中学习，那么根据 P 衡量，它在 T 中的任务的性能会通过经验 E 而提高。”

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

上面的定义主要关注三个参数，也是任何学习算法的主要组成部分，即任务（T）、性能（P）和经验（E）。在这种情况下，我们可以将此定义简化为−

The above definition is basically focusing on three parameters, also the main components of any learning algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this definition as −

机器学习是包括学习算法的人工智能领域，其中包含：

ML is a field of AI consisting of learning algorithms that −

Improve their performance (P)
At executing some task (T)
Over time with experience (E)

基于上述内容，以下图表表示一个机器学习模型−

Based on the above, the following diagram represents a Machine Learning Model −

现在让我们更详细地讨论它们−

Let us discuss them more in detail now −

Task(T)

从问题的角度来看，我们可以将任务 T 定义为要解决的现实世界问题。该问题可以是找到特定位置的最佳房屋价格或找到最佳营销策略等。另一方面，如果我们谈论机器学习，任务的定义是不同的，因为很难通过传统编程方法解决基于机器学习的任务。

From the perspective of problem, we may define the task T as the real-world problem to be solved. The problem can be anything like finding best house price in a specific location or to find best marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is different because it is difficult to solve ML based tasks by conventional programming approach.

任务 T 被称为基于机器学习的任务，当它基于数据点进行操作必须遵循的过程和系统时。基于机器学习的任务的示例包括分类、回归、结构化注释、聚类、转录等。

A task T is said to be a ML based task when it is based on the process and the system must follow for operating on data points. The examples of ML based tasks are Classification, Regression, Structured annotation, Clustering, Transcription etc.

Experience (E)

顾名思义，它是从提供给算法或模型的数据点中获得的知识。一旦提供了数据集，模型将迭代运行并会学习一些固有模式。由此获得的学习称为经验（E）。与人类学习进行类比，我们可以将这种情况视为一个人从情况、关系等各种属性中学习或获得一些经验的情况。监督学习、无监督学习和强化学习是学习或获得经验的一些方法。我们的机器学习模型或算法获得的经验将用于解决任务 T。

As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learning thus acquired is called experience(E). Making an analogy with human learning, we can think of this situation as in which a human being is learning or gaining some experience from various attributes like situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T.

Performance (P)

机器学习算法应该随着时间的推移执行任务并获得经验。衡量机器学习算法是否按预期执行的衡量标准是其性能 (P)。P 基本上是一个定量指标，它告诉模型如何使用其经验 E 执行任务 T。有许多指标有助于理解机器学习性能，例如准确性得分、F1 分数、混淆矩阵、精确度、召回率、灵敏度等。

An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure which tells whether ML algorithm is performing as per expectation or not is its performance (P). P is basically a quantitative metric that tells how a model is performing the task, T, using its experience, E. There are many metrics that help to understand the ML performance, such as accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.