Artificial Intelligence With Python 简明教程

AI with Python – Machine Learning

学习是指通过学习或经验获取知识或技能。基于此,我们可以对机器学习(ML)进行如下定义:−

Learning means the acquisition of knowledge or skills through study or experience. Based on this, we can define machine learning (ML) as follows −

它可以被定义为计算机科学领域,更具体地说,是一种人工智能的应用,它为计算机系统提供了利用数据学习和从经验中提高的能力,而不用明确编程。

It may be defined as the field of computer science, more specifically an application of artificial intelligence, which provides computer systems the ability to learn with data and improve from experience without being explicitly programmed.

基本上,机器学习的主要焦点在于让计算机在没有人工干预的情况下自动学习。现在问题来了,这样的学习是如何开始的又是如何进行的?它可以从数据观察开始。数据可以是某些示例、指令或某些直接经验。然后,机器根据这些输入寻找数据中的某些模式,从而做出更好的决策。

Basically, the main focus of machine learning is to allow the computers learn automatically without human intervention. Now the question arises that how such learning can be started and done? It can be started with the observations of data. The data can be some examples, instruction or some direct experiences too. Then on the basis of this input, machine makes better decision by looking for some patterns in data.

Types of Machine Learning (ML)

机器学习算法帮助计算机系统学习而无需明确编程。这些算法分为监督或非监督。下面我们来看看几种算法 −

Machine Learning Algorithms helps computer system learn without being explicitly programmed. These algorithms are categorized into supervised or unsupervised. Let us now see a few algorithms −

Supervised machine learning algorithms

这是最常用的机器学习算法。它被称为监督算法,因为算法从训练数据集中学习的过程可以看作是教师监督学习过程。在这种机器学习算法中,可能的结果已经知道,训练数据也用正确的答案进行了标记。它可以理解如下 −

This is the most commonly used machine learning algorithm. It is called supervised because the process of algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. In this kind of ML algorithm, the possible outcomes are already known and training data is also labeled with correct answers. It can be understood as follows −

假设我们有输入变量 x 和输出变量 y ,我们应用一个算法来学习从输入到输出的映射函数,例如 −

Suppose we have input variables x and an output variable y and we applied an algorithm to learn the mapping function from the input to output such as −

Y = f(x)

现在,主要目标是很好地逼近映射函数,这样当我们有新的输入数据(x)时,我们可以预测该数据的输出变量(Y)。

Now, the main goal is to approximate the mapping function so well that when we have new input data (x), we can predict the output variable (Y) for that data.

主要是监督学习问题可以分为以下两种问题 −

Mainly supervised leaning problems can be divided into the following two kinds of problems −

  1. Classification − A problem is called classification problem when we have the categorized output such as “black”, “teaching”, “non-teaching”, etc.

  2. Regression − A problem is called regression problem when we have the real value output such as “distance”, “kilogram”, etc.

决策树、随机森林、knn、逻辑回归是监督机器学习算法的示例。

Decision tree, random forest, knn, logistic regression are the examples of supervised machine learning algorithms.

Unsupervised machine learning algorithms

顾名思义,这些机器学习算法没有任何主管来提供任何指导。因此,无监督机器学习算法与有些人所说的真正人工智能紧密相关。它可以理解如下 −

As the name suggests, these kinds of machine learning algorithms do not have any supervisor to provide any sort of guidance. That is why unsupervised machine learning algorithms are closely aligned with what some call true artificial intelligence. It can be understood as follows −

假设我们有输入变量 x,那么将没有像在监督学习算法中对应的输出变量。

Suppose we have input variable x, then there will be no corresponding output variables as there is in supervised learning algorithms.

简单来说,我们可以说在无监督学习中没有正确答案,也没有指导老师。算法有助于发现数据中的有趣模式。

In simple words, we can say that in unsupervised learning there will be no correct answer and no teacher for the guidance. Algorithms help to discover interesting patterns in data.

无监督学习问题可以分为以下两种问题 −

Unsupervised learning problems can be divided into the following two kinds of problem −

  1. Clustering − In clustering problems, we need to discover the inherent groupings in the data. For example, grouping customers by their purchasing behavior.

  2. Association − A problem is called association problem because such kinds of problem require discovering the rules that describe large portions of our data. For example, finding the customers who buy both x and y.

用于聚类的 K-means、用于关联的 Apriori 算法是无监督机器学习算法的示例。

K-means for clustering, Apriori algorithm for association are the examples of unsupervised machine learning algorithms.

Reinforcement machine learning algorithms

这类机器学习算法的使用很少。这些算法训练系统做出特定决策。基本上,机器暴露在环境中,在该环境中它不断使用试验和错误法进行自我训练。这些算法从过去的经验中学习,并尝试捕捉最佳可能的知识以做出准确的决策。马尔可夫决策过程是强化机器学习算法的一个示例。

These kinds of machine learning algorithms are used very less. These algorithms train the systems to make specific decisions. Basically, the machine is exposed to an environment where it trains itself continually using the trial and error method. These algorithms learn from past experience and tries to capture the best possible knowledge to make accurate decisions. Markov Decision Process is an example of reinforcement machine learning algorithms.

Most Common Machine Learning Algorithms

在本节中,我们将学习最常见的机器学习算法。以下描述了这些算法:

In this section, we will learn about the most common machine learning algorithms. The algorithms are described below −

Linear Regression

它是统计学和机器学习中最著名的算法之一。

It is one of the most well-known algorithms in statistics and machine learning.

基本概念 - 主要线性回归是一个线性模型,它假设输入变量(例如 x)和单个输出变量(例如 y)之间存在线性关系。换句话说,我们可以说 y 可以从输入变量 x 的线性组合中计算得到。变量之间的关系可以通过拟合最佳线来建立。

Basic concept − Mainly linear regression is a linear model that assumes a linear relationship between the input variables say x and the single output variable say y. In other words, we can say that y can be calculated from a linear combination of the input variables x. The relationship between variables can be established by fitting a best line.

Types of Linear Regression

线性回归具有以下两种类型 −

Linear regression is of the following two types −

  1. Simple linear regression − A linear regression algorithm is called simple linear regression if it is having only one independent variable.

  2. Multiple linear regression − A linear regression algorithm is called multiple linear regression if it is having more than one independent variable.

线性回归主要用于基于连续变量估计真实值。例如,可以通过线性回归根据真实值估计商店一天内的总销售额。

Linear regression is mainly used to estimate the real values based on continuous variable(s). For example, the total sale of a shop in a day, based on real values, can be estimated by linear regression.

Logistic Regression

它是一种分类算法,也称为 logit 回归。

It is a classification algorithm and also known as logit regression.

主要逻辑回归是一种分类算法,用于根据一组给定的自变量估计离散值,如 0 或 1、真或假、是或否。基本上,它预测概率,因此其输出介于 0 和 1 之间。

Mainly logistic regression is a classification algorithm that is used to estimate the discrete values like 0 or 1, true or false, yes or no based on a given set of independent variable. Basically, it predicts the probability hence its output lies in between 0 and 1.

Decision Tree

决策树是一种监督学习算法,主要用于分类问题。

Decision tree is a supervised learning algorithm that is mostly used for classification problems.

基本上它是一个分类器,表示为基于自变量的递归划分。决策树具有形成有根树的节点。有根树是一个有向树,其中一个节点称为“根”。根没有任何入边,所有其他节点都有一个入边。这些节点称为叶节点或决策节点。例如,考虑以下决策树,以查看一个人是否健康。

Basically it is a classifier expressed as recursive partition based on the independent variables. Decision tree has nodes which form the rooted tree. Rooted tree is a directed tree with a node called “root”. Root does not have any incoming edges and all the other nodes have one incoming edge. These nodes are called leaves or decision nodes. For example, consider the following decision tree to see whether a person is fit or not.

decision tree

Support Vector Machine (SVM)

它用于分类和回归问题。但主要用于分类问题。SVM 的主要概念是将每个数据项绘制为 n 维空间中的一个点,其中每个特征的值是特定坐标的值。这里 n 将是我们拥有的特征。以下是一个简单的图形表示,用于理解 SVM 的概念:

It is used for both classification and regression problems. But mainly it is used for classification problems. The main concept of SVM is to plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Here n would be the features we would have. Following is a simple graphical representation to understand the concept of SVM −

support vector machine

在上图中,我们有两个特征,因此我们首先需要将这两个变量绘制在二维空间中,其中每个点的两个坐标称为支持向量。该线将数据分成两个不同的分类组。这条线将是分类器。

In the above diagram, we have two features hence we first need to plot these two variables in two dimensional space where each point has two co-ordinates, called support vectors. The line splits the data into two different classified groups. This line would be the classifier.

Naïve Bayes

这也是一种分类技术。这种分类技术的逻辑是使用贝叶斯定理构建分类器。假设预测变量是独立的。简单来说,它假设类中是否存在特定特征与是否存在任何其他特征无关。以下是贝叶斯定理的公式:

It is also a classification technique. The logic behind this classification technique is to use Bayes theorem for building classifiers. The assumption is that the predictors are independent. In simple words, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Below is the equation for Bayes theorem −

P\left ( \frac{A}{B} \right ) = \frac{P\left ( \frac{B}{A} \right )P\left ( A \right )}{P\left ( B \right )}

朴素贝叶斯模型易于构建,尤其适用于大型数据集。

The Naïve Bayes model is easy to build and particularly useful for large data sets.

K-Nearest Neighbors (KNN)

它用于问题的分类和回归。它被广泛用于解决分类问题。此算法的主要概念是它用于存储所有可用的案例,并按其 k 个邻居中的多数票对新案例进行分类。然后将案件分配给在其 K 个最近邻居中最常见的类别,由距离函数测量。距离函数可以是欧几里得距离、闵可夫斯基距离和汉明距离。考虑使用 KNN 的以下内容:

It is used for both classification and regression of the problems. It is widely used to solve classification problems. The main concept of this algorithm is that it used to store all the available cases and classifies new cases by majority votes of its k neighbors. The case being then assigned to the class which is the most common amongst its K-nearest neighbors, measured by a distance function. The distance function can be Euclidean, Minkowski and Hamming distance. Consider the following to use KNN −

  1. Computationally KNN are expensive than other algorithms used for classification problems.

  2. The normalization of variables needed otherwise higher range variables can bias it.

  3. In KNN, we need to work on pre-processing stage like noise removal.

K-Means Clustering

名前が示すように、クラスタリングの問題を解決するために使用されます。これは基本的に、無教師学習の種類です。K手段クラスタリングアルゴリズムの主な論理は、いくつかのクラスタを通してデータセットを分類することです。K手段によるクラスタを形成するには、次の手順に従います。

As the name suggests, it is used to solve the clustering problems. It is basically a type of unsupervised learning. The main logic of K-Means clustering algorithm is to classify the data set through a number of clusters. Follow these steps to form clusters by K-means −

  1. K-means picks k number of points for each cluster known as centroids.

  2. Now each data point forms a cluster with the closest centroids, i.e., k clusters.

  3. Now, it will find the centroids of each cluster based on the existing cluster members.

  4. We need to repeat these steps until convergence occurs.

Random Forest

これは、教師付き分類アルゴリズムです。ランダムフォレストアルゴリズムの利点は、分類と回帰の両方の種類の問題に使用できることです。基本的に、決定木の集合(すなわち、フォレスト)であるか、決定木のアンサンブルであると言えます。ランダムフォレストの基本的な概念は、各木が分類を提供し、フォレストはそれらから最高の分類を選択するというものです。ランダムフォレストアルゴリズムの利点は次のとおりです。

It is a supervised classification algorithm. The advantage of random forest algorithm is that it can be used for both classification and regression kind of problems. Basically it is the collection of decision trees (i.e., forest) or you can say ensemble of the decision trees. The basic concept of random forest is that each tree gives a classification and the forest chooses the best classifications from them. Followings are the advantages of Random Forest algorithm −

  1. Random forest classifier can be used for both classification and regression tasks.

  2. They can handle the missing values.

  3. It won’t over fit the model even if we have more number of trees in the forest.