Machine Learning With Python 简明教程

Machine Learning With Python - Quick Guide

Machine Learning with Python - Basics

我们生活在“数据时代”,该时代拥有更强大的计算能力和更多的存储资源。这些数据或信息正在与日俱增,但真正的挑战在于理解所有这些数据。企业和组织正试图通过使用数据科学、数据挖掘和机器学习中的概念和方法来构建智能系统以此来解决问题。其中,机器学习是计算机科学中最令人兴奋的领域。如果我们称机器学习是提供数据意义的算法的应用和科学,那并不会有错。

We are living in the ‘age of data’ that is enriched with better computational power and more storage resources,. This data or information is increasing day by day, but the real challenge is to make sense of all the data. Businesses & organizations are trying to deal with it by building intelligent systems using the concepts and methodologies from Data science, Data Mining and Machine learning. Among them, machine learning is the most exciting field of computer science. It would not be wrong if we call machine learning the application and science of algorithms that provides sense to the data.

What is Machine Learning?

机器学习 (ML) 是计算机科学的一个领域,借助该领域,计算机系统能够像人类一样赋予数据意义。

Machine Learning (ML) is that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do.

简单来说,机器学习是一种人工智能,它通过使用算法或方法从原始数据中提取模式。机器学习的主要重点是让计算机系统在没有明确编程或人为干预的情况下从经验中学习。

In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The main focus of ML is to allow computer systems learn from experience without being explicitly programmed or human intervention.

Need for Machine Learning

就目前而言,人类是地球上最聪明、最先进的物种,因为他们可以思考、评估和解决复杂的问题。另一方面,人工智能仍处于起步阶段,在许多方面尚未超越人类的智慧。那么问题是为什么要让机器学习?这样做的最恰当理由是“高效、大规模地基于数据做出决策”。

Human beings, at this moment, are the most intelligent and advanced species on earth because they can think, evaluate and solve complex problems. On the other side, AI is still in its initial stage and haven’t surpassed human intelligence in many aspects. Then the question is that what is the need to make machine learn? The most suitable reason for doing this is, “to make decisions, based on data, with efficiency and scale”.

最近,企业在人工智能、机器学习和深度学习等较新技术上投入巨资,从数据中获取关键信息执行若干实战任务并解决问题。我们可以称之为机器采取的数据驱动型决策,特别是为了自动化此流程。数据驱动型决策可以用来解决无法固有地编程的问题,代替使用程序设计逻辑。事实是,我们离不开人类智能,但另一方面是我们所有人都需要高效、大规模地解决现实世界问题。这就是机器学习需求存在的原因。

Lately, organizations are investing heavily in newer technologies like Artificial Intelligence, Machine Learning and Deep Learning to get the key information from data to perform several real-world tasks and solve problems. We can call it data-driven decisions taken by machines, particularly to automate the process. These data-driven decisions can be used, instead of using programing logic, in the problems that cannot be programmed inherently. The fact is that we can’t do without human intelligence, but other aspect is that we all need to solve real-world problems with efficiency at a huge scale. That is why the need for machine learning arises.

Why & When to Make Machines Learn?

我们已经讨论了机器学习的必要性,但另一个问题出现了,即在什么情况下我们必须让机器学习?机器需要高效且大规模地进行数据驱动决策的情况可能有多种。以下是机器学习更有效的一些情况−

We have already discussed the need for machine learning, but another question arises that in what scenarios we must make the machine learn? There can be several circumstances where we need machines to take data-driven decisions with efficiency and at a huge scale. The followings are some of such circumstances where making machines learn would be more effective −

Lack of human expertise

第一个我们希望机器学习并做出数据驱动决策的情况可能是缺乏人类专业知识的领域。示例可以是未知领域或空间行星的导航。

The very first scenario in which we want a machine to learn and take data-driven decisions, can be the domain where there is a lack of human expertise. The examples can be navigations in unknown territories or spatial planets.

Dynamic scenarios

有些情况本质上是动态的,即它们会随着时间的推移而不断变化。对于这些情况和行为,我们希望机器学习并做出数据驱动决策。一些示例可以是网络连接和组织中基础设施的可用性。

There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples can be network connectivity and availability of infrastructure in an organization.

Difficulty in translating expertise into computational tasks

人类可能在各个领域拥有专业知识;但是,他们无法将此专业知识转化为计算任务。在这种情况下,我们需要机器学习。示例可以是语音识别、认知任务等领域。

There can be various domains in which humans have their expertise,; however, they are unable to translate this expertise into computational tasks. In such circumstances we want machine learning. The examples can be the domains of speech recognition, cognitive tasks etc.

Machine Learning Model

在讨论机器学习模型之前,我们需要了解米切尔教授给出的以下 ML 正式定义:

Before discussing the machine learning model, we must need to understand the following formal definition of ML given by professor Mitchell −

“如果计算机程序相对于某些类别的任务 T 和性能衡量标准 P 从经验 E 中学习,那么根据 P 衡量,它在 T 中的任务的性能会通过经验 E 而提高。”

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

上面的定义主要关注三个参数,也是任何学习算法的主要组成部分,即任务(T)、性能(P)和经验(E)。在这种情况下,我们可以将此定义简化为−

The above definition is basically focusing on three parameters, also the main components of any learning algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this definition as −

机器学习是包括学习算法的人工智能领域,其中包含:

ML is a field of AI consisting of learning algorithms that −

  1. Improve their performance (P)

  2. At executing some task (T)

  3. Over time with experience (E)

基于上述内容,以下图表表示一个机器学习模型−

Based on the above, the following diagram represents a Machine Learning Model −

task

现在让我们更详细地讨论它们−

Let us discuss them more in detail now −

Task(T)

从问题的角度来看,我们可以将任务 T 定义为要解决的现实世界问题。该问题可以是找到特定位置的最佳房屋价格或找到最佳营销策略等。另一方面,如果我们谈论机器学习,任务的定义是不同的,因为很难通过传统编程方法解决基于机器学习的任务。

From the perspective of problem, we may define the task T as the real-world problem to be solved. The problem can be anything like finding best house price in a specific location or to find best marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is different because it is difficult to solve ML based tasks by conventional programming approach.

任务 T 被称为基于机器学习的任务,当它基于数据点进行操作必须遵循的过程和系统时。基于机器学习的任务的示例包括分类、回归、结构化注释、聚类、转录等。

A task T is said to be a ML based task when it is based on the process and the system must follow for operating on data points. The examples of ML based tasks are Classification, Regression, Structured annotation, Clustering, Transcription etc.

Experience (E)

顾名思义,它是从提供给算法或模型的数据点中获得的知识。一旦提供了数据集,模型将迭代运行并会学习一些固有模式。由此获得的学习称为经验(E)。与人类学习进行类比,我们可以将这种情况视为一个人从情况、关系等各种属性中学习或获得一些经验的情况。监督学习、无监督学习和强化学习是学习或获得经验的一些方法。我们的机器学习模型或算法获得的经验将用于解决任务 T。

As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learning thus acquired is called experience(E). Making an analogy with human learning, we can think of this situation as in which a human being is learning or gaining some experience from various attributes like situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T.

Performance (P)

机器学习算法应该随着时间的推移执行任务并获得经验。衡量机器学习算法是否按预期执行的衡量标准是其性能 (P)。P 基本上是一个定量指标,它告诉模型如何使用其经验 E 执行任务 T。有许多指标有助于理解机器学习性能,例如准确性得分、F1 分数、混淆矩阵、精确度、召回率、灵敏度等。

An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure which tells whether ML algorithm is performing as per expectation or not is its performance (P). P is basically a quantitative metric that tells how a model is performing the task, T, using its experience, E. There are many metrics that help to understand the ML performance, such as accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.

Challenges in Machines Learning

尽管机器学习正在快速发展,在网络安全和自动驾驶汽车领域取得重大进展,但作为整体而言的 AI 的这一部分还有很长的路要走。其原因在于机器学习未能克服许多挑战。机器学习目前面临的挑战有:

While Machine Learning is rapidly evolving, making significant strides with cybersecurity and autonomous cars, this segment of AI as whole still has a long way to go. The reason behind is that ML has not been able to overcome number of challenges. The challenges that ML is facing currently are −

Quality of data − 拥有高质量数据以用于机器学习算法是最大的挑战之一。使用低质量数据会导致与数据预处理和特征提取相关的难题。

Quality of data − Having good-quality data for ML algorithms is one of the biggest challenges. Use of low-quality data leads to the problems related to data preprocessing and feature extraction.

Time-Consuming task − 机器学习模型面临的另一个挑战是消耗很多时间,尤其是在数据获取、特征提取和检索上。

Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for data acquisition, feature extraction and retrieval.

Lack of specialist persons − 鉴于机器学习技术仍处于初级阶段,因此获得专家资源是一项艰巨的任务。

Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is a tough job.

No clear objective for formulating business problems − 没有清晰的目标和明确定义的业务问题是对采用机器学习的另一个关键挑战,因为此技术尚未成熟。

No clear objective for formulating business problems − Having no clear objective and well-defined goal for business problems is another key challenge for ML because this technology is not that mature yet.

Issue of overfitting & underfitting − 如果模型出现过拟合或欠拟合,则无法很好地表示该问题。

Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be represented well for the problem.

Curse of dimensionality − 机器学习模型面临的另一个挑战是拥有过多的数据点特征。这会成为真正的障碍。

Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can be a real hindrance.

Difficulty in deployment − 机器学习模型的复杂性使其难以在现实生活中部署。

Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

Applications of Machines Learning

机器学习是发展最快的技术,根据研究人员的说法,我们正处于人工智能和机器学习的黄金时代。它用来解决许多无法使用传统方法解决的现实世界中的复杂问题。以下是机器学习的一些实际应用:

Machine Learning is the most rapidly growing technology and according to researchers we are in the golden year of AI and ML. It is used to solve many real-world complex problems which cannot be solved with traditional approach. Following are some real-world applications of ML −

  1. Emotion analysis

  2. Sentiment analysis

  3. Error detection and prevention

  4. Weather forecasting and prediction

  5. Stock market analysis and forecasting

  6. Speech synthesis

  7. Speech recognition

  8. Customer segmentation

  9. Object recognition

  10. Fraud detection

  11. Fraud prevention

  12. Recommendation of products to customer in online shopping.

Machine Learning with Python - Ecosystem

An Introduction to Python

Python 是一种流行的面向对象编程语言,具有高级编程语言的功能。它易于学习的语法和可移植性使其在当下变得很流行。以下事实为我们介绍了 Python:

Python is a popular object-oriented programing language having the capabilities of high-level programming language. Its easy to learn syntax and portability capability makes it popular these days. The followings facts gives us the introduction to Python −

  1. Python was developed by Guido van Rossum at Stichting Mathematisch Centrum in the Netherlands.

  2. It was written as the successor of programming language named ‘ABC’.

  3. It’s first version was released in 1991.

  4. The name Python was picked by Guido van Rossum from a TV show named Monty Python’s Flying Circus.

  5. It is an open source programming language which means that we can freely download it and use it to develop programs. It can be downloaded from www.python.org.

  6. Python programming language is having the features of Java and C both. It is having the elegant ‘C’ code and on the other hand, it is having classes and objects like Java for object-oriented programming.

  7. It is an interpreted language, which means the source code of Python program would be first converted into bytecode and then executed by Python virtual machine.

Strengths and Weaknesses of Python

每种编程语言都有一些优点和缺点,Python 也如此。

Every programming language has some strengths as well as weaknesses, so does Python too.

Strengths

根据研究和调查,Python 是第五大最重要语言,也是机器学习和数据科学中最流行的语言。这是因为 Python 具有以下优势:

According to studies and surveys, Python is the fifth most important language as well as the most popular language for machine learning and data science. It is because of the following strengths that Python has −

Easy to learn and understand :Python 语法更简单;因此,即使是初学者也相对容易学习和理解该语言。

Easy to learn and understand − The syntax of Python is simpler; hence it is relatively easy, even for beginners also, to learn and understand the language.

Multi-purpose language :Python 是一种多用途编程语言,因为它支持结构化编程、面向对象编程以及函数式编程。

Multi-purpose language − Python is a multi-purpose programming language because it supports structured programming, object-oriented programming as well as functional programming.

Huge number of modules :Python 拥有大量模块,涵盖了编程的各个方面。这些模块很容易使用,因此使 Python 成为一种可扩展的语言。

Huge number of modules − Python has huge number of modules for covering every aspect of programming. These modules are easily available for use hence making Python an extensible language.

Support of open source community :作为开源编程语言,Python 得到非常庞大的开发人员社区的支持。因此,Python 社区可以轻松修复 bug。此特性使 Python 非常强大且具有适应性。

Support of open source community − As being open source programming language, Python is supported by a very large developer community. Due to this, the bugs are easily fixed by the Python community. This characteristic makes Python very robust and adaptive.

Scalability :Python 是一种可扩展编程语言,因为它提供了比 shell 脚本更好的支持大型程序的结构。

Scalability − Python is a scalable programming language because it provides an improved structure for supporting large programs than shell-scripts.

Weakness

尽管 Python是一种流行且功能强大的编程语言,但它也有自己的弱点,即执行速度慢。

Although Python is a popular and powerful programming language, it has its own weakness of slow execution speed.

与编译型语言相比,Python 的执行速度较慢,因为 Python 是一种解释型语言。这可能是 Python 社区的主要改进领域。

The execution speed of Python is slow as compared to compiled languages because Python is an interpreted language. This can be the major area of improvement for Python community.

Installing Python

要使用 Python,我们必须先安装它。你可以使用以下两种方法之一来安装 Python:

For working in Python, we must first have to install it. You can perform the installation of Python in any of the following two ways −

  1. Installing Python individually

  2. Using Pre-packaged Python distribution − Anaconda

让我们详细讨论每一个。

Let us discuss these each in detail.

Installing Python Individually

如果你想在计算机上安装 Python,则只需为你所在的平台下载适用的二进制代码即可。Python 发行版适用于 Windows、Linux 和 Mac 平台。

If you want to install Python on your computer, then then you need to download only the binary code applicable for your platform. Python distribution is available for Windows, Linux and Mac platforms.

以下是上述平台上安装 Python 的快速概述:

The following is a quick overview of installing Python on the above-mentioned platforms −

On Unix and Linux platform

On Unix and Linux platform

通过以下步骤,我们可以在 Unix 和 Linux 平台上安装 Python −

With the help of following steps, we can install Python on Unix and Linux platform −

  1. First, go to https://www.python.org/downloads/.

  2. Next, click on the link to download zipped source code available for Unix/Linux.

  3. Now, Download and extract files.

  4. Next, we can edit the Modules/Setup file if we want to customize some options. Next, write the command run ./configure script make make install

On Windows platform

On Windows platform

借助以下步骤,我们可以在 Windows 平台上安装 Python:

With the help of following steps, we can install Python on Windows platform −

  1. First, go to https://www.python.org/downloads/.

  2. Next, click on the link for Windows installer python-XYZ.msi file. Here XYZ is the version we wish to install.

  3. Now, we must run the file that is downloaded. It will take us to the Python install wizard, which is easy to use. Now, accept the default settings and wait until the install is finished.

On Macintosh platform

On Macintosh platform

对于 Mac OS X,建议使用 Homebrew,一个易于使用的软件包安装程序来安装 Python 3。如果你没有 Homebrew,可以使用以下命令安装:

For Mac OS X, Homebrew, a great and easy to use package installer is recommended to install Python 3. In case if you don’t have Homebrew, you can install it with the help of following command −

$ ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"

可以使用以下命令更新:

It can be updated with the command below −

$ brew update

现在,要在你的系统上安装 Python3,我们需要运行以下命令:

Now, to install Python3 on your system, we need to run the following command −

$ brew install python3

Using Pre-packaged Python Distribution: Anaconda

Anaconda 是 Python 的一个打包编译,它具有在数据科学中广泛使用的所有库。我们可以按照以下步骤使用 Anaconda 设置 Python 环境:

Anaconda is a packaged compilation of Python which have all the libraries widely used in Data science. We can follow the following steps to setup Python environment using Anaconda −

Step1 − 首先,我们需要从 Anaconda 发行版下载所需的安装包。其链接为 [role="bare"] [role="bare"]https://www.anaconda.com/products/individualhttps://www.anaconda.com/distribution/ 。你可以根据需要选择 Windows、Mac 和 Linux 操作系统。

Step1 − First, we need to download the required installation package from Anaconda distribution. The link for the same is [role="bare"]https://www.anaconda.com/products/individualhttps://www.anaconda.com/distribution/. You can choose from Windows, Mac and Linux OS as per your requirement.

Step2 − 接下来,选择想要在电脑上安装的 Python 版本。最新 Python 版本是 3.7。在那里,你将获得适用于 64 位和 32 位图形化安装程序的选项。

Step2 − Next, select the Python version you want to install on your machine. The latest Python version is 3.7. There you will get the options for 64-bit and 32-bit Graphical installer both.

Step3 − 选择了操作系统和 Python 版本之后,它将在电脑上下载 Anaconda 安装程序。现在,双击该文件,安装程序将安装 Anaconda 包。

Step3 − After selecting the OS and Python version, it will download the Anaconda installer on your computer. Now, double click the file and the installer will install Anaconda package.

Step4 − 如要检查是否已安装,请打开命令提示符,然后键入以下 Python:

Step4 − For checking whether it is installed or not, open a command prompt and type Python as follows −

cmd

你还可以通过 https://www.tutorialspoint.com/python_essentials_online_training/getting_started_with_anaconda.asp 观看详细视频讲座来检查这一点。

Why Python for Data Science?

Python 是机器学习和数据科学中最流行的语言,重要性排名第五。以下是 Python 的特性,使其成为数据科学的首选语言:

Python is the fifth most important language as well as most popular language for Machine learning and data science. The following are the features of Python that makes it the preferred choice of language for data science −

Extensive set of packages

Python 拥有一套广泛而强大的可用在不同领域的包。它还具有 numpy、scipy、pandas、scikit-learn 等机器学习和数据科学所需的包。

Python has an extensive and powerful set of packages which are ready to be used in various domains. It also has packages like numpy, scipy, pandas, scikit-learn etc. which are required for machine learning and data science.

Easy prototyping

Python 的另一个重要特性使其成为数据科学语言首选,那就是轻松而快速的原型制作。该特性对于开发新算法非常有用。

Another important feature of Python that makes it the choice of language for data science is the easy and fast prototyping. This feature is useful for developing new algorithm.

Collaboration feature

数据科学领域基本上需要良好的协作,而 Python 提供了许多有用的工具,极大地促进了协作。

The field of data science basically needs good collaboration and Python provides many useful tools that make this extremely.

One language for many domains

典型的数据科学项目包括各个领域,如数据提取、数据处理、数据分析、特征提取、建模、评估、部署和更新解决方案。由于 Python 是一种多用途语言,它允许数据科学家从一个通用平台来处理所有这些领域。

A typical data science project includes various domains like data extraction, data manipulation, data analysis, feature extraction, modelling, evaluation, deployment and updating the solution. As Python is a multi-purpose language, it allows the data scientist to address all these domains from a common platform.

Components of Python ML Ecosystem

在本部分,我们将讨论构成 Python 机器学习生态系统组件的一些核心数据科学库。这些有用的组件使 Python 成为数据科学的重要语言。尽管有许多这样的组件,但让我们在此讨论 Python 生态系统中的一些重要组件:

In this section, let us discuss some core Data Science libraries that form the components of Python Machine learning ecosystem. These useful components make Python an important language for Data Science. Though there are many such components, let us discuss some of the importance components of Python ecosystem here −

Jupyter Notebook

Jupyter 笔记本基本上提供了一个交互式计算环境,用于开发基于 Python 的数据科学应用程序。它们以前称为 iPython 笔记本。以下是 Jupyter 笔记本的一些特性,使其成为 Python ML 生态系统最佳组件之一 −

Jupyter notebooks basically provides an interactive computational environment for developing Python based Data Science applications. They are formerly known as ipython notebooks. The following are some of the features of Jupyter notebooks that makes it one of the best components of Python ML ecosystem −

  1. Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like code, images, text, output etc. in a step by step manner.

  2. It helps a data scientist to document the thought process while developing the analysis process.

  3. One can also capture the result as the part of the notebook.

  4. With the help of jupyter notebooks, we can share our work with a peer also.

Installation and Execution

如果你使用的是 Anaconda 发行版,那么你无需单独安装 Jupyter 笔记本,因为它已随附安装。你只需转到 Anaconda Prompt 并键入以下命令 −

If you are using Anaconda distribution, then you need not install jupyter notebook separately as it is already installed with it. You just need to go to Anaconda Prompt and type the following command −

C:\>jupyter notebook

按 Enter 后,它将在计算机的 localhost:8888 上启动一个笔记本服务器。它显示在以下屏幕截图中 −

After pressing enter, it will start a notebook server at localhost:8888 of your computer. It is shown in the following screen shot −

table

现在,在单击新选项卡后,你会得到一个选项列表。选择 Python 3,它将带你到新笔记本中开始工作。你可以在以下屏幕截图中看到它的预览 −

Now, after clicking the New tab, you will get a list of options. Select Python 3 and it will take you to the new notebook for start working in it. You will get a glimpse of it in the following screenshots −

python table
search bar

另一方面,如果你使用的是标准 Python 发行版,则可以使用流行的 Python 软件包安装程序 pip 安装 Jupyter 笔记本。

On the other hand, if you are using standard Python distribution then jupyter notebook can be installed using popular python package installer, pip.

pip install jupyter

Types of Cells in Jupyter Notebook

以下是 Jupyter 笔记本中的三种类型的单元格 −

The following are the three types of cells in a jupyter notebook −

Code cells − 顾名思义,我们可以使用这些单元格编写代码。在编写完代码/内容后,它会将其发送到与该笔记本关联的内核。

Code cells − As the name suggests, we can use these cells to write code. After writing the code/content, it will send it to the kernel that is associated with the notebook.

Markdown cells − 我们可以使用这些单元格来记录计算过程。它们可以包含诸如文本、图像、Latex 方程式、HTML 标签等内容。

Markdown cells − We can use these cells for notating the computation process. They can contain the stuff like text, images, Latex equations, HTML tags etc.

Raw cells − 其中编写的文本按原样显示。这些单元格基本上用于添加我们不希望被 Jupyter 笔记本的自动转换机制转换的文本。

Raw cells − The text written in them is displayed as it is. These cells are basically used to add the text that we do not wish to be converted by the automatic conversion mechanism of jupyter notebook.

如需详细了解 jupyter notebook,你可以访问此链接 https://www.tutorialspoint.com/jupyter/index.htm

For more detailed study of jupyter notebook, you can go to the linkhttps://www.tutorialspoint.com/jupyter/index.htm.

NumPy

这是使 Python 成为了最受数据科学领域欢迎的语言之一的另一有用组件。它实际上代表的是 Numerical Python,由多维数组对象构成。通过使用 NumPy,我们可以执行以下重要操作:

It is another useful component that makes Python as one of the favorite languages for Data Science. It basically stands for Numerical Python and consists of multidimensional array objects. By using NumPy, we can perform the following important operations −

  1. Mathematical and logical operations on arrays.

  2. Fourier transformation

  3. Operations associated with linear algebra.

我们也可以将 NumPy 看作是 MatLab 的替代品,因为 NumPy 主要会和 Scipy(科学 Python)和 Mat-plotlib(绘图库)一起使用。

We can also see NumPy as the replacement of MatLab because NumPy is mostly used along with Scipy (Scientific Python) and Mat-plotlib (plotting library).

Installation and Execution

Installation and Execution

如果你正在使用 Anaconda 发行版,则无需单独安装 NumPy,因为它已经随发行版一起安装。你只需使用以下方式将该程序包导入到 Python 脚本中即可−

If you are using Anaconda distribution, then no need to install NumPy separately as it is already installed with it. You just need to import the package into your Python script with the help of following −

import numpy as np

另一方面,如果你正在使用标准 Python 发行版,则可以使用流行的 python 程序包安装程序 pip 安装 NumPy。

On the other hand, if you are using standard Python distribution then NumPy can be installed using popular python package installer, pip.

pip install NumPy

如需详细了解 NumPy,你可以访问此链接 https://www.tutorialspoint.com/numpy/index.htm

For more detailed study of NumPy, you can go to the link https://www.tutorialspoint.com/numpy/index.htm.

Pandas

这是使 Python 成为数据科学首选语言之一的另一个有用 Python 库。Pandas 主要用于数据操作、清理和分析。它由韦斯·麦金尼于 2008 年开发。在数据处理时,借助 Pandas,我们可以完成以下五个步骤:

It is another useful Python library that makes Python one of the favorite languages for Data Science. Pandas is basically used for data manipulation, wrangling and analysis. It was developed by Wes McKinney in 2008. With the help of Pandas, in data processing we can accomplish the following five steps −

  1. Load

  2. Prepare

  3. Manipulate

  4. Model

  5. Analyze

Data representation in Pandas

数据在 Pandas 中的整个表示形式借助以下三个数据结构完成:

The entire representation of data in Pandas is done with the help of following three data structures −

Series − 它实际上是一个具有一维 axis 标签的 ndarray,这意味着它类似于一个具有同类数据的简单数组。举例来说,下面的序列是数字 1、5、10、15、24、25 的集合…​

Series − It is basically a one-dimensional ndarray with an axis label which means it is like a simple array with homogeneous data. For example, the following series is a collection of integers 1,5,10,15,24,25…

1

5

10

15

24

25

28

36

40

89

Data frame − 这是最实用的数据结构,用于 pandas 中几乎所有类型的表示和数据操作。它实际上是一个可以包含异类数据的二维数据结构。通常,使用数据框架表示表格数据。举例来说,下面表格显示了姓名和学号、年龄和性别等学生数据:

Data frame − It is the most useful data structure and used for almost all kind of data representation and manipulation in pandas. It is basically a two-dimensional data structure which can contain heterogeneous data. Generally, tabular data is represented by using data frames. For example, the following table shows the data of students having their names and roll numbers, age and gender −

Name

Roll number

Age

Gender

Aarav

1

15

Male

Harshit

2

14

Male

Kanika

3

16

Female

Mayank

4

15

Male

Panel − 它是一个包含异构数据的三维数据结构。以图形方式表示框架非常困难,但可以将其视为 DataFrame 的容器。

Panel − It is a 3-dimensional data structure containing heterogeneous data. It is very difficult to represent the panel in graphical representation, but it can be illustrated as a container of DataFrame.

下表为 Pandas 中上述数据结构提供了尺寸和说明:

The following table gives us the dimension and description about above mentioned data structures used in Pandas −

Data Structure

Dimension

Description

Series

1-D

Size immutable, 1-D homogeneous data

DataFrames

2-D

Size Mutable, Heterogeneous data in tabular form

Panel

3-D

Size-mutable array, container of DataFrame.

我们可以理解,高维数据结构是低维数据结构的容器。

We can understand these data structures as the higher dimensional data structure is the container of lower dimensional data structure.

Installation and Execution

如果你使用 Anaconda 发行版,则无需单独安装 Pandas,因为它已随此发行版安装。你只需要使用以下命令将其导入 Python 脚本中 −

If you are using Anaconda distribution, then no need to install Pandas separately as it is already installed with it. You just need to import the package into your Python script with the help of following −

import pandas as pd

另一方面,如果你使用的是标准 Python 发行版,则可以使用流行的 python 软件包安装程序 pip 安装 Pandas。

On the other hand, if you are using standard Python distribution then Pandas can be installed using popular python package installer, pip.

pip install Pandas

安装 Pandas 后,您可以将其导入 Python 脚本,如上所述。

After installing Pandas, you can import it into your Python script as did above.

Example

以下是使用 Pandas 从 ndarray 创建一个系列的示例 −

The following is an example of creating a series from ndarray by using Pandas −

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = np.array(['g','a','u','r','a','v'])

In [4]: s = pd.Series(data)

In [5]: print (s)

0 g
1 a
2 u
3 r
4 a
5 v

dtype: object

有关 Pandas 的更详细研究,你可以访问 https://www.tutorialspoint.com/python_pandas/index.htm 链接。

For more detailed study of Pandas you can go to the link https://www.tutorialspoint.com/python_pandas/index.htm.

Scikit-learn

另一个有用且最重要的 python 库是 Scikit-learn,用于 Python 中的数据科学和机器学习。以下是 Scikit-learn 的一些使其如此有用的功能 −

Another useful and most important python library for Data Science and machine learning in Python is Scikit-learn. The following are some features of Scikit-learn that makes it so useful −

  1. It is built on NumPy, SciPy, and Matplotlib.

  2. It is an open source and can be reused under BSD license.

  3. It is accessible to everybody and can be reused in various contexts.

  4. Wide range of machine learning algorithms covering major areas of ML like classification, clustering, regression, dimensionality reduction, model selection etc. can be implemented with the help of it.

Installation and Execution

如果你使用 Anaconda 发行版,则无需单独安装 Scikit-learn,因为它已随其一同安装。你只需要在 Python 脚本中使用包即可。例如,通过以下脚本行,我们从 Scikit-learn 导入乳腺癌患者的数据集 −

If you are using Anaconda distribution, then no need to install Scikit-learn separately as it is already installed with it. You just need to use the package into your Python script. For example, with following line of script we are importing dataset of breast cancer patients from Scikit-learn

from sklearn.datasets import load_breast_cancer

另一方面,如果你使用标准 Python 发行版并拥有 NumPy 和 SciPy,则可以使用流行的 python 包安装器 pip 来安装 Scikit-learn。

On the other hand, if you are using standard Python distribution and having NumPy and SciPy then Scikit-learn can be installed using popular python package installer, pip.

pip install -U scikit-learn

安装 Scikit-learn 后,你可以将其用在 Python 脚本中,正如你上面所做的那样。

After installing Scikit-learn, you can use it into your Python script as you have done above.

Machine Learning with Python - Methods

有各种 ML 算法、技术和方法可用于通过使用数据构建用于解决现实问题的模型。在本章中,我们将讨论这些不同种类的研究方法。

There are various ML algorithms, techniques and methods that can be used to build models for solving real-life problems by using data. In this chapter, we are going to discuss such different kinds of methods.

Different Types of Methods

以下是基于一些广泛类别的各种 ML 研究方法 −

The following are various ML methods based on some broad categories −

Based on human supervision

在学习过程中,一些基于人工监督的方法如下 −

In the learning process, some of the methods that are based on human supervision are as follows −

Supervised Learning

Supervised Learning

监督学习算法或方法是最常用的 ML 算法。此方法或学习算法在训练过程中获取数据样本,即训练数据及其相关输出,即每个数据样本的标签或响应。

Supervised learning algorithms or methods are the most commonly used ML algorithms. This method or learning algorithm take the data sample i.e. the training data and its associated output i.e. labels or responses with each data samples during the training process.

监督学习算法的主要目的是在执行多个训练数据实例后,学习输入数据样本与相应输出之间的关联。

The main objective of supervised learning algorithms is to learn an association between input data samples and corresponding outputs after performing multiple training data instances.

例如,我们有

For example, we have

x:输入变量和

x: Input variables and

y:输出变量

Y: Output variable

现在,应用算法从输入到输出学习映射函数,如下所示:

Now, apply an algorithm to learn the mapping function from the input to output as follows −

Y=f(x)

现在,主要目标是要很好地逼近映射函数,即使我们有新的输入数据 (x),我们也能轻松预测该新输入数据的输出变量 (Y)。

Now, the main objective would be to approximate the mapping function so well that even when we have new input data (x), we can easily predict the output variable (Y) for that new input data.

我们称之为监督,因为整个学习过程可以理解为它是由老师或主管监督的。监督机器学习算法的示例包括 Decision tree, Random Forest, KNN, Logistic Regression 等。

It is called supervised because the whole process of learning can be thought as it is being supervised by a teacher or supervisor. Examples of supervised machine learning algorithms includes Decision tree, Random Forest, KNN, Logistic Regression etc.

根据 ML 任务,监督学习算法可以分为以下两大类 −

Based on the ML tasks, supervised learning algorithms can be divided into following two broad classes −

  1. Classification

  2. Regression

Classification

Classification

基于分类的任务的关键目标是预测给定输入数据的类别输出标签或响应。输出将基于模型在训练阶段中学到的内容。众所周知,类别输出响应是指无序的不连续值,因此每个输出响应都将属于一个特定类别。我们还将在即将到来的章节中详细讨论分类和相关算法。

The key objective of classification-based tasks is to predict categorial output labels or responses for the given input data. The output will be based on what the model has learned in training phase. As we know that the categorial output responses means unordered and discrete values, hence each output response will belong to a specific class or category. We will discuss Classification and associated algorithms in detail in the upcoming chapters also.

Regression

Regression

基于回归的任务的关键目标是为给定的输入数据预测输出标签或响应,这些标签或响应是连续数值。输出将基于模型在其训练阶段中学习到的结果。基本上,回归模型使用输入数据特征(自变量)及其相应的连续数值输出值(因变量或结果变量)来学习输入与相应输出之间的特定关联。我们还将在后面的章节中详细讨论回归和相关算法。

The key objective of regression-based tasks is to predict output labels or responses which are continues numeric values, for the given input data. The output will be based on what the model has learned in its training phase. Basically, regression models use the input data features (independent variables) and their corresponding continuous numeric output values (dependent or outcome variables) to learn specific association between inputs and corresponding outputs. We will discuss regression and associated algorithms in detail in further chapters also.

Unsupervised Learning

顾名思义,它与监督式 ML 方法或算法相反,这意味着在无监督机器学习算法中,我们没有任何主管提供任何形式的指导。无监督学习算法在没有自由度的情况下很方便,例如在监督学习算法中拥有预先标记的训练数据,并且我们希望从输入数据中提取有用的模式。

As the name suggests, it is opposite to supervised ML methods or algorithms which means in unsupervised machine learning algorithms we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.

例如,它可以理解如下:

For example, it can be understood as follows −

假设我们有 -

Suppose we have −

{s1},那么将没有相对应的输出变量,并且算法需要发现数据中有趣的模式以进行学习。

x: Input variables, then there would be no corresponding output variable and the algorithms need to discover the interesting pattern in data for learning.

无监督机器学习算法的示例包括 K 均值聚类,{s2} 等。

Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neighbors etc.

根据 ML 任务,无监督学习算法可以分为以下广泛类别 −

Based on the ML tasks, unsupervised learning algorithms can be divided into following broad classes −

  1. Clustering

  2. Association

  3. Dimensionality Reduction

Clustering

Clustering

聚类方法是最有用的无监督 ML 方法之一。这些算法用于查找数据样本的相似性和关系模式,然后将这些样本聚类到具有基于特征的相似性的组中。聚类的实际示例是按购买行为对客户进行分组。

Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find similarity as well as relationship patterns among data samples and then cluster those samples into groups having similarity based on features. The real-world example of clustering is to group the customers by their purchasing behavior.

Association

Association

另一种有用的无监督 ML 方法是 {s3},它用于分析大体数据集以查找进一步表示不同项目之间的有趣关系的模式。它也称为 {s4} 或 {s5},主要用于分析客户购物模式。

Another useful unsupervised ML method is Association which is used to analyze large dataset to find patterns which further represents the interesting relationships between various items. It is also termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer shopping patterns.

Dimensionality Reduction

Dimensionality Reduction

此无监督 ML 方法通过选择代表性特征或主要特征集,用于减少每个数据样本的特征变量数量。这里产生一个问题,即我们为什么要减少维数?背后的原因是特征空间复杂性的问题,该问题在我们开始分析和从数据样本中提取数百万个特征时出现。此问题通常称为“维数灾难”。主成分分析 (PCA)、K 近邻和判别分析是用于此目的的部分流行算法。

This unsupervised ML method is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features. A question arises here is that why we need to reduce the dimensionality? The reason behind is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to “curse of dimensionality”. PCA (Principal Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms for this purpose.

Anomaly Detection

Anomaly Detection

此无监督 ML 方法用于找出通常不会发生的罕见事件或观测的发生。通过使用学到的知识,异常检测方法将能够区分异常数据点或正常数据点。诸如聚类、KNN 等部分无监督算法可以基于数据及其特征检测异常。

This unsupervised ML method is used to find out the occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or a normal data point. Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its features.

Semi-supervised Learning

此类算法或方法既不完全受监督,也不完全不受监督。它们基本上介于两种方法之间,即监督和无监督学习方法。这类算法通常使用较小的监督学习组件,即少量预标记注释数据和大量的无监督学习组件,即大量未标记数据进行训练。我们可以遵循以下任何一种方法来实现半监督学习方法:

Such kind of algorithms or methods are neither fully supervised nor fully unsupervised. They basically fall between the two i.e. supervised and unsupervised learning methods. These kinds of algorithms generally use small supervised learning component i.e. small amount of pre-labeled annotated data and large unsupervised learning component i.e. lots of unlabeled data for training. We can follow any of the following approaches for implementing semi-supervised learning methods −

  1. The first and simple approach is to build the supervised model based on small amount of labeled and annotated data and then build the unsupervised model by applying the same to the large amounts of unlabeled data to get more labeled samples. Now, train the model on them and repeat the process.

  2. ,p>The second approach needs some extra efforts. In this approach, we can first use the unsupervised methods to cluster similar data samples, annotate these groups and then use a combination of this information to train the model.

Reinforcement Learning

这些方法不同于以前的研究方法,使用也很少。在这种类型的学习算法中,有一个代理,我们希望在一段时间内对它进行训练,以便它可以与特定环境交互。该代理将遵循一组与环境交互的策略,然后在观察环境后,将针对环境的当前状态采取行动。以下是强化学习方法的主要步骤:

These methods are different from previously studied methods and very rarely used also. In this kind of learning algorithms, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regards the current state of the environment. The following are the main steps of reinforcement learning methods −

  1. Step1 − First, we need to prepare an agent with some initial set of strategies.

  2. Step2 − Then observe the environment and its current state.

  3. Step3 − Next, select the optimal policy regards the current state of the environment and perform important action.

  4. Step4 − Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step.

  5. Step5 − Now, we can update the strategies if it is required so.

  6. Step6 − At last, repeat steps 2-5 until the agent got to learn and adopt the optimal policies.

Tasks Suited for Machine Learning

下图显示了哪种类型的任务适合各种机器学习问题:

The following diagram shows what type of task is appropriate for various ML problems −

diagram

Based on learning ability

在学习过程中,以下是基于学习能力的一些方法:

In the learning process, the following are some methods that are based on learning ability −

Batch Learning

Batch Learning

在很多情况下,我们有端到端的机器学习系统,我们需要一次性使用所有可用的训练数据来训练模型。这种学习方法或算法称为 {s11}。它被称为批量或离线学习,因为它是一次性过程,模型将使用一个批次中的数据进行训练。以下是批量学习方法的主要步骤:

In many cases, we have end-to-end Machine Learning systems in which we need to train the model in one go by using whole available training data. Such kind of learning method or algorithm is called Batch or Offline learning. It is called Batch or Offline learning because it is a one-time procedure and the model will be trained with data in one single batch. The following are the main steps of Batch learning methods −

Step1 ——首先,我们需要收集所有训练数据以开始训练模型。

Step1 − First, we need to collect all the training data for start training the model.

Step2 ——现在,通过一次性提供整个训练数据来开始模型训练。

Step2 − Now, start the training of model by providing whole training data in one go.

Step3 ——然后,一旦您获得令人满意的结果/表现,就停止学习/训练过程。

Step3 − Next, stop learning/training process once you got satisfactory results/performance.

Step4 − 最后,将此训练模型部署到生产中。在此情况下,它将预测新数据样本的输出。

Step4 − Finally, deploy this trained model into production. Here, it will predict the output for new data sample.

Online Learning

它与批量或离线学习方法完全相反。在这些学习方法中,训练数据以多个递增批次(称为小批次)提供给算法。以下是在线学习方法的主要步骤 −

It is completely opposite to the batch or offline learning methods. In these learning methods, the training data is supplied in multiple incremental batches, called mini-batches, to the algorithm. Followings are the main steps of Online learning methods −

Step1 − 首先,我们需要收集所有训练数据用于开始训练模型。

Step1 − First, we need to collect all the training data for starting training of the model.

Step2 − 现在,通过向算法提供小批量的训练数据,开始训练模型。

Step2 − Now, start the training of model by providing a mini-batch of training data to the algorithm.

Step3 − 接下来,我们需要以多个增量向算法提供小批量的训练数据。

Step3 − Next, we need to provide the mini-batches of training data in multiple increments to the algorithm.

Step4 − 由于它不会像批量学习那样停止,因此在以小批次提供整个训练数据之后,还要向它提供新的数据样本。

Step4 − As it will not stop like batch learning hence after providing whole training data in mini-batches, provide new data samples also to it.

Step5 − 最后,它将根据新数据样本在一段时间内持续学习。

Step5 − Finally, it will keep learning over a period of time based on the new data samples.

Based on Generalization Approach

在学习过程中,以下是一些基于泛化方法的方法 −

In the learning process, followings are some methods that are based on generalization approaches −

Instance based Learning

基于实例的学习方法是一种有用的方法,它通过对输入数据进行泛化来构建机器学习模型。它与之前研究过的学习方法相反,这种学习涉及机器学习系统以及使用原始数据点本身为较新的数据样本绘制结果而不基于训练数据构建显式模型的方法。

Instance based learning method is one of the useful methods that build the ML models by doing generalization based on the input data. It is opposite to the previously studied learning methods in the way that this kind of learning involves ML systems as well as methods that uses the raw data points themselves to draw the outcomes for newer data samples without building an explicit model on training data.

简单来说,基于实例的学习基本上从查看输入数据点开始,然后使用相似性度量对新数据点进行泛化和预测。

In simple words, instance-based learning basically starts working by looking at the input data points and then using a similarity metric, it will generalize and predict the new data points.

Model based Learning

在基于模型的学习方法中,迭代过程发生在基于各种模型参数(称为超参数)构建的机器学习模型上,其中输入数据用于提取特征。在这个学习中,基于各种模型验证技术优化超参数。这就是为什么我们可以说基于模型的学习方法对泛化使用更传统的机器学习方法。

In Model based learning methods, an iterative process takes place on the ML models that are built based on various model parameters, called hyperparameters and in which input data is used to extract the features. In this learning, hyperparameters are optimized based on various model validation techniques. That is why we can say that Model based learning methods uses more traditional ML approach towards generalization.

Data Loading for ML Projects

假设你想启动一个机器学习项目,那么你需要的第一个也是最重要的事情是什么?它就是我们需要加载用于开始任何机器学习项目的 data。关于数据,机器学习项目中最常见的数据格式是 CSV(逗号分隔值)。

Suppose if you want to start a ML project then what is the first and most important thing you would require? It is the data that we need to load for starting any of the ML project. With respect to data, the most common format of data for ML projects is CSV (comma-separated values).

基本上,CSV 是一种简单文件格式,用于存储制表数据(数字和文本),如纯文本中的电子表格。在 Python 中,我们可以用不同的方式加载 CSV 数据,但在加载 CSV 数据之前,我们必须考虑一些注意事项。

Basically, CSV is a simple file format which is used to store tabular data (number and text) such as a spreadsheet in plain text. In Python, we can load CSV data into with different ways but before loading CSV data we must have to take care about some considerations.

Consideration While Loading CSV data

CSV 数据格式是机器学习数据最常见的格式,但在将其加载到我们的机器学习项目中时,我们需要考虑以下主要注意事项 −

CSV data format is the most common format for ML data, but we need to take care about following major considerations while loading the same into our ML projects −

File Header

在 CSV 数据文件中,标题包含每个字段的信息。我们必须对标题文件和数据文件使用相同的定界符,因为标题文件指定了如何解释数据字段。

In CSV data files, the header contains the information for each field. We must use the same delimiter for the header file and for data file because it is the header file that specifies how should data fields be interpreted.

以下是必须考虑的与 CSV 文件标题相关的两个案例 −

The following are the two cases related to CSV file header which must be considered −

  1. Case-I: When Data file is having a file header − It will automatically assign the names to each column of data if data file is having a file header.

  2. Case-II: When Data file is not having a file header − We need to assign the names to each column of data manually if data file is not having a file header.

在这两种情况下,我们都必须明确指定我们的 CSV 文件是否包含标题。

In both the cases, we must need to specify explicitly weather our CSV file contains header or not.

Comments

任何数据文件中的注释都有其重要意义。在 CSV 数据文件中,注释由行首的井号 (#) 表示。在将 CSV 数据加载到机器学习项目时,我们需要考虑注释,因为如果文件中有注释,则我们可能需要指示(取决于用于加载的方法),是否需要那些注释。

Comments in any data file are having their significance. In CSV data file, comments are indicated by a hash (#) at the start of the line. We need to consider comments while loading CSV data into ML projects because if we are having comments in the file then we may need to indicate, depends upon the method we choose for loading, whether to expect those comments or not.

Delimiter

在 CSV 数据文件中,逗号 (,) 字符是标准分隔符。分隔符的作用是分隔字段中的值。在将 CSV 文件上传到机器学习项目时,考虑分隔符的作用很重要,因为我们还可以使用其他分隔符,例如制表符或空格。但是,在使用与标准分隔符不同的分隔符时,我们必须明确指定它。

In CSV data files, comma (,) character is the standard delimiter. The role of delimiter is to separate the values in the fields. It is important to consider the role of delimiter while uploading the CSV file into ML projects because we can also use a different delimiter such as a tab or white space. But in the case of using a different delimiter than standard one, we must have to specify it explicitly.

Quotes

在 CSV 数据文件中,双引号 (“ ”) 标记是默认引号字符。在将 CSV 文件上传到机器学习项目时,考虑引号的作用很重要,因为我们还可以使用双引号标记以外的其他引号字符。但是,在使用与标准引号字符不同的引号字符时,我们必须明确指定它。

In CSV data files, double quotation (“ ”) mark is the default quote character. It is important to consider the role of quotes while uploading the CSV file into ML projects because we can also use other quote character than double quotation mark. But in case of using a different quote character than standard one, we must have to specify it explicitly.

Methods to Load CSV Data File

在使用机器学习项目时,最关键的任务是正确地将数据加载到其中。机器学习项目最常用的数据格式是 CSV,并且它具有多种风味和不同的解析难度。在本节中,我们将讨论在 Python 中加载 CSV 数据文件的三种常见方法−

While working with ML projects, the most crucial task is to load the data properly into it. The most common data format for ML projects is CSV and it comes in various flavors and varying difficulties to parse. In this section, we are going to discuss about three common approaches in Python to load CSV data file −

Load CSV with Python Standard Library

加载 CSV 数据文件的第一种也是最常用的一种方法是使用 Python 标准库,它为我们提供各种内置模块,即 csv 模块和 reader() 函数。以下是如何在它的帮助下加载 CSV 数据文件的示例−

The first and most used approach to load CSV data file is the use of Python standard library which provides us a variety of built-in modules namely csv module and the reader()function. The following is an example of loading CSV data file with the help of it −

Example

在这个示例中,我们使用可下载到我们本地目录的鸢尾花数据集。加载数据文件后,我们可以将其转换为 NumPy 数组并将其用于机器学习项目。以下是加载 CSV 数据文件的 Python 脚本−

In this example, we are using the iris flower data set which can be downloaded into our local directory. After loading the data file, we can convert it into NumPy array and use it for ML projects. Following is the Python script for loading CSV data file −

首先,我们需要如下导入 Python 标准库提供的 csv 模块−

First, we need to import the csv module provided by Python standard library as follows −

import csv

接下来,我们需要导入 Numpy 模块以将加载的数据转换为 NumPy 数组。

Next, we need to import Numpy module for converting the loaded data into NumPy array.

import numpy as np

现在,提供存储在本地目录中且具有 CSV 数据文件的文件的完整路径−

Now, provide the full path of the file, stored on our local directory, having the CSV data file −

path = r"c:\iris.csv"

接下来,使用 csv.reader() 函数从 CSV 文件中读取数据−

Next, use the csv.reader()function to read data from CSV file −

with open(path,'r') as f:
   reader = csv.reader(f,delimiter = ',')
   headers = next(reader)
   data = list(reader)
   data = np.array(data).astype(float)

我们可以使用以下脚本行打印标题的名称−

We can print the names of the headers with the following line of script −

print(headers)

下一行脚本将打印数据的形状,即文件中的行数和列数−

The following line of script will print the shape of the data i.e. number of rows & columns in the file −

print(data.shape)

下一行脚本将给出数据文件的头三行−

Next script line will give the first three line of data file −

print(data[:3])

Output

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
(150, 4)
[  [5.1 3.5 1.4 0.2]
   [4.9 3. 1.4 0.2]
   [4.7 3.2 1.3 0.2]]

Load CSV with NumPy

加载 CSV 数据文件的另一种方法是 NumPy 和 numpy.loadtxt() 函数。以下是如何在它的帮助下加载 CSV 数据文件的示例−

Another approach to load CSV data file is NumPy and numpy.loadtxt() function. The following is an example of loading CSV data file with the help of it −

Example

在这个示例中,我们使用包含糖尿病患者数据的皮马印第安人数据集。此数据集是无标头的数字数据集。它也可以下载到我们的本地目录。加载数据文件后,我们可以将其转换为 NumPy 数组并将其用于机器学习项目。以下是加载 CSV 数据文件的 Python 脚本−

In this example, we are using the Pima Indians Dataset having the data of diabetic patients. This dataset is a numeric dataset with no header. It can also be downloaded into our local directory. After loading the data file, we can convert it into NumPy array and use it for ML projects. The following is the Python script for loading CSV data file −

from numpy import loadtxt
path = r"C:\pima-indians-diabetes.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")
print(data.shape)
print(data[:3])

Output

(768, 9)
[  [ 6. 148. 72. 35. 0. 33.6 0.627 50. 1.]
   [ 1. 85. 66. 29. 0. 26.6 0.351 31. 0.]
   [ 8. 183. 64. 0. 0. 23.3 0.672 32. 1.]]

Load CSV with Pandas

加载 CSV 数据文件的另一种方法是 Pandas 和 pandas.read_csv() 函数。这是一个非常灵活的函数,它返回一个 pandas.DataFrame,可以立即用于绘图。以下是如何在它的帮助下加载 CSV 数据文件的示例−

Another approach to load CSV data file is by Pandas and pandas.read_csv()function. This is the very flexible function that returns a pandas.DataFrame which can be used immediately for plotting. The following is an example of loading CSV data file with the help of it −

Example

在这里,我们将实现两个 Python 脚本,第一个是具有标题的鸢尾花数据集,另一个是使用皮马印第安人数据集,这是一个无标头的数字数据集。这两个数据集都可以下载到本地目录。

Here, we will be implementing two Python scripts, first is with Iris data set having headers and another is by using the Pima Indians Dataset which is a numeric dataset with no header. Both the datasets can be downloaded into local directory.

Script-1

Script-1

以下是使用 Pandas 在鸢尾花数据集上加载 CSV 数据文件的 Python 脚本−

The following is the Python script for loading CSV data file using Pandas on Iris Data set −

from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)
print(data[:3])

Output:

(150, 4)
   sepal_length   sepal_width  petal_length   petal_width
0         5.1     3.5          1.4            0.2
1         4.9     3.0          1.4            0.2
2         4.7     3.2          1.3            0.2

Script-2

Script-2

以下是 Python 脚本,用于加载 CSV 数据文件,以及使用 Pandas 对 Pima Indians Diabetes 数据集提供标题名称 -

The following is the Python script for loading CSV data file, along with providing the headers names too, using Pandas on Pima Indians Diabetes dataset −

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
print(data.shape)
print(data[:3])

Output

(768, 9)
   preg  plas  pres   skin  test   mass    pedi    age   class
0   6    148    72      35    0     33.6   0.627    50      1
1   1    85     66      29    0     26.6   0.351    31      0
2   8    183    64      0     0     23.3   0.672    32      1

通过给定的示例,可以轻松理解上述三种用于加载 CSV 数据文件的方法之间的差异。

The difference between above used three approaches for loading CSV data file can easily be understood with the help of given examples.

ML - Understanding Data with Statistics

Introduction

在处理机器学习项目时,我们通常忽略两个最重要的部分,称为 mathematicsdata 。这是因为,我们知道 ML 是一种数据驱动方法,我们的 ML 模型只会产生与我们提供给它的数据一样好或一样差的结果。

While working with machine learning projects, usually we ignore two most important parts called mathematics and data. It is because, we know that ML is a data driven approach and our ML model will produce only as good or as bad results as the data we provided to it.

在上一章中,我们讨论了如何将 CSV 数据上传到我们的 ML 项目,但在上传数据之前先了解数据会更好。我们可以通过两种方式理解数据:通过统计数字和通过可视化。

In the previous chapter, we discussed how we can upload CSV data into our ML project, but it would be good to understand the data before uploading it. We can understand the data by two ways, with statistics and with visualization.

在这一章中,在以下 Python 配方的帮助下,我们将使用统计数据来理解机器学习数据。

In this chapter, with the help of following Python recipes, we are going to understand ML data with statistics.

Looking at Raw Data

第一个配方是查看原始数据。查看原始数据非常重要,因为查看原始数据后获得的见解将增加我们更好地对机器学习项目的进行数据预处理和处理的机会。

The very first recipe is for looking at your raw data. It is important to look at raw data because the insight we will get after looking at raw data will boost our chances to better pre-processing as well as handling of data for ML projects.

以下是使用 Pandas DataFrame 的 head() 函数在 Pima Indians Diabetes 数据集上实现的 Python 脚本,查看前 50 行以更好地理解它 -

Following is a Python script implemented by using head() function of Pandas DataFrame on Pima Indians diabetes dataset to look at the first 50 rows to get better understanding of it −

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
print(data.head(50))

Output

preg   plas  pres    skin  test  mass   pedi    age      class
0      6      148     72     35   0     33.6    0.627    50    1
1      1       85     66     29   0     26.6    0.351    31    0
2      8      183     64      0   0     23.3    0.672    32    1
3      1       89     66     23  94     28.1    0.167    21    0
4      0      137     40     35  168    43.1    2.288    33    1
5      5      116     74      0   0     25.6    0.201    30    0
6      3       78     50     32   88    31.0    0.248    26    1
7     10      115      0      0   0     35.3    0.134    29    0
8      2      197     70     45  543    30.5    0.158    53    1
9      8      125     96      0   0     0.0     0.232    54    1
10     4      110     92      0   0     37.6    0.191    30    0
11    10      168     74      0   0     38.0    0.537    34    1
12    10      139     80      0   0     27.1    1.441    57    0
13     1      189     60     23  846    30.1    0.398    59    1
14     5      166     72     19  175    25.8    0.587    51    1
15     7      100      0      0   0     30.0    0.484    32    1
16     0      118     84     47  230    45.8    0.551    31    1
17     7      107     74      0   0     29.6    0.254    31    1
18     1      103     30     38  83     43.3    0.183    33    0
19     1      115     70     30  96     34.6    0.529    32    1
20     3      126     88     41  235    39.3    0.704    27    0
21     8       99     84      0   0     35.4    0.388    50    0
22     7      196     90      0   0     39.8    0.451    41    1
23     9      119     80     35   0     29.0    0.263    29    1
24    11      143     94     33  146    36.6    0.254    51    1
25    10      125     70     26  115    31.1    0.205    41    1
26     7      147     76      0   0     39.4    0.257    43    1
27     1       97     66     15  140    23.2    0.487    22    0
28    13      145     82     19  110    22.2    0.245    57    0
29     5      117     92      0   0     34.1    0.337    38    0
30     5      109     75     26   0     36.0    0.546    60    0
31     3      158     76     36  245    31.6    0.851    28    1
32     3       88     58     11   54    24.8    0.267    22    0
33     6       92     92      0   0     19.9    0.188    28    0
34    10      122     78     31   0     27.6    0.512    45    0
35     4      103     60     33  192    24.0    0.966    33    0
36    11      138     76      0   0     33.2    0.420    35    0
37     9      102     76     37   0     32.9    0.665    46    1
38     2       90     68     42   0     38.2    0.503    27    1
39     4      111     72     47  207    37.1    1.390    56    1
40     3      180     64     25   70    34.0    0.271    26    0
41     7      133     84      0   0     40.2    0.696    37    0
42     7      106     92     18   0     22.7    0.235    48    0
43     9      171    110     24  240    45.4    0.721    54    1
44     7      159     64      0   0     27.4    0.294    40    0
45     0      180     66     39   0     42.0    1.893    25    1
46     1      146     56      0   0     29.7    0.564    29    0
47     2       71     70     27   0     28.0    0.586    22    0
48     7      103     66     32   0     39.1    0.344    31    1
49     7      105      0      0   0     0.0     0.305    24    0

从上面的输出中,我们可以看到第一列给出了行号,这对于引用特定观察结果非常有用。

We can observe from the above output that first column gives the row number which can be very useful for referencing a specific observation .

Checking Dimensions of Data

了解我们为机器学习项目准备的行列数据量始终是一个好习惯。背后的原因是 −

It is always a good practice to know how much data, in terms of rows and columns, we are having for our ML project. The reasons behind are −

  1. Suppose if we have too many rows and columns then it would take long time to run the algorithm and train the model.

  2. Suppose if we have too less rows and columns then it we would not have enough data to well train the model.

以下是通过在 Pandas 数据框架中打印 shape 属性来实现的 Python 脚本。我们将对 iris 数据集进行实现以获取其中的行数和列数。

Following is a Python script implemented by printing the shape property on Pandas Data Frame. We are going to implement it on iris data set for getting the total number of rows and columns in it.

Example

from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)

Output

(150, 4)

我们可以从输出中轻松观察到,我们将要使用的 iris 数据集共有 150 行和 4 列。

We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4 columns.

Getting Each Attribute’s Data Typ

了解每个属性的数据类型是另一个好习惯。背后的原因是,根据要求,有时我们可能需要将一种数据类型转换为另一种数据类型。例如,我们可能需要将字符串转换为浮点数或整数来表示分类或序数。我们可以通过查看原始数据来了解属性的数据类型,但另一种方法是使用 Pandas DataFrame 的 dtypes 属性。在 dtypes 属性的帮助下,我们可以对每个属性的数据类型进行分类。借助以下 Python 脚本可以理解 −

It is another good practice to know data type of each attribute. The reason behind is that, as per to the requirement, sometimes we may need to convert one data type to another. For example, we may need to convert string into floating point or int for representing categorial or ordinal values. We can have an idea about the attribute’s data type by looking at the raw data, but another way is to use dtypes property of Pandas DataFrame. With the help of dtypes property we can categorize each attributes data type. It can be understood with the help of following Python script −

Example

from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.dtypes)

Output

sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
dtype: object

从上面的输出中,我们可以轻松获得每个属性的数据类型。

From the above output, we can easily get the datatypes of each attribute.

Statistical Summary of Data

我们讨论了 Python 配方以获取数据的形状,即行数和列数,但很多时候我们需要查看该数据形状的摘要。这可以通过 Pandas DataFrame 的 describe() 函数来完成,该函数进一步提供每个数据属性的以下 8 个统计属性 −

We have discussed Python recipe to get the shape i.e. number of rows and columns, of data but many times we need to review the summaries out of that shape of data. It can be done with the help of describe() function of Pandas DataFrame that further provide the following 8 statistical properties of each & every data attribute −

  1. Count

  2. Mean

  3. Standard Deviation

  4. Minimum Value

  5. Maximum value

  6. 25%

  7. Median i.e. 50%

  8. 75%

Example

from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
print(data.shape)
print(data.describe())

Output

(768, 9)
         preg      plas       pres      skin      test        mass       pedi      age      class
count 768.00      768.00    768.00     768.00    768.00     768.00     768.00    768.00    768.00
mean    3.85      120.89     69.11      20.54     79.80      31.99       0.47     33.24      0.35
std     3.37       31.97     19.36      15.95    115.24       7.88       0.33     11.76      0.48
min     0.00        0.00      0.00       0.00      0.00       0.00       0.08     21.00      0.00
25%     1.00       99.00     62.00       0.00      0.00      27.30       0.24     24.00      0.00
50%     3.00      117.00     72.00      23.00     30.50      32.00       0.37     29.00      0.00
75%     6.00      140.25     80.00      32.00    127.25      36.60       0.63     41.00      1.00
max    17.00      199.00    122.00      99.00    846.00      67.10       2.42     81.00      1.00

从上面的输出中,我们可以观察到 Pima Indian Diabetes 数据集的数据统计摘要以及数据形状。

From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset along with shape of data.

Reviewing Class Distribution

类分布统计在分类问题中很有用,在这些问题中我们需要了解类值之间的平衡。了解类值分布非常重要,因为如果我们的类分布极不平衡,即一个类的观察值远多于另一个类,那么在机器学习项目的 data preparation 阶段可能需要特殊处理。我们可以借助 Pandas DataFrame 轻松地在 Python 中获取类分布。

Class distribution statistics is useful in classification problems where we need to know the balance of class values. It is important to know class value distribution because if we have highly imbalanced class distribution i.e. one class is having lots more observations than other class, then it may need special handling at data preparation stage of our ML project. We can easily get class distribution in Python with the help of Pandas DataFrame.

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
count_class = data.groupby('class').size()
print(count_class)

Output

Class
0 500
1 268
dtype: int64

从上面的输出中可以清楚地看出,类 0 的观测值数量几乎是类 1 的观测值数量的两倍。

From the above output, it can be clearly seen that the number of observations with class 0 are almost double than number of observations with class 1.

Reviewing Correlation between Attributes

两个变量之间的关系称为相关性。在统计学中,计算相关性的最常用方法是皮尔逊相关系数。它可以具有以下三个值 −

The relationship between two variables is called correlation. In statistics, the most common method for calculating correlation is Pearson’s Correlation Coefficient. It can have three values as follows −

  1. Coefficient value = 1 − It represents full positive correlation between variables.

  2. Coefficient value = -1 − It represents full negative correlation between variables.

  3. Coefficient value = 0 − It represents no correlation at all between variables.

在我们将其用于 ML 项目之前,始终审阅我们数据集中的属性对相关性非常有益,因为如果我们具有高度相关的属性,某些机器学习算法(如线性回归和逻辑回归)的性能会很差。在 Python 中,我们可以借助 Pandas DataFrame 上的 corr() 函数轻松计算数据集属性的相关性矩阵。

It is always good for us to review the pairwise correlations of the attributes in our dataset before using it into ML project because some machine learning algorithms such as linear regression and logistic regression will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a correlation matrix of dataset attributes with the help of corr() function on Pandas DataFrame.

Example

from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)

Output

preg     plas     pres     skin     test      mass     pedi       age      class
preg     1.00     0.13     0.14     -0.08     -0.07   0.02     -0.03       0.54   0.22
plas     0.13     1.00     0.15     0.06       0.33   0.22      0.14       0.26   0.47
pres     0.14     0.15     1.00     0.21       0.09   0.28      0.04       0.24   0.07
skin    -0.08     0.06     0.21     1.00       0.44   0.39      0.18      -0.11   0.07
test    -0.07     0.33     0.09     0.44       1.00   0.20      0.19      -0.04   0.13
mass     0.02     0.22     0.28     0.39       0.20   1.00      0.14       0.04   0.29
pedi    -0.03     0.14     0.04     0.18       0.19   0.14      1.00       0.03   0.17
age      0.54     0.26     0.24     -0.11     -0.04   0.04      0.03       1.00   0.24
class    0.22     0.47     0.07     0.07       0.13   0.29      0.17       0.24   1.00

上方输出中的矩阵提供了数据集中的所有成对属性之间的相关性。

The matrix in above output gives the correlation between all the pairs of the attribute in dataset.

Reviewing Skew of Attribute Distribution

偏度可以定义为一个假定为高斯分布但看起来已向另一个方向扭曲或偏移,或朝左或朝右的分布。审阅属性的偏度至关重要,原因如下所述−

Skewness may be defined as the distribution that is assumed to be Gaussian but appears distorted or shifted in one direction or another, or either to the left or right. Reviewing the skewness of attributes is one of the important tasks due to following reasons −

  1. Presence of skewness in data requires the correction at data preparation stage so that we can get more accuracy from our model.

  2. Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of bell curved data.

在 Python 中,我们可以通过对 Pandas DataFrame 使用 skew() 函数轻松计算每个属性的偏差。

In Python, we can easily calculate the skew of each attribute by using skew() function on Pandas DataFrame.

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
print(data.skew())

Output

preg   0.90
plas   0.17
pres  -1.84
skin   0.11
test   2.27
mass  -0.43
pedi   1.92
age    1.13
class  0.64
dtype: float64

从以上输出中,可以观察到正偏差或负偏差。如果数值接近于 0,则表示偏差较小。

From the above output, positive or negative skew can be observed. If the value is closer to zero, then it shows less skew.

ML - Understanding Data with Visualization

Introduction

在上一章中,我们讨论了机器学习算法对数据的重要性以及使用 Python 配方了解数据统计信息。还有一种称为可视化的方法来理解数据。

In the previous chapter, we have discussed the importance of data for Machine Learning algorithms along with some Python recipes to understand the data with statistics. There is another way called Visualization, to understand the data.

借助数据可视化,我们可以看到数据的表现形式以及数据的属性持有哪种相关性。这是查看特征是否对应输出的最快方法。借助以下 Python 配方,我们可以用统计数据理解 ML 数据。

With the help of data visualization, we can see how the data looks like and what kind of correlation is held by the attributes of data. It is the fastest way to see if the features correspond to the output. With the help of following Python recipes, we can understand ML data with statistics.

data visualization

Univariate Plots: Understanding Attributes Independently

最简单的可视化类型是单变量或“单变量”可视化。借助单变量可视化,我们可以独立了解数据集的每个属性。以下是 Python 中实现单变量可视化的某些技术 -

The simplest type of visualization is single-variable or “univariate” visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization −

Histograms

直方图将数据分组到区间中,并且是获取数据集中的每个属性的分布想法的最快方法。以下是直方图的一些特性 -

Histograms group the data in bins and is the fastest way to get idea about the distribution of each attribute in dataset. The following are some of the characteristics of histograms −

  1. It provides us a count of the number of observations in each bin created for visualization.

  2. From the shape of the bin, we can easily observe the distribution i.e. weather it is Gaussian, skewed or exponential.

  3. Histograms also help us to see possible outliers.

Example

以下显示的代码是创建 Pima Indian Diabetes 数据集的属性直方图的 Python 脚本示例。在这里,我们将使用 Pandas DataFrame 上的 hist() 函数生成直方图和 matplotlib 用于绘制它们。

The code shown below is an example of Python script creating the histogram of the attributes of Pima Indian Diabetes dataset. Here, we will be using hist() function on Pandas DataFrame to generate histograms and matplotlib for ploting them.

from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.hist()
pyplot.show()

Output

graph

上面的输出显示它为数据集中每个属性创建了直方图。从中,我们可以观察到也许 age、pedi 和 test 属性可能是指数分布,而 mass 和 plas 是高斯分布。

The above output shows that it created the histogram for each attribute in the dataset. From this, we can observe that perhaps age, pedi and test attribute may have exponential distribution while mass and plas have Gaussian distribution.

Density Plots

获得每个属性分布的另一种快速简单的技术是密度图。它也类似于直方图,但在每个区间的顶部绘制一条平滑曲线。我们可以将它们称为抽象直方图。

Another quick and easy technique for getting each attributes distribution is Density plots. It is also like histogram but having a smooth curve drawn through the top of each bin. We can call them as abstracted histograms.

Example

在以下示例中,Python 脚本将生成 Pima Indian Diabetes 数据集的属性分布的密度图。

In the following example, Python script will generate Density Plots for the distribution of attributes of Pima Indian Diabetes dataset.

from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

Output

density

从上面的输出中,可以很容易地理解密度图和直方图之间的差异。

From the above output, the difference between Density plots and Histograms can be easily understood.

Box and Whisker Plots

箱线图,简称箱形图,是审查每个属性分布的另一种有用技术。以下はこのテクニックの特徴です -

Box and Whisker plots, also called boxplots in short, is another useful technique to review the distribution of each attribute’s distribution. The following are the characteristics of this technique −

  1. It is univariate in nature and summarizes the distribution of each attribute.

  2. It draws a line for the middle value i.e. for median.

  3. It draws a box around the 25% and 75%.

  4. It also draws whiskers which will give us an idea about the spread of the data.

  5. The dots outside the whiskers signifies the outlier values. Outlier values would be 1.5 times greater than the size of the spread of the middle data.

Example

在以下示例中,Python 脚本将生成 Pima Indian Diabetes 数据集的属性分布的密度图。

In the following example, Python script will generate Density Plots for the distribution of attributes of Pima Indian Diabetes dataset.

from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False,sharey=False)
pyplot.show()

Output

mass

从上面绘制的属性分布图可以观察到,年龄、测试和皮肤对较小的值呈偏态。

From the above plot of attribute’s distribution, it can be observed that age, test and skin appear skewed towards smaller values.

Multivariate Plots: Interaction Among Multiple Variables

另一种可视化类型是多变量或“多变量”可视化。借助多变量可视化,我们可以了解数据集中多个属性之间的交互。以下是 Python 中用于实现多变量可视化的某些技术 -

Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization −

Correlation Matrix Plot

相关性是关于两个变量之间变化的指示。在之前的章节中,我们讨论了皮尔森相关系数和相关性的重要性。我们可以绘制相关矩阵以显示哪些变量相对于另一个变量具有高或低相关性。

Correlation is an indication about the changes between two variables. In our previous chapters, we have discussed Pearson’s Correlation coefficients and the importance of Correlation too. We can plot correlation matrix to show which variable is having a high or low correlation in respect to another variable.

Example

在以下示例中,Python 脚本将为皮马印第安人糖尿病数据集生成并绘制相关矩阵。它可以在 Pandas DataFrame 上使用 corr() 函数生成,并在 pyplot 的帮助下绘制。

In the following example, Python script will generate and plot correlation matrix for the Pima Indian Diabetes dataset. It can be generated with the help of corr() function on Pandas DataFrame and plotted with the help of pyplot.

from matplotlib import pyplot
from pandas import read_csv
import numpy
Path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(Path, names=names)
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

Output

class

从相关矩阵的上述输出中,我们可以看到它是对称的,即左下角与右上角相同。还观察到每个变量都彼此正相关。

From the above output of correlation matrix, we can see that it is symmetrical i.e. the bottom left is same as the top right. It is also observed that each variable is positively correlated with each other.

Scatter Matrix Plot

散点图展示了在二维平面上,一个变量受另一个变量影响的程度或它们之间的关系。散点图非常像折线图,因为它们使用水平和垂直轴绘制数据点。

Scatter plots shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points.

Example

在以下示例中,Python 脚本将为皮马印第安人糖尿病数据集生成并绘制散点矩阵。它可以在 Pandas DataFrame 上使用 scatter_matrix() 函数生成,并在 pyplot 的帮助下绘制。

In the following example, Python script will generate and plot Scatter matrix for the Pima Indian Diabetes dataset. It can be generated with the help of scatter_matrix() function on Pandas DataFrame and plotted with the help of pyplot.

from matplotlib import pyplot
from pandas import read_csv
from pandas.tools.plotting import scatter_matrix
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
scatter_matrix(data)
pyplot.show()

Output

colors

Machine Learning with Python - Preparing Data

Introduction

机器学习算法完全依赖于数据,因为数据是使模型训练成为可能的至关重要的方面。另一方面,如果我们在将数据输入 ML 算法之前无法理解数据,机器将毫无用处。简而言之,对于我们希望机器解决的问题,我们总是需要输入正确的数据,即按正确比例、格式且包含有意义特征的数据。

Machine Learning algorithms are completely dependent on data because it is the most crucial aspect that makes model training possible. On the other hand, if we won’t be able to make sense out of that data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem we want machine to solve.

这使得数据准备成为 ML 处理中最重要的步骤。数据准备可以定义为使我们的数据集更适用于 ML 处理的过程。

This makes data preparation the most important step in ML process. Data preparation may be defined as the procedure that makes our dataset more appropriate for ML process.

Why Data Pre-processing?

在选择 ML 训练的原始数据后,最重要的任务是数据预处理。广义上讲,数据预处理将选定的数据转换为我们可以处理或可以输入 ML 算法的形式。我们总是需要预处理我们的数据,以便它可以符合机器学习算法的期望。

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.

Data Pre-processing Techniques

以下数据预处理技术可以应用于数据集以生成 ML 算法的数据 −

We have the following data preprocessing techniques that can be applied on data set to produce data for ML algorithms −

Scaling

我们数据集很可能包含具有不同比例的属性,但是我们不能将此类数据提供给 ML 算法,因此需要重新缩放。数据重新缩放确保属性具有相同的比例。一般来说,将属性重新缩放为 0 到 1 的范围。梯度下降和 k 近邻等 ML 算法需要缩放数据。我们可以借助 scikit-learn Python 库的 MinMaxScaler 类重新缩放数据。

Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.

Example

在这个示例中,我们将重新缩放我们之前使用的 Pima Indians Diabetes 数据集的数据。首先,将加载 CSV 数据(如前几章中所做的那样),然后借助 MinMaxScaler 类,将数据重新缩放至 0 和 1 之间的范围。

In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of the following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 MinMaxScaler 类将数据重新缩放到 0 和 1 的范围内。

Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.

data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 1,并在输出中显示前 10 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and showing the first 10 rows in the output.

set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

Output

Scaled data:
[[0.4 0.7 0.6 0.4 0. 0.5 0.2 0.5 1. ]
[0.1 0.4 0.5 0.3 0. 0.4 0.1 0.2 0. ]
[0.5 0.9 0.5 0. 0. 0.3 0.3 0.2 1. ]
[0.1 0.4 0.5 0.2 0.1 0.4 0. 0. 0. ]
[0. 0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
[0.3 0.6 0.6 0. 0. 0.4 0.1 0.2 0. ]
[0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
[0.6 0.6 0. 0. 0. 0.5 0. 0.1 0. ]
[0.1 1. 0.6 0.5 0.6 0.5 0. 0.5 1. ]
[0.5 0.6 0.8 0. 0. 0. 0.1 0.6 1. ]]

从上述输出中,所有数据都重新缩放到 0 和 1 的范围内。

From the above output, all the data got rescaled into the range of 0 and 1.

Normalization

另一种有用的数据预处理技术是归一化。这用于将每行数据重新缩放为长度为 1。它主要用于我们有很多零的稀疏数据集。我们可以借助 scikit-learn Python 库的 Normalizer 类重新缩放数据。

Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.

Types of Normalization

在机器学习中,有两种归一化预处理技术,如下所示 −

In machine learning, there are two types of normalization preprocessing techniques as follows −

L1 Normalization

它可以定义为一种归一化技术,它以一种方式修改数据集值,使得每一行的绝对值之和始终高达 1。它也称为最小绝对偏差。

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.

Example

在此示例中,我们使用 L1 归一化技术来归一化我们先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后在 Normalizer 类的帮助下对其进行归一化。

In this example, we use L1 Normalize technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Normalizer class it will be normalized.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用 L1 的 Normalizer 类来对数据进行归一化。

Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

Output

Normalized data:
[[0.02 0.43 0.21 0.1 0. 0.1 0. 0.14 0. ]
[0. 0.36 0.28 0.12 0. 0.11 0. 0.13 0. ]
[0.03 0.59 0.21 0. 0. 0.07 0. 0.1 0. ]]

L2 Normalization

它可以定义为一种归一化技术,它以一种方式修改数据集值,使得每一行的平方和始终高达 1。它也称为最小二乘。

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares.

Example

在此示例中,我们使用 L2 归一化技术来归一化我们先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据(如前几章中所述),然后在 Normalizer 类的帮助下对其进行归一化。

In this example, we use L2 Normalization technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in previous chapters) and then with the help of Normalizer class it will be normalized.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用 L1 的 Normalizer 类来对数据进行归一化。

Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

Output

Normalized data:
[[0.03 0.83 0.4 0.2 0. 0.19 0. 0.28 0.01]
[0.01 0.72 0.56 0.24 0. 0.22 0. 0.26 0. ]
[0.04 0.92 0.32 0. 0. 0.12 0. 0.16 0.01]]

Binarization

顾名思义,这是一个可以帮助我们使数据实现二进制化的技术。我们可以使用二进制阈值使数据实现二进制化。高于该阈值的值将转换为 1,低于该阈值的值将转换为 0。例如,如果我们选择阈值 = 0.5,则数据集值高于该值将变为 1,低于则变为 0。这就是为什么我们可以称之为 binarizing 数据或 thresholding 数据。当我们数据集中有概率并希望将它们转换为固定值时,这个技术很有用。

As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0. For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.

我们可以借助 scikit-learn Python 库中的 Binarizer 类对数据进行二值化。

We can binarize the data with the help of Binarizer class of scikit-learn Python library.

Example

在这个示例中,我们将重新调整前面使用过的 Pima Indians Diabetes 数据集的数据。首先,将加载 CSV 数据,然后借助 Binarizer 类,根据阈值将其转换为二进制值,即 0 和 1。我们选择 0.5 作为阈值。

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 Binarize 类将该数据转换为二进制值。

Now, we can use Binarize class to convert the data into binary values.

binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)

在此处,我们显示输出中的前 5 行。

Here, we are showing the first 5 rows in the output.

print ("\nBinary data:\n", Data_binarized [0:5])

Output

Binary data:
[[1. 1. 1. 1. 0. 1. 1. 1. 1.]
[1. 1. 1. 1. 0. 1. 0. 1. 0.]
[1. 1. 1. 0. 0. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 0. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 1.]]

Standardization

另一种有用的数据预处理技术,它基本上用于转换具有高斯分布的数据属性。它将均值和 SD(标准差)与均值为 0 和 SD 为 1 的标准高斯分布区分开来。这种技术在 ML 算法中很有用,例如线性回归和逻辑回归,它假设输入数据集中的高斯分布并且使用重新缩放的数据生成更好的结果。我们可以借助 scikit-learn Python 库的 StandardScaler 类标准化数据(均值 = 0 和 SD =1)。

Another useful data preprocessing technique which is basically used to transform the data attributes with a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a standard Gaussian distribution with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data. We can standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of scikit-learn Python library.

Example

在此示例中,我们将重新调整先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后在 StandardScaler 类的帮助下,将其转换为均值 = 0 和 SD = 1 的高斯分布。

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian Distribution with mean = 0 and SD = 1.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 StandardScaler 类来重新缩放到数据。

Now, we can use StandardScaler class to rescale the data.

data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 5 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 5 rows in the output.

set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])

Output

Rescaled data:
[[ 0.64 0.85 0.15 0.91 -0.69 0.2 0.47 1.43 1.37]
[-0.84 -1.12 -0.16 0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
[ 1.23 1.94 -0.26 -1.29 -0.69 -1.1 0.6 -0.11 1.37]
[-0.84 -1. -0.16 0.15 0.12 -0.49 -0.92 -1.04 -0.73]
[-1.14 0.5 -1.5 0.91 0.77 1.41 5.48 -0.02 1.37]]

Data Labeling

我们讨论了优质数据对于 ML 算法的重要性以及在将数据发送到 ML 算法之前预处理数据的一些技术。在这方面另一个方面是数据标记。将具有适当标记的数据发送到 ML 算法也非常重要。例如,对于分类问题,数据上存在许多以单词、数字等形式出现的标签。

We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very important to send the data to ML algorithms having proper labeling. For example, in case of classification problems, lot of labels in the form of words, numbers etc. are there on the data.

What is Label Encoding?

大多数 sklearn 函数都需要的是以数字标记表示的数据,而不是以单词标记的数据。因此,我们需要将此类标记转换为数字标记。此过程称为标记编码。我们可以借助 scikit-learn Python 库的 LabelEncoder() 函数对数据执行标记编码。

Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of LabelEncoder() function of scikit-learn Python library.

Example

在以下示例中,Python 脚本将执行标记编码。

In the following example, Python script will perform the label encoding.

首先,按如下方式导入所需的 Python 库 −

First, import the required Python libraries as follows −

import numpy as np
from sklearn import preprocessing

现在,我们需要提供输入标记,如下所示 −

Now, we need to provide the input labels as follows −

input_labels = ['red','black','red','green','black','yellow','white']

代码的下一行将创建标记编码器并对其进行训练。

The next line of code will create the label encoder and train it.

encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

脚本的下一行将通过对随机有序的列表进行编码来检查性能 −

The next lines of script will check the performance by encoding the random ordered list −

test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)

我们可以借助以下 python 脚本获取编码值列表 −

We can get the list of encoded values with the help of following python script −

print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))

Output

Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]
Encoded values = [3, 0, 4, 1]
Decoded labels = ['white', 'black', 'yellow', 'green']

ML with Python - Data Feature Selection

在上一章节中,我们详细了解了如何对用于机器学习的数据进行预处理和准备。在这一章节中,让我们详细了解数据特征选择及其所涉及的各个方面。

In the previous chapter, we have seen in detail how to preprocess and prepare data for machine learning. In this chapter, let us understand in detail data feature selection and various aspects involved in it.

Importance of Data Feature Selection

机器学习模型的性能与用于对其训练的数据特征成正比。如果向其提供不相关的数据特征,将会对 ML 模型的性能产生负面影响。另一方面,使用相关数据特征可以提高 ML 模型的准确性,尤其是线性回归和逻辑回归。

The performance of machine learning model is directly proportional to the data features used to train it. The performance of ML model will be affected negatively if the data features provided to it are irrelevant. On the other hand, use of relevant data features can increase the accuracy of your ML model especially linear and logistic regression.

现在的问题是,什么是自动特征选择?它可以定义为一个过程,借助这个过程,我们可以选择我们数据中最与我们感兴趣的输出或预测变量相关的特征。它也称为属性选择。

Now the question arise that what is automatic feature selection? It may be defined as the process with the help of which we select those features in our data that are most relevant to the output or prediction variable in which we are interested. It is also called attribute selection.

以下是在对数据建模之前进行自动特征选择的一些好处:

The following are some of the benefits of automatic feature selection before modeling the data −

  1. Performing feature selection before data modeling will reduce the overfitting.

  2. Performing feature selection before data modeling will increases the accuracy of ML model.

  3. Performing feature selection before data modeling will reduce the training time

Feature Selection Techniques

以下是我们可以在 Python 中用于对 ML 数据建模的自动特征选择技术:

The followings are automatic feature selection techniques that we can use to model ML data in Python −

Univariate Selection

此特征选择技术对于选择那些与预测变量具有最强关系(借助统计测试)的特征非常有用。我们可以借助 scikit-learn Python 库的 SelectKBest0 类来实现单变量特征选择技术。

This feature selection technique is very useful in selecting those features, with the help of statistical testing, having strongest relationship with the prediction variables. We can implement univariate feature selection technique with the help of SelectKBest0class of scikit-learn Python library.

Example

在这个示例中,我们将使用皮马印第安人糖尿病数据集来选择 4 个具有最佳特征的属性(借助卡方统计测试)。

In this example, we will use Pima Indians Diabetes dataset to select 4 of the attributes having best features with the help of chi-square statistical test.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

接下来,我们将数组分成输入和输出组件:

Next, we will separate array into input and output components −

X = array[:,0:8]
Y = array[:,8]

以下代码行将从数据集中选择最佳特征:

The following lines of code will select the best features from dataset −

test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)

我们还可以根据自己的选择来总结输出数据。在这里,我们设置精度为 2,并显示具有最佳特征的 4 个数据属性以及每个属性的最佳分数:

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the 4 data attributes with best features along with best score of each attribute −

set_printoptions(precision=2)
print(fit.scores_)
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])

Output

[ 111.52 1411.89 17.61 53.11 2175.57 127.67 5.39 181.3 ]
Featured data:
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]]

Recursive Feature Elimination

顾名思义,RFE(递归特征消除)特征选择技术会递归地删除属性,并使用剩余属性构建模型。我们可以借助 scikit-learn Python 库的 RFE 类来实现 RFE 特征选择技术。

As the name suggests, RFE (Recursive feature elimination) feature selection technique removes the attributes recursively and builds the model with remaining attributes. We can implement RFE feature selection technique with the help of RFE class of scikit-learn Python library.

Example

在这个示例中,我们将使用 RFE 连同逻辑回归算法从皮马印第安人糖尿病数据集中选择具有最佳特征的最佳 3 个属性。

In this example, we will use RFE with logistic regression algorithm to select the best 3 attributes having the best features from Pima Indians Diabetes dataset to.

from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

接下来,我们将数组分成其输入和输出组件:

Next, we will separate the array into its input and output components −

X = array[:,0:8]
Y = array[:,8]

以下代码行将从数据集中选择最佳特征:

The following lines of code will select the best features from a dataset −

model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Number of Features: %d")
print("Selected Features: %s")
print("Feature Ranking: %s")

Output

Number of Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

我们可以在以上输出中看到,RFE 选择了 preg、mass 和 pedi 作为前 3 个最佳特征。它们在输出中标记为 1。

We can see in above output, RFE choose preg, mass and pedi as the first 3 best features. They are marked as 1 in the output.

Principal Component Analysis (PCA)

PCA 通常称为数据还原技术,它是一种非常有用的特征选择技术,因为它使用线性代数将数据集转换为压缩形式。我们可以借助 scikit-learn Python 库的 PCA 类来实现 PCA 特征选择技术。我们可以在输出中选择主成分的数量。

PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form. We can implement PCA feature selection technique with the help of PCA class of scikit-learn Python library. We can select number of principal components in the output.

Example

在此示例中,我们将使用 PCA 从 Pima Indians Diabetes 数据集中选择最佳的 3 个主成分。

In this example, we will use PCA to select best 3 Principal components from Pima Indians Diabetes dataset.

from pandas import read_csv
from sklearn.decomposition import PCA
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

接下来,我们将数组分成输入和输出组件:

Next, we will separate array into input and output components −

X = array[:,0:8]
Y = array[:,8]

以下几行代码将从数据集中提取特征 -

The following lines of code will extract features from dataset −

pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)

Output

Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02
-9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]

我们可以从以上输出看到,3 个主成分与源数据几乎没有相似之处。

We can observe from the above output that 3 Principal Components bear little resemblance to the source data .

Feature Importance

顾名思义,特征重要性技术用来挑选重要特征。它基本上使用受过训练的有监督分类器来选择特征。我们可以通过 scikit-learn Python 库的 ExtraTreeClassifier 类来实现此特征选择技术。

As the name suggests, feature importance technique is used to choose the importance features. It basically uses a trained supervised classifier to select features. We can implement this feature selection technique with the help of ExtraTreeClassifier class of scikit-learn Python library.

Example

在此示例中,我们将使用 ExtraTreeClassifier 从 Pima Indians Diabetes 数据集中选择特征。

In this example, we will use ExtraTreeClassifier to select features from Pima Indians Diabetes dataset.

from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
path = r'C:\Desktop\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(data, names=names)
array = dataframe.values

接下来,我们将数组分成输入和输出组件:

Next, we will separate array into input and output components −

X = array[:,0:8]
Y = array[:,8]

以下几行代码将从数据集中提取特征 -

The following lines of code will extract features from dataset −

model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

Output

[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]

从输出中,我们可以看到每个属性都有得分。得分越高,该属性就越重要。

From the output, we can observe that there are scores for each attribute. The higher the score, higher is the importance of that attribute.

Classification - Introduction

Introduction to Classification

分类可以定义为从观察到的值或给定的数据点中预测类或类别。分类后的输出形式可以是“黑”或“白”或“垃圾邮件”或“无垃圾邮件”。

Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.

从数学上讲,分类是从输入变量(X)到输出变量(Y)逼近映射函数(f)的任务。它基本上属于监督机器学习,其中目标也与输入数据集一起提供。

Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.

分类问题的示例可以是电子邮件中的垃圾邮件检测。输出只能有两类,“垃圾邮件”和“无垃圾邮件”;因此,这是一个二元类型分类。

An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.

为了实现此分类,我们首先需要训练分类器。对于此示例,“垃圾邮件”和“无垃圾邮件”电子邮件将用作训练数据。成功训练分类器后,可将其用于检测未知电子邮件。

To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.

Types of Learners in Classification

在分类问题中,我们有两种类型的学习器——

We have two types of learners in respective to classification problems −

Lazy Learners

顾名思义,此类学习器会等待在存储训练数据后出现的测试数据。只有在获取测试数据后才会执行分类。它们花费在训练上的时间较少,而花费在预测上的时间较多。惰性学习器的示例包括 k 近邻和基于案例的推理。

As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.

Eager Learners

与惰性学习器相反,主动学习器会在存储训练数据后等待在测试数据出现后构建分类模型。它们花费在训练上的时间较多,而花费在预测上的时间较少。主动学习器的示例包括决策树、朴素贝叶斯和人工神经网络 (ANN)。

As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).

Building a Classifier in Python

Scikit-learn(一个用于机器学习的 Python 库)可用于在 Python 中构建分类器。在 Python 中构建分类器的步骤如下——

Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −

Step1: Importing necessary python package

要使用 scikit-learn 构建分类器,我们需要导入它。我们可以使用以下脚本导入它——

For building a classifier using scikit-learn, we need to import it. We can import it by using following script −

import sklearn

Step2: Importing dataset

导入必要的包后,我们需要一个数据集来构建分类预测模型。我们可以从 sklearn 数据集导入它,也可以根据我们的要求使用其他数据集。我们将使用 sklearn 的乳腺癌威斯康星州诊断数据库。我们可以借助以下脚本导入它——

After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −

from sklearn.datasets import load_breast_cancer

以下脚本将加载数据集;

The following script will load the dataset;

data = load_breast_cancer()

我们还需要整理数据,可以使用以下脚本来完成此操作——

We also need to organize the data and it can be done with the help of following scripts −

label_names = data['target_names']
   labels = data['target']
   feature_names = data['feature_names']
   features = data['data']

以下命令将打印标签的名称,“malignant”和“benign”,针对我们的数据库的情况。

The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.

print(label_names)

以上命令的输出是标签名称——

The output of the above command is the names of the labels −

['malignant' 'benign']

这些标签映射到二进制值 0 和 1。 Malignant 癌由 0 表示, Benign 癌由 1 表示。

These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.

这些标签的特征名称和特征值可以使用以下命令查看——

The feature names and feature values of these labels can be seen with the help of following commands −

print(feature_names[0])

以上命令的输出是标签 0(即 Malignant 癌)的特征名称——

The output of the above command is the names of the features for label 0 i.e. Malignant cancer −

mean radius

类似地,可以按如下方式生成标签的特征名称:

Similarly, names of the features for label can be produced as follows −

print(feature_names[1])

上述命令的输出是标签 1 的特征名称,即 Benign 癌症 -

The output of the above command is the names of the features for label 1 i.e. Benign cancer −

mean texture

我们可以使用以下命令打印这些标签的特征 −

We can print the features for these labels with the help of following command −

print(features[0])

这将给出以下输出 −

This will give the following output −

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01]

我们可以使用以下命令打印这些标签的特征 −

We can print the features for these labels with the help of following command −

print(features[1])

这将给出以下输出 −

This will give the following output −

[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02]

Step3: Organizing data into training & testing sets

因为我们需要在不可见数据上测试我们的模型,我们会将我们的数据集分成两部分:一个训练集和一个测试集。我们可以使用 sklearn python 包的 train_test_split() 函数将数据分成数据集。以下命令将导入该函数 −

As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −

from sklearn.model_selection import train_test_split

现在,下一个命令会将数据分成训练和测试数据。在这个示例中,我们将 40% 的数据用于测试目的,60% 的数据用于训练目的 −

Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −

train, test, train_labels, test_labels = train_test_split(features,labels,test_size = 0.40, random_state = 42)

Step4 - Model evaluation

在将数据分成训练和测试后,我们需要构建模型。为此,我们将使用朴素贝叶斯算法。以下命令会导入 GaussianNB 模块 −

After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module −

from sklearn.naive_bayes import GaussianNB

现在,初始化模型如下所示 −

Now, initialize the model as follows −

gnb = GaussianNB()

接下来,在以下命令的帮助下,我们可以训练模型 −

Next, with the help of following command we can train the model −

model = gnb.fit(train, train_labels)

现在,为了评估的目的,我们需要进行预测。它可以通过使用 predict() 函数来完成,如下所示 −

Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −

preds = gnb.predict(test)
print(preds)

这将给出以下输出 −

This will give the following output −

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
0 0 1 1 0 1]

输出中上述一系列 0 和 1 是 MalignantBenign 肿瘤分类的预测值。

The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.

Step5- Finding accuracy

我们可以通过比较 test_labels 和 preds 这两个数组来找到上一步构建的模型的精确度。我们将使用 accuracy_score() 函数来确定精确度。

We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.

from sklearn.metrics import accuracy_score
   print(accuracy_score(test_labels,preds))
   0.951754385965

上述输出显示 NaiveBayes 分类器准确率为 95.17%。

The above output shows that NaïveBayes classifier is 95.17% accurate .

Classification Evaluation Metrics

即使你已完成机器学习应用程序或模型的实现,工作还未完成。我们必须找出我们的模型有多有效?可能有不同的评估指标,但我们必须仔细选择,因为指标的选择会影响机器学习算法的性能的测量和比较方式。

The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.

以下是你可以根据你的数据集和问题类型从中进行选择的一些重要的分类评估指标 −

The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −

Confusion Matrix

这是衡量分类问题性能的最简单方法,其中输出可以是两种或更多种类的类。混淆矩阵只不过是一个有两维的表格,即“实际”和“预测”,此外,这两个维度都具有“真阳性(TP)”、“真阴性(TN)”、“假阳性(FP)”、“假阴性(FN) ”如下所示 -

It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −

actual
  1. True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.

  2. True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.

  3. False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1.

  4. False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0.

我们可以借助 sklearn 的 confusion_matrix() 函数找到混淆矩阵。借助以下脚本,我们可以找到上述构建的二元分类器的混淆矩阵 -

We can find the confusion matrix with the help of confusion_matrix() function of sklearn. With the help of the following script, we can find the confusion matrix of above built binary classifier −

from sklearn.metrics import confusion_matrix

Output

[[ 73 7]
[ 4 144]]

Accuracy

可以将其定义为由我们的机器学习模型做出的正确预测的数目。我们可以通过以下公式借助混淆矩阵轻松计算它 -

It may be defined as the number of correct predictions made by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

对于上述构建的二元分类器,TP + TN = 73 + 144 = 217 和 TP+FP+FN+TN = 73+7+4+144=228。

For above built binary classifier, TP + TN = 73+144 = 217 and TP+FP+FN+TN = 73+7+4+144=228.

因此,准确性 = 217/228 = 0.951754385965,这与我们在创建二元分类器后计算得出的值相同。

Hence, Accuracy = 217/228 = 0.951754385965 which is same as we have calculated after creating our binary classifier.

Precision

精度,用于文档检索,可以定义为由我们的机器学习模型返回的正确文档数。我们可以通过以下公式借助混淆矩阵轻松计算它 -

Precision, used in document retrievals, may be defined as the number of correct documents returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

对于上述构建的二元分类器,TP = 73 和 TP+FP = 73+7 = 80。

For the above built binary classifier, TP = 73 and TP+FP = 73+7 = 80.

因此,精度 = 73/80 = 0.915

Hence, Precision = 73/80 = 0.915

Recall or Sensitivity

召回率可以定义为由我们的机器学习模型返回的正例数。我们可以通过以下公式借助混淆矩阵轻松计算它 -

Recall may be defined as the number of positives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

对于上述构建的二元分类器,TP = 73 和 TP+FN = 73+4 = 77。

For above built binary classifier, TP = 73 and TP+FN = 73+4 = 77.

因此,精度 = 73/77 = 0.94805

Hence, Precision = 73/77 = 0.94805

Specificity

特异性与召回率相反,可以定义为我们的 ML 模型返回的负样本数量。我们可以通过使用以下公式轻松地通过混淆矩阵计算它−

Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

对于上述构建的二元分类器,TN = 144,TN+FP = 144+7 = 151。

For the above built binary classifier, TN = 144 and TN+FP = 144+7 = 151.

因此,精度 = 144/151 = 0.95364

Hence, Precision = 144/151 = 0.95364

Various ML Classification Algorithms

以下是某些重要的 ML 分类算法 −

The followings are some important ML classification algorithms −

  1. Logistic Regression

  2. Support Vector Machine (SVM)

  3. Decision Tree

  4. Naïve Bayes

  5. Random Forest

我们将在后面的章节中详细讨论所有这些分类算法。

We will be discussing all these classification algorithms in detail in further chapters.

Applications

分类算法的一些最重要的应用程序如下 −

Some of the most important applications of classification algorithms are as follows −

  1. Speech Recognition

  2. Handwriting Recognition

  3. Biometric Identification

  4. Document Classification

Classification Algorithms - Logistic Regression

Introduction to Logistic Regression

逻辑回归是一种有监督学习分类算法,用于预测目标变量的概率。目标或因变量的本质是二分的,这意味着仅有两个可能的类别。

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.

简言之,因变量本质上是二元的,数据被编码为 1(表示成功/是)或 0(表示失败/否)。

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

在数学上,逻辑回归模型会预测 P(Y=1) 作为 X 的函数。它是可以用于各种分类问题的最简单的 ML 算法之一,例如垃圾邮件检测、糖尿病预测、癌症检测等。

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

Types of Logistic Regression

通常,逻辑回归意味着具有二进制目标变量的二进制逻辑回归,但可能有更多两类目标变量可以通过它来预测。基于这些类别的数量,逻辑回归可以划分为以下类型 -

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −

Binary or Binomial

在这种分类中,因变量只有两种可能的类型 1 和 0。例如,这些变量可能表示成功或失败、是或否、赢或输等。

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

Multinomial

在这种分类中,因变量可能有 3 个或更多种可能的无序类型或没有定量意义的类型。例如,这些变量可能表示“类型 A”或“类型 B”或“类型 C”。

In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

Ordinal

在这种分类中,因变量可能有 3 个或更多种可能的有序类型或具有定量意义的类型。例如,这些变量可能表示“差”或“好”、“非常好”、“优秀”,每个类别都可以有 0、1、2、3 等得分。

In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

Logistic Regression Assumptions

在开始实现逻辑回归之前,我们必须认识到以下关于此算法的假设 -

Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same −

  1. In case of binary logistic regression, the target variables must be binary always and the desired outcome is represented by the factor level 1.

  2. There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other .

  3. We must include meaningful variables in our model.

  4. We should choose a large sample size for logistic regression.

Binary Logistic Regression model

逻辑回归最简单的形式是二元或二项逻辑回归,其中目标变量或因变量可以仅具有两种可能的类型,即 1 或 0。它允许我们对多个预测变量和二元/二项目标变量建模关系。对于逻辑回归,线性函数基本作为另一个函数(例如下面的关系中的𝑔)的输入使用 -

The simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. In case of logistic regression, the linear function is basically used as an input to another function such as 𝑔 in the following relation −

这里,𝑔是逻辑斯蒂或 sigmoid 函数,可以表示如下 -

Here, 𝑔 is the logistic or sigmoid function which can be given as follows −

sigmoid 曲线可以借助下面的图像表示。我们可以看到 y 轴的值介于 0 到 1 之间,并且在 0.5 处穿过轴。

To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.

flow

类别可以分为正类或负类。输出落在 0 到 1 之间时,即正类的概率。对于我们的实现,我们解释假设函数的输出,如果它≥0.5 为正,否则为负。

The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is ≥0.5, otherwise negative.

我们还需要定义一个损失函数来度量算法使用函数权重的执行情况,该函数由 theta 如下所示 -

We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by theta as follows −

ℎ=𝑔(𝑋𝜃)

现在,在定义了损失函数后,我们的主要目标是最小化损失函数。可以通过帮助拟合权重来实现,这意味着增加或减少权重。借助于相对于每个权重的损失函数的导数,我们将能够了解哪些参数应该具有高权重以及哪些参数应该具有较小的权重。

Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters should have high weight and what should have smaller weight.

以下梯度下降方程告诉我们如果我们修改参数,损失将如何改变 -

The following gradient descent equation tells us how loss would change if we modified the parameters −

Implementation in Python

现在,我们将用 Python 实现二项式逻辑回归的上述概念。为此,我们使用了一个名为“iris”的多变量花数据集,其中有 3 类,每类有 50 个实例,但我们将使用前两个特征列。每个类代表一种鸢尾花。

Now we will implement the above concept of binomial logistic regression in Python. For this purpose, we are using a multivariate flower dataset named ‘iris’ which have 3 classes of 50 instances each, but we will be using the first two feature columns. Every class represents a type of iris flower.

首先,我们需要导入必需的库,如下所示 -

First, we need to import the necessary libraries as follows −

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

然后,加载 iris 数据集,如下所示 -

Next, load the iris dataset as follows −

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

我们可以绘制我们的训练数据,如下所示 -

We can plot our training data s follows −

plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
star

接下来,我们将定义 sigmoid 函数、损失函数和梯度下降,如下所示 -

Next, we will define sigmoid function, loss function and gradient descend as follows −

class LogisticRegression:
   def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
      self.lr = lr
      self.num_iter = num_iter
      self.fit_intercept = fit_intercept
      self.verbose = verbose
   def __add_intercept(self, X):
      intercept = np.ones((X.shape[0], 1))
      return np.concatenate((intercept, X), axis=1)
   def __sigmoid(self, z):
      return 1 / (1 + np.exp(-z))
   def __loss(self, h, y):
      return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
   def fit(self, X, y):
      if self.fit_intercept:
         X = self.__add_intercept(X)

现在,初始化权重,如下所示 -

Now, initialize the weights as follows −

self.theta = np.zeros(X.shape[1])
   for i in range(self.num_iter):
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      gradient = np.dot(X.T, (h - y)) / y.size
      self.theta -= self.lr * gradient
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      loss = self.__loss(h, y)
      if(self.verbose ==True and i % 10000 == 0):
         print(f'loss: {loss} \t')

借助以下脚本,我们可以预测输出概率 -

With the help of the following script, we can predict the output probabilities −

def predict_prob(self, X):
   if self.fit_intercept:
      X = self.__add_intercept(X)
   return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X):
   return self.predict_prob(X).round()

接下来,我们可以对模型进行评估并将其绘制如下 -

Next, we can evaluate the model and plot it as follows −

model = LogisticRegression(lr=0.1, num_iter=300000)
preds = model.predict(X)
(preds == y).mean()

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');
red line

Multinomial Logistic Regression Model

另一种有用的逻辑回归形式是多项逻辑回归,其中目标变量或因变量可以具有 3 种或更多种可能的无序类型,即类型没有数量上的意义。

Another useful form of logistic regression is multinomial logistic regression in which the target or dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative significance.

Implementation in Python

现在,我们将在 Python 中实现多项逻辑回归的上述概念。为此,我们正在使用一个名为 digit 的 sklearn 中的数据集。

Now we will implement the above concept of multinomial logistic regression in Python. For this purpose, we are using a dataset from sklearn named digit.

首先,我们需要导入必需的库,如下所示 -

First, we need to import the necessary libraries as follows −

Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split

接下来,我们需要加载 digit 数据集 -

Next, we need to load digit dataset −

digits = datasets.load_digits()

现在,定义特征矩阵 (X) 和响应矢量 (y) 如下所示 −

Now, define the feature matrix(X) and response vector(y)as follows −

X = digits.data
y = digits.target

在下一行代码的帮助下,我们可以将 X 和 y 拆分为训练集和测试集 −

With the help of next line of code, we can split X and y into training and testing sets −

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

现在按以下方式创建一个逻辑回归对象 −

Now create an object of logistic regression as follows −

digreg = linear_model.LogisticRegression()

现在,我们需要使用训练集训练模型,如下所示 −

Now, we need to train the model by using the training sets as follows −

digreg.fit(X_train, y_train)

接下来,按以下方式对测试集进行预测 −

Next, make the predictions on testing set as follows −

y_pred = digreg.predict(X_test)

接下来按以下方式打印模型的准确度 −

Next print the accuracy of the model as follows −

print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100)

Output

Accuracy of Logistic Regression model is: 95.6884561891516

从以上输出我们可以看到模型的准确度为 96%。

From the above output we can see the accuracy of our model is around 96 percent.

Support Vector Machine (SVM)

Introduction to SVM

支持向量机 (SVM) 功能强大且灵活,是监督机器学习算法,可同时用于分类和回归。但一般而言,它们用于分类问题。SVm 于 1960 年代首次推出,但后来在 1990 年得到了改进。与其他机器学习算法相比,SVM 具有独特的实现方式。近年来,由于其处理多个连续变量和分类变量的能力而备受青睐。

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms which are used both for classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

Working of SVM

SVM 模型本质上是对多维空间中超平面中不同类别的表示。SVM 会以迭代方式生成超平面,以便最大限度地减少错误。SVM 的目标是将数据集划分为不同的类别,以找到最大边缘超平面 (MMH)。

An SVM model is basically a representation of different classes in a hyperplane in multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH).

margin

SVM 中的重要概念包括以下几个 −

The followings are important concepts in SVM −

  1. Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.

  2. Hyperplane − As we can see in the above diagram, it is a decision plane or space which is divided between a set of objects having different classes.

  3. Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin.

SVM 的主要目标是将数据集划分为不同类别,以找到最大边缘超平面 (MMH),可以分两步完成 −

The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) and it can be done in the following two steps −

  1. First, SVM will generate hyperplanes iteratively that segregates the classes in best way.

  2. Then, it will choose the hyperplane that separates the classes correctly.

Implementing SVM in Python

我们在 Python 中实现 SVM 的过程从导入标准库开始,如下所示 −

For implementing SVM in Python we will start with the standard libraries import as follows −

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()

接下来,我们从 sklearn.dataset.sample_generator 使用 SVM 创建用于分类的线性可分离数据样本集 −

Next, we are creating a sample dataset, having linearly separable data, from sklearn.dataset.sample_generator for classification using SVM −

from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2,
      random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

在生成包含 100 个样本和 2 个簇的样本集后,输出如下 −

The following would be the output after generating sample dataset having 100 samples and 2 clusters −

map

我们知道 SVM 支持判别分类。它通过在二维情况下简单地找到一条线或在多维情况下找到一个流形来区分不同的类。在上述数据集中实现 SVM 的过程如下 −

We know that SVM supports discriminative classification. it divides the classes from each other by simply finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on the above dataset as follows −

xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
plt.plot([0.6], [2.1], 'x', color='black', markeredgewidth=4, markersize=12)
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
   plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);

输出如下 −

The output is as follows −

cross

从上述输出中我们可以看到,有三个不同的分隔符完美地区分了上述样本。

We can see from the above output that there are three different separators that perfectly discriminate the above samples.

如前所述,SVM 的主要目标是将数据集划分为类别以找到最大边缘超平面 (MMH),因此,我们可以在类之间绘制零线,也可以围绕每条线绘制一个边缘,其宽度最多可达最近点。它可以如下完成——

As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around each line a margin of some width up to the nearest point. It can be done as follows −

xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
   for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
   yfit = m * xfit + b
   plt.plot(xfit, yfit, '-k')
   plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
         color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
green

从输出中的上述图像中,我们很容易观察到判别分类器中的“边缘”。SVM 将选择最大化边缘的线。

From the above image in output, we can easily observe the “margins” within the discriminative classifiers. SVM will choose the line that maximizes the margin.

接下来,我们将使用 Scikit-Learn 的支持向量分类器对该数据训练 SVM 模型。在此,我们使用线性核来拟合 SVM,如下所示——

Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are using linear kernel to fit SVM as follows −

from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

输出如下 −

The output is as follows −

SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

现在,为了更好地理解,以下内容将绘制二维 SVC 的决策函数——

Now, for a better understanding, the following will plot the decision functions for 2D SVC −

def decision_function(model, ax=None, plot_support=True):
   if ax is None:
      ax = plt.gca()
   xlim = ax.get_xlim()
   ylim = ax.get_ylim()

为了评估模型,我们需要创建网格,如下所示——

For evaluating model, we need to create grid as follows −

x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)

接下来,我们需要绘制决策边界和边缘,如下所示——

Next, we need to plot decision boundaries and margins as follows −

ax.contour(X, Y, P, colors='k',
   levels=[-1, 0, 1], alpha=0.5,
   linestyles=['--', '-', '--'])

现在,以类似的方式绘制支持向量,如下所示——

Now, similarly plot the support vectors as follows −

if plot_support:
   ax.scatter(model.support_vectors_[:, 0],
      model.support_vectors_[:, 1],
      s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)

现在,使用此函数拟合我们的模型,如下所示——

Now, use this function to fit our models as follows −

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
decision_function(model);
yellow

我们可以从上述输出中观察到,一个 SVM 分类器适合具有裕度的数据,即虚线和支持向量,这是该拟合的关键元素,触及虚线。这些支持向量点存储在分类器的 support_vectors_ 属性中,如下所示——

We can observe from the above output that an SVM classifier fit to the data with margins i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line. These support vector points are stored in the support_vectors_ attribute of the classifier as follows −

model.support_vectors_

输出如下 −

The output is as follows −

array([[0.5323772 , 3.31338909],
   [2.11114739, 3.57660449],
   [1.46870582, 1.86947425]])

SVM Kernels

在实践中,SVM 算法使用将输入数据空间转换为所需形式的核来实现。SVM 使用称为核技巧的技术,其中核采用低维输入空间并将其转换为更高维的空间。简单来说,核通过向其中添加更多维度将不可分离的问题转换为可分离问题。它使 SVM 功能更强大、更灵活、更准确。以下是 SVM 使用的一些内核类型——

In practice, SVM algorithm is implemented with kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non-separable problems into separable problems by adding more dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used by SVM −

Linear Kernel

它可用作任意两个观测之间的点积。线性核的公式如下——

It can be used as a dot product between any two observations. The formula of linear kernel is as below −

k(x,xi) = sum(x*xi)

从上述公式中,我们可以看到两个向量(例如 𝑥 和 𝑥𝑖)之间的乘积是输入值每对乘积的总和。

From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of the multiplication of each pair of input values.

Polynomial Kernel

它是线性核的更通用形式,并且区分曲线或非线性输入空间。以下是多项式核的公式——

It is more generalized form of linear kernel and distinguish curved or nonlinear input space. Following is the formula for polynomial kernel −

K(x, xi) = 1 + sum(x * xi)^d

其中 d 是多项式的度数,我们需要在学习算法中手动指定它。

Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.

Radial Basis Function (RBF) Kernel

RBF 核主要用于 SVM 分类,它将输入空间映射到无限维空间。以下公式在数学上对它进行了说明——

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. Following formula explains it mathematically −

K(x,xi) = exp(-gamma * sumx – xi^2

在此,gamma 范围从 0 到 1。我们需要在学习算法中手动指定它。gamma 一个好的默认值为 0.1。

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good default value of gamma is 0.1.

由于我们为线性可分离数据实现了 SVM,所以我们可以使用 Python 针对不可线性分离的数据实现它。可以通过使用核函数来实现这一点。

As we implemented SVM for linearly separable data, we can implement it in Python for the data that is not linearly separable. It can be done by using kernels.

Example

下面是使用核函数创建 SVM 分类器的示例。我们将使用 scikit-learn 中的虹膜数据集:

The following is an example for creating an SVM classifier by using kernels. We will be using iris dataset from scikit-learn −

我们将通过导入以下包来开始:

We will start by importing following packages −

import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt

现在,我们需要加载输入数据:

Now, we need to load the input data −

iris = datasets.load_iris()

从该数据集,我们按如下方式获取前两个特征:

From this dataset, we are taking first two features as follows −

X = iris.data[:, :2]
y = iris.target

接下来,我们将使用原始数据按如下方式绘制 SVM 边界:

Next, we will plot the SVM boundaries with original data as follows −

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
   np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]

现在,我们需要按如下方式提供正则化参数的值:

Now, we need to provide the value of regularization parameter as follows −

C = 1.0

接下来,可以按如下方式创建 SVM 分类器对象:

Next, SVM classifier object can be created as follows −

Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)

Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
curve

对于使用 rbf 核创建 SVM 分类器,我们可以按如下方式将核更改为 rbf

For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −

Svc_classifier = svm.SVC(kernel='rbf', gamma =‘auto’,C=C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')
classifier

我们将 gamma 值设置为“自动”,但是你也可以提供 0 到 1 之间的值。

We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM 分类器提供极高的准确度,并且在高维度空间中表现良好。SVM 分类器基本上使用训练点的子集,因此实际上使用非常少的内存。

SVM classifiers offers great accuracy and work well with high dimensional space. SVM classifiers basically use a subset of training points hence in result uses very less memory.

Cons of SVM classifiers

它们的训练时间很长,因此在实践中不适用于大型数据集。另一个缺点是 SVM 分类器与重叠类别不匹配。

They have high training time hence in practice not suitable for large datasets. Another disadvantage is that SVM classifiers do not work well with overlapping classes.

Classification Algorithms - Decision Tree

Introduction to Decision Tree

一般而言,决策树分析是一种预测建模工具,可应用于许多领域。决策树可以通过一种算法方法来构建,该方法可以基于不同条件以不同方式拆分数据集。决策树是最强大的算法,属于监督算法范畴。

In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas. Decision trees can be constructed by an algorithmic approach that can split the dataset in different ways based on different conditions. Decisions tress are the most powerful algorithms that falls under the category of supervised algorithms.

它们可用于分类和回归任务。树的两个主要实体是决策节点(其中数据被拆分)和叶节点(其中我们得到结果)。下面给出了一个二叉树示例,用于预测人的健康状况,它提供了年龄、饮食习惯和锻炼习惯等各种信息:

They can be used for both classification and regression tasks. The two main entities of a tree are decision nodes, where the data is split and leaves, where we got outcome. The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below −

person

在上述决策树中,问题是决策节点,最终结果是叶节点。我们有以下两种类型的决策树:

In the above decision tree, the question are decision nodes and final outcomes are leaves. We have the following two types of decision trees −

  1. Classification decision trees − In this kind of decision trees, the decision variable is categorical. The above decision tree is an example of classification decision tree.

  2. Regression decision trees − In this kind of decision trees, the decision variable is continuous.

Implementing Decision Tree Algorithm

Gini Index

这是用来评估数据集中二分类分割的成本函数名称,并且使用类别目标变量“成功”或“失败”。

It is the name of the cost function that is used to evaluate the binary splits in the dataset and works with the categorial target variable “Success” or “Failure”.

基尼指数的值越高,同质性越高。完美的基尼指数值为 0,最差为 0.5(对于 2 类问题)。可以用以下步骤计算分割的基尼指数 −

Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of following steps −

  1. First, calculate Gini index for sub-nodes by using the formula p2+q2 , which is the sum of the square of probability for success and failure.

  2. Next, calculate Gini index for split using weighted Gini score of each node of that split.

分类和回归树 (CART) 算法使用基尼方法生成二进制分割。

Classification and Regression Tree (CART) algorithm uses Gini method to generate binary splits.

Split Creation

分割基本上包括数据集中的一个属性和一个值。我们可以使用以下三部分在数据集中创建一个分割 −

A split is basically including an attribute in the dataset and a value. We can create a split in dataset with the help of following three parts −

  1. Part1 − Calculating Gini Score: We have just discussed this part in the previous section.

  2. Part2 − Splitting a dataset: It may be defined as separating a dataset into two lists of rows having index of an attribute and a split value of that attribute. After getting the two groups - right and left, from the dataset, we can calculate the value of split by using Gini score calculated in first part. Split value will decide in which group the attribute will reside.

  3. Part3 − Evaluating all splits: Next part after finding Gini score and splitting dataset is the evaluation of all splits. For this purpose, first, we must check every value associated with each attribute as a candidate split. Then we need to find the best possible split by evaluating the cost of the split. The best split will be used as a node in the decision tree.

Building a Tree

众所周知,一颗树有根节点和终节点。创建根节点后,我们可以按照以下两部分构建树 −

As we know that a tree has root node and terminal nodes. After creating the root node, we can build the tree by following two parts −

Part1: Terminal node creation

在创建决策树的终节点时,一个重要的一点是确定何时停止加深树或创建更多终节点。它可以用两个标准来完成,即最大树深度和最小节点记录,如下所示 −

While creating terminal nodes of decision tree, one important point is to decide when to stop growing tree or creating further terminal nodes. It can be done by using two criteria namely maximum tree depth and minimum node records as follows −

  1. Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a tree after root node. We must stop adding terminal nodes once a tree reached at maximum depth i.e. once a tree got maximum number of terminal nodes.

  2. Minimum Node Records − It may be defined as the minimum number of training patterns that a given node is responsible for. We must stop adding terminal nodes once tree reached at these minimum node records or below this minimum.

终节点用于做出最终预测。

Terminal node is used to make a final prediction.

Part2: Recursive Splitting

当我们了解何时创建终节点时,我们现在可以开始构建树。递归分割是一种构建树的方法。用这种方法,一旦创建一个节点,我们就可以对由分割数据集生成的数据的每一组递归创建子节点(添加到现有节点的节点),方法是反复调用同一个函数。

As we understood about when to create terminal nodes, now we can start building our tree. Recursive splitting is a method to build the tree. In this method, once a node is created, we can create the child nodes (nodes added to an existing node) recursively on each group of data, generated by splitting the dataset, by calling the same function again and again.

Prediction

在构建决策树之后,我们需要对它进行预测。本质上,预测涉及使用特定提供的数据行浏览决策树。

After building a decision tree, we need to make a prediction about it. Basically, prediction involves navigating the decision tree with the specifically provided row of data.

我们可以使用递归函数进行预测,如上所述。使用左右子节点再次调用相同的预测程序。

We can make a prediction with the help of recursive function, as did above. The same prediction routine is called again with the left or the child right nodes.

Assumptions

我们在创建决策树时做出的假设如下 −

The following are some of the assumptions we make while creating decision tree −

  1. While preparing decision trees, the training set is as root node.

  2. Decision tree classifier prefers the features values to be categorical. In case if you want to use continuous values then they must be done discretized prior to model building.

  3. Based on the attribute’s values, the records are recursively distributed.

  4. Statistical approach will be used to place attributes at any node position i.e.as root node or internal node.

Implementation in Python

Example

在以下示例中,我们将对 Pima 印第安人糖尿病实施决策树分类器——

In the following example, we are going to implement Decision Tree classifier on Pima Indian Diabetes −

首先,从导入必要的 Python 包开始——

First, start with importing necessary python packages −

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

接下来,从其 Web 链接下载 iris 数据集,如下所示——

Next, download the iris dataset from its weblink as follows −

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
      pregnant    glucose  bp    skin  insulin  bmi   pedigree    age   label
0       6         148      72    35     0       33.6    0.627     50      1
1       1         85       66    29     0       26.6    0.351     31      0
2       8         183      64     0     0       23.3    0.672     32      1
3       1         89       66    23     94      28.1    0.167     21      0
4       0         137      40    35     168     43.1    2.288     33      1

现在,将数据集拆分为特征和目标变量,如下所示——

Now, split the dataset into features and target variable as follows −

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

接下来,我们将数据分为训练和测试拆分。以下代码将数据集拆分为 70% 的训练数据和 30% 的测试数据——

Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data −

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

接下来,使用 sklearn 的 DecisionTreeClassifier 类对模型进行训练,如下所示——

Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows −

clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

最后,我们需要进行预测。可以使用以下脚本完成——

At last we need to make prediction. It can be done with the help of following script −

y_pred = clf.predict(X_test)

接下来,我们可以得到准确度得分、混淆矩阵和分类报告,如下所示——

Next, we can get the accuracy score, confusion matrix and classification report as follows −

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:
[[116 30]
[ 46 39]]
Classification Report:
            precision   recall   f1-score    support
      0       0.72      0.79       0.75     146
      1       0.57      0.46       0.51     85
micro avg     0.67      0.67       0.67     231
macro avg     0.64      0.63       0.63     231
weighted avg  0.66      0.67       0.66     231

Accuracy: 0.670995670995671

Visualizing Decision Tree

可以使用以下代码对上述决策树进行可视化——

The above decision tree can be visualized with the help of following code −

from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
      filled=True, rounded=True,
      special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())
box

Classification Algorithms - Naïve Bayes

Introduction to Naïve Bayes Algorithm

朴素贝叶斯算法是一种分类技术,它基于应用贝叶斯定理,并且有一个强假设,即所有预测变量都相互独立。简而言之,假设是类别中特征的存在独立于同一类别中任何其他特征的存在。例如,如果手机有触摸屏、便携功能、好的摄像头等,它可以被认为是智能的。尽管所有这些特性是相互依赖的,但它们会独立地影响该手机是智能手机的概率。

Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem with a strong assumption that all the predictors are independent to each other. In simple words, the assumption is that the presence of a feature in a class is independent to the presence of any other feature in the same class. For example, a phone may be considered as smart if it is having touch screen, internet facility, good camera etc. Though all these features are dependent on each other, they contribute independently to the probability of that the phone is a smart phone.

在贝叶斯分类中,主要目的是找到后验概率,即给定某些观察到的特征的标签概率,P(L | fatures)。借助贝叶斯定理,我们可以将其定量表示如下−

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠). With the help of Bayes theorem, we can express this in quantitative form as follows −

此处,P(L | fatures) 是类的后验概率。

Here, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the posterior probability of class.

P(L) 是类的先验概率。

𝑃(𝐿) is the prior probability of class.

P(fatures | L) 是可能性,即给定类的预测变量的概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝐿) is the likelihood which is the probability of predictor given class.

P(fatures) 是预测变量的先验概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the prior probability of predictor.

Building model using Naïve Bayes in Python

Python 库 Scikit learn 是帮助我们在 Python 中构建朴素贝叶斯模型的最有用的库。我们可以在 Scikit learn Python 库中找到以下三种类型的朴素贝叶斯模型−

Python library, Scikit learn is the most useful library that helps us to build a Naïve Bayes model in Python. We have the following three types of Naïve Bayes model under Scikit learn Python library −

Gaussian Naïve Bayes

它是最简单的朴素贝叶斯分类器,假设来自每个标签的数据都是从一个简单的正态分布中获取的。

It is the simplest Naïve Bayes classifier having the assumption that the data from each label is drawn from a simple Gaussian distribution.

Multinomial Naïve Bayes

另一个有用的朴素贝叶斯分类器是多项式朴素贝叶斯,其中假设特征是从一个简单的多项式分布中获取的。这种朴素贝叶斯最适合表示离散计数的特征。

Another useful Naïve Bayes classifier is Multinomial Naïve Bayes in which the features are assumed to be drawn from a simple Multinomial distribution. Such kind of Naïve Bayes are most appropriate for the features that represents discrete counts.

Bernoulli Naïve Bayes

另一个重要的模型是伯努利朴素贝叶斯,其中假设特征是二进制的(0 和 1)。用“词袋”模型进行文本分类可以成为伯努利朴素贝叶斯的应用。

Another important model is Bernoulli Naïve Bayes in which features are assumed to be binary (0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli Naïve Bayes.

Example

根据我们的数据集,我们可以选择上面解释的任何朴素贝叶斯模型。在这里,我们在 Python 中实现高斯朴素贝叶斯模型−

Depending on our data set, we can choose any of the Naïve Bayes model explained above. Here, we are implementing Gaussian Naïve Bayes model in Python −

我们将从所需导入开始,如下所示−

We will start with required imports as follows −

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

现在,通过使用 Scikit learn 的 make_blobs() 函数,我们可以生成具有正态分布的点团,如下所示−

Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with Gaussian distribution as follows −

from sklearn.datasets import make_blobs
X, y = make_blobs(300, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

接下来,为了使用 GaussianNB 模型,我们需要导入它并使其成为对象,如下所示−

Next, for using GaussianNB model, we need to import and make its object as follows −

from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB()
model_GNB.fit(X, y);

现在,我们必须进行预测。它可以在生成一些新数据后按照以下步骤进行 −

Now, we have to do prediction. It can be done after generating some new data as follows −

rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model_GNB.predict(Xnew)

接下来,我们要绘制新数据以找到它的边界 −

Next, we are plotting new data to find its boundaries −

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);

现在,借助以下代码行,我们可以找到第一个和第二个标签的后验概率 −

Now, with the help of following line of codes, we can find the posterior probabilities of first and second label −

yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)

Output

array([[0.998, 0.002],
      [1. , 0. ],
      [0.987, 0.013],
      [1. , 0. ],
      [1. , 0. ],
      [1. , 0. ],
      [1. , 0. ],
      [1. , 0. ],
      [0. , 1. ],
      [0.986, 0.014]])

Pros & Cons

Pros

以下是使用朴素贝叶斯分类器的一些优点 −

The followings are some pros of using Naïve Bayes classifiers −

  1. Naïve Bayes classification is easy to implement and fast.

  2. It will converge faster than discriminative models like logistic regression.

  3. It requires less training data.

  4. It is highly scalable in nature, or they scale linearly with the number of predictors and data points.

  5. It can make probabilistic predictions and can handle continuous as well as discrete data.

  6. Naïve Bayes classification algorithm can be used for binary as well as multi-class classification problems both.

Cons

以下是使用朴素贝叶斯分类器的一些缺点 −

The followings are some cons of using Naïve Bayes classifiers −

  1. One of the most important cons of Naïve Bayes classification is its strong feature independence because in real life it is almost impossible to have a set of features which are completely independent of each other.

  2. Another issue with Naïve Bayes classification is its ‘zero frequency’ which means that if a categorial variable has a category but not being observed in training data set, then Naïve Bayes model will assign a zero probability to it and it will be unable to make a prediction.

Applications of Naïve Bayes classification

以下是朴素贝叶斯分类的一些常见应用程序 −

The following are some common applications of Naïve Bayes classification −

Real-time prediction − 由于易于实现和快速计算,它可用于进行实时预测。

Real-time prediction − Due to its ease of implementation and fast computation, it can be used to do prediction in real-time.

Multi-class prediction − 朴素贝叶斯分类算法可用于预测目标变量的多个类的后验概率。

Multi-class prediction − Naïve Bayes classification algorithm can be used to predict posterior probability of multiple classes of target variable.

Text classification − 由于多类预测的特性,朴素贝叶斯分类算法非常适合文本分类。这就是它也用于解决垃圾邮件过滤和情绪分析等问题的原因。

Text classification − Due to the feature of multi-class prediction, Naïve Bayes classification algorithms are well suited for text classification. That is why it is also used to solve problems like spam-filtering and sentiment analysis.

Recommendation system − 除了协同过滤等算法之外,朴素贝叶斯还构成一个推荐系统,该系统可用于过滤未见信息并预测用户是否会喜欢给定的资源。

Recommendation system − Along with the algorithms like collaborative filtering, Naïve Bayes makes a Recommendation system which can be used to filter unseen information and to predict weather a user would like the given resource or not.

Classification Algorithms - Random Forest

Introduction

随机森林是一种监督学习算法,用于分类和回归。但它主要用于分类问题。我们知道森林是由树木组成的,树越多,森林就越健壮。同样,随机森林算法针对数据样本创建决策树,然后从每个决策树中获取预测,最后通过投票方式选择最佳解决方案。它是一种集成方法,优于单个决策树,因为它通过平均结果来减少过度拟合。

Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

借助以下步骤,我们可以了解随机森林算法的工作原理:

We can understand the working of Random Forest algorithm with the help of following steps −

Step1 ——首先,从给定数据集中开始选择随机样本。

Step1 − First, start with the selection of random samples from a given dataset.

Step2 ——接下来,此算法将为每个样本构建一个决策树。然后,它将从每个决策树中获取预测结果。

Step2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.

Step3 ——在此步骤中,将对每个预测结果执行投票。

Step3 − In this step, voting will be performed for every predicted result.

Step4 ——最后,选择票数最多的预测结果作为最终的预测结果。

Step4 − At last, select the most voted prediction result as the final prediction result.

以下图表将说明其工作原理——

The following diagram will illustrate its working −

test set

Implementation in Python

首先,从导入必要的 Python 包开始——

First, start with importing necessary Python packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下来,从其 Web 链接下载 iris 数据集,如下所示——

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

接下来,我们需要按照以下方式为数据集分配列名称 −

Next, we need to assign column names to the dataset as follows −

headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −

Now, we need to read dataset to pandas dataframe as follows −

dataset = pd.read_csv(path, names=headernames)
dataset.head()

sepal-length

sepal-width

petal-length

petal-width

Class

0

5.1

3.5

1.4

0.2

Iris-setosa

1

4.9

3.0

1.4

0.2

Iris-setosa

2

4.7

3.2

1.3

0.2

Iris-setosa

3

4.6

3.1

1.5

0.2

Iris-setosa

4

5.0

3.6

1.4

0.2

Iris-setosa

数据预处理将借助以下脚本行执行 −

Data Preprocessing will be done with the help of following script lines −

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

接下来,我们将数据分为训练和测试拆分。以下代码将数据集拆分为 70% 的训练数据和 30% 的测试数据——

Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data −

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

接下来,使用 sklearn 的 RandomForestClassifier 类按照以下方式训练模型 −

Next, train the model with the help of RandomForestClassifier class of sklearn as follows −

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(X_train, y_train)

最后,我们需要进行预测。可以借助以下脚本执行此操作 −

At last, we need to make prediction. It can be done with the help of following script −

y_pred = classifier.predict(X_test)

接下来,按照以下方式打印结果 −

Next, print the results as follows −

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:
[[14 0 0]
[ 0 18 1]
[ 0 0 12]]
Classification Report:
               precision       recall     f1-score       support
Iris-setosa        1.00         1.00        1.00         14
Iris-versicolor    1.00         0.95        0.97         19
Iris-virginica     0.92         1.00        0.96         12
micro avg          0.98         0.98        0.98         45
macro avg          0.97         0.98        0.98         45
weighted avg       0.98         0.98        0.98         45

Accuracy: 0.9777777777777777

Pros and Cons of Random Forest

Pros

以下是 Random Forest 算法的优点 −

The following are the advantages of Random Forest algorithm −

  1. It overcomes the problem of overfitting by averaging or combining the results of different decision trees.

  2. Random forests work well for a large range of data items than a single decision tree does.

  3. Random forest has less variance then single decision tree.

  4. Random forests are very flexible and possess very high accuracy.

  5. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

  6. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

Cons

以下是 Random Forest 算法的缺点 −

The following are the disadvantages of Random Forest algorithm −

  1. Complexity is the main disadvantage of Random forest algorithms.

  2. Construction of Random forests are much harder and time-consuming than decision trees.

  3. More computational resources are required to implement Random Forest algorithm.

  4. It is less intuitive in case when we have a large collection of decision trees .

  5. The prediction process using random forests is very time-consuming in comparison with other algorithms.

Regression Algorithms - Overview

Introduction to Regression

回归是另一个重要且广泛使用的统计和机器学习工具。基于回归的任务的主要目标是为给定的输入数据预测输出标签或响应,这些输出标签或响应是连续数值。输出将基于模型在训练阶段中学习到的内容。基本上,回归模型使用输入数据特征(自变量)及其相应的连续数值输出值(因变量或结果变量)来学习输入和相应输出之间的特定关联。

Regression is another important and broadly used statistical and machine learning tool. The key objective of regression-based tasks is to predict output labels or responses which are continues numeric values, for the given input data. The output will be based on what the model has learned in training phase. Basically, regression models use the input data features (independent variables) and their corresponding continuous numeric output values (dependent or outcome variables) to learn specific association between inputs and corresponding outputs.

variables

Types of Regression Models

resgression

回归模型分为以下两类:

Regression models are of following two types −

Simple regression model - 这是最基本的回归模型,其中预测是从数据的单一单变量特征形成的。

Simple regression model − This is the most basic regression model in which predictions are formed from a single, univariate feature of the data.

Multiple regression model - 正如名称所示,在此回归模型中,预测是从数据的多个特征形成的。

Multiple regression model − As name implies, in this regression model the predictions are formed from multiple features of the data.

Building a Regressor in Python

Python 中的回归模型的构造方式与分类器的构造方式相同。Scikit-learn 是一个用于机器学习的 Python 库,还可以用于在 Python 中构建回归模型。

Regressor model in Python can be constructed just like we constructed the classifier. Scikit-learn, a Python library for machine learning can also be used to build a regressor in Python.

在以下示例中,我们将构建一个基本的回归模型,它将拟合一条数据线,即线性回归模型。在 Python 中构建回归模型所需的步骤如下:

In the following example, we will be building basic regression model that will fit a line to the data i.e. linear regressor. The necessary steps for building a regressor in Python are as follows −

Step1: Importing necessary python package

要使用 scikit-learn 构建回归模型,我们需要导入它以及其他必需的包。我们可以使用以下脚本导入:

For building a regressor using scikit-learn, we need to import it along with other necessary packages. We can import the by using following script −

import numpy as np
from sklearn import linear_model
import sklearn.metrics as sm
import matplotlib.pyplot as plt

Step2: Importing dataset

在导入必要的包之后,我们需要一个数据集来构建回归预测模型。我们可以从 sklearn 数据集中导入它,也可以根据我们的需求使用其他数据集。我们将使用保存的输入数据。我们可以借助以下脚本导入:

After importing necessary package, we need a dataset to build regression prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use our saved input data. We can import it with the help of following script −

input = r'C:\linear.txt'

接下来,我们需要加载此数据。我们使用 np.loadtxt 函数来加载它。

Next, we need to load this data. We are using np.loadtxt function to load it.

input_data = np.loadtxt(input, delimiter=',')
X, y = input_data[:, :-1], input_data[:, -1]

Step3: Organizing data into training & testing sets

由于我们需要在未见数据上测试我们的模型,因此我们将数据集分为两部分:训练集和测试集。以下命令将执行此操作:

As we need to test our model on unseen data hence, we will divide our dataset into two parts: a training set and a test set. The following command will perform it −

training_samples = int(0.6 * len(X))
testing_samples = len(X) - num_training

X_train, y_train = X[:training_samples], y[:training_samples]

X_test, y_test = X[training_samples:], y[training_samples:]

Step4- Model evaluation & prediction

在将数据划分为训练和测试后,我们需要构建模型。我们将为此目的使用 Scikit-learn 的 LineaRegression() 函数。以下命令将创建一个线性回归对象。

After dividing the data into training and testing we need to build the model. We will be using LineaRegression() function of Scikit-learn for this purpose. Following command will create a linear regressor object.

reg_linear= linear_model.LinearRegression()

接下来,使用训练样本训练此模型,如下所示:

Next, train this model with the training samples as follows −

reg_linear.fit(X_train, y_train)

现在,最后我们需要使用测试数据进行预测了。

Now, at last we need to do the prediction with the testing data.

y_test_pred = reg_linear.predict(X_test)

Step5- Plot & visualization

预测之后,我们可以借助以下脚本绘制并对其进行可视化:

After prediction, we can plot and visualize it with the help of following script −

plt.scatter(X_test, y_test, color='red')
plt.plot(X_test, y_test_pred, color='black', linewidth=2)
plt.xticks(())
plt.yticks(())
plt.show()

Output

line

在上面的输出中,我们可以在数据点之间看到回归线。

In the above output, we can see the regression line between the data points.

Step6- Performance computation − 我们还可以借助以下各种性能指标来计算回归模型的性能 −

Step6- Performance computation − We can also compute the performance of our regression model with the help of various performance metrics as follows −

print("Regressor model performance:")
print("Mean absolute error(MAE) =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error(MSE) =", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))

Output

Regressor model performance:
Mean absolute error(MAE) = 1.78
Mean squared error(MSE) = 3.89
Median absolute error = 2.01
Explain variance score = -0.09
R2 score = -0.09

Types of ML Regression Algorithms

最有用的流行 ML 回归算法是线性回归算法,它进一步分为两类:

The most useful and popular ML regression algorithm is Linear regression algorithm which further divided into two types namely −

  1. Simple Linear Regression algorithm

  2. Multiple Linear Regression algorithm.

我们将在下一章中对其进行讨论并在 Python 中实现它。

We will discuss about it and implement it in Python in the next chapter.

Applications

ML 回归算法的应用如下:

The applications of ML regression algorithms are as follows −

Forecasting or Predictive analysis - 回归的一个重要用途是预测或预测分析。例如,我们可以预测 GDP、石油价格或简单来说随着时间的推移而变化的数量化数据。

Forecasting or Predictive analysis − One of the important uses of regression is forecasting or predictive analysis. For example, we can forecast GDP, oil prices or in simple words the quantitative data that changes with the passage of time.

Optimization − 我们可以在回归的帮助下优化业务流程。例如,商店经理可以创建统计模型以了解顾客高峰期。

Optimization − We can optimize business processes with the help of regression. For example, a store manager can create a statistical model to understand the peek time of coming of customers.

Error correction − 在业务中,做出正确的决定与优化业务流程一样重要。回归可以帮助我们做出正确的决定,并帮助纠正在已执行的决策。

Error correction − In business, taking correct decision is equally important as optimizing the business process. Regression can help us to take correct decision as well in correcting the already implemented decision.

Economics − 这是经济学中最常用的工具。我们可以使用回归来预测供给、需求、消费、库存投资等。

Economics − It is the most used tool in economics. We can use regression to predict supply, demand, consumption, inventory investment etc.

Finance − 金融公司始终对最大程度降低风险组合感兴趣,并且想知道影响客户的因素。所有这些都可以使用回归模型进行预测。

Finance − A financial company is always interested in minimizing the risk portfolio and want to know the factors that affects the customers. All these can be predicted with the help of regression model.

Regression Algorithms - Linear Regression

Introduction to Linear Regression

线性回归可以定义为分析因变量与给定的一组自变量之间的线性关系的统计模型。变量之间的线性关系意味着当一个或多个自变量的值变化(增加或减少)时,因变量的值也会相应地发生变化(增加或减少)。

Linear regression may be defined as the statistical model that analyzes the linear relationship between a dependent variable with given set of independent variables. Linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of dependent variable will also change accordingly (increase or decrease).

在数学上,可以通过以下等式来表示这种关系 −

Mathematically the relationship can be represented with the help of following equation −

Y = mX + b

其中,Y 是我们尝试预测的因变量

Here, Y is the dependent variable we are trying to predict

X 是我们用于进行预测的自变量。

X is the dependent variable we are using to make predictions.

m 是回归线的斜率,表示 X 对 Y 的影响。

m is the slop of the regression line which represents the effect X has on Y

b 是一个常量,称为 Y 截距。如果 X = 0,则 Y 等于 b。

b is a constant, known as the Y-intercept. If X = 0,Y would be equal to b.

此外,线性关系的本质可以是正面的或负面的,如下所述 −

Furthermore, the linear relationship can be positive or negative in nature as explained below −

Positive Linear Relationship

如果自变量和因变量均增加,则线性关系将称为正相关关系。可以通过以下图形来理解这一点 −

A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of following graph −

positive linear

Negative Linear relationship

如果自变量增加而因变量减小,则线性关系将称为正相关关系。可以通过以下图形来理解这一点 −

A linear relationship will be called positive if independent increases and dependent variable decreases. It can be understood with the help of following graph −

negative linear

Types of Linear Regression

线性回归具有以下两种类型 −

Linear regression is of the following two types −

  1. Simple Linear Regression

  2. Multiple Linear Regression

Simple Linear Regression (SLR)

它是线性回归的最基本版本,它使用单个特征预测响应。SLR 中的假设是这两个变量是线性相关的。

It is the most basic version of linear regression which predicts a response using a single feature. The assumption in SLR is that the two variables are linearly related.

Python implementation

我们可以用两种方法在 Python 中实现 SLR,一种是提供你自己的数据集,另一种是从 scikit-learn python 库中使用数据集。

We can implement SLR in Python in two ways, one is to provide your own dataset and other is to use dataset from scikit-learn python library.

Example1 − 在以下 Python 实现示例中,我们使用我们自己的数据集。

Example1 − In the following Python implementation example, we are using our own dataset.

首先,我们将从导入必要包开始,如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

接下来,定义一个将计算 SLR 重要值的函数 −

Next, define a function which will calculate the important values for SLR −

def coef_estimation(x, y):

以下脚本行将给出观测值 n −

The following script line will give number of observations n −

n = np.size(x)

x 和 y 向量的平均值可以按如下方式计算 −

The mean of x and y vector can be calculated as follows −

m_x, m_y = np.mean(x), np.mean(y)

我们可以按如下方式找到交叉差和围绕 x 的差 −

We can find cross-deviation and deviation about x as follows −

SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

接下来,回归系数即 b 可按如下方式计算 −

Next, regression coefficients i.e. b can be calculated as follows −

b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

接下来,我们需要定义一个函数,它将绘制回归线并预测响应矢量 −

Next, we need to define a function which will plot the regression line as well as will predict the response vector −

def plot_regression_line(x, y, b):

以下脚本行将绘制实际点作为散点图 −

The following script line will plot the actual points as scatter plot −

plt.scatter(x, y, color = "m", marker = "o", s = 30)

以下脚本行将预测响应矢量 −

The following script line will predict response vector −

y_pred = b[0] + b[1]*x

以下脚本行将绘制回归线并在其上放置标签 −

The following script lines will plot the regression line and will put the labels on them −

plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()

最后,我们需要定义 main() 函数,用于提供数据集并调用我们上面定义的函数 −

At last, we need to define main() function for providing dataset and calling the function we defined above −

def main():
   x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
   y = np.array([100, 300, 350, 500, 750, 800, 850, 900, 1050, 1250])
   b = coef_estimation(x, y)
   print("Estimated coefficients:\nb_0 = {} \nb_1 = {}".format(b[0], b[1]))
   plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output

Estimated coefficients:
b_0 = 154.5454545454545
b_1 = 117.87878787878788
dataset

Example2 − 在下面的 Python 实施示例中,我们正在使用 scikit-learn 中的糖尿病数据集。

Example2 − In the following Python implementation example, we are using diabetes dataset from scikit-learn.

首先,我们将从导入必要包开始,如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

接下来,我们将加载糖尿病数据集并创建其对象 −

Next, we will load the diabetes dataset and create its object −

diabetes = datasets.load_diabetes()

由于我们实施了 SLR,我们将仅使用一项功能,如下所示 −

As we are implementing SLR, we will be using only one feature as follows −

X = diabetes.data[:, np.newaxis, 2]

接下来,我们需要将数据分成训练和测试集,如下所示 −

Next, we need to split the data into training and testing sets as follows −

X_train = X[:-30]
X_test = X[-30:]

接下来,我们需要将目标分成训练和测试集,如下所示 −

Next, we need to split the target into training and testing sets as follows −

y_train = diabetes.target[:-30]
y_test = diabetes.target[-30:]

现在,要训练模型,我们需要创建如下所示的线性回归对象 −

Now, to train the model we need to create linear regression object as follows −

regr = linear_model.LinearRegression()

接下来,使用训练集训练模型,如下所示 −

Next, train the model using the training sets as follows −

regr.fit(X_train, y_train)

接下来,使用测试集进行预测,如下所示 −

Next, make predictions using the testing set as follows −

y_pred = regr.predict(X_test)

接下来,我们将打印一些系数,例如 MSE、方差分数等,如下所示 −

Next, we will be printing some coefficient like MSE, Variance score etc. as follows −

print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f"
   % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

现在,绘制输出内容,如下所示 −

Now, plot the outputs as follows −

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

Output

Coefficients:
   [941.43097333]
Mean squared error: 3035.06
Variance score: 0.41
red blue

Multiple Linear Regression (MLR)

这是简单线性回归的扩展,它使用两个或更多特征来预测响应。在数学上,我们可以解释如下 −

It is the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows −

考虑一个具有 n 个观测值、p 个特征(即自变量)和 y 作为响应(即因变量)的数据集线性回归线对于 p 个特征可以计算如下 −

Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows −

其中,h(xi) 是预测的响应值,b0、b1、b2…、bp 是回归系数。

Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.

多元线性回归模型始终包含称为残差误差的数据误差,该误差会更改计算,如下所示 −

Multiple Linear Regression models always includes the errors in the data known as residual error which changes the calculation as follows −

我们还可以将上述方程式写成以下形式 −

We can also write the above equation as follows −

Python Implementation

在此示例中,我们将使用 scikit learn 的波士顿住房数据集 −

in this example, we will be using Boston housing dataset from scikit learn −

首先,我们将从导入必要包开始,如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics

接下来,加载数据集,如下所示 −

Next, load the dataset as follows −

boston = datasets.load_boston(return_X_y=False)

以下脚本行将定义特征矩阵 X 和响应向量 Y −

The following script lines will define feature matrix, X and response vector, Y −

X = boston.data
y = boston.target

接下来,将数据集分成训练和测试集,如下所示 −

Next, split the dataset into training and testing sets as follows −

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1)

现在,创建线性回归对象并训练模型,如下所示 −

Now, create linear regression object and train the model as follows −

reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
print('Coefficients: \n', reg.coef_)
print('Variance score: {}'.format(reg.score(X_test, y_test)))
plt.style.use('fivethirtyeight')
plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
      color = "green", s = 10, label = 'Train data')
plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
      color = "blue", s = 10, label = 'Test data')
plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
plt.legend(loc = 'upper right')
plt.title("Residual errors")
plt.show()

Output

Coefficients:
[-1.16358797e-01 6.44549228e-02 1.65416147e-01 1.45101654e+00
-1.77862563e+01 2.80392779e+00 4.61905315e-02 -1.13518865e+00
3.31725870e-01 -1.01196059e-02 -9.94812678e-01 9.18522056e-03
-7.92395217e-01]
Variance score: 0.709454060230326
spot

Assumptions

以下是线性回归模型对数据集所做的一些假设 −

The following are some assumptions about dataset that is made by Linear Regression model −

Multi-collinearity − 线性回归模型假设数据中几乎没有或没有多重共线性。基本上,当自变量或特征其中有依赖关系时,就会出现多重共线性。

Multi-collinearity − Linear regression model assumes that there is very little or no multi-collinearity in the data. Basically, multi-collinearity occurs when the independent variables or features have dependency in them.

Auto-correlation − 线性回归模型的另一项假设是数据中几乎没有或没有自相关。基本上,当残差误差之间存在依赖关系时,就会出现自相关。

Auto-correlation − Another assumption Linear regression model assumes is that there is very little or no auto-correlation in the data. Basically, auto-correlation occurs when there is dependency between residual errors.

Relationship between variables − 线性回归模型假设响应变量和特征变量之间的关系必须是线性的。

Relationship between variables − Linear regression model assumes that the relationship between response and feature variables must be linear.

Clustering Algorithms - Overview

Introduction to Clustering

聚类方法是最有用的无监督 ML 方法之一。这些方法用于查找数据样本之间的相似性以及关系模式,然后将这些样本聚类到基于特征相似性的组中。

Clustering methods are one of the most useful unsupervised ML methods. These methods are used to find similarity as well as the relationship patterns among data samples and then cluster those samples into groups having similarity based on features.

聚类很重要,因为它确定了当前未标记数据之间的内在分组。它们在基本上对数据点做出了一些关于其相似性的假设。每一项假设都会构建不同但同样有效的聚类。

Clustering is important because it determines the intrinsic grouping among the present unlabeled data. They basically make some assumptions about data points to constitute their similarity. Each assumption will construct different but equally valid clusters.

例如,以下是显示聚类系统将不同聚类中相似类型的数据分组在一起的图表:

For example, below is the diagram which shows clustering system grouped together the similar kind of data in different clusters −

clustering

Cluster Formation Methods

聚类不必以球形形式形成。以下是一些其他聚类形成方法:

It is not necessary that clusters will be formed in spherical form. Followings are some other cluster formation methods −

Density-based

在这些方法中,聚类被形成为稠密区域。这些方法的优点在于,它们既具有良好的准确性,又有合并两个聚类的良好能力。例如,带噪声的基于密度的空间聚类应用 (DBSCAN),用于识别聚类结构的排序点 (OPTICS) 等。

In these methods, the clusters are formed as the dense region. The advantage of these methods is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc.

Hierarchical-based

在这些方法中,聚类被基于分层形成为树型结构。它们有两个类别,即凝聚(自底向上的方法)和分裂(自顶向下的方法)。例如,使用代表的聚类 (CURE),使用层次结构的平衡迭代缩小聚类 (BIRCH) 等。

In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc.

Partitioning

在这些方法中,聚类是由将各个对象分配到 k 个聚类中而形成的。聚类数将等于分区数。例如,K 均值,基于随机搜索聚类大型应用程序 (CLARANS)。

In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized Search (CLARANS).

Grid

在这些方法中,聚类被形成为网格状结构。这些方法的优点在于,在这些网格上进行的所有聚类操作都很快,并且与数据对象的数量无关。例如,统计信息网格 (STING),寻求聚类 (CLIQUE)。

In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that all the clustering operation done on these grids are fast and independent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest (CLIQUE).

Measuring Clustering Performance

有关 ML 模型最重要的考虑因素之一是评估其性能或可以称之为模型的质量。在监督学习算法的情况下,对模型质量的评估很简单,因为我们已经为每个示例都贴上了标签。

One of the most important consideration regarding ML model is assessing its performance or you can say model’s quality. In case of supervised learning algorithms, assessing the quality of our model is easy because we already have labels for every example.

另一方面,在无监督学习算法的情况下,由于我们处理的是未标记数据,因此我们没有那么幸运。但我们仍然有一些指标可以让从业者深入了解群集的变化,具体取决于算法。

On the other hand, in case of unsupervised learning algorithms we are not that much blessed because we deal with unlabeled data. But still we have some metrics that give the practitioner an insight about the happening of change in clusters depending on algorithm.

在我们深入了解这些指标之前,我们必须了解这些指标只是评估模型之间的比较性能,而不是衡量模型预测的有效性。以下是我们可以在聚类算法中部署的一些指标来衡量模型质量 -

Before we deep dive into such metrics, we must understand that these metrics only evaluates the comparative performance of models against each other rather than measuring the validity of the model’s prediction. Followings are some of the metrics that we can deploy on clustering algorithms to measure the quality of model −

Silhouette Analysis

轮廓分析用于检查聚类模型的质量,方法是测量聚类之间的距离。它基本上为我们提供了一种方法来评估聚类数量等参数,这得益于 Silhouette score 。此分数衡量一个聚类中的每个点与相邻聚类中的点的距离。

Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score. This score measures how close each point in one cluster is to points in the neighboring clusters.

Analysis of Silhouette Score

轮廓分数的范围为 [-1, 1]。它的分析如下 -

The range of Silhouette score is [-1, 1]. Its analysis is as follows −

  1. +1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster.

  2. 0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters.

  3. -1 Score &minusl -1 Silhouette score indicates that the samples have been assigned to the wrong clusters.

轮廓分数的计算可以使用以下公式进行 −

The calculation of Silhouette score can be done by using the following formula −

silhouette 分数=(p-q)/max (p,q)

𝒔𝒊𝒍𝒉𝒐𝒖𝒆𝒕𝒕𝒆 𝒔𝒄𝒐𝒓𝒆=(𝒑−𝒒)/𝐦𝐚𝐱 (𝒑,𝒒)

此处,p = 到最近聚类中点的平均距离

Here, 𝑝 = mean distance to the points in the nearest cluster

并且,q = 到所有点的平均聚类内距离。

And, 𝑞 = mean intra-cluster distance to all the points.

Davis-Bouldin Index

DB 索引是执行聚类算法分析的另一个好指标。借助 DB 索引,我们可以了解有关聚类模型的以下几点 -

DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model −

  1. Weather the clusters are well-spaced from each other or not?

  2. How much dense the clusters are?

我们可以借助以下公式计算 DB 索引 -

We can calculate DB index with the help of following formula −

此处,n = 聚类数

Here, 𝑛 = number of clusters

σi = 聚类 i 中所有点到聚类质心 ci 的平均距离。

σi = average distance of all points in cluster 𝑖 from the cluster centroid 𝑐𝑖.

DB 索引越少,聚类模型越好。

Less the DB index, better the clustering model is.

Dunn Index

它的工作原理与 DB 索引相同,但有以下几点不同:

It works same as DB index but there are following points in which both differs −

  1. The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model.

  2. Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense.

我们可以借助以下公式计算 Dunn 索引:

We can calculate Dunn index with the help of following formula −

其中,𝑖,𝑗,𝑘 = 每个簇的索引

Here, 𝑖,𝑗,𝑘 = each indices for clusters

𝑝 = 簇间距离

𝑝 = inter-cluster distance

q = 簇内距离

q = intra-cluster distance

Types of ML Clustering Algorithms

以下是最重要的有用的 ML 聚类算法 −

The following are the most important and useful ML clustering algorithms −

K-means Clustering

此聚类算法计算质心并迭代直至找到最佳质心。它假定已知聚类数。它也被称为平面聚类算法。算法从数据识别的聚类数在 K 均值中表示为“K”。

This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

Mean-Shift Algorithm

这是无监督学习中使用的另一种强有力的聚类算法。与 K 均值聚类不同,它不作任何假设,因此它是一种非参数算法。

It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions hence it is a non-parametric algorithm.

Hierarchical Clustering

这是另一种无监督学习算法,用于对具有相似特征的未标记数据点进行分组。

It is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics.

我们将在接下来的章节中详细讨论所有这些算法。

We will be discussing all these algorithms in detail in the upcoming chapters.

Applications of Clustering

我们可以在以下领域发现聚类很有用 −

We can find clustering useful in the following areas −

Data summarization and compression − 聚类被广泛用于我们要求数据汇总、压缩和减少的领域。例如图像处理和矢量量化。

Data summarization and compression − Clustering is widely used in the areas where we require data summarization, compression and reduction as well. The examples are image processing and vector quantization.

Collaborative systems and customer segmentation − 由于聚类可以用于查找类似产品或同类用户,因此它可以用于协作系统和客户细分领域。

Collaborative systems and customer segmentation − Since clustering can be used to find similar products or same kind of users, it can be used in the area of collaborative systems and customer segmentation.

Serve as a key intermediate step for other data mining tasks − 聚类分析可以生成用于分类、测试、假设生成的数据的紧凑摘要;因此,它也作为其他数据挖掘任务的关键中间步骤。

Serve as a key intermediate step for other data mining tasks − Cluster analysis can generate a compact summary of data for classification, testing, hypothesis generation; hence, it serves as a key intermediate step for other data mining tasks also.

Trend detection in dynamic data − 通过创建具有类似趋势的不同聚类,聚类还可以用于动态数据中的趋势检测。

Trend detection in dynamic data − Clustering can also be used for trend detection in dynamic data by making various clusters of similar trends.

Social network analysis − 聚类可以用于社交网络分析。例如,在图像、视频或音频中生成序列。

Social network analysis − Clustering can be used in social network analysis. The examples are generating sequences in images, videos or audios.

Biological data analysis − 聚类还可以用于生成图像和视频聚类,因此可以成功地用于生物数据分析。

Biological data analysis − Clustering can also be used to make clusters of images, videos hence it can successfully be used in biological data analysis.

Clustering Algorithms - K-means Algorithm

Introduction to K-Means Algorithm

K-means 聚类算法计算质心并迭代,直到找到最优质心。它假设已知集群数。它也被称为 flat clustering 算法。k-means 中算法通过数据识别的集群数由“K”表示。

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

此算法中,数据点被分配给一个集群,数据点和质心之间的平方距离总和将达到最小值。可以理解的是,集群内方差越小,同一集群内的数据点越相似。

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

我们可借助以下步骤理解 K 均值聚类算法的工作原理:

We can understand the working of K-Means clustering algorithm with the help of following steps −

Step1 - 首先,我们需要指定算法需要生成的集群数量 K。

Step1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.

Step2 - 接下来,随机选取 K 个数据点,并将每个数据点分配给一个集群。简而言之,根据数据点的数量对数据进行分类。

Step2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points.

Step3 - 现在,算法将计算聚类质心。

Step3 − Now it will compute the cluster centroids.

Step4 - 接下来,不断迭代以下步骤,直至找到最优质心,即将数据点分配给不再发生变化的集群:

Step4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more −

4.1 - 首先,计算数据点和质心之间的平方距离之和。

4.1 − First, the sum of squared distance between data points and centroids would be computed.

4.2 - 现在,我们需要将每个数据点分配给比其他集群(质心)更近的集群。

4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).

4.3 - 最后,通过取该集群内的所有数据点的平均值来计算此集群的质心。

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

K 均值采用 Expectation-Maximization 方法来解决问题。期望步用于将数据点分配给最接近的集群,最大化步用于计算每个集群的质心。

K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster.

使用 K 均值算法时,我们需要注意以下事项:

While working with K-means algorithm we need to take care of the following things −

  1. While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points.

  2. Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids.

Implementation in Python

用于实现 K 均值聚类算法的以下两个示例将有助于我们更好地理解此算法:

The following two examples of implementing K-Means clustering algorithm will help us in its better understanding −

Example1

这是一个简单的示例,用于理解 k 均值的工作方式。在此示例中,我们将首先生成包含 4 个不同斑点的 2D 数据集,然后应用 k 均值算法来查看结果。

It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

首先,我们将通过导入必要的包来开始:

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

以下代码将生成包含 4 个斑点的 2D:

The following code will generate the 2D, containing four blobs −

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

接下来,以下代码将帮助我们可视化数据集:

Next, the following code will help us to visualize the dataset −

plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()
world map

接下来,创建一个 KMeans 对象并同时提供集群数量,训练模型并进行预测,如下所示:

Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows −

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

现在,在以下代码的帮助下,我们可以绘制并可视化均值 k-Means Python 估计器选择的集群中心−

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator −

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()
world sopt

Example 2

让我们切换到另一个示例,其中我们将对简单的数字数据集应用 K-Means 集群。K-Means 将尝试识别类似数字,而不使用原始标签信息。

Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information.

首先,我们将通过导入必要的包来开始:

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

接下来,从 sklearn 中加载数字数据集并生成一个对象。我们还可以在此数据集中找到行数和列数,如下所示 -

Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows −

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

Output

(1797, 64)

以上输出表明,此数据集包含 1797 个样本,具有 64 个特征。

The above output shows that this dataset is having 1797 samples with 64 features.

我们可以如以上示例 1 中所述执行集群 −

We can perform the clustering as we did in Example 1 above −

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

Output

(10, 64)

以上输出表明,K-Means 创建了 10 个集群,具有 64 个特征。

The above output shows that K-means created 10 clusters with 64 features.

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks=[])
   axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Output

作为输出,我们将获得以下图像,显示 k-means 了解的集群中心。

As output, we will get following image showing clusters centers learned by k-means.

blur

以下代码行将匹配了解的集群标签和在其中找到的真实标签 -

The following lines of code will match the learned cluster labels with the true labels found in them −

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
   mask = (clusters == i)
   labels[mask] = mode(digits.target[mask])[0]

接下来,我们可以按如下方式检查精确度 -

Next, we can check the accuracy as follows −

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

Output

0.7935447968836951

以上输出表明,精确度约为 80%。

The above output shows that the accuracy is around 80%.

Advantages and Disadvantages

Advantages

以下是 K-Means 集群算法的一些优点 −

The following are some advantages of K-Means clustering algorithms −

  1. It is very easy to understand and implement.

  2. If we have large number of variables then, K-means would be faster than Hierarchical clustering.

  3. On re-computation of centroids, an instance can change the cluster.

  4. Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

以下是 K-Means 集群算法的一些缺点 −

The following are some disadvantages of K-Means clustering algorithms −

  1. It is a bit difficult to predict the number of clusters i.e. the value of k.

  2. Output is strongly impacted by initial inputs like number of clusters (value of k).

  3. Order of data will have strong impact on the final output.

  4. It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization, then the output will completely change.final output.

  5. It is not good in doing clustering job if the clusters have a complicated geometric shape.

Applications of K-Means Clustering Algorithm

聚类分析的主要目标为 -

The main goals of cluster analysis are −

  1. To get a meaningful intuition from the data we are working with.

  2. Cluster-then-predict where different models will be built for different subgroups.

为了实现上述目标,K 均值聚类表现得足够好。它可以用于以下应用中 -

To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in following applications −

  1. Market segmentation

  2. Document Clustering

  3. Image segmentation

  4. Image compression

  5. Customer segmentation

  6. Analyzing the trend on dynamic data

Clustering Algorithms - Mean Shift Algorithm

Introduction to Mean-Shift Algorithm

如前所述,这是无监督学习中使用的另一种强大的聚类算法。不同于 K 均值聚类,它不做出任何假设;因此它是一种非参数化算法。

As discussed earlier, it is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions; hence it is a non-parametric algorithm.

均值漂移算法通过将点向数据点密度最高处(即群集质心)偏移,基本将数据点迭代分配到群集中。

Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints i.e. cluster centroid.

K 均值算法和均值漂移算法之间的区别在于后者不需要预先指定群集数,因为群集数将由算法相对于数据确定。

The difference between K-Means algorithm and Mean-Shift is that later one does not need to specify the number of clusters in advance because the number of clusters will be determined by the algorithm w.r.t data.

Working of Mean-Shift Algorithm

我们可以借助以下步骤了解均值漂移聚类算法的工作原理 -

We can understand the working of Mean-Shift clustering algorithm with the help of following steps −

Step1 - 首先,从分配给它们自己的群集的数据点开始。

Step1 − First, start with the data points assigned to a cluster of their own.

Step2 - 接下来,此算法将计算质心。

Step2 − Next, this algorithm will compute the centroids.

Step3 - 在此步骤中,将更新新质心的位置。

Step3 − In this step, location of new centroids will be updated.

Step4 - 现在,将迭代该过程并移动到更高密度区域。

Step4 − Now, the process will be iterated and moved to the higher density region.

Step5 - 最后,当质心达到无法再移动的位置时,它将停止。

Step5 − At last, it will be stopped once the centroids reach at position from where it cannot move further.

Implementation in Python

这是一个简单的示例,用于了解均值漂移算法的工作原理。在此示例中,我们将首先生成包含 4 个不同斑点的 2D 数据集,然后应用均值漂移算法查看结果。

It is a simple example to understand how Mean-Shift algorithm works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply Mean-Shift algorithm to see the result.

%matplotlib inline
import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.datasets.samples_generator import make_blobs
centers = [[3,3,3],[4,5,5],[3,10,10]]
X, _ = make_blobs(n_samples = 700, centers = centers, cluster_std = 0.5)
plt.scatter(X[:,0],X[:,1])
plt.show()
red dots
ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Estimated clusters:", n_clusters_)
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
   plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 3)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
      marker=".",color='k', s=20, linewidths = 5, zorder=10)
plt.show()

输出

Output

[[ 2.98462798 9.9733794 10.02629344]
[ 3.94758484 4.99122771 4.99349433]
[ 3.00788996 3.03851268 2.99183033]]
Estimated clusters: 3
mix dot

Advantages and Disadvantages

Advantages

以下是均值漂移聚类算法的一些优点 -

The following are some advantages of Mean-Shift clustering algorithm −

  1. It does not need to make any model assumption as like in K-means or Gaussian mixture.

  2. It can also model the complex clusters which have nonconvex shape.

  3. It only needs one parameter named bandwidth which automatically determines the number of clusters.

  4. There is no issue of local minima as like in K-means.

  5. No problem generated from outliers.

Disadvantages

以下是均值漂移聚类算法的一些缺点 -

The following are some disadvantages of Mean-Shift clustering algorithm −

在高维度(其中簇的数量急剧变化)的情况下,均值漂移算法无法很好地工作。

Mean-shift algorithm does not work well in case of high dimension, where number of clusters changes abruptly.

  1. We do not have any direct control on the number of clusters but in some applications, we need a specific number of clusters.

  2. It cannot differentiate between meaningful and meaningless modes.

Clustering Algorithms - Hierarchical Clustering

Introduction to Hierarchical Clustering

层次聚类是另一种无监督学习算法,用于将具有相似特征的未标记数据点分组在一起。层次聚类算法分为以下两类:

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories −

Agglomerative hierarchical algorithms − 在凝聚层次算法中,每个数据点都被视为一个单一簇,然后依次合并或凝聚(自下而上方法)簇对。簇的层次结构表示为树状图或树结构。

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − 另一方面,在分裂层次算法中,所有数据点都被视为一个大簇,而聚类过程涉及将一个大簇分割(自上而下方法)为多个小簇。

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters.

Steps to Perform Agglomerative Hierarchical Clustering

我们将解释最常用、最重要的层次聚类,即凝聚。执行此操作的步骤如下 −

We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −

Step1 - 将每个数据点视为单个群集。因此,我们将从开始就有 K 个簇。数据点的数量在开始时也将是 K。

Step1 − Treat each data point as single cluster. Hence, we will be having, say K clusters at start. The number of data points will also be K at start.

Step2 - 现在,在此步骤中,我们需要通过连接两个最接近的数据点来形成一个大簇。这将产生总共 K-1 个簇。

Step2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters.

Step3 - 现在,为了形成更多簇,我们需要连接两个最接近的簇。这将产生总共 K-2 个簇。

Step3 − Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters.

Step4 - 现在,要形成一个大簇,重复上述三个步骤,直到 K 变为 0,即没有更多数据点可以加入。

Step4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.

Step5 - 最后,在制作了一个大簇之后,将使用树状图根据问题将其划分为多个簇。

Step5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.

Role of Dendrograms in Agglomerative Hierarchical Clustering

如我们在上一步中所讨论的,一旦形成了大簇,树状图的作用就开始了。树状图将用于根据我们的问题将簇分割为多个相关数据点的簇。我们可以借助以下示例来理解:

As we discussed in the last step, the role of dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example −

Example1

为了理解,让我们开始导入所需的库,如下所示:

To understand, let us start with importing the required libraries as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

接下来,我们将绘制我们为此示例获取的数据点 −

Next, we will be plotting the datapoints we have taken for this example −

X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
   plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords='offset points', ha='right', va='bottom')
plt.show()
random dots

从上图中,很容易看出,我们的数据点中有两个簇,但实际数据中有可能是数千个簇。接下来,我们将使用 SciPy 库绘制我们的数据点的树状图:

From the above diagram, it is very easy to see that we have two clusters in out datapoints but in the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library −

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top',labels=labelList, distance_sort='descending',show_leaf_counts=True)
plt.show()
building

现在,在大簇形成后,将选择最长的垂直距离。然后在其周围绘制一条垂直线,如下图所示。由于水平线在两点处穿过蓝线,因此簇的数量将是两个。

Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points, the number of clusters would be two.

red building

接下来,我们需要导入用于聚类的类,并调用它的 fit_predict 方法来预测簇。我们导入的是 sklearn.cluster 库的 AgglomerativeClustering 类:

Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library −

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)

接下来,使用以下代码绘制簇 −

Next, plot the cluster with the help of following code −

plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap='rainbow')
random red dots

上图显示了我们数据点中的两个簇。

The above diagram shows the two clusters from our datapoints.

Example2

如我们从上面讨论的简单示例中了解到的树状图概念,让我们转到另一个示例,其中我们使用层次聚类创建了 Pima Indian Diabetes 数据集中数据点的簇:

As we understood the concept of dendrograms from the simple example discussed above, let us move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering −

import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import numpy as np
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
data.shape
(768, 9)
data.head()

slno.

preg

Plas

Pres

skin

test

mass

pedi

age

class

0

6

148

72

35

0

33.6

0.627

50

1

1

1

85

66

29

0

26.6

0.351

31

0

2

8

183

64

0

0

23.3

0.672

32

1

3

1

89

66

23

94

28.1

0.167

21

0

4

0

137

40

35

168

43.1

2.288

33

1

patient_data = data.iloc[:, 3:5].values
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Patient Dendograms")
dend = shc.dendrogram(shc.linkage(data, method='ward'))
green building
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster.fit_predict(patient_data)
plt.figure(figsize=(10, 7))
plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap='rainbow')
brown

KNN Algorithm - Finding Nearest Neighbors

Introduction

K 近邻 (KNN) 算法是一种有监督的 ML 算法,可用于分类和回归预测问题。然而,它主要用于工业中的分类预测问题。以下两个特性很好地定义了 KNN -

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well −

  1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.

  2. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm

K 近邻 (KNN) 算法使用“特征相似性”来预测新数据点的值,这意味着新的数据点将根据它与训练集中的点的匹配程度分配一个值。我们可以通过以下步骤了解它的工作原理 -

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps −

Step1 - 为了实现任何算法,我们需要数据集。因此,在 KNN 的第一步中,我们必须加载训练数据和测试数据。

Step1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.

Step2 - 接下来,我们需要选择 K 的值,即最近的数据点。K 可以是任何整数。

Step2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.

Step3 − 对于测试数据中的每个点做以下操作 −

Step3 − For each point in the test data do the following −

3.1 − 借助以下任何一种方法计算测试数据与训练数据中的每一行之间的距离:欧几里德距离、曼哈顿距离或汉明距离。用于计算距离最常见的方法是欧几里德距离。

3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.

3.2 − 现在,根据距离值,按升序对它们进行排序。

3.2 − Now, based on the distance value, sort them in ascending order.

3.3 − 接下来,它将从排序数组中选择前 K 行。

3.3 − Next, it will choose the top K rows from the sorted array.

3.4 − 现在,它将根据这些行的最频繁类为测试点分配一个类。

3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.

Step4 − 结束

Step4 − End

Example

以下是理解 K 的概念和 KNN 算法的工作原理的一个示例 −

The following is an example to understand the concept of K and working of KNN algorithm −

假设我们有一个可以按如下方式绘制的数据集 −

Suppose we have a dataset which can be plotted as follows −

violate

现在,我们需要将带黑点的新的数据点(在点 60,60)分类为蓝色或红色类别。我们假定 K = 3,即它会找到三个最邻近的数据点。它在下一张图中显示 −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −

circle

我们可以在上图中看到带黑点的这个数据点的三近邻。在这三个数据点中,有两个属于红色类别,因此黑点也将被分配到红色类别。

We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class.

Implementation in Python

众所周知,K 近邻 (KNN) 算法既可以用于分类,也可以用于回归。以下是使用 Python 将 KNN 同时用作分类器和回归器的程序 −

As we know K-nearest neighbors (KNN) algorithm can be used for both classification as well as regression. The following are the recipes in Python to use KNN as classifier as well as regressor −

KNN as Classifier

首先,从导入必要的 Python 包开始——

First, start with importing necessary python packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下来,从其 Web 链接下载 iris 数据集,如下所示——

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

接下来,我们需要按照以下方式为数据集分配列名称 −

Next, we need to assign column names to the dataset as follows −

headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −

Now, we need to read dataset to pandas dataframe as follows −

dataset = pd.read_csv(path, names=headernames)
dataset.head()

slno.

sepal-length

sepal-width

petal-length

petal-width

Class

0

5.1

3.5

1.4

0.2

Iris-setosa

1

4.9

3.0

1.4

0.2

Iris-setosa

2

4.7

3.2

1.3

0.2

Iris-setosa

3

4.6

3.1

1.5

0.2

Iris-setosa

4

5.0

3.6

1.4

0.2

Iris-setosa

数据预处理将借助以下脚本行执行 −

Data Preprocessing will be done with the help of following script lines −

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

接下来,我们将数据分成训练集和测试集。以下代码会将数据集分成 60% 的训练数据和 40% 的测试数据 −

Next, we will divide the data into train and test split. Following code will split the dataset into 60% training data and 40% of testing data −

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)

接下来,将按照如下方式对数据进行缩放 −

Next, data scaling will be done as follows −

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

接下来,借助 sklearn 的 KNeighborsClassifier 类按如下方式训练模型 −

Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)

最后,我们需要进行预测。可以使用以下脚本完成——

At last we need to make prediction. It can be done with the help of following script −

y_pred = classifier.predict(X_test)

接下来,按照以下方式打印结果 −

Next, print the results as follows −

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
            precision      recall       f1-score       support
Iris-setosa       1.00        1.00         1.00          21
Iris-versicolor   0.70        1.00         0.82          16
Iris-virginica    1.00        0.70         0.82          23
micro avg         0.88        0.88         0.88          60
macro avg         0.90        0.90         0.88          60
weighted avg      0.92        0.88         0.88          60


Accuracy: 0.8833333333333333

KNN as Regressor

首先,从导入必要的 Python 包开始——

First, start with importing necessary Python packages −

import numpy as np
import pandas as pd

接下来,从其 Web 链接下载 iris 数据集,如下所示——

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

接下来,我们需要按照以下方式为数据集分配列名称 −

Next, we need to assign column names to the dataset as follows −

headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −

Now, we need to read dataset to pandas dataframe as follows −

data = pd.read_csv(url, names=headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape

output:(150, 5)

接下来,从 sklearn 导入 KNeighborsRegressor 以拟合模型 −

Next, import KNeighborsRegressor from sklearn to fit the model −

from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)

最后,我们可以按如下方式找到 MSE −

At last, we can find the MSE as follows −

print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))

Output

The MSE is: 0.12226666666666669

Pros and Cons of KNN

Pros

  1. It is very simple algorithm to understand and interpret.

  2. It is very useful for nonlinear data because there is no assumption about data in this algorithm.

  3. It is a versatile algorithm as we can use it for classification as well as regression.

  4. It has relatively high accuracy but there are much better supervised learning models than KNN.

Cons

  1. It is computationally a bit expensive algorithm because it stores all the training data.

  2. High memory storage required as compared to other supervised learning algorithms.

  3. Prediction is slow in case of big N.

  4. It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN

以下是一些 KNN 可以成功应用的领域 −

The following are some of the areas in which KNN can be applied successfully −

Banking System

KNN 可用于银行系统预测某个人是否适合贷款审批?此人是否具有与违约者相似的特征?

KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have the characteristics similar to the defaulters one?

Calculating Credit Ratings

KNN 算法可用于通过与具有相似特征的人进行比较来查找个人信用评级。

KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.

Politics

借助 KNN 算法,我们可以将潜在选民分为各种类别,如“将投票”、“不会投票”、“将投票给‘国大党’”、“将投票给‘人民党’”。

With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

KNN 算法可以应用的其他领域包括语音识别、手写检测、图像识别和视频识别。

Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition and Video Recognition.

Machine Learning - Performance Metrics

有各种指标可以用来评估 ML 算法、分类算法以及回归算法的性能。我们必须仔细选择指标来评估 ML 性能,因为 −

There are various metrics which we can use to evaluate the performance of ML algorithms, classification as well as regression algorithms. We must carefully choose the metrics for evaluating ML performance because −

  1. How the performance of ML algorithms is measured and compared will be dependent entirely on the metric you choose.

  2. How you weight the importance of various characteristics in the result will be influenced completely by the metric you choose.

Performance Metrics for Classification Problems

我们在前面的章节中讨论了分类及其算法。在这里,我们将讨论可用于评价分类问题预测的各种性能指标。

We have discussed classification and its algorithms in the previous chapters. Here, we are going to discuss various performance metrics that can be used to evaluate predictions for classification problems.

Confusion Matrix

这是衡量分类问题性能的最简单方法,其中输出可以是两种或更多种类的类。混淆矩阵只不过是一个有两维的表格,即“实际”和“预测”,此外,这两个维度都具有“真阳性(TP)”、“真阴性(TN)”、“假阳性(FP)”、“假阴性(FN) ”如下所示 -

It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −

actual predicated

与混淆矩阵相关的术语的解释如下 -

Explanation of the terms associated with confusion matrix are as follows −

  1. True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.

  2. True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.

  3. False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1.

  4. False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0.

我们可以使用 sklearn.metrics 的 confusion_matrix 函数计算我们的分类模型的混淆矩阵。

We can use confusion_matrix function of sklearn.metrics to compute Confusion Matrix of our classification model.

Classification Accuracy

它是分类算法最常用的性能指标。它可以定义为预测的正确预测数与所做预测数之比。我们可以使用混淆矩阵轻松地根据以下公式计算它 -

It is most common performance metric for classification algorithms. It may be defined as the number of correct predictions made as a ratio of all predictions made. We can easily calculate it by confusion matrix with the help of following formula −

我们可以使用 sklearn.metrics 的 accuracy_score 函数计算我们的分类模型的准确度。

We can use accuracy_score function of sklearn.metrics to compute accuracy of our classification model.

Classification Report

此报告包括精确度、召回率、F1 和支持的分数。它们解释如下 −

This report consists of the scores of Precisions, Recall, F1 and Support. They are explained as follows −

Precision

精度,用于文档检索,可以定义为由我们的机器学习模型返回的正确文档数。我们可以通过以下公式借助混淆矩阵轻松计算它 -

Precision, used in document retrievals, may be defined as the number of correct documents returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

Recall or Sensitivity

召回率可以定义为由我们的机器学习模型返回的正例数。我们可以通过以下公式借助混淆矩阵轻松计算它 -

Recall may be defined as the number of positives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

Specificity

特异性与召回率相反,可以定义为我们的 ML 模型返回的负样本数量。我们可以通过使用以下公式轻松地通过混淆矩阵计算它−

Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

Support

支持可以定义为目标值每个类别中存在的真实响应样本数。

Support may be defined as the number of samples of the true response that lies in each class of target values.

F1 Score

这个分数将为我们提供精确度和召回率的调和平均值。从数学角度讲,F1 分数是精确度和召回率的加权平均值。F1 的最佳值为 1,最差值为 0。我们可以借助以下公式计算 F1 分数:

This score will give us the harmonic mean of precision and recall. Mathematically, F1 score is the weighted average of the precision and recall. The best value of F1 would be 1 and worst would be 0. We can calculate F1 score with the help of following formula −

F1 = 2 ∗ (精确度 ∗ 召回率) / (精确度 + 召回率)

𝑭𝟏 = 𝟐 ∗ (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍) / (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍)

F1 分数中包含精确度和召回率相等的相关贡献。

F1 score is having equal relative contribution of precision and recall.

我们可以使用 sklearn.metrics 的 classification_report 函数获取我们分类模型的分类报告。

We can use classification_report function of sklearn.metrics to get the classification report of our classification model.

AUC (Area Under ROC curve)

AUC(曲线下面积)-ROC(接收者操作特征)是一种基于分类问题中变化的阈值,用于衡量其性能的指标。顾名思义,ROC 是一条概率曲线,AUC 是分离度的衡量标准。简而言之,AUC-ROC 指标将告诉我们模型区分不同类别。AUC 越大,模型也就越好。

AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance metric, based on varying threshold values, for classification problems. As name suggests, ROC is a probability curve and AUC measure the separability. In simple words, AUC-ROC metric will tell us about the capability of model in distinguishing the classes. Higher the AUC, better the model.

从数学角度讲,它可以通过绘制 TPR(真阳性率),即灵敏度或召回率与 FPR(假阳性率),即 1-特异性,在各种阈值下计算得来。以下图表展示了 ROC,AUC 在 y 轴上有 TPR,在 x 轴上有 FPR:

Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or recall vs FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values. Following is the graph showing ROC, AUC having TPR at y-axis and FPR at x-axis −

aoc

我们可以使用 sklearn.metrics 的 roc_auc_score 函数计算 AUC-ROC。

We can use roc_auc_score function of sklearn.metrics to compute AUC-ROC.

LOGLOSS (Logarithmic Loss)

它也称为逻辑回归损失或交叉熵损失。它基本上根据概率估算定义,且衡量分类模型的性能,其中输入是一个介于 0 到 1 之间的概率值。我们可以通过将其与准确性区分开来以更清晰地理解它。众所周知,准确性是我们模型中的预测计数(预测值 = 实际值),而对数损失是我们预测的不确定程度,取决于它与实际标签的差异有多大。借助对数损失值,我们可以更准确地了解我们模型的性能。我们可以使用 sklearn. metrics 的 log_loss 函数计算对数损失。

It is also called Logistic regression loss or cross-entropy loss. It basically defined on probability estimates and measures the performance of a classification model where the input is a probability value between 0 and 1. It can be understood more clearly by differentiating it with accuracy. As we know that accuracy is the count of predictions (predicted value = actual value) in our model whereas Log Loss is the amount of uncertainty of our prediction based on how much it varies from the actual label. With the help of Log Loss value, we can have more accurate view of the performance of our model. We can use log_loss function of sklearn.metrics to compute Log Loss.

Example

以下是 Python 中的简单示例,它将让我们深入了解如何在二元分类模型上使用上述性能指标:

The following is a simple recipe in Python which will give us an insight about how we can use the above explained performance metrics on binary classification model −

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
results = confusion_matrix(X_actual, Y_predic)
print ('Confusion Matrix :')
print(results)
print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))
print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))
print('AUC-ROC:',roc_auc_score(X_actual, Y_predic))
print('LOGLOSS Value is',log_loss(X_actual, Y_predic))

Output

Confusion Matrix :
[[3 3]
[1 3]]
Accuracy Score is 0.6
Classification Report :
            precision      recall      f1-score       support
      0       0.75          0.50      0.60           6
      1       0.50          0.75      0.60           4
micro avg     0.60          0.60      0.60           10
macro avg     0.62          0.62      0.60           10
weighted avg  0.65          0.60      0.60           10
AUC-ROC:  0.625
LOGLOSS Value is 13.815750437193334

Performance Metrics for Regression Problems

我们已经在前面的章节中讨论了回归及其算法。在这里,我们将讨论可以用来评估回归问题的预测的各种性能指标。

We have discussed regression and its algorithms in previous chapters. Here, we are going to discuss various performance metrics that can be used to evaluate predictions for regression problems.

Mean Absolute Error (MAE)

它是回归问题中最简单的误差指标。它基本上是预测值和实际值之间的绝对差的平均值之和。简而言之,有了 MAE,我们就可以了解预测有多错误。MAE 不会指明模型的方向,即不会指明模型的欠佳表现或表现过强。以下是计算 MAE 的公式:

It is the simplest error metric used in regression problems. It is basically the sum of average of the absolute difference between the predicted and actual values. In simple words, with MAE, we can get an idea of how wrong the predictions were. MAE does not indicate the direction of the model i.e. no indication about underperformance or overperformance of the model. The following is the formula to calculate MAE −

其中,Y = 实际输出值

Here, 𝑌=Actual Output Values

以及 Y^ = 预测输出值。

And $\hat{Y}$= Predicted Output Values.

我们可以使用 sklearn.metrics 的 mean_absolute_error 函数计算 MAE。

We can use mean_absolute_error function of sklearn.metrics to compute MAE.

Mean Square Error (MSE)

MSE 类似于 MAE,唯一的区别在于在对所有数据求和之前,它对实际值和预测输出值之差进行求平方,而不是使用绝对值。以下公式显示了差异:

MSE is like the MAE, but the only difference is that the it squares the difference of actual and predicted output values before summing them all instead of using the absolute value. The difference can be noticed in the following equation −

其中,Y = 实际输出值

Here, 𝑌=Actual Output Values

以及 Y^ = 预测输出值。

And $\hat{Y}$ = Predicted Output Values.

我们可以使用 sklearn.metrics 的 mean_squared_error 函数计算 MSE。

We can use mean_squared_error function of sklearn.metrics to compute MSE.

R Squared (R2)

R 平方指标通常用于解释目的,并指示一组预测输出值对实际输出值的拟合优度。以下公式将帮助我们理解它:

R Squared metric is generally used for explanatory purpose and provides an indication of the goodness or fit of a set of predicted output values to the actual output values. The following formula will help us understanding it −

在以上公式中,分子是 MSE,分母是 Y 值中的方差。

In the above equation, numerator is MSE and the denominator is the variance in 𝑌 values.

我们可以使用 Sklearn.metrics 的 r2_score 函数来计算 R 平方值。

We can use r2_score function of sklearn.metrics to compute R squared value.

Example

以下是 Python 中的一个简单代码示例,将让我们了解如何将上述解释的性能指标用于回归模型 -

The following is a simple recipe in Python which will give us an insight about how we can use the above explained performance metrics on regression model −

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
X_actual = [5, -1, 2, 10]
Y_predic = [3.5, -0.9, 2, 9.9]
print ('R Squared =',r2_score(X_actual, Y_predic))
print ('MAE =',mean_absolute_error(X_actual, Y_predic))
print ('MSE =',mean_squared_error(X_actual, Y_predic))

Output

R Squared = 0.9656060606060606
MAE = 0.42499999999999993
MSE = 0.5674999999999999

Machine Learning - Automatic Workflows

Introduction

为了成功执行和生成结果,机器学习模型必须自动化一些标准工作流。自动化这些标准工作流的过程可以在 Scikit-learn Pipelines 的帮助下完成。从数据科学家的角度来看,管道是一个概括的,但非常重要的概念。它基本上允许数据从原始格式流向某些有用信息。管道的工作原理可以用以下图表理解 −

In order to execute and produce results successfully, a machine learning model must automate some standard workflows. The process of automate these standard workflows can be done with the help of Scikit-learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It basically allows data flow from its raw format to some useful information. The working of pipelines can be understood with the help of following diagram −

data

机器学习管道的模块如下 −

The blocks of ML pipelines are as follows −

Data ingestion − 正如其名称所暗示的,这是为在机器学习项目中使用而导入数据的过程。数据可以从单个或多个系统中实时或批量提取。这是最具挑战性的步骤之一,因为数据质量会影响整个机器学习模型。

Data ingestion − As the name suggests, it is the process of importing the data for use in ML project. The data can be extracted in real time or batches from single or multiple systems. It is one of the most challenging steps because the quality of data can affect the whole ML model.

Data Preparation − 导入数据后,我们需要准备数据以便用于我们的机器学习模型。数据预处理是最重要的数据准备技术之一。

Data Preparation − After importing the data, we need to prepare data to be used for our ML model. Data preprocessing is one of the most important technique of data preparation.

ML Model Training − 下一步是训练我们的机器学习模型。我们有各种机器学习算法,如监督、无监督、强化,用于从数据中提取特征并进行预测。

ML Model Training − Next step is to train our ML model. We have various ML algorithms like supervised, unsupervised, reinforcement to extract the features from data, and make predictions.

Model Evaluation − 其次,我们需要评估机器学习模型。在自动机器学习管道的情况下,机器学习模型可以在各种统计方法和业务规则的帮助下进行评估。

Model Evaluation − Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be evaluated with the help of various statistical methods and business rules.

ML Model retraining − 在自动机器学习管道的情况下,第一个模型不一定是最好的。第一个模型被视为基线模型,我们可以重复训练它以提高模型的准确性。

ML Model retraining − In case of AutoML pipeline, it is not necessary that the first model is best one. The first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy.

Deployment − 最后,我们需要部署模型。此步骤涉及将模型应用于业务运营并将其迁移到业务运营中供其使用。

Deployment − At last, we need to deploy the model. This step involves applying and migrating the model to business operations for their use.

Challenges Accompanying ML Pipelines

为了创建机器学习管道,数据科学家面临许多挑战。这些挑战归为以下三类 −

In order to create ML pipelines, data scientists face many challenges. These challenges fall into the following three categories −

Quality of Data

任何机器学习模型的成功在很大程度上取决于数据的质量。如果我们提供给机器学习模型的数据不准确、不可靠和稳健,那么我们最终会得到错误或误导性的输出。

The success of any ML model depends heavily on the quality of data. If the data we are providing to ML model is not accurate, reliable and robust, then we are going to end with wrong or misleading output.

Data Reliability

与 ML 管道相关的另一个挑战是我们要提供给 ML 模型的数据的可靠性。如我们所知,数据科学家可以从多个来源获取数据,但要获得最佳结果,必须确保数据来源是可靠且受信任的。

Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model. As we know, there can be various sources from which data scientist can acquire data but to get the best results, it must be assured that the data sources are reliable and trusted.

Data Accessibility

要从 ML 管道中获得最佳结果,数据本身必须是可访问的,这需要整合、清理和管理数据。由于数据可访问性属性,元数据将用新标签进行更新。

To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation, cleansing and curation of data. As a result of data accessibility property, metadata will be updated with new tags.

Modelling ML Pipeline and Data Preparation

数据泄露(发生在训练数据集到测试数据集)是数据科学家在为 ML 模型准备数据时要处理的一个重要问题。通常,在数据准备时,数据科学家会在学习前对整个数据集使用标准化或归一化等技术。但这些技术无法帮助我们避免数据泄露,因为训练数据集将受到测试数据集中数据的规模的影响。

Data leakage, happening from training dataset to testing dataset, is an important issue for data scientist to deal with while preparing data for ML model. Generally, at the time of data preparation, data scientist uses techniques like standardization or normalization on entire dataset before learning. But these techniques cannot help us from the leakage of data because the training dataset would have been influenced by the scale of the data in the testing dataset.

通过使用 ML 管道,我们可以防止此数据泄露,因为管道确保数据准备(例如此处标准化)受到我们的交叉验证过程的每个折的影响。

By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like standardization is constrained to each fold of our cross-validation procedure.

Example

以下是一个演示数据准备和模型评估工作流的 Python 示例。为此,我们从 Sklearn 中使用 Pima Indian Diabetes 数据集。首先,我们将创建一个对数据进行标准化的管道。然后将创建线性判别分析模型,最后将使用 10 折交叉验证对管道进行评估。

The following is an example in Python that demonstrate data preparation and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipeline that standardized the data. Then a Linear Discriminative analysis model will be created and at last the pipeline will be evaluated using 10-fold cross validation.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values

接下来,我们将借助以下代码创建一个管道:

Next, we will create a pipeline with the help of the following code −

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

最后,我们将评估该管道并输出其准确度,如下所示:

At last, we are going to evaluate this pipeline and output its accuracy as follows −

kfold = KFold(n_splits=20, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7790148448043184

以上输出是对该设置的准确度的摘要,该设置位于数据集上。

The above output is the summary of accuracy of the setup on the dataset.

Modelling ML Pipeline and Feature Extraction

数据泄露也可能发生在机器学习模型的特征提取步骤中。这就是为什么也应当限制特征提取程序来阻止训练数据集中发生数据泄露。与数据准备的情况一样,通过使用机器学习管道,我们也可以防止此数据泄露。ML 管道提供的一个名为 FeatureUnion 的工具可用于此目的。

Data leakage can also happen at feature extraction step of ML model. That is why feature extraction procedures should also be restricted to stop data leakage in our training dataset. As in the case of data preparation, by using ML pipelines, we can prevent this data leakage also. FeatureUnion, a tool provided by ML pipelines can be used for this purpose.

Example

以下是在 Python 中演示特征提取和模型评估工作流的一个示例。为此,我们正在使用来自 Sklearn 的 Pima 印第安人糖尿病数据集。

The following is an example in Python that demonstrates feature extraction and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn.

首先,将用 PCA(主成分分析)提取 3 个特征。然后,将用统计分析提取 6 个特征。在特征提取之后,将通过使用

First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extraction procedures will be combined by using

FeatureUnion 工具。最后,创建一个逻辑回归模型,使用十倍交叉验证评估管道。

FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated using 10-fold cross validation.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values

接下来,按如下方式创建特征并集 -

Next, feature union will be created as follows −

features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

接下来,使用以下脚本行创建管道 -

Next, pipeline will be creating with the help of following script lines −

estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

最后,我们将评估该管道并输出其准确度,如下所示:

At last, we are going to evaluate this pipeline and output its accuracy as follows −

kfold = KFold(n_splits=20, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7789811066126855

以上输出是对该设置的准确度的摘要,该设置位于数据集上。

The above output is the summary of accuracy of the setup on the dataset.

Improving Performance of ML Models

Performance Improvement with Ensembles

集成可以通过组合多个模型为我们提升机器学习结果。基本上,集成模型包含几个单独训练的监督学习模型,将它们的结果通过各种方式合并以实现比单个模型更好的预测性能。集成方法可以分成以下两组 −

Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups −

Sequential ensemble methods

顾名思义,在这些种类的集成方法中,基础学习器会被顺序生成。此类方法的动机是利用基础学习器之间的依赖关系。

As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners.

Parallel ensemble methods

顾名思义,在这些种类的集成方法中,基础学习器会被并行生成。此类方法的动机是利用基础学习器之间的独立性。

As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners.

Ensemble Learning Methods

以下是最流行的集成学习方法,即组合不同模型的预测的方法 −

The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models −

Bagging

装袋法也称为引导聚合。在装袋方法中,集成模型尝试通过结合单个模型的预测(这些模型是根据随机生成的训练样本进行训练的)来提高预测精度并降低模型方差。集成模型的最终预测将通过计算所有预测的平均值来给出。装袋方法的最佳示例之一是随机森林。

The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests.

Boosting

在推进方法中,构建集成模型的主要原则是通过顺序训练每个基础模型估计量来逐步构建集成模型。顾名思义,它基本上结合多个较弱的基础学习器(在训练数据的多次迭代中顺序训练),构建强大的集成模型。在较弱的基础学习器的训练期间,将为较早被错误分类的学习器分配更大的权重。推进方法的一个示例是 AdaBoost。

In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost.

Voting

在此集成学习模型中,将构建多种不同类型的模型,并且一些简单的统计信息(例如计算均值或中位数等)用于组合预测。此预测将用作训练的附加输入以做出最终预测。

In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction.

Bagging Ensemble Algorithms

以下三个是装袋集成算法 −

The following are three bagging ensemble algorithms −

Bagged Decision Tree

众所周知,装袋集成法适用于方差较高的算法,在这一方面,决策树算法属于佼佼者。在以下 Python 配方中,我们将使用 sklearn 的 BaggingClassifier 函数和 DecisionTreeClassifier(一种分类和回归树算法)在 Pima Indians 糖尿病数据集上建立装袋决策树集成模型。

As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

现在,我们需要加载 Pima 糖尿病数据集,如我们在前面的示例中所做的那样 −

Now, we need to load the Pima diabetes dataset as we did in the previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()

我们需要提供要构建的树的数量。这里我们构建 150 棵树 −

We need to provide the number of trees we are going to build. Here we are building 150 trees −

num_trees = 150

接下来,借助以下脚本构建模型 −

Next, build the model with the help of following script −

model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

计算并打印结果,如下所示 −

Calculate and print the result as follows −

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7733766233766234

上面显示的输出表明,我们的袋装决策树分类器模型的准确率约为 77%。

The output above shows that we got around 77% accuracy of our bagged decision tree classifier model.

Random Forest

它是装袋决策树的延伸。对于单独的分类器,训练数据集的样本是替换抽取的,但树是以这样的方式构建的,从而降低它们之间的相关性。另外,在构建每棵树时,会考虑特征的随机子集来选择每个分割点,而不是贪心地选择最佳分割点。

It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree.

在以下 Python 配方中,我们将使用 sklearn 的 RandomForestClassifier 类在 Pima Indians 糖尿病数据集上建立装袋随机森林集成模型。

In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

seed = 7
kfold = KFold(n_splits=10, random_state=seed)

我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

num_trees = 150
max_features = 5

接下来,借助以下脚本构建模型 −

Next, build the model with the help of following script −

model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

计算并打印结果,如下所示 −

Calculate and print the result as follows −

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7629357484620642

上面显示的输出表明,我们的装袋随机森林分类器模型的准确率约为 76%。

The output above shows that we got around 76% accuracy of our bagged random forest classifier model.

Extra Trees

它是装袋决策树集成法的另一个延伸。在这种方法中,随机树是从训练数据集的样本构建的。

It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset.

在以下 Python 配方中,我们将使用 sklearn 的 ExtraTreesClassifier 类在 Pima Indians 糖尿病数据集上构建额外树集成模型。

In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

seed = 7
kfold = KFold(n_splits=10, random_state=seed)

我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

num_trees = 150
max_features = 5

接下来,借助以下脚本构建模型 −

Next, build the model with the help of following script −

model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)

计算并打印结果,如下所示 −

Calculate and print the result as follows −

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7551435406698566

上面显示的输出表明,我们的装袋额外树分类器模型的准确率约为 75.5%。

The output above shows that we got around 75.5% accuracy of our bagged extra trees classifier model.

Boosting Ensemble Algorithms

以下是两种最常见的提升集成算法 −

The followings are the two most common boosting ensemble algorithms −

AdaBoost

它是其中一个最成功的提升集成算法。此算法的主要关键在于赋予数据集中的实例的权重方式。因此,该算法在构建后续模型时需要较少地关注这些实例。

It is one the most successful boosting ensemble algorithm. The main key of this algorithm is in the way they give weights to the instances in dataset. Due to this the algorithm needs to pay less attention to the instances while constructing subsequent models.

在以下 Python 配方中,我们将使用 sklearn 的 AdaBoostClassifier 类在 Pima Indians 糖尿病数据集上构建 Ada Boost 集成模型,用于分类。

In the following Python recipe, we are going to build Ada Boost ensemble model for classification by using AdaBoostClassifier class of sklearn on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

seed = 5
kfold = KFold(n_splits=10, random_state=seed)

我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

num_trees = 50

接下来,借助以下脚本构建模型 −

Next, build the model with the help of following script −

model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

计算并打印结果,如下所示 −

Calculate and print the result as follows −

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7539473684210527

上面显示的输出表明,我们的 AdaBoost 分类器集成模型的准确率约为 75%。

The output above shows that we got around 75% accuracy of our AdaBoost classifier ensemble model.

Stochastic Gradient Boosting

它也被称为梯度提升机。在以下 Python 配方中,我们将使用 sklearn 的 GradientBoostingClassifier 类在 Pima Indians 糖尿病数据集上构建随机梯度提升集成模型,用于分类。

It is also called Gradient Boosting Machines. In the following Python recipe, we are going to build Stochastic Gradient Boostingensemble model for classification by using GradientBoostingClassifier class of sklearn on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

seed = 5
kfold = KFold(n_splits=10, random_state=seed)

我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

num_trees = 50

接下来,借助以下脚本构建模型 −

Next, build the model with the help of following script −

model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

计算并打印结果,如下所示 −

Calculate and print the result as follows −

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Output

0.7746582365003418

上面的输出表明,我们的梯度提升分类器集成模型准确率约为 77.5%。

The output above shows that we got around 77.5% accuracy of our Gradient Boosting classifier ensemble model.

Voting Ensemble Algorithms

如讨论所述,投票首先从训练数据集创建两个或更多独立的模型,然后投票分类器将围绕模型进行封装,同时根据需要对子模型的预测结果求取平均值以生成新数据。

As discussed, voting first creates two or more standalone models from training dataset and then a voting classifier will wrap the model along with taking the average of the predictions of sub-model whenever needed new data.

在以下 Python 配方中,我们将使用 sklearn 中的 VotingClassifier 类对 Pima Indian 糖尿病数据集建立投票集成模型,用于分类。我们对逻辑回归、决策树分类器和 SVM 的预测结果进行组合,如下所示,用于解决分类问题 −

In the following Python recipe, we are going to build Voting ensemble model for classification by using VotingClassifier class of sklearn on Pima Indians diabetes dataset. We are combining the predictions of logistic regression, Decision Tree classifier and SVM together for a classification problem as follows −

首先,按如下所示导入所需包:

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,输入 10 倍交叉验证,如下所示 −

Next, give the input for 10-fold cross validation as follows −

kfold = KFold(n_splits=10, random_state=7)

接下来,我们需要创建子模型,如下所示 −

Next, we need to create sub-models as follows −

estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))

现在,通过组合上述创建的子模型的预测结果来创建投票集成模型。

Now, create the voting ensemble model by combining the predictions of above created sub models.

ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

Output

0.7382262474367738

上面的输出表明,我们的投票分类器集成模型的准确率约为 74%。

The output above shows that we got around 74% accuracy of our voting classifier ensemble model.

Improving Performance of ML Model (Contd…)

Performance Improvement with Algorithm Tuning

众所周知,ML 模型以这样一种方式进行参数化,即可以调整它们的行为以解决特定问题。算法调优是指找到这些参数的最佳组合,以便提高 ML 模型的性能。此过程有时称为超参数优化,算法本身的参数称为超参数,ML 算法找到的系数称为参数。

As we know that ML models are parameterized in such a way that their behavior can be adjusted for a specific problem. Algorithm tuning means finding the best combination of these parameters so that the performance of ML model can be improved. This process sometimes called hyperparameter optimization and the parameters of algorithm itself are called hyperparameters and coefficients found by ML algorithm are called parameters.

在此,我们将讨论 Python Scikit-learn 提供的算法参数调优的一些方法。

Here, we are going to discuss about some methods for algorithm parameter tuning provided by Python Scikit-learn.

Grid Search Parameter Tuning

这是一种参数调优方法。此方法的工作要点是,针对网格中指定的每个可能的算法参数组合有条不紊地构建和评估模型方法。因此,我们可以说此算法具有搜索性质。

It is a parameter tuning approach. The key point of working of this method is that it builds and evaluate the model methodically for every possible combination of algorithm parameter specified in a grid. Hence, we can say that this algorithm is having search nature.

Example

在以下 Python 代码示例中,我们将使用 Sklearn 的 GridSearchCV 类对 Pima 印第安人糖尿病数据集执行网格搜索,以评估岭回归算法的各个 alpha 值。

In the following Python recipe, we are going to perform grid search by using GridSearchCV class of sklearn for evaluating various alpha values for the Ridge Regression algorithm on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

import numpy
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,按如下方式评估各个 alpha 值 -

Next, evaluate the various alpha values as follows −

alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)

现在,我们需要对我们的模型应用网格搜索 -

Now, we need to apply grid search on our model −

model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X, Y)

使用以下脚本行打印结果 -

Print the result with following script line −

print(grid.best_score_)
print(grid.best_estimator_.alpha)

Output

0.2796175593129722
1.0

以上输出为我们提供了最佳分数以及达到该分数的网格中的参数集。此例中的 alpha 值为 1.0。

The above output gives us the optimal score and the set of parameters in the grid that achieved that score. The alpha value in this case is 1.0.

Random Search Parameter Tuning

这是一种参数调优方法。此方法的工作要点是,针对固定次数的迭代从随机分布中对算法参数进行采样。

It is a parameter tuning approach. The key point of working of this method is that it samples the algorithm parameters from a random distribution for a fixed number of iterations.

Example

在以下 Python 代码示例中,我们将使用 Sklearn 的 RandomizedSearchCV 类对 Pima 印第安人糖尿病数据集执行随机搜索,以评估岭回归算法的 0 到 1 之间的不同 alpha 值。

In the following Python recipe, we are going to perform random search by using RandomizedSearchCV class of sklearn for evaluating different alpha values between 0 and 1 for the Ridge Regression algorithm on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

import numpy
from pandas import read_csv
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,按如下方式在岭回归算法上评估各个 alpha 值 -

Next, evaluate the various alpha values on Ridge regression algorithm as follows −

param_grid = {'alpha': uniform()}
model = Ridge()
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50,
random_state=7)
random_search.fit(X, Y)

使用以下脚本行打印结果 -

Print the result with following script line −

print(random_search.best_score_)
print(random_search.best_estimator_.alpha)

Output

0.27961712703051084
0.9779895119966027
0.9779895119966027

以上输出为我们提供了与网格搜索非常相似的最佳分数。

The above output gives us the optimal score just similar to the grid search.