Logistic Regression In Python 简明教程
Logistic Regression in Python - Introduction
逻辑回归是一种用于对对象进行分类的统计方法。本章将借助一些示例介绍逻辑回归。
Logistic Regression is a statistical method of classification of objects. This chapter will give an introduction to logistic regression with the help of some examples.
Classification
要理解逻辑回归,您应该知道分类的含义。让我们考虑以下示例以更好地理解这一点 −
To understand logistic regression, you should know what classification means. Let us consider the following examples to understand this better −
-
A doctor classifies the tumor as malignant or benign.
-
A bank transaction may be fraudulent or genuine.
多年来,人类一直在执行此类任务——尽管它们容易出错。问题是我们能否训练机器以更高的准确性为我们执行这些任务?
For many years, humans have been performing such tasks - albeit they are error-prone. The question is can we train machines to do these tasks for us with a better accuracy?
机器执行分类的一个示例是您机器上的电子邮件 Client 程序,它将每封传入邮件归类为“垃圾邮件”或“非垃圾邮件”,并且它的准确性相当高。逻辑回归的统计技术已成功应用于电子邮件客户端。在这种情况下,我们已经训练我们的机器来解决分类问题。
One such example of machine doing the classification is the email Client on your machine that classifies every incoming mail as “spam” or “not spam” and it does it with a fairly large accuracy. The statistical technique of logistic regression has been successfully applied in email client. In this case, we have trained our machine to solve a classification problem.
逻辑回归只是用于解决此类二元分类问题的机器学习的一个部分。还有其他几种机器学习技术已经开发出来并用于解决其他类型的问题。
Logistic Regression is just one part of machine learning used for solving this kind of binary classification problem. There are several other machine learning techniques that are already developed and are in practice for solving other kinds of problems.
如果您已经注意到,在以上所有示例中,预测的结果只有两个值——是或否。我们称之为类别——所以我们说我们的分类器将对象分类为两个类别。从技术角度讲,我们可以说结果或目标变量本质上是二分的。
If you have noted, in all the above examples, the outcome of the predication has only two values - Yes or No. We call these as classes - so as to say we say that our classifier classifies the objects in two classes. In technical terms, we can say that the outcome or target variable is dichotomous in nature.
在其他分类问题中,输出可能被分类成多于两个的类别。例如,给定一篮子水果,要求你将不同种类的水果分拣出来。现在,篮子可能装着橘子、苹果、芒果,等等。所以当你分拣水果时,你要将它们分拣成两类以上。这是多变量分类问题。
There are other classification problems in which the output may be classified into more than two classes. For example, given a basket full of fruits, you are asked to separate fruits of different kinds. Now, the basket may contain Oranges, Apples, Mangoes, and so on. So when you separate out the fruits, you separate them out in more than two classes. This is a multivariate classification problem.
Logistic Regression in Python - Case Study
假设一家银行拜托您开发一个机器学习应用程序,帮助他们识别出很有可能在该银行开定期存款(一些银行也称定期存款)的潜在客户。该银行会定期通过电话或网络表单进行调查,收集潜在客户的信息。该调查的性质是通用的,面向非常广泛的受众,他们中许多人可能对在该银行办理业务并不感兴趣。在剩余的人中,只有少部分人可能对开定期存款感兴趣。其他人可能对银行提供的其他服务感兴趣。因此,该调查不一定专门用于识别开定期存款的客户。您的任务是从银行即将与您分享的庞大调查数据中识别出所有开定期存款可能性较高的客户。
Consider that a bank approaches you to develop a machine learning application that will help them in identifying the potential clients who would open a Term Deposit (also called Fixed Deposit by some banks) with them. The bank regularly conducts a survey by means of telephonic calls or web forms to collect information about the potential clients. The survey is general in nature and is conducted over a very large audience out of which many may not be interested in dealing with this bank itself. Out of the rest, only a few may be interested in opening a Term Deposit. Others may be interested in other facilities offered by the bank. So the survey is not necessarily conducted for identifying the customers opening TDs. Your task is to identify all those customers with high probability of opening TD from the humongous survey data that the bank is going to share with you.
幸运的是,对于那些渴望开发机器学习模型的人来说,这类数据是公开的。这些数据是由加州大学欧文分校的一些学生在得到资助的情况下准备的。该数据库是 UCI Machine Learning Repository 的一部分,世界各地的学生、教育者和研究人员都在广泛使用。可以从 here 下载数据。
Fortunately, one such kind of data is publicly available for those aspiring to develop machine learning models. This data was prepared by some students at UC Irvine with external funding. The database is available as a part of UCI Machine Learning Repository and is widely used by students, educators, and researchers all over the world. The data can be downloaded from here.
在接下来的章节中,让我们使用相同的数据执行应用程序开发。
In the next chapters, let us now perform the application development using the same data.
Setting Up a Project
在本章中,我们将详细了解在 Python 中设置一个用于执行逻辑回归的项目的过程中涉及的步骤。
In this chapter, we will understand the process involved in setting up a project to perform logistic regression in Python, in detail.
Installing Jupyter
我们将使用 Jupyter——机器学习中最广泛使用的平台之一。如果你还没有在你的机器上安装 Jupyter,请从 here 下载。你可以按照它们网站上的说明来安装这个平台。正如该网站所建议的,你可能更愿意使用 Anaconda Distribution ,它与 Python 以及许多常用的用于科学计算和数据科学的 Python 软件包一起提供。这将避免了单独安装这些软件包的需要。
We will be using Jupyter - one of the most widely used platforms for machine learning. If you do not have Jupyter installed on your machine, download it from here. For installation, you can follow the instructions on their site to install the platform. As the site suggests, you may prefer to use Anaconda Distribution which comes along with Python and many commonly used Python packages for scientific computing and data science. This will alleviate the need for installing these packages individually.
在成功安装 Jupyter 之后,创建一个新项目,你现在屏幕上的样子应该像下面这样,准备接受你的代码。
After the successful installation of Jupyter, start a new project, your screen at this stage would look like the following ready to accept your code.
现在,通过点击标题名称并编辑,将项目的名称从 Untitled1 to “Logistic Regression” 更改。
Now, change the name of the project from Untitled1 to “Logistic Regression” by clicking the title name and editing it.
首先,我们将导入我们在代码中需要的几个 Python 软件包。
First, we will be importing several Python packages that we will need in our code.
Importing Python Packages
为此,在代码编辑器中键入或粘贴以下代码 −
For this purpose, type or cut-and-paste the following code in the code editor −
In [1]: # import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
此时,你的 Notebook 应该像下面这样 −
Your Notebook should look like the following at this stage −
通过点击 Run 按钮来运行代码。如果没有生成任何错误,说明你已成功安装 Jupyter,现在准备好进行其余的开发工作。
Run the code by clicking on the Run button. If no errors are generated, you have successfully installed Jupyter and are now ready for the rest of the development.
前三个 import 语句在我们的项目中导入了 pandas、numpy 和 matplotlib.pyplot 软件包。接下来的三个语句从 sklearn 导入了指定的模块。
The first three import statements import pandas, numpy and matplotlib.pyplot packages in our project. The next three statements import the specified modules from sklearn.
我们的下一个任务是下载项目所需的数据。我们将在下一章学习它。
Our next task is to download the data required for our project. We will learn this in the next chapter.
Logistic Regression in Python - Getting Data
本章详细讨论了在 Python 中获取数据来执行逻辑回归所涉及的步骤。
The steps involved in getting data for performing logistic regression in Python are discussed in detail in this chapter.
Downloading Dataset
如果您还没有下载之前提到的 UCI 数据集,请立即从 here 下载。单击数据文件夹。您将看到以下屏幕:
If you have not already downloaded the UCI dataset mentioned earlier, download it now from here. Click on the Data Folder. You will see the following screen −
单击给定的链接下载 bank.zip 文件。该 zip 文件包含以下文件:
Download the bank.zip file by clicking on the given link. The zip file contains the following files −
我们将使用 bank.csv 文件进行模型开发。bank-names.txt 文件包含您稍后将需要了解的数据库说明。bank-full.csv 包含更大的数据集,您可以使用该数据集进行更高级的开发。
We will use the bank.csv file for our model development. The bank-names.txt file contains the description of the database that you are going to need later. The bank-full.csv contains a much larger dataset that you may use for more advanced developments.
这里我们已经在可下载源代码文件中包含了 bank.csv 文件。该文件包含以逗号分隔的字段。我们还在文件中做了一些修改。建议您使用项目源 zip 文件中包含的文件来学习。
Here we have included the bank.csv file in the downloadable source zip. This file contains the comma-delimited fields. We have also made a few modifications in the file. It is recommended that you use the file included in the project source zip for your learning.
Loading Data
要从刚复制的 csv 文件加载数据,请键入以下语句并运行代码。
To load the data from the csv file that you copied just now, type the following statement and run the code.
In [2]: df = pd.read_csv('bank.csv', header=0)
您还可以通过运行以下代码语句检查加载的数据:
You will also be able to examine the loaded data by running the following code statement −
IN [3]: df.head()
运行命令后,您将看到以下输出:
Once the command is run, you will see the following output −
它基本上打印了已加载数据的最初五行。检查存在的 21 列。我们将只使用其中的几列来进行模型开发。
Basically, it has printed the first five rows of the loaded data. Examine the 21 columns present. We will be using only few columns from these for our model development.
接下来,我们需要清理数据。数据可能包含一些带有 NaN 的行。使用以下命令可以消除此类行 −
Next, we need to clean the data. The data may contain some rows with NaN. To eliminate such rows, use the following command −
IN [4]: df = df.dropna()
幸运的是,bank.csv 不包含任何带有 NaN 的行,所以我们在案例中确实不需要这一步。但是,通常很难在海量数据库中发现此类行。所以始终最好运行上述语句来清理数据。
Fortunately, the bank.csv does not contain any rows with NaN, so this step is not truly required in our case. However, in general it is difficult to discover such rows in a huge database. So it is always safer to run the above statement to clean the data.
Note − 你可以随时使用以下语句轻松查看数据大小 −
Note − You can easily examine the data size at any point of time by using the following statement −
IN [5]: print (df.shape)
(41188, 21)
行数和列数将以第二行以上所示的格式打印在输出中。
The number of rows and columns would be printed in the output as shown in the second line above.
接下来要做的是检查每列是否适合我们尝试构建的模型。
Next thing to do is to examine the suitability of each column for the model that we are trying to build.
Logistic Regression in Python - Restructuring Data
每当某个组织进行调查时,他们尝试从顾客那里收集尽可能多的信息,其设想是这些信息可以以这种或另一种方式对组织在未来某个时刻有用。要解决当前问题,我们必须找出与我们的问题直接相关的信息。
Whenever any organization conducts a survey, they try to collect as much information as possible from the customer, with the idea that this information would be useful to the organization one way or the other, at a later point of time. To solve the current problem, we have to pick up the information that is directly relevant to our problem.
Displaying All Fields
现在,让我们看看该如何选择对我们有用的数据字段。在代码编辑器中运行以下语句。
Now, let us see how to select the data fields useful to us. Run the following statement in the code editor.
In [6]: print(list(df.columns))
您将看到以下输出 −
You will see the following output −
['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx',
'euribor3m', 'nr_employed', 'y']
输出显示数据库中所有列的名称。最后一列“y”是布尔值,表示该顾客在银行是否存有定期存款。该字段的值可能是“y”或“n”。您可以在下载的数据中作为银行名称.txt文件的一部分下载每个列的描述和用途。
The output shows the names of all the columns in the database. The last column “y” is a Boolean value indicating whether this customer has a term deposit with the bank. The values of this field are either “y” or “n”. You can read the description and purpose of each column in the banks-name.txt file that was downloaded as part of the data.
Eliminating Unwanted Fields
在检查列名称时,您将知道某些字段与当前问题无关。例如,诸如 month, day_of_week 、campaign 等字段对我们毫无用处。我们将从数据库中去掉这些字段。要删除列,可以使用下降命令,如下所示:
Examining the column names, you will know that some of the fields have no significance to the problem at hand. For example, fields such as month, day_of_week, campaign, etc. are of no use to us. We will eliminate these fields from our database. To drop a column, we use the drop command as shown below −
In [8]: #drop columns which are not needed.
df.drop(df.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]],
axis = 1, inplace = True)
该命令表示删除列编号 0、3、7、8 等。要确保正确选择索引,请使用以下语句:
The command says that drop column number 0, 3, 7, 8, and so on. To ensure that the index is properly selected, use the following statement −
In [7]: df.columns[9]
Out[7]: 'day_of_week'
这会打印给定索引的列名称。
This prints the column name for the given index.
在删除不需要的列之后,使用 head 语句检查数据。屏幕输出如下所示:
After dropping the columns which are not required, examine the data with the head statement. The screen output is shown here −
In [9]: df.head()
Out[9]:
job marital default housing loan poutcome y
0 blue-collar married unknown yes no nonexistent 0
1 technician married no no no nonexistent 0
2 management single no yes no success 1
3 services married no no no nonexistent 0
4 retired married no yes no success 1
现在,我们只剩下我们觉得对我们的数据分析和预测很重要的字段了。 Data Scientist 的重要性在这一步骤中很明显。数据科学家必须选择合适的列来构建模型。
Now, we have only the fields which we feel are important for our data analysis and prediction. The importance of Data Scientist comes into picture at this step. The data scientist has to select the appropriate columns for model building.
例如,虽然乍看之下 job 的类型并不能让所有人都相信有必要将其包含在数据库中,但它将是一个非常有用的字段。并非所有类型的顾客都会开定期存款。低收入人群可能不会开定期存款,而高收入人群通常会将多余的钱存入定期存款。因此,在这个场景中,职业类型变得非常重要。同样,仔细选择您认为对您的分析有用的列。
For example, the type of job though at the first glance may not convince everybody for inclusion in the database, it will be a very useful field. Not all types of customers will open the TD. The lower income people may not open the TDs, while the higher income people will usually park their excess money in TDs. So the type of job becomes significantly relevant in this scenario. Likewise, carefully select the columns which you feel will be relevant for your analysis.
在下一章中,我们将准备数据以构建模型。
In the next chapter, we will prepare our data for building the model.
Logistic Regression in Python - Preparing Data
为了创建分类器,我们必须以分类器构建模块所要求的格式准备数据。我们通过执行 One Hot Encoding 准备数据。
For creating the classifier, we must prepare the data in a format that is asked by the classifier building module. We prepare the data by doing One Hot Encoding.
Encoding Data
我们将在下文中讨论我们所说的数据编码。首先,让我们运行代码。在代码窗口中运行以下命令。
We will discuss shortly what we mean by encoding data. First, let us run the code. Run the following command in the code window.
In [10]: # creating one hot encoding of the categorical columns.
data = pd.get_dummies(df, columns =['job', 'marital', 'default', 'housing', 'loan', 'poutcome'])
正如注释中所说的,上述语句将创建数据的 One-Hot 编码。让我们看看创建了什么?通过打印数据库中的头部记录来检查创建的数据称为 “data” 。
As the comment says, the above statement will create the one hot encoding of the data. Let us see what has it created? Examine the created data called “data” by printing the head records in the database.
In [11]: data.head()
您将看到以下输出 −
You will see the following output −
为了理解上述数据,我们将通过运行 data.columns 命令列出列名称,如下所示 −
To understand the above data, we will list out the column names by running the data.columns command as shown below −
In [12]: data.columns
Out[12]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'job_unknown', 'marital_divorced', 'marital_married', 'marital_single',
'marital_unknown', 'default_no', 'default_unknown', 'default_yes',
'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
'loan_unknown', 'loan_yes', 'poutcome_failure', 'poutcome_nonexistent',
'poutcome_success'], dtype='object')
现在,我们将解释 get_dummies 命令如何执行 One-Hot 编码。新生成数据库中的第一列是“y”字段,它表示此客户是否已订阅 TD。现在,让我们看看已编码的列。第一个编码列是 “job” 。在数据库中,你会发现“job”列具有许多可能的值,例如“admin”、“blue-collar”、“entrepreneur”等。对于每一个可能的值,我们在数据库中创建了一个新列,其列名称附加为前缀。
Now, we will explain how the one hot encoding is done by the get_dummies command. The first column in the newly generated database is “y” field which indicates whether this client has subscribed to a TD or not. Now, let us look at the columns which are encoded. The first encoded column is “job”. In the database, you will find that the “job” column has many possible values such as “admin”, “blue-collar”, “entrepreneur”, and so on. For each possible value, we have a new column created in the database, with the column name appended as a prefix.
因此,我们有名为“job_admin”、“job_blue-collar”等列。对于我们原始数据库中的每个编码字段,你将在创建的数据库中找到一个列列表,其中包含该列在原始数据库中所采用的所有可能的值。仔细检查列列表,了解数据如何映射到新数据库。
Thus, we have columns called “job_admin”, “job_blue-collar”, and so on. For each encoded field in our original database, you will find a list of columns added in the created database with all possible values that the column takes in the original database. Carefully examine the list of columns to understand how the data is mapped to a new database.
Understanding Data Mapping
为了理解生成的数据,让我们使用 data 命令打印出所有数据。运行该命令后的部分输出如下所示。
To understand the generated data, let us print out the entire data using the data command. The partial output after running the command is shown below.
In [13]: data
上屏显示前十二行。如果你继续向下滚动,你会看到所有行的映射都已完成。
The above screen shows the first twelve rows. If you scroll down further, you would see that the mapping is done for all the rows.
为了快速方便地参考,下面显示了数据库中更靠下的部分屏幕输出。
A partial screen output further down the database is shown here for your quick reference.
为了理解已映射的数据,让我们检查第一行。
To understand the mapped data, let us examine the first row.
它表示此客户未订阅 TD,如“y”字段中的值所示。它还表示此客户是一个“blue-collar”客户。水平滚动下来,它将告诉你他有一套“住房”,并且没有贷款“loan”。
It says that this customer has not subscribed to TD as indicated by the value in the “y” field. It also indicates that this customer is a “blue-collar” customer. Scrolling down horizontally, it will tell you that he has a “housing” and has taken no “loan”.
在 One-Hot 编码之后,我们需要一些更多的数据处理,然后再开始构建模型。
After this one hot encoding, we need some more data processing before we can start building our model.
Dropping the “unknown”
如果我们检查映射数据库中的列,你会发现一些以“unknown”结尾的列。例如,使用屏幕截图中所示的以下命令检查索引 12 处的列 −
If we examine the columns in the mapped database, you will find the presence of few columns ending with “unknown”. For example, examine the column at index 12 with the following command shown in the screenshot −
In [14]: data.columns[12]
Out[14]: 'job_unknown'
这表示指定客户的工作未知。显然,在我们的分析和模型构建中没有必要包含这样的列。因此,应删除所有带有“unknown”值的列。这是使用以下命令完成的 −
This indicates the job for the specified customer is unknown. Obviously, there is no point in including such columns in our analysis and model building. Thus, all columns with the “unknown” value should be dropped. This is done with the following command −
In [15]: data.drop(data.columns[[12, 16, 18, 21, 24]], axis=1, inplace=True)
确保你指定正确的列号。如有疑问,你随时可以通过指定列命令中如上所述的索引来检查列名。
Ensure that you specify the correct column numbers. In case of a doubt, you can examine the column name anytime by specifying its index in the columns command as described earlier.
删除不需要的列后,你可以检查最终的列列表,如下所示:
After dropping the undesired columns, you can examine the final list of columns as shown in the output below −
In [16]: data.columns
Out[16]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'marital_divorced', 'marital_married', 'marital_single', 'default_no',
'default_yes', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes',
'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success'],
dtype='object')
此时,我们的数据已准备好进行模型构建。
At this point, our data is ready for model building.
Logistic Regression in Python - Splitting Data
我们有大约四万一千条记录。如果我们将所有数据用于模型构建,我们将没有数据用于测试。因此,通常我们将整个数据集分成两部分,比如说 70/30 的百分比。我们将 70% 的数据用于模型构建,其余用于测试我们创建模型预测中的精确度。你可以根据你的要求使用不同的分割比例。
We have about forty-one thousand and odd records. If we use the entire data for model building, we will not be left with any data for testing. So generally, we split the entire data set into two parts, say 70/30 percentage. We use 70% of the data for model building and the rest for testing the accuracy in prediction of our created model. You may use a different splitting ratio as per your requirement.
Creating Features Array
在我们分割数据之前,我们将数据分成两个数组 X 和 Y。X 数组包含我们要分析的所有特征(数据列),而 Y 数组是一个布尔值的一维数组,它就是预测的输出结果。为了理解这一点,让我们运行一些代码。
Before we split the data, we separate out the data into two arrays X and Y. The X array contains all the features (data columns) that we want to analyze and Y array is a single dimensional array of boolean values that is the output of the prediction. To understand this, let us run some code.
首先,执行以下 Python 语句来创建 X 数组:
Firstly, execute the following Python statement to create the X array −
In [17]: X = data.iloc[:,1:]
要检查 X 的内容,请使用 head 打印几个初始记录。以下屏幕显示 X 数组的内容。
To examine the contents of X use head to print a few initial records. The following screen shows the contents of the X array.
In [18]: X.head ()
该数组有几行和 23 列。
The array has several rows and 23 columns.
接下来,我们将创建包含“ y ”值的输出数组。
Next, we will create output array containing “y” values.
Creating Output Array
要创建一个用于预测值列的数组,请使用以下 Python 语句 −
To create an array for the predicted value column, use the following Python statement −
In [19]: Y = data.iloc[:,0]
通过调用 head 来检查其内容。以下屏幕输出显示结果 −
Examine its contents by calling head. The screen output below shows the result −
In [20]: Y.head()
Out[20]: 0 0
1 0
2 1
3 0
4 1
Name: y, dtype: int64
现在,使用以下命令分割数据 −
Now, split the data using the following command −
In [21]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
这将创建名为 X_train, Y_train, X_test, and Y_test 的四个数组。与之前一样,您可以使用 head 命令检查这些数组的内容。我们将使用 X_train 和 Y_train 数组训练我们的模型,使用 X_test 和 Y_test 数组测试和验证。
This will create the four arrays called X_train, Y_train, X_test, and Y_test. As before, you may examine the contents of these arrays by using the head command. We will use X_train and Y_train arrays for training our model and X_test and Y_test arrays for testing and validating.
现在,我们准备构建分类器。我们将在下一章中了解它。
Now, we are ready to build our classifier. We will look into it in the next chapter.
Logistic Regression in Python - Building Classifier
你不需要从头构建分类器。构建分类器很复杂,需要具备多个领域的知识,例如统计学、概率论、优化技术等等。市场上有几个预先构建的库,它们有经过充分测试的高效分类器实现。我们将使用 sklearn 中的一个这样的预先构建的模型。
It is not required that you have to build the classifier from scratch. Building classifiers is complex and requires knowledge of several areas such as Statistics, probability theories, optimization techniques, and so on. There are several pre-built libraries available in the market which have a fully-tested and very efficient implementation of these classifiers. We will use one such pre-built model from the sklearn.
The sklearn Classifier
从 sklearn 工具箱创建逻辑回归分类器非常简单,只需像这里所示的那样用一个程序语句完成 −
Creating the Logistic Regression classifier from sklearn toolkit is trivial and is done in a single program statement as shown here −
In [22]: classifier = LogisticRegression(solver='lbfgs',random_state=0)
创建分类器后,你将把你训练的数据输入到分类器中,这样它可以调整其内部参数,并准备好对你的未来数据进行预测。为了调整分类器,我们运行以下语句 −
Once the classifier is created, you will feed your training data into the classifier so that it can tune its internal parameters and be ready for the predictions on your future data. To tune the classifier, we run the following statement −
In [23]: classifier.fit(X_train, Y_train)
分类器现在准备好了进行测试。以下代码是上面两个语句执行的输出 −
The classifier is now ready for testing. The following code is the output of execution of the above two statements −
Out[23]: LogisticRegression(C = 1.0, class_weight = None, dual = False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2', random_state=0,
solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))
现在,我们准备测试创建的分类器。我们将在下一章中处理这个问题。
Now, we are ready to test the created classifier. We will deal this in the next chapter.
Logistic Regression in Python - Testing
我们需要在将上述创建的分类器投入生产使用之前对其进行测试。如果测试显示模型未达到所需精度,我们必须回到上述流程中,选择另一组特征(数据字段),重新构建模型并对其进行测试。这将是一个迭代步骤,直到分类器满足所需的精度要求。因此,让我们测试我们的分类器。
We need to test the above created classifier before we put it into production use. If the testing reveals that the model does not meet the desired accuracy, we will have to go back in the above process, select another set of features (data fields), build the model again, and test it. This will be an iterative step until the classifier meets your requirement of desired accuracy. So let us test our classifier.
Predicting Test Data
为了测试分类器,我们使用前面阶段中生成测试数据。我们对创建对象调用 predict 方法,并将测试数据 X 数组传递给下列命令:
To test the classifier, we use the test data generated in the earlier stage. We call the predict method on the created object and pass the X array of the test data as shown in the following command −
In [24]: predicted_y = classifier.predict(X_test)
这将为整个训练数据集生成一个一维数组,提供 X 数组中每一行预测。你可使用以下命令检查此数组:
This generates a single dimensional array for the entire training data set giving the prediction for each row in the X array. You can examine this array by using the following command −
In [25]: predicted_y
以下是执行以上两个命令的输出:
The following is the output upon the execution the above two commands −
Out[25]: array([0, 0, 0, ..., 0, 0, 0])
输出表明第一个和最后三个客户并不是 Term Deposit 的潜在候选人。你可以检查整个数组以找出潜在客户。为此,请使用以下 Python 代码段:
The output indicates that the first and last three customers are not the potential candidates for the Term Deposit. You can examine the entire array to sort out the potential customers. To do so, use the following Python code snippet −
In [26]: for x in range(len(predicted_y)):
if (predicted_y[x] == 1):
print(x, end="\t")
运行以上代码的输出显示如下:
The output of running the above code is shown below −
输出显示所有行的索引,这些行是订阅定期存款的潜在候选人。你现在可以将此输出交给银行的营销团队,由该团队获取所选行中每个客户的联系方式并继续其工作。
The output shows the indexes of all rows who are probable candidates for subscribing to TD. You can now give this output to the bank’s marketing team who would pick up the contact details for each customer in the selected row and proceed with their job.
在我们把这个模型投入生产之前,我们需要验证预测精度。
Before we put this model into production, we need to verify the accuracy of prediction.
Verifying Accuracy
要测试模型的精度,请在分类器上使用以下所示的得分方法:
To test the accuracy of the model, use the score method on the classifier as shown below −
In [27]: print('Accuracy: {:.2f}'.format(classifier.score(X_test, Y_test)))
运行此命令的屏幕输出如下所示:
The screen output of running this command is shown below −
Accuracy: 0.90
它显示我们模型的精度为 90%,在大多数应用程序中这被认为非常好。因此,不需要进一步调整。现在,我们的客户已准备好运行下一个活动,获得潜在客户列表,并追逐他们开设定期存款,成功率可能很高。
It shows that the accuracy of our model is 90% which is considered very good in most of the applications. Thus, no further tuning is required. Now, our customer is ready to run the next campaign, get the list of potential customers and chase them for opening the TD with a probable high rate of success.
Logistic Regression in Python - Limitations
正如你从上面的示例中看到的,将逻辑回归应用于机器学习并不是一项困难的任务。但是,它有其自身的局限性。逻辑回归无法处理大量的分类特征。在迄今为止我们讨论的示例中,我们将特征数量减少到了很大程度。
As you have seen from the above example, applying logistic regression for machine learning is not a difficult task. However, it comes with its own limitations. The logistic regression will not be able to handle a large number of categorical features. In the example we have discussed so far, we reduced the number of features to a very large extent.
但是,如果这些特征对我们的预测很重要,我们会被迫包含它们,但这样的话逻辑回归将无法给予我们良好的精度。逻辑回归也容易过拟合。它不适用于非线性问题。对于与目标无关且相互关联的自变量,它的表现会很差。因此,你必须仔细评估逻辑回归对你试图解决的问题的适用性。
However, if these features were important in our prediction, we would have been forced to include them, but then the logistic regression would fail to give us a good accuracy. Logistic regression is also vulnerable to overfitting. It cannot be applied to a non-linear problem. It will perform poorly with independent variables which are not correlated to the target and are correlated to each other. Thus, you will have to carefully evaluate the suitability of logistic regression to the problem that you are trying to solve.
机器学习的许多领域都指定了其他技术。举几个例子,我们有诸如 k 最近邻 (kNN)、线性回归、支持向量机 (SVM)、决策树、朴素贝叶斯等等算法。在最终确定某个特定模型之前,你必须评估这些各种技术对我们试图解决的问题的适用性。
There are many areas of machine learning where other techniques are specified devised. To name a few, we have algorithms such as k-nearest neighbours (kNN), Linear Regression, Support Vector Machines (SVM), Decision Trees, Naive Bayes, and so on. Before finalizing on a particular model, you will have to evaluate the applicability of these various techniques to the problem that we are trying to solve.
Logistic Regression in Python - Summary
逻辑回归是一种二元分类的统计技术。在本教程中,你学习了如何训练机器使用逻辑回归。创建机器学习模型时,最重要的要求是数据可用性。如果没有足够且相关的数据,就不能让机器学习。
Logistic Regression is a statistical technique of binary classification. In this tutorial, you learned how to train the machine to use logistic regression. Creating machine learning models, the most important requirement is the availability of the data. Without adequate and relevant data, you cannot simply make the machine to learn.
一旦有了数据,你的下一个主要任务是清理数据,消除不需要的行、字段,并选择适合模型开发的字段。完成后,你需要将数据映射到分类器训练所需的格式。因此,数据准备是任何机器学习应用程序的主要任务。一旦准备好了数据,就可以选择特定类型的分类器。
Once you have data, your next major task is cleansing the data, eliminating the unwanted rows, fields, and select the appropriate fields for your model development. After this is done, you need to map the data into a format required by the classifier for its training. Thus, the data preparation is a major task in any machine learning application. Once you are ready with the data, you can select a particular type of classifier.
在本教程中,你学习了如何使用 sklearn 库中提供的逻辑回归分类器。为了训练分类器,我们使用约 70% 的数据来训练模型。我们使用剩余的数据进行测试。我们测试模型的准确性。如果这不在可接受的范围内,我们会重新选择新的特征集。
In this tutorial, you learned how to use a logistic regression classifier provided in the sklearn library. To train the classifier, we use about 70% of the data for training the model. We use the rest of the data for testing. We test the accuracy of the model. If this is not within acceptable limits, we go back to selecting the new set of features.
再次按照准备数据、训练模型和测试它等整个过程进行操作,直到你对它的准确性满意为止。在开始任何机器学习项目之前,你必须学习和接触到目前为止已开发并已在行业中成功应用的各种技术。
Once again, follow the entire process of preparing data, train the model, and test it, until you are satisfied with its accuracy. Before taking up any machine learning project, you must learn and have exposure to a wide variety of techniques which have been developed so far and which have been applied successfully in the industry.