Logistic Regression In Python 简明教程

Logistic Regression in Python - Preparing Data

为了创建分类器,我们必须以分类器构建模块所要求的格式准备数据。我们通过执行 One Hot Encoding 准备数据。

For creating the classifier, we must prepare the data in a format that is asked by the classifier building module. We prepare the data by doing One Hot Encoding.

Encoding Data

我们将在下文中讨论我们所说的数据编码。首先,让我们运行代码。在代码窗口中运行以下命令。

We will discuss shortly what we mean by encoding data. First, let us run the code. Run the following command in the code window.

In [10]: # creating one hot encoding of the categorical columns.
data = pd.get_dummies(df, columns =['job', 'marital', 'default', 'housing', 'loan', 'poutcome'])

正如注释中所说的,上述语句将创建数据的 One-Hot 编码。让我们看看创建了什么?通过打印数据库中的头部记录来检查创建的数据称为 “data”

As the comment says, the above statement will create the one hot encoding of the data. Let us see what has it created? Examine the created data called “data” by printing the head records in the database.

In [11]: data.head()

您将看到以下输出 −

You will see the following output −

created data

为了理解上述数据,我们将通过运行 data.columns 命令列出列名称,如下所示 −

To understand the above data, we will list out the column names by running the data.columns command as shown below −

In [12]: data.columns
Out[12]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'job_unknown', 'marital_divorced', 'marital_married', 'marital_single',
'marital_unknown', 'default_no', 'default_unknown', 'default_yes',
'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
'loan_unknown', 'loan_yes', 'poutcome_failure', 'poutcome_nonexistent',
'poutcome_success'], dtype='object')

现在,我们将解释 get_dummies 命令如何执行 One-Hot 编码。新生成数据库中的第一列是“y”字段,它表示此客户是否已订阅 TD。现在,让我们看看已编码的列。第一个编码列是 “job” 。在数据库中,你会发现“job”列具有许多可能的值,例如“admin”、“blue-collar”、“entrepreneur”等。对于每一个可能的值,我们在数据库中创建了一个新列,其列名称附加为前缀。

Now, we will explain how the one hot encoding is done by the get_dummies command. The first column in the newly generated database is “y” field which indicates whether this client has subscribed to a TD or not. Now, let us look at the columns which are encoded. The first encoded column is “job”. In the database, you will find that the “job” column has many possible values such as “admin”, “blue-collar”, “entrepreneur”, and so on. For each possible value, we have a new column created in the database, with the column name appended as a prefix.

因此,我们有名为“job_admin”、“job_blue-collar”等列。对于我们原始数据库中的每个编码字段,你将在创建的数据库中找到一个列列表,其中包含该列在原始数据库中所采用的所有可能的值。仔细检查列列表,了解数据如何映射到新数据库。

Thus, we have columns called “job_admin”, “job_blue-collar”, and so on. For each encoded field in our original database, you will find a list of columns added in the created database with all possible values that the column takes in the original database. Carefully examine the list of columns to understand how the data is mapped to a new database.

Understanding Data Mapping

为了理解生成的数据,让我们使用 data 命令打印出所有数据。运行该命令后的部分输出如下所示。

To understand the generated data, let us print out the entire data using the data command. The partial output after running the command is shown below.

In [13]: data
understanding data mapping

上屏显示前十二行。如果你继续向下滚动,你会看到所有行的映射都已完成。

The above screen shows the first twelve rows. If you scroll down further, you would see that the mapping is done for all the rows.

为了快速方便地参考,下面显示了数据库中更靠下的部分屏幕输出。

A partial screen output further down the database is shown here for your quick reference.

quick reference

为了理解已映射的数据,让我们检查第一行。

To understand the mapped data, let us examine the first row.

mapped data

它表示此客户未订阅 TD,如“y”字段中的值所示。它还表示此客户是一个“blue-collar”客户。水平滚动下来,它将告诉你他有一套“住房”,并且没有贷款“loan”。

It says that this customer has not subscribed to TD as indicated by the value in the “y” field. It also indicates that this customer is a “blue-collar” customer. Scrolling down horizontally, it will tell you that he has a “housing” and has taken no “loan”.

在 One-Hot 编码之后,我们需要一些更多的数据处理,然后再开始构建模型。

After this one hot encoding, we need some more data processing before we can start building our model.

Dropping the “unknown”

如果我们检查映射数据库中的列,你会发现一些以“unknown”结尾的列。例如,使用屏幕截图中所示的以下命令检查索引 12 处的列 −

If we examine the columns in the mapped database, you will find the presence of few columns ending with “unknown”. For example, examine the column at index 12 with the following command shown in the screenshot −

In [14]: data.columns[12]
Out[14]: 'job_unknown'

这表示指定客户的工作未知。显然,在我们的分析和模型构建中没有必要包含这样的列。因此,应删除所有带有“unknown”值的列。这是使用以下命令完成的 −

This indicates the job for the specified customer is unknown. Obviously, there is no point in including such columns in our analysis and model building. Thus, all columns with the “unknown” value should be dropped. This is done with the following command −

In [15]: data.drop(data.columns[[12, 16, 18, 21, 24]], axis=1, inplace=True)

确保你指定正确的列号。如有疑问,你随时可以通过指定列命令中如上所述的索引来检查列名。

Ensure that you specify the correct column numbers. In case of a doubt, you can examine the column name anytime by specifying its index in the columns command as described earlier.

删除不需要的列后,你可以检查最终的列列表,如下所示:

After dropping the undesired columns, you can examine the final list of columns as shown in the output below −

In [16]: data.columns
Out[16]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'marital_divorced', 'marital_married', 'marital_single', 'default_no',
'default_yes', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes',
'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success'],
dtype='object')

此时,我们的数据已准备好进行模型构建。

At this point, our data is ready for model building.