Logistic Regression In Python 简明教程

Logistic Regression in Python - Restructuring Data

每当某个组织进行调查时,他们尝试从顾客那里收集尽可能多的信息,其设想是这些信息可以以这种或另一种方式对组织在未来某个时刻有用。要解决当前问题,我们必须找出与我们的问题直接相关的信息。

Whenever any organization conducts a survey, they try to collect as much information as possible from the customer, with the idea that this information would be useful to the organization one way or the other, at a later point of time. To solve the current problem, we have to pick up the information that is directly relevant to our problem.

Displaying All Fields

现在,让我们看看该如何选择对我们有用的数据字段。在代码编辑器中运行以下语句。

Now, let us see how to select the data fields useful to us. Run the following statement in the code editor.

In [6]: print(list(df.columns))

您将看到以下输出 −

You will see the following output −

['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx',
'euribor3m', 'nr_employed', 'y']

输出显示数据库中所有列的名称。最后一列“y”是布尔值,表示该顾客在银行是否存有定期存款。该字段的值可能是“y”或“n”。您可以在下载的数据中作为银行名称.txt文件的一部分下载每个列的描述和用途。

The output shows the names of all the columns in the database. The last column “y” is a Boolean value indicating whether this customer has a term deposit with the bank. The values of this field are either “y” or “n”. You can read the description and purpose of each column in the banks-name.txt file that was downloaded as part of the data.

Eliminating Unwanted Fields

在检查列名称时,您将知道某些字段与当前问题无关。例如,诸如 month, day_of_week 、campaign 等字段对我们毫无用处。我们将从数据库中去掉这些字段。要删除列,可以使用下降命令,如下所示:

Examining the column names, you will know that some of the fields have no significance to the problem at hand. For example, fields such as month, day_of_week, campaign, etc. are of no use to us. We will eliminate these fields from our database. To drop a column, we use the drop command as shown below −

In [8]: #drop columns which are not needed.
   df.drop(df.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]],
   axis = 1, inplace = True)

该命令表示删除列编号 0、3、7、8 等。要确保正确选择索引,请使用以下语句:

The command says that drop column number 0, 3, 7, 8, and so on. To ensure that the index is properly selected, use the following statement −

In [7]: df.columns[9]
Out[7]: 'day_of_week'

这会打印给定索引的列名称。

This prints the column name for the given index.

在删除不需要的列之后,使用 head 语句检查数据。屏幕输出如下所示:

After dropping the columns which are not required, examine the data with the head statement. The screen output is shown here −

In [9]: df.head()
Out[9]:
      job   marital  default  housing  loan  poutcome    y
0     blue-collar    married  unknown yes no nonexistent 0
1     technician     married  no    no    no nonexistent 0
2     management     single   no    yes   no success     1
3     services       married  no    no    no nonexistent 0
4     retired        married  no    yes   no success     1

现在,我们只剩下我们觉得对我们的数据分析和预测很重要的字段了。 Data Scientist 的重要性在这一步骤中很明显。数据科学家必须选择合适的列来构建模型。

Now, we have only the fields which we feel are important for our data analysis and prediction. The importance of Data Scientist comes into picture at this step. The data scientist has to select the appropriate columns for model building.

例如,虽然乍看之下 job 的类型并不能让所有人都相信有必要将其包含在数据库中,但它将是一个非常有用的字段。并非所有类型的顾客都会开定期存款。低收入人群可能不会开定期存款,而高收入人群通常会将多余的钱存入定期存款。因此,在这个场景中,职业类型变得非常重要。同样,仔细选择您认为对您的分析有用的列。

For example, the type of job though at the first glance may not convince everybody for inclusion in the database, it will be a very useful field. Not all types of customers will open the TD. The lower income people may not open the TDs, while the higher income people will usually park their excess money in TDs. So the type of job becomes significantly relevant in this scenario. Likewise, carefully select the columns which you feel will be relevant for your analysis.

在下一章中,我们将准备数据以构建模型。

In the next chapter, we will prepare our data for building the model.