Logistic Regression In Python 简明教程

Logistic Regression in Python - Getting Data

本章详细讨论了在 Python 中获取数据来执行逻辑回归所涉及的步骤。

The steps involved in getting data for performing logistic regression in Python are discussed in detail in this chapter.

Downloading Dataset

如果您还没有下载之前提到的 UCI 数据集,请立即从 here 下载。单击数据文件夹。您将看到以下屏幕:

If you have not already downloaded the UCI dataset mentioned earlier, download it now from here. Click on the Data Folder. You will see the following screen −

machine learning databases

单击给定的链接下载 bank.zip 文件。该 zip 文件包含以下文件:

Download the bank.zip file by clicking on the given link. The zip file contains the following files −

bank

我们将使用 bank.csv 文件进行模型开发。bank-names.txt 文件包含您稍后将需要了解的数据库说明。bank-full.csv 包含更大的数据集,您可以使用该数据集进行更高级的开发。

We will use the bank.csv file for our model development. The bank-names.txt file contains the description of the database that you are going to need later. The bank-full.csv contains a much larger dataset that you may use for more advanced developments.

这里我们已经在可下载源代码文件中包含了 bank.csv 文件。该文件包含以逗号分隔的字段。我们还在文件中做了一些修改。建议您使用项目源 zip 文件中包含的文件来学习。

Here we have included the bank.csv file in the downloadable source zip. This file contains the comma-delimited fields. We have also made a few modifications in the file. It is recommended that you use the file included in the project source zip for your learning.

Loading Data

要从刚复制的 csv 文件加载数据,请键入以下语句并运行代码。

To load the data from the csv file that you copied just now, type the following statement and run the code.

In [2]: df = pd.read_csv('bank.csv', header=0)

您还可以通过运行以下代码语句检查加载的数据:

You will also be able to examine the loaded data by running the following code statement −

IN [3]: df.head()

运行命令后,您将看到以下输出:

Once the command is run, you will see the following output −

loaded data

它基本上打印了已加载数据的最初五行。检查存在的 21 列。我们将只使用其中的几列来进行模型开发。

Basically, it has printed the first five rows of the loaded data. Examine the 21 columns present. We will be using only few columns from these for our model development.

接下来,我们需要清理数据。数据可能包含一些带有 NaN 的行。使用以下命令可以消除此类行 −

Next, we need to clean the data. The data may contain some rows with NaN. To eliminate such rows, use the following command −

IN [4]: df = df.dropna()

幸运的是,bank.csv 不包含任何带有 NaN 的行,所以我们在案例中确实不需要这一步。但是,通常很难在海量数据库中发现此类行。所以始终最好运行上述语句来清理数据。

Fortunately, the bank.csv does not contain any rows with NaN, so this step is not truly required in our case. However, in general it is difficult to discover such rows in a huge database. So it is always safer to run the above statement to clean the data.

Note − 你可以随时使用以下语句轻松查看数据大小 −

Note − You can easily examine the data size at any point of time by using the following statement −

IN [5]: print (df.shape)
(41188, 21)

行数和列数将以第二行以上所示的格式打印在输出中。

The number of rows and columns would be printed in the output as shown in the second line above.

接下来要做的是检查每列是否适合我们尝试构建的模型。

Next thing to do is to examine the suitability of each column for the model that we are trying to build.