Machine Learning 简明教程

Machine Learning - Data Understanding

在使用机器学习项目时，我们通常会忽略两个最重要的部分，分别称为 mathematics 和 data 。数据理解是 ML 中的关键步骤，其原因在于它的数据驱动方法。我们的 ML 模型只会产生与我们提供给它的数据一样好或一样差的结果。

While working with machine learning projects, usually we ignore two most important parts called mathematics and data. What makes data understanding a critical step in ML is its data driven approach. Our ML model will produce only as good or as bad results as the data we provided to it.

数据理解基本上涉及分析和探索数据，以识别可能存在的任何模式或趋势。

Data understanding basically involves analyzing and exploring the data to identify any patterns or trends that may be present.

数据理解阶段通常涉及以下步骤 −

The data understanding phase typically involves the following steps −

Data Collection − This involves gathering the relevant data that you will be using for your analysis. The data can be collected from various sources such as databases, websites, and APIs.
Data Cleaning − This involves cleaning the data by removing any irrelevant or duplicate data, and dealing with missing data values. The data should be formatted in a way that makes it easy to analyze.
Data Exploration − This involves exploring the data to identify any patterns or trends that may be present. This can be done using various statistical techniques such as histograms, scatter plots, and correlation analysis.
Data Visualization − This involves creating visual representations of the data to help you understand it better. This can be done using tools such as graphs, charts, and maps.
Data Preprocessing − This involves transforming the data to make it suitable for use in machine learning algorithms. This can include scaling the data, transforming it into a different format, or reducing its dimensionality.

Understand the Data before Uploading It in ML Projects

在将数据上传到我们的 ML 项目之前了解我们的数据出于以下几个原因非常重要 −

Understanding our data before uploading it into our ML project is important for several reasons −

Identify Data Quality Issues

通过了解您的数据，您可以识别数据质量问题，例如可能影响您的 ML 模型性能的缺失值、异常值、不正确的 data 类型和不一致性。通过解决这些问题，您可以提高模型的质量和准确性。

By understanding your data, you can identify data quality issues such as missing values, outliers, incorrect data types, and inconsistencies that can affect the performance of your ML model. By addressing these issues, you can improve the quality and accuracy of your model.

Determine Data Relevance

您可以确定您收集的数据是否与您要解决的问题相关。通过了解您的数据，您可以确定哪些特征对您的模型很重要，哪些特征可以忽略。

You can determine if the data you have collected is relevant to the problem you are trying to solve. By understanding your data, you can determine which features are important for your model and which ones can be ignored.

Select Appropriate ML Techniques

根据您数据的特征，您可能需要选择特定的 ML 技术或算法。例如，如果您的数据是分类的，则您可能需要使用分类技术，而如果您的数据是连续的，则您可能需要使用回归技术。了解您的数据可以帮助您为您的问题选择合适的 ML 技术。

Depending on the characteristics of your data, you may need to choose a particular ML technique or algorithm. For example, if your data is categorical, you may need to use classification techniques, while if your data is continuous, you may need to use regression techniques. Understanding your data can help you select the appropriate ML technique for your problem.

Improve Model Performance

通过了解您的数据，您可以设计新特性、预处理您的数据并选择合适的 ML 技术来提高模型的性能。这可能会带来更好的准确性、精确度、召回率和 F1 分数。

By understanding your data, you can engineer new features, preprocess your data, and select the appropriate ML technique to improve the performance of your model. This can result in better accuracy, precision, recall, and F1 score.

Data Understanding with Statistics

在上一章中，我们讨论了如何将 CSV 数据上传到我们的 ML 项目，但在上传数据之前先了解数据会更好。我们可以通过两种方式理解数据：通过统计数字和通过可视化。

In the previous chapter, we discussed how we can upload CSV data into our ML project, but it would be good to understand the data before uploading it. We can understand the data by two ways, with statistics and with visualization.

在这一章中，在以下 Python 配方的帮助下，我们将使用统计数据来理解机器学习数据。

In this chapter, with the help of following Python recipes, we are going to understand ML data with statistics.

Looking at Raw Data

第一个配方是查看原始数据。查看原始数据非常重要，因为查看原始数据后获得的见解将增加我们更好地对机器学习项目的进行数据预处理和处理的机会。

The very first recipe is for looking at your raw data. It is important to look at raw data because the insight we will get after looking at raw data will boost our chances to better pre-processing as well as handling of data for ML projects.

以下是使用 Pandas DataFrame 的 head() 函数在 Pima Indians 糖尿病数据集上实现的 Python 脚本，用于查看前 10 行以更好地理解它 −

Following is a Python script implemented by using head() function of Pandas DataFrame on Pima Indians diabetes dataset to look at the first 10 rows to get better understanding of it −

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
print(data.head(10))

preg   plas  pres    skin  test  mass   pedi    age      class
0      6      148     72     35   0     33.6    0.627    50    1
1      1       85     66     29   0     26.6    0.351    31    0
2      8      183     64      0   0     23.3    0.672    32    1
3      1       89     66     23  94     28.1    0.167    21    0
4      0      137     40     35  168    43.1    2.288    33    1
5      5      116     74      0   0     25.6    0.201    30    0
6      3       78     50     32   88    31.0    0.248    26    1
7     10      115      0      0   0     35.3    0.134    29    0
8      2      197     70     45  543    30.5    0.158    53    1
9      8      125     96      0   0     0.0     0.232    54    1
10     4      110     92      0   0     37.6    0.191    30    0

我们可以从上面的输出中观察到，第一列给出了行号，这对于引用特定观测非常有用。

We can observe from the above output that first column gives the row number which can be very useful for referencing a specific observation.

Checking Dimensions of Data

了解我们为机器学习项目准备的行列数据量始终是一个好习惯。背后的原因是 −

It is always a good practice to know how much data, in terms of rows and columns, we are having for our ML project. The reasons behind are −

Suppose if we have too many rows and columns then it would take long time to run the algorithm and train the model.
Suppose if we have too less rows and columns then it we would not have enough data to well train the model.

以下是通过在 Pandas 数据框架中打印 shape 属性来实现的 Python 脚本。我们将对 iris 数据集进行实现以获取其中的行数和列数。

Following is a Python script implemented by printing the shape property on Pandas Data Frame. We are going to implement it on iris data set for getting the total number of rows and columns in it.

Example

from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)

(150, 4)

我们可以从输出中轻松观察到，我们将要使用的 iris 数据集共有 150 行和 4 列。

We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4 columns.

Getting Each Attribute’s Data Type

了解每个属性的数据类型是另一个好习惯。背后的原因是，根据要求，有时我们可能需要将一种数据类型转换为另一种数据类型。例如，我们可能需要将字符串转换为浮点数或整数来表示分类或序数。我们可以通过查看原始数据来了解属性的数据类型，但另一种方法是使用 Pandas DataFrame 的 dtypes 属性。在 dtypes 属性的帮助下，我们可以对每个属性的数据类型进行分类。借助以下 Python 脚本可以理解 −

It is another good practice to know data type of each attribute. The reason behind is that, as per to the requirement, sometimes we may need to convert one data type to another. For example, we may need to convert string into floating point or int for representing categorial or ordinal values. We can have an idea about the attribute’s data type by looking at the raw data, but another way is to use dtypes property of Pandas DataFrame. With the help of dtypes property we can categorize each attributes data type. It can be understood with the help of following Python script −

Example

from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.dtypes)

sepal_length  float64
sepal_width   float64
petal_length  float64
petal_width   float64
dtype: object

从上面的输出中，我们可以轻松获得每个属性的数据类型。

From the above output, we can easily get the datatypes of each attribute.

Statistical Summary of Data

我们讨论了 Python 配方以获取数据的形状，即行数和列数，但很多时候我们需要查看该数据形状的摘要。这可以通过 Pandas DataFrame 的 describe() 函数来完成，该函数进一步提供每个数据属性的以下 8 个统计属性 −

We have discussed Python recipe to get the shape i.e. number of rows and columns, of data but many times we need to review the summaries out of that shape of data. It can be done with the help of describe() function of Pandas DataFrame that further provide the following 8 statistical properties of each & every data attribute −

Count
Mean
Standard Deviation
Minimum Value
Maximum value
25%
Median i.e. 50%
75%

Example

from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
print(data.shape)
print(data.describe())

(768, 9)
         preg      plas       pres      skin      test        mass       pedi      age      class
count 768.00      768.00    768.00     768.00    768.00     768.00     768.00    768.00    768.00
mean    3.85      120.89     69.11      20.54     79.80      31.99       0.47     33.24      0.35
std     3.37       31.97     19.36      15.95    115.24       7.88       0.33     11.76      0.48
min     0.00        0.00      0.00       0.00      0.00       0.00       0.08     21.00      0.00
25%     1.00       99.00     62.00       0.00      0.00      27.30       0.24     24.00      0.00
50%     3.00      117.00     72.00      23.00     30.50      32.00       0.37     29.00      0.00
75%     6.00      140.25     80.00      32.00    127.25      36.60       0.63     41.00      1.00
max    17.00      199.00    122.00      99.00    846.00      67.10       2.42     81.00      1.00

从上面的输出中，我们可以观察到 Pima Indian Diabetes 数据集的数据统计摘要以及数据形状。

From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset along with shape of data.

Reviewing Class Distribution

类分布统计在分类问题中很有用，在这些问题中我们需要了解类值之间的平衡。了解类值分布非常重要，因为如果我们的类分布极不平衡，即一个类的观察值远多于另一个类，那么在机器学习项目的 data preparation 阶段可能需要特殊处理。我们可以借助 Pandas DataFrame 轻松地在 Python 中获取类分布。

Class distribution statistics is useful in classification problems where we need to know the balance of class values. It is important to know class value distribution because if we have highly imbalanced class distribution i.e. one class is having lots more observations than other class, then it may need special handling at data preparation stage of our ML project. We can easily get class distribution in Python with the help of Pandas DataFrame.

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
count_class = data.groupby('class').size()
print(count_class)

Class
0  500
1  268
dtype: int64

从上面的输出中可以清楚地看出，类 0 的观测值数量几乎是类 1 的观测值数量的两倍。

From the above output, it can be clearly seen that the number of observations with class 0 are almost double than number of observations with class 1.

Reviewing Correlation between Attributes

两个变量之间的关系称为相关性。在统计学中，计算相关性的最常用方法是皮尔逊相关系数。它可以具有以下三个值 −

The relationship between two variables is called correlation. In statistics, the most common method for calculating correlation is Pearson’s Correlation Coefficient. It can have three values as follows −

Coefficient value = 1 − It represents full positive correlation between variables.
Coefficient value = -1 − It represents full negative correlation between variables.
Coefficient value = 0 − It represents no correlation at all between variables.

在我们将其用于 ML 项目之前，始终审阅我们数据集中的属性对相关性非常有益，因为如果我们具有高度相关的属性，某些机器学习算法（如线性回归和逻辑回归）的性能会很差。在 Python 中，我们可以借助 Pandas DataFrame 上的 corr() 函数轻松计算数据集属性的相关性矩阵。

It is always good for us to review the pairwise correlations of the attributes in our dataset before using it into ML project because some machine learning algorithms such as linear regression and logistic regression will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a correlation matrix of dataset attributes with the help of corr() function on Pandas DataFrame.

Example

from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)

preg     plas     pres     skin     test      mass     pedi       age      class
preg     1.00     0.13     0.14     -0.08     -0.07   0.02     -0.03       0.54   0.22
plas     0.13     1.00     0.15     0.06       0.33   0.22      0.14       0.26   0.47
pres     0.14     0.15     1.00     0.21       0.09   0.28      0.04       0.24   0.07
skin    -0.08     0.06     0.21     1.00       0.44   0.39      0.18      -0.11   0.07
test    -0.07     0.33     0.09     0.44       1.00   0.20      0.19      -0.04   0.13
mass     0.02     0.22     0.28     0.39       0.20   1.00      0.14       0.04   0.29
pedi    -0.03     0.14     0.04     0.18       0.19   0.14      1.00       0.03   0.17
age      0.54     0.26     0.24     -0.11     -0.04   0.04      0.03       1.00   0.24
class    0.22     0.47     0.07     0.07       0.13   0.29      0.17       0.24   1.00

上方输出中的矩阵提供了数据集中的所有成对属性之间的相关性。

The matrix in above output gives the correlation between all the pairs of the attribute in dataset.

Reviewing Skew of Attribute Distribution

偏度可以定义为一个假定为高斯分布但看起来已向另一个方向扭曲或偏移，或朝左或朝右的分布。审阅属性的偏度至关重要，原因如下所述−

Skewness may be defined as the distribution that is assumed to be Gaussian but appears distorted or shifted in one direction or another, or either to the left or right. Reviewing the skewness of attributes is one of the important tasks due to following reasons −

Presence of skewness in data requires the correction at data preparation stage so that we can get more accuracy from our model.
Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of bell curved data.

在 Python 中，我们可以通过对 Pandas DataFrame 使用 skew() 函数轻松计算每个属性的偏差。

In Python, we can easily calculate the skew of each attribute by using skew() function on Pandas DataFrame.

Example

from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
print(data.skew())

preg   0.90
plas   0.17
pres  -1.84
skin   0.11
test   2.27
mass  -0.43
pedi   1.92
age    1.13
class  0.64
dtype: float64

从以上输出中，可以观察到正偏差或负偏差。如果数值接近于 0，则表示偏差较小。

From the above output, positive or negative skew can be observed. If the value is closer to zero, then it shows less skew.