Machine Learning 简明教程

Data Preparation in Machine Learning

数据准备是机器学习过程中的一个关键步骤,会对最终模型的准确性和有效性产生重大影响。它需要仔细关注细节,并透彻地理解数据和手头的问题。

Data preparation is a critical step in the machine learning process, and can have a significant impact on the accuracy and effectiveness of the final model. It requires careful attention to detail and a thorough understanding of the data and the problem at hand.

让我们讨论一下应该如何准备数据以使其与模型相匹配,以提高准确性和结果。

Let’s discuss how data should be prepared in order to fit right with the model for better accuracy and outcome.

What is Data Preparation?

数据准备是处理原始数据(即清理、组织和转换数据)以使其与机器学习算法保持一致的过程。数据准备是一个持续的过程,并且对机器学习模型的性能有很大影响。干净且结构化的数据将带来更好的结果。

Data preparation is the process of dealing with raw data i.e, cleaning, organizing and transforming it to align with the machine learning algorithms. Data preparation is a continuous process, and has a huge impact on the performance of machine learning model. Clean and structured data would result in better outcomes.

Importance of Data Preparation

在机器学习中,模型从输入的数据中学习。因此,算法只有在数据有条理且完美的情况下才能有效地学习。用于模型的数据的质量会对模型的性能产生重大影响。

In Machine learning, the model learns from the data that is fed. So, the algorithm can learn efficiently only if the data is organized and perfect. The quality of the data you use for your model can have a significant impact on the performance of the model.

定义数据准备在机器学习中重要性的几个方面 −

Few aspects that define the importance of data preparation in machine learning are −

  1. *Improves model accuracy * − Machine learning algorithms reply completely on data. When you provide clean and structured data to models, the outcomes are accurate.

  2. Facilitates Feature Engineering − Data preparation often includes the process of selecting or creating new features to train the model. Hence, data preparation would make feature engineering easy.

  3. *Data Quality * − Collected data most often would contain inconsistencies, errors and irrelevant information. Hence when tasks like data cleaning, transformation are applied, the data is formatted and neat. This can be used for gaining insights and patterns.

  4. *Enables rate of prediction * − Prepared data makes it easier to analyze results and would yield accurate outcomes.

Data Preparation Process Steps

数据准备过程涉及一系列步骤,这些步骤对于使数据适合于分析和建模是必需的。数据准备的目标是确保数据对于分析而言是准确、完整和相关的。

Data preparation process involves a sequence of steps that is required to make data suitable for analysis and modeling. The goal of data preparation is to make sure that the data is accurate, complete, and relevant for the analysis.

以下是数据准备中涉及的一些关键步骤 −

The following are some of the key steps involved in data preparation −

  1. Data Collection

  2. Data Cleaning

  3. Data Transformation

  4. Data Reduction

  5. Data Splitting

machine learning data preparation

让我们详细了解以上每个步骤 −

Let’s understand each of the above steps in detail −

Data Collection

数据收集是机器学习流程中的第一步,其中收集来自不同来源的数据来做出决定、回答研究问题以及制定统计规划。数据收集可使用数据库、文本文件、图片、声音文件或网络爬取等不同来源。在选择数据后,必须对数据进行预处理以获得见解。执行此步骤是将数据置于适合解决问题的适当格式。有时候,数据收集紧随数据集成步骤之后。

Data collection is the first step in the process of machine learning, where data from different sources is gathered to make decisions, answer research questions and statistical planning. Different sources such as databases, text files, pictures, sound files, or web scraping may be used for data collection. Once the data is selected, the data has to be preprocessed in order to gain insights. This process is carried out to put the data in an appropriate format that would be useful for problem solving. Some time data collection follows the data integration step.

Data integration 涉及将来自多个来源的数据合并到一个数据集进行分析。这可能涉及匹配或链接跨不同数据集的记录,或基于通用变量合并数据集。

Data integration involves combining data from multiple sources into a single dataset for analysis. This may involve matching or linking records across different datasets, or merging datasets based on common variables.

在选取原始数据后,最重要的工作即为 data preprocessing 。广义上,数据预处理将选中的数据转换成一种我们可以处理或可以输入 ML 算法的形式。我们始终需要预处理数据,以便其按照机器学习算法的期望来进行。数据预处理包括数据清理、转换和精简。让我们仔细讨论这三个中的每一个。

After selecting the raw data, the most important task is data preprocessing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm. The data preprocessing includes data cleaning, transformation and reduction. Let’s discuss each of these three in detail.

Data Cleaning

Data cleaning 是识别和修正数据中的错误、缺失值、重复值、异常值等的流程。此步骤在机器学习流程中很关键,因为它确保了数据准确、相关并且没有错误。

Data cleaning is the process of identifying and correcting errors, missing values, duplicate values and outliers, etc. in the data. This step is crucial in the process of machine learning as it ensures that the data is accurate, relevant and error free.

用于数据清理的常见技术包括插补、异常值检测和移除等。以下是数据清理的步骤序列:

Common techniques used for data cleaning include imputation, outlier detection and removal, etc. The following is a sequence of steps for data cleaning −

1. Handling duplicate values

数据集中存在重复的内容,这可能会由于数据输入错误或收集数据时的问题而发生。用来去除重复项的技术是首先识别出它们,然后使用 drop_duplicates function in Pandas 删除它们。

Duplicates in the dataset means that there is repeated data, which might occur due to data entry errors or issues while collecting data. The technique used to remove duplicates is first they are identified and then deleted using drop_duplicates function in Pandas.

2. Fixing syntax errors

在此步骤中,应解决数据格式或命名约定中的不一致性等结构性错误。标准化格式并修复错误将确保数据的一致性和准确分析。

In this step, structural errors like inconsistencies in data format or naming conventions should be addressed. Standardizing formats and fixing errors would ensure data consistence and accurate analysis.

3. Dealing outliers

异常值是与数据不同寻常且有很大差异的值。用于检测异常值的技术包括 z-scoreIQR method 等统计方法以及 clusteringSVM’s 等机器学习方法。

Outliers are values that are unusual and differ greatly with the data. The techniques used to detect outliers include statistical methods like z-score or IQR method and machine learning methods like clustering and SVM’s.

4. Handling Missing Values

缺失值是未存储在数据集中某些值中的值或数据。有几种方法可以处理缺失数据,例如:

Missing values are the values or data that is not stored for some values in the dataset. There are several ways to handle missing data like:

  1. Imputation − In this process the missing values are substituted with different value, which can be a central tendency measure like mean, median or mode for numeric values and most frequency category for categorical data. Some other methods in imputation include regression imputation and multiple imputation.

  2. Deletion − In this process the entire instances with missing values are removed. Well, this is not a reliable method since there is loss of data.

5. Validating the data

Data Validation 是确保数据完全符合需求的另一个阶段,以便预测结果准确。在将数据存储到数据库中的常见数据验证过程包括:

Data Validation is another stage that makes sure that the data aligns perfectly with the requirements so that the predicted outcome is accurate. Some common data validation procedures it the correctness of data before storing them in databases are:

  1. Data type check

  2. Code Check

  3. Format check

  4. Range check

Data Transformation

Data transformation 是将数据从原始格式转换为适合分析和建模的格式的过程。这可能包括定义结构、对齐数据、从源中提取数据,然后以适当形式存储它。

Data transformation is the process of converting the data from its original format into a format that is suitable for analysis and modeling. This could include defining the structure, aligning the data, extracting data from source, and then storing it in an appropriate form.

有许多可用来将数据转换成合适格式的技术。一些常用的 data transformation techniques 如下:

There are many techniques available to transorm data into a sutable format. Some commonly used data transformation techniques are as follows −

  1. Scaling

  2. Normalization − L1 & L2 Normalizations

  3. Standardization

  4. Binarization

  5. Encoding

  6. Log Transformation

让我们详细讨论以上每一种数据转换技术:

Lets discuss each of the above data transformation techniques in detail −

1. Scaling

大多数情况下,我们收集的数据包含不同规模的属性,但是我们不能向 ML 算法提供此类数据,因此需要重新缩放。 Data scaling 确保了属性具有相同的规模,即范围通常为 0 到 1。

In most cases, the data we collected consists of attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data scaling makes sure that attributes are at the same scale i.e, usually range of 0 to 1.

我们可以借助 scikit-learn Python 库的 MinMaxScaler 类重新缩放数据。

We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.

在这个示例中,我们将重新缩放我们之前使用的 Pima Indians Diabetes 数据集的数据。首先,将加载 CSV 数据(如前几章中所做的那样),然后借助 MinMaxScaler 类,将数据重新缩放至 0 和 1 之间的范围。

In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of the following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 MinMaxScaler 类将数据重新缩放到 0 和 1 的范围内。

Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.

data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 1,并在输出中显示前 10 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and showing the first 10 rows in the output.

set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])
Scaled data:
[
   [0.4 0.7 0.6 0.4 0.  0.5 0.2 0.5 1. ]
   [0.1 0.4 0.5 0.3 0.  0.4 0.1 0.2 0. ]
   [0.5 0.9 0.5 0.  0.  0.3 0.3 0.2 1. ]
   [0.1 0.4 0.5 0.2 0.1 0.4 0.  0.  0. ]
   [0.  0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
   [0.3 0.6 0.6 0.  0.  0.4 0.1 0.2 0. ]
   [0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
   [0.6 0.6 0.  0.  0.  0.5 0.  0.1 0. ]
   [0.1 1.  0.6 0.5 0.6 0.5 0.  0.5 1. ]
   [0.5 0.6 0.8 0.  0.  0.  0.1 0.6 1. ]
]

从上述输出中,所有数据都重新缩放到 0 和 1 的范围内。

From the above output, all the data got rescaled into the range of 0 and 1.

2. Normalization

归一化用于将数据重新缩放到 0 和 1 之间的分布值。对于每个特征,最小值被设置为 0,最大值被设置为 1。

Normalization is used to rescale the data with a distribution value between 0 and 1. For every feature, the minimum value is set to 0 and the maximum value is set to 1.

这用于将数据的每一行重新缩放到长度为 1。这主要用于我们有很多零值的稀疏数据集。我们可以在 scikit-learn Python 库的 Normalizer 类的帮助下重新缩放到数据。

This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.

在机器学习中,有 two types 归一化预处理技术如下 −

In machine learning, there are two types of normalization preprocessing techniques as follows −

它可以定义为一种归一化技术,它以一种方式修改数据集值,使得每一行的绝对值之和始终高达 1。它也称为最小绝对偏差。

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.

在此示例中,我们使用 L1 归一化技术来归一化我们先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后在 Normalizer 类的帮助下对其进行归一化。

In this example, we use L1 Normalize technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Normalizer class it will be normalized.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用 L1 的 Normalizer 类来对数据进行归一化。

Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])
Normalized data:
[
   [0.02 0.43 0.21 0.1  0. 0.1  0. 0.14 0. ]
   [0.   0.36 0.28 0.12 0. 0.11 0. 0.13 0. ]
   [0.03 0.59 0.21 0.   0. 0.07 0. 0.1  0. ]
]

它可以定义为一种归一化技术,它以一种方式修改数据集值,使得每一行的平方和始终高达 1。它也称为最小二乘。

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares.

在此示例中,我们使用 L2 归一化技术来归一化我们先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据(如前几章中所述),然后在 Normalizer 类的帮助下对其进行归一化。

In this example, we use L2 Normalization technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in previous chapters) and then with the help of Normalizer class it will be normalized.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用 L1 的 Normalizer 类来对数据进行归一化。

Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])
Normalized data:
[
   [0.03 0.83 0.4  0.2  0. 0.19 0. 0.28 0.01]
   [0.01 0.72 0.56 0.24 0. 0.22 0. 0.26 0.  ]
   [0.04 0.92 0.32 0.   0. 0.12 0. 0.16 0.01]
]

3. Standardization

标准化用于将数据属性转换为标准高斯分布,均值为 0,标准差为 1。这种技术在诸如线性回归、逻辑回归等 ML 算法中非常有用,它假设输入数据集是高斯分布的,并且使用重新缩放到数据产生更好的结果。

Standardization is used to transform data attributes to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data.

我们可以在 scikit-learn Python 库的 StandardScaler 类的帮助下对数据进行标准化(均值 = 0 且 SD =1)。

We can standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of scikit-learn Python library.

在此示例中,我们将重新调整先前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后在 StandardScaler 类的帮助下,将其转换为均值 = 0 和 SD = 1 的高斯分布。

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian Distribution with mean = 0 and SD = 1.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 StandardScaler 类来重新缩放到数据。

Now, we can use StandardScaler class to rescale the data.

data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

我们还可以根据自己的选择总结数据以供输出。在这里,我们将精度设置为 2,并在输出中显示前 5 行。

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 5 rows in the output.

set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])
Rescaled data:
[
   [ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43  1.37]
   [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
   [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11  1.37]
   [-0.84 -1.   -0.16  0.15  0.12 -0.49 -0.92 -1.04 -0.73]
   [-1.14  0.5  -1.5   0.91  0.77  1.41  5.48 -0.02  1.37]
]

4. Binarization

顾名思义,这是一个可以帮助我们使数据实现二进制化的技术。我们可以使用二进制阈值使数据实现二进制化。高于该阈值的值将转换为 1,低于该阈值的值将转换为 0。例如,如果我们选择阈值 = 0.5,则数据集值高于该值将变为 1,低于则变为 0。这就是为什么我们可以称之为 binarizing 数据或 thresholding 数据。当我们数据集中有概率并希望将它们转换为固定值时,这个技术很有用。

As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0. For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.

我们可以借助 scikit-learn Python 库中的 Binarizer 类对数据进行二值化。

We can binarize the data with the help of Binarizer class of scikit-learn Python library.

在这个示例中,我们将重新调整前面使用过的 Pima Indians Diabetes 数据集的数据。首先,将加载 CSV 数据,然后借助 Binarizer 类,根据阈值将其转换为二进制值,即 0 和 1。我们选择 0.5 作为阈值。

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.

以下脚本的前几行与我们在加载 CSV 数据时在之前的章节中写的一样。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 Binarize 类将该数据转换为二进制值。

Now, we can use Binarize class to convert the data into binary values.

binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)

在此处,我们显示输出中的前 5 行。

Here, we are showing the first 5 rows in the output.

print ("\nBinary data:\n", Data_binarized [0:5])
Binary data:
[
   [1. 1. 1. 1. 0. 1. 1. 1. 1.]
   [1. 1. 1. 1. 0. 1. 0. 1. 0.]
   [1. 1. 1. 0. 0. 1. 1. 1. 1.]
   [1. 1. 1. 1. 1. 1. 0. 1. 0.]
   [0. 1. 1. 1. 1. 1. 1. 1. 1.]
]

5. Encoding

此技术用于将分类变量转换为数值表达。一些常见的编码技术包括 one-hot encodinglabel encodingtarget encoding

This technique is used to convert categorical variables into numerical representations. Some common encoding techniques include one-hot encoding, label encoding and target encoding.

大多数 sklearn 函数都需要的是以数字标记表示的数据,而不是以单词标记的数据。因此,我们需要将此类标记转换为数字标记。此过程称为标记编码。我们可以借助 scikit-learn Python 库的 LabelEncoder() 函数对数据执行标记编码。

Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of LabelEncoder() function of scikit-learn Python library.

在以下示例中,Python 脚本将执行标记编码。

In the following example, Python script will perform the label encoding.

首先,按如下方式导入所需的 Python 库 −

First, import the required Python libraries as follows −

import numpy as np
from sklearn import preprocessing

现在,我们需要提供输入标记,如下所示 −

Now, we need to provide the input labels as follows −

input_labels = ['red','black','red','green','black','yellow','white']

代码的下一行将创建标记编码器并对其进行训练。

The next line of code will create the label encoder and train it.

encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

脚本的下一行将通过对随机有序的列表进行编码来检查性能 −

The next lines of script will check the performance by encoding the random ordered list −

test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)

我们可以借助以下 python 脚本获取编码值列表 −

We can get the list of encoded values with the help of following python script −

print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))
Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]
Encoded values = [3, 0, 4, 1]
Decoded labels = ['white', 'black', 'yellow', 'green']

6. Log Transformation

此技术通常用于处理偏斜数据。它涉及对数据集中所有值应用自然对数函数,以修改数值的尺度。

This technique is usually used in handling skewed data. It involves apply natural logarithmic function for all values in the dataset to modify the scale of numeric values.

Data Reduction

数据减少是一种通过选择与分析最相关的特征或观察子集来减小数据集大小的技术。这有助于减少噪声并提高模型的准确性。

Data Reduction is a technique to reduce the size of the dataset by selecting a subset of features or observations that are most relevant for the analysis. This can help to reduce noise and improve the accuracy of the model.

当数据集非常大或者数据集包含大量不相关数据时,此方法非常有用。

This is useful when the dataset is very large or when a dataset contains large amount of irrelevant data.

应用的最常见技术之一是 Dimensionality Reduction ,它可以在不丢失重要信息的情况下减小数据集的大小。另一个方法是 Discretization ,其中时间和温度等连续值将转换为离散的类别,从而简化了数据。

One of the most common technique used is Dimensionality Reduction, which reduces the size of the dataset without loosing the important information. Other method is the Discretization, where continuous values like time and temperature are converted to discrete categories which simplifies the data.

Data Splitting

Data Splitting 是为机器学习准备数据的最后一步,其中将数据拆分到不同的集合中 -

Data Splitting is the last step in the preparation of data for machine learning, where the data is split into different sets -

  1. Training − subset which is used by the machine learning model for learning patterns.

  2. Validation − subset used to evaluate the performance of machine learning model while training.

  3. Testing − subset used to evaluate the performance and efficiency of the trained model.

Python Example

我们看一个使用乳腺癌数据集进行数据准备的示例:

Let’s check an example of data preparation using the breast cancer dataset −

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# load the dataset
data = load_breast_cancer()

# separate the features and target
X = data.data
y = data.target

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# normalize the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

在此示例中,我们首先使用 scikit-learn 中的 load_breast_cancer 函数加载乳腺癌数据集。然后我们分离特征和目标,并使用 train_test_split 函数将数据分成训练集和测试集。

In this example, we first load the breast cancer dataset using load_breast_cancer function from scikit-learn. Then we separate the features and target, and split the data into training and testing sets using train_test_split function.

最后,我们对数据进行标准化,使用 scikit-learn 中的 StandardScaler,它会减去均值并将数据扩展到单位方差。这有助于将所有特征都调整到相似的范围,这对于 SVM 和神经网络等模型尤其重要。

Finally, we normalize the data using StandardScaler from scikit-learn, which subtracts the mean and scales the data to unit variance. This helps to bring all the features to a similar scale, which is particularly important for models like SVM and neural networks.

Data Preparation and Feature Engineering

Feature engineering 涉及从现有数据中创建可能对分析更具信息或用途的新特征。这包括合并或转换现有特征,或基于领域知识或见解创建新特征。数据准备和特征工程在整体数据预处理管道中是相辅相成的。

Feature engineering involves creating new features from the existing data that may be more informative or useful for the analysis. It can involve combining or transforming existing features, or creating new features based on domain knowledge or insights. Both data preparation and feature engineering go hand-in-hand in the overall data preprocessing pipeline.