Machine Learning 简明教程

Machine Learning - Data Scaling

数据缩放是机器学习中用于对数据中的特征范围或分布进行归一化或标准化的预处理技术。数据缩放至关重要，因为数据中的不同特征可能具有不同的范围，并且某些算法可能无法很好地处理此类数据。通过缩放数据，我们可以确保每个特征具有相似的范围和量程，从而可以提高机器学习模型的性能。

Data scaling is a pre-processing technique used in Machine Learning to normalize or standardize the range or distribution of features in the data. Data scaling is essential because the different features in the data may have different scales, and some algorithms may not work well with such data. By scaling the data, we can ensure that each feature has a similar scale and range, which can improve the performance of the machine learning model.

用于数据缩放的两种常用技术为 -

There are two common techniques used for data scaling −

Normalization − Normalization scales the values of a feature between 0 and 1. This is achieved by subtracting the minimum value of the feature from each value and dividing it by the range of the feature (the difference between the maximum and minimum values).
Standardization − Standardization scales the values of a feature to have a mean of 0 and a standard deviation of 1. This is achieved by subtracting the mean of the feature from each value and dividing it by the standard deviation.

Example

在 Python 中，可以使用 sklearn 模块实现数据缩放。sklearn.preprocessing 子模块提供用于缩放数据的类。下面是使用 StandardScaler 类在 Python 中实现数据缩放的示例实现 −

In Python, data scaling can be implemented using the sklearn module. The sklearn.preprocessing sub-module provides classes for scaling data. Below is an example implementation of data scaling in Python using the StandardScaler class for standardization −

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset
data = load_iris()
X = data.data
y = data.target

# Create a DataFrame from the dataset
df = pd.DataFrame(X, columns=data.feature_names)
print("Before scaling:")
print(df.head())

# Scale the data using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a new DataFrame from the scaled data
df_scaled = pd.DataFrame(X_scaled, columns=data.feature_names)
print("After scaling:")
print(df_scaled.head())

在此示例中，我们加载 iris 数据集并从中创建一个 DataFrame。然后，我们使用 StandardScaler 类对数据进行缩放，并从缩放后的数据创建一个新的 DataFrame。最后，我们打印数据框以查看缩放前后的数据差异。请注意，我们使用缩放器对象的 fit_transform() 方法对数据进行拟合和转换。

In this example, we load the iris dataset and create a DataFrame from it. We then use the StandardScaler class to scale the data and create a new DataFrame from the scaled data. Finally, we print the dataframes to see the difference in the data before and after scaling. Note that we fit and transform the data using the fit_transform() method of the scaler object.

Output

执行此代码时，将生成以下输出 −

When you execute this code, it will produce the following output −

Before scaling:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0    5.1                3.5                1.4               0.2
1    4.9                3.0                1.4               0.2
2    4.7                3.2                1.3               0.2
3    4.6                3.1                1.5               0.2
4    5.0                3.6                1.4               0.2
After scaling:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0   -0.900681            1.019004        -1.340227           -1.315444
1   -1.143017            -0.131979       -1.340227           -1.315444
2   -1.385353            0.328414        -1.397064           -1.315444
3   -1.506521            0.098217        -1.283389           -1.315444
4   -1.021849            1.249201        -1.340227           -1.315444