Machine Learning 简明教程

Machine Learning - Automatic Workflows

Introduction

为了成功执行和生成结果，机器学习模型必须自动化一些标准工作流。自动化这些标准工作流的过程可以在 Scikit-learn Pipelines 的帮助下完成。从数据科学家的角度来看，管道是一个概括的，但非常重要的概念。它基本上允许数据从原始格式流向某些有用信息。管道的工作原理可以用以下图表理解 −

In order to execute and produce results successfully, a machine learning model must automate some standard workflows. The process of automate these standard workflows can be done with the help of Scikit-learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It basically allows data flow from its raw format to some useful information. The working of pipelines can be understood with the help of following diagram −

机器学习管道的模块如下 −

The blocks of ML pipelines are as follows −

Data ingestion − 正如其名称所暗示的，这是为在机器学习项目中使用而导入数据的过程。数据可以从单个或多个系统中实时或批量提取。这是最具挑战性的步骤之一，因为数据质量会影响整个机器学习模型。

Data ingestion − As the name suggests, it is the process of importing the data for use in ML project. The data can be extracted in real time or batches from single or multiple systems. It is one of the most challenging steps because the quality of data can affect the whole ML model.

Data Preparation − 导入数据后，我们需要准备数据以便用于我们的机器学习模型。数据预处理是最重要的数据准备技术之一。

Data Preparation − After importing the data, we need to prepare data to be used for our ML model. Data preprocessing is one of the most important technique of data preparation.

ML Model Training − 下一步是训练我们的机器学习模型。我们有各种机器学习算法，如监督、无监督、强化，用于从数据中提取特征并进行预测。

ML Model Training − Next step is to train our ML model. We have various ML algorithms like supervised, unsupervised, reinforcement to extract the features from data, and make predictions.

Model Evaluation − 其次，我们需要评估机器学习模型。在自动机器学习管道的情况下，机器学习模型可以在各种统计方法和业务规则的帮助下进行评估。

Model Evaluation − Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be evaluated with the help of various statistical methods and business rules.

ML Model retraining − 在自动机器学习管道的情况下，第一个模型不一定是最好的。第一个模型被视为基线模型，我们可以重复训练它以提高模型的准确性。

ML Model retraining − In case of AutoML pipeline, it is not necessary that the first model is best one. The first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy.

Deployment − 最后，我们需要部署模型。此步骤涉及将模型应用于业务运营并将其迁移到业务运营中供其使用。

Deployment − At last, we need to deploy the model. This step involves applying and migrating the model to business operations for their use.

Challenges Accompanying ML Pipelines

为了创建机器学习管道，数据科学家面临许多挑战。这些挑战归为以下三类 −

In order to create ML pipelines, data scientists face many challenges. These challenges fall into the following three categories −

Quality of Data

任何机器学习模型的成功在很大程度上取决于数据的质量。如果我们提供给机器学习模型的数据不准确、不可靠和稳健，那么我们最终会得到错误或误导性的输出。

The success of any ML model depends heavily on the quality of data. If the data we are providing to ML model is not accurate, reliable and robust, then we are going to end with wrong or misleading output.

Data Reliability

与 ML 管道相关的另一个挑战是我们要提供给 ML 模型的数据的可靠性。如我们所知，数据科学家可以从多个来源获取数据，但要获得最佳结果，必须确保数据来源是可靠且受信任的。

Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model. As we know, there can be various sources from which data scientist can acquire data but to get the best results, it must be assured that the data sources are reliable and trusted.

Data Accessibility

要从 ML 管道中获得最佳结果，数据本身必须是可访问的，这需要整合、清理和管理数据。由于数据可访问性属性，元数据将用新标签进行更新。

To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation, cleansing and curation of data. As a result of data accessibility property, metadata will be updated with new tags.

Modelling ML Pipeline and Data Preparation

数据泄露（发生在训练数据集到测试数据集）是数据科学家在为 ML 模型准备数据时要处理的一个重要问题。通常，在数据准备时，数据科学家会在学习前对整个数据集使用标准化或归一化等技术。但这些技术无法帮助我们避免数据泄露，因为训练数据集将受到测试数据集中数据的规模的影响。

Data leakage, happening from training dataset to testing dataset, is an important issue for data scientist to deal with while preparing data for ML model. Generally, at the time of data preparation, data scientist uses techniques like standardization or normalization on entire dataset before learning. But these techniques cannot help us from the leakage of data because the training dataset would have been influenced by the scale of the data in the testing dataset.

通过使用 ML 管道，我们可以防止此数据泄露，因为管道确保数据准备（例如此处标准化）受到我们的交叉验证过程的每个折的影响。

By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like standardization is constrained to each fold of our cross-validation procedure.

Example

以下是一个演示数据准备和模型评估工作流的 Python 示例。为此，我们从 Sklearn 中使用 Pima Indian Diabetes 数据集。首先，我们将创建一个对数据进行标准化的管道。然后将创建线性判别分析模型，最后将使用 10 折交叉验证对管道进行评估。

The following is an example in Python that demonstrate data preparation and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipeline that standardized the data. Then a Linear Discriminative analysis model will be created and at last the pipeline will be evaluated using 10-fold cross validation.

首先，按如下所示导入所需包：

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

现在，我们需要加载 Pima diabetes 数据集，如在之前的示例中所做的那样：

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values

接下来，我们将借助以下代码创建一个管道：

Next, we will create a pipeline with the help of the following code −

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

最后，我们将评估该管道并输出其准确度，如下所示：

At last, we are going to evaluate this pipeline and output its accuracy as follows −

kfold = KFold(n_splits=20, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7790148448043184

以上输出是对该设置的准确度的摘要，该设置位于数据集上。

The above output is the summary of accuracy of the setup on the dataset.

Modelling ML Pipeline and Feature Extraction

数据泄露也可能发生在机器学习模型的特征提取步骤中。这就是为什么也应当限制特征提取程序来阻止训练数据集中发生数据泄露。与数据准备的情况一样，通过使用机器学习管道，我们也可以防止此数据泄露。ML 管道提供的一个名为 FeatureUnion 的工具可用于此目的。

Data leakage can also happen at feature extraction step of ML model. That is why feature extraction procedures should also be restricted to stop data leakage in our training dataset. As in the case of data preparation, by using ML pipelines, we can prevent this data leakage also. FeatureUnion, a tool provided by ML pipelines can be used for this purpose.

Example

以下是在 Python 中演示特征提取和模型评估工作流的一个示例。为此，我们正在使用来自 Sklearn 的 Pima 印第安人糖尿病数据集。

The following is an example in Python that demonstrates feature extraction and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn.

首先，将用 PCA（主成分分析）提取 3 个特征。然后，将用统计分析提取 6 个特征。在特征提取之后，将通过使用

First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extraction procedures will be combined by using

FeatureUnion 工具。最后，创建一个逻辑回归模型，使用十倍交叉验证评估管道。

FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated using 10-fold cross validation.

首先，按如下所示导入所需包：

First, import the required packages as follows −

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

现在，我们需要加载 Pima diabetes 数据集，如在之前的示例中所做的那样：

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values

接下来，按如下方式创建特征并集 -

Next, feature union will be created as follows −

features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

接下来，使用以下脚本行创建管道 -

Next, pipeline will be creating with the help of following script lines −

estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

最后，我们将评估该管道并输出其准确度，如下所示：

At last, we are going to evaluate this pipeline and output its accuracy as follows −

kfold = KFold(n_splits=20, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7789811066126855

以上输出是对该设置的准确度的摘要，该设置位于数据集上。

The above output is the summary of accuracy of the setup on the dataset.