Machine Learning 简明教程

Machine Learning - Life Cycle

机器学习 life cycle 是构建端到端机器学习项目或 ML 解决方案的迭代过程。随着数据量的不断增长,机器学习模型的构建是一个持续的过程。机器学习专注于通过使用真实世界数据训练模型来提高系统的性能。我们必须遵循一些明确定义的步骤才能使机器学习项目取得成功。机器学习生命周期为我们提供了这些明确定义的步骤或阶段。

Machine learning life cycle is an iterative process of building an end to end machine learning project or ML solution. Building a machine learning model is a continuous process especially with the growing amount of data. Machine learning focuses on improving a system’s performance through training the model with real world data. We have to follow some well-defined steps for making a machine learning project successful. The machine learning life cycle provides us with these well-defined steps or phases.

What is Machine Learning Life Cycle?

机器学习生命周期是一个从业务问题转到机器学习解决方案的迭代过程。它用作开发机器学习项目以解决问题的指南。在开发 ML 解决方案时,它为我们在每个阶段使用的说明和最佳实践提供了指导。

The machine learning life cycle is an iterative process that moves from a business problem to a machine learning solution. It is used as a guide for developing a machine learning project to solve a problem. It provides us with instructions and best practices to be used in each phase while developing ML solutions.

机器学习生命周期是一个涉及从问题识别到模型部署和监控的多个阶段的过程。在开发 ML 项目时,生命周期中的每个步骤都将在这些阶段中多次重新检查。端到端机器生命周期过程中涉及的阶段/步骤如下 −

The machine learning life cycle is a process that involves several phases from problem identification to model deployment and monitoring. While developing an ML project, each step in the life cycle is revisited many times through these phases. The stages/ phases involved in the end to end machine life cycle process are as follows −

  1. Problem Definition

  2. Data Preparation

  3. Model Development

  4. Model Deployment

  5. Monitoring and Maintenance

machine learning life cycle

让我们详细讨论上述机器学习生命周期过程的阶段 −

Let’s discuss the above phases of machine learning life cycle process in detail −

Problem Definition

机器学习生命周期中的第一步是确定你要解决的问题。这是一个至关重要的步骤,它能帮助你开始针对问题构建机器学习解决方案。这个识别问题的过程将建立对输出结果、任务范围及其目标的理解。

The first step in the machine learning life cycle is to identify the problem you want to solve. It is a crucial step which helps you start building a machine learning solution for a problem. This process of identifying a problem would establish an understanding about what the output might be, scope of the task and its objective.

由于此步骤为构建机器学习模型奠定了基础,因此问题定义必须明确且简洁。

As this step lays the foundation for building a machine learning model, the problem definition has to be clear and concise.

此阶段涉及了解业务问题、定义问题陈述并确定机器学习模型的成功标准。

This stage involves understanding the business problem, defining the problem statement, and identifying the success criteria for the machine learning model.

Data Preparation

数据准备是一个针对数据分析而对数据进行准备的过程,通过进行数据探索、特征工程和特征选择。数据探索涉及对数据的可视化和理解,而特征工程涉及从现有数据创建新特征。特征选择涉及选择将用于训练机器学习模型的最相关特征。

Data preparation is a process to prepare data for analysis by performing data exploration, feature engineering, and feature selection. Data exploration involves visualizing and understanding the data, while feature engineering involves creating new features from the existing data. Feature selection involves selecting the most relevant features that will be used to train the machine learning model.

数据准备过程包括收集数据、预处理数据以及特征工程和特征选择。此阶段通常还包括探索性数据分析。

Data preparation process includes collecting data, preprocessing data, and feature engineering & feature selection. This stage generally also includes exploratory data analysis.

让我们讨论一下机器学习生命周期过程的数据准备阶段涉及的每个步骤——

Let’s discuss each step involved in the data preparation phase of machine learning life cycle process −

1. Data Collection

在分析问题陈述后,下一步就是收集数据。这涉及从各种来源收集数据,这些数据作为原始材料提供给机器学习模型。收集数据时要考虑的一些特征包括——

After the problem statement is analyzed, the next step would be collecting data. This involves gathering data from various sources which is given as a raw material to the machine learning model. Few features that are considered while collecting data are −

  1. Relevant and usefulness − The data collected has to be relevant to the problem statement, and also should be useful enough to train the machine learning model efficiently.

  2. Quality and Quantity − The quality and quantity of the data collected would directly impact the performance of the machine learning model.

  3. Variety − Make sure that the data collected is diverse so that the model can be trained with multiple scenarios to recognize the patterns.

可以从各种来源收集数据,例如调查、现有数据库和 Kaggle 等在线平台。这些来源可能是包含专门针对问题陈述收集的数据的原始数据,而二级数据包括现有数据。

There are various sources from where the data can be collected like surveys, existing databases, and online platforms like Kaggle. The sources may be primary data which includes data collected exclusively for the problem statement while the secondary data includes the existing data.

2. Data Preprocessing

收集的数据通常可能是无结构且混乱的,这会导致对结果产生负面影响,因此,预处理数据对于提高机器学习模型的准确性和性能非常重要。需要解决的问题包括 missing values, duplicate data, invalid data and noise

The data collected often might be unstructured and messy which causes it to negatively affect the outcomes, hence pre processing data is important to improve the accuracy and performance of the machine learning model. Issues that have to be addressed are missing values, duplicate data, invalid data and noise.

这一数据预处理步骤也称为数据整理,旨在使数据更易于使用和更适用于分析。

This step of data preprocessing also called data wrangling is intended to make the data more consumable and useful for analytics.

3. Analyzing Data

在对数据进行排序后,就开始了解所收集的数据。对数据进行可视化和统计总结以获得见解。

After the data is all sorted, it is time to understand the data that is collected. The data is visualized and statistically summarized to gain insights.

使用了 Power BI、Tableau 等各种工具来对数据进行可视化,这有助于理解数据中的模式和趋势。此分析将有助于在特征工程和模型选择中做出选择。

Various tools like Power BI, Tableau are used to visualize data which helps in understanding the patterns and trends in the data. This analysis will help to make choices in feature engineering and model selection.

4. Feature Engineering and Selection

“特征”是机器学习模型在训练时最好观察到的单个可测量量。 Feature Engineering 是创建新特征或增强现有特征的过程,以准确了解数据中的模式和趋势。

A 'Feature' is an individual measurable quantity which is preferably observed when the machine learning model is being trained. Feature Engineering is the process of creating new features or enhancing the existing ones to accurately understand the patterns and trends in the data.

Feature selection 涉及选择与问题陈述一致且更有意义的特征的过程。特征工程和选择的目的是减小数据集的大小,这对于解决数据增长问题非常重要。

Feature selection involves the process of picking up features that are consistent and more relevant to the problem statement. The process of feature engineering and selection are used to reduce the size of the dataset which is important to tackle the issue of growing data.

Model Development

在模型开发阶段,使用准备好的数据建立机器学习模型。模型构建过程包括选择合适的机器学习算法、算法训练、调整算法的超参数以及使用交叉验证技术评估模型的性能。

In the model development phase, the machine learning model is built using the prepared data. The model building process involves selecting the appropriate machine learning algorithm, algorithm training, tuning the hyperparameters of the algorithm, and evaluating the performance of the model using cross-validation techniques.

此阶段主要包括三个步骤,分别是模型选择、模型训练和模型评估。我们详细讨论一下这三个步骤——

This phase mainly consists of three steps, model selection, model training, and model evaluation. Let’s discuss these three steps in detail −

1. Model Selection

模型选择是机器学习工作流中的关键步骤。选择模型的决定取决于一些基本特征,例如数据的特征、问题的复杂程度、期望的结果以及其与已定义问题的契合程度。此步骤影响模型的结果和性能指标。

Model selection is a crucial step in the machine learning workflow. The decision of choosing a model depends on basic features like characteristics of the data, complexity of the problem, desired outcomes and how well it aligns with the defined problem. This step affects the outcomes and performance metrics of the model.

2. Model Training

在此过程中,将预处理的数据集导入算法来识别和理解指定特征中的模式和关系。

In this process, the algorithm is fed with a preprocessed dataset to identify and understand the patterns and relationships in the specified features.

通过调整参数对模型进行一致地训练将提高预测率并增强准确性。此步骤使模型在实际场景中变得可靠。

Consistent training of a model by adjusting parameters would improve the prediction rate and enhance accuracy. This step makes the model reliable in real-world scenarios.

3. Model Evaluation

在模型评估中,使用一组评估指标评估机器学习模型的性能。这些指标测量模型的准确度、精确度、召回率和 F1 分数。如果模型未达到期望的性能,则调整模型以调整超参数并提高预测准确度。这种持续迭代对于提高模型的准确度和可靠性至关重要。

In model evaluation, the performance of the machine learning model is evaluated using a set of evaluation metrics. These metrics measure the accuracy, precision, recall, and F1 score of the model. If the model has not achieved desired performance, the model is tuned to adjust hyper parameters and improve the predictive accuracy. This continuous iteration is essential to make the model more accurate and reliable.

如果模型的性能仍然不令人满意,则可能需要返回到模型选择阶段,并继续进行模型训练和评估以提高模型的性能。

If the model’s performance is still not satisfactory, it may be necessary to return to the model selection stage and continue to model training and evaluation to improve the model’s performance.

Model Deployment

在模型部署阶段,我们将机器学习模型部署到生产中。此过程涉及将经过测试的模型与现有系统集成,以使其可供用户、管理或其他目的使用。这也涉及在真实环境中测试模型。

In the model deployment phase, we deploy the machine learning model into production. This process involves integrating the tested model with existing systems to make it available to users, management or other purposes. This also involves testing the model in a real-world scenario.

部署前必须检查的两个重要因素是:模型是否便携,即能够将软件从一台机器传输到另一台机器,以及是否可扩展,即无需重新设计模型即可保持性能。

Two important factors that have to be checked before deploying are whether the model is portable i.e, the ability to transfer the software from one machine to another and scalable i.e, the model need not be redesigned to maintain performance.

Monitor and Maintenance

机器学习中的监控涉及度量模型性能指标和检测模型中问题的技术。检测到问题后,必须使用新数据训练模型或修改架构。

Monitoring in machine learning involves techniques to measure the model performance metrics and to detect issues in the models. After an issue is detected, the model has to be trained with new data or the architecture has to be modified.

有时,通过使用新数据对设计的模型进行训练无法解决检测到的问题,则该问题变为问题陈述。因此,机器学习生命周期从重新分析问题转变为开发改进的模型。

Sometimes when the issue detected in the designed model cannot be solved with training it with new data, the issue becomes the problem statement. So, the machine learning life cycle revamps from analyzing the problem again to develop an improved model.

机器学习生命周期是一个迭代过程,可能有必要重新审视前面的阶段以提高模型性能或满足新的需求。通过遵循机器学习生命周期,数据科学家可以确保他们的机器学习模型有效、准确且满足业务需求。

The machine learning life cycle is an iterative process, and it may be necessary to revisit previous stages to improve the model’s performance or address new requirements. By following the machine learning life cycle, data scientists can ensure that their machine learning models are effective, accurate, and meet the business requirements.