Data Science 简明教程

Data Science - Lifecycle

What is Data Science Lifecycle?

数据科学生命周期是一种寻找数据问题的解决方案的系统方法,它显示了开发、交付/部署和维护数据科学项目所需的步骤。我们可以假设一个通用数据科学生命周期,其中包含一些最重要的常见步骤,如下面的图片所示,但由于每个项目不同,因此某些步骤可能因项目而异,因为并非每个数据科学项目都是以相同的方式构建的

A data science lifecycle is a systematic approach to find a solution for a data problem which shows the steps that are taken to develop, deliver/deploy , and maintain a data science project. We can assume a general data science lifecycle with some of the most important common steps that is shown in the figure given below but some steps may differ from project to project as each project is different so life cycle may differ since not every data science project is built the same way

一种标准数据科学生命周期方法包括使用机器学习算法和统计程序,这些程序会产生更准确的预测模型。数据提取、准备、清理、建模、评估等,是数据科学中最重要的阶段。这种技术在数据科学领域被称为“数据挖掘的跨行业标准程序”。

A standard data science lifecycle approach comprises the use of machine learning algorithms and statistical procedures that result in more accurate prediction models. Data extraction, preparation, cleaning, modelling, assessment, etc., are some of the most important data science stages. This technique is known as "Cross Industry Standard Procedure for Data Mining" in the field of data science.

How many phases are there in the Data Science Life Cycle?

数据科学生命周期主要有六个阶段−

There are mainly six phases in Data Science Life Cycle −

data science life cycle

Identifying Problem and Understanding the Business

数据科学生命周期从“为什么?”开始,就像任何其他业务生命周期一样。数据科学过程中最重要的部分之一是找出问题所在。这有助于找到一个明确的目标,可以围绕该目标制定所有其他步骤。简而言之,尽早了解业务目标非常重要,因为它将决定分析的最终目标。

The data science lifecycle starts with "why?" just like any other business lifecycle. One of the most important parts of the data science process is figuring out what the problem is. This helps to find a clear goal around which all the other steps can be planned out. In short, it’s important to know the business goal as earliest because it will determine what the end goal of the analysis will be.

此阶段应评估业务趋势、评估可比分析的案例研究,并研究行业领域。该组将根据可用的员工、设备、时间和技术评估项目的可行性。一旦发现并评估了这些因素,将制定一个初步假设来解决现有环境造成的业务问题。本阶段应−

This phase should evaluate the trends of business, assess case studies of comparable analyses, and research the industry’s domain. The group will evaluate the feasibility of the project given the available employees, equipment, time, and technology. When these factors been discovered and assessed, a preliminary hypothesis will be formulated to address the business issues resulting from the existing environment. This phrase should −

  1. Specify the issue that why the problem must be resolved immediately and demands answer.

  2. Specify the business project’s potential value.

  3. Identify dangers, including ethical concerns, associated with the project.

  4. Create and convey a flexible, highly integrated project plan.

Data Collection

数据科学生命周期的下一步是数据收集,这意味着从适当且可靠的来源获取原始数据。收集的数据可以是有序的,也可以是无序的。数据可以从网站日志、社交媒体数据、在线数据存储库中收集,甚至可以使用 API、网络抓取或可能存在于 Excel 或其他来源中的数据从在线来源流式传输数据。

The next step in the data science lifecycle is data collection, which means getting raw data from the appropriate and reliable source. The data that is collected can be either organized or unorganized. The data could be collected from website logs, social media data, online data repositories, and even data that is streamed from online sources using APIs, web scraping, or data that could be in Excel or any other source.

从事这项工作的人员应该了解可用的不同数据集之间的差异,以及组织如何投资其数据。专业人士很难追踪每条数据的来源,以及它是否是最新的。在数据科学项目整个生命周期内,追踪这些信息非常重要,因为它可以帮助验证假设或运行任何其他新实验。

The person doing the job should know the difference between the different data sets that are available and how an organization invests its data. Professionals find it hard to keep track of where each piece of data comes from and whether it is up to date or not. During the whole lifecycle of a data science project, it is important to keep track of this information because it could help test hypotheses or run any other new experiments.

信息可以通过调查或更流行的自动数据收集方法(如互联网 Cookie)收集,互联网 Cookie 是未经分析的数据的主要来源。

The information may be gathered by surveys or the more prevalent method of automated data gathering, such as internet cookies which is the primary source of data that is unanalysed.

我们还可以使用开放源数据集等辅助数据。我们可以从许多可用网站收集数据,例如

We can also use secondary data which is an open-source dataset. There are many available websites from where we can collect data for example

Python 中有一些预定义的数据集。让我们从 Python 中导入鸢尾花数据集,并使用它来定义数据科学的阶段。

There are some predefined datasets available in python. Let’s import the Iris dataset from python and use it to define phases of data science.

from sklearn.datasets import load_iris
import pandas as pd

# Load Data
iris = load_iris()

# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
X = iris.data

Data Processing

从可靠的来源收集到高质量的数据后,下一步是对其进行处理。数据处理的目的是确保所获取的数据中是否存在任何问题,以便在进入下一阶段之前解决这些问题。如果没有这一步,我们可能会产生错误或不准确的发现。

After collecting high-quality data from reliable sources, next step is to process it. The purpose of data processing is to ensure if there is any problem with the acquired data so that it can be resolved before proceeding to the next phase. Without this step, we may produce mistakes or inaccurate findings.

所获得的数据可能存在若干困难。例如,数据可能有多行或多列缺少若干值。它可能包含若干离群值、不准确的数字、具有不同时区的 timestamp 等。数据可能潜在地存在日期范围问题。在某些国家/地区,日期的格式为 DD/MM/YYYY,而在其他国家/地区,日期的格式为 MM/DD/YYYY。在数据收集过程中可能会出现许多问题,例如,如果从多个温度计中收集数据且其中任何一个出现故障,则可能需要丢弃或重新收集数据。

There may be several difficulties with the obtained data. For instance, the data may have several missing values in multiple rows or columns. It may include several outliers, inaccurate numbers, timestamps with varying time zones, etc. The data may potentially have problems with date ranges. In certain nations, the date is formatted as DD/MM/YYYY, and in others, it is written as MM/DD/YYYY. During the data collecting process numerous problems can occur, for instance, if data is gathered from many thermometers and any of them are defective, the data may need to be discarded or recollected.

在此阶段,必须解决与数据相关的各种问题。其中一些问题有多种解决方案,例如,如果数据包含缺失值,我们可以用零或列的平均值替换它们。不过,如果该列缺少大量值,则最好完全移除该列,因为它拥有太少数据而无法用于求解问题的数据科学生命周期的方法。

At this phase, various concerns with the data must be resolved. Several of these problems have multiple solutions, for example, if the data includes missing values, we can either replace them with zero or the column’s mean value. However, if the column is missing a large number of values, it may be preferable to remove the column completely since it has so little data that it cannot be used in our data science life cycle method to solve the issue.

当时区混乱时,我们无法使用这些列中的数据,可能必须将其移除,直至我们能够定义提供的时间戳中使用的时区。如果知道收集每个时间戳所用的时区,我们可能会将所有时间戳数据转换为某个时区。此种方式有许多策略可以解决所获取数据中可能存在的各种问题。

When the time zones are all mixed up, we cannot utilize the data in those columns and may have to remove them until we can define the time zones used in the supplied timestamps. If we know the time zones in which each timestamp was gathered, we may convert all timestamp data to a certain time zone. In this manner, there are a number of strategies to address concerns that may exist in the obtained data.

接下来我们将使用 Python 访问数据,然后将其存储在数据框内。

We will access the data and then store it in a dataframe using python.

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load Data
iris = load_iris()

# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
X = iris.data

机器学习模型的所有数据都必须以数字表示。这意味着,如果数据集包含分类数据,则必须将其转换为数字值,然后才能执行模型。因此,我们将实施标签编码。

All data must be in numeric representation for machine learning models. This implies that if a dataset includes categorical data, it must be converted to numeric values before the model can be executed. So we will be implementing label encoding.

Label Encoding

Label Encoding

species = []
for i in range(len(df['target'])):
   if df['target'][i] == 0:
      species.append("setosa")
   elif df['target'][i] == 1:
      species.append('versicolor')
   else:
      species.append('virginica')
df['species'] = species
labels = np.asarray(df.species)
df.sample(10)
labels = np.asarray(df.species)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
df_selected1 = df.drop(['sepal length (cm)', 'sepal width (cm)', "species"], axis=1)

Data Analysis

数据分析 探索性数据分析 (EDA) 是一组用于分析数据的可视化技术。采用此方法,我们可能会获取有关数据统计摘要的具体详细信息。此外,我们将能够处理重复数字、离群值,并在集合内找出趋势或模式。

Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing data. With this method, we may get specific details on the statistical summary of the data. Also, we will be able to deal with duplicate numbers, outliers, and identify trends or patterns within the collection.

在此阶段,我们尝试对所获取和已处理的数据有更深入的理解。我们应用统计和分析技术对数据做出结论,并确定我们数据集中的多列之间的联系。借助图片、图表、流程图、绘图等,我们可以使用可视化来更好地理解和描述数据。

At this phase, we attempt to get a better understanding of the acquired and processed data. We apply statistical and analytical techniques to make conclusions about the data and determine the link between several columns in our dataset. Using pictures, graphs, charts, plots, etc., we may use visualisations to better comprehend and describe the data.

专业人员使用均值和中位数等数据统计技术来更好地理解数据。他们还使用直方图、频谱分析和总体分布来可视化数据并评估其分布模式。数据将根据问题进行分析。

Professionals use data statistical techniques such as the mean and median to better comprehend the data. Using histograms, spectrum analysis, and population distribution, they also visualise data and evaluate its distribution patterns. The data will be analysed based on the problems.

Example

以下代码用于检查数据集中是否存在任何空值:

Below code is used to check if there are any null values in the dataset −

df.isnull().sum()

Output

sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
target 0
species 0
dtype: int64

从上述输出中我们可以得出结论,数据集中没有空值,因为列中所有空值的总和为 0。

From the above output we can conclude that there are no null values in the dataset as the sum of all the null values in the column is 0.

我们将使用 shape 参数来检查数据集的形状(行、列)。

We will be using shape parameter to check the shape (rows, columns) of the dataset −

Example

df.shape

Output

(150, 5)

接下来我们将使用 info() 检查列及其数据类型:

Now we will use info() to check the columns and their data types −

Example

df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

只有一列包含分类数据,而其他列包含非空数字值。

Only one column contains category data, whereas the other columns include non-Null numeric values.

现在我们将在数据上使用 describe()。describe() 方法对数据集执行基础统计计算,例如极值、数据点数量、标准偏差等。任何缺失值或 NaN 值都会立即被忽略。describe() 方法准确描绘了数据的分布。

Now we will use describe() on the data. The describe() method performs fundamental statistical calculations to a dataset, such as extreme values, the number of data points, standard deviation, etc. Any missing or NaN values are immediately disregarded. The describe() method accurately depicts the distribution of data.

Example

df.describe()

Output

data analysis output

Data Visualization

Target column - 我们的目标列将是 Species 列,因为我们最终只需要基于物种的结果即可。

Target column − Our target column will be the Species column since we will only want results based on species in the end.

我们将使用 Matplotlib 和 seaborn 库进行数据可视化。

Matplotlib and seaborn library will be used for data visualization.

以下是物种计量图−

Below is the species countplot −

Example

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='species', data=df, )
plt.show()

Output

data visualization

数据科学中还有许多其他可视化图。要了解更多信息,请参阅 https://www.tutorialspoint.com/machine_learning_with_python

There are many other visualization plots in Data Science. To know more about them refer https://www.tutorialspoint.com/machine_learning_with_python

Data Modeling

数据建模是数据科学中最重要的方面之一,有时被称为数据分析的核心。模型的预期输出应从已准备和分析的数据中得出。在达到指定标准之前,将选择并构建执行数据模型所需的模型。

Data Modeling is one of the most important aspects of data science and is sometimes referred to as the core of data analysis. The intended output of a model should be derived from prepared and analysed data. The environment required to execute the data model will be chosen and constructed, before achieving the specified criteria.

在这个阶段,我们将为生产相关任务开发数据集以对模型进行训练和测试。它还涉及选择正确的模式类型并确定问题是否涉及分类、回归或聚类。在分析模型类型后,我们必须选择适当的实现算法。必须仔细执行,因为从提供的数据中提取相关见解至关重要。

At this phase, we develop datasets for training and testing the model for production-related tasks. It also involves selecting the correct mode type and determining if the problem involves classification, regression, or clustering. After analysing the model type, we must choose the appropriate implementation algorithms. It must be performed with care, as it is crucial to extract the relevant insights from the provided data.

机器学习在这里发挥了作用。机器学习基本上分为分类、回归或聚类模型,每个模型都有一些算法应用于数据集以获取相关信息。这些模型用于此阶段。我们将在机器学习章节详细讨论这些模型。

Here machine learning comes in picture. Machine learning is basically divided into classification, regression, or clustering models and each model have some algorithms which is applied on the dataset to get the relevant information. These models are used in this phase. We will discuss these models in detail in the machine learning chapter.

Model Deployment

我们已经到达数据科学生命周期的最后阶段。经过详细的审查过程后,该模型终于可以按照所需的格式在所选渠道中部署。请注意,机器学习模型只有在生产中部署后才有用。一般来说,这些模型与产品和应用程序相关联并集成。

We have reached the final stage of the data science lifecycle. The model is finally ready to be deployed in the desired format and chosen channel after a detailed review process. Note that the machine learning model has no utility unless it is deployed in the production. Generally speaking, these models are associated and integrated with products and applications.

模型部署包含建立将模型部署到市场消费者或另一个系统所需的交付方法。机器学习模型也在设备上实施,并获得接受和吸引力。根据项目的复杂性,此阶段可能从 Tableau 仪表板的基本模型输出到拥有数百万用户的复杂云部署。

Model Deployment contains the establishment of a delivery method necessary to deploy the model to market consumers or to another system. Machine learning models are also being implemented on devices and gaining acceptance and appeal. Depending on the complexity of the project, this stage might range from a basic model output on a Tableau Dashboard to a complicated cloud-based deployment with millions of users.

Who are all involved in Data Science lifecycle?

数据正在从个人层面到组织层面,在大量的服务器和数据仓库中生成、收集和存储。但是您将如何访问这个庞大的数据存储库?这就是数据科学家介入的地方,因为他们专门从非结构化文本和统计数据中提取见解和模式。

Data is being generated, collected, and stored on voluminous servers and data warehouses from the individual level to the organisational level. But how will you access this massive data repository? This is where the data scientist comes in, since he or she is a specialist in extracting insights and patterns from unstructured text and statistics.

下文,我们将介绍参加数据科学生命周期的许多数据科学团队工作简介。

Below, we present the many job profiles of the data science team participating in the data science lifecycle.

S.No

Job Profile & Role

1

*Business Analyst*Understanding business requirements and find the right target customers.

2

*Data Analyst*Format and clean the raw data, interpret and visualise them to perform the analysis and provide the technical summary of the same

3

*Data Scientists*Improve quality of machine learning models.

4

*Data Engineer*They are in charge of gathering data from social networks, websites, blogs, and other internal and external web sources ready for further analysis.

5

*Data Architect*Connect, centralise, protect, and keep up with the organization’s data sources.

6

*Machine Learning Engineer*Design and implement machine learning-related algorithms and applications.