Big Data Analytics 简明教程
Big Data Analytics - Data Life Cycle
生命周期是一个过程,表示大数据分析中涉及的一个或多个活动的顺序流程。在学习大数据分析生命周期之前,让我们先了解一下传统的数据挖掘生命周期。
A life cycle is a process which denotes a sequential flow of one or more activities involved in Big Data Analytics. Before going to learn about a big data analytics life cycle; let’s understand the traditional data mining life cycle.
Traditional Data Mining Life Cycle
为了给组织提供一个系统化地组织工作的框架,该框架支持整个业务流程,并提供宝贵的业务见解,以便做出战略决策,在竞争激烈的世界中使组织生存下来并实现其利润最大化。
To provide a framework to organize the work systematically for an organization; the framework supports the entire business process and provides valuable business insights to make strategic decisions to survive the organisation in a competitive world as well as maximize its profit.

传统的数据挖掘生命周期包含以下阶段:
The Traditional Data Mining Life Cycle includes the following phases −
-
Problem Definition − It’s an initial phase of the data mining process; it includes problem definitions that need to be uncovered or solved. A problem definition always includes business goals that need to be achieved and the data that need to be explored to identify patterns, business trends, and process flow to achieve the defined goals.
-
Data Collection − The next step is data collection. This phase involves data extraction from different sources like databases, weblogs, or social media platforms that are required for analysis and to do business intelligence. Collected data is considered raw data because it includes impurities and may not be in the required formats and structures.
-
Data Pre-processing − After data collection, we clean it and pre-process it to remove noises, missing value imputation, data transformation, feature selection, and convert data into a required format before you can begin your analysis.
-
Data Exploration and Visualization − Once pre-processing is done on data, we explore it to understand its characteristics, and identify patterns and trends. This phase also includes data visualizations using scatter plots, histograms, or heat maps to show the data in graphical form.
-
Modelling − This phase includes creating data models to solve realistic problems defined in Phase 1. This could include an effective machine learning algorithm; training the model, and assessing its performance.
-
Evaluation − The final stage in data mining is to assess the model’s performance and determine if it matches your business goals in step 1. If the model is underperforming, you may need to do data exploration or feature selection once again.
CRISP-DM Methodology
CRISP-DM 是数据挖掘的跨行业标准流程;它是一种方法,描述了数据挖掘专家用来解决传统 BI 数据挖掘问题时常用的方法。它仍在传统的 BI 数据挖掘团队中使用。下图对其进行了说明。它描述了 CRISP-DM 周期的主要阶段,以及它们之间如何相互关联。
The CRISP-DM stands for Cross Industry Standard Process for Data Mining; it is a methodology which describes commonly used approaches that a data mining expert uses to tackle problems in traditional BI data mining. It is still being used in traditional BI data mining teams. The following figure illustrates it. It describes the major phases of the CRISP-DM cycle and how they are interrelated with one another

CRISP-DM 于 1996 年推出,次年作为一项欧盟项目在 ESPRIT 资助计划下启动。该项目由五家公司牵头:SPSS、Teradata、戴姆勒股份公司、NCR 公司和 OHRA(一家保险公司)。该项目最终被并入 SPSS 中。
CRISP-DM was introduced in 1996 and the next year, it got underway as a European Union project under the ESPRIT funding initiative. The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project was finally incorporated into SPSS.
Phases of CRISP-DM Life Cycle | Steps of CRISP-DM Life Cycle
-
Business Understanding − This phase includes problem definition, project objectives and requirements from a business perspective, and then converts it into data mining. A preliminary plan is designed to achieve the objectives.
-
Data Understanding − The data understanding phase initially starts with data collection, to identify data quality, discover data insights, or detect interesting subsets to form hypotheses for hidden information.
-
Data Preparation − The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modelling tools.
-
Modelling − In this phase, different modelling techniques are selected and applied; different techniques may be available to process the same type of data; an expert always opts for effective and efficient ones.
-
Evaluation − Once the proposed model is completed; before the final deployment of the model, it is important to evaluate it thoroughly and review the steps executed to construct the model, to ensure that the model achieves the desired business objectives.
-
Deployment − The creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer.In many cases, it will be the customer, not the data analyst, who will carry out the deployment phase. Even if the analyst deploys the model, the customer needs to understand upfront the actions which will need to be carried out to make use of the created models.
SEMMA Methodology
SEMMA 是 SAS 为数据挖掘建模开发的另一种方法。它代表抽样、探索、修改、建模和评估。
SEMMA is another methodology developed by SAS for data mining modelling. It stands for Sample, Explore, Modify, Model, and Asses.

其阶段描述如下−
The description of its phases is as follows −
-
Sample − The process starts with data sampling, e.g., selecting the dataset for modelling. The dataset should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.
-
Explore − This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization.
-
Modify − The Modify phase contains methods to select, create and transform variables in preparation for data modelling.
-
Model − In the Model phase, the focus is on applying various modelling (data mining) techniques on the prepared variables to create models that possibly provide the desired outcome.
-
Assess − The evaluation of the modelling results shows the reliability and usefulness of the created models.
CRISM-DM 和 SEMMA 之间的最大区别在于,SEMMA 专注于建模方面,而 CRISP-DM 则更重视建模之前的周期阶段,例如了解要解决的业务问题、了解和预处理用作输入的数据,例如机器学习算法。
The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modelling aspect, whereas CRISP-DM gives more importance to stages of the cycle before modelling such as understanding the business problem to be solved, understanding and pre-processing the data to be used as input, for example, machine learning algorithms.
Big Data Life Cycle
大数据分析是一个涉及管理整个数据生命周期(包括数据收集、清理、组织、存储、分析和治理)的领域。在大数据的背景下,传统方法并不适合分析大容量数据、具有不同值的的数据、数据速度等。
Big Data Analytics is a field that involves managing the entire data lifecycle, including data collection, cleansing, organisation, storage, analysis, and governance. In the context of big data, the traditional approaches were not optimal for analysing large-volume data, data with different values, data velocity etc.
例如,SEMMA 方法轻视对不同数据源进行数据收集和预处理。在成功的的大数据项目中,这些阶段通常构成了大部分工作。大数据分析涉及大批原始数据(非结构化和半结构化数据)的识别、获取、处理和分析,其目标是从中提取有价值的信息来识别趋势、增强现有公司数据和进行广泛搜索。
For example, the SEMMA methodology disdains data collection and pre-processing of different data sources. These stages normally constitute most of the work in a successful big data project. Big Data analytics involves the identification, acquisition, processing, and analysis of large amounts of raw data, unstructured and semi-structured data which aims to extract valuable information for trend identification, enhancing existing company data, and conducting extensive searches.

大数据分析生命周期可分为以下阶段−
The Big Data analytics lifecycle can be divided into the following phases −
-
Business Case Evaluation
-
Data Identification
-
Data Acquisition & Filtering
-
Data Extraction
-
Data Validation & Cleansing
-
Data Aggregation & Representation
-
Data Analysis
-
Data Visualization
-
Utilization of Analysis Results
大数据分析与传统数据分析之间的主要差异在于处理数据的价值、速度和多样性。为了满足大数据分析的特定要求,需要一种有条理的方法。大数据分析生命周期阶段的描述如下 -
The primary differences between Big Data Analytics and traditional data analysis are in the value, velocity, and variety of data processed. To address the specific requirements for big data analysis, an organised method is required. The description of Big Data analytics lifecycle phases are as follows −
Business Case Evaluation
大数据分析生命周期始于一个明确定义的业务用例,它概述了进行分析的问题识别、目标和目标。在开始真正的动手分析任务之前,业务用例评估需要创建、评估和批准业务用例。
A Big Data analytics lifecycle begins with a well-defined business case that outlines the problem identification, objective, and goals for conducting the analysis. Before beginning the real hands-on analytical duties, the Business Case Evaluation needs the creation, assessment, and approval of a business case.
对大数据分析商业案例的审查为决策者提供了一个方向,让他们了解所需的业务资源和需要解决的业务问题。案例评估审查正在解决的业务问题定义是否真的是大数据问题。
An examination of a Big Data analytics business case gives a direction to decision-makers to understand the business resources that will be required and business problems that need to be addressed. The case evaluation examines whether the business problem definition being addressed is truly a Big Data problem.
Data Identification
数据识别阶段重点识别分析项目所需的必要数据集及其来源。识别更广泛的数据源可能会提高发现隐藏模式和关系的机会。公司可能需要内部或外部数据集和资源,具体取决于它所解决的业务问题的性质。
The data identification phase focuses on identifying the necessary datasets and their sources for the analysis project. Identifying a larger range of data sources may improve the chances of discovering hidden patterns and relationships. The firm may require internal or external datasets and sources, depending on the nature of the business problems it is addressing.
Data Acquisition and Filtering
数据获取过程包括从前一阶段提到的所有源中收集数据。我们对数据进行自动过滤,以删除损坏的数据或与研究目标无关的记录。根据数据源的类型,数据可能以文件集合的形式出现,例如从第三方数据提供商获取的数据,或以 API 集成的形式出现,例如与 Twitter 集成。
The data acquisition process entails gathering data from all of the sources mentioned in the previous phase. We subjected the data to automated filtering to remove corrupted data or records irrelevant to the study objectives. Depending on the type of data source, data might come as a collection of files, such as data acquired from a third-party data provider, or as API integration, such as with Twitter.

一旦生成或进入企业边界,我们必须保存内部和外部数据。我们将此数据保存到磁盘,然后使用批处理分析对其进行分析。在实时分析中,我们在将数据保存到磁盘之前对其进行分析。
Once generated or entering the enterprise boundary, we must save both internal and external data. We save this data to disk and then analyse it using batch analytics. In real-time analytics, we first analyse the data before saving it to disc.
Data Extraction
此阶段重点是从不同数据中提取数据,并将其转换为底层大数据解决方案可用于数据分析的格式。
This phase focuses on extracting disparate data and converting it into a format that the underlying Big Data solution can use for data analysis.
Data Validation and Cleansing
不正确的数据会使分析结果产生偏差和错误表示。与具有预定义结构并经过验证的大部分企业数据可以用于分析不同;如果在分析之前未验证数据,大数据分析可能是无结构的。它的复杂性可能会使制定一组适当的验证要求变得困难。数据验证和清理负责定义复杂的验证条件并删除所有已知有缺陷的数据。
Incorrect data can bias and misrepresent analytical results. Unlike typical enterprise data, which has a predefined structure and is verified can feed for analysis; Big Data analysis can be unstructured if data is not validated before analysis. Its intricacy can make it difficult to develop a set of appropriate validation requirements. Data Validation and Cleansing is responsible for defining complicated validation criteria and deleting any known faulty data.
Data Aggregation and Representation
数据聚合和表示阶段重点在于合并多个数据集以创建一致的视图。由于以下差异,执行此阶段可能会变得棘手:
The Data Aggregation and Representation phase focuses on combining multiple datasets to create a cohesive view. Performing this stage can get tricky due to variances in −
Data Structure - 数据格式可能相同,但数据模型可能不同。
Data Structure − The data format may be the same, but the data model may differ.
Semantics - 在两个数据集中标记不同的变量可能表示相同的内容,例如“姓氏”和“名字”。
Semantics − A variable labelled differently in two datasets may signify the same thing, for example, "surname" and "last name."

Data Analysis
数据分析阶段负责执行实际的分析工作,通常包括一种或多种类型的分析。特别是如果数据分析是探索性的,我们可以迭代地继续此阶段,直到发现正确的模式或关联为止。
The data analysis phase is responsible for carrying out the actual analysis work, which usually comprises one or more types of analytics. Especially if the data analysis is exploratory, we can continue this stage iteratively until we discover the proper pattern or association.
Data Visualization
数据可视化阶段以图形方式可视化数据,以便以业务用户能够有效解释的结果进行交流。结果结果有助于执行可视分析,使他们能够发现他们尚未提出的查询的答案。
The Data Visualization phase visualizes data graphically to communicate outcomes for effective interpretation by business users.The resultant outcome helps to perform visual analysis, allowing them to uncover answers to queries they have not yet formulated.
Utilization of Analysis Results
为业务人员提供的成果,以便支持业务决策,例如通过仪表盘。所有提到的九个阶段是大数据分析生命周期的主要阶段。
The outcomes made available to business personnel to support business decision-making, such as via dashboards.All of the mentioned nine phases are the primary phases of the Big Data Analytics life cycle.
还可以考虑以下阶段 -
Below mentioned phases can also be kept in consideration −
分析其他公司在相同情况下所做的工作。这包括寻找对你的公司有意义的解决方案,即使这意味着为你公司的资源和需求调整其他解决方案。在此阶段,应定义未来阶段的方法。
Analyse what other companies have done in the same situation. This involves looking for solutions that are reasonable for your company, even though it involves adapting other solutions to the resources and requirements that your company has. In this stage, a methodology for the future stages should be defined.
一旦定义了问题,继续分析当前员工是否能够成功完成项目是合理的。传统的 BI 团队可能无法为所有阶段提供最佳解决方案,所以在开始项目之前应该考虑是否需要将部分项目外包或雇用更多人。
Once the problem is defined, it’s reasonable to continue analyzing if the current staff can complete the project successfully. Traditional BI teams might not be capable of delivering an optimal solution to all the stages, so it should be considered before starting the project if there is a need to outsource a part of the project or hire more people.
本节在大数据生命周期中至关重要;它定义了传递最终数据产品所需的类型配置。数据收集是该过程中的非平凡步骤;它通常涉及从不同来源收集非结构化数据。举例来说,它可能涉及编写一个爬虫程序从某个网站获取评论。这涉及到处理文本(可能采用不同的语言),通常需要大量时间来完成。
This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally requiring a significant amount of time to be completed.
一旦从网络中检索到数据,例如,需要将其存储为易于使用的格式。为了继续查看示例,让我们假设从不同站点检索数据,每个站点显示数据的方式不同。
Once the data is retrieved, for example, from the web, it needs to be stored in an easy-to-use format. To continue with the review examples, let’s assume the data is retrieved from different sites where each has a different display of the data.
假设一个数据源以星级评级提供评论,因此可以将此读取为响应变量 $ \ mathrm {y \:\epsilon \:\lbrace 1,2,3,4,5\rbrace} $ 的映射。另一个数据源使用箭头系统提供评论,一个用于顶帖,另一个用于踩帖。这意味着响应变量的形式为 $ \ mathrm {y \:\epsilon \:\lbrace positive,negative \rbrace} $。
Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read this as a mapping for the response variable $\mathrm{y\:\epsilon \:\lbrace 1,2,3,4,5\rbrace}$. Another data source gives reviews using an arrow system, one for upvoting and the other for downvoting. This would imply a response variable of the form $\mathrm{y\:\epsilon \:\lbrace positive,negative \rbrace}$.
为了合并这两个数据源,必须做出一个决策来让这两个响应表示相等。这可能包括将第一个数据源响应表示转换为第二种形式,将一星视为负面,五星视为正面。此过程通常需要大量时间分配来保证良好质量。
To combine both data sources, a decision has to be made to make these two response representations equivalent. This can involve converting the first data source response representation to the second form, considering one star as negative and five stars as positive. This process often requires a large time allocation to be delivered with good quality.
数据处理完成后,有时需要将数据存储在数据库中。大数据技术为此提供了很多备选方案。最常见的备选方案是使用 Hadoop 文件系统进行存储,它为用户提供了 HIVE 查询语言这个版本的 SQL。从用户角度来看,这允许大部分分析任务以与在传统 BI 数据仓库中完成任务类似的方式完成。其他可供考虑的存储选项包括 MongoDB、Redis 和 SPARK。
Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop File System for storage which provides users a limited version of SQL, known as HIVE Query Language. This allows most analytics tasks to be done in similar ways as would be done in traditional BI data warehouses, from the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.
该周期的这个阶段与人力资源知识相关,即他们实施不同架构的能力。经过修改的传统数据仓库版本仍在大型应用中使用。例如,Teradata 和 IBM 提供的 SQL 数据库可以处理数 TB 的数据;PostgreSQL 和 MySQL 等开源解决方案仍然用于大型应用。
This stage of the cycle is related to the human resources knowledge in terms of their abilities to implement different architectures. Modified versions of traditional data warehouses are still being used in large-scale applications. For example, Teradata and IBM offer SQL databases that can handle terabytes of data; open-source solutions such as PostgreSQL and MySQL are still being used for large-scale applications.
即使不同的存储在后台的工作方式在实现上有所不同,但从客户端来看,大部分解决方案都提供了 SQL API。因此,对 SQL 有深入的理解对于大数据分析来说仍然是一项关键技能。这个阶段表面上似乎是最重要的主题,但在实践中并不是这样。它甚至不是一个必要阶段。可以实现一个使用实时数据的大数据解决方案,因此在这种情况下,我们只需要收集数据来开发模型,然后实时实施它。因此,根本不需要正式存储数据。
Even though there are differences in how the different storages work in the background, from the client side, most solutions provide an SQL API. Hence having a good understanding of SQL is still a key skill to have for big data analytics. This stage a priori seems to be the most important topic, in practice, this is not true. It is not even an essential stage. It is possible to implement a big data solution that would work with real-time data, so in this case, we only need to gather data to develop the model and then implement it in real-time. So there would not be a need to formally store the data at all.
一旦以可以从中检索见解的方式清理和存储数据,那么数据探索阶段就必不可少了。此阶段的目标是了解数据,这通常是通过统计技术以及绘制数据图来完成的。这是一个评估问题定义是否有意义或是否可行的良好阶段。
Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mandatory. The objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good stage to evaluate whether the problem definition makes sense or is feasible.
此阶段涉及对之前检索的已清理数据进行重新整形并使用统计预处理进行缺失值插补、异常值检测、规范化、特征提取和特征选择。
This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier detection, normalization, feature extraction and feature selection.
前面的阶段应该为训练和测试(例如预测模型)生成了多个数据集。在这个阶段,需要尝试不同的模型并期待解决手头的业务问题。在实践中,通常希望该模型能够提供一些对业务的见解。最后,通过评估其在留出数据集上的性能,选择最佳模型或模型组合。
The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.
在这个阶段,已开发的数据产品实施在公司的管道中。这涉及在数据产品运行时设置一个验证方案,以跟踪其性能。例如,在实施预测模型的情况下,此阶段会涉及将模型应用于新数据,并且一旦有了响应,就会评估模型。
In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model.