Big Data Analytics 简明教程
Big Data Analytics - Methodology
在方法论方面,大数据分析与传统统计方法的实验设计有很大不同。分析从数据开始。通常,我们以一种能够回答业务专业人士提出的问题的方式对数据建模。这种方法的目标是预测响应行为或了解输入变量如何与响应相关。
In terms of methodology, big data analytics differs significantly from the traditional statistical approach of experimental design. Analytics starts with data. Normally, we model the data in a way that able to answer the questions that a business professionals have. The objectives of this approach are to predict the response behavior or understand how the input variables relate to a response.
通常,统计实验设计开发一个实验,然后检索结果数据。这可以在独立性、常态性和随机化的假设下生成适合统计模型的数据。大数据分析方法从问题识别开始,一旦确定了业务问题,就需要进行研究阶段来设计方法。但是,一般准则与几乎所有问题都相关并适用于所有问题。
Typically, statistical experimental designs develop an experiment and then retrieve the resulting data. This enables the generation of data suitable for a statistical model, under the assumption of independence, normality, and randomization. Big data analytics methodology begins with problem identification, and once the business problem is defined, a research stage is required to design the methodology. However, general guidelines are relevant to mention and apply to almost all problems.
下图演示了大数据分析中经常遵循的方法 -
The following figure demonstrates the methodology often followed in Big Data Analytics −

Big Data Analytics Methodology
以下是进行大数据分析的方法 -
The following are the methodologies of big data analytics −
Define Objectives
清晰地概述分析的目标和目的。您寻求什么见解?您正在尝试解决哪些业务难题?此阶段对于指导整个过程至关重要。
Clearly outline the analysis’s goals and objectives. What insights do you seek? What business difficulties are you attempting to solve? This stage is critical to steering the entire process.
Data Collection
从各种来源收集相关数据。其中包括来自数据库的结构化数据、来自日志或 JSON 文件的半结构化数据和来自社交媒体、电子邮件和论文的非结构化数据。
Gather relevant data from a variety of sources. This includes structured data from databases, semi-structured data from logs or JSON files, and unstructured data from social media, emails, and papers.
Data Pre-processing
此步骤涉及清理和预处理数据,以确保其质量和一致性。这包括处理缺失值、删除重复项、解决不一致性和将数据转换为有用的格式。
This step involves cleaning and pre-processing the data to ensure its quality and consistency. This includes addressing missing values, deleting duplicates, resolving inconsistencies, and transforming data into a useful format.
Data Storage and Management
将数据存储在适当的存储系统中。这可能包括典型的关系数据库、NoSQL 数据库、数据湖或分布式文件系统,例如 Hadoop 分布式文件系统 (HDFS)。
Store the data in an appropriate storage system. This could include a typical relational database, a NoSQL database, a data lake, or a distributed file system such as Hadoop Distributed File System (HDFS).
Exploratory Data Analysis (EDA)
此阶段包括识别数据特征、查找模式和检测异常值。我们经常使用直方图、散点图和箱体图等可视化工具。
This phase includes the identification of data features, finding patterns, and detecting outliers. We often use visualization tools like histograms, scatter plots, and box plots.
Feature Engineering
创建新功能或修改现有功能以提高机器学习模型的性能。这可能包括特征缩放、降维或构建复合特征。
Create new features or modify existing ones to improve the performance of machine learning models. This could include feature scaling, dimensionality reduction, or constructing composite features.
Model Selection and Training
根据问题的性质和数据的属性选择相关的机器学习算法。如果存在标记数据,则训练模型。
Choose relevant machine learning algorithms based on the nature of the problem and the properties of the data. If labeled data is available, train the models.
Model Evaluation
使用准确度、精确度、召回率、F1 分数和 ROC 曲线来衡量训练模型的性能。这有助于确定适合部署的性能最佳的模型。
Measure the trained models' performance using accuracy, precision, recall, F1-score, and ROC curves. This helps to determine the best-performing model for deployment.
Deployment
在生产环境中,部署模型以供实际使用。这可能包括将模型与现有系统集成、为模型推理创建 API 以及建立监控工具。
In a production environment, deploy the model for real-world use. This could include integrating the model with existing systems, creating APIs for model inference, and establishing monitoring tools.
Monitoring and Maintenance
此外,根据不断变化的业务需求或数据特征,按需更改分析管道。
Also, change the analytics pipeline as needed to reflect changing business requirements or data characteristics.
Iterate
大数据分析是一个迭代的过程。分析数据、收集评论,并根据需要更新模型或过程,以提高准确性和有效性。
Big Data analytics is an iterative process. Analyze the data, collect comments, and update the models or procedures as needed to increase accuracy and effectiveness over time.
大数据分析中最重要的任务之一是统计建模,即监督和非监督分类或回归问题。在对数据进行建模的清理和预处理后,使用适当的损失度量仔细评估各种模型。实施模型后,进行其他评估并报告结果。预测建模中常见的陷阱是仅实施模型而不衡量其性能。
One of the most important tasks in big data analytics is statistical modeling, meaning supervised and unsupervised classification or regression problems. After cleaning and pre-processing the data for modeling, carefully assess various models with appropriate loss metrics. After implementing the model, conduct additional evaluations and report the outcomes. A common pitfall in predictive modeling is to just implement the model and never measure its performance.