Big Data Analytics 简明教程

Big Data Analytics - Overview

What is Big Data Analytics?

Gartner 将大数据定义为：“大数据是需要高性价比、创新型信息处理形式的信息，这些形式可增强洞察力、决策和流程自动化。”

Gartner defines Big Data as “Big data is high-volume, high-velocity and/or high-variety information that demands cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

大数据是传统计算方法无法处理和管理的大量数据集集合。这是一个广泛的术语，指的是当今数字世界中企业和政府生成的大量复杂数据集。它通常以 PB 或 TB 为单位衡量，其源自三个关键来源：事务数据、机器数据和社交数据。

Big Data is a collection of large amounts of data sets that traditional computing approaches cannot compute and manage. It is a broad term that refers to the massive volume of complex data sets that businesses and governments generate in today’s digital world. It is often measured in petabytes or terabytes and originates from three key sources: transactional data, machine data, and social data.

大数据包括用于存储、访问、分析和可视化数据的数据、框架、工具和方法。像社交网络和强大小工具这样的先进通信渠道已经创造了创建数据和数据转换的不同方法，并向行业参与者提出了挑战，即他们必须找到处理数据的新方法。从不同来源检索到的大量非结构化原始数据转换为对组织有用的数据产品的过程构成了大数据分析的核心。

Big Data encompasses data, frameworks, tools, and methodologies used to store, access, analyse and visualise it. Technological advanced communication channels like social networking and powerful gadgets have created different ways to create data, data transformation and challenges to industry participants in the sense that they must find new ways to handle data. The process of converting large amounts of unstructured raw data, retrieved from different sources to a data product useful for organizations forms the core of Big Data Analytics.

Steps of Big Data Analytics

大数据分析是帮助发现大型复杂数据集的潜力的强大工具。为了更好地理解，让我们将其分解为关键步骤 −

Big Data Analytics is a powerful tool which helps to find the potential of large and complex datasets. To get a better understanding, let’s break it down into key steps −

Data Collection

这是初始步骤，其中数据是从社交媒体、传感器、在线渠道、商业交易、网站日志等不同来源收集的。收集的数据可能是结构化的（预定义的组织，如数据库）、半结构化的（如日志文件）或非结构化的（文本文档、照片和视频）。

This is the initial step, in which data is collected from different sources like social media, sensors, online channels, commercial transactions, website logs etc. Collected data might be structured (predefined organisation, such as databases), semi-structured (like log files) or unstructured (text documents, photos, and videos).

Data Cleaning (Data Pre-processing)

下一步是通过消除错误并使其适合分析来处理收集的数据。收集的原始数据通常包含错误、缺失值、不一致性和噪声数据。数据清理需要识别和纠正错误，以确保数据准确一致。预处理操作还可能涉及数据转换、规范化和特征提取，以准备数据进行进一步分析。

The next step is to process collected data by removing errors and making it suitable and proper for analysis. Collected raw data generally contains errors, missing values, inconsistencies, and noisy data. Data cleaning entails identifying and correcting errors to ensure that the data is accurate and consistent. Pre-processing operations may also involve data transformation, normalisation, and feature extraction to prepare the data for further analysis.

总体而言，数据清理和预处理需要替换缺失数据、更正不准确数据，并删除重复数据。这就像筛选一个宝库，将岩石和碎屑分离，只留下有价值的宝石。

Overall, data cleaning and pre-processing entail the replacement of missing data, the correction of inaccuracies, and the removal of duplicates. It is like sifting through a treasure trove, separating the rocks and debris and leaving only the valuable gems behind.

Data Analysis

这是大数据分析的关键阶段。使用不同的技术和算法来分析数据并得出有用的见解。这可以包括描述性分析（总结数据以更好地了解其特征）、诊断分析（识别模式和关系）、预测分析（预测未来趋势或结果）和规范性分析（基于分析提出建议或决策）。

This is a key phase of big data analytics. Different techniques and algorithms are used to analyse data and derive useful insights. This can include descriptive analytics (summarising data to better understand its characteristics), diagnostic analytics (identifying patterns and relationships), predictive analytics (predicting future trends or outcomes), and prescriptive analytics (making recommendations or decisions based on the analysis).

Data Visualization

这是使用图表、图形和交互式仪表板以可视化形式呈现数据的一个步骤。因此，数据可视化技术用于使用图表、图形、仪表板和其他图形格式直观地描绘数据，以使数据分析见解更清晰，更可操作。

It’s a step to present data in a visual form using charts, graphs and interactive dashboards. Hence, data visualisation techniques are used to visually portray the data using charts, graphs, dashboards, and other graphical formats to make data analysis insights more clear and actionable.

Interpretation and Decision Making

一旦数据分析和可视化完成并获得了见解，利益相关者就会分析这些发现以做出明智的决策。此决策包括优化企业运营、增加消费者体验、创建新产品或服务以及指导战略规划。

Once data analytics and visualisation are done and insights gained, stakeholders analyse the findings to make informed decisions. This decision-making includes optimising corporate operations, increasing consumer experiences, creating new products or services, and directing strategic planning.

Data Storage and Management

收集数据后，必须以一种方式存储数据，以便于检索和分析。传统数据库可能不足以处理大量数据，因此许多组织使用分布式存储系统，例如 Hadoop 分布式文件系统 (HDFS) 或云存储解决方案，如 Amazon S3。

Once collected, the data must be stored in a way that enables easy retrieval and analysis. Traditional databases may not be sufficient for handling large amounts of data, hence many organisations use distributed storage systems such as Hadoop Distributed File System (HDFS) or cloud-based storage solutions like Amazon S3.

Continuous Learning and Improvement

大数据分析是一个持续的过程，涉及收集、清理和分析数据以发现隐藏的见解。它帮助企业做出更好的决策并获得竞争优势。

Big data analytics is a continuous process of collecting, cleaning, and analyzing data to uncover hidden insights. It helps businesses make better decisions and gain a competitive edge.

Types of Big-Data

大数据通常分为三种不同的类型。它们如下所示 −

Big Data is generally categorized into three different varieties. They are as shown below −

Structured Data
Semi-Structured Data
Unstructured Data

让我们详细讨论一下获取类型。

Let us discuss the earn type in details.

Structured Data

结构化数据具有专门的数据模型、明确定义的结构和一致的顺序，并且旨在以人类或计算机可以轻松访问和使用的方式进行设计。结构化数据通常以明确定义的表格形式存储，即以行和列的形式存储。示例：MS Excel、数据库管理系统 (DBMS)

Structured data has a dedicated data model, a well-defined structure, and a consistent order, and is designed in such a way that it can be easily accessed and used by humans or computers. Structured data is usually stored in well-defined tabular form means in the form of rows and columns. Example: MS Excel, Database Management Systems (DBMS)

Semi-Structured Data

半结构化数据可以描述为另一种类型的结构化数据。它继承了结构化数据的一些特性；然而，此类数据的大部分缺乏特定结构，并且不遵循 RDBMS 等数据模型的正式结构。示例：逗号分隔值 (CSV) 文件。

Semi-structured data can be described as another type of structured data. It inherits some qualities from Structured Data; however, the majority of this type of data lacks a specific structure and does not follow the formal structure of data models such as an RDBMS. Example: Comma Separated Values (CSV) File.

Unstructured Data

非结构化数据是一种不遵循任何结构的数据类型。它缺乏统一的格式并且不断变化。但是，它偶尔可能包含数据和时间相关信息。示例：音频文件、图像等。

Unstructured data is a type of data that doesn’t follow any structure. It lacks a uniform format and is constantly changing. However, it may occasionally include data and time-related information. Example: Audio Files, Images etc.

Types of Big Data Analytics

一些常见的大数据分析类型有：

Some common types of Big Data analytics are as −

Descriptive Analytics

描述性分析给出了类似 “What is happening in my business?" 的结果，如果数据集与业务相关。总体而言，这将总结之前的事实并帮助创建报告，例如公司的收入、利润和销售数据。它还有助于统计社交媒体指标。它可以执行全面、准确的实时数据和有效的可视化。

Descriptive analytics gives a result like “What is happening in my business?" if the dataset is business-related. Overall, this summarises prior facts and aids in the creation of reports such as a company’s income, profit, and sales figures. It also aids the tabulation of social media metrics. It can do comprehensive, accurate, live data and effective visualisation.

Diagnostic Analytics

诊断分析从数据中确定根本原因。它回答诸如 “Why is it happening?” 这样的问题。一些常见的例子是向下钻取、数据挖掘和数据恢复。组织使用诊断分析，因为它们提供了对特定问题的深入见解。总的来说，它可以深入研究根本原因，并能够隔离所有混杂信息。

Diagnostic analytics determines root causes from data. It answers like “Why is it happening?” Some common examples are drill-down, data mining, and data recovery. Organisations use diagnostic analytics because they provide an in-depth insight into a particular problem. Overall, it can drill down the root causes and ability to isolate all confounding information.

For example - 来自在线商店的报告称销售额有所下降，尽管人们仍在向他们的购物车中添加商品。造成这种情况的原因可能是表单加载不当、运费过高，或者提供的付款方式不足。你可以使用诊断数据来找出原因。

For example − A report from an online store says that sales have decreased, even though people are still adding items to their shopping carts. Several things could have caused this, such as the form not loading properly, the shipping cost being too high, or not enough payment choices being offered. You can use diagnostic data to figure out why this is happening.

Predictive Analytics

这种分析从过去和现在的数据中查找，来预测未来会发生什么。因此，它以 “What will be happening in future? 为答案。数据挖掘、人工智能和机器学习都用于预测分析，以查看当前数据并猜测未来会发生什么。它可以找出市场趋势、客户趋势等。

This kind of analytics looks at data from the past and the present to guess what will happen in the future. Hence, it answers like “What will be happening in future? “Data mining, AI, and machine learning are all used in predictive analytics to look at current data and guess what will happen in the future. It can figure out things like market trends, customer trends, and so on.

For example - PayPal 制定了 Bajaj Finance 必须遵循的规则，以保护其客户免受虚假交易的影响。该公司使用预测分析来查看其所有过去的支付和用户行为数据，并制定一个可以发现欺诈行为的程序。

For example − The rules that Bajaj Finance has to follow to keep their customers safe from fake transactions are set by PayPal. The business uses predictive analytics to look at all of its past payment and user behaviour data and come up with a program that can spot fraud.

Prescriptive Analytics

透视分析提供了制定战略决策的能力，分析结果回答 “What do I need to do?” 透视分析适用于描述性和预测分析两者。在大多数情况下，它依赖于人工智能和机器学习。

Perspective analytics gives the ability to frame a strategic decision, the analytical results answer “What do I need to do?” Perspective analytics works with both descriptive and predictive analytics. Most of the time, it relies on AI and machine learning.

For example - 规范性分析可以帮助公司最大化其业务和利润。例如，在航空业中，透视分析会应用一些算法集，这些算法根据客户需求自动更改航班价格，并且由于恶劣天气条件、位置、节假日等原因而降低机票价格。

For example − Prescriptive analytics can help a company to maximise its business and profit. For example in the airline industry, Perspective analytics applies some set of algorithms that will change flight prices automatically based on demand from customers, and reduce ticket prices due to bad weather conditions, location, holiday seasons etc.

Tools and Technologies of Big Data Analytics

一些常用的大数据分析工具有：

Some commonly used big data analytics tools are as −

Hadoop

存储和分析大量数据的工具。Hadoop 使得处理大数据成为可能，它是一个使大数据分析成为可能的工具。

A tool to store and analyze large amounts of data. Hadoop makes it possible to deal with big data, It’s a tool which made big data analytics possible.

MongoDB

一种用于管理非结构化数据的工具。这是一个专门设计用于存储、访问和处理大量非结构化数据的数据库。

A tool for managing unstructured data. It’s a database which specially designed to store, access and process large quantities of unstructured data.

Talend

一种用于数据集成和管理的工具。Talend 的解决方案包包括数据集成、数据质量、主数据管理和数据治理的完整功能。Talend 与大数据管理工具（如 Hadoop、Spark 和 NoSQL 数据库）集成，使组织能够高效地处理和分析大量数据。它包括用于与大数据技术交互的连接器和组件，使用户能够创建数据管道来摄入、处理和分析大量数据。

A tool to use for data integration and management. Talend’s solution package includes complete capabilities for data integration, data quality, master data management, and data governance. Talend integrates with big data management tools like Hadoop, Spark, and NoSQL databases allowing organisations to process and analyse enormous amounts of data efficiently. It includes connectors and components for interacting with big data technologies, allowing users to create data pipelines for ingesting, processing, and analysing large amounts of data.

Cassandra

用于处理数据块的分布式数据库。Cassandra 是一个开源分布式 NoSQL 数据库管理系统，它可以在多个商品服务器上处理海量数据，确保高可用性和可扩展性，而不会牺牲性能。

A distributed database used to handle chunks of data. Cassandra is an open-source distributed NoSQL database management system that handles massive amounts of data over several commodity servers, ensuring high availability and scalability without sacrificing performance.

Spark

用于实时处理和分析大量数据。Apache Spark 是一个强大且多功能的分布式计算框架，它为大数据处理、分析和机器学习提供了一个单一平台，使其在电子商务、金融、医疗保健和电信等行业中广受欢迎。

Used for real-time processing and analyzing large amounts of data. Apache Spark is a robust and versatile distributed computing framework that provides a single platform for big data processing, analytics, and machine learning, making it popular in industries such as e-commerce, finance, healthcare, and telecommunications.

Storm

它是一个开源实时计算系统。Apache Storm 是一个强大且多功能的流处理框架，它允许组织大规模处理和分析实时数据流，使其适用于银行、电信、电子商务和物联网等行业中的广泛用例。

It is an open-source real-time computational system. Apache Storm is a robust and versatile stream processing framework that allows organisations to process and analyse real-time data streams on a large scale, making it suited for a wide range of use cases in industries such as banking, telecommunications, e-commerce, and IoT.

Kafka

这是一个用于容错存储的分布式流平台。Apache Kafka是一个多功能且强大的事件流平台，允许组织创建可扩展、容错且实时的管道和流应用，以有效地满足其数据处理要求。

It is a distributed streaming platform that is used for fault-tolerant storage. Apache Kafka is a versatile and powerful event streaming platform that allows organisations to create scalable, fault-tolerant, and real-time data pipelines and streaming applications to efficiently meet their data processing requirements.