Big Data Analytics 简明教程
Big Data Analytics - Architecture
What is Big Data Architecture?
大数据架构专门设计用于管理数据摄取、处理和分析过大或过于复杂的数据。传统的关联数据库无法存储、处理和管理大规模的数据。解决办法是将技术组织成大数据架构。大数据架构能够对数据进行管理和处理。
Big data architecture is specifically designed to manage data ingestion, data processing, and analysis of data that is too large or complex. A big size data cannot be store, process and manage by conventional relational databases. The solution is to organize technology into a structure of big data architecture. Big data architecture is able to manage and process data.
Key Aspects of Big Data Architecture
以下是数据架构的一些关键方面 −
The following are some key aspects of big data architecture −
-
To store and process large size data like 100 GB in size.
-
To aggregates and transform of a wide variety of unstructured data for analysis and reporting.
-
Access, processing and analysis of streamed data in real time.
Components of Big Data Architecture
以下是组成大数据架构的不同组件 -
The following are the different components of big data architecture −
Data Sources
所有大数据解决方案都始于一个或多个数据源。大数据架构适应了各种数据源并高效管理了多种数据类型。大数据架构中的某些常见数据源包括事务数据库、日志、机器生成的数据、社交媒体和 Web 数据、流式数据、外部数据源、基于云的数据、NOSQL 数据库、数据仓库、文件系统、API 和 Web 服务。
All big data solutions start with one or more data sources. The Big Data Architecture accommodates various data sources and efficiently manages a wide range of data types. Some common data sources in big data architecture include transactional databases, logs, machine-generated data, social media and web data, streaming data, external data sources, cloud-based data, NOSQL databases, data warehouses, file systems, APIs, and web services.
这些只是几个例子;事实上,数据环境很广泛且不断变化,随着时间的推移,新的数据源和技术正在开发。大数据架构中的主要挑战是成功集成、处理和分析来自不同来源的数据,以便获得相关见解并推动决策制定。
These are only a few instances; in reality, the data environment is broad and constantly changing, with new sources and technologies developing over time. The primary challenge in big data architecture is successfully integrating, processing, and analyzing data from various sources in order to gain relevant insights and drive decision-making.
Data Storage
数据存储是大数据架构中存储和管理大量数据的一个系统。大数据包括处理大量结构化、半结构化和非结构化数据;传统的关联数据库通常由于可伸缩性和性能限制而证明不足。
Data storage is the system for storing and managing large amounts of data in big data architecture. Big data includes handling large amounts of structured, semi-structured, and unstructured data; traditional relational databases often prove inadequate due to scalability and performance limitations.
分布式文件存储能够以多种格式存储大容量的文件,通常存储用于批量处理操作的数据。人们通常将这种类型的存储称为数据湖。为此,您可以在 Azure 存储中使用 Azure 数据湖存储或 Blob 容器。在下面的大数据架构图像中展示了数据存储的主要方法 -
Distributed file stores, capable of storing large volumes of files in various formats, typically store data for batch processing operations. People often refer to this type of store as a data lake. You can use Azure Data Lake Storage or blob containers in Azure Storage for this purpose. In a big data architecture, the following image shows the key approaches to data storage −

数据存储系统的选择取决于不同的方面,包括数据类型、性能要求、可伸缩性和财务限制。不同的数据架构通过结合使用这些存储系统来有效满足不同的用例和目标。
The selection of a data storage system is contingent on different aspects, including type of the data, performance requirements, scalability, and financial limitations. Different big data architectures use a blend of these storage systems to efficiently meet different use cases and objectives.
Batch Processing
使用长时间运行的批处理作业处理数据以筛选、聚合和准备数据进行分析,这些作业通常涉及读入源文件并对其进行处理,然后再将输出写入新文件。批处理是大数据架构的一个重要组成部分,该架构允许使用已排定的批处理高效处理大量数据。它包括按照预定的时间间隔而不是实时收集、处理和分析数据批处理。
Process data with long running batch jobs to filter, aggregate and prepare data for analysis, these jobs often involve reading and processing source files, and then writing the output to new files. Batch processing is an essential component of big data architecture, allowing for the efficient processing of large amounts of data using scheduled batches. It entails gathering, processing, and analysing data in batches at predetermined intervals rather than in real time.
批处理对于不需要立即响应的操作特别有用,例如数据分析、报告和基于批量的的数据转换。您可以在 Azure 数据湖分析中运行 U-SQL 作业、在 HDInsight Hadoop 群集上使用 Hive、Pig 或自定义 Map/Reduce 作业,或在 HDInsight Spark 群集上使用 Java、Scala 或 Python 程序。
Batch processing is especially useful for operations that do not require immediate responses, such as data analytics, reporting, and batch-based data conversions. You can run U-SQL jobs in Azure Data Lake Analytics, use Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or use Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time Message Ingestion
大数据架构在实时消息获取中发挥着重要作用,因为它需要在数据流生成或接收时实时捕获和处理数据流。此功能帮助企业处理高速数据源,例如传感器馈送、日志文件、社交媒体更新、点击流和物联网设备等。
Big data architecture plays a significant role in real-time message ingestion, as it necessitates the real-time capture and processing of data streams during their generation or reception. This functionality helps enterprises deal with high-speed data sources such as sensor feeds, log files, social media updates, clickstreams, and IoT devices, among others.
实时消息获取系统对于提取重要见解、识别异常和立即响应发生的事件至关重要。下图显示了大数据架构中用于实时消息获取的不同方法 -
Real-time message ingestion systems are critical for extracting important insights, identifying anomalies, and responding immediately to occurrences. The following image shows the different methods work for real time message ingestion within big data architecture −

架构合并了一种方法来捕获和存储实时消息以进行流处理;如果该解决方案包含实时源。这可能是一个数据存储系统,其中传入消息会放置在一个文件夹中以进行处理。但是,消息获取存储对于针对不同的方法作为消息的缓冲器运行并且促进扩展处理、可靠传递和其它消息排队语义是必要的。一些高效的解决方案是 Azure Event Hubs、Azure IoT Hubs 和 Kafka。
The architecture incorporates a method for capturing and storing real-time messages for stream processing; if the solution includes real-time sources. This could be a data storage system where incoming messages are dropped into a folder for processing. Nevertheless, a message ingestion store is necessary for different approaches to function as a buffer for messages and to facilitate scale-out processing, reliable delivery, and other message queuing semantics. Some efficient solutions are Azure Event Hubs, Azure IoT Hubs, and Kafka.
Stream Processing
流处理是一种数据处理类型,它会持续处理数据记录,因为它们实时生成或接收。它使企业能够快速分析、转换和响应数据流,从而导致及时的见解、警报和动作。流处理是大数据架构的一个重要组成部分,特别是对于处理大容量数据源(例如传感器数据、日志、社交媒体更新、金融交易和物联网设备遥测)特别有用。
Stream processing is a type of data processing that continuously processes data records as they generate or receive in real time. It enables enterprises to quickly analyze, transform, and respond to data streams, resulting in timely insights, alerts, and actions. Stream processing is a critical component of big data architecture, especially for dealing with high-volume data sources such as sensor data, logs, social media updates, financial transactions, and IoT device telemetry.
下图说明了如何在大数据架构中执行流处理 -
Following figure illustrate how stream processing works within big data architecture −

收集实时消息后,一个建议的解决方案可以通过筛选、聚合和准备数据进行分析来处理数据。然后,处理过的流数据存储在输出接收器中。Azure 流分析提供了一种托管的流处理服务,该服务基于对无限流持续执行 SQL 查询。此外,我们可以在 HDInsight 群集上使用 Storm 和 Spark Streaming 等开源 Apache 流式处理技术。
After gathering real-time messages, a proposes solution processes data by filter, aggregate, and preparing it for analysis. The processed stream data is subsequently stored in an output sink. Azure Stream Analytics offers a managed stream processing service based on continuously executing SQL queries on unbounded streams. on addition, we may employ open-source Apache streaming technologies such as Storm and Spark Streaming on an HDInsight cluster.
Analytical Data Store
在大数据分析中,分析数据存储 (ADS) 是定制的数据库或数据存储系统,专为处理复杂的分析查询和海量数据而设计。ADS 旨在促进临时查询、数据勘探、报告和高级分析任务,使其成为用于商业智能和分析的大数据系统的重要组成部分。大数据分析中分析数据存储的关键特征总结在以下图表中 -
In big data analytics, an Analytical Data Store (ADS) is a customized database or data storage system designed to deal with complicated analytical queries and massive amounts of data. An ADS is intended to facilitate ad hoc querying, data exploration, reporting, and advanced analytics tasks, making it an essential component of big data systems for business intelligence and analytics. The key features of Analytical Data Stores in big data analytics are summarized in following figure −

分析工具可以查询结构化数据。低延迟的 NoSQL 技术(例如 HBase 或交互式 Hive 数据库)可以通过从分布式数据存储系统中的数据文件中抽取信息来显示数据。Azure Synapse Analytics 是一个用于大规模云中数据仓库的托管解决方案。您可以使用带 HDInsight 的 Hive、HBase 和 Spark SQL 提供和分析数据。
Analytical tools can query structured data. A low-latency NoSQL technology, such as HBase or an interactive Hive database, could present the data by abstracting information from data files in the distributed data storage system. Azure Synapse Analytics is a managed solution for large-scale, cloud-based data warehousing. You can serve and analyze data using Hive, HBase, and Spark SQL with HDInsight.
Analysis and Reporting
大数据分析和报告是提取洞察力、模式和趋势以帮助决策、战略规划和运营改进的过程。它包括用于分析数据和以有用和实际的方式呈现结果的不同策略、工具和方法。
Big data analysis and reporting are the processes of extracting insights, patterns, and trends from huge and complex information to aid in decision-making, strategic planning, and operational improvements. It includes different strategies, tools, and methodologies for analyzing data and presenting results in a useful and practical fashion.
下图简单介绍了大数据分析中的不同分析和报告方法 −
Following image gives a brief idea about different analysis and reporting methods in big data analytics −

大多数大数据解决方案的目标是通过分析和报告从数据中提取见解。为了使用户能够分析数据,架构可能包含数据建模层,例如 Azure 分析服务中的多维 OLAP 立方体或表格数据模型。它还可以利用 Microsoft Power BI 或 Excel 中的建模和可视化功能来提供自助式商业智能。数据科学家或分析员可能在其分析和报告过程中进行交互式数据探索。
Most big data solutions aim to extract insights from the data through analysis and reporting. In order to enable users to analyze data, the architecture may incorporate a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It may also offer self-service business intelligence by leveraging the modeling and visualization features found in Microsoft Power BI or Excel. Data scientists or analysts might conduct interactive data exploration as part of their analysis and reporting processes.
Orchestration
在大数据分析中, 编排 指协调和管理用于执行数据的不同任务、进程和资源。要确保大数据分析工作流高效可靠地运行,必须自动化数据流和处理过程、安排作业、管理依赖项和监视任务性能。
In big data analytics, orchestration refers to the coordination and administration of different tasks, processes, and resources used to execute data. To ensure that big data analytics workflows run efficiently and reliably, it is necessary to automate the flow of data and processing processes, schedule jobs, manage dependencies, and monitor task performance.
下图包含用于编排的不同步骤 −
Following figure includes different steps used in orchestration −

将源数据转换为、在不同源和接收器之间传输数据、将处理后的数据加载到分析数据存储或将结果直接输出到报告或仪表板的工作流构成了大多数大数据解决方案。为了自动化这些活动,可以使用 Azure 数据工厂、Apache Oozie 或 Sqoop 等编排工具。
Workflows that convert source data, transport data across different sources and sinks, load the processed data into an analytical data store, or output the results directly to a report or dashboard comprise most big data solutions. To automate these activities, utilize an orchestration tool like Azure Data Factory, Apache Oozie, or Sqoop.