Hcatalog 简明教程

HCatalog - Introduction

What is HCatalog?

HCatalog 是 Hadoop 的表存储管理工具。它向其他 Hadoop 应用程序公开 Hive 元存储的表格数据。它使用户能够使用不同的数据处理工具(Pig、MapReduce)轻松地将数据写到网格中。它确保用户不必担心数据存储在何处或以何种格式存储。

HCatalog is a table storage management tool for Hadoop. It exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don’t have to worry about where or in what format their data is stored.

HCatalog 作为一个 Hive 的关键组件,使用户可以存储任何格式和任何结构的数据。

HCatalog works like a key component of Hive and it enables the users to store their data in any format and any structure.

Why HCatalog?

Enabling right tool for right Job

Hadoop 生态系统包含用于数据处理的不同工具,例如 Hive、Pig 和 MapReduce。虽然这些工具不需要元数据,但是当元数据存在时,它们仍然可以从中受益。共享元数据存储还使跨工具的用户能够更轻松地共享数据。一种非常常见的工作流程是使用 MapReduce 或 Pig 加载和规范化数据,然后通过 Hive 进行分析。如果所有这些工具共享一个元存储,那么每个工具的用户都可以立即访问使用另一个工具创建的数据。无需加载或传输步骤。

Hadoop ecosystem contains different tools for data processing such as Hive, Pig, and MapReduce. Although these tools do not require metadata, they can still benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using MapReduce or Pig and then analyzed via Hive is very common. If all these tools share one metastore, then the users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.

Capture processing states to enable sharing

HCatalog 可以发布您的分析结果。因此,其他程序员可以通过“REST”访问您的分析平台。您发布的模式对其他数据科学家也有用。其他数据科学家使用您的发现作为后续发现的输入。

HCatalog can publish your analytics results. So the other programmer can access your analytics platform via “REST”. The schemas which are published by you are also useful to other data scientists. The other data scientists use your discoveries as inputs into a subsequent discovery.

Integrate Hadoop with everything

Hadoop 作为处理和存储环境为企业提供了许多机会;但是,为了促进采用,它必须与现有工具一起工作并对其进行扩展。Hadoop 应作为您分析平台的输入或与您的运营数据存储和 Web 应用程序集成。企业应该享受 Hadoop 的价值,而不必学习全新的工具集。REST 服务通过熟悉的 API 和类似 SQL 的语言向企业开放平台。企业数据管理系统使用 HCatalog 更深入地与 Hadoop 平台集成。

Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption, it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform.

HCatalog Architecture

下图显示了 HCatalog 的整体架构。

The following illustration shows the overall architecture of HCatalog.

architecture

HCatalog 支持为可以使用 SerDe (序列化器-反序列化器)编写的任何格式读写文件。默认情况下,HCatalog 支持 RCFile、CSV、JSON、SequenceFile 和 ORC 文件格式。要使用自定义格式,您必须提供 InputFormat、OutputFormat 和 SerDe。

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog 建立在 Hive 元存储之上,并结合了 Hive 的 DDL。HCatalog 为 Pig 和 MapReduce 提供了读写接口,并使用 Hive 的命令行界面来发布数据定义和元数据探索命令。

HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.