Dynamodb 简明教程

DynamoDB - MapReduce

亚马逊的弹性 MapReduce (EMR) 允许您快速高效地处理大数据。EMR 在 EC2 实例上运行 Apache Hadoop，但简化了流程。您可以利用 Apache Hive 通过 HiveQL 来查询映射归约作业流，该查询语言类似于 SQL。Apache Hive 是一种优化查询和应用程序的方法。

Amazon’s Elastic MapReduce (EMR) allows you to quickly and efficiently process big data. EMR runs Apache Hadoop on EC2 instances, but simplifies the process. You utilize Apache Hive to query map reduce job flows through HiveQL, a query language resembling SQL. Apache Hive serves as a way to optimize queries and your applications.

您可以使用管理控制台的 EMR 选项卡、EMR CLI、API 或 SDK 来启动作业流。您还可以选择交互式运行 Hive 或利用脚本。

You can use the EMR tab of the management console, the EMR CLI, an API, or an SDK to launch a job flow. You also have the option to run Hive interactively or utilize a script.

然而，EMR 读/写操作会影响吞吐量消耗，在大型请求中，它会使用退避算法的保护执行重试。此外，与其他操作和任务同时运行 EMR 可能会导致限制。

The EMR read/write operations impact throughput consumption, however, in large requests, it performs retries with the protection of a backoff algorithm. Also, running EMR concurrently with other operations and tasks may result in throttling.

DynamoDB/EMR 集成不支持二进制和二进制集属性。

The DynamoDB/EMR integration does not support binary and binary set attributes.

DynamoDB/EMR Integration Prerequisites

在使用 EMR 之前查看此必要项目清单 -

Review this checklist of necessary items before using EMR −

An AWS account
A populated table under the same account employed in EMR operations
A custom Hive version with DynamoDB connectivity
DynamoDB connectivity support
An S3 bucket (optional)
An SSH client (optional)
An EC2 key pair (optional)

Hive Setup

在使用 EMR 之前，创建一个密钥对以交互方式运行 Hive。密钥对允许连接到 EC2 实例和作业流的主节点。

Before using EMR, create a key pair to run Hive in interactive mode. The key pair allows connection to EC2 instances and master nodes of job flows.

您可以通过以下步骤执行此操作 -

You can perform this by following the subsequent steps −

Log in to the management console, and open the EC2 console located at https://console.aws.amazon.com/ec2/
Select a region in the upper, right-hand portion of the console. Ensure the region matches the DynamoDB region.
In the Navigation pane, select Key Pairs.
Select Create Key Pair.
In the Key Pair Name field, enter a name and select Create.
Download the resulting private key file which uses the following format: filename.pem.

Note − 无法在没有密钥对的情况下连接到 EC2 实例。

Note − You cannot connect to EC2 instances without the key pair.

Hive Cluster

创建一个支持 Hive 的群集以运行 Hive。它将构建一个适用于 Hive 到 DynamoDB 连接的应用程序和基础设施必需的环境。

Create a hive-enabled cluster to run Hive. It builds the required environment of applications and infrastructure for a Hive-to-DynamoDB connection.

您可以通过执行以下步骤来执行此任务 −

You can perform this task by using the following steps −

Access the EMR console.
Select Create Cluster.
In the creation screen, set the cluster configuration with a descriptive name for the cluster, select Yes for termination protection and check on Enabled for logging, an S3 destination for log folder S3 location, and Enabled for debugging.
In the Software Configuration screen, ensure the fields hold Amazon for Hadoop distribution, the latest version for AMI version, a default Hive version for Applications to be Installed-Hive, and a default Pig version for Applications to be Installed-Pig.
In the Hardware Configuration screen, ensure the fields hold Launch into EC2-Classic for Network, No Preference for EC2 Availability Zone, the default for Master-Amazon EC2 Instance Type, no check for Request Spot Instances, the default for Core-Amazon EC2 Instance Type, 2 for Count, no check for Request Spot Instances, the default for Task-Amazon EC2 Instance Type, 0 for Count, and no check for Request Spot Instances.

务必设置一个限制以提供足够的容量来防止群集故障。

Be sure to set a limit providing sufficient capacity to prevent cluster failure.

In the Security and Access screen, ensure fields hold your key pair in EC2 key pair, No other IAM users in IAM user access, and Proceed without roles in IAM role.
Review the Bootstrap Actions screen, but do not modify it.
Review settings, and select Create Cluster when finished.

Summary 窗格出现在群集开始时。

A Summary pane appears on the start of the cluster.

Activate SSH Session

您需要一个活动 SSH 会话来连接到主节点并执行 CLI 操作。通过在 EMR 控制台中选择群集来找到主节点。它将主节点列为 Master Public DNS Name 。

You need an active the SSH session to connect to the master node and execute CLI operations. Locate the master node by selecting the cluster in the EMR console. It lists the master node as Master Public DNS Name.

如果没有 PuTTY，请安装它。然后启动 PuTTYgen 并选择 Load 。选择您的 PEM 文件并将其打开。PuTTYgen 将通知您导入成功。选择 Save private key 以采用 PuTTY 私钥格式 (PPK) 保存，选择 Yes 以在不使用口令的情况下保存。然后为 PuTTY 密钥输入一个名称，按 Save 并关闭 PuTTYgen。

Install PuTTY if you do not have it. Then launch PuTTYgen and select Load. Choose your PEM file, and open it. PuTTYgen will inform you of successful import. Select Save private key to save in PuTTY private key format (PPK), and choose Yes for saving without a pass phrase. Then enter a name for the PuTTY key, hit Save, and close PuTTYgen.

先启动 PuTTY，然后使用 PuTTY 连接到主节点。从“类别”列表中选择 Session 。在“主机名”字段中输入 hadoop@DNS。展开“SSH”下的“类别”列表，选择 Auth 。在控制选项屏幕中，选择 Browse 以进行用于验证的私钥文件。然后选择您的私钥文件并将其打开。选择 Yes 以响应弹出安全警报。

Use PuTTY to make a connection with the master node by first starting PuTTY. Choose Session from the Category list. Enter hadoop@DNS within the Host Name field. Expand Connection > SSH in the Category list, and choose Auth. In the controlling options screen, select Browse for Private key file for authentication. Then select your private key file and open it. Select Yes for the security alert pop-up.

连接到主节点后，会出现 Hadoop 命令提示符，这意味着您可以开始一个交互式 Hive 会话。

When connected to the master node, a Hadoop command prompt appears, which means you can begin an interactive Hive session.

Hive Table

Hive 作为一款数据仓库工具，可使用 HiveQL 查询 EMR 集群。之前的设置可提供一个有用的提示。通过输入“hive”并输入您希望执行的任何命令，互动式运行 Hive 命令。请参阅我们的 Hive 教程，以了解 Hive 的详细信息。

Hive serves as a data warehouse tool allowing queries on EMR clusters using HiveQL. The previous setups give you a working prompt. Run Hive commands interactively by simply entering “hive,” and then any commands you wish. See our Hive tutorial for more information on Hive.