Mahout 简明教程

Mahout - Environment

本章将教你如何设置 Mahout。Java 和 Hadoop 是 Mahout 的先决条件。以下是下载并安装 Java、Hadoop 和 Mahout 的步骤。

This chapter teaches you how to setup mahout. Java and Hadoop are the prerequisites of mahout. Below given are the steps to download and install Java, Hadoop, and Mahout.

Pre-Installation Setup

在将 Hadoop 安装到 Linux 环境之前，我们需要使用 ssh （安全外壳）设置 Linux。按照以下步骤设置 Linux 环境。

Before installing Hadoop into Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps mentioned below for setting up the Linux environment.

Creating a User

建议为 Hadoop 创建一个独立用户，以将 Hadoop 文件系统与 Unix 文件系统隔离。请按照以下步骤创建用户：

It is recommended to create a separate user for Hadoop to isolate the Hadoop file system from the Unix file system. Follow the steps given below to create a user:

Open root using the command “su”.

使用 “useradd username” 命令从 root 帐户创建用户。

. Create a user from the root account using the command “useradd username”.

Now you can open an existing user account using the command “su username”.
Open the Linux terminal and type the following commands to create a user.

$ su
password:
# useradd hadoop
# passwd hadoop
New passwd:
Retype new passwd

SSH Setup and Key Generation

执行群集上的不同操作（如启动、停止和分布式守护程序 shell 操作）需要进行 SSH 设置。为了对 Hadoop 的不同用户进行身份验证，需要为 Hadoop 用户提供公钥/私钥对并将其与不同的用户共享。

SSH setup is required to perform different operations on a cluster such as starting, stopping, and distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for a Hadoop user and share it with different users.

以下命令用于使用 SSH 生成密钥值对，将公钥 id_rsa.pub 复制到 authorized_keys，并分别为 authorized_keys 文件提供所有者、读取和写入权限。

The following commands are used to generate a key value pair using SSH, copy the public keys form id_rsa.pub to authorized_keys, and provide owner, read and write permissions to authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Verifying ssh

ssh localhost

Installing Java

Java 是 Hadoop 和 HBase 的主要先决条件。首先，你应该使用 “java -version” 验证系统中是否存在 Java。Java version 命令的语法如下。

Java is the main prerequisite for Hadoop and HBase. First of all, you should verify the existence of Java in your system using “java -version”. The syntax of Java version command is given below.

$ java -version

它应该生成以下输出。

It should produce the following output.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

如果你的系统中未安装 Java，请按照以下步骤安装 Java。

If you don’t have Java installed in your system, then follow the steps given below for installing Java.

Step 1

访问以下链接下载 java (JDK <最新版本> - X64.tar.gz) ： Oracle

Download java (JDK <latest version> - X64.tar.gz) by visiting the following link: Oracle

然后 jdk-7u71-linux-x64.tar.gz is downloaded 到你的系统。

Then jdk-7u71-linux-x64.tar.gz is downloaded onto your system.

Step 2

通常情况下，你会在下载文件夹中找到下载的 Java 文件。使用以下命令验证并解压 jdk-7u71-linux-x64.gz 文件。

Generally, you find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

为了使所有用户可以使用 Java，你需要将其移动到 “/usr/local/”位置。打开 root，然后键入以下命令。

To make Java available to all the users, you need to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

要设置 PATH 和 JAVA_HOME 变量，请将以下命令添加到 ~/.bashrc file 中。

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH= $PATH:$JAVA_HOME/bin

现在，如上所述，从终端验证 java -version 命令。

Now, verify the java -version command from terminal as explained above.

Downloading Hadoop

安装 Java 后，你需要首先安装 Hadoop。使用如下所示的 “Hadoop version” 命令验证 Hadoop 的存在。

After installing Java, you need to install Hadoop initially. Verify the existence of Hadoop using “Hadoop version” command as shown below.

hadoop version

它应生成以下输出:

It should produce the following output:

Hadoop 2.6.0
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/hadoop/hadoop/share/hadoop/common/hadoopcommon-2.6.0.jar

如果你的系统无法找到 Hadoop，请下载 Hadoop 并安装在你的系统上。按照以下给出的命令操作。

If your system is unable to locate Hadoop, then download Hadoop and have it installed on your system. Follow the commands given below to do so.

使用以下命令从 Apache 软件基础下载并解压 hadoop-2.6.0。

Download and extract hadoop-2.6.0 from apache software foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-
2.6.0/hadoop-2.6.0-src.tar.gz
# tar xzf hadoop-2.6.0-src.tar.gz
# mv hadoop-2.6.0/* hadoop/
# exit

Installing Hadoop

在任何必需的模式下安装 Hadoop。这里，我们正在以伪分布式模式演示 HBase 的功能，因此请以伪分布式模式安装 Hadoop。

Install Hadoop in any of the required modes. Here, we are demonstrating HBase functionalities in pseudo-distributed mode, therefore install Hadoop in pseudo-distributed mode.

按照下方给出的步骤将 Hadoop 2.4.1 安装在您的系统上。

Follow the steps given below to install Hadoop 2.4.1 on your system.

Step 1: Setting up Hadoop

您可以通过将以下命令追加到 ~/.bashrc 文件来设置 Hadoop 环境变量。

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

现在，将所有更改应用于当前运行的系统。

Now, apply all changes into the currently running system.

$ source ~/.bashrc

Step 2: Hadoop Configuration

您可以在以下位置找到所有 Hadoop 配置文件：“$HADOOP_HOME/etc/hadoop”。根据 Hadoop 基础结构，需要对这些配置文件进行更改。

You can find all the Hadoop configuration files at the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

为了在 Java 中开发 Hadoop 程序，您需要通过使用系统中 Java 的位置替换 JAVA_HOME 值在 hadoop-env.sh 文件中重置 Java 环境变量。

In order to develop Hadoop programs in Java, you need to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

以下是您必须编辑以配置 Hadoop 的文件列表。

Given below are the list of files which you have to edit to configure Hadoop.

core-site.xml

core-site.xml 文件包含诸如 Hadoop 实例使用的端口号、分配给文件系统、数据存储内存限制和读/写缓冲区大小等信息。

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for file system, memory limit for storing data, and the size of Read/Write buffers.

打开 core-site.xml 并将以下属性添加到 <configuration>、</configuration> 标记之间：

Open core-site.xml and add the following property in between the <configuration>, </configuration> tags:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

hdfs-site.xm

hdfs-site.xml 文件包含有关复制数据的值、名称节点路径和本地文件系统的 DataNode 路径的信息。也就是说，您想将 Hadoop 基础架构存储在什么位置。

The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.

让我们假设以下数据：

Let us assume the following data:

dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

打开此文件，并在此文件中在 <configuration>、</configuration> 标记之间添加以下属性。

Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
   </property>
</configuration>

Note: 在上述文件中，所有属性值都是用户定义的。您可以根据自己的 Hadoop 基础结构进行更改。

Note: In the above file, all the property values are user defined. You can make changes according to your Hadoop infrastructure.

mapred-site.xml

此文件用于将 yarn 配置到 Hadoop 中。打开 mapred-site.xml 文件，并在此文件的 <configuration>、</configuration> 标记之间添加以下属性。

This file is used to configure yarn into Hadoop. Open mapred-site.xml file and add the following property in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

此文件用于指定我们正在使用的哪个 MapReduce 框架。默认情况下，Hadoop 包含 mapred-site.xml 的模板。首先，需要使用以下命令将文件从 mapred-site.xml.template 复制到 mapred-site.xml 文件。

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of mapred-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

打开 mapred-site.xml 文件，并在此文件中的 <configuration>、</configuration> 标记之间添加以下属性。

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

以下步骤用于验证 Hadoop 安装。

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

使用命令 “hdfs namenode -format” 设置名称节点，如下所示：

Set up the namenode using the command “hdfs namenode -format” as follows:

$ cd ~
$ hdfs namenode -format

预期结果如下：

The expected result is as follows:

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain
1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2: Verifying Hadoop dfs

以下命令用于启动 dfs。此命令将启动您的 Hadoop 文件系统。

The following command is used to start dfs. This command starts your Hadoop file system.

$ start-dfs.sh

预期的输出如下：

The expected output is as follows:

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script

以下命令用于启动 yarn 脚本。执行此命令将启动您的 yarn 守护程序。

The following command is used to start yarn script. Executing this command will start your yarn demons.

$ start-yarn.sh

预期的输出如下：

The expected output is as follows:

starting yarn daemons
starting resource manager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-
hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser

访问 Hadoop 的默认端口号为 50070。使用以下 URL 在您的浏览器上获取 Hadoop 服务。

The default port number to access hadoop is 50070. Use the following URL to get Hadoop services on your browser.

http://localhost:50070/

Step 5: Verify All Applications for Cluster

访问集群的所有应用程序的默认端口号为 8088。使用以下 URL 访问此服务。

The default port number to access all application of cluster is 8088. Use the following URL to visit this service.

http://localhost:8088/

Downloading Mahout

可以在网站 Mahout 中找到 Mahout。从网站中提供的链接下载 Mahout。以下是网站的屏幕截图。

Mahout is available in the website Mahout. Download Mahout from the link provided in the website. Here is the screenshot of the website.

Step 1

使用以下命令从链接 https://mahout.apache.org/general/downloads 下载 Apache mahout。

Download Apache mahout from the link https://mahout.apache.org/general/downloads using the following command.

[Hadoop@localhost ~]$ wget
http://mirror.nexcess.net/apache/mahout/0.9/mahout-distribution-0.9.tar.gz

然后 mahout-distribution-0.9.tar.gz 将在您的系统中下载。

Then mahout-distribution-0.9.tar.gz will be downloaded in your system.

Step2

浏览 mahout-distribution-0.9.tar.gz 存储的文件夹，并按如下所示提取下载的 jar 文件。

Browse through the folder where mahout-distribution-0.9.tar.gz is stored and extract the downloaded jar file as shown below.

[Hadoop@localhost ~]$ tar zxvf mahout-distribution-0.9.tar.gz

Maven Repository

以下是使用 Eclipse 构建 Apache Mahout 的 pom.xml。

Given below is the pom.xml to build Apache Mahout using Eclipse.

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-core</artifactId>
   <version>0.9</version>
</dependency>

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-math</artifactId>
   <version>${mahout.version}</version>
</dependency>

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-integration</artifactId>
   <version>${mahout.version}</version>
</dependency>