Mahout 简明教程

Mahout - Clustering

聚类是对给定集合的元素或项目按项目之间的相似度组织到组中的过程。例如,与在线新闻发布相关的应用程序使用聚类对新闻文章进行分组。

Clustering is the procedure to organize elements or items of a given collection into groups based on the similarity between the items. For example, the applications related to online news publishing group their news articles using clustering.

Applications of Clustering

  1. Clustering is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

  2. Clustering can help marketers discover distinct groups in their customer basis. And they can characterize their customer groups based on purchasing patterns.

  3. In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.

  4. Clustering helps in identification of areas of similar land use in an earth observation database.

  5. Clustering also helps in classifying documents on the web for information discovery.

  6. Clustering is used in outlier detection applications such as detection of credit card fraud.

  7. As a data mining function, Cluster Analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

使用 Mahout,我们可以对给定的数据集进行聚类。所需步骤如下:

Using Mahout, we can cluster a given set of data. The steps required are as follows:

  1. Algorithm You need to select a suitable clustering algorithm to group the elements of a cluster.

  2. Similarity and Dissimilarity You need to have a rule in place to verify the similarity between the newly encountered elements and the elements in the groups.

  3. Stopping Condition A stopping condition is required to define the point where no clustering is required.

Procedure of Clustering

若要对给定数据进行聚类,您需要 -

To cluster the given data you need to -

  1. Start the Hadoop server. Create required directories for storing files in Hadoop File System. (Create directories for input file, sequence file, and clustered output in case of canopy).

  2. Copy the input file to the Hadoop File system from Unix file system.

  3. Prepare the sequence file from the input data.

  4. Run any of the available clustering algorithms.

  5. Get the clustered data.

Starting Hadoop

Mahout 与 Hadoop 配合工作,因此请确保 Hadoop 服务器已经启动并正在运行。

Mahout works with Hadoop, hence make sure that the Hadoop server is up and running.

$ cd HADOOP_HOME/bin
$ start-all.sh

Preparing Input File Directories

使用以下命令在 Hadoop 文件系统中创建目录,以存储输入文件、序列文件和聚类数据:

Create directories in the Hadoop file system to store the input file, sequence files, and clustered data using the following command:

$ hadoop fs -p mkdir /mahout_data
$ hadoop fs -p mkdir /clustered_data
$ hadoop fs -p mkdir /mahout_seq

您可以在以下 URL 中使用hadoop web 界面验证目录是否已被创建 - http://localhost:50070/

You can verify whether the directory is created using the hadoop web interface in the following URL - http://localhost:50070/

它会向您提供如下所示的输出:

It gives you the output as shown below:

input files directories

Copying Input File to HDFS

现在,将输入数据文件从 Linux 文件系统复制到 Hadoop 文件系统中 的 mahout_data 目录,如下所示。假设您的输入文件是 mydata.txt,并且它位于 /home/Hadoop/data/ 目录中。

Now, copy the input data file from the Linux file system to mahout_data directory in the Hadoop File System as shown below. Assume your input file is mydata.txt and it is in the /home/Hadoop/data/ directory.

$ hadoop fs -put /home/Hadoop/data/mydata.txt /mahout_data/

Preparing the Sequence File

Mahout 为您提供了一个实用程序,用于将给定的输入文件转换成序列文件格式。此实用程序需要两个参数。

Mahout provides you a utility to convert the given input file in to a sequence file format. This utility requires two parameters.

  1. The input file directory where the original data resides.

  2. The output file directory where the clustered data is to be stored.

下面给出了 mahout seqdirectory 实用程序的帮助提示。

Given below is the help prompt of mahout seqdirectory utility.

Step 1: 浏览至 Mahout 主目录。您可以如下所示获得该实用程序的帮助:

Step 1: Browse to the Mahout home directory. You can get help of the utility as shown below:

[Hadoop@localhost bin]$ ./mahout seqdirectory --help
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--overwrite (-ow) If present, overwrite the output directory

使用以下语法使用该实用程序生成序列文件:

Generate the sequence file using the utility using the following syntax:

mahout seqdirectory -i <input file path> -o <output directory>

Example

mahout seqdirectory
-i hdfs://localhost:9000/mahout_seq/
-o hdfs://localhost:9000/clustered_data/

Clustering Algorithms

Mahout 支持两种用于聚类的主要算法,即:

Mahout supports two main algorithms for clustering namely:

  1. Canopy clustering

  2. K-means clustering

Canopy Clustering

树冠聚类是一种简单快速的算法,Mahout 用于聚类目的。对象将被视为平面空间中的点。该算法经常用作其他聚类算法(如 k 均值聚类)中的初始步骤。您可以使用下列语法运行树冠作业:

Canopy clustering is a simple and fast technique used by Mahout for clustering purpose. The objects will be treated as points in a plain space. This technique is often used as an initial step in other clustering techniques such as k-means clustering. You can run a Canopy job using the following syntax:

mahout canopy -i <input vectors directory>
-o <output directory>
-t1 <threshold value 1>
-t2 <threshold value 2>

树冠作业需要一个包含序列文件的输入文件目录,以及一个将聚类数据存储在其中的输出目录。

Canopy job requires an input file directory with the sequence file and an output directory where the clustered data is to be stored.

Example

mahout canopy -i hdfs://localhost:9000/mahout_seq/mydata.seq
-o hdfs://localhost:9000/clustered_data
-t1 20
-t2 30

您可以在给定的输出目录中获取所生成的集群数据。

You will get the clustered data generated in the given output directory.

K-means Clustering

K-means 聚类是一种重要的聚类算法。K-means 聚类算法中的 k 表示将数据划分为的簇的数目。例如,为该算法指定 k 值为 3 时,该算法将把数据划分为 3 个簇。

K-means clustering is an important clustering algorithm. The k in k-means clustering algorithm represents the number of clusters the data is to be divided into. For example, the k value specified to this algorithm is selected as 3, the algorithm is going to divide the data into 3 clusters.

每个对象将在空间中表示为向量。最初将由算法随机选择 k 个点并将其视为中心,最接近每个中心的每个对象会被聚类。有几种算法用于距离测量,用户应选择所需的一种。

Each object will be represented as vector in space. Initially k points will be chosen by the algorithm randomly and treated as centers, every object closest to each center are clustered. There are several algorithms for the distance measure and the user should choose the required one.

Creating Vector Files

Creating Vector Files

  1. Unlike Canopy algorithm, the k-means algorithm requires vector files as input, therefore you have to create vector files.

  2. To generate vector files from sequence file format, Mahout provides the seq2parse utility.

下面列出了 seq2parse 实用程序的一些选项。使用这些选项创建向量文件。

Given below are some of the options of seq2parse utility. Create vector files using these options.

$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName  The class name of the analyzer
--chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes.
--output (-o) output              The directory pathname for o/p
--input (-i) input                Path to job input directory.

创建向量后,使用 k-means 算法。运行 k-means 作业的语法如下:

After creating vectors, proceed with k-means algorithm. The syntax to run k-means job is as follows:

mahout kmeans -i <input vectors directory>
-c  <input clusters directory>
-o  <output working directory>
-dm <Distance Measure technique>
-x  <maximum number of iterations>
-k  <number of initial clusters>

K-means 聚类作业需要输入向量目录、输出簇目录、距离度量、要执行的最大迭代数和表示将输入数据划分为的簇数目的整数值。

K-means clustering job requires input vector directory, output clusters directory, distance measure, maximum number of iterations to be carried out, and an integer value representing the number of clusters the input data is to be divided into.