Mahout 简明教程
Mahout - Classification
What is Classification?
分类是一种机器学习技术,它使用已知数据来确定如何将新数据归类到一组现有类别中。例如,
Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. For example,
-
iTunes application uses classification to prepare playlists.
-
Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.
How Classification Works
在对给定一组数据进行分类时,分类器系统执行以下操作:
While classifying a given set of data, the classifier system performs the following actions:
-
Initially a new data model is prepared using any of the learning algorithms.
-
Then the prepared data model is tested.
-
Thereafter, this data model is used to evaluate the new data and to determine its class.

Applications of Classification
-
Credit card fraud detection - The Classification mechanism is used to predict credit card frauds. Using historical information of previous frauds, the classifier can predict which future transactions may turn into frauds.
-
Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.
Naive Bayes Classifier
Mahout 使用朴素贝叶斯分类器算法。它使用两种实现:
Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:
-
Distributed Naive Bayes classification
-
Complementary Naive Bayes classification
朴素贝叶斯是一种构建分类器的简单技术。它不是训练此类分类器的单个算法,而是一系列算法。贝叶斯分类器构建模型以分类问题实例。这些分类是使用可用数据进行的。
Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.
朴素贝叶斯的一个优点是它只需要少量训练数据来估算分类所需的特征。
An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.
对于某些类型的概率模型,朴素贝叶斯分类器能在受监督的学习设置中非常有效地进行训练。
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.
尽管其假设过于简化,但朴素贝叶斯分类器在许多复杂的实际情况中运行良好。
Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.
Procedure of Classification
以下步骤用于实现分类:
The following steps are to be followed to implement Classification:
-
Generate example data
-
Create sequence files from data
-
Convert sequence files to vectors
-
Train the vectors
-
Test the vectors
Step1: Generate Example Data
生成或下载要分类的数据。例如,您可以从以下链接获取 20 newsgroups 示例数据: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
Generate or download the data to be classified. For example, you can get the 20 newsgroups example data from the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
创建一个目录来存储输入数据。按照如下所示下载示例。
Create a directory for storing input data. Download the example as shown below.
$ mkdir classification_example
$ cd classification_example
$tar xzvf 20news-bydate.tar.gz
wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
Step 2: Create Sequence Files
使用 seqdirectory 实用工具从示例创建序列文件。生成序列的语法如下:
Create sequence file from the example using seqdirectory utility. The syntax to generate sequence is given below:
mahout seqdirectory -i <input file path> -o <output directory>
Step 3: Convert Sequence Files to Vectors
使用 seq2parse 实用工具从序列文件创建向量文件。 seq2parse 实用工具的选项如下:
Create vector files from sequence files using seq2parse utility. The options of seq2parse utility are given below:
$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes.
--output (-o) output The directory pathname for o/p
--input (-i) input Path to job input directory.