Mahout 简明教程
Mahout - Introduction
我们生活在一个信息无限丰富的时代。信息过载已达到如此高的程度,以至于有时管理我们的小小邮箱都变得困难!想象一下 Facebook、Twitter 和 Youtube 等一些流行网站必须每天收集和管理的数据和记录量。即使是鲜为人知的网站也会收到大量信息也不是什么新鲜事。
We are living in a day and age where information is available in abundance. The information overload has scaled to such heights that sometimes it becomes difficult to manage our little mailboxes! Imagine the volume of data and records some of the popular websites (the likes of Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon even for lesser known websites to receive huge amounts of information in bulk.
通常,我们会使用数据挖掘算法来分析大量数据以识别趋势并得出结论。但是,没有哪种数据挖掘算法能够高效地处理非常大型的数据集并快速提供结果,除非计算任务在分布在云端的多台机器上运行。
Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw conclusions. However, no data mining algorithm can be efficient enough to process very large datasets and provide outcomes in quick time, unless the computational tasks are run on multiple machines distributed over the cloud.
我们现在有了能够将计算任务分解成多个部分并让这些部分在不同机器上运行的新框架。 Mahout 就是这样的一个数据挖掘框架,通常与 Hadoop 基础设施结合在背景下运行,用于管理海量的数据。
We now have new frameworks that allow us to break down a computation task into multiple segments and run each segment on a different machine. Mahout is such a data mining framework that normally runs coupled with the Hadoop infrastructure at its background to manage huge volumes of data.
What is Apache Mahout?
马豪特是训象者。名称源自其与使用大象作为其标志的 Apache Hadoop 的紧密关联。
A mahout is one who drives an elephant as its master. The name comes from its close association with Apache Hadoop which uses an elephant as its logo.
Hadoop 是 Apache 的开源框架,它允许使用简单的编程模型在计算机集群的分布式环境中存储和处理大数据。
Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.
Apache Mahout 是一个开源项目,主要用于创建可扩展的机器学习算法。它实现了流行的机器学习技术,例如:
Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as:
-
Recommendation
-
Classification
-
Clustering
Apache Mahout 于 2008 年作为 Apache Lucene 的子项目启动。2010 年,Mahout 成为 Apache 的一个顶级项目。
Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top level project of Apache.
Features of Mahout
以下列出了 Apache Mahout 的基本功能。
The primitive features of Apache Mahout are listed below.
-
The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.
-
Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data.
-
Mahout lets applications to analyze large sets of data effectively and in quick time.
-
Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.
-
Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations.
-
Comes with distributed fitness function capabilities for evolutionary programming.
-
Includes matrix and vector libraries.
Applications of Mahout
-
Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally.
-
Foursquare helps you in finding out places, food, and entertainment available in a particular area. It uses the recommender engine of Mahout.
-
Twitter uses Mahout for user interest modelling.
-
Yahoo! uses Mahout for pattern mining.