Mahout 简明教程

Mahout - Machine Learning

Apache Mahout 是一个高度可扩展的机器学习库,使开发人员能够使用经过优化的算法。Mahout 实现了流行的机器学习技术,例如推荐、分类和聚类。因此,在继续深入之前,有必要简要了解一下机器学习。

Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering. Therefore, it is prudent to have a brief section on machine learning before we move further.

What is Machine Learning?

机器学习是计算机科学的一个分支,它通过编程系统的方式使其能够通过经验自动学习和改进。这里的学习是指识别和理解输入数据,并根据所提供的数据做出明智的决策。

Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data.

根据所有可能的输入迎合所有决策非常困难。为了解决这个问题,开发了算法。这些算法根据特定的数据和过去经验以及统计学、概率论、逻辑学、组合优化、搜索、强化学习和控制论的原则来构建知识。

It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory.

已开发的算法形成各种应用的基础,如:

The developed algorithms form the basis of various applications such as:

  1. Vision processing

  2. Language processing

  3. Forecasting (e.g., stock market trends)

  4. Pattern recognition

  5. Games

  6. Data mining

  7. Expert systems

  8. Robotics

机器学习是一个广阔的领域,涵盖其所有功能超出了本教程的范围。有多种方法可用于实现机器学习技术,但最常用的方法是 supervisedunsupervised learning

Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features. There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning.

Supervised Learning

监督学习涉及从可用的训练数据中学习函数。监督学习算法分析训练数据并生成一个推断函数,该函数可用于映射新示例。监督学习的常见示例包括:

Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include:

  1. classifying e-mails as spam,

  2. labeling webpages based on their content, and

  3. voice recognition.

有许多监督学习算法,例神经网络、支持向量机 (SVM) 和朴素贝叶斯分类器。Mahout 实现了朴素贝叶斯分类器。

There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.

Unsupervised Learning

无监督学习可以理解未标记数据,而无需任何预定义的数据集进行训练。无监督学习是一种非常强大的工具,可用于分析可用数据并查找模式和趋势。它最常用于将相似的输入聚类到逻辑组中。无监督学习的常见方法包括:

Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for patterns and trends. It is most commonly used for clustering similar input into logical groups. Common approaches to unsupervised learning include:

  1. k-means

  2. self-organizing maps, and

  3. hierarchical clustering

Recommendation

推荐是一种流行的技术,它根据用户的购买记录、点击和评级等信息提供密切的建议。

Recommendation is a popular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings.

  1. Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions.

  2. Facebook uses the recommender technique to identify and recommend the “people you may know list”.

recommendation

Classification

分类也称为 categorization ,是一种机器学习技术,该技术使用已知数据确定如何将新数据分类到一组现有类别中。分类是一种监督学习形式。

Classification, also known as categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning.

  1. Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

  2. iTunes application uses classification to prepare playlists.

classification

Clustering

聚类用于根据共同特征形成相似数据的组或群集。聚类是一种无监督学习形式。

Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering is a form of unsupervised learning.

  1. Search engines such as Google and Yahoo! use clustering techniques to group data with similar characteristics.

  2. Newsgroups use clustering techniques to group various articles based on related topics.

聚类引擎会完全遍历输入数据,并根据数据的特征决定将其归入哪个群集中。请看以下示例。

The clustering engine goes through the input data completely and based on the characteristics of the data, it will decide under which cluster it should be grouped. Take a look at the following example.

clustering

我们的教程库中包含各种主题。我们收到 TutorialsPoint 中的新教程时,它将由聚类引擎进行处理,基于其内容决定将其分组到何处。

Our library of tutorials contains topics on various subjects. When we receive a new tutorial at TutorialsPoint, it gets processed by a clustering engine that decides, based on its content, where it should be grouped.