Big Data Analytics 简明教程

Big Data Analytics - Problem Definition

通过本教程，我们将开发一个项目。本教程中每一后续章节都会在迷你项目部分处理大项目的一部分。这一部分被认为是一个应用教程部分，它将提供接触现实世界问题的经验。在这种情况下，我们将从该项目的定义问题开始。

Through this tutorial, we will develop a project. Each subsequent chapter in this tutorial deals with a part of the larger project in the mini-project section. This is thought to be an applied tutorial section that will provide exposure to a real-world problem. In this case, we would start with the problem definition of the project.

Project Description

该项目的目的是开发一个机器学习模型，以预测人们使用其简历（CV）文本作为输入的小时工资。

The objective of this project would be to develop a machine learning model to predict the hourly salary of people using their curriculum vitae (CV) text as input.

使用上面定义的框架，定义问题很容易。我们可以将 X = {x1, x2, …, xn} 定义为用户的 CV，其中每个特征可能以最简单的方式出现，即此单词出现的次数。然后响应是有实值的，我们正试图预测个人每小时的工资，单位为美元。

Using the framework defined above, it is simple to define the problem. We can define X = {x1, x2, …, xn} as the CV’s of users, where each feature can be, in the simplest way possible, the amount of times this word appears. Then the response is real valued, we are trying to predict the hourly salary of individuals in dollars.

这两个考虑足以得出结论，即可以通过监督回归算法解决所提出的问题。

These two considerations are enough to conclude that the problem presented can be solved with a supervised regression algorithm.

Problem Definition

Problem Definition 可能是在大数据分析管道中最复杂、最容易被忽视的阶段之一。为了定义数据产品要解决的问题，经验是强制性的。大多数数据科学家准学者在这个阶段几乎没有或完全没有经验。

Problem Definition is probably one of the most complex and heavily neglected stages in the big data analytics pipeline. In order to define the problem a data product would solve, experience is mandatory. Most data scientist aspirants have little or no experience in this stage.

大多数大数据问题可以按以下方式分类−

Most big data problems can be categorized in the following ways −

Supervised classification
Supervised regression
Unsupervised learning
Learning to rank

现在让我们进一步了解这四个概念。

Let us now learn more about these four concepts.

Supervised Classification

给定特征矩阵 X = {x1, x2, …, xn}，我们开发模型 M 来预测定义为 y = {c1, c2, …, cn} 的不同类。例如：给定保险公司客户的交易数据，可以开发一个模型来预测客户是否会流失。后者是一个二元分类问题，其中有两个类或目标变量：流失和不流失。

Given a matrix of features X = {x1, x2, …, xn} we develop a model M to predict different classes defined as y = {c1, c2, …, cn}. For example: Given transactional data of customers in an insurance company, it is possible to develop a model that will predict if a client would churn or not. The latter is a binary classification problem, where there are two classes or target variables: churn and not churn.

其他问题涉及预测多个类，我们可能对识别数字感兴趣，因此响应向量将定义为：y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}，最先进的模型将卷积神经网络和特征矩阵将定义为图像像素。

Other problems involve predicting more than one class, we could be interested in doing digit recognition, therefore the response vector would be defined as: y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the-art model would be convolutional neural network and the matrix of features would be defined as the pixels of the image.

Supervised Regression

在这种情况下，问题定义与前面的例子非常相似；区别在于响应。在回归问题中，响应 y ∈ ℜ，这意味着该响应是有实值的。例如，我们可以开发一个模型来预测给定简历语料库的情况下个人的每小时工资。

In this case, the problem definition is rather similar to the previous example; the difference relies on the response. In a regression problem, the response y ∈ ℜ, this means the response is real valued. For example, we can develop a model to predict the hourly salary of individuals given the corpus of their CV.

Unsupervised Learning

管理层常常渴望获得新的见解。分段模型可以提供此见解，以便营销部门为不同的细分开发产品。开发分段模型的一个好方法是选择与所需细分相关的特征，而不是考虑算法。

Management is often thirsty for new insights. Segmentation models can provide this insight in order for the marketing department to develop products for different segments. A good approach for developing a segmentation model, rather than thinking of algorithms, is to select features that are relevant to the segmentation that is desired.

例如，在电信公司中，按手机使用情况对客户进行细分很有意思。这将涉及忽视与细分目标无关的特征，并且仅包含相关的特征。在这种情况下，这将选择诸如一个月内使用的短信数量、入站和出站分钟数等特征。

For example, in a telecommunications company, it is interesting to segment clients by their cellphone usage. This would involve disregarding features that have nothing to do with the segmentation objective and including only those that do. In this case, this would be selecting features as the number of SMS used in a month, the number of inbound and outbound minutes, etc.

Learning to Rank

该问题可以被认为是一个回归问题，但它具有特殊特征，值得单独处理。该问题涉及给定我们寻求在给定查询的情况下查找最相关排序的文档集合。为了开发一个监督学习算法，需要标记给定查询的情况下排序的相关程度。

This problem can be considered as a regression problem, but it has particular characteristics and deserves a separate treatment. The problem involves given a collection of documents we seek to find the most relevant ordering given a query. In order to develop a supervised learning algorithm, it is needed to label how relevant an ordering is, given a query.

需要注意的是，为了开发一个监督学习算法，需要标记训练数据。这意味着为了训练一个例如能够从图像中识别数字的模型，我们需要手动标记大量示例。有一些 Web 服务可以加速此过程，并且通常用于此任务，例如 Amazon Mechanical Turk。事实证明，当提供更多数据时，学习算法可以提高其性能，因此在监督学习中实际上必须标记大量示例。

It is relevant to note that in order to develop a supervised learning algorithm, it is needed to label the training data. This means that in order to train a model that will, for example, recognize digits from an image, we need to label a significant amount of examples by hand. There are web services that can speed up this process and are commonly used for this task such as amazon mechanical turk. It is proven that learning algorithms improve their performance when provided with more data, so labeling a decent amount of examples is practically mandatory in supervised learning.