Data Mining 简明教程

Data Mining - Classification & Prediction

有两种形式的数据分析可以用于提取描述重要类别的模型或预测未来的数据趋势。这两种形式如下 −

There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. These two forms are as follows −

  1. Classification

  2. Prediction

分类模型预测分类类别标签;预测模型预测连续的值函数。例如,我们可以构建一个分类模型,将银行贷款申请归类为安全或有风险,或构建一个预测模型,根据潜在客户的收入和职业来预测他们在计算机设备上的支出(以美元为单位)。

Classification models predict categorical class labels; and prediction models predict continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.

What is classification?

以下是数据分析任务是分类的案例示例 −

Following are the examples of cases where the data analysis task is Classification −

  1. A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe.

  2. A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.

在上述两个示例中,构造了一个模型或分类器来预测分类标签。对于贷款申请数据,这些标签是“有风险”或“安全”,对于市场数据,这些标签是“是”或“否”。

In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data.

What is prediction?

以下是数据分析任务是预测的案例示例 −

Following are the examples of cases where the data analysis task is Prediction −

假设营销经理需要预测给定客户在他公司的一次促销中会花费多少钱。在这个示例中,我们很乐意预测一个数字值。因此,数据分析任务是数字预测的示例。在这种情况下,将构造一个模型或预测器来预测连续值函数或有序值。

Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a continuous-valued-function or ordered value.

Note − 回归分析是一种最常用于数字预测的统计方法。

Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.

How Does Classification Works?

借助我们上面讨论的银行贷款申请,让我们了解分类的工作原理。数据分类过程包括两个步骤 −

With the help of the bank loan application that we have discussed above, let us understand the working of classification. The Data Classification process includes two steps −

  1. Building the Classifier or Model

  2. Using Classifier for Classification

Building the Classifier or Model

  1. This step is the learning step or the learning phase.

  2. In this step the classification algorithms build the classifier.

  3. The classifier is built from the training set made up of database tuples and their associated class labels.

  4. Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.

dm build classifier

Using Classifier for Classification

在此步骤中,分类器用于分类。这里使用测试数据来评估分类规则的准确度。如果准确度被认为可以接受,则可以将分类规则应用于新数据元组。

In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.

dm using classifier

Classification and Prediction Issues

主要问题是为分类和预测准备数据。准备数据包含以下活动:

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities −

  1. Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.

  2. Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.

  3. Data Transformation and reduction − The data can be transformed by any of the following methods. Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used. Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies.

Note − 还可以通过其他一些方法来减少数据,例如小波变换、分箱、直方图分析和聚类。

Note − Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

以下是比较分类和预测方法的标准:

Here is the criteria for comparing the methods of Classification and Prediction −

  1. Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data.

  2. Speed − This refers to the computational cost in generating and using the classifier or predictor.

  3. Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data.

  4. Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data.

  5. Interpretability − It refers to what extent the classifier or predictor understands.