Data Mining 简明教程

Data Mining - Cluster Analysis

簇是属于同一类的对象组。换句话说,相似对象被归入一个簇中,而不相似对象被归入另一个簇中。

Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

聚类是将一组抽象对象变成相似对象类的过程。

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

Points to Remember

  1. A cluster of data objects can be treated as one group.

  2. While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.

  3. The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

  1. Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

  2. Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.

  3. In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations.

  4. Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.

  5. Clustering also helps in classifying documents on the web for information discovery.

  6. Clustering is also used in outlier detection applications such as detection of credit card fraud.

  7. As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

以下要点阐明了为什么需要在数据挖掘中进行聚类 −

The following points throw light on why clustering is required in data mining −

  1. Scalability − We need highly scalable clustering algorithms to deal with large databases.

  2. Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.

  3. Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.

  4. High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.

  5. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

  6. Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

聚类方法可归类为以下类别−

Clustering methods can be classified into the following categories −

  1. Partitioning Method

  2. Hierarchical Method

  3. Density-based Method

  4. Grid-Based Method

  5. Model-Based Method

  6. Constraint-based Method

Partitioning Method

假设我们给定一个由“n”个对象组成的数据库,并且划分法构建了“k”个数据分区。每个分区都将表示一个簇,并且k ≤ n。这意味着它会将数据分成k个组,满足以下要求−

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements −

  1. Each group contains at least one object.

  2. Each object must belong to exactly one group.

Points to remember −

Points to remember −

  1. For a given number of partitions (say k), the partitioning method will create an initial partitioning.

  2. Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.

Hierarchical Methods

这种方法对给定的数据对象集创建了一种分层分解。我们可以根据分层分解的形成方式对分层方法进行分类。这里有两种方法−

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −

  1. Agglomerative Approach

  2. Divisive Approach

Agglomerative Approach

这种方法也称为自下而上的方法。在这种方法中,我们从每个对象形成一个单独的组开始。其不断合并彼此接近的对象或组。其会不断这样做,直到所有组都合并为一个或直到终止条件成立。

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

这种方法也称为自上而下的方法。在这种方法中,我们从同一个簇中的所有对象开始。在连续迭代中,将一个簇分成较小的簇。其会进行此操作,直到一个簇中的每个对象或终止条件成立。该方法是严格的,即一旦合并或分裂完成,则永远无法撤消。

This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

以下是用于提高分层聚类质量的两种方法−

Here are the two approaches that are used to improve the quality of hierarchical clustering −

  1. Perform careful analysis of object linkages at each hierarchical partitioning.

  2. Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into micro-clusters, and then performing macro-clustering on the micro-clusters.

Density-based Method

此方法基于密度的概念。基本想法是在邻域密度超过一定阈值时继续扩展给定簇,即对于给定簇中的每个数据点,给定簇的半径至少要包含最少数量的点。

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

在此当中,对象共同形成一个网格。对象空间被量子化成有限数量的单元格,形成了网格结构。

In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.

Advantages

Advantages

  1. The major advantage of this method is fast processing time.

  2. It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

在此方法中,假设每个簇一个模型,以找到给定模型的最佳数据拟合。此方法通过对密度函数进行聚类来定位簇。它反映了数据点的空间分布。

In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.

此方法还提供了一种基于标准统计数据自动确定簇数量的方法,将离群点或噪声考虑在内。因此,它产生了鲁棒的聚类方法。

This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method

在此方法中,聚类是通过纳入用户或面向应用程序的约束执行的。约束指的是用户期望或所需聚类结果的属性。约束为我们提供了一种与聚类过程进行交互的方式。约束可以由用户或应用程序要求指定。

In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. Constraints can be specified by the user or the application requirement.