Machine Learning 简明教程

Machine Learning - Density Plots

密度图是一种用于显示连续变量的概率密度函数的图形。它类似于直方图,但它不使用条形图表示每个值的频率,而是使用平滑曲线表示概率密度函数。x 轴表示变量值的范围,y 轴表示概率密度。

A density plot is a type of plot that shows the probability density function of a continuous variable. It is similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth curve to represent the probability density function. The xaxis represents the range of values of the variable, and the y-axis represents the probability density.

密度图对于识别数据中的模式(例如偏度、模态和离群值)很有用。偏度是指变量分布中不对称的程度。模态是指分布中的峰值数量。离群值是数据点,它们落在变量典型值范围之外。

Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

Python Implementation of Density Plots

Python 提供了几个用于数据可视化的库,例如 Matplotlib、Seaborn、Plotly 和 Bokeh。对于下面给出的示例,我们将使用 Seaborn 来实现密度图。

Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For our example given below, we will use Seaborn to implement density plots.

我们针对此示例从 Sklearn 库使用乳腺癌数据集。乳腺癌数据集包含有关乳腺癌细胞特征的信息,包括其是恶性还是良性的信息。该数据集有 30 项特征和 569 个样本。

We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

Example

  1. 从导入必要的库和加载数据集开始 −

Let’s start by importing the necessary libraries and loading the dataset −

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
  1. 接下来,我们将创建数据集平均半径特征的密度图 −

Next, we will create a density plot of the mean radius feature of the dataset −

plt.figure(figsize=(7.2, 3.5))
sns.kdeplot(data.data[:,0], shade=True)
plt.xlabel('Mean Radius')
plt.ylabel('Density')
plt.show()
  1. 在此代码中,我们使用了 Seaborn 中的 kdeplot() 函数来创建数据集平均半径特征的密度图。我们将 kdeplot() 参数设置为 True 来填充阴影曲线下面区域。我们还使用 xlabel()ylabel() 函数为 x 轴和 y 轴添加了标签。

In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

  1. 结果密度图显示了数据集中平均半径值概率密度函数。我们可以看到,数据大致呈正态分布,峰值在 12-14 左右。

The resulting density plot shows the probability density function of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

kdeplot function

Density Plot with Multiple Data Sets

  1. 此外,我们还可以使用多个数据集来创建密度图,以比较其概率密度函数。接下来,我们将创建良性和恶性样本的平均半径特征密度图 −

We can also create a density plot with multiple data sets to compare their probability density functions. Let’s create density plots of the mean radius feature for both the malignant and benign samples −

Example

plt.figure(figsize=(7.5, 3.5))
sns.kdeplot(data.data[data.target==0,0], shade=True, label='Malignant')
sns.kdeplot(data.data[data.target==1,0], shade=True, label='Benign')
plt.xlabel('Mean Radius')
plt.ylabel('Density')
plt.legend()
plt.show()
  1. 在此代码中,我们使用了 kdeplot() 函数两次来创建平均半径特征的两个密度图,一个用于良性样本,一个用于恶性样本。我们将 kdeplot() 参数设置为 true 来填充阴影曲线下面区域,并使用 label 参数为图表添加了标签。我们还使用 legend() 函数为图表添加了图例。

In this code, we have used the kdeplot() function twice to create two density plots of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the shade parameter to True to shade the area under the curve, and we have added labels to the plots using the label parameter. We have also added a legend to the plot using the legend() function.

执行此代码后,您将得到以下绘图作为输出 −

On executing this code, you will get the following plot as the output −

density plot
  1. 结果密度图显示了良性和恶性样本的平均半径值的概率密度函数,我们可以看到良性样本的概率密度函数向右偏移,表示平均半径值较高。

The resulting density plot shows the probability density functions of mean radius values for both the malignant and benign samples. We can see that the probability density function for the malignant samples is shifted to the right, indicating a higher mean radius value.