Machine Learning 简明教程

Machine Learning - Histograms

  1. 直方图是一种箱线图类型的变量分布条形图表示。它显示变量的每个值出现的频率。x 轴表示变量值的范围,而 y 轴表示每个值的频率或数量。每个条形的高度表示属于该值范围的数据点数量。

A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the y-axis represents the frequency or count of each value. The height of each bar represents the number of data points that fall within that value range.

  1. 直方图有助于发现数据模式,例如偏度、众数和异常值。偏度是指变量分布不对称的程度。众数是指分布中的峰值数量。异常值是指超出变量典型值范围的数据点。

Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

Python Implementation of Histograms

  1. Python 提供了多个用于数据可视化的库,例如 Matplotlib、Seaborn、Plotly 和 Bokeh。对于以下示例,我们将使用 Matplotlib 来实现直方图。

Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For the example given below, we will use Matplotlib to implement histograms.

我们针对此示例从 Sklearn 库使用乳腺癌数据集。乳腺癌数据集包含有关乳腺癌细胞特征的信息,包括其是恶性还是良性的信息。该数据集有 30 项特征和 569 个样本。

We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

Example

  1. 从导入必要的库和加载数据集开始 −

Let’s start by importing the necessary libraries and loading the dataset −

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
  1. 接下来,我们将创建数据集平均半径特征的直方图 −

Next, we will create a histogram of the mean radius feature of the dataset −

plt.figure(figsize=(7.2, 3.5))
plt.hist(data.data[:,0], bins=20)
plt.xlabel('Mean Radius')
plt.ylabel('Frequency')
plt.show()
  1. 在此代码中,我们使用了 Matplotlib 中的 hist() 函数来创建数据集平均半径特征的直方图。我们设置了包含 20 个间隔将数据范围分为 20 个区间的 20 个条形。我们还使用 xlabel()ylabel() 函数为 x 轴和 y 轴添加了标签。

In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

  1. 结果直方图显示了数据集中平均半径值的分布。我们可以看到,数据大致呈正态分布,峰值在 12-14 左右。

The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

mean radius

Histogram with Multiple Data Sets

  1. 此外,我们还可以使用多个数据集来创建直方图,以比较其分布。接下来,我们将创建良性和恶性样本的平均半径特征直方图 −

We can also create a histogram with multiple data sets to compare their distributions. Let’s create histograms of the mean radius feature for both the malignant and benign samples −

Example

plt.figure(figsize=(7.2, 3.5))
plt.hist(data.data[data.target==0,0], bins=20, alpha=0.5, label='Malignant')
plt.hist(data.data[data.target==1,0], bins=20, alpha=0.5, label='Benign')
plt.xlabel('Mean Radius')
plt.ylabel('Frequency')
plt.legend()
plt.show()
  1. 在此代码中,我们使用了 hist() 函数两次来创建平均半径特征的两个直方图,一个用于良性样本,一个用于恶性样本。我们使用 alpha 参数将条形透明度设置为 0.5,以免它们完全重叠。我们还使用 legend() 函数为图表添加了图例。

In this code, we have used the hist() function twice to create two histograms of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the transparency of the bars to 0.5 using the alpha parameter so that they don’t overlap completely. We have also added a legend to the plot using the legend() function.

执行此代码后,您将得到以下绘图作为输出 −

On executing this code, you will get the following plot as the output −

mean radius2
  1. 结果直方图显示了良性和恶性样本的平均半径值的分布。我们可以看到,分布不同,恶性样本具有更高的平均半径值频率。

The resulting histogram shows the distribution of mean radius values for both the malignant and benign samples. We can see that the distributions are different, with the malignant samples having a higher frequency of higher mean radius values.