Machine Learning 简明教程

Machine Learning - Box and Whisker Plots

箱形图是对数据集的图形表示,它显示数据的五数概括——最小值、第一四分位数、中值、第三四分位数和最大值。

A boxplot is a graphical representation of a dataset that displays the five-number summary of the data - the minimum value, the first quartile, the median, the third quartile, and the maximum value.

箱形图包含一个带有从箱子顶部和底部延伸的触须的箱子。

The boxplot consists of a box with whiskers extending from the top and bottom of the box.

  1. The box represents the interquartile range (IQR) of the data, which is the range between the first and third quartiles.

  2. The whiskers extend from the top and bottom of the box to the highest and lowest values that are within 1.5 times the IQR.

超出此范围的任何值均视为 outliers ,并表示为超出端点范围内的点。

Any values that fall outside this range are considered outliers and are represented as points beyond the whiskers.

Python Implementation of Box and Whisker Plots

既然我们对箱型图有了基本了解,那么我们便可以在 Python 中实现它们。对于我们的示例,我们将使用 Sklearn 中的 Iris 数据集,其中包含属于三个不同种类(Setosa、Versicolor 和 Virginica)的 150 种鸢尾花的花萼长度、花萼宽度、花瓣长度和花瓣宽度的测量结果。

Now that we have a basic understanding of boxplots, let’s implement them in Python. For our example, we will be using the Iris dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species - Setosa, Versicolor, and Virginica.

首先,我们需要导入所需的库并加载数据集。

To start, we need to import the necessary libraries and load the dataset.

Example

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
target = iris.target

接下来,我们可以使用 Seaborn 库为这三个鸢尾花属创建花萼长度的箱型图。

Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library.

plt.figure(figsize=(7.5, 3.5))
sns.boxplot(x=target, y=data[:, 0])
plt.xlabel('Species')
plt.ylabel('Sepal Length (cm)')
plt.show()

此代码将为三个鸢尾花属的每一种生成一个箱形图,其中 x 轴代表种类,y 轴代表厘米为单位的花萼长度。

This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis representing the species and the y-axis representing the sepal length in centimeters.

species

从这个箱型图中,我们可以看出 setosa 种类的花萼长度与 Versicolor 和 Virginica 种类相比较短,而后者的花萼长度中值和范围相似。此外,我们可以看出 setosa 种类没有离群值,但 Versicolor 和 Virginica 种类有一些离群值。

From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica specie.