Machine Learning 简明教程
Machine Learning - Correlation Matrix Plot
相关矩阵图是数据集中变量之间成对相关性的图示表示。该图由散点图和相关系数矩阵构成,其中每个散点图表示两个变量之间的关系,相关系数表示关系的强度。矩阵的对角线通常显示每个变量的分布。
A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot represents the relationship between two variables, and the correlation coefficient indicates the strength of the relationship. The diagonal of the matrix usually shows the distribution of each variable.
相关系数是两个变量之间的线性关系度量,范围从 -1 到 1。系数 1 表示完美的正相关,其中一个变量的增加与另一个变量的增加相关。系数 -1 表示完美的负相关,其中一个变量的增加与另一个变量的减少相关。系数 0 表示变量之间没有相关性。
The correlation coefficient is a measure of the linear relationship between two variables and ranges from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is associated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation, where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indicates no correlation between the variables.
Python Implementation of Correlation Matrix Plots
既然我们对相关矩阵图有了基本的了解,现在让我们在 Python 中实现它们。对于我们的示例,我们将使用 Sklearn 中的 Iris 花数据集,其中包含属于三种不同物种(Setosa、Versicolor 和 Virginica)的 150 朵鸢尾花的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量值。
Now that we have a basic understanding of correlation matrix plots, let’s implement them in Python. For our example, we will be using the Iris flower dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species - Setosa, Versicolor, and Virginica.
Example
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
target = iris.target
plt.figure(figsize=(7.5, 3.5))
corr = data.corr()
sns.set(style='white')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
此代码将生成 Iris 数据集的相关矩阵图,其中每个方块表示两个变量之间的相关系数。
This code will produce a correlation matrix plot of the Iris dataset, with each square representing the correlation coefficient between two variables.
从该图中,我们可以看到变量“萼片宽度(厘米)”和“花瓣长度(厘米)”具有中等负相关(-0.37),而变量“花瓣长度(厘米)”和“花瓣宽度(厘米)”具有很强的正相关(0.96)。我们还可以看到变量“萼片长度(厘米)”与变量“花瓣长度(厘米)”具有较弱的正相关(0.87)。
From this plot, we can see that the variables 'sepal width (cm)' and 'petal length (cm)' have a moderate negative correlation (-0.37), while the variables 'petal length (cm)' and 'petal width (cm)' have a strong positive correlation (0.96). We can also see that the variable 'sepal length (cm)' has a weak positive correlation (0.87) with the variable 'petal length (cm)'.