Machine Learning 简明教程
Machine Learning - P-value
在机器学习中,我们使用 p 值来检验原假设(即两个变量之间没有显著关系)。例如,如果我们有房屋价格数据集并且我们希望确定房屋大小与其价格之间是否有显著关系,我们可以使用 p 值来检验此假设。
In machine learning, we use P-value to test the null hypothesis that there is no significant relationship between two variables. For example, if we have a dataset of house prices and we want to determine whether there is a significant relationship between the size of the house and its price, we can use P-value to test this hypothesis.
要理解机器学习中的 p 值概念,我们需要首先理解原假设和备择假设的概念。原假设是两个变量之间没有显著关系的假设,而备择假设是原假设的反面,即表示两个变量之间有显著关系。
To understand the concept of P-value in machine learning, we need to first understand the concept of null hypothesis and alternative hypothesis. The null hypothesis is the hypothesis that there is no significant relationship between the two variables, while the alternative hypothesis is the opposite of the null hypothesis, which states that there is a significant relationship between the two variables.
一旦我们定义了我们的零假设和备选假设,我们就可以使用 P 值来检验我们假设的重要性。P 值是假设零假设为真时获得观察结果或更极端结果的概率。
Once we have defined our null hypothesis and alternative hypothesis, we can use P-value to test the significance of our hypothesis. The P-value is the probability of obtaining the observed result or a more extreme result, assuming that the null hypothesis is true.
如果 P 值小于显著性水平(通常设置为 0.05),那么我们拒绝零假设并接受备选假设。这意味着这两个变量之间存在显着的关系。另一方面,如果 P 值大于显著性水平,那么我们未能拒绝零假设,并得出结论,即这两个变量之间没有显着关系。
If the P-value is less than the significance level (usually set at 0.05), then we reject the null hypothesis and accept the alternative hypothesis. This means that there is a significant relationship between the two variables. On the other hand, if the P-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that there is no significant relationship between the two variables.
Implementation of P-value in Python
Python 提供了多个用于统计分析和假设检验的库。最流行的统计分析库之一是 scipy 库。scipy 库提供了一个名为 ttest_ind() 的函数,可用于计算两个独立样本的 P 值。
Python provides several libraries for statistical analysis and hypothesis testing. One of the most popular libraries for statistical analysis is the scipy library. The scipy library provides a function called ttest_ind() that can be used to calculate the P-value for two independent samples.
为了演示机器学习中 p 值的实现,我们将使用由 scikit-learn 提供的乳腺癌数据集。此数据集的目标是基于肿瘤半径、纹理、周长、面积、光滑度、致密性、凹性和对称性等各种特征来预测乳腺肿瘤是恶性还是良性的。
To demonstrate the implementation of p-value in Machine Learning, we will use the breast cancer dataset provided by scikit-learn. The goal of this dataset is to predict whether a breast tumor is malignant or benign based on various features such as the tumor’s radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.
首先,我们将加载数据集并将其拆分为训练集和测试集 -
First, we will load the dataset and split it into training and testing sets −
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
接下来,我们将使用 scikit-learn 中的 SelectKBest 类根据其 p 值选择前 k 个特征。在这里,我们将选择前 5 个特征 -
Next, we will use the SelectKBest class from scikit-learn to select the top k features based on their p-values. Here, we will select the top 5 features −
from sklearn.feature_selection import SelectKBest, f_classif
k = 5
selector = SelectKBest(score_func=f_classif, k=k)
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)
SelectKBest 类将得分函数作为输入以计算每个特征的 p 值。我们使用 f_classif 函数,它是每个特征和目标变量之间的 ANOVA F 值。k 参数指定要选择的前 n 个特征的数量。
The SelectKBest class takes a score function as input to calculate the p-values for each feature. We use the f_classif function, which is the ANOVA F-value between each feature and the target variable. The k parameter specifies the number of top features to select.
在训练数据上拟合选择器后,我们使用 fit_transform() 方法转换数据,以仅保持前 k 个特征。我们还使用 transform() 方法转换测试数据以仅保留选定的特征。
After fitting the selector on the training data, we transform the data to keep only the top k features using the fit_transform() method. We also transform the testing data to keep only the selected features using the transform() method.
我们现在可以在选定的特征上训练模型并评估其性能 -
We can now train a model on the selected features and evaluate its performance −
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train_new, y_train)
y_pred = model.predict(X_test_new)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
在此示例中,我们在选定的前 5 个特征上训练了一个逻辑回归模型,并使用准确度评估其性能。但是,p 值也可用于假设检验,以确定特征是否具有统计显着性。
In this example, we trained a logistic regression model on the top 5 selected features and evaluated its performance using accuracy. However, the p-value can also be used for hypothesis testing to determine whether a feature is statistically significant or not.
例如,要检验平均半径特征是否显着的假设,我们可以使用 scipy.stats 模块中的 ttest_ind() 函数 -
For example, to test the hypothesis that the mean radius feature is significant, we can use the ttest_ind() function from the scipy.stats module −
from scipy.stats import ttest_ind
malignant = X[y == 0, 0]
benign = X[y == 1, 0]
t, p_value = ttest_ind(malignant, benign)
print(f"P-value: {p_value:.2f}")
ttest_ind() 函数将两个数组作为输入并返回 t 统计量和双尾 p 值。
The ttest_ind() function takes two arrays as input and returns the t-statistic and the two-tailed p-value.