Big Data Analytics 简明教程

Big Data Analytics - Logistic Regression

逻辑回归是一种响应变量为分类变量的分类模型。它是一种来自统计学的算法，用于监督分类问题。在逻辑回归中，我们寻求找到使成本函数最小的以下方程中的参数向量 β。

Logistic regression is a classification model in which the response variable is categorical. It is an algorithm that comes from statistics and is used for supervised classification problems. In logistic regression we seek to find the vector β of parameters in the following equation that minimize the cost function.

logit(p_i) = ln \left ( \frac{p_i}{1 - p_i} \right ) = \beta_0 + \beta_1x_{1,i} + … + \beta_kx_{k,i}

以下代码演示如何在 R 中拟合逻辑回归模型。我们在这里将使用垃圾邮件数据集来演示逻辑回归，这与朴素贝叶斯所使用的数据集相同。

The following code demonstrates how to fit a logistic regression model in R. We will use here the spam dataset to demonstrate logistic regression, the same that was used for Naive Bayes.

从准确率方面的预测结果中，我们发现回归模型在测试集中达到 92.5% 的准确率，而朴素贝叶斯分类器的准确率为 72%。

From the predictions results in terms of accuracy, we find that the regression model achieves a 92.5% accuracy in the test set, compared to the 72% achieved by the Naive Bayes classifier.

library(ElemStatLearn)
head(spam)

# Split dataset in training and testing
inx = sample(nrow(spam), round(nrow(spam) * 0.8))
train = spam[inx,]
test = spam[-inx,]

# Fit regression model
fit = glm(spam ~ ., data = train, family = binomial())
summary(fit)

# Call:
#   glm(formula = spam ~ ., family = binomial(), data = train)
#

# Deviance Residuals:
#   Min       1Q   Median       3Q      Max
# -4.5172  -0.2039   0.0000   0.1111   5.4944
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.511e+00  1.546e-01  -9.772  < 2e-16 ***
# A.1         -4.546e-01  2.560e-01  -1.776 0.075720 .
# A.2         -1.630e-01  7.731e-02  -2.108 0.035043 *
# A.3          1.487e-01  1.261e-01   1.179 0.238591
# A.4          2.055e+00  1.467e+00   1.401 0.161153
# A.5          6.165e-01  1.191e-01   5.177 2.25e-07 ***
# A.6          7.156e-01  2.768e-01   2.585 0.009747 **
# A.7          2.606e+00  3.917e-01   6.652 2.88e-11 ***
# A.8          6.750e-01  2.284e-01   2.955 0.003127 **
# A.9          1.197e+00  3.362e-01   3.559 0.000373 ***
# Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1  1

### Make predictions
preds = predict(fit, test, type = ’response’)
preds = ifelse(preds > 0.5, 1, 0)
tbl = table(target = test$spam, preds)
tbl

#         preds
# target    0   1
# email   535  23
# spam     46 316
sum(diag(tbl)) / sum(tbl)
# 0.925