Big Data Analytics 简明教程
Big Data Analytics - Statistical Methods
在分析数据时,可以采用统计方法。执行基本分析所需的基本工具为:
When analyzing data, it is possible to have a statistical approach. The basic tools that are needed to perform basic analysis are −
-
Correlation analysis
-
Analysis of Variance
-
Hypothesis Testing
在处理大型数据集时,这些方法不涉及问题,因为这些方法在计算上并不密集,相关性分析除外。在这种情况下,始终可以抽取样本,并且结果应该是稳健的。
When working with large datasets, it doesn’t involve a problem as these methods aren’t computationally intensive with the exception of Correlation Analysis. In this case, it is always possible to take a sample and the results should be robust.
Correlation Analysis
相关性分析旨在找出数值变量之间的线性关系。这可能在不同情况下有用。一个常见的用途是探索性数据分析,在本书的第 16.0.2 节中有一个基本的示例。首先,所述示例中使用的相关性指标基于 Pearson coefficient 。但是,还有另一种不受异常值影响的有趣相关性指标。该指标称为 Spearman 相关性。
Correlation Analysis seeks to find linear relationships between numeric variables. This can be of use in different circumstances. One common use is exploratory data analysis, in section 16.0.2 of the book there is a basic example of this approach. First of all, the correlation metric used in the mentioned example is based on the Pearson coefficient. There is however, another interesting metric of correlation that is not affected by outliers. This metric is called the spearman correlation.
与 Pearson 方法相比, spearman correlation 指标对离群值的存在更鲁棒,当数据不呈正态分布时,它能更好地估计数值变量之间的线性关系。
The spearman correlation metric is more robust to the presence of outliers than the Pearson method and gives better estimates of linear relations between numeric variable when the data is not normally distributed.
library(ggplot2)
# Select variables that are interesting to compare pearson and spearman
correlation methods.
x = diamonds[, c('x', 'y', 'z', 'price')]
# From the histograms we can expect differences in the correlations of both
metrics.
# In this case as the variables are clearly not normally distributed, the
spearman correlation
# is a better estimate of the linear relation among numeric variables.
par(mfrow = c(2,2))
colnm = names(x)
for(i in 1:4) {
hist(x[[i]], col = 'deepskyblue3', main = sprintf('Histogram of %s', colnm[i]))
}
par(mfrow = c(1,1))
从下图中的直方图中,我们可以期望两个指标的相关性存在差异。在这种情况下,由于变量明显不呈正态分布,因此 Spearman 相关性是数值变量之间线性关系的更好估计。
From the histograms in the following figure, we can expect differences in the correlations of both metrics. In this case, as the variables are clearly not normally distributed, the spearman correlation is a better estimate of the linear relation among numeric variables.

要计算 R 中的相关性,请打开包含此代码部分的文件 bda/part2/statistical_methods/correlation/correlation.R 。
In order to compute the correlation in R, open the file bda/part2/statistical_methods/correlation/correlation.R that has this code section.
## Correlation Matrix - Pearson and spearman
cor_pearson <- cor(x, method = 'pearson')
cor_spearman <- cor(x, method = 'spearman')
### Pearson Correlation
print(cor_pearson)
# x y z price
# x 1.0000000 0.9747015 0.9707718 0.8844352
# y 0.9747015 1.0000000 0.9520057 0.8654209
# z 0.9707718 0.9520057 1.0000000 0.8612494
# price 0.8844352 0.8654209 0.8612494 1.0000000
### Spearman Correlation
print(cor_spearman)
# x y z price
# x 1.0000000 0.9978949 0.9873553 0.9631961
# y 0.9978949 1.0000000 0.9870675 0.9627188
# z 0.9873553 0.9870675 1.0000000 0.9572323
# price 0.9631961 0.9627188 0.9572323 1.0000000
Chi-squared Test
卡方检验允许我们测试两个随机变量是否独立。这意味着每个变量的概率分布不影响另一个变量。为了在 R 中评估检验,我们首先需要创建一个列联表,然后将该表传递给 chisq.test R 函数。
The chi-squared test allows us to test if two random variables are independent. This means that the probability distribution of each variable doesn’t influence the other. In order to evaluate the test in R we need first to create a contingency table, and then pass the table to the chisq.test R function.
例如,让我们检查一下 diamond 数据集中变量 cut 和 color 之间是否存在关联。该检验的正式定义如下:
For example, let’s check if there is an association between the variables: cut and color from the diamonds dataset. The test is formally defined as −
-
H0: The variable cut and diamond are independent
-
H1: The variable cut and diamond are not independent
从这两个变量的名称来判断,我们会假设这两个变量之间存在一种关系,但检验可以给出一个客观“规则”,说明这个结果的重要性程度。
We would assume there is a relationship between these two variables by their name, but the test can give an objective "rule" saying how significant this result is or not.
在以下代码片段中,我们发现检验的 p 值为 2.2e-16,在实际应用中该值为零。在运行 Monte Carlo simulation 检验之后,我们发现 p 值为 0.0004998,仍然比阈值 0.05 低。此结果意即我们拒绝原假设 (H0),我们相信变量 cut 和 color 不是独立的。
In the following code snippet, we found that the p-value of the test is 2.2e-16, this is almost zero in practical terms. Then after running the test doing a Monte Carlo simulation, we found that the p-value is 0.0004998 which is still quite lower than the threshold 0.05. This result means that we reject the null hypothesis (H0), so we believe the variables cut and color are not independent.
library(ggplot2)
# Use the table function to compute the contingency table
tbl = table(diamonds$cut, diamonds$color)
tbl
# D E F G H I J
# Fair 163 224 312 314 303 175 119
# Good 662 933 909 871 702 522 307
# Very Good 1513 2400 2164 2299 1824 1204 678
# Premium 1603 2337 2331 2924 2360 1428 808
# Ideal 2834 3903 3826 4884 3115 2093 896
# In order to run the test we just use the chisq.test function.
chisq.test(tbl)
# Pearson’s Chi-squared test
# data: tbl
# X-squared = 310.32, df = 24, p-value < 2.2e-16
# It is also possible to compute the p-values using a monte-carlo simulation
# It's needed to add the simulate.p.value = TRUE flag and the amount of
simulations
chisq.test(tbl, simulate.p.value = TRUE, B = 2000)
# Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates)
# data: tbl
# X-squared = 310.32, df = NA, p-value = 0.0004998
T-test
t-test 的目的是评估数字变量 # 在标称变量的不同组之间的分布是否有所不同。为了展示这一点,我将选择因子变量切工的佳级和理想级,然后我们将比较这两个组之间的数字变量值。
The idea of t-test is to evaluate if there are differences in a numeric variable # distribution between different groups of a nominal variable. In order to demonstrate this, I will select the levels of the Fair and Ideal levels of the factor variable cut, then we will compare the values a numeric variable among those two groups.
data = diamonds[diamonds$cut %in% c('Fair', 'Ideal'), ]
data$cut = droplevels.factor(data$cut) # Drop levels that aren’t used from the
cut variable
df1 = data[, c('cut', 'price')]
# We can see the price means are different for each group
tapply(df1$price, df1$cut, mean)
# Fair Ideal
# 4358.758 3457.542
t 检验在 R 中以 t.test 函数实现。公式接口到 t.test 是使用它的最简单方法,其原理是按组变量对数字变量进行解释。
The t-tests are implemented in R with the t.test function. The formula interface to t.test is the simplest way to use it, the idea is that a numeric variable is explained by a group variable.
例如: t.test(numeric_variable ~ group_variable, data = data) 。在上一个示例中, numeric_variable 是 price , group_variable 是 cut 。
For example: t.test(numeric_variable ~ group_variable, data = data). In the previous example, the numeric_variable is price and the group_variable is cut.
从统计角度看,我们检验在两组之间数字变量的分布是否存在差异。从形式上讲,假设检验描述为一个原假设 (H0) 和一个备择假设 (H1)。
From a statistical perspective, we are testing if there are differences in the distributions of the numeric variable among two groups. Formally the hypothesis test is described with a null (H0) hypothesis and an alternative hypothesis (H1).
-
H0: There are no differences in the distributions of the price variable among the Fair and Ideal groups
-
H1 There are differences in the distributions of the price variable among the Fair and Ideal groups
以下内容可以用 R 中的如下代码实现:
The following can be implemented in R with the following code −
t.test(price ~ cut, data = data)
# Welch Two Sample t-test
#
# data: price by cut
# t = 9.7484, df = 1894.8, p-value < 2.2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# 719.9065 1082.5251
# sample estimates:
# mean in group Fair mean in group Ideal
# 4358.758 3457.542
# Another way to validate the previous results is to just plot the
distributions using a box-plot
plot(price ~ cut, data = data, ylim = c(0,12000),
col = 'deepskyblue3')
我们可以通过检查 p 值是否低于 0.05 来分析检验结果。如果是,我们保留备择假设。这意味着我们发现切工因子的两个层级之间的价格有差异。根据层级的名称,我们会预料到此结果,但我们不会预料到不及格组的平均价格高于理想组。我们可以通过比较每个因子的均值来证明这一点。
We can analyze the test result by checking if the p-value is lower than 0.05. If this is the case, we keep the alternative hypothesis. This means we have found differences of price among the two levels of the cut factor. By the names of the levels we would have expected this result, but we wouldn’t have expected that the mean price in the Fail group would be higher than in the Ideal group. We can see this by comparing the means of each factor.
plot 命令生成一个图,显示价格变量与切工变量之间的关系。它是一个箱形图;我们在第 16.0.1 节中已经介绍过这个图,但它基本显示了我们正在分析的两个切工等级的价格变量的分布。
The plot command produces a graph that shows the relationship between the price and cut variable. It is a box-plot; we have covered this plot in section 16.0.1 but it basically shows the distribution of the price variable for the two levels of cut we are analyzing.

Analysis of Variance
方差分析 (ANOVA) 是一个统计模型,用于通过比较每个组的均值和方差来分析组分布之间的差异,该模型是由罗纳德·费舍尔开发的。ANOVA 提供了关于多个组均值是否相等的统计检验,因此它将 t 检验推广至两个以上组。
Analysis of Variance (ANOVA) is a statistical model used to analyze the differences among group distribution by comparing the mean and variance of each group, the model was developed by Ronald Fisher. ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups.
ANOVA 对于比较三个或更多个组的统计显著性非常有用,因为进行多个两样本 t 检验会增加犯统计 I 型错误的可能性。
ANOVAs are useful for comparing three or more groups for statistical significance because doing multiple two-sample t-tests would result in an increased chance of committing a statistical type I error.
在提供数学解释方面,需要了解以下内容才能理解此检验。
In terms of providing a mathematical explanation, the following is needed to understand the test.
xij = x + (xi − x) + (xij − x)
它产生如下模型:
This leads to the following model −
xij = μ + αi + ∈ij
其中 μ 是总体均值且 αi 是第 i 组均值。误差项 ∈ij 被假定从正态分布中独立同分布。检验的原假设为:
where μ is the grand mean and αi is the ith group mean. The error term ∈ij is assumed to be iid from a normal distribution. The null hypothesis of the test is that −
α1 = α2 = … = αk
就检验统计量的计算而言,我们需要计算两个值——
In terms of computing the test statistic, we need to compute two values −
-
Sum of squares for between group difference −
SSD_B = \sum_{i}^{k} \sum_{j}^{n}(\bar{x_{\bar{i}}} - \bar{x})^2
-
Sums of squares within groups
SSD_W = \sum_{i}^{k} \sum_{j}^{n}(\bar{x_{\bar{ij}}} - \bar{x_{\bar{i}}})^2
其中 SSDB 的自由度为 k−1,SSDW 的自由度为 N−k。接着,我们可以为每个度量值定义均方差。
where SSDB has a degree of freedom of k−1 and SSDW has a degree of freedom of N−k. Then we can define the mean squared differences for each metric.
MSB = SSDB / (k - 1)
MSw = SSDw / (N - k)
最后,ANOVA 中的检验统计量定义为以上两个量的比值
Finally, the test statistic in ANOVA is defined as the ratio of the above two quantities
F = MSB / MSw
其服从自由度为 k−1 和 N−k 的 F 分布。如果原假设为真,则 F 可能接近 1。否则,组间均方 MSB 可能较大,从而产生较大的 F 值。
which follows a F-distribution with k−1 and N−k degrees of freedom. If null hypothesis is true, F would likely be close to 1. Otherwise, the between group mean square MSB is likely to be large, which results in a large F value.
基本上,ANOVA 会检验总方差的两种来源并查看哪一部分贡献更多。这就是为什么它被称为方差分析,尽管其目的是比较组均值。
Basically, ANOVA examines the two sources of the total variance and sees which part contributes more. This is why it is called analysis of variance although the intention is to compare group means.
就计算统计量而言,实际上在 R 中完成相当简单。以下示例将展示完成此操作并在结果中绘制图像的方法。
In terms of computing the statistic, it is actually rather simple to do in R. The following example will demonstrate how it is done and plot the results.
library(ggplot2)
# We will be using the mtcars dataset
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Let's see if there are differences between the groups of cyl in the mpg variable.
data = mtcars[, c('mpg', 'cyl')]
fit = lm(mpg ~ cyl, data = mtcars)
anova(fit)
# Analysis of Variance Table
# Response: mpg
# Df Sum Sq Mean Sq F value Pr(>F)
# cyl 1 817.71 817.71 79.561 6.113e-10 ***
# Residuals 30 308.33 10.28
# Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
# Plot the distribution
plot(mpg ~ as.factor(cyl), data = mtcars, col = 'deepskyblue3')
此代码将产生以下输出——
The code will produce the following output −

我在示例中获得的 p 值显著小于 0.05,因此 R 会返回符号“ * ”以表示这一点。这意味着我们拒绝对原假设,并且发现不同组别的 cyl 变量的 mpg 均值之间存在差异。
The p-value we get in the example is significantly smaller than 0.05, so R returns the symbol '*' to denote this. It means we reject the null hypothesis and that we find differences between the mpg means among the different groups of the cyl variable.