Big Data Analytics 简明教程

Big Data Analytics - Charts & Graphs

分析数据的第一个方法是可视化分析数据。这样做通常是为了寻找变量之间的关系和变量的单变量描述。我们可以将这些策略分为 −

The first approach to analyzing data is to visually analyze it. The objectives at doing this are normally finding relations between variables and univariate descriptions of the variables. We can divide these strategies as −

  1. Univariate analysis

  2. Multivariate analysis

Univariate Graphical Methods

Univariate 是一个统计术语。在实践中,这意味着我们希望独立于其他数据分析变量。允许有效执行此操作的图表有 −

Univariate is a statistical term. In practice, it means we want to analyze a variable independently from the rest of the data. The plots that allow to do this efficiently are −

Box-Plots

箱形图通常用于比较分布。这是一种直观检查分布之间是否存在差异的好方法。我们可以看看不同切工的钻石价格之间是否存在差异。

Box-Plots are normally used to compare distributions. It is a great way to visually inspect if there are differences between distributions. We can see if there are differences between the price of diamonds for different cut.

# We will be using the ggplot2 library for plotting
library(ggplot2)
data("diamonds")

# We will be using the diamonds dataset to analyze distributions of numeric variables
head(diamonds)

#    carat   cut       color  clarity  depth  table   price    x     y     z
# 1  0.23    Ideal       E      SI2    61.5    55     326     3.95  3.98  2.43
# 2  0.21    Premium     E      SI1    59.8    61     326     3.89  3.84  2.31
# 3  0.23    Good        E      VS1    56.9    65     327     4.05  4.07  2.31
# 4  0.29    Premium     I      VS2    62.4    58     334     4.20  4.23  2.63
# 5  0.31    Good        J      SI2    63.3    58     335     4.34  4.35  2.75
# 6  0.24    Very Good   J      VVS2   62.8    57     336     3.94  3.96  2.48

### Box-Plots
p = ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
   geom_box-plot() +
   theme_bw()
print(p)

在该图片中,我们可以看到不同类别的钻石价格分布有所不同。

We can see in the plot there are differences in the distribution of diamonds price in different types of cut.

box plots

Histograms

source('01_box_plots.R')

# We can plot histograms for each level of the cut factor variable using
facet_grid
p = ggplot(diamonds, aes(x = price, fill = cut)) +
   geom_histogram() +
   facet_grid(cut ~ .) +
   theme_bw()

p
# the previous plot doesn’t allow to visuallize correctly the data because of
the differences in scale
# we can turn this off using the scales argument of facet_grid

p = ggplot(diamonds, aes(x = price, fill = cut)) +
   geom_histogram() +
   facet_grid(cut ~ ., scales = 'free') +
   theme_bw()
p

png('02_histogram_diamonds_cut.png')
print(p)
dev.off()

上述代码的输出如下 -

The output of the above code will be as follows −

histogram

Multivariate Graphical Methods

探索性数据分析中的多元图解方法旨在找到不同变量之间关系。实现此目的常用的方法有两种:绘制数字变量的相关性矩阵,或简单地将原始数据绘制成散点图矩阵。

Multivariate graphical methods in exploratory data analysis have the objective of finding relationships among different variables. There are two ways to accomplish this that are commonly used: plotting a correlation matrix of numeric variables or simply plotting the raw data as a matrix of scatter plots.

为了演示这一点,我们将使用钻石数据集。要遵循代码,请打开脚本 bda/part2/charts/03_multivariate_analysis.R

In order to demonstrate this, we will use the diamonds dataset. To follow the code, open the script bda/part2/charts/03_multivariate_analysis.R.

library(ggplot2)
data(diamonds)

# Correlation matrix plots
keep_vars = c('carat', 'depth', 'price', 'table')
df = diamonds[, keep_vars]
# compute the correlation matrix
M_cor = cor(df)

#          carat       depth      price      table
# carat 1.00000000  0.02822431  0.9215913  0.1816175
# depth 0.02822431  1.00000000 -0.0106474 -0.2957785
# price 0.92159130 -0.01064740  1.0000000  0.1271339
# table 0.18161755 -0.29577852  0.1271339  1.0000000

# plots
heat-map(M_cor)

此代码将产生以下输出——

The code will produce the following output −

heat map

这是一份摘要,它告诉我们价格和克拉之间有很强的相关性,而其他变量之间的相关性不大。

This is a summary, it tells us that there is a strong correlation between price and caret, and not much among the other variables.

当我们拥有大量变量时,相关性矩阵会非常有用,在这种情况下,绘制原始数据是不切实际的。如前所述,也可以显示原始数据 −

A correlation matrix can be useful when we have a large number of variables in which case plotting the raw data would not be practical. As mentioned, it is possible to show the raw data also −

library(GGally)
ggpairs(df)

我们可以在绘图中看到,热图中显示的结果得到证实,价格和克拉变量之间的相关性为 0.922。

We can see in the plot that the results displayed in the heat-map are confirmed, there is a 0.922 correlation between the price and carat variables.

scatterplot

可以在散点图矩阵的 (3, 1) 索引中找到的价格-克拉散点图中形象化地显示该关系。

It is possible to visualize this relationship in the price-carat scatterplot located in the (3, 1) index of the scatterplot matrix.