Big Data Analytics 简明教程
Big Data Analytics - Data Visualization
为了理解数据,通常有用可视化方式。通常在大型数据应用中,乐趣在于发现洞察,而不仅仅是创建美观的图表。以下是使用图表理解数据的不同方法的示例。
In order to understand data, it is often useful to visualize it. Normally in Big Data applications, the interest relies in finding insight rather than just making beautiful plots. The following are examples of different approaches to understanding data using plots.
为了开始分析航班数据,我们可以首先检查数值变量之间是否存在相关性。此代码也可用在 bda/part1/data_visualization/data_visualization.R 文件中。
To start analyzing the flights data, we can start by checking if there are correlations between numeric variables. This code is also available in bda/part1/data_visualization/data_visualization.R file.
# Install the package corrplot by running
install.packages('corrplot')
# then load the library
library(corrplot)
# Load the following libraries
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
# We will continue working with the flights data
DT <- as.data.table(flights)
head(DT) # take a look
# We select the numeric variables after inspecting the first rows.
numeric_variables = c('dep_time', 'dep_delay',
'arr_time', 'arr_delay', 'air_time', 'distance')
# Select numeric variables from the DT data.table
dt_num = DT[, numeric_variables, with = FALSE]
# Compute the correlation matrix of dt_num
cor_mat = cor(dt_num, use = "complete.obs")
print(cor_mat)
### Here is the correlation matrix
# dep_time dep_delay arr_time arr_delay air_time distance
# dep_time 1.00000000 0.25961272 0.66250900 0.23230573 -0.01461948 -0.01413373
# dep_delay 0.25961272 1.00000000 0.02942101 0.91480276 -0.02240508 -0.02168090
# arr_time 0.66250900 0.02942101 1.00000000 0.02448214 0.05429603 0.04718917
# arr_delay 0.23230573 0.91480276 0.02448214 1.00000000 -0.03529709 -0.06186776
# air_time -0.01461948 -0.02240508 0.05429603 -0.03529709 1.00000000 0.99064965
# distance -0.01413373 -0.02168090 0.04718917 -0.06186776 0.99064965 1.00000000
# We can display it visually to get a better understanding of the data
corrplot.mixed(cor_mat, lower = "circle", upper = "ellipse")
# save it to disk
png('corrplot.png')
print(corrplot.mixed(cor_mat, lower = "circle", upper = "ellipse"))
dev.off()
此代码生成以下相关矩阵可视化 −
This code generates the following correlation matrix visualization −

我们可以在图表中看到,数据集中某些变量之间存在很强的相关性。例如,到达延迟和出发延迟似乎高度相关。我们可以看到这一点,因为椭圆显示了两个变量之间几乎线性关系;然而,从这个结果中发现因果关系并不容易。
We can see in the plot that there is a strong correlation between some of the variables in the dataset. For example, arrival delay and departure delay seem to be highly correlated. We can see this because the ellipse shows an almost lineal relationship between both variables, however, it is not simple to find causation from this result.
我们不能说因为两个变量相关,一个变量就会影响另一个变量。我们还在图表中发现了飞行时间和距离之间的强相关性,这是相当合理的,因为随着距离的增加,飞行时间应该增加。
We can’t say that as two variables are correlated, that one has an effect on the other. Also we find in the plot a strong correlation between air time and distance, which is fairly reasonable to expect as with more distance, the flight time should grow.
我们还可以对数据进行单变量分析。一种简单有效的可视化分布方式是 box-plots 。以下代码演示了如何使用 ggplot2 库生成箱线图和格子图。此代码也可用在 bda/part1/data_visualization/boxplots.R 文件中。
We can also do univariate analysis of the data. A simple and effective way to visualize distributions are box-plots. The following code demonstrates how to produce box-plots and trellis charts using the ggplot2 library. This code is also available in bda/part1/data_visualization/boxplots.R file.
source('data_visualization.R')
### Analyzing Distributions using box-plots
# The following shows the distance as a function of the carrier
p = ggplot(DT, aes(x = carrier, y = distance, fill = carrier)) + # Define the carrier
in the x axis and distance in the y axis
geom_box-plot() + # Use the box-plot geom
theme_bw() + # Leave a white background - More in line with tufte's
principles than the default
guides(fill = FALSE) + # Remove legend
labs(list(title = 'Distance as a function of carrier', # Add labels
x = 'Carrier', y = 'Distance'))
p
# Save to disk
png(‘boxplot_carrier.png’)
print(p)
dev.off()
# Let's add now another variable, the month of each flight
# We will be using facet_wrap for this
p = ggplot(DT, aes(carrier, distance, fill = carrier)) +
geom_box-plot() +
theme_bw() +
guides(fill = FALSE) +
facet_wrap(~month) + # This creates the trellis plot with the by month variable
labs(list(title = 'Distance as a function of carrier by month',
x = 'Carrier', y = 'Distance'))
p
# The plot shows there aren't clear differences between distance in different months
# Save to disk
png('boxplot_carrier_by_month.png')
print(p)
dev.off()