Big Data Analytics 简明教程

Big Data Analytics - Data Exploration

Exploratory data analysis 是由 John Tuckey（1977 年）提出的一个概念，它包含一个新的统计学观点。塔基的想法是，在传统统计学中，数据没有以图形方式进行探索，它只是用于检验假设。开发工具的第一次尝试是在斯坦福进行的，该项目名为 prim9。该工具能够以九个维度可视化数据，因此它能够提供数据的多变量视角。

Exploratory data analysis is a concept developed by John Tuckey (1977) that consists on a new perspective of statistics. Tuckey’s idea was that in traditional statistics, the data was not being explored graphically, is was just being used to test hypotheses. The first attempt to develop a tool was done in Stanford, the project was called prim9. The tool was able to visualize data in nine dimensions, therefore it was able to provide a multivariate perspective of the data.

最近，探索性数据分析是必需的，并已纳入大数据分析生命周期。在组织中有效发现洞察并将其有效传达的能力得益于强大的 EDA 功能。

In recent days, exploratory data analysis is a must and has been included in the big data analytics life cycle. The ability to find insight and be able to communicate it effectively in an organization is fueled with strong EDA capabilities.

基于塔克的概念，贝尔实验室开发了 S programming language ，以提供交互式界面来进行统计。S 的想法是提供具有易于使用的语言的丰富图形功能。在当今基于 S 编程语言的 R 是最受欢迎的分析软件的世界，大数据背景下。

Based on Tuckey’s ideas, Bell Labs developed the S programming language in order to provide an interactive interface for doing statistics. The idea of S was to provide extensive graphical capabilities with an easy-to-use language. In today’s world, in the context of Big Data, R that is based on the S programming language is the most popular software for analytics.

以下程序演示了探索性数据分析的使用。

The following program demonstrates the use of exploratory data analysis.

以下是一个探索性数据分析的示例。此代码也可用在 part1/eda/exploratory_data_analysis.R 文件中。

The following is an example of exploratory data analysis. This code is also available in part1/eda/exploratory_data_analysis.R file.

library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)

# Using the code from the previous section
# This computes the mean arrival and departure delays by carrier.
DT <- as.data.table(flights)
mean2 = DT[, list(mean_departure_delay = mean(dep_delay, na.rm = TRUE),
   mean_arrival_delay = mean(arr_delay, na.rm = TRUE)),
   by = carrier]

# In order to plot data in R usign ggplot, it is normally needed to reshape the data
# We want to have the data in long format for plotting with ggplot
dt = melt(mean2, id.vars = ’carrier’)

# Take a look at the first rows
print(head(dt))

# Take a look at the help for ?geom_point and geom_line to find similar examples
# Here we take the carrier code as the x axis
# the value from the dt data.table goes in the y axis

# The variable column represents the color
p = ggplot(dt, aes(x = carrier, y = value, color = variable, group = variable)) +
   geom_point() + # Plots points
   geom_line() + # Plots lines
   theme_bw() + # Uses a white background
   labs(list(title = 'Mean arrival and departure delay by carrier',
      x = 'Carrier', y = 'Mean delay'))
print(p)

# Save the plot to disk
ggsave('mean_delay_by_carrier.png', p,
   width = 10.4, height = 5.07)

代码应产生如下图像 −

The code should produce an image such as the following −