Big Data Analytics 简明教程

Big Data Analytics - Introduction to SQL

SQL 代表结构化查询语言。它是从传统数据仓库和大数据技术中的数据库中提取数据的最广泛使用的语言之一。为了演示 SQL 的基础知识,我们将使用示例。为了专注于语言本身,我们将使用 R 中的 SQL。在编写 SQL 代码方面,这与在数据库中完成代码的方式完全相同。

SQL stands for structured query language. It is one of the most widely used languages for extracting data from databases in traditional data warehouses and big data technologies. In order to demonstrate the basics of SQL we will be working with examples. In order to focus on the language itself, we will be using SQL inside R. In terms of writing SQL code this is exactly as would be done in a database.

SQL 的核心是三个语句:SELECT、FROM 和 WHERE。以下示例使用了 SQL 的最常见用例。导航到文件夹 bda/part2/SQL_introduction 并打开 SQL_introduction.Rproj 文件。然后打开 01_select.R 脚本。为了在 R 中编写 SQL 代码,我们需要安装 sqldf 包,如下面的代码所示。

The core of SQL are three statements: SELECT, FROM and WHERE. The following examples make use of the most common use cases of SQL. Navigate to the folder bda/part2/SQL_introduction and open the SQL_introduction.Rproj file. Then open the 01_select.R script. In order to write SQL code in R we need to install the sqldf package as demonstrated in the following code.

# Install the sqldf package
install.packages('sqldf')

# load the library
library('sqldf')
library(nycflights13)

# We will be working with the fligths dataset in order to introduce SQL

# Let’s take a look at the table
str(flights)
# Classes 'tbl_d', 'tbl' and 'data.frame': 336776 obs. of  16 variables:

# $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
# $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
# $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
# $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
# $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
# $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
# $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
# $ carrier  : chr  "UA" "UA" "AA" "B6" ...

# $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
# $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
# $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
# $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
# $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
# $ distance : num  1400 1416 1089 1576 762 ...
# $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
# $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

select 语句用于从表中检索列并对列进行计算。最简单的 SELECT 语句在 ej1 中演示。我们还可以创建新变量,如 ej2 中所示。

The select statement is used to retrieve columns from tables and do calculations on them. The simplest SELECT statement is demonstrated in ej1. We can also create new variables as shown in ej2.

### SELECT statement
ej1 = sqldf("
   SELECT
   dep_time
   ,dep_delay
   ,arr_time
   ,carrier
   ,tailnum
   FROM
   flights
")

head(ej1)
#    dep_time   dep_delay  arr_time  carrier  tailnum
# 1      517         2      830      UA       N14228
# 2      533         4      850      UA       N24211
# 3      542         2      923      AA       N619AA
# 4      544        -1     1004      B6       N804JB
# 5      554        -6      812      DL       N668DN
# 6      554        -4      740      UA       N39463

# In R we can use SQL with the sqldf function. It works exactly the same as in
a database

# The data.frame (in this case flights) represents the table we are querying
and goes in the FROM statement
# We can also compute new variables in the select statement using the syntax:

# old_variables as new_variable
ej2 = sqldf("
   SELECT
   arr_delay - dep_delay as gain,
   carrier
   FROM
   flights
")

ej2[1:5, ]
#    gain   carrier
# 1    9      UA
# 2   16      UA
# 3   31      AA
# 4  -17      B6
# 5  -19      DL

SQL 中最常用的功能之一是 group by 语句。这允许为另一个变量的不同组计算数值。打开脚本 02_group_by.R。

One of the most common used features of SQL is the group by statement. This allows to compute a numeric value for different groups of another variable. Open the script 02_group_by.R.

### GROUP BY

# Computing the average
ej3 = sqldf("
  SELECT
   avg(arr_delay) as mean_arr_delay,
   avg(dep_delay) as mean_dep_delay,
   carrier
   FROM
   flights
   GROUP BY
   carrier
")

#    mean_arr_delay   mean_dep_delay carrier
# 1       7.3796692      16.725769      9E
# 2       0.3642909       8.586016      AA
# 3      -9.9308886       5.804775      AS
# 4       9.4579733      13.022522      B6
# 5       1.6443409       9.264505      DL
# 6      15.7964311      19.955390      EV
# 7      21.9207048      20.215543      F9
# 8      20.1159055      18.726075      FL
# 9      -6.9152047       4.900585      HA
# 10     10.7747334      10.552041      MQ
# 11     11.9310345      12.586207      OO
# 12      3.5580111      12.106073      UA
# 13      2.1295951       3.782418      US
# 14      1.7644644      12.869421      VX
# 15      9.6491199      17.711744      WN
# 16     15.5569853      18.996330      YV

# Other aggregations
ej4 = sqldf("
   SELECT
   avg(arr_delay) as mean_arr_delay,
   min(dep_delay) as min_dep_delay,
   max(dep_delay) as max_dep_delay,
   carrier
   FROM
   flights
   GROUP BY
   carrier
")

# We can compute the minimun, mean, and maximum values of a numeric value
ej4
#      mean_arr_delay    min_dep_delay   max_dep_delay   carrier
# 1       7.3796692           -24           747          9E
# 2       0.3642909           -24          1014          AA
# 3      -9.9308886           -21           225          AS
# 4       9.4579733           -43           502          B6
# 5       1.6443409           -33           960         DL
# 6      15.7964311           -32           548         EV
# 7      21.9207048           -27           853         F9
# 8      20.1159055           -22           602         FL
# 9      -6.9152047           -16          1301         HA
# 10     10.7747334           -26          1137         MQ
# 11     11.9310345           -14           154         OO
# 12      3.5580111           -20           483         UA
# 13      2.1295951           -19           500         US
# 14      1.7644644           -20           653         VX
# 15      9.6491199           -13           471         WN
# 16     15.5569853           -16           387         YV

### We could be also interested in knowing how many observations each carrier has
ej5 = sqldf("
   SELECT
   carrier, count(*) as count
   FROM
   flights
   GROUP BY
   carrier
")

ej5
#      carrier  count
# 1       9E    18460
# 2       AA   32729
# 3       AS   714
# 4       B6   54635
# 5       DL   48110
# 6       EV   54173
# 7       F9   685
# 8       FL   3260
# 9       HA   342
# 10      MQ   26397
# 11      OO   32
# 12      UA   58665
# 13      US   20536
# 14      VX   5162
# 15      WN   12275
# 16      YV   601

SQL 最有用的功能是联接。联接意味着我们想要在表 A 和表 B 中的表 B 中使用一列来匹配两张表的值并将它们合并到一张表中。有不同类型的联接,实际操作中,以下联接将是最有用的:内部联接和左外部联接。

The most useful feature of SQL are joins. A join means that we want to combine table A and table B in one table using one column to match the values of both tables. There are different types of joins, in practical terms, to get started these will be the most useful ones: inner join and left outer join.

# Let’s create two tables: A and B to demonstrate joins.
A = data.frame(c1 = 1:4, c2 = letters[1:4])
B = data.frame(c1 = c(2,4,5,6), c2 = letters[c(2:5)])

A
# c1 c2
# 1  a
# 2  b
# 3  c
# 4  d

B
# c1 c2
# 2  b
# 4  c
# 5  d
# 6  e

### INNER JOIN
# This means to match the observations of the column we would join the tables by.
inner = sqldf("
   SELECT
   A.c1, B.c2
   FROM
   A INNER JOIN B
   ON A.c1 = B.c1
")

# Only the rows that match c1 in both A and B are returned
inner
# c1 c2
#  2  b
#  4  c

### LEFT OUTER JOIN
# the left outer join, sometimes just called left join will return the
# first all the values of the column used from the A table
left = sqldf("
  SELECT
   A.c1, B.c2
  FROM
   A LEFT OUTER JOIN B
   ON A.c1 = B.c1
")

# Only the rows that match c1 in both A and B are returned
left
#   c1    c2
#    1  <NA>
#    2    b
#    3  <NA>
#    4    c