Apache Tajo 简明教程

Apache Tajo - SQL Statements

在上一章中，您已经了解如何在 Tajo 中创建表。本章说明 Tajo 中的 SQL 语句。

In the previous chapter, you have understood how to create tables in Tajo. This chapter explains about the SQL statement in Tajo.

Create Table Statement

在转移到创建表之前，请按如下方法在 Tajo 安装目录路径中创建一个文本文件“students.csv” −

Before moving to create a table, create a text file “students.csv” in Tajo installation directory path as follows −

students.csv

Id	Name	Address	Age	Marks
1	Adam	23 New Street	21	90
2	Amit	12 Old Street	13	95
3	Bob	10 Cross Street	12	80
4	David	15 Express Avenue	12	85
5	Esha	20 Garden Street	13	50
6	Ganga	25 North Street	12	55
7	Jack	2 Park Street	12	60
8	Leena	24 South Street	12	70
9	Mary	5 West Street	12	75
10	Peter	16 Park Avenue	12	95

在文件创建后，转到终端并逐个启动 Tajo 服务器和 Shell。

After the file has been created, move to the terminal and start the Tajo server and shell one by one.

Create Database

使用以下命令创建新数据库 −

Create a new database using the following command −

Query

default> create database sampledb;
OK

连接到刚刚创建的数据库“sampledb”。

Connect to the database “sampledb” which is now created.

default> \c sampledb
You are now connected to database "sampledb" as user “user1”.

然后，按如下方法在“sampledb”中创建表 −

Then, create a table in “sampledb” as follows −

Query

sampledb>  create external table mytable(id int,name text,address text,age int,mark int)
   using text with('text.delimiter' = ',') location ‘file:/Users/workspace/Tajo/students.csv’;

Result

上述查询将生成以下结果。

The above query will generate the following result.

OK

在此创建外部表。现在，您只需输入文件路径。如果您必须从 hdfs 分配表，请使用 hdfs 而非 file。

Here, the external table is created. Now, you just have to enter the file location. If you have to assign the table from hdfs then use hdfs instead of file.

接下来， “students.csv” 文件包含逗号分隔的值。 text.delimiter 字段分配以“,”分隔。

Next, the “students.csv” file contains comma separated values. The text.delimiter field is assigned with ‘,’.

您现在已经成功在“sampledb”中创建“mytable”。

You have now created “mytable” successfully in “sampledb”.

Show Table

如需在 Tajo 中显示表，请使用以下查询。

To show tables in Tajo, use the following query.

Query

sampledb> \d
mytable
sampledb> \d mytable

Result

上述查询将生成以下结果。

The above query will generate the following result.

table name: sampledb.mytable
table uri: file:/Users/workspace/Tajo/students.csv
store type: TEXT
number of rows: unknown
volume: 261 B
Options:
'timezone' = 'Asia/Kolkata'
'text.null' = '\\N'
'text.delimiter' = ','
schema:
id INT4
name TEXT
address TEXT
age INT4
mark INT4

List table

如需获取表中的所有记录，请键入以下查询 −

To fetch all the records in the table, type the following query −

Query

sampledb> select * from mytable;

Result

上述查询将生成以下结果。

The above query will generate the following result.

Insert Table Statement

Tajo 使用以下语法将记录插入表。

Tajo uses the following syntax to insert records in table.

Syntax

create table table1 (col1 int8, col2 text, col3 text);
--schema should be same for target table schema
Insert overwrite into table1 select * from table2;
                     (or)
Insert overwrite into LOCATION '/dir/subdir' select * from table;

Tajo 的插入语句与 SQL 的 INSERT INTO SELECT 语句类似。

Tajo’s insert statement is similar to the INSERT INTO SELECT statement of SQL.

Query

我们来创建一张表以覆盖现有表的数据。

Let’s create a table to overwrite table data of an existing table.

sampledb> create table test(sno int,name text,addr text,age int,mark int);
OK
sampledb> \d

Result

上述查询将生成以下结果。

The above query will generate the following result.

mytable
test

Insert Records

如需在“test”表中插入记录，请键入以下查询。

To insert records in the “test” table, type the following query.

Query

sampledb> insert overwrite into test select * from mytable;

Result

上述查询将生成以下结果。

The above query will generate the following result.

Progress: 100%, response time: 0.518 sec

在此，“mytable”记录覆盖“test”表。如果您不想创建“test”表，则直接分配物理路径位置，如插入查询的替代选项中所述。

Here, “mytable" records overwrite the “test” table. If you don’t want to create the “test” table, then straight away assign the physical path location as mentioned in an alternative option for insert query.

Fetch records

使用以下查询列出“test”表中的所有记录 −

Use the following query to list out all the records in the “test” table −

Query

sampledb> select * from test;

Result

上述查询将生成以下结果。

The above query will generate the following result.

此语句用于添加、删除或修改现有表的列。

This statement is used to add, remove or modify columns of an existing table.

如需重命名表，请使用以下语法 −

To rename the table use the following syntax −

Alter table table1 RENAME TO table2;

Query

sampledb> alter table test rename to students;

Result

上述查询将生成以下结果。

The above query will generate the following result.

OK

如需检查更改的表名，请使用以下查询。

To check the changed table name, use the following query.

sampledb> \d
mytable
students

现在“test”表已更改为“students”表。

Now the table “test” is changed to “students” table.

Add Column

如需在“students”表中插入新列，请键入以下语法 −

To insert new column in the “students” table, type the following syntax −

Alter table <table_name> ADD COLUMN <column_name> <data_type>

Query

sampledb> alter table students add column grade text;

Result

上述查询将生成以下结果。

The above query will generate the following result.

OK

Set Property

此属性用于更改表的属性。

This property is used to change the table’s property.

Query

sampledb> ALTER TABLE students SET PROPERTY 'compression.type' = 'RECORD',
   'compression.codec' = 'org.apache.hadoop.io.compress.Snappy Codec' ;
OK

在此，分配压缩类型和编解码器属性。

Here, compression type and codec properties are assigned.

如需更改文本分隔符属性，请使用以下 −

To change the text delimiter property, use the following −

Query

ALTER TABLE students  SET PROPERTY ‘text.delimiter'=',';
OK

Result

上述查询将生成以下结果。

The above query will generate the following result.

sampledb> \d students
table name: sampledb.students
table uri: file:/tmp/tajo-user1/warehouse/sampledb/students
store type: TEXT
number of rows: 10
volume: 228 B
Options:
'compression.type' = 'RECORD'
'timezone' = 'Asia/Kolkata'
'text.null' = '\\N'
'compression.codec' = 'org.apache.hadoop.io.compress.SnappyCodec'
'text.delimiter' = ','
schema:
id INT4
name TEXT
addr TEXT
age INT4
mark INT4
grade TEXT

以上结果显示可以使用“SET”属性来更改表的属性。

The above result shows that the table’s properties are changed using the “SET” property.

Select Statement

SELECT 语句用于从数据库中选择数据。

The SELECT statement is used to select data from a database.

Select 语句的语法如下 -

The syntax for the Select statement is as follows −

SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...]
   [FROM <table reference> [[AS] <table alias name>] [, ...]]
   [WHERE <condition>]
   [GROUP BY <expression> [, ...]]
   [HAVING <condition>]
   [ORDER BY <expression> [ASC|DESC] [NULLS (FIRST|LAST)] [, …]]

Where Clause

Where 子句用于从表中筛选记录。

The Where clause is used to filter records from the table.

Query

sampledb> select * from mytable where id > 5;

Result

上述查询将生成以下结果。

The above query will generate the following result.

查询返回 id 大于 5 的那些学生的记录。

The query returns the records of those students whose id is greater than 5.

Query

sampledb> select * from mytable where name = ‘Peter’;

Result

上述查询将生成以下结果。

The above query will generate the following result.

Progress: 100%, response time: 0.117 sec

  id,  name,  address,   age
-------------------------------
10,  Peter,  16 park avenue , 12

该结果只筛选了 Peter 的记录。

The result filters Peter’s records only.

Distinct Clause

一个表列可能包含重复的值。DISTINCT 关键字可用于仅返回不同的值。

A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values.

Syntax

SELECT DISTINCT column1,column2 FROM table_name;

Query

sampledb> select distinct age from mytable;

Result

上述查询将生成以下结果。

The above query will generate the following result.

Progress: 100%, response time: 0.216 sec
age
-------------------------------
13
12

该查询返回 mytable 中不同的学生年龄。

The query returns the distinct age of students from mytable.

Group By Clause

GROUP BY 子句与 SELECT 语句配合使用，将相同的数据整理成组。

The GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.

Syntax

SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2;

Query

select age,sum(mark) as sumofmarks from mytable group by age;

Result

上述查询将生成以下结果。

The above query will generate the following result.

age,  sumofmarks
-------------------------------
13,  145
12,  610

此处，“mytable”列有两种类型不同的年龄 - 12 和 13。现在，该查询按年龄对记录进行分组，并对学生相应年龄的成绩进行求和。

Here, the “mytable" column has two types of ages — 12 and 13. Now the query groups the records by age and produces the sum of marks for the corresponding ages of students.

Having Clause

HAVING 子句使你能够指定过滤哪些组结果出现在最终结果中的条件。WHERE 子句对所选列设置条件，而 HAVING 子句则对 GROUP BY 子句创建的组设置条件。

The HAVING clause enables you to specify conditions that filter which group results appear in the final results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on the groups created by the GROUP BY clause.

Syntax

SELECT column1, column2 FROM table1 GROUP BY column HAVING [ conditions ]

Query

sampledb> select age from mytable group by age  having  sum(mark) > 200;

Result

上述查询将生成以下结果。

The above query will generate the following result.

age
-------------------------------
12

该查询按年龄对记录进行分组并返回当条件结果 sum(mark) > 200 时的年龄。

The query groups the records by age and returns the age when the condition result sum(mark) > 200.

Order By Clause

ORDER BY 子句用于根据一列或多列以升序或降序对数据进行排序。Tajo 数据库默认按升序对查询结果进行排序。

The ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. The Tajo database sorts query results in ascending order by default.

Syntax

SELECT column-list FROM table_name
[WHERE condition]
[ORDER BY column1, column2, .. columnN] [ASC | DESC];

Query

sampledb> select * from mytable where mark > 60 order by name desc;

Result

上述查询将生成以下结果。

The above query will generate the following result.

该查询以降序返回成绩大于 60 的那些学生的姓名。

The query returns the names of those students in descending order whose marks are greater than 60.

Create Index Statement

CREATE INDEX 语句用于在表中创建索引。索引用于快速检索数据。当前版本仅支持存储在 HDFS 上的纯文本格式的索引。

The CREATE INDEX statement is used to create indexes in tables. Index is used for fast retrieval of data. Current version supports index for only plain TEXT formats stored on HDFS.

Syntax

CREATE INDEX [ name ] ON table_name ( { column_name | ( expression ) }

Query

create index student_index on mytable(id);

Result

上述查询将生成以下结果。

The above query will generate the following result.

id
———————————————

要查看已为该列分配的索引，请键入以下查询。

To view assigned index for the column, type the following query.

default> \d mytable
table name: default.mytable
table uri: file:/Users/deiva/workspace/Tajo/students.csv
store type: TEXT
number of rows: unknown
volume: 307 B
Options:
   'timezone' = 'Asia/Kolkata'
   'text.null' = '\\N'
   'text.delimiter' = ','
schema:
id INT4
name TEXT
address TEXT
age INT4
mark INT4
Indexes:
"student_index" TWO_LEVEL_BIN_TREE (id ASC NULLS LAST )

此处，TWO_LEVEL_BIN_TREE 方法在 Tajo 中默认使用。

Here, TWO_LEVEL_BIN_TREE method is used by default in Tajo.

Drop Table Statement

Drop Table 语句用于从数据库中删除表。

The Drop Table Statement is used to drop a table from the database.

Syntax

drop table table name;

Query

sampledb> drop table mytable;

要检查该表是否已从该表中删除，请键入以下查询。

To check if the table has been dropped from the table, type the following query.

sampledb> \d mytable;

Result

上述查询将生成以下结果。

The above query will generate the following result.

ERROR: relation 'mytable' does not exist

您也可以使用 “\d” 命令检查查询以列出可用的 Tajo 表。

You can also check the query using “\d” command to list out the available Tajo tables.