Talend 简明教程

Talend - Big Data

The tag line for Open Studio with Big data is “Simplify ETL and ELT with the leading free open source ETL tool for big data.” In this chapter, let us look into the usage of Talend as a tool for processing data on big data environment.

Introduction

Talend Open Studio – Big Data is a free and open source tool for processing your data very easily on a big data environment. You have plenty of big data components available in Talend Open Studio , that lets you create and run Hadoop jobs just by simple drag and drop of few Hadoop components.

Besides, we do not need to write big lines of MapReduce codes; Talend Open Studio Big data helps you do this with the components present in it. It automatically generates MapReduce code for you, you just need to drag and drop the components and configure few parameters.

It also gives you the option to connect with several Big Data distributions like Cloudera, HortonWorks, MapR, Amazon EMR and even Apache.

Talend Components for Big Data

The list of categories with components to run a job on Big Data environment included under Big Data, is shown below −

big data

The list of Big Data connectors and components in Talend Open Studio is shown below −

  1. tHDFSConnection − Used for connecting to HDFS (Hadoop Distributed File System).

  2. tHDFSInput − Reads the data from given hdfs path, puts it into talend schema and then passes it to the next component in the job.

  3. tHDFSList − Retrieves all the files and folders in the given hdfs path.

  4. tHDFSPut − Copies file/folder from local file system (user-defined) to hdfs at the given path.

  5. tHDFSGet - 从 hdfs 复制文件/文件夹到指定路径的本地文件系统(用户定义)。

  6. tHDFSDelete - 从 HDFS 删除文件。

  7. tHDFSExist - 检查文件是否存在于 HDFS 上。

  8. tHDFSOutput - 在 HDFS 上写数据流。

  9. tCassandraConnection - 打开与卡桑德拉服务器的连接。

  10. tCassandraRow - 对指定的数据库运行 CQL(Cassandra 查询语言)查询。

  11. tHBaseConnection - 打开与 HBase 数据库的连接。

  12. tHBaseInput - 从 HBase 数据库读取数据。

  13. tHiveConnection - 打开与 Hive 数据库的连接。

  14. tHiveCreateTable - 在 hive 数据库中创建一个表。

  15. tHiveInput - 从 hive 数据库读取数据。

  16. tHiveLoad - 将数据写入 hive 表或指定目录。

  17. tHiveRow - 对指定的数据库运行 HiveQL 查询。

  18. tPigLoad - 将输入数据加载到输出流。

  19. tPigMap - 用于在 pig 进程中转换和路由数据。

  20. tPigJoin - 基于连接键执行两个文件的连接操作。

  21. tPigCoGroup - 对来自多个输入的数据进行分组和聚合。

  22. tPigSort - 基于一个或多个定义的排序键对给定的数据进行排序。

  23. tPigStoreResult - 在定义的存储空间中存储来自 pig 操作的结果。

  24. tPigFilterRow − 根据给定的条件过滤指定列以拆分数据。

  25. tPigDistinct − 从关系中删除重复元组。

  26. tSqoopImport − 从数据库关系(如 MySQL、Oracle DB)将数据传输到 HDFS。

  27. tSqoopExport − 从 HDFS 将数据传输到数据库关系(如 MySQL、Oracle DB)