Apache Presto 简明教程

Apache Presto - Overview

数据分析是对原始数据进行分析以收集相关信息、做出更好决策的过程。它主要用于许多组织中以做出业务决策。大数据分析包含海量数据,且此过程非常复杂,因此公司采用了不同策略。

Data analytics is the process of analyzing raw data to gather relevant information for better decision making. It is primarily used in many organizations to make business decisions. Well, big data analytics involves a large amount of data and this process is quite complex, hence companies use different strategies.

例如,Facebook 是全球领先的数据驱动和最大的数据仓库公司之一。Facebook 仓库数据存储在 Hadoop 中用于大规模计算。后来,当仓库数据增长到 PB 级别时,他们决定开发一个具有低延迟的新系统。在 2012 年,Facebook 团队成员设计了 “Presto” ,针对交互式查询分析而设计,即使具有 PB 级数据,也能快速运行。

For example, Facebook is one of the leading data driven and largest data warehouse company in the world. Facebook warehouse data is stored in Hadoop for large scale computation. Later, when warehouse data grew to petabytes, they decided to develop a new system with low latency. In the year of 2012, Facebook team members designed “Presto” for interactive query analytics that would operate quickly even with petabytes of data.

What is Apache Presto?

Apache Presto 是一个分布式并行查询执行引擎,针对低延迟和交互式查询分析而优化。Presto 可以轻松运行查询,并在从 GB 到 PB 的范围内进行扩展,且不会出现宕机。

Apache Presto is a distributed parallel query execution engine, optimized for low latency and interactive query analysis. Presto runs queries easily and scales without down time even from gigabytes to petabytes.

一个 Presto 查询可以处理来自 HDFS、MySQL、Cassandra、Hive 等多个来源的数据,还可以处理许多其他数据源。Presto 是用 Java 编写的,且易于与其他数据基础设施组件集成。Presto 功能强大,Airbnb、DropBox、Groupon、Netflix 等领先公司都在采用它。

A single Presto query can process data from multiple sources like HDFS, MySQL, Cassandra, Hive and many more data sources. Presto is built in Java and easy to integrate with other data infrastructure components. Presto is powerful, and leading companies like Airbnb, DropBox, Groupon, Netflix are adopting it.

Presto − Features

Presto 包含以下功能:

Presto contains the following features −

  1. Simple and extensible architecture.

  2. Pluggable connectors - Presto supports pluggable connector to provide metadata and data for queries.

  3. Pipelined executions - Avoids unnecessary I/O latency overhead.

  4. User-defined functions - Analysts can create custom user-defined functions to migrate easily.

  5. Vectorized columnar processing.

Presto − Benefits

以下是 Apache Presto 提供的好处:

Here is a list of benefits that Apache Presto offers −

  1. Specialized SQL operations

  2. Easy to install and debug

  3. Simple storage abstraction

  4. Quickly scales petabytes data with low latency

Presto − Applications

Presto 支持当今大多数最佳行业应用。让我们来看一下一些著名应用。

Presto supports most of today’s best industrial applications. Let’s take a look at some of the notable applications.

  1. Facebook − Facebook built Presto for data analytics needs. Presto easily scales large velocity of data.

  2. Teradata − Teradata provides end-to-end solutions in Big Data analytics and data warehousing. Teradata contribution to Presto makes it easier for more companies to enable all analytical needs.

  3. Airbnb − Presto is an integral part of the Airbnb data infrastructure. Well, hundreds of employees are running queries each day with the technology.

Why Presto?

Presto 支持标准 ANSI SQL,这让数据分析师和开发人员的工作变得非常轻松。尽管它以 Java 编写,但它避免了 Java 代码与内存分配和垃圾回收相关的典型问题。Presto 具有与 Hadoop 友好的连接器架构。它允许轻松插入文件系统。

Presto supports standard ANSI SQL which has made it very easy for data analysts and developers. Though it is built in Java, it avoids typical issues of Java code related to memory allocation and garbage collection. Presto has a connector architecture that is Hadoop friendly. It allows to easily plug in file systems.

Presto 在多个 Hadoop 分布上运行。此外,Presto 可以通过一个 Hadoop 平台查询 Cassandra、关系型数据库或其他数据存储。这种跨平台分析功能允许 Presto 用户从千兆字节到拍字节的数据中提取最大业务价值。

Presto runs on multiple Hadoop distributions. In addition, Presto can reach out from a Hadoop platform to query Cassandra, relational databases, or other data stores. This cross-platform analytic capability allows Presto users to extract maximum business value from gigabytes to petabytes of data.