Gen-ai 简明教程

Architecture of Transformers in Generative AI

基于 transformer 的大型语言模型 (LLM) 已在情绪分析、机器翻译、文本摘要等各种任务中超越了之前的循环神经网络 (RNN)。

Large Language Models (LLMs) based on transformers have outperformed the earlier Recurrent Neural Networks (RNNs) in various tasks like sentiment analysis, machine translation, text summarization, etc.

Transformer 的独特功能源自其架构。本章将简单解释原始 transformer 模型的主要理念,以便更容易理解。

Transformers achieve their unique capabilities from its architecture. This chapter will explain the main ideas of the original transformer model in simple terms to make it easier to understand.

我们将重点关注构成 transformer 的关键组件: encoderdecoder 以及将它们连接起来的独特的 attention mechanism

We will focus on the key components that make the transformer: the encoder, the decoder, and the unique attention mechanism that connects them both.

How do Transformers Work in Generative AI?

让我们了解一下 transformer 的工作原理 −

Let’s understand how a transformer works −

  1. First, when we provide a sentence to the transformer, it pays extra attention to the important words in that sentence.

  2. It then considers all the words simultaneously rather than one after another which helps the transformer to find the dependency between the words in that sentence.

  3. After that, it finds the relationship between words in that sentence. For example, suppose a sentence is about stars and galaxies then it knows that these words are related.

  4. Once done, the transformer uses this knowledge to understand the complete story and how words connect with each other.

  5. With this understanding, the transformer can even predict which word might come next.

Transformer Architecture in Generative AI

Transformer 有两个主要组件: the encoderthe decoder 。以下是 Transformer 的简化架构:

The transformer has two main components: the encoder and the decoder. Below is a simplified architecture of the transformer −

architecture of transformers in generative ai 1

如你在该图中看到的,在 Transformer 的左侧,输入进入编码器。输入首先转换为输入嵌入,然后穿过一个注意力子层和前馈网络 (FFN) 子层。类似地,在右侧,目标输出进入解码器。

As you can see in the diagram, on the left side of the transformer, the input enters the encoder. The input is first converted to input embeddings and then crosses through an attention sub-layer and FeedForward Network (FFN) sub-layer. Similarly, on the right side, the target output enters the decoder.

输出还首先转换为输出嵌入,然后穿过两个注意力子层和一个前馈网络 (FFN) 子层。在此架构中,没有 RNN、LSTM 或 CNN。递归也被丢弃,取而代之的是注意力机制。

The output is also first converted to output embeddings and then crosses through two attention sub-layers and a FeedForward Network (FFN) sub-layer. In this architecture, there is no RNN, LSTM, or CNN. Recurrence has also been discarded and replaced by attention mechanism.

接下来让我们详细讨论 Transformer 的两个主要组件,即编码器和解码器。

Let’s discuss the two main components, i.e., the encoder and the decoder, of the transformer in detail.

A Layer of the Encoder Stack of the Transformer

在 Transformer 中,编码器处理输入序列,并将它分解成一些有意义的表示。Transformer 模型编码器的层数为 stacks of layers ,其中每个编码器堆栈层具有以下结构:

In transformer, the encoder processes the input sequences and breaks it down into some meaningful representations. The layers of the encoder of the transformer model are stacks of layers where each encoder stack layer has the following structure −

architecture of transformers in generative ai 2

该编码器层结构对于 Transformer 模型的所有层都是相同的。每层编码器堆栈包含以下两个子层:

This encoder layer structure remains the same for all the layers of the Transformer model. Each layer of encoder stack contains the following two sub-layers −

  1. A multi-head attention mechanism

  2. FeedForward Network (FFN)

如我们可在上图中看到的,在两个子层周围存在一个残差连接,即多头注意力机制和前馈网络。这些残差连接的作用是将子层的未处理输入 x 发送到层归一化函数。

As we can see in the above diagram, there is a residual connection around both the sub-layers, i.e., multi-head attention mechanism and FeedForward Network. The job of these residual connections is to send the unprocessed input x of a sub-layer to a layer normalization function.

通过这种方式,可以如下计算每层的归一化输出:

In this way, the normalized output of each layer can be calculated as below −

Layer Normalization (x + Sublayer(x))

Layer Normalization (x + Sublayer(x))

我们将在后续章节中详细讨论子层(即多头注意力和 FNN)、输入嵌入、位置编码、归一化和残差连接。

We will discuss the sub-layers, i.e., multi-head attention and FNN, Input embeddings, positional encodings, normalization, and residual connections in detail in subsequent chapters.

A Layer of the Decoder Stack of the Transformer

在 Transformer 中,解码器采用编码器生成的表示并将它们处理成生成输出序列。它就像翻译或文本续写。与编码器一样,Transformer 模型解码器的层数也为 stacks of layers ,其中每个解码器堆栈层具有以下结构:

In transformer, the decoder takes the representations generated by the encoder and processes them to generate output sequences. It is just like a translation or a text continuation. Like the encoder, the layers of the decoder of the transformer model are also stacks of layers where each decoder stack layer has the following structure −

architecture of transformers in generative ai 3

与编码器层一样,解码器层结构在 Transformer 模型的所有 N=6 层中保持相同。每层解码器堆栈包含以下三个子层:

Like encoder layers, the decoder layer structure remains the same for all the N=6 layers of the Transformer model. Each layer of decoder stack contains the following three sub-layers −

  1. Masked multi-head attention mechanism

  2. A multi-head attention mechanism

  3. FeedForward Network (FFN)

与编码器相反,解码器有一个称为掩码多头注意力的第三个子层,其中,在给定位置,掩码了以下单词。此子层的优势在于 Transformer 在不查看整个序列的情况下根据其推论做出预测。

Opposite to the encoder, the decoder has a third sub-layer called masked multi-head attention in which, at a given position, the following words are masked. The advantage of this sub-layer is that the transformer makes its predictions based on its inferences without seeing the entire sequence.

与编码器一样,所有子层周围都有一个残差连接,每层的归一化输出可以如下计算:

Like the encoder, there is a residual connection around all the sub-layers and the normalized output of each layer can be calculated as below −

Layer Normalization (x + Sublayer(x))

Layer Normalization (x + Sublayer(x))

如我们可在上图中看到的,在所有解码器块之后有一个最终的线性层。此线性层的作用是将数据映射到所需的输出词汇表大小。然后向映射的数据应用一个 softmax 函数,以在目标词汇表上生成概率分布。这将生成最终的输出序列。

As we can see in the above diagram, after all the decoder blocks there is a final linear layer. The role of this linear layer is to map the data to the desired output vocabulary size. A softmax function is then applied to the mapped data to generate a probability distribution over the target vocabulary. This will result in the final output sequence.

Conclusion

在本章中,我们详细解释了生成式 AI 中 Transformer 的架构。我们主要关注其两个主要部分:编码器和解码器。

In this chapter, we explained in detail the architecture of transformers in Generative AI. We mainly focused on its two main parts: the encoder and the decoder.

编码器通过查看所有单词之间的关系来理解输入序列。它使用自注意力和前馈层来创建输入的详细表示。

The role of the encoder is to understand the input sequence by looking at the relationships between all the words. It uses self-attention and feed-forward layers to create a detailed representation of the input.

解码器采用输入的详细表示,并生成输出序列。它使用掩码自注意力来确保它按正确顺序生成序列,并利用编码器-解码器注意力来整合来自编码器的信息。

The decoder takes the detailed representations of the input and generates the output sequence. It uses masked self-attention to ensure it generates the sequence in the correct order and utilizes encoder-decoder attention to integrate the information from the encoder.

通过探索编码器和解码器的工作原理,我们看到了 Transformer 是如何从根本上改变了自然语言处理 (NLP) 领域的。正是编码器和解码器的结构使得 Transformer 在各个行业中如此强大和有效,并改变了我们与 AI 系统交互的方式。

By exploring how the encoder and the decoder work, we see how Transformers have fundamentally transformed the field of natural language processing (NLP). It is the encoder and decoder structure that make the Transformer so powerful and effective in various industries and transform the way we interact with AI systems.