Gen-ai 简明教程
Input Embeddings in Transformers
Transformer 的两个主要组件,即编码器和解码器,包含各种机制和子层。在 Transformer 架构中,第一个子层是输入嵌入。
The two main components of a Transformer, i.e., the encoder and the decoder, contain various mechanisms and sub-layers. In the Transformer architecture, the first sublayer is Input Embedding.
输入嵌入是一个至关重要的组件,用作输入数据的初始表示。这些嵌入在将文本输入模型进行处理之前,首先将原始文本数据(例如单词或子单词)转换为模型可以处理的格式。
Input embeddings are a crucial component that serve as the initial representation of the input data. These embeddings, before feeding text into the model for processing, first convert raw text data like words or sub words into a format that the model can process.
阅读本章了解输入嵌入是什么,它们为何重要,以及如何在 Transformer 中实现它们,其中包含 Python 示例来说明这些概念。
Read this chapter to understand what input embeddings are, why they are important, and how they are implemented in Transformers, with Python examples to illustrate these concepts.
What are Input Embeddings?
输入嵌入基本上是离散标记(例如单词、子单词或字符)的向量表示。这些向量捕获了标记的语义含义,使模型能够有效理解和处理文本数据。
Input embeddings are basically vector representations of discrete tokens like words, sub words, or characters. These vectors capture the semantic meaning of the tokens that enables the model to understand and manipulate the text data effectively.
输入嵌入子层在 Transformer 中的作用是将输入标记转换为高维空间 $\mathrm{d_{model} \: = \: 512}$,其中相似的标记具有相似的向量表示。
The role of the input embedding sublayer in Transformer is to convert the input tokens in a high-dimensional space $\mathrm{d_{model} \: = \: 512}$ where similar tokens have similar vector representations.
Importance of Input Embeddings in Transformers
现在我们来了解输入嵌入为何在 Transformer 中很重要 −
Let’s now understand why input embeddings are important in Transformers −
Semantic Representation
输入嵌入捕获了输入文本单词之间的语义相似性。例如,像“国王”和“王后”,或“猫”和“狗”这样的单词在嵌入空间中具有相邻的向量。
Input embeddings capture semantic similarities between words of input text. For example, words like "king" and "queen", or "cat" and "dog" will have vectors that are close to each other in the embedding space.
Dimensionality Reduction
传统的独热编码将每个标记表示为具有单个高值的二进制向量,这需要很大的空间。另一方面,输入嵌入通过提供紧凑且密集的表示来降低这种计算复杂度。
The traditional one-hot encoding represents each token as a binary vector with a single high value which requires large space. On the other hand, input embeddings reduce this computational complexity by providing a compact and dense representation.
Working of the Input Embedding Sublayer
输入嵌入子层的工作原理就像其他标准转导模型一样,包括以下步骤 −
The working of input embedding sublayer is like other standard transduction models, includes the following steps −
Step 1: Tokenization
在嵌入输入标记之前,应该对原始文本数据进行标记化。标记化是将文本拆分成单词或子单词等较小单元的过程。
Before the input tokens can be embedded, the raw text data should be tokenized. Tokenization is the process in which the text splits into smaller units like words or sub words.
让我们看一下两种标记化 −
Let’s see both kinds of tokenization −
-
Word-level tokenization − As the name implies, it splits the text into individual words.
-
Subword-level tokenization − As name implies, it splits the text into smaller units, which can be the parts of word. This kind of tokenization is used in models like BERT and GPT to handle misspellings. For example, a subword-level tokenizer applied to the text =" Transformers revolutionized the field of NLP" will produce the tokens = "["Transformers", " revolutionized ", " the ", "field", "of", "NLP"]
Step 2: Embedding Layer
第二步是嵌入层,它基本上是一个查找表,可以将每个标记映射到固定维度的稠密向量。此过程涉及以下两个步骤 −
The second step is embedding layer which is basically a lookup table that maps each token to a dense vector of fixed dimensions. This process involves the following two steps −
-
Vocabulary − It is a set of unique tokens recognized by the model.
-
Embedding Dimension − It represents the size of the vector space in which tokens are represented, for example a size of 512.
当标记传递到嵌入层时,它会从嵌入矩阵返回相应的稠密向量。
When a token is passed to the embedding layer, it returns the corresponding dense vector from the embedding matrix.
How Input Embeddings are Implemented in a Transformer?
下面是一个 Python 示例,用于说明如何在 Transformer 中实现输入嵌入 −
Given below is a Python example to illustrate how input embeddings are implemented in a Transformer −
Example
import numpy as np
# Example text and tokenization
text = "Transformers revolutionized the field of NLP"
tokens = text.split()
# Creating a vocabulary
Vocab = {word: idx for idx, word in enumerate(tokens)}
# Example input (sequence of token indices)
input_indices = np.array([vocab[word] for word in tokens])
print("Vocabulary:", vocab)
print("Input Indices:", input_indices)
# Parameters
vocab_size = len(vocab)
embed_dim = 512 # Dimension of the embeddings
# Initialize the embedding matrix with random values
embedding_matrix = np.random.rand(vocab_size, embed_dim)
# Get the embeddings for the input indices
input_embeddings = embedding_matrix[input_indices]
print("Embedding Matrix:\n", embedding_matrix)
print("Input Embeddings:\n", input_embeddings)
上述 Python 脚本首先将文本拆分为标记并创建一个词汇表,将每个单词映射到一个唯一索引。之后,它用随机值初始化一个嵌入矩阵,其中每行对应于一个单词的嵌入。我们使用嵌入的维度 = 512。
The above Python script first splits the text into tokens and creates a vocabulary that maps each word to a unique index. After that, it initializes an embedding matrix with random values where each row corresponds to the embedding of a word. We are using the dimension of embedding = 512.
Vocabulary: {'Transformers': 0, 'revolutionized': 1, 'the': 2, 'field': 3, 'of': 4, 'NLP': 5}
Input Indices: [0 1 2 3 4 5]
Embedding Matrix:
[[0.29083668 0.70830247 0.22773598 ... 0.62831348 0.90910366 0.46552784]
[0.01269533 0.47420163 0.96242738 ... 0.38781376 0.33485277 0.53721033]
[0.62287977 0.09313765 0.54043664 ... 0.7766359 0.83653342 0.75300144]
[0.32937143 0.51701913 0.39535506 ... 0.60957358 0.22620172 0.60341522]
[0.65193484 0.25431826 0.55643452 ... 0.76561879 0.24922971 0.96247851]
[0.78385765 0.58940282 0.71930539 ... 0.61332926 0.24710099 0.5445185 ]]
Input Embeddings:
[[0.29083668 0.70830247 0.22773598 ... 0.62831348 0.90910366 0.46552784]
[0.01269533 0.47420163 0.96242738 ... 0.38781376 0.33485277 0.53721033]
[0.62287977 0.09313765 0.54043664 ... 0.7766359 0.83653342 0.75300144]
[0.32937143 0.51701913 0.39535506 ... 0.60957358 0.22620172 0.60341522]
[0.65193484 0.25431826 0.55643452 ... 0.76561879 0.24922971 0.96247851]
[0.78385765 0.58940282 0.71930539 ... 0.61332926 0.24710099 0.5445185 ]]
Conclusion
输入嵌入子层将原始文本数据(如单词或子单词)转换为模型可以处理的格式。我们还解释了为什么输入嵌入对于 Transformer 的成功工作很重要。
The input embedding sublayer converts raw text data like words or sub words into a format that the model can process. We also explained why input embeddings are important for the successful working of Transformer.
通过捕捉单词之间的语义相似性并通过提供紧凑且密集的表示来降低计算复杂度,此子层确保模型可以有效地学习数据中的模式和关系。
By capturing semantic similarities between words and reducing the computational complexity by providing their compact and dense representation, this sublayer ensures that the model can effectively learn patterns and relationships in the data.
我们还提供了一个 Python 实现示例来实现将原始文本数据转换为适合在 Transformer 模型中进一步处理的格式所需的基本步骤。
We also provided a Python implementation example to achieve the fundamental steps required to transform raw text data into a format suitable for further processing in a Transformer model.
了解和实施输入嵌入子层对于有效地将 Transformer 模型用于 NLP 任务至关重要。
Understanding and implementing the input embedding sublayer is crucial for effectively using the Transformer models for NLP tasks.