Natural Language Processing 简明教程

Natural Language Processing - Syntactic Analysis

句法分析或解析或语法分析是 NLP 的第三个阶段。此阶段的目的是提取确切的含义,或者可以说从文本中提取字典含义。语法分析根据正式语法规则检查文本的意义。例如,语义分析器会拒绝“热冰淇淋”之类的句子。

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by semantic analyzer.

从这个意义上讲,句法分析或解析可以被定义为分析自然语言中的符号串符合形式语法规则的过程。词 ‘parsing’ 来源于拉丁语 ‘pars’ ,意为 ‘part’

In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.

Concept of Parser

它用于实现解析任务。它可以定义为专门用于获取输入数据(文本)并根据形式语法检查正确语法后给出输入的结构表示的软件组件。它还构建了一个通常采用解析树、抽象语法树或其他层次结构形式的数据结构。

It is used to implement the task of parsing. It may be defined as the software component designed for taking input data (text) and giving structural representation of the input after checking for correct syntax as per formal grammar. It also builds a data structure generally in the form of parse tree or abstract syntax tree or other hierarchical structure.

symbol table

解析的主要作用包括:

The main roles of the parse include −

  1. To report any syntax error.

  2. To recover from commonly occurring error so that the processing of the remainder of program can be continued.

  3. To create parse tree.

  4. To create symbol table.

  5. To produce intermediate representations (IR).

Types of Parsing

导出将解析分解为以下两种类型−

Derivation divides parsing into the followings two types −

  1. Top-down Parsing

  2. Bottom-up Parsing

Top-down Parsing

在此类解析中,解析器从开始符号开始构建解析树,然后尝试将开始符号转换为输入。自顶向下解析最常用的形式使用递归过程来处理输入。自顶向下解析的主要缺点是回溯。

In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input. The most common form of topdown parsing uses recursive procedure to process the input. The main disadvantage of recursive descent parsing is backtracking.

Bottom-up Parsing

在此类解析中,解析器从输入符号开始构建解析树,直到开始符号。

In this kind of parsing, the parser starts with the input symbol and tries to construct the parser tree up to the start symbol.

Concept of Derivation

为了获取输入字符串,我们需要一系列产生式。导出是一组产生式。在解析过程中,我们需要决定要替换的非终结符,并决定借助其将要替换非终结符的产生式。

In order to get the input string, we need a sequence of production rules. Derivation is a set of production rules. During parsing, we need to decide the non-terminal, which is to be replaced along with deciding the production rule with the help of which the non-terminal will be replaced.

Types of Derivation

在本节中,我们将了解两种类型的导出,可用于决定要用产生式替换哪个非终结符 −

In this section, we will learn about the two types of derivations, which can be used to decide which non-terminal to be replaced with production rule −

Left-most Derivation

在最左导出中,输入的句子形式从左到右进行扫描和替换。这种情况下的句子形式称为左句子形式。

In the left-most derivation, the sentential form of an input is scanned and replaced from the left to the right. The sentential form in this case is called the left-sentential form.

Right-most Derivation

在最左导出中,输入的句子形式从右到左进行扫描和替换。这种情况下的句子形式称为右句子形式。

In the left-most derivation, the sentential form of an input is scanned and replaced from right to left. The sentential form in this case is called the right-sentential form.

Concept of Parse Tree

可以将它定义为导出的图形描述。导出的开始符号用作解析树的根。在每个解析树中,叶节点是终结符,内部节点是非终结符。解析树的属性是:中序遍历将产生原始输入字符串。

It may be defined as the graphical depiction of a derivation. The start symbol of derivation serves as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes are non-terminals. A property of parse tree is that in-order traversal will produce the original input string.

Concept of Grammar

语法对于描述构造良好的程序的句法结构至关重要。从文学意义上来说,它们表示自然语言对话的语法规则。语言学自英语、印地语等自然语言诞生以来,就尝试定义语法。

Grammar is very essential and important to describe the syntactic structure of well-formed programs. In the literary sense, they denote syntactical rules for conversation in natural languages. Linguistics have attempted to define grammars since the inception of natural languages like English, Hindi, etc.

形式语言理论也适用于计算机科学领域,主要在编程语言和数据结构中。例如,在“C”语言中,精确的语法规则说明了如何通过列表和语句来创建函数。

The theory of formal languages is also applicable in the fields of Computer Science mainly in programming languages and data structure. For example, in ‘C’ language, the precise grammar rules state how functions are made from lists and statements.

Noam Chomsky 在 1956 年给出了语法的数学模型,该模型可有效书写计算机语言。

A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective for writing computer languages.

在数学上,语法 G 可以正式写为一个 4 元组 (N, T, S, P),其中 −

Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −

  1. N or VN = set of non-terminal symbols, i.e., variables.

  2. T or = set of terminal symbols.

  3. S = Start symbol where S ∈ N

  4. P denotes the Production rules for Terminals as well as Non-terminals. It has the form α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN

Phrase Structure or Constituency Grammar

Noam Chomsky 提出的短语结构语法基于成分关系。这就是它也被称为成分语法的缘故。它与依存语法相反。

Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency relation. That is why it is also called constituency grammar. It is opposite to dependency grammar.

Example

在给出语群语法示例之前,我们需要了解语群语法和语群关系的基本点。

Before giving an example of constituency grammar, we need to know the fundamental points about constituency grammar and constituency relation.

  1. All the related frameworks view the sentence structure in terms of constituency relation.

  2. The constituency relation is derived from the subject-predicate division of Latin as well as Greek grammar.

  3. The basic clause structure is understood in terms of noun phrase NP and verb phrase VP.

我们可以按如下所示书写句子{s2}:

We can write the sentence “This tree is illustrating the constituency relation” as follows −

constituency relation

Dependency Grammar

它与语群语法相反,并且基于从属关系。是由吕西安·泰尼埃首次提出的。从属语法 (DG) 与语群语法相反,因为它缺少短语节点。

It is opposite to the constituency grammar and based on dependency relation. It was introduced by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar because it lacks phrasal nodes.

Example

在给出从属语法示例之前,我们需要了解从属语法和从属关系的基本点。

Before giving an example of Dependency grammar, we need to know the fundamental points about Dependency grammar and Dependency relation.

  1. In DG, the linguistic units, i.e., words are connected to each other by directed links.

  2. The verb becomes the center of the clause structure.

  3. Every other syntactic units are connected to the verb in terms of directed link. These syntactic units are called dependencies.

我们可以按如下所示书写句子{s4}:

We can write the sentence “This tree is illustrating the dependency relation” as follows;

illustrating the dependency

使用语群语法的解析树称为基于语群的解析树;而使用从属语法的解析树称为基于从属的解析树。

Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse trees that uses dependency grammar is called dependency-based parse tree.

Context Free Grammar

无上下文文法,也称为 CFG,是一种用于描述语言的符号以及正则语法的超集。它可以在下图中看到:

Context free grammar, also called CFG, is a notation for describing languages and a superset of Regular grammar. It can be seen in the following diagram −

context free grammar

Definition of CFG

CFG 由一组具有以下四个部分的有限语法规则组成:

CFG consists of finite set of grammar rules with the following four components −

Set of Non-terminals

它由 V 表示。非终结符是表示一组字符串的句法变量,这些字符串进一步帮助定义语法所生成的语言。

It is denoted by V. The non-terminals are syntactic variables that denote the sets of strings, which further help defining the language, generated by the grammar.

Set of Terminals

它也被称为标记并由 Σ 定义。字符串是由终结符的基本符号组成的。

It is also called tokens and defined by Σ. Strings are formed with the basic symbols of terminals.

Set of Productions

它由 P 表示。该集合定义了如何组合终结符和非终结符。每个产生式 (P) 由非终结符、箭头和终结符(终结符序列)组成。非终结符称为产生式的左侧,终结符称为产生式的右侧。

It is denoted by P. The set defines how the terminals and non-terminals can be combined. Every production(P) consists of non-terminals, an arrow, and terminals (the sequence of terminals). Non-terminals are called the left side of the production and terminals are called the right side of the production.

Start Symbol

产生式从开始符号开始。它由符号 S 表示。非终结符符号始终被指定为开始符号。

The production begins from the start symbol. It is denoted by symbol S. Non-terminal symbol is always designated as start symbol.