Natural Language Toolkit 简明教程

Natural Language Toolkit - Parsing

Parsing and its relevance in NLP

单词“Parsing”源自拉丁语单词 ‘pars’ (含义为 ‘part’ ),用于从文本中提取确切含义或字典含义。它也称为句法分析或语法分析。通过比较正式语法的规则,语法分析检查了文本的含义。例如,诸如“给我热冰淇淋”之类句子将被解析器或句法分析器所拒绝。

The word ‘Parsing’ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to draw exact meaning or dictionary meaning from the text. It is also called Syntactic analysis or syntax analysis. Comparing the rules of formal grammar, syntax analysis checks the text for meaningfulness. The sentence like “Give me hot ice-cream”, for example, would be rejected by parser or syntactic analyzer.

从这个意义上说,我们可以定义解析、句法分析或语法分析如下 −

In this sense, we can define parsing or syntactic analysis or syntax analysis as follows −

可以将其定义为分析符号串的过程,该串是符合正式语法规则的自然语言。

It may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar.

relevance

我们可以借助以下几点理解解析在 NLP 中的相关性 −

We can understand the relevance of parsing in NLP with the help of following points −

  1. Parser is used to report any syntax error.

  2. It helps to recover from commonly occurring error so that the processing of the remainder of program can be continued.

  3. Parse tree is created with the help of a parser.

  4. Parser is used to create symbol table, which plays an important role in NLP.

  5. Parser is also used to produce intermediate representations (IR).

Deep Vs Shallow Parsing

Deep Parsing

Shallow Parsing

In deep parsing, the search strategy will give a complete syntactic structure to a sentence.

It is the task of parsing a limited part of the syntactic information from the given task.

It is suitable for complex NLP applications.

It can be used for less complex NLP applications.

Dialogue systems and summarization are the examples of NLP applications where deep parsing is used.

Information extraction and text mining are the examples of NLP applications where deep parsing is used.

It is also called full parsing.

It is also called chunking.

Various types of parsers

如讨论的那样,解析器基本上是对语法的程序解释。它在通过各种树的空间搜索后为给定的句子找到了最佳树。请参阅下方一些可用的解析器 –

As discussed, a parser is basically a procedural interpretation of grammar. It finds an optimal tree for the given sentence after searching through the space of a variety of trees. Let us see some of the available parsers below −

Recursive descent parser

递归下降解析是最直接的解析形式之一。以下是有关递归下降解析器的一些重要要点 –

Recursive descent parsing is one of the most straightforward forms of parsing. Following are some important points about recursive descent parser −

  1. It follows a top down process.

  2. It attempts to verify that the syntax of the input stream is correct or not.

  3. It reads the input sentence from left to right.

  4. One necessary operation for recursive descent parser is to read characters from the input stream and matching them with the terminals from the grammar.

Shift-reduce parser

以下是有关移-约解析器的一些重要要点 –

Following are some important points about shift-reduce parser −

  1. It follows a simple bottom-up process.

  2. It tries to find a sequence of words and phrases that correspond to the right-hand side of a grammar production and replaces them with the left-hand side of the production.

  3. The above attempt to find a sequence of word continues until the whole sentence is reduced.

  4. In other simple words, shift-reduce parser starts with the input symbol and tries to construct the parser tree up to the start symbol.

Chart parser

以下是有关图表解析器的一些重要要点 –

Following are some important points about chart parser −

  1. It is mainly useful or suitable for ambiguous grammars, including grammars of natural languages.

  2. It applies dynamic programing to the parsing problems.

  3. Because of dynamic programing, partial hypothesized results are stored in a structure called a ‘chart’.

  4. The ‘chart’ can also be re-used.

Regexp parser

正则表达式解析是最常用的解析技术之一。以下是关于正则表达式解析器的一些重要要点 -

Regexp parsing is one of the mostly used parsing technique. Following are some important points about Regexp parser −

  1. As the name implies, it uses a regular expression defined in the form of grammar on top of a POS-tagged string.

  2. It basically uses these regular expressions to parse the input sentences and generate a parse tree out of this.

Example

以下是正则表达式解析器的实际示例 -

Following is a working example of Regexp Parser −

import nltk
sentence = [
   ("a", "DT"),
   ("clever", "JJ"),
   ("fox","NN"),
   ("was","VBP"),
   ("jumping","VBP"),
   ("over","IN"),
   ("the","DT"),
   ("wall","NN")
]
grammar = "NP:{<DT>?<JJ>*<NN>}"
Reg_parser = nltk.RegexpParser(grammar)
Reg_parser.parse(sentence)
Output = Reg_parser.parse(sentence)
Output.draw()

Output

regexp parser

Dependency Parsing

依存关系解析 (DP),一种现代解析机制,其主要概念是每个语言单元(即单词)通过直接链接相互关联。这些直接链接在语言学中实际上是 ‘dependencies’ 。例如,下图显示了句子 “John can hit the ball” 的依存关系语法。

Dependency Parsing (DP), a modern parsing mechanism, whose main concept is that each linguistic unit i.e. words relates to each other by a direct link. These direct links are actually ‘dependencies’ in linguistic. For example, the following diagram shows dependency grammar for the sentence “John can hit the ball”.

dependency parsing

NLTK Package

以下是使用 NLTK 进行依存关系解析的两种方式 -

We have following the two ways to do dependency parsing with NLTK −

Probabilistic, projective dependency parser

这是我们可以使用 NLTK 进行依存关系解析的第一种方式。但此解析器对使用有限的训练数据进行训练有限制。

This is the first way we can do dependency parsing with NLTK. But this parser has the restriction of training with a limited set of training data.

Stanford parser

这是我们可以使用 NLTK 执行依存关系解析的另一种方式。斯坦福解析器是一种最先进的依存关系解析器。NLTK 对此进行了包装。要使用它,我们需要下载以下两样东西 -

This is another way we can do dependency parsing with NLTK. Stanford parser is a state-of-the-art dependency parser. NLTK has a wrapper around it. To use it we need to download following two things −

Language model ,适用于所需语言。例如,英语语言模型。

Language model for desired language. For example, English language model.

Example

下载模型后,我们可以通过 NLTK 使用它,如下所示 -

Once you downloaded the model, we can use it through NLTK as follows −

from nltk.parse.stanford import StanfordDependencyParser
path_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'
dep_parser = StanfordDependencyParser(
   path_to_jar = path_jar, path_to_models_jar = path_models_jar
)
result = dep_parser.raw_parse('I shot an elephant in my sleep')
depndency = result.next()
list(dependency.triples())

Output

[
   ((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
   ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
   ((u'elephant', u'NN'), u'det', (u'an', u'DT')),
   ((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
   ((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
   ((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))
]