Natural Language Toolkit 简明教程

Natural Language Toolkit - Transforming Trees

以下是变换树的两个原因 −

Following are the two reasons to transform the trees −

  1. To modify deep parse tree and

  2. To flatten deep parse trees

Converting Tree or Subtree to Sentence

我们即将在此讨论的第一个秘诀是将树或子树转换回句子或块字符串。这非常简单,让我们通过以下示例进行了解 −

The first recipe we are going to discuss here is to convert a Tree or subtree back to a sentence or chunk string. This is very simple, let us see in the following example −

Example

from nltk.corpus import treebank_chunk
tree = treebank_chunk.chunked_sents()[2]
' '.join([w for w, t in tree.leaves()])

Output

'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC , was named a nonexecutive director of this British industrial
conglomerate .'

Deep tree flattening

嵌套短语的深度树无法用于训练块,因此我们必须在使用之前将其展开。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

Deep trees of nested phrases can’t be used for training a chunk hence we must flatten them before using. In the following example, we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

为实现这一点,我们定义了一个名为 deeptree_flat() 的函数,该函数将采用一个 Tree 并将返回一个仅保留最低级别树的新 Tree。为对大多数工作进行操作,该函数将使用一个名为 childtree_flat() 的辅助函数。

To achieve this, we are defining a function named deeptree_flat() that will take a single Tree and will return a new Tree that keeps only the lowest level trees. In order to do most of the work, it uses a helper function which we named as childtree_flat().

from nltk.tree import Tree
def childtree_flat(trees):
   children = []
   for t in trees:
      if t.height() < 3:
         children.extend(t.pos())
      elif t.height() == 3:
         children.append(Tree(t.label(), t.pos()))
      else:
         children.extend(flatten_childtrees([c for c in t]))
   return children
def deeptree_flat(tree):
   return Tree(tree.label(), flatten_childtrees([c for c in tree]))

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 deeptree_flat() 函数。我们将这些函数保存在名为 deeptree.py 的文件中。

Now, let us call deeptree_flat() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named deeptree.py.

from deeptree import deeptree_flat
from nltk.corpus import treebank
deeptree_flat(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]),
(',', ','), Tree('NP', [('55', 'CD'),
('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'),
Tree('NP', [('former', 'JJ'),
('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC',
'NNP')]), (',', ','), ('was', 'VBD'),
('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]),
Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]),
('of', 'IN'), Tree('NP',
[('this', 'DT'), ('British', 'JJ'),
('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

Building Shallow tree

在前面一节中,我们通过仅保留最低级别的子树来展开嵌套短语的深度树。在本节中,我们将仅保留最高级别的子树,即构建浅树。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

In the previous section, we flatten a deep tree of nested phrases by only keeping the lowest level subtrees. In this section, we are going to keep only the highest-level subtrees i.e. to build the shallow tree. In the following example we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

为实现这一点,我们定义了一个名为 tree_shallow() 的函数,该函数将通过仅保留顶部子树标签来消除所有的嵌套子树。

To achieve this, we are defining a function named tree_shallow() that will eliminate all the nested subtrees by keeping only the top subtree labels.

from nltk.tree import Tree
def tree_shallow(tree):
   children = []
   for t in tree:
      if t.height() < 3:
         children.extend(t.pos())
      else:
         children.append(Tree(t.label(), t.pos()))
   return Tree(tree.label(), children)

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 tree_shallow() 函数。我们将这些函数保存在名为 shallowtree.py 的文件中。

Now, let us call tree_shallow() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named shallowtree.py.

from shallowtree import shallow_tree
from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','),
('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'),
('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]),
Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'),
('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'),
('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

借助于获取树的高度,我们可以看到差异 −

We can see the difference with the help of getting the height of the trees −

from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2]).height()

Output

3
from nltk.corpus import treebank
treebank.parsed_sents()[2].height()

Output

9

Tree labels conversion

在剖析树中有多种 Tree 标签类型,但在区块树中不存在。但是,在使用剖析树训练区块器时,我们希望通过将一些树标签转换为更常见的标签类型来减少这种差异。例如,我们有两个替代的 NP 子树,即 NP-SBL 和 NP-TMP。我们可以将它们都转换为 NP。让我们在以下示例中了解如何执行此操作。

In parse trees there are variety of Tree label types that are not present in chunk trees. But while using parse tree to train a chunker, we would like to reduce this variety by converting some of Tree labels to more common label types. For example, we have two alternative NP subtrees namely NP-SBL and NP-TMP. We can convert both of them into NP. Let us see how to do it in the following example.

Example

为了实现这一点,我们正在定义一个名为 tree_convert() 的函数,它接受以下两个参数 −

To achieve this we are defining a function named tree_convert() that takes following two arguments −

  1. Tree to convert

  2. A label conversion mapping

此函数将返回一棵新树,其中所有匹配的标签都根据映射中的值进行替换。

This function will return a new Tree with all matching labels replaced based on the values in the mapping.

from nltk.tree import Tree
def tree_convert(tree, mapping):
   children = []
   for t in tree:
      if isinstance(t, Tree):
         children.append(convert_tree_labels(t, mapping))
      else:
         children.append(t)
   label = mapping.get(tree.label(), tree.label())
   return Tree(label, children)

现在,让我们对 treebank 语料库中的第 3 个已剖析句子(嵌套短语的深度树)调用 tree_convert() 函数。我们将这些函数保存在名为 converttree.py 的文件中。

Now, let us call tree_convert() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named converttree.py.

from converttree import tree_convert
from nltk.corpus import treebank
mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
convert_tree_labels(treebank.parsed_sents()[2], mapping)

Output

Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']),
Tree('NNP', ['Agnew'])]), Tree(',', [',']),
Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']),
Tree('NNS', ['years'])]),
Tree('JJ', ['old'])]), Tree('CC', ['and']),
Tree('NP', [Tree('NP', [Tree('JJ', ['former']),
Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']),
Tree('NP', [Tree('NNP', ['Consolidated']),
Tree('NNP', ['Gold']), Tree('NNP', ['Fields']),
Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]),
Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']),
Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]),
Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']),
Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]),
Tree('PP', [Tree('IN', ['of']), Tree('NP',
[Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']),
Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])