Natural Language Toolkit 简明教程

Natural Language Toolkit - Transforming Trees

以下是变换树的两个原因 −

  1. 修改深度分析树和

  2. 展开深度分析树

Converting Tree or Subtree to Sentence

我们即将在此讨论的第一个秘诀是将树或子树转换回句子或块字符串。这非常简单,让我们通过以下示例进行了解 −

Example

from nltk.corpus import treebank_chunk
tree = treebank_chunk.chunked_sents()[2]
' '.join([w for w, t in tree.leaves()])

Output

'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC , was named a nonexecutive director of this British industrial
conglomerate .'

Deep tree flattening

嵌套短语的深度树无法用于训练块,因此我们必须在使用之前将其展开。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

Example

为实现这一点,我们定义了一个名为 deeptree_flat() 的函数,该函数将采用一个 Tree 并将返回一个仅保留最低级别树的新 Tree。为对大多数工作进行操作,该函数将使用一个名为 childtree_flat() 的辅助函数。

from nltk.tree import Tree
def childtree_flat(trees):
   children = []
   for t in trees:
      if t.height() < 3:
         children.extend(t.pos())
      elif t.height() == 3:
         children.append(Tree(t.label(), t.pos()))
      else:
         children.extend(flatten_childtrees([c for c in t]))
   return children
def deeptree_flat(tree):
   return Tree(tree.label(), flatten_childtrees([c for c in tree]))

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 deeptree_flat() 函数。我们将这些函数保存在名为 deeptree.py 的文件中。

from deeptree import deeptree_flat
from nltk.corpus import treebank
deeptree_flat(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]),
(',', ','), Tree('NP', [('55', 'CD'),
('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'),
Tree('NP', [('former', 'JJ'),
('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC',
'NNP')]), (',', ','), ('was', 'VBD'),
('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]),
Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]),
('of', 'IN'), Tree('NP',
[('this', 'DT'), ('British', 'JJ'),
('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

Building Shallow tree

在前面一节中,我们通过仅保留最低级别的子树来展开嵌套短语的深度树。在本节中,我们将仅保留最高级别的子树,即构建浅树。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

Example

为实现这一点,我们定义了一个名为 tree_shallow() 的函数,该函数将通过仅保留顶部子树标签来消除所有的嵌套子树。

from nltk.tree import Tree
def tree_shallow(tree):
   children = []
   for t in tree:
      if t.height() < 3:
         children.extend(t.pos())
      else:
         children.append(Tree(t.label(), t.pos()))
   return Tree(tree.label(), children)

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 tree_shallow() 函数。我们将这些函数保存在名为 shallowtree.py 的文件中。

from shallowtree import shallow_tree
from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','),
('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'),
('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]),
Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'),
('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'),
('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

借助于获取树的高度,我们可以看到差异 −

from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2]).height()

Output

3
from nltk.corpus import treebank
treebank.parsed_sents()[2].height()

Output

9

Tree labels conversion

在剖析树中有多种 Tree 标签类型,但在区块树中不存在。但是,在使用剖析树训练区块器时,我们希望通过将一些树标签转换为更常见的标签类型来减少这种差异。例如,我们有两个替代的 NP 子树,即 NP-SBL 和 NP-TMP。我们可以将它们都转换为 NP。让我们在以下示例中了解如何执行此操作。

Example

为了实现这一点,我们正在定义一个名为 tree_convert() 的函数,它接受以下两个参数 −

  1. Tree to convert

  2. A label conversion mapping

此函数将返回一棵新树,其中所有匹配的标签都根据映射中的值进行替换。

from nltk.tree import Tree
def tree_convert(tree, mapping):
   children = []
   for t in tree:
      if isinstance(t, Tree):
         children.append(convert_tree_labels(t, mapping))
      else:
         children.append(t)
   label = mapping.get(tree.label(), tree.label())
   return Tree(label, children)

现在,让我们对 treebank 语料库中的第 3 个已剖析句子(嵌套短语的深度树)调用 tree_convert() 函数。我们将这些函数保存在名为 converttree.py 的文件中。

from converttree import tree_convert
from nltk.corpus import treebank
mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
convert_tree_labels(treebank.parsed_sents()[2], mapping)

Output

Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']),
Tree('NNP', ['Agnew'])]), Tree(',', [',']),
Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']),
Tree('NNS', ['years'])]),
Tree('JJ', ['old'])]), Tree('CC', ['and']),
Tree('NP', [Tree('NP', [Tree('JJ', ['former']),
Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']),
Tree('NP', [Tree('NNP', ['Consolidated']),
Tree('NNP', ['Gold']), Tree('NNP', ['Fields']),
Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]),
Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']),
Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]),
Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']),
Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]),
Tree('PP', [Tree('IN', ['of']), Tree('NP',
[Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']),
Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])