Python Text Processing 简明教程

Python - Chunks and Chinks

分块是根据单词的性质将类似的单词分组在一起的过程。在下面的示例中,我们定义了一个分块必须在其中生成的语法。该语法建议名词和形容词等短语的顺序,在创建分块时将遵循该顺序。分块的图片输出如下所示。

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"),
("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]
grammar = "NP: {?*}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
result.draw()

当我们运行以上程序时,我们得到了以下输出 −

chunk 1

更改语法后,我们会得到不同的输出,如下所示。

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"),
 ("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]

grammar = "NP: {?*}"

chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence)
print(result)
result.draw()

当我们运行以上程序时,我们得到了以下输出 −

chunk 2

Chinking

分块是从分块中删除一系列标记的过程。如果标记序列出现在分块的中间,则删除这些标记,留下两个分块,它们本来就在那里。

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"), ("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]

grammar = r"""
  NP:
    {<.*>+}         # Chunk everything
    }+{      # Chink sequences of JJ and NN
  """
chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence)
print(result)
result.draw()