Natural Language Toolkit 简明教程

Chunking & Information Extraction

What is Chunking?

分块是自然语言处理中最重要的过程之一,用于识别词性 (POS) 和短语。换句话说,通过分块,我们可以获得句子的结构。它也称为 partial parsing

Chunking, one of the important processes in natural language processing, is used to identify parts of speech (POS) and short phrases. In other simple words, with chunking, we can get the structure of the sentence. It is also called partial parsing.

Chunk patterns and chinks

Chunk patterns 是词性 (POS) 标记模式,用于定义组成块的单词类型。我们可以借助改进的正则表达式来定义块模式。

Chunk patterns are the patterns of part-of-speech (POS) tags that define what kind of words made up a chunk. We can define chunk patterns with the help of modified regular expressions.

此外,我们还可以定义不应该在块中的单词类型的模式,这些未分块的单词称为 chinks

Moreover, we can also define patterns for what kind of words should not be in a chunk and these unchunked words are known as chinks.

Implementation example

在以下示例中,除了解析句子 “the book has many chapters”, 的结果之外,一个名词短语的语法将块和裂纹模式都结合在一起 −

In the example below, along with the result of parsing the sentence “the book has many chapters”, there is a grammar for noun phrases that combines both a chunk and a chink pattern −

import nltk
sentence = [
   ("the", "DT"),
   ("book", "NN"),
   ("has","VBZ"),
   ("many","JJ"),
   ("chapters","NNS")
]
chunker = nltk.RegexpParser(
   r'''
   NP:{<DT><NN.*><.*>*<NN.*>}
   }<VB.*>{
   '''
)
chunker.parse(sentence)
Output = chunker.parse(sentence)
Output.draw()

Output

chunk patterns

正如上文所示,用于指定块的模式如下使用大括号 −

As seen above, the pattern for specifying a chunk is to use curly braces as follows −

{<DT><NN>}

为了指定一个细微差别,我们可以翻转大括号,如下所示 −

And to specify a chink, we can flip the braces such as follows −

}<VB>{.

现在,就特定的短语类型而言,这些规则可以组合成一个语法。

Now, for a particular phrase type, these rules can be combined into a grammar.

Information Extraction

我们已经经历了标记器以及可以用来构建信息提取引擎的解析器。让我们看看一个基本的信息提取管道 −

We have gone through taggers as well as parsers that can be used to build information extraction engine. Let us see a basic information extraction pipeline −

extraction

信息提取有很多应用程序,包括 −

Information extraction has many applications including −

  1. Business intelligence

  2. Resume harvesting

  3. Media analysis

  4. Sentiment detection

  5. Patent search

  6. Email scanning

Named-entity recognition (NER)

命名实体识别 (NER) 实际上是提取一些最常见实体(如姓名、组织、位置等)的一种方法。让我们看看一个示例,它采用了所有预处理步骤,如句子标记化、词性标注、组块、NER,并遵循上图中提供的管道。

Named-entity recognition (NER) is actually a way of extracting some of most common entities like names, organizations, location, etc. Let us see an example that took all the preprocessing steps such as sentence tokenization, POS tagging, chunking, NER, and follows the pipeline provided in the figure above.

Example

Import nltk
file = open (
   # provide here the absolute path for the file of text for which we want NER
)
data_text = file.read()
sentences = nltk.sent_tokenize(data_text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
for sent in tagged_sentences:
print nltk.ne_chunk(sent)

一些修改后的命名实体识别 (NER) 也可用于提取诸如产品名称、生物医学实体、品牌名称等实体。

Some of the modified Named-entity recognition (NER) can also be used to extract entities such as product names, bio-medical entities, brand name and much more.

Relation extraction

关系提取是另一种常用的信息提取操作,它是提取各种实体之间不同关系的过程。可能有不同的关系,如继承、同义词、类比等,其定义取决于信息需求。例如,假设如果我们想要查找一本书的作者,那么作者身份将是作者名称和书名之间的关系。

Relation extraction, another commonly used information extraction operation, is the process of extracting the different relationships between various entities. There can be different relationships like inheritance, synonyms, analogous, etc., whose definition depends on the information need. For example, suppose if we want to look for write of a book then the authorship would be a relation between the author name and book name.

Example

在以下示例中,我们使用与上图中所示相同的 IE 管道,我们一直使用它直到命名实体关系 (NER),并使用基于 NER 标记的关系模式对其进行扩展。

In the following example, we use the same IE pipeline, as shown in the above diagram, that we used till Named-entity relation (NER) and extend it with a relation pattern based on the NER tags.

import nltk
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus = 'ieer',
pattern = IN):
print(nltk.sem.rtuple(rel))

Output

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']

在上面的代码中,我们使用了一个名为 ieer 的内置语料库。在这个语料库中,句子被标记到命名实体关系 (NER)。在这里,我们只需要指定我们想要的关系列模式以及我们希望关系定义的 NER 类型。在我们的示例中,我们定义了组织和位置之间的关系。我们提取了这些模式的所有组合。

In the above code, we have used an inbuilt corpus named ieer. In this corpus, the sentences are tagged till Named-entity relation (NER). Here we only need to specify the relation pattern that we want and the kind of NER we want the relation to define. In our example, we defined relationship between an organization and a location. We extracted all the combinations of these patterns.