Natural Language Toolkit 简明教程
Natural Language Toolkit - Transforming Chunks
Why transforming Chunks?
到目前为止,我们已经从句子中获得了部分或短语,但我们应该用它们做什么。其中一项重要任务是对它们进行转换。但是为什么呢?是为了执行以下操作 -
Till now we have got chunks or phrases from sentences but what are we supposed to do with them. One of the important tasks is to transform them. But why? It is to do the following −
-
grammatical correction and
-
rearranging phrases
Filtering insignificant/useless words
假设您想判断短语的含义,那么有许多常用单词,例如,“the”、“a”是无关紧要的或无用的。例如,请看以下短语 -
Suppose if you want to judge the meaning of a phrase then there are many commonly used words such as, ‘the’, ‘a’, are insignificant or useless. For example, see the following phrase −
“The movie was good”。
‘The movie was good’.
这里最重要的单词是“movie”和“good”。其他单词“the”和“was”都是无用的或无关紧要的。这是因为没有它们,我们也可以获得短语的相同含义。“Good movie”。
Here the most significant words are ‘movie’ and ‘good’. Other words, ‘the’ and ‘was’ both are useless or insignificant. It is because without them also we can get the same meaning of the phrase. ‘Good movie’.
在以下 Python 配方中,我们将学习如何使用 POS 标记删除无用/无关紧要的单词并保留有意义的单词。
In the following python recipe, we will learn how to remove useless/insignificant words and keep the significant words with the help of POS tags.
Example
首先,通过查看 treebank 语料库以获取停用词,我们需要确定哪些词性标记是有意义的,哪些是没有意义的。让我们看看下表,其中包含无关紧要的单词和标记 -
First, by looking through treebank corpus for stopwords we need to decide which part-of-speech tags are significant and which are not. Let us see the following table of insignificant words and tags −
Word |
Tag |
a |
DT |
All |
PDT |
An |
DT |
And |
CC |
Or |
CC |
That |
WDT |
The |
DT |
从上表中,我们可以看到除了 CC 之外,所有其他标签都以 DT 结尾,这意味着我们可以通过查看标签的后缀来过滤掉无关紧要的单词。
From the above table, we can see other than CC, all the other tags end with DT which means we can filter out insignificant words by looking at the tag’s suffix.
对于此示例,我们将使用一个名为 filter() 的函数,它获取一个块并返回一个不带任何无关紧要标记单词的新块。此函数会过滤掉所有以 DT 或 CC 结尾的标记。
For this example, we are going to use a function named filter() which takes a single chunk and returns a new chunk without any insignificant tagged words. This function filters out any tags that end with DT or CC.
Example
import nltk
def filter(chunk, tag_suffixes=['DT', 'CC']):
significant = []
for word, tag in chunk:
ok = True
for suffix in tag_suffixes:
if tag.endswith(suffix):
ok = False
break
if ok:
significant.append((word, tag))
return (significant)
现在,让我们在 Python 配方中使用此函数 filter() 来删除无关紧要的单词 -
Now, let us use this function filter() in our Python recipe to delete insignificant words −
from chunk_parse import filter
filter([('the', 'DT'),('good', 'JJ'),('movie', 'NN')])
Verb Correction
在现实世界语言中,我们经常看到不正确的动词形式。例如,“is you fine?”是不正确的。这个句子中的动词形式不正确。这个句子应该是“are you fine?”NLTK 通过创建动词更正映射为我们提供了纠正此类错误的方法。这些更正映射的使用取决于块中是否有复数或单数名词。
Many times, in real-world language we see incorrect verb forms. For example, ‘is you fine?’ is not correct. The verb form is not correct in this sentence. The sentence should be ‘are you fine?’ NLTK provides us the way to correct such mistakes by creating verb correction mappings. These correction mappings are used depending on whether there is a plural or singular noun in the chunk.
Example
要实现 Python 配方,我们首先需要定义动词更正映射。让我们创建两个映射,如下所示 -
To implement Python recipe, we first need to need define verb correction mappings. Let us create two mapping as follows −
Plural to Singular mappings
Plural to Singular mappings
plural= {
('is', 'VBZ'): ('are', 'VBP'),
('was', 'VBD'): ('were', 'VBD')
}
Singular to Plural mappings
Singular to Plural mappings
singular = {
('are', 'VBP'): ('is', 'VBZ'),
('were', 'VBD'): ('was', 'VBD')
}
如上所示,每个映射都有一个标记动词,它映射到另一个标记动词。我们示例中的初始映射涵盖了映射 is to are, was to were 的基础,反之亦然。
As seen above, each mapping has a tagged verb which maps to another tagged verb. The initial mappings in our example cover the basic of mappings is to are, was to were, and vice versa.
接下来,我们将定义一个名为 verbs() 的函数,您可以在其中传递一个动词形式不正确的部分,我们将会从 verb() 函数获取一个更正的部分。要完成此操作, verb() 函数使用一个名为 index_chunk() 的帮助函数,它将在片段中搜索第一个标记单词的位置。
Next, we will define a function named verbs(), in which you can pass a chink with incorrect verb form and ‘ll get a corrected chunk back. To get it done, verb() function uses a helper function named index_chunk() which will search the chunk for the position of the first tagged word.
让我们看看这些函数 -
Let us see these functions −
def index_chunk(chunk, pred, start = 0, step = 1):
l = len(chunk)
end = l if step > 0 else -1
for i in range(start, end, step):
if pred(chunk[i]):
return i
return None
def tag_startswith(prefix):
def f(wt):
return wt[1].startswith(prefix)
return f
def verbs(chunk):
vbidx = index_chunk(chunk, tag_startswith('VB'))
if vbidx is None:
return chunk
verb, vbtag = chunk[vbidx]
nnpred = tag_startswith('NN')
nnidx = index_chunk(chunk, nnpred, start = vbidx+1)
if nnidx is None:
nnidx = index_chunk(chunk, nnpred, start = vbidx-1, step = -1)
if nnidx is None:
return chunk
noun, nntag = chunk[nnidx]
if nntag.endswith('S'):
chunk[vbidx] = plural.get((verb, vbtag), (verb, vbtag))
else:
chunk[vbidx] = singular.get((verb, vbtag), (verb, vbtag))
return chunk
将这些函数保存在 Python 文件中,在安装了 Python 或 Anaconda 的本地目录中运行该文件。我已将该文件保存在 verbcorrect.py 中。
Save these functions in a Python file in your local directory where Python or Anaconda is installed and run it. I have saved it as verbcorrect.py.
现在,我们对 is you fine 块上的 verbs() 函数使用 POS 标记 −
Now, let us call verbs() function on a POS tagged is you fine chunk −
from verbcorrect import verbs
verbs([('is', 'VBZ'), ('you', 'PRP$'), ('fine', 'VBG')])
Eliminating passive voice from phrases
另一项有用的任务是从短语中消除被动语态。可以使用围绕动词交换单词来执行此操作。例如,可以将 ‘the tutorial was great’ 转变成 ‘the great tutorial’ 。
Another useful task is to eliminate passive voice from phrases. This can be done with the help of swapping the words around a verb. For example, ‘the tutorial was great’ can be transformed into ‘the great tutorial’.
Example
为实现这一点,我们定义了一个名为 eliminate_passive() 的函数,其通过使用动词作为枢轴点来将块的右侧与左侧进行交换。为了找到围绕动词进行枢纽,它还将使用上述 index_chunk() 函数。
To achieve this we are defining a function named eliminate_passive() that will swap the right-hand side of the chunk with the left-hand side by using the verb as the pivot point. In order to find the verb to pivot around, it will also use the index_chunk() function defined above.
def eliminate_passive(chunk):
def vbpred(wt):
word, tag = wt
return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2
vbidx = index_chunk(chunk, vbpred)
if vbidx is None:
return chunk
return chunk[vbidx+1:] + chunk[:vbidx]
现在,我们对 the tutorial was great 块上的 eliminate_passive() 函数使用 POS 标记 −
Now, let us call eliminate_passive() function on a POS tagged the tutorial was great chunk −
from passiveverb import eliminate_passive
eliminate_passive(
[
('the', 'DT'), ('tutorial', 'NN'), ('was', 'VBD'), ('great', 'JJ')
]
)
Swapping noun cardinals
如我们所知,诸如 5 的基数词在块中标记为 CD。这些基数词常常在名词之前或之后出现,但出于规范化目的,将其始终放在名词之前会很有用。例如,可以将日期 January 5 写成 5 January 。让我们通过以下示例来理解。
As we know, a cardinal word such as 5, is tagged as CD in a chunk. These cardinal words often occur before or after a noun but for normalization purpose it is useful to put them before the noun always. For example, the date January 5 can be written as 5 January. Let us understand it with the following example.
Example
为实现这一点,我们定义了一个名为 swapping_cardinals() 的函数,其将紧跟在名词之后的任何基数与该名词进行交换。通过此操作,基数将立即出现在名词之前。为与给定的标记进行相等性比较,该函数将使用一个名为 tag_eql() 的辅助函数。
To achieve this we are defining a function named swapping_cardinals() that will swap any cardinal that occurs immediately after a noun with the noun. With this the cardinal will occur immediately before the noun. In order to do equality comparison with the given tag, it uses a helper function which we named as tag_eql().
def tag_eql(tag):
def f(wt):
return wt[1] == tag
return f
现在,我们可以定义 swapping_cardinals()−
Now we can define swapping_cardinals() −
def swapping_cardinals (chunk):
cdidx = index_chunk(chunk, tag_eql('CD'))
if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
return chunk
noun, nntag = chunk[cdidx-1]
chunk[cdidx-1] = chunk[cdidx]
chunk[cdidx] = noun, nntag
return chunk
现在,让我们对 “January 5” 上的 swapping_cardinals() 函数使用日期 −
Now, Let us call swapping_cardinals() function on a date “January 5” −
from Cardinals import swapping_cardinals()
swapping_cardinals([('Janaury', 'NNP'), ('5', 'CD')])