Natural Language Toolkit 简明教程
Natural Language Toolkit - Word Replacement
词干提取和词形还原可以被视为一种语言压缩。从这个意义上说,单词替换可以被认为是文本归一化或错误纠正。
Stemming and lemmatization can be considered as a kind of linguistic compression. In the same sense, word replacement can be thought of as text normalization or error correction.
但是为什么我们需要单词替换?假设如果我们谈论标记化,那么它就会出现缩写问题(例如 can’t、won’t 等)。因此,为了处理此类问题,我们需要单词替换。例如,我们可以用缩写的扩展形式替换缩写。
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.
Word replacement using regular expression
首先,我们将替换与正则表达式匹配的单词。但为此,我们必须对正则表达式以及 python re 模块有基本的了解。在下面的示例中,我们将用缩写的扩展形式(例如,“can’t”将被替换为“cannot”)替换缩写,所有这些都使用正则表达式。
First, we are going to replace words that matches the regular expression. But for this we must have a basic understanding of regular expressions as well as python re module. In the example below, we will be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all that by using regular expressions.
Example
首先,导入必需的包 re 来处理正则表达式。
First, import the necessary package re to work with regular expressions.
import re
from nltk.corpus import wordnet
接下来,按如下所示定义您选择的替换模式 −
Next, define the replacement patterns of your choice as follows −
R_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]
现在,创建一个可用于替换单词的类 −
Now, create a class that can be used for replacing words −
class REReplacer(object):
def __init__(self, pattern = R_patterns):
self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.pattern:
s = re.sub(pattern, repl, s)
return s
保存此 python 程序(比如 repRE.py)并从 python 命令提示符运行它。运行它后,在您想要替换单词时导入 REReplacer 类。让我们看看怎么做。
Save this python program (say repRE.py) and run it from python command prompt. After running it, import REReplacer class when you want to replace words. Let us see how.
from repRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'
Complete implementation example
import re
from nltk.corpus import wordnet
R_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]
class REReplacer(object):
def __init__(self, patterns=R_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
s = re.sub(pattern, repl, s)
return s
现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −
Now once you saved the above program and run it, you can import the class and use it as follows −
from replacerRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")
Replacement before text processing
使用自然语言处理 (NLP) 时的一种常见做法是在文本处理之前清理文本。对此,我们也可以在文本处理之前的步骤中使用上面在前一个示例中创建的 REReplacer 类,即标记化。
One of the common practices while working with natural language processing (NLP) is to clean up the text before text processing. In this concern we can also use our REReplacer class created above in previous example, as a preliminary step before text processing i.e. tokenization.
Example
from nltk.tokenize import word_tokenize
from replacerRE import REReplacer
rep_word = REReplacer()
word_tokenize("I won't be able to do this now")
Output:
['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now']
word_tokenize(rep_word.replace("I won't be able to do this now"))
Output:
['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']
在上面的 Python 食谱中,我们可以轻松理解带有正则表达式替换的词标记器输出和不带有正则表达式替换的词标记器输出之间的区别。
In the above Python recipe, we can easily understand the difference between the output of word tokenizer without and with using regular expression replace.
Removal of repeating characters
我们的日常语言严格符合语法吗?不,并不会。例如,有时我们要写“Hiiiiiiiiiiii Mohan”来强调“Hi”这个词。但是计算机系统不知道“Hiiiiiiiiiiii”是“Hi”这个单词的一种变体。在下面的示例中,我们将创建一个名为 rep_word_removal 的类,可用于移除重复的单词。
Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that ‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class named rep_word_removal which can be used for removing the repeating words.
Example
首先,导入必要的包 re 来使用正则表达式
First, import the necessary package re to work with regular expressions
import re
from nltk.corpus import wordnet
现在,创建一个类,可用于移除重复的单词 −
Now, create a class that can be used for removing the repeating words −
class Rep_word_removal(object):
def __init__(self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl, word)
if repl_word != word:
return self.replace(repl_word)
else:
return repl_word
保存此 python 程序(例如 removalrepeat.py)并从 python 命令提示符运行它。运行之后,在你想要移除重复的单词的时候导入 Rep_word_removal 类。我们看看怎么操作?
Save this python program (say removalrepeat.py) and run it from python command prompt. After running it, import Rep_word_removal class when you want to remove the repeating words. Let us see how?
from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
Output:
'Hi'
rep_word.replace("Hellooooooooooooooo")
Output:
'Hello'
Complete implementation example
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def __init__(self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
replace_word = self.repeat_regexp.sub(self.repl, word)
if replace_word != word:
return self.replace(replace_word)
else:
return replace_word
现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −
Now once you saved the above program and run it, you can import the class and use it as follows −
from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")