Natural Language Toolkit 简明教程

Natural Language Toolkit - Introduction

What is Natural Language Processing (NLP)?

人类可以藉此说话、阅读和写作的交流方式,就是语言。换句话说,我们人类可以用我们的自然语言思考、制定计划、做出决定。这里的大问题是在人工智能、机器学习和深度学习的时代,人类是否可以通过自然语言与计算机/机器交流?开发NLP应用程序对我们来说是一个巨大的挑战,因为计算机需要结构化数据,但另一方面,人类语言是无结构且本质上含糊的。

The method of communication with the help of which humans can speak, read, and write, is language. In other words, we humans can think, make plans, make decisions in our natural language. Here the big question is, in the era of artificial intelligence, machine learning and deep learning, can humans communicate in natural language with computers/machines? Developing NLP applications is a huge challenge for us because computers require structured data, but on the other hand, human speech is unstructured and often ambiguous in nature.

自然语言是计算机科学,更具体地说是人工智能的一个子领域,它使计算机/机器能够理解、处理和处理人类语言。简单来说,NLP是机器分析、理解和从人类自然语言(如印地语、英语、法语、荷兰语等)中获取意义的一种方式。

Natural language is that subfield of computer science, more specifically of AI, which enables computers/machines to understand, process and manipulate human language. In simple words, NLP is a way of machines to analyze, understand and derive meaning from human natural languages like Hindi, English, French, Dutch, etc.

How does it work?

在深入了解NLP的工作原理之前,我们必须了解人类如何使用语言。每天,我们人类都会使用数百或数千个单词,而其他人会对它们进行解释并相应地回答。这对人类来说是一种简单的交流,不是吗?但我们知道单词的含义远不止于此,我们总是可以从所说的话和说话的方式中得出上下文。这就是为什么我们可以说,与关注语音调制相比,NLP确实利用了上下文模式。

Before getting deep dive into the working of NLP, we must have to understand how human beings use language. Every day, we humans use hundreds or thousands of words and other humans interpret them and answer accordingly. It’s a simple communication for humans, isn’t it? But we know words run much-much deeper than that and we always derive a context from what we say and how we say. That’s why we can say rather than focuses on voice modulation, NLP does draw on contextual pattern.

让我们用一个示例来理解它 −

Let us understand it with an example −

Man is to woman as king is to what?
We can interpret it easily and answer as follows:
Man relates to king, so woman can relate to queen.
Hence the answer is Queen.

人类如何知道哪个词是什么意思?这个问题的答案是我们通过经验学习。但是,机器/计算机如何学习相同的知识?

How humans know what word means what? The answer to this question is that we learn through our experience. But, how do machines/computers learn the same?

让我们通过以下简单的步骤来理解它 −

Let us understand it with following easy steps −

  1. First, we need to feed the machines with enough data so that machines can learn from experience.

  2. Then machine will create word vectors, by using deep learning algorithms, from the data we fed earlier as well as from its surrounding data.

  3. Then by performing simple algebraic operations on these word vectors, machine would be able to provide the answers as human beings.

Components of NLP

以下图表表示自然语言处理 (NLP) 的组成部分 −

Following diagram represents the components of natural language processing (NLP) −

components

Morphological Processing

形态处理是 NLP 的第一个组成部分。它包括将语言输入块分解为与段落、句子和单词相对应的标记集。例如,像 “everyday” 这样的单词可以分解为两个子单词标记 “every-day”

Morphological processing is the first component of NLP. It includes breaking of chunks of language input into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “everyday” can be broken into two sub-word tokens as “every-day”.

Syntax analysis

句法分析是 NLP 的第二个组成部分,也是 NLP 最重要的组成部分之一。此组件的目的如下 −

Syntax Analysis, the second component, is one of the most important components of NLP. The purposes of this component are as follows −

  1. To check that a sentence is well formed or not.

  2. To break it up into a structure that shows the syntactic relationships between the different words.

  3. E.g. The sentences like “The school goes to the student” would be rejected by syntax analyzer.

Semantic analysis

语义分析是 NLP 的第三个组成部分,用于检查文本的含义。它包括从文本中得出确切的含义,或者我们可以说字典含义。例如,“这是一款热冰淇淋。”之类的句子会被语义分析器丢弃。

Semantic Analysis is the third component of NLP which is used to check the meaningfulness of the text. It includes drawing exact meaning, or we can say dictionary meaning from the text. E.g. The sentences like “It’s a hot ice-cream.” would be discarded by semantic analyzer.

Pragmatic analysis

语用分析是 NLP 的第四个组成部分。它包括将每个上下文中存在的实际对象或事件与前一个组件(即语义分析)获得的对象引用进行匹配。例如,诸如 “Put the fruits in the basket on the table” 的句子可以有两种语义解释,因此语用分析器将在这两种可能性之间进行选择。

Pragmatic analysis is the fourth component of NLP. It includes fitting the actual objects or events that exist in each context with object references obtained by previous component i.e. semantic analysis. E.g. The sentences like “Put the fruits in the basket on the table” can have two semantic interpretations hence the pragmatic analyzer will choose between these two possibilities.

Examples of NLP Applications

NLP 是一种新兴技术,可推导出各种形式的 AI,我们现在习惯于看到这些形式。对于当今和未来的认知应用,在人机之间创建无缝交互界面的 NLP 用途将继续成为重中之重。以下是 NLP 的一些非常有用的应用。

NLP, an emerging technology, derives various forms of AI we used to see these days. For today’s and tomorrow’s increasingly cognitive applications, the use of NLP in creating a seamless and interactive interface between humans and machines will continue to be a top priority. Following are some of the very useful applications of NLP.

Machine Translation

机器翻译 (MT) 是自然语言处理最重要的应用之一。机器翻译基本上是一个将一种源语言或文本翻译成另一种语言的过程。机器翻译系统可以是双语或多语言的。

Machine translation (MT) is one of the most important applications of natural language processing. MT is basically a process of translating one source language or text into another language. Machine translation system can be of either Bilingual or Multilingual.

Fighting Spam

由于不需要的电子邮件大量增加,垃圾邮件过滤器变得很重要,因为它是对付此问题的第一道防线。通过将其误报和漏报问题视为主要问题,NLP 的功能可用于开发垃圾邮件过滤系统。

Due to enormous increase in unwanted emails, spam filters have become important because it is the first line of defense against this problem. By considering its false-positive and false-negative issues as the main issues, the functionality of NLP can be used to develop spam filtering system.

N-gram 模型、词干提取和贝叶斯分类是一些现有的 NLP 模型,可用于垃圾邮件过滤。

N-gram modelling, Word Stemming and Bayesian classification are some of the existing NLP models that can be used for spam filtering.

大多数搜索引擎,如 Google、Yahoo、Bing、WolframAlpha 等等,都将其机器翻译 (MT) 技术建立在 NLP 深度学习模型之上。此类深度学习模型允许算法阅读网页上的文本,解释其含义并将其翻译成另一种语言。

Most of the search engines like Google, Yahoo, Bing, WolframAlpha, etc., base their machine translation (MT) technology on NLP deep learning models. Such deep learning models allow algorithms to read text on webpage, interprets its meaning and translate it to another language.

Automatic Text Summarization

自动文本摘要是一种技术,它可以创建较长文本文档的简短、准确的摘要。因此,它可以帮助我们用更少的时间获取相关信息。在这个数字时代,我们迫切需要自动文本摘要,因为互联网上的信息洪流不会停止。NLP 及其功能在开发自动文本摘要时发挥着重要作用。

Automatic text summarization is a technique which creates a short, accurate summary of longer text documents. Hence, it helps us in getting relevant information in less time. In this digital era, we are in a serious need of automatic text summarization because we have the flood of information over internet which is not going to stop. NLP and its functionalities play an important role in developing an automatic text summarization.

Grammar Correction

拼写检查和语法检查是 Microsoft Word 之类的文字处理器软件的一项非常有用的功能。自然语言处理 (NLP) 被广泛用于此目的。

Spelling correction & grammar correction is a very useful feature of word processor software like Microsoft Word. Natural language processing (NLP) is widely used for this purpose.

Question-answering

问答,自然语言处理 (NLP) 的另一项主要应用,专注于构建自动回答用户用自然语言发布的问题的系统。

Question-answering, another main application of natural language processing (NLP), focuses on building systems which automatically answer the question posted by user in their natural language.

Sentiment analysis

情感分析是自然语言处理 (NLP) 的另一项重要应用。正如其名称所示,情感分析用于以下目的:

Sentiment analysis is among one other important applications of natural language processing (NLP). As its name implies, Sentiment analysis is used to −

  1. Identify the sentiments among several posts and

  2. Identify the sentiment where the emotions are not expressed explicitly.

Amazon、ebay 等在线电子商务公司正在使用情感分析识别其客户在网上表达的意见和情绪。它将帮助他们了解客户对其产品和服务的想法。

Online E-commerce companies like Amazon, ebay, etc., are using sentiment analysis to identify the opinion and sentiment of their customers online. It will help them to understand what their customers think about their products and services.

Speech engines

Siri、Google Voice、Alexa 等语音引擎构建在 NLP 上,以便我们用自然语言与之交流。

Speech engines like Siri, Google Voice, Alexa are built on NLP so that we can communicate with them in our natural language.

Implementing NLP

为了构建上述应用程序,我们需要具备特定技能组,并且非常理解语言及其高效处理语言的工具。为了实现此目的,我们有各种可用的开源工具。其中一些是开源的,而另一些则是由组织开发用来构建他们自己的 NLP 应用程序的。以下是一些 NLP 工具的列表:

In order to build the above-mentioned applications, we need to have specific skill set with a great understanding of language and tools to process the language efficiently. To achieve this, we have various open-source tools available. Some of them are open-sourced while others are developed by organizations to build their own NLP applications. Following is the list of some NLP tools −

  1. Natural Language Tool Kit (NLTK)

  2. Mallet

  3. GATE

  4. Open NLP

  5. UIMA

  6. Genism

  7. Stanford toolkit

这些工具大部分都是用 Java 编写的。

Most of these tools are written in Java.

Natural Language Tool Kit (NLTK)

在上述 NLP 工具中,NLTK 在易用性和概念解释方面得分非常高。Python 的学习曲线非常快,NLTK 是用 Python 编写的,因此 NLTK 也有非常好的学习工具包。NLTK 已经融入了大多数任务,如标记化、词干、词根化、标点符号、字符计数和单词计数。它非常优雅,易于使用。

Among the above-mentioned NLP tool, NLTK scores very high when it comes to the ease of use and explanation of the concept. The learning curve of Python is very fast and NLTK is written in Python so NLTK is also having very good learning kit. NLTK has incorporated most of the tasks like tokenization, stemming, Lemmatization, Punctuation, Character Count, and Word count. It is very elegant and easy to work with.