Natural Language Processing 简明教程

Applications of NLP

自然语言处理 (NLP) 是一项新兴技术,它派生出我们在当下看到的各种形式的 AI,并且它在为人类和机器之间创造无缝且互动式界面中的用处将继续成为当今和未来越来越多的认知应用程序的首要任务。在这里,我们将讨论NLP的一些非常有用的应用。

Natural Language Processing (NLP) is an emerging technology that derives various forms of AI that we see in the present times and its use for creating a seamless as well as interactive interface between humans and machines will continue to be a top priority for today’s and tomorrow’s increasingly cognitive applications. Here, we are going to discuss about some of the very useful applications of NLP.

Machine Translation

机器翻译 (MT),将一种源语言或文本翻译成另一种语言的过程,是NLP最重要的应用之一。借助以下流程图,我们可以了解机器翻译的过程−

Machine translation (MT), process of translating one source language or text into another language, is one of the most important applications of NLP. We can understand the process of machine translation with the help of the following flowchart −

machine translation

Types of Machine Translation Systems

有不同类型的机器翻译系统。让我们看看不同的类型是什么。

There are different types of machine translation systems. Let us see what the different types are.

Bilingual MT System

双语 MT 系统生成两种特定语言之间的翻译。

Bilingual MT systems produce translations between two particular languages.

Multilingual MT System

多语言 MT 系统生成任何一对语言之间的翻译。它们在本质上可能是单向的或双向的。

Multilingual MT systems produce translations between any pair of languages. They may be either uni-directional or bi-directional in nature.

Approaches to Machine Translation (MT)

让我们现在了解机器翻译的重要方法。MT 的方法如下 −

Let us now learn about the important approaches to Machine Translation. The approaches to MT are as follows −

Direct MT Approach

这是较不流行但 MT 的最古老方法。使用此方法的系统能够将 SL(源语言)直接翻译成 TL(目标语言)。此类系统本质上是双语的,并且是单向的。

It is less popular but the oldest approach of MT. The systems that use this approach are capable of translating SL (source language) directly to TL (target language). Such systems are bi-lingual and uni-directional in nature.

Interlingua Approach

使用 Interlingua 方法的系统将 SL 翻译成称为 Interlingua (IL) 的中间语言,然后将 IL 翻译成 TL。Interlingua 方法可以通过以下 MT 金字塔理解 −

The systems that use Interlingua approach translate SL to an intermediate language called Interlingua (IL) and then translate IL to TL. The Interlingua approach can be understood with the help of the following MT pyramid −

interlingua approach

Transfer Approach

这种方法涉及三个阶段。

Three stages are involved with this approach.

  1. In the first stage, source language (SL) texts are converted to abstract SL-oriented representations.

  2. In the second stage, SL-oriented representations are converted into equivalent target language (TL)-oriented representations.

  3. In the third stage, the final text is generated.

Empirical MT Approach

这是 MT 的一种新兴方法。基本上,它以平行语料库的形式使用大量的原始数据。原始数据由文本及其翻译组成。基于类比、基于示例和基于内存的机器翻译技术使用基于经验的 MT 方法。

This is an emerging approach for MT. Basically, it uses large amount of raw data in the form of parallel corpora. The raw data consists of the text and their translations. Analogybased, example-based, memory-based machine translation techniques use empirical MTapproach.

Fighting Spam

如今最常见的问题之一是垃圾邮件。这使得垃圾邮件过滤器变得格外重要,因为它是针对此问题的第一道防线。

One of the most common problems these days is unwanted emails. This makes Spam filters all the more important because it is the first line of defense against this problem.

可以通过考虑主要的误报和漏报问题来使用 NLP 功能开发垃圾邮件过滤系统。

Spam filtering system can be developed by using NLP functionality by considering the major false-positive and false-negative issues.

Existing NLP models for spam filtering

以下是用于垃圾邮件过滤的一些现有 NLP 模型 −

Followings are some existing NLP models for spam filtering −

N-gram Modeling

N-Gram 模型是较长字符串的 N 字符切片。在此模型中,在处理和检测垃圾邮件时同时使用不同长度的 N-gram。

An N-Gram model is an N-character slice of a longer string. In this model, N-grams of several different lengths are used simultaneously in processing and detecting spam emails.

Word Stemming

垃圾邮件发送者通常会更改垃圾邮件中攻击性单词的一个或多个字符,以便他们可以突破基于内容的垃圾邮件过滤器。这就是为什么我们可以说,如果基于内容的过滤器无法理解电子邮件中单词或短语的含义,那么它们就没有用。为了消除垃圾邮件过滤中的此类问题,开发了一种基于规则的词干提取技术,它可以匹配看起来和听起来相似的单词。

Spammers, generators of spam emails, usually change one or more characters of attacking words in their spams so that they can breach content-based spam filters. That is why we can say that content-based filters are not useful if they cannot understand the meaning of the words or phrases in the email. In order to eliminate such issues in spam filtering, a rule-based word stemming technique, that can match words which look alike and sound alike, is developed.

Bayesian Classification

这现已成为垃圾邮件过滤的广泛使用技术。在统计技术中,电子邮件中单词的出现率针对其在未经请求的(垃圾邮件)和合法的(火腿)电子邮件消息数据库中的典型出现率进行衡量。

This has now become a widely-used technology for spam filtering. The incidence of the words in an email is measured against its typical occurrence in a database of unsolicited (spam) and legitimate (ham) email messages in a statistical technique.

Automatic Summarization

在这个数字时代,最有价值的是数据,或者你可以说信息。然而,我们是否真正获得有用的以及所需数量的信息?答案是“否”,因为信息超载,我们获取知识和信息的能力远远超过理解它们的能力。我们迫切需要自动文本摘要和信息,因为互联网上的信息泛滥不会停止。

In this digital era, the most valuable thing is data, or you can say information. However, do we really get useful as well as the required amount of information? The answer is ‘NO’ because the information is overloaded and our access to knowledge and information far exceeds our capacity to understand it. We are in a serious need of automatic text summarization and information because the flood of information over internet is not going to stop.

文本摘要可以定义为创建较长文本文档的简短准确摘要的技术。自动文本摘要将帮助我们在更短的时间内获得相关信息。自然语言处理 (NLP) 在开发自动文本摘要中起着重要作用。

Text summarization may be defined as the technique to create short, accurate summary of longer text documents. Automatic text summarization will help us with relevant information in less time. Natural language processing (NLP) plays an important role in developing an automatic text summarization.

Question-answering

自然语言处理 (NLP) 的另一个主要应用是问答。搜索引擎将世界的信息触手可及,但是当回答人类用自然语言提出的问题时,它们仍然存在缺陷。我们有谷歌等大型科技公司也在朝着这个方向努力。

Another main application of natural language processing (NLP) is question-answering. Search engines put the information of the world at our fingertips, but they are still lacking when it comes to answer the questions posted by human beings in their natural language. We have big tech companies like Google are also working in this direction.

问答是人工智能和 NLP 领域的计算机科学学科。它专注于构建系统,这些系统可以自动回答人类用其自然语言提出的问题。理解自然语言的计算机系统具有程序系统的能力,可以将人类编写的句子翻译成内部表示,以便系统能够生成有效答案。可以通过对问题进行语法和语义分析来生成确切的答案。词汇差距、歧义和多语言是 NLP 在构建良好的问答系统时面临的一些挑战。

Question-answering is a Computer Science discipline within the fields of AI and NLP. It focuses on building systems that automatically answer questions posted by human beings in their natural language. A computer system that understands the natural language has the capability of a program system to translate the sentences written by humans into an internal representation so that the valid answers can be generated by the system. The exact answers can be generated by doing syntax and semantic analysis of the questions. Lexical gap, ambiguity and multilingualism are some of the challenges for NLP in building good question answering system.

Sentiment Analysis

自然语言处理 (NLP) 的另一个重要应用是情绪分析。顾名思义,情绪分析用于识别多条帖子中的情绪。它还用于识别未明确表达情绪的情感。公司正在使用自然语言处理的应用程序情绪分析来识别客户在线上的意见和情感。它将帮助公司了解客户对产品和服务有何看法。借助情绪分析,公司可以从客户帖子中判断其整体声誉。通过这种方式,我们可以说,除了确定简单的情绪外,情绪分析还可以理解上下文中包含的情绪,以帮助我们更好地理解表达意见背后的出发点。

Another important application of natural language processing (NLP) is sentiment analysis. As the name suggests, sentiment analysis is used to identify the sentiments among several posts. It is also used to identify the sentiment where the emotions are not expressed explicitly. Companies are using sentiment analysis, an application of natural language processing (NLP) to identify the opinion and sentiment of their customers online. It will help companies to understand what their customers think about the products and services. Companies can judge their overall reputation from customer posts with the help of sentiment analysis. In this way, we can say that beyond determining simple polarity, sentiment analysis understands sentiments in context to help us better understand what is behind the expressed opinion.