Artificial Intelligence 简明教程

AI - Natural Language Processing

自然语言处理 (NLP) 指的是使用自然语言(例如英语)与智能系统进行交流的人工智能方法。

Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural language such as English.

当您希望像机器人这样的智能系统根据您的说明执行操作时,当您希望从基于对话的临床专家系统中听到决策时,需要进行自然语言处理。

Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions, when you want to hear decision from a dialogue based clinical expert system, etc.

NLP 领域涉及让计算机使用人类使用的自然语言执行有用的任务。NLP 系统的输入和输出可以是 −

The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be −

  1. Speech

  2. Written Text

Components of NLP

NLP 有两个部分,如下所示 −

There are two components of NLP as given −

Natural Language Understanding (NLU)

理解涉及以下任务 −

Understanding involves the following tasks −

  1. Mapping the given input in natural language into useful representations.

  2. Analyzing different aspects of the language.

Natural Language Generation (NLG)

它是以自然语言的形式生成有意义的短语和句子的过程。

It is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

它涉及 −

It involves −

  1. Text planning − It includes retrieving the relevant content from knowledge base.

  2. Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.

  3. Text Realization − It is mapping sentence plan into sentence structure.

NLU 难于 NLG。

The NLU is harder than NLG.

Difficulties in NLU

NL 拥有极丰富的形式和结构。

NL has an extremely rich form and structure.

它非常含糊。含糊可能存在多个层级 −

It is very ambiguous. There can be different levels of ambiguity −

  1. Lexical ambiguity − It is at very primitive level such as word-level.

  2. For example, treating the word “board” as noun or verb?

  3. Syntax Level ambiguity − A sentence can be parsed in different ways.

  4. For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a beetle that had red cap?

  5. Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?

  6. One input can mean different meanings.

  7. Many inputs can mean the same thing.

NLP Terminology

  1. Phonology − It is study of organizing sound systematically.

  2. Morphology − It is a study of construction of words from primitive meaningful units.

  3. Morpheme − It is primitive unit of meaning in a language.

  4. Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.

  5. Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.

  6. Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.

  7. Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.

  8. World Knowledge − It includes the general knowledge about the world.

Steps in NLP

共有五个常规步骤 −

There are general five steps −

  1. Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.

  2. Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.

steps in nlp
  1. Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.

  2. Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.

  3. Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.

Implementation Aspects of Syntactic Analysis

研究人员为语法分析开发了许多算法,但我们只考虑以下简单的方法 −

There are a number of algorithms researchers have developed for syntactic analysis, but we consider only the following simple methods −

  1. Context-Free Grammar

  2. Top-Down Parser

让我们详细了解它们 −

Let us see them in detail −

Context-Free Grammar

这是语法,它由重写规则一侧的一个符号组成的规则。我们来创建一个用于分析一个句子 − 的语法

It is the grammar that consists rules with a single symbol on the left-hand side of the rewrite rules. Let us create grammar to parse a sentence −

“The bird pecks the grains”

Articles (DET) − a | an | the

Articles (DET) − a | an | the

Nouns − bird | birds | grain | grains

Nouns − bird | birds | grain | grains

Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun

Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun

DET N | DET ADJ N

Verbs − pecks | pecking | pecked

Verbs − pecks | pecking | pecked

Verb Phrase (VP) − NP V | V NP

Verb Phrase (VP) − NP V | V NP

Adjectives (ADJ) − beautiful | small | chirping

Adjectives (ADJ) − beautiful | small | chirping

语法树将句子分解为结构化部分,以便计算机能够轻松地理解和处理它。为了让分析算法构建这个语法树,需要构建一组描述了哪些树结构是合法的重写规则。

The parse tree breaks down the sentence into structured parts so that the computer can easily understand and process it. In order for the parsing algorithm to construct this parse tree, a set of rewrite rules, which describe what tree structures are legal, need to be constructed.

这些规则表明树中的某个符号可以通过一系列其他符号进行扩展。根据一阶逻辑规则,如果存在两个字符串名词短语(NP)和动词短语(VP),那么由 NP 和 VP 组成的组合字符串就是句子。句子的重写规则如下 −

These rules say that a certain symbol may be expanded in the tree by a sequence of other symbols. According to first order logic rule, if there are two strings Noun Phrase (NP) and Verb Phrase (VP), then the string combined by NP followed by VP is a sentence. The rewrite rules for the sentence are as follows −

S → NP VP

S → NP VP

NP → DET N | DET ADJ N

NP → DET N | DET ADJ N

VP → V NP

VP → V NP

Lexocon −

Lexocon −

DET → a | the

ADJ → beautiful | perching

N → bird | birds | grain | grains

V → peck | pecks | pecking

语法树可以如以下所示创建 −

The parse tree can be created as shown −

nlp parsing tree

现在考虑一下以上的重写规则。既然 V 可以被“peck”或“pecks”替换,那么类似于“The bird peck the grains”这样的句子就会被错误地允许。例如,主谓一致错误会被批准为正确。

Now consider the above rewrite rules. Since V can be replaced by both, "peck" or "pecks", sentences such as "The bird peck the grains" can be wrongly permitted. i. e. the subject-verb agreement error is approved as correct.

Merit − 语法的最简单风格,因此被广泛使用。

Merit − The simplest style of grammar, therefore widely used one.

Demerits −

Demerits −

  1. They are not highly precise. For example, “The grains peck the bird”, is a syntactically correct according to parser, but even if it makes no sense, parser takes it as a correct sentence.

  2. To bring out high precision, multiple sets of grammar need to be prepared. It may require a completely different sets of rules for parsing singular and plural variations, passive sentences, etc., which can lead to creation of huge set of rules that are unmanageable.

Top-Down Parser

在此,解析器从 S 符号开始,并尝试将其重写为与输入句子中的单词类匹配的终端符号序列,直到它完全由终端符号组成为止。

Here, the parser starts with the S symbol and attempts to rewrite it into a sequence of terminal symbols that matches the classes of the words in the input sentence until it consists entirely of terminal symbols.

然后使用输入句子检查这些内容,看看是否匹配。如果没有,则使用不同的规则集再次重新开始该过程。重复此操作,直到找到描述句子结构的特定规则为止。

These are then checked with the input sentence to see if it matched. If not, the process is started over again with a different set of rules. This is repeated until a specific rule is found which describes the structure of the sentence.

Merit − 实现简单。

Merit − It is simple to implement.

Demerits −

Demerits −

  1. It is inefficient, as the search process has to be repeated if an error occurs.

  2. Slow speed of working.