Natural Language Processing 简明教程

NLP - Word Level Analysis

在本章中，我们将理解自然语言处理中的世界级分析。

In this chapter, we will understand world level analysis in Natural Language Processing.

Regular Expressions

正则表达式 (RE) 是一种指定文本搜索字符串的语言。RE 帮助我们使用模式中保存的专门语法匹配或查找其他字符串或一组字符串。正则表达式用来在 UNIX 和 MS WORD 中以相同的方式搜索文本。我们有各种使用多个 RE 功能的搜索引擎。

A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX as well as in MS WORD in identical way. We have various search engines using a number of RE features.

Properties of Regular Expressions

以下是 RE 的一些重要属性：

Followings are some of the important properties of RE −

American Mathematician Stephen Cole Kleene formalized the Regular Expression language.
RE is a formula in a special language, which can be used for specifying simple classes of strings, a sequence of symbols. In other words, we can say that RE is an algebraic notation for characterizing a set of strings.
Regular expression requires two things, one is the pattern that we wish to search and other is a corpus of text from which we need to search.

在数学上，正则表达式可以定义如下：

Mathematically, A Regular Expression can be defined as follows −

ε is a Regular Expression, which indicates that the language is having an empty string.
φ is a Regular Expression which denotes that it is an empty language.
If X and Y are Regular Expressions, then X, Y X.Y(Concatenation of XY) X+Y (Union of X and Y) X, Y* (Kleen Closure of X and Y)*

也是正则表达式。

are also regular expressions.

If a string is derived from above rules then that would also be a regular expression.

Examples of Regular Expressions

下表列出了几个正则表达式的例子 −

The following table shows a few examples of Regular Expressions −

Regular Expressions

Regular Set

(0 + 10*)

{0, 1, 10, 100, 1000, 10000, … }

(0*10*)

{1, 01, 10, 010, 0010, …}

(0 + ε)(1 + ε)

{ε, 0, 1, 01}

(a+b)*

It would be set of strings of a’s and b’s of any length which also includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}

(a+b)*abb

It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb, babb, aaabb, ababb, …………..}

(11)*

It would be set consisting of even number of 1’s which also includes an empty string i.e. {ε, 11, 1111, 111111, ……….}

(aa)*(bb)*b

It would be set of strings consisting of even number of a’s followed by odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab, aaaabbb, …………..}

(aa + ab + ba + bb)*

It would be string of a’s and b’s of even length that can be obtained by concatenating any combination of the strings aa, ab, ba and bb including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}

Regular Sets & Their Properties

可以将其定义为表示正则表达式的值的一组，并且具有特定属性。

It may be defined as the set that represents the value of the regular expression and consists specific properties.

Properties of regular sets

If we do the union of two regular sets then the resulting set would also be regula.
If we do the intersection of two regular sets then the resulting set would also be regular.
If we do the complement of regular sets, then the resulting set would also be regular.
If we do the difference of two regular sets, then the resulting set would also be regular.
If we do the reversal of regular sets, then the resulting set would also be regular.
If we take the closure of regular sets, then the resulting set would also be regular.
If we do the concatenation of two regular sets, then the resulting set would also be regular.

Finite State Automata

自动机一词源自希腊语单词“αὐτόματα”，意为“自动”，它是“自动机”的复数形式，可以将其定义为一种抽象的自推进计算设备，它自动执行预定的操作序列。

The term automata, derived from the Greek word "αὐτόματα" meaning "self-acting", is the plural of automaton which may be defined as an abstract self-propelled computing device that follows a predetermined sequence of operations automatically.

具有有限状态数的自动机称为有限自动机 (FA) 或有限状态自动机 (FSA)。

An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA).

在数学上，自动机可以用 5 元组 (Q, Σ, δ, q0, F) 表示，其中 −

Mathematically, an automaton can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ is the transition function
q0 is the initial state from where any input is processed (q0 ∈ Q).
F is a set of final state/states of Q (F ⊆ Q).

Relation between Finite Automata, Regular Grammars and Regular Expressions

以下几点将使我们清楚地了解有限自动机、正规语法和正规表达式之间的关系：

Following points will give us a clear view about the relationship between finite automata, regular grammars and regular expressions −

As we know that finite state automata are the theoretical foundation of computational work and regular expressions is one way of describing them.
We can say that any regular expression can be implemented as FSA and any FSA can be described with a regular expression.
On the other hand, regular expression is a way to characterize a kind of language called regular language. Hence, we can say that regular language can be described with the help of both FSA and regular expression.
Regular grammar, a formal grammar that can be right-regular or left-regular, is another way to characterize regular language.

下图显示了有限自动机、正规表达式和正规语法是描述正规语言的等效方式。

Following diagram shows that finite automata, regular expressions and regular grammars are the equivalent ways of describing regular languages.

Types of Finite State Automation (FSA)

有限状态自动化有两種類型。让我们来看看這些類型是什么。

Finite state automation is of two types. Let us see what the types are.

Deterministic Finite automation (DFA)

可以将其定义为有限自动化类型，其中，对于每个输入符号，我们可以确定机器将移动到的状态。它具有有限数量的状态，这就是机器被称为确定性有限自动机 (DFA) 的原因。

It may be defined as the type of finite automation wherein, for every input symbol we can determine the state to which the machine will move. It has a finite number of states that is why the machine is called Deterministic Finite Automaton (DFA).

在数学上，DFA 可以表示为 5 元组 (Q, Σ, δ, q0, F)，其中：

Mathematically, a DFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ is the transition function where δ: Q × Σ → Q .
q0 is the initial state from where any input is processed (q0 ∈ Q).
F is a set of final state/states of Q (F ⊆ Q).

在图形上，DFA 可以通过称为状态图的有向图表示，其中：

Whereas graphically, a DFA can be represented by diagraphs called state diagrams where −

The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty incoming arc.
The final state is represented by double circle.

Example of DFA

假设 DFA 为

Suppose a DFA be

Q = {a, b, c},
Σ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows −

Current State

Next State for Input 0

Next State for Input 1

该 DFA 的图形表示如下 −

The graphical representation of this DFA would be as follows −

Non-deterministic Finite Automation (NDFA)

它可以定义为对于每个输入符号，我们不能确定机器将移动到的状态的有限自动化类型，即机器可以移动到任何状态的组合。它具有有限数量的状态，这就是机器被称为非确定性有限自动机 (NDFA) 的原因。

It may be defined as the type of finite automation where for every input symbol we cannot determine the state to which the machine will move i.e. the machine can move to any combination of the states. It has a finite number of states that is why the machine is called Non-deterministic Finite Automation (NDFA).

从数学上讲，NDFA 可以用 5 元组 (Q, Σ, δ, q0, F) 来表示，其中 −

Mathematically, NDFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ :-is the transition function where δ: Q × Σ → 2 Q.
q0 :-is the initial state from where any input is processed (q0 ∈ Q).
F :-is a set of final state/states of Q (F ⊆ Q).

相反，在图形上（与 DFA 相同），NDFA 可以用称为状态图的有向图来表示，其中 −

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state diagrams where −

The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty incoming arc.
The final state is represented by double circle.

Example of NDFA

假设 NDFA 为

Suppose a NDFA be

Q = {a, b, c},
Σ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows −

Current State

Next State for Input 0

Next State for Input 1

a, c

该 NDFA 的图形表示如下 −

The graphical representation of this NDFA would be as follows −

Morphological Parsing

形态解析这个术语与词素的解析有关。我们可以将形态解析定义为识别某个单词分解为较小的有意义单位（称为词素）的问题，从而为它产生某种语言结构。例如，我们可以把单词 foxes 分解为两个词，fox 和 -es。我们可以看到，单词 foxes 由两个词素组成，一个是 fox，另一个是 -es。

The term morphological parsing is related to the parsing of morphemes. We can define morphological parsing as the problem of recognizing that a word breaks down into smaller meaningful units called morphemes producing some sort of linguistic structure for it. For example, we can break the word foxes into two, fox and -es. We can see that the word foxes, is made up of two morphemes, one is fox and other is -es.

从另一个意义上，我们可以说形态学是 −

In other sense, we can say that morphology is the study of −

The formation of words.
The origin of the words.
Grammatical forms of the words.
Use of prefixes and suffixes in the formation of words.
How parts-of-speech (PoS) of a language are formed.

Types of Morphemes

词素是最小的带义单位，可分两类：

Morphemes, the smallest meaning-bearing units, can be divided into two types −

Stems
Word Order

Stems

它是单词的核心意义单位。我们也可称其为单词词根。例如，在单词 “foxes” 中，词干是 “fox”。

It is the core meaningful unit of a word. We can also say that it is the root of the word. For example, in the word foxes, the stem is fox.

Affixes − As the name suggests, they add some additional meaning and grammatical functions to the words. For example, in the word foxes, the affix is − es.

缀进一步可细分为以下四类：

Further, affixes can also be divided into following four types −

Word Order

单词顺序由形态句法分析决定。现在让我们了解构建形态句法分析器所需条件：

The order of the words would be decided by morphological parsing. Let us now see the requirements for building a morphological parser −

Lexicon

构建形态句法分析仪的最首要条件是词典，其中包括词干和缀的列表及其基本信息。例如，以下信息：词干是名词词干还是动词词干等。

The very first requirement for building a morphological parser is lexicon, which includes the list of stems and affixes along with the basic information about them. For example, the information like whether the stem is Noun stem or Verb stem, etc.

Morphotactics

它基本上是词素排序模型。换句话说，该模型解释单词中哪类词素可跟在其他类词素后面。例如，词法事实表明，英语复数词素总是跟在名词后面，而不是在名词前面。

It is basically the model of morpheme ordering. In other sense, the model explaining which classes of morphemes can follow other classes of morphemes inside a word. For example, the morphotactic fact is that the English plural morpheme always follows the noun rather than preceding it.

Orthographic rules

这些拼写规则用于对单词中发生的改变建模。例如，将单词中的 y 转换成 ie 的规则，比如 city+s = cities，而不是 citys。

These spelling rules are used to model the changes occurring in a word. For example, the rule of converting y to ie in word like city+s = cities not citys.