Natural Language Processing 简明教程

Natural Language Processing - Inception

在本章中,我们将讨论自然语言处理中的自然语言始源。首先,让我们了解什么是自然语言语法。

In this chapter, we will discuss the natural language inception in Natural Language Processing. To begin with, let us first understand what is Natural Language Grammar.

Natural Language Grammar

对语言学来说,语言就是一组任意的声乐标志。我们可以说语言是有创造性的、受规则制约的、与生俱来的,同时又是普遍的。另一方面,它也是人性的。语言的性质因人而异。对语言的性质有很多误解。这就是理解模棱两可的术语 ‘grammar’ 的含义非常重要的原因。在语言学中,语法术语可以定义为语言运行所依据的规则或原理。广义上来说,我们可以将语法分为两类−

For linguistics, language is a group of arbitrary vocal signs. We may say that language is creative, governed by rules, innate as well as universal at the same time. On the other hand, it is humanly too. The nature of the language is different for different people. There is a lot of misconception about the nature of the language. That is why it is very important to understand the meaning of the ambiguous term ‘grammar’. In linguistics, the term grammar may be defined as the rules or principles with the help of which language works. In broad sense, we can divide grammar in two categories −

Descriptive Grammar

由语言学家和语法学家制定说话者语法的那套规则称为描述性语法。

The set of rules, where linguistics and grammarians formulate the speaker’s grammar is called descriptive grammar.

Perspective Grammar

这是一种截然不同的语法概念,它试图维持语言的正确性标准。这一类与语言的实际运作关系不大。

It is a very different sense of grammar, which attempts to maintain a standard of correctness in the language. This category has little to do with the actual working of the language.

Components of Language

研究语言时将其划分为相互关联的成分,这些成分是语言调查中约定俗成的、任意的划分。对这些成分的解释如下−

The language of study is divided into the interrelated components, which are conventional as well as arbitrary divisions of linguistic investigation. The explanation of these components is as follows −

Phonology

语言的第一个成分是音系学。它是对某一特定语言的语音进行的研究。这个词的起源可以追溯到希腊语,其中“phone”意为声音或语音。语音学是音系学的一个分支,它从语音的产生、感知或物理属性的角度研究人类语言的语音。国际音标 (IPA) 是一种在语音学研究中以规则的方式表示人声的工具。在国际音标中,每个书面符号只表示一个语音,反之亦然。

The very first component of language is phonology. It is the study of the speech sounds of a particular language. The origin of the word can be traced to Greek language, where ‘phone’ means sound or voice. Phonetics, a subdivision of phonology is the study of the speech sounds of human language from the perspective of their production, perception or their physical properties. IPA (International Phonetic Alphabet) is a tool that represents human sounds in a regular way while studying phonology. In IPA, every written symbol represents one and only one speech sound and vice-versa.

Phonemes

它可以被定义为语音单位之一,它将一个语言中的单词与另一个单词区分开来。在语言学中,音位用斜杠书写。例如,音位 /k/ 出现于 kit、skit 这样的单词中。

It may be defined as one of the units of sound that differentiate one word from other in a language. In linguistic, phonemes are written between slashes. For example, phoneme /k/ occurs in the words such as kit, skit.

Morphology

这是语言的第二个部分。它是对某一特定语言中单词的结构和分类的研究。这个词的起源来自希腊语,其中“morphe”一词意为“形式”。形态学考虑语言中词语形成的原理。换句话说,声音如何组合成有意义的单位,如前缀、后缀和词根。它还考虑了如何将单词归类为词性。

It is the second component of language. It is the study of the structure and classification of the words in a particular language. The origin of the word is from Greek language, where the word ‘morphe’ means ‘form’. Morphology considers the principles of formation of words in a language. In other words, how sounds combine into meaningful units like prefixes, suffixes and roots. It also considers how words can be grouped into parts of speech.

Lexeme

在语言学中,与某个单词所采用的形式集合相对应的形态分析抽象单位被称为词素。词素在句子中的用法由其语法范畴决定。词素可以是单个单词,也可以是多单词。例如,单词 talk 是单个词素的例子,它可能具有很多语法变体,如 talks、talked 和 talking。多词素可以由多个正字法词语组成。例如,speak up、pull through 等都是多词素的例子。

In linguistics, the abstract unit of morphological analysis that corresponds to a set of forms taken by a single word is called lexeme. The way in which a lexeme is used in a sentence is determined by its grammatical category. Lexeme can be individual word or multiword. For example, the word talk is an example of an individual word lexeme, which may have many grammatical variants like talks, talked and talking. Multiword lexeme can be made up of more than one orthographic word. For example, speak up, pull through, etc. are the examples of multiword lexemes.

Syntax

这是语言的第三个部分。它是对单词按序排列以及组合成更大单位的研究。这个单词可以追溯到希腊语,其中单词 suntassein 的意思是“按序排列”。它研究句子的类型及其结构、从句、短语。

It is the third component of language. It is the study of the order and arrangement of the words into larger units. The word can be traced to Greek language, where the word suntassein means ‘to put in order’. It studies the type of sentences and their structure, of clauses, of phrases.

Semantics

这是语言的第四个部分。它是对意义如何传达的研究。意义可以与外部世界相关,也可以与句子的语法相关。这个单词可以追溯到希腊语,其中单词 semainein 的意思是“表示”、“展示”、“信号”。

It is the fourth component of language. It is the study of how meaning is conveyed. The meaning can be related to the outside world or can be related to the grammar of the sentence. The word can be traced to Greek language, where the word semainein means means ‘to signify’, ‘show’, ‘signal’.

Pragmatics

这是语言的第五个部分。它是对语言功能及其在语境中的使用进行的研究。这个词的起源可以追溯到希腊语,其中单词“pragma” 的意思是“行为”、“事务”。

It is the fifth component of language. It is the study of the functions of the language and its use in context. The origin of the word can be traced to Greek language where the word ‘pragma’ means ‘deed’, ‘affair’.

Grammatical Categories

语法范畴可以定义为语言语法内单元或特征的一类。这些单元是语言的组成部分,并具有共同的特征集合。语法范畴也称为语法特征。

A grammatical category may be defined as a class of units or features within the grammar of a language. These units are the building blocks of language and share a common set of characteristics. Grammatical categories are also called grammatical features.

语法范畴的清单如下所述−

The inventory of grammatical categories is described below −

Number

这是最简单的语法范畴。我们有与这一范畴相关的两个术语——单数和复数。单数是“一个”的概念,而复数是“多个”的概念。例如,dog/dogs,this/these。

It is the simplest grammatical category. We have two terms related to this category −singular and plural. Singular is the concept of ‘one’ whereas, plural is the concept of ‘more than one’. For example, dog/dogs, this/these.

Gender

语法性通过人称代词和第三人称的变异来表达。语法性的例子有:单数——he、she、it;第一人称和第二人称形式——I、we 和 you;第三人称复数形式 they 是普通性或中性。

Grammatical gender is expressed by variation in personal pronouns and 3rd person. Examples of grammatical genders are singular − he, she, it; the first and second person forms − I, we and you; the 3rd person plural form they, is either common gender or neuter gender.

Person

另一个简单的语法范畴是人称。在此之下,识别出以下三个术语−

Another simple grammatical category is person. Under this, following three terms are recognized −

  1. 1st person − The person who is speaking is recognized as 1st person.

  2. 2nd person − The person who is the hearer or the person spoken to is recognized as 2nd person.

  3. 3rd person − The person or thing about whom we are speaking is recognized as 3rd person.

Case

它是语法中最难的范畴之一。它可以被定义为名词短语 (NP) 功能的指示,或名词短语与句子中的动词或其他名词短语之间的关系。我们有以下三个在人称和疑问代词中表达的格:

It is one of the most difficult grammatical categories. It may be defined as an indication of the function of a noun phrase (NP) or the relationship of a noun phrase to a verb or to the other noun phrases in the sentence. We have the following three cases expressed in personal and interrogative pronouns −

  1. Nominative case − It is the function of subject. For example, I, we, you, he, she, it, they and who are nominative.

  2. Genitive case − It is the function of possessor. For example, my/mine, our/ours, his, her/hers, its, their/theirs, whose are genitive.

  3. Objective case − It is the function of object. For example, me, us, you, him, her, them, whom are objective.

Degree

这个语法范畴与形容词和副词有关。它有以下三个术语:

This grammatical category is related to adjectives and adverbs. It has the following three terms −

  1. Positive degree − It expresses a quality. For example, big, fast, beautiful are positive degrees.

  2. Comparative degree − It expresses greater degree or intensity of the quality in one of two items. For example, bigger, faster, more beautiful are comparative degrees.

  3. Superlative degree − It expresses greatest degree or intensity of the quality in one of three or more items. For example, biggest, fastest, most beautiful are superlative degrees.

Definiteness and Indefiniteness

这两个概念都非常简单。正如我们所知,确定性表示一个指称者,该指称者是说话者或听众所知道、熟悉或可识别的。相反,不确定性表示一个不为人所知或不熟悉的指称者。这个概念可以在冠词与名词的共现中理解:

Both these concepts are very simple. Definiteness as we know represents a referent, which is known, familiar or identifiable by the speaker or hearer. On the other hand, indefiniteness represents a referent that is not known, or is unfamiliar. The concept can be understood in the co-occurrence of an article with a noun −

  1. definite article − the

  2. indefinite article − a/an

Tense

这个语法范畴与动词有关,可以定义为动作时间语言指示。现在时建立了一种关系,因为它表示事件发生的时间与说话时刻的关系。从广义上讲,它有以下三种类型:

This grammatical category is related to verb and can be defined as the linguistic indication of the time of an action. A tense establishes a relation because it indicates the time of an event with respect to the moment of speaking. Broadly, it is of the following three types −

  1. Present tense − Represents the occurrence of an action in the present moment. For example, Ram works hard.

  2. Past tense − Represents the occurrence of an action before the present moment. For example, it rained.

  3. Future tense − Represents the occurrence of an action after the present moment. For example, it will rain.

Aspect

这个语法范畴可以定义为对事件的看法。它可以有以下类型:

This grammatical category may be defined as the view taken of an event. It can be of the following types −

  1. Perfective aspect − The view is taken as whole and complete in the aspect. For example, the simple past tense like yesterday I met my friend, in English is perfective in aspect as it views the event as complete and whole.

  2. Imperfective aspect − The view is taken as ongoing and incomplete in the aspect. For example, the present participle tense like I am working on this problem, in English is imperfective in aspect as it views the event as incomplete and ongoing.

Mood

这个语法范畴有点难以定义,但它可以简单地表示为说话者对他/她谈论内容的态度的迹象。它也是动词的语法特征。它不同于语法时态和语法语态。语态的例子是陈述语态、疑问语态、祈使语态、禁止语态、虚拟语态、可能语态、祈愿语态、现在分词和过去分词。

This grammatical category is a bit difficult to define but it can be simply stated as the indication of the speaker’s attitude towards what he/she is talking about. It is also the grammatical feature of verbs. It is distinct from grammatical tenses and grammatical aspect. The examples of moods are indicative, interrogative, imperative, injunctive, subjunctive, potential, optative, gerunds and participles.

Agreement

它也被称为一致。当一个词从依赖于它所关联的其他词中发生改变时,就会发生这种情况。换句话说,它涉及使不同单词或词性之间的某些语法范畴的值达成一致。以下是基于其他语法范畴的一致性:

It is also called concord. It happens when a word changes from depending on the other words to which it relates. In other words, it involves making the value of some grammatical category agree between different words or part of speech. Followings are the agreements based on other grammatical categories −

  1. Agreement based on Person − It is the agreement between subject and the verb. For example, we always use “I am” and “He is” but never “He am” and “I is”.

  2. Agreement based on Number − This agreement is between subject and the verb. In this case, there are specific verb forms for first person singular, second person plural and so on. For example, 1st person singular: I really am, 2nd person plural: We really are, 3rd person singular: The boy sings, 3rd person plural: The boys sing.

  3. Agreement based on Gender − In English, there is agreement in gender between pronouns and antecedents. For example, He reached his destination. The ship reached her destination.

  4. Agreement based on Case − This kind of agreement is not a significant feature of English. For example, who came first − he or his sister?

Spoken Language Syntax

书面英语和口语语法有很多共同特征,但除此之外,它们在许多方面也有所不同。以下特征区分了口语和书面英语语法:

The written English and spoken English grammar have many common features but along with that, they also differ in a number of aspects. The following features distinguish between the spoken and written English grammar −

Disfluencies and Repair

这个引人注目的特征使得口语和书面英语语法彼此不同。它分别称为不流畅现象,统称为修复现象。不流畅包括以下用法:

This striking feature makes spoken and written English grammar different from each other. It is individually known as phenomena of disfluencies and collectively as phenomena of repair. Disfluencies include the use of following −

  1. Fillers words − Sometimes in between the sentence, we use some filler words. They are called fillers of filler pause. Examples of such words are uh and um.

  2. Reparandum and repair − The repeated segment of words in between the sentence is called reparandum. In the same segment, the changed word is called repair. Consider the following example to understand this −

Does ABC airlines offer any one-way flights uh one-way fares for 5000 rupees?

Does ABC airlines offer any one-way flights uh one-way fares for 5000 rupees?

在上述句子中,单程航班是待修复词,单程航班是修复词。

In the above sentence, one-way flight is a reparadum and one-way flights is a repair.

Restarts

在填充停顿后,会出现重新开始。例如,在上述句子中,当说话者开始询问单程航班然后停止,用填充停顿更正自己,然后重新开始询问单程票价时,就会重新开始。

After the filler pause, restarts occurs. For example, in the above sentence, restarts occur when the speaker starts asking about one-way flights then stops, correct himself by filler pause and then restarting asking about one-way fares.

Word Fragments

有时我们会用较小的单词片段来说话。例如, wwha-what is the time? 这里是 w-wha 单词片段。

Sometimes we speak the sentences with smaller fragments of words. For example, wwha-what is the time? Here the words w-wha are word fragments.