Natural Language Processing 简明教程
Natural Language Processing - Introduction
语言是一种沟通方式,借助它我们能够说话、阅读和写作。例如,我们用自然语言(确切地说,用文字)思考、做决定、制定计划等等。然而,在 AI 时代,我们面临的一个重大问题是,我们是否能够以类似的方式与计算机进行沟通。换句话说,人类是否可以用自己的自然语言与计算机沟通?对于我们来说,开发 NLP 应用程序是一项挑战,因为计算机需要结构化数据,但人类语言本质上是无结构且常常歧义的。
Language is a method of communication with the help of which we can speak, read and write. For example, we think, we make decisions, plans and more in natural language; precisely, in words. However, the big question that confronts us in this AI era is that can we communicate in a similar manner with computers. In other words, can human beings communicate with computers in their natural language? It is a challenge for us to develop NLP applications because computers need structured data, but human speech is unstructured and often ambiguous in nature.
从这个意义上讲,我们可以说自然语言处理 (NLP) 是计算机科学,特别是人工智能 (AI) 的子领域,其关注的是使计算机能够理解和处理人类语言。从技术上讲,NLP 的主要任务是对计算机进行编程,以便其分析和处理大量自然语言数据。
In this sense, we can say that Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amount of natural language data.
History of NLP
我们将 NLP 的历史划分为了四个阶段。各个阶段有不同的关注点和风格。
We have divided the history of NLP into four phases. The phases have distinctive concerns and styles.
First Phase (Machine Translation Phase) - Late 1940s to late 1960s
该阶段所做的工作主要集中在机器翻译 (MT)。这个阶段是一个充满热情和乐观主义的时期。
The work done in this phase focused mainly on machine translation (MT). This phase was a period of enthusiasm and optimism.
现在让我们看看第一阶段有什么内容:
Let us now see all that the first phase had in it −
-
The research on NLP started in early 1950s after Booth & Richens’ investigation and Weaver’s memorandum on machine translation in 1949.
-
1954 was the year when a limited experiment on automatic translation from Russian to English demonstrated in the Georgetown-IBM experiment.
-
In the same year, the publication of the journal MT (Machine Translation) started.
-
The first international conference on Machine Translation (MT) was held in 1952 and second was held in 1956.
-
In 1961, the work presented in Teddington International Conference on Machine Translation of Languages and Applied Language analysis was the high point of this phase.
Second Phase (AI Influenced Phase) – Late 1960s to late 1970s
在这个阶段,所做的工作主要与世界知识以及它在构造和操纵意义表征中的作用有关。这就是为什么这个阶段也被称为人工智能风味阶段。
In this phase, the work done was majorly related to world knowledge and on its role in the construction and manipulation of meaning representations. That is why, this phase is also called AI-flavored phase.
该阶段包括以下内容:
The phase had in it, the following −
-
In early 1961, the work began on the problems of addressing and constructing data or knowledge base. This work was influenced by AI.
-
In the same year, a BASEBALL question-answering system was also developed. The input to this system was restricted and the language processing involved was a simple one.
-
A much advanced system was described in Minsky (1968). This system, when compared to the BASEBALL question-answering system, was recognized and provided for the need of inference on the knowledge base in interpreting and responding to language input.
Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s
这个阶段可以描述为语法逻辑阶段。由于上个阶段实际系统构建失败,研究人员转向使用逻辑进行人工智能中的知识表示和推理。
This phase can be described as the grammatico-logical phase. Due to the failure of practical system building in last phase, the researchers moved towards the use of logic for knowledge representation and reasoning in AI.
第三阶段包括以下内容:
The third phase had the following in it −
-
The grammatico-logical approach, towards the end of decade, helped us with powerful general-purpose sentence processors like SRI’s Core Language Engine and Discourse Representation Theory, which offered a means of tackling more extended discourse.
-
In this phase we got some practical resources & tools like parsers, e.g. Alvey Natural Language Tools along with more operational and commercial systems, e.g. for database query.
-
The work on lexicon in 1980s also pointed in the direction of grammatico-logical approach.
Fourth Phase (Lexical & Corpus Phase) – The 1990s
我们可以将此描述为词法和语料库阶段。该阶段采用了词法化语法方法,该方法出现在 20 世纪 80 年代末并产生越来越大的影响。随着机器学习算法在语言处理方面的引入,这个十年发生了自然语言处理的革命。
We can describe this as a lexical & corpus phase. The phase had a lexicalized approach to grammar that appeared in late 1980s and became an increasing influence. There was a revolution in natural language processing in this decade with the introduction of machine learning algorithms for language processing.
Study of Human Languages
语言是人类生活的一个至关重要的组成部分,也是我们行为的最基本方面。我们主要可以以两种形式体验到它——书面和口语。在书面形式中,它是一种将我们的知识从一代传递到下一代的方式。在口语形式中,它是人类在日常行为中相互协调的主要媒介。语言在各个学术学科中都得到研究。每个学科都有自己的一套问题和解决这些问题的方案。
Language is a crucial component for human lives and also the most fundamental aspect of our behavior. We can experience it in mainly two forms - written and spoken. In the written form, it is a way to pass our knowledge from one generation to the next. In the spoken form, it is the primary medium for human beings to coordinate with each other in their day-to-day behavior. Language is studied in various academic disciplines. Each discipline comes with its own set of problems and a set of solution to address those.
考虑下表来理解这一点:
Consider the following table to understand this −
Discipline |
Problems |
Tools |
Linguists |
How phrases and sentences can be formed with words? What curbs the possible meaning for a sentence? |
Intuitions about well-formedness and meaning. Mathematical model of structure. For example, model theoretic semantics, formal language theory. |
Psycholinguists |
How human beings can identify the structure of sentences? How the meaning of words can be identified? When does understanding take place? |
Experimental techniques mainly for measuring the performance of human beings. Statistical analysis of observations. |
Philosophers |
How do words and sentences acquire the meaning? How the objects are identified by the words? What is meaning? |
Natural language argumentation by using intuition. Mathematical models like logic and model theory. |
Computational Linguists |
How can we identify the structure of a sentence How knowledge and reasoning can be modeled? How we can use language to accomplish specific tasks? |
Algorithms Data structures Formal models of representation and reasoning. AI techniques like search & representation methods. |
Ambiguity and Uncertainty in Language
模糊性通常用于自然语言处理中,可以指能够以不止一种方式被理解的能力。简单来说,我们可以说模糊性是能够以不止一种方式被理解的能力。自然语言非常模棱两可。NLP 具有以下类型的模糊性−
Ambiguity, generally used in natural language processing, can be referred as the ability of being understood in more than one way. In simple terms, we can say that ambiguity is the capability of being understood in more than one way. Natural language is very ambiguous. NLP has the following types of ambiguities −
Lexical Ambiguity
单个单词的模糊性称为词法模糊性。例如,将单词 silver 视为名词、形容词或动词。
The ambiguity of a single word is called lexical ambiguity. For example, treating the word silver as a noun, an adjective, or a verb.
Syntactic Ambiguity
当句子以不同方式进行解析时,就会出现这种类型的模糊性。例如,句子“这个人用望远镜看到了那个女孩”。模糊的是,这个人看到女孩拿着望远镜,还是他通过望远镜看到了她。
This kind of ambiguity occurs when a sentence is parsed in different ways. For example, the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man saw the girl carrying a telescope or he saw her through his telescope.
Semantic Ambiguity
当单词本身的含义可能被误解时,就会出现这种类型的模糊性。换句话说,语义模糊性发生在句子包含一个模棱两可的单词或短语时。例如,句子“这辆车在行驶时撞到了电线杆”具有语义模糊性,因为解释可以是“这辆车在行驶时撞到了电线杆”和“这辆车撞到了电线杆,而电线杆正在移动”。
This kind of ambiguity occurs when the meaning of the words themselves can be misinterpreted. In other words, semantic ambiguity happens when a sentence contains an ambiguous word or phrase. For example, the sentence “The car hit the pole while it was moving” is having semantic ambiguity because the interpretations can be “The car, while moving, hit the pole” and “The car hit the pole while the pole was moving”.
Anaphoric Ambiguity
这种模糊性是由于话语中使用了指代实体而产生的。例如,马沿着小路跑。它非常陡峭。它很快就累了。这里,“它”在两种情况下的指代关系造成了模糊性。
This kind of ambiguity arises due to the use of anaphora entities in discourse. For example, the horse ran up the hill. It was very steep. It soon got tired. Here, the anaphoric reference of “it” in two situations cause ambiguity.
Pragmatic ambiguity
此类模糊性是指短语的上下文赋予其多重解释的情况。简单来说,我们可以说,当表述不具体时,就会出现语用模糊性。例如,句子“我也喜欢你”可以有多种解释,例如我喜欢你(就像你一样喜欢我),我喜欢你(就像其他人一样)。
Such kind of ambiguity refers to the situation where the context of a phrase gives it multiple interpretations. In simple words, we can say that pragmatic ambiguity arises when the statement is not specific. For example, the sentence “I like you too” can have multiple interpretations like I like you (just like you like me), I like you (just like someone else dose).
NLP Phases
下图显示了自然语言处理中的阶段或逻辑步骤−
Following diagram shows the phases or logical steps in natural language processing −
Morphological Processing
这是 NLP 的第一阶段。此阶段的目的是将语言输入块分解为对应于段落、句子和单词的标记集。例如,像 “uneasy” 这样的单词可以分解成 “un-easy” 两个子词令牌。
It is the first phase of NLP. The purpose of this phase is to break chunks of language input into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “uneasy” can be broken into two sub-word tokens as “un-easy”.
Syntax Analysis
这是 NLP 的第二个阶段。此阶段的目的有两个:检查句子是否构成良好,以及将其分解为反映不同单词之间句法关系的结构。例如,像 “The school goes to the boy” 这样的句子将被语法分析器或解析器拒绝。
It is the second phase of NLP. The purpose of this phase is two folds: to check that a sentence is well formed or not and to break it up into a structure that shows the syntactic relationships between the different words. For example, the sentence like “The school goes to the boy” would be rejected by syntax analyzer or parser.
Semantic Analysis
这是 NLP 的第三个阶段。此阶段的目的是从文本中提取确切的含义,或者可以说成是词典含义。对文本进行了含义检查。例如,语义分析器会拒绝诸如“热冰淇淋”之类的句子。
It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. The text is checked for meaningfulness. For example, semantic analyzer would reject a sentence like “Hot ice-cream”.
Pragmatic Analysis
这是 NLP 的第四个阶段。语用分析仅适合实际对象/事件,这些对象/事件存在于给定上下文中,并在上一阶段(语义分析)中获得了对象引用。例如,句子“把香蕉放在架子上的篮子里”可以有两种语义解释,语用分析器将在这两种可能性之间进行选择。
It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events, which exist in a given context with object references obtained during the last phase (semantic analysis). For example, the sentence “Put the banana in the basket on the shelf” can have two semantic interpretations and pragmatic analyzer will choose between these two possibilities.