Natural Language Processing 简明教程
NLP - Information Retrieval
信息检索 (IR) 可以定义为一个软件程序,该程序处理信息库(尤其是文本信息)中的组织、存储、检索和评估。该系统协助用户查找所需信息,但它不会明确返回问题的答案。它会通知包含所需信息的文档的存在和位置。满足用户需求的文档称为相关文档。一个完美的信息检索系统只会检索相关文档。
Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system assists users in finding the information they require but it does not explicitly return the answers of the questions. It informs the existence and location of documents that might consist of the required information. The documents that satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve only relevant documents.
借助以下图表,我们可以了解信息检索 (IR) 的过程:
With the help of the following diagram, we can understand the process of information retrieval (IR) −
从上图可以明显看出,需要信息的用户必须以自然语言查询的形式制定请求。然后,IR 系统会通过以文档的形式检索所需信息的相关输出作出响应。
It is clear from the above diagram that a user who needs information will have to formulate a request in the form of query in natural language. Then the IR system will respond by retrieving the relevant output, in the form of documents, about the required information.
Classical Problem in Information Retrieval (IR) System
信息检索研究的主要目标是开发一个用于从文档库中检索信息的模型。这里,我们将讨论一个经典的问题 ad-hoc retrieval problem ,该问题与 IR 系统相关。
The main goal of IR research is to develop a model for retrieving information from the repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc retrieval problem, related to the IR system.
在即席检索中,用户必须以描述所需信息的自然语言输入查询。然后,IR 系统将返回与所需信息相关的所需文档。例如,假设我们在互联网上搜寻某样东西,它提供了符合我们要求的一些确切页面,但也可能有一些不相关的页面。这是由于即席检索问题。
In ad-hoc retrieval, the user must enter a query in natural language that describes the required information. Then the IR system will return the required documents related to the desired information. For example, suppose we are searching something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval
以下是 IR 研究中涉及的即席检索的一些方面:
Followings are some aspects of ad-hoc retrieval that are addressed in IR research −
-
How users with the help of relevance feedback can improve original formulation of a query?
-
How to implement database merging, i.e., how results from different text databases can be merged into one result set?
-
How to handle partly corrupted data? Which models are appropriate for the same?
Information Retrieval (IR) Model
在数学上,模型在许多科学领域中使用,旨在了解现实世界中的一些现象。信息检索模型预测并解释用户将在与给定查询相关的内容中找到什么。IR 模型基本上是一种模式,它定义了检索过程的上述方面,并包括以下内容:
Mathematically, models are used in many scientific areas having objective to understand some phenomenon in the real world. A model of information retrieval predicts and explains what a user will find in relevance to the given query. IR model is basically a pattern that defines the above-mentioned aspects of retrieval procedure and consists of the following −
-
A model for documents.
-
A model for queries.
-
A matching function that compares queries to documents.
在数学上,检索模型包括:
Mathematically, a retrieval model consists of −
D - 文档表示。
D − Representation for documents.
R - 查询表示。
R − Representation for queries.
F - D、Q 及其之间关系的建模框架。
F − The modeling framework for D, Q along with relationship between them.
R (q,di) - 根据查询对文档进行排序的相似性函数。它也称为排名。
R (q,di) − A similarity function which orders the documents with respect to the query. It is also called ranking.
Types of Information Retrieval (IR) Model
信息模型 (IR) 模型可以分为以下三种模型:
An information model (IR) model can be classified into the following three models −
Classical IR Model
它是实现最简单、最容易的 IR 模型。此模型基于容易识别和理解的数学知识。布尔、向量和概率是三个经典的 IR 模型。
It is the simplest and easy to implement IR model. This model is based on mathematical knowledge that was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three classical IR models.
Non-Classical IR Model
它与经典 IR 模型完全相反。这种 IR 模型基于相似性、概率和布尔运算之外的原则。信息逻辑模型、情景理论模型和交互模型是非经典 IR 模型的示例。
It is completely opposite to classical IR model. Such kind of IR models are based on principles other than similarity, probability, Boolean operations. Information logic model, situation theory model and interaction models are the examples of non-classical IR model.
Alternative IR Model
增强经典信息检索模型并利用其他一些领域中的部分特定技术。群集模型、模糊模型和潜在语义索引 (LSI) 模型就是替代信息检索模型的示例。
It is the enhancement of classical IR model making use of some specific techniques from some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are the example of alternative IR model.
Design features of Information retrieval (IR) systems
下面我们来了解信息检索系统的设计特征。
Let us now learn about the design features of IR systems −
Inverted Index
大多数信息检索系统的主要数据结构是以倒排索引的形式存在的。我们可以将倒排索引定义为一种数据结构,针对每一个单词,列出所有包含该单词的文档以及该单词在各文档中出现的频率。这使得针对查询词搜索“命中”结果变得轻而易举。
The primary data structure of most of the IR systems is in the form of inverted index. We can define an inverted index as a data structure that list, for every word, all documents that contain it and frequency of the occurrences in document. It makes it easy to search for ‘hits’ of a query word.
Stop Word Elimination
停用词是指这些高频词不太可能用于搜索。它们具有较低的语义权重。所有此类词都在称为停用词表的一个列表中。例如,冠词“a”、“an”、“the”和介词“in”、“of”、“for”、“at”等都是停用词的示例。停用词表可以显著减少倒排索引的大小。根据齐普夫定律,涵盖几十个单词的停用词表将倒排索引的大小减少近一半。另一方面,有时停用词的排除可能会导致消除对搜索有用的术语。例如,如果我们从“维生素 A”中删除字母“A”,那么它将没有任何意义。
Stop words are those high frequency words that are deemed unlikely to be useful for searching. They have less semantic weights. All such kind of words are in a list called stop list. For example, articles “a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are the examples of stop words. The size of the inverted index can be significantly reduced by stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of inverted index by almost half. On the other hand, sometimes the elimination of stop word may cause elimination of the term that is useful for searching. For example, if we eliminate the alphabet “A” from “Vitamin A” then it would have no significance.
Stemming
词干提取作为形态分析的简化形式,是一种通过切断单词末尾来提取单词基本形式的启发式过程。例如,单词 laughing、laughs、laughed 将被词干提取为词根 laugh。
Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed would be stemmed to the root word laugh.
在我们的后续章节中,我们将讨论一些重要且有用的信息检索模型。
In our subsequent sections, we will discuss about some important and useful IR models.
The Boolean Model
这是最古老的信息检索 (IR) 模型。该模型基于集合论和布尔代数,其中文档为术语集,而查询是对术语的布尔表达式。布尔模型可以定义为:
It is the oldest information retrieval (IR) model. The model is based on set theory and the Boolean algebra, where documents are sets of terms and queries are Boolean expressions on terms. The Boolean model can be defined as −
-
D − A set of words, i.e., the indexing terms present in a document. Here, each term is either present (1) or absent (0).
-
Q − A Boolean expression, where terms are the index terms and operators are logical products − AND, logical sum − OR and logical difference − NOT
-
F − Boolean algebra over sets of terms as well as over sets of documents If we talk about the relevance feedback, then in Boolean IR model the Relevance prediction can be defined as follows −
-
R − A document is predicted as relevant to the query expression if and only if it satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
我们可以通过将查询项解释为文档集的明确定义来解释这个模型。
We can explain this model by a query term as an unambiguous definition of a set of documents.
例如,查询项 “economic” 定义索引有项 “economic” 的文档集。
For example, the query term “economic” defines the set of documents that are indexed with the term “economic”.
现在,在将项与布尔 AND 运算符组合之后结果会是什么?它会定义小于或等于任何单独项的文档集。例如,带有项 “social” 和 “economic” 的查询将产生同时索引有这两个项的文档集。换句话说,具有两种集合交集的文档集。
Now, what would be the result after combining terms with Boolean AND Operator? It will define a document set that is smaller than or equal to the document sets of any of the single terms. For example, the query with terms “social” and “economic” will produce the documents set of documents that are indexed with both the terms. In other words, document set with the intersection of both the sets.
现在,在将项与布尔 OR 运算符组合之后结果会是什么?它会定义大于或等于任何单独项的文档集。例如,带有项 “social” 或 “economic” 的查询将产生索引有项 “social” 或 “economic” 的文档集。换句话说,具有两种集合并集的文档集。
Now, what would be the result after combining terms with Boolean OR operator? It will define a document set that is bigger than or equal to the document sets of any of the single terms. For example, the query with terms “social” or “economic” will produce the documents set of documents that are indexed with either the term “social” or “economic”. In other words, document set with the union of both the sets.
Advantages of the Boolean Mode
布尔模型的优点如下 −
The advantages of the Boolean model are as follows −
-
The simplest model, which is based on sets.
-
Easy to understand and implement.
-
It only retrieves exact matches
-
It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model
布尔模型的缺点如下 −
The disadvantages of the Boolean model are as follows −
-
The model’s similarity function is Boolean. Hence, there would be no partial matches. This can be annoying for the users.
-
In this model, the Boolean operator usage has much more influence than a critical word.
-
The query language is expressive, but it is complicated too.
-
No ranking for retrieved documents.
Vector Space Model
由于布尔模型的上述缺点,Gerard Salton 和他的同事们提出了一个基于 Luhn 相似度准则的模型。Luhn 制定的相似度准则指出,“两个表示在给定元素及其分布上越一致,则它们表示相似信息的概率就越高。”
Due to the above disadvantages of the Boolean model, Gerard Salton and his colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity criterion formulated by Luhn states, “the more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information.”
考虑以下重要方面,以便更多地了解向量空间模型 −
Consider the following important points to understand more about the Vector Space Model −
-
The index representations (documents) and the queries are considered as vectors embedded in a high dimensional Euclidean space.
-
The similarity measure of a document vector to a query vector is usually the cosine of the angle between them.
Cosine Similarity Measure Formula
余弦是一种归一化点积,可以用以下公式计算:
Cosine is a normalized dot product, which can be calculated with the help of the following formula −
得分\lgroup\vec{d}\vec{q}\rgroup= \frac{\sum_{k=1}^m d_{k}\:.q_{k}}{\sqrt{\sum_{k=1}^m\lgroup d_{k}\rgroup^2}m\lgroup q_{k}\rgroup^2 }
Score \lgroup \vec{d} \vec{q} \rgroup= \frac{\sum_{k=1}^m d_{k}\:.q_{k}}{\sqrt{\sum_{k=1}^m\lgroup d_{k}\rgroup2}\:.\sqrt{\sum_{k=1}m}m\lgroup q_{k}\rgroup^2 }
得分\lgroup \vec{d} \vec{q}\rgroup =1\:当\:d =q
Score \lgroup \vec{d} \vec{q}\rgroup =1\:when\:d =q
得分\lgroup\vec{d}\vec{q}\rgroup=0\:当\:d\:和\:q\:不共享任何项目时
Score \lgroup \vec{d} \vec{q}\rgroup =0\:when\:d\:and\:q\:share\:no\:items
Vector Space Representation with Query and Document
查询和文档由一个二维向量空间表示。术语为 car 和 insurance 。在向量空间中有一个查询和三个文档。
The query and documents are represented by a two-dimensional vector space. The terms are car and insurance. There is one query and three documents in the vector space.
作为对术语汽车和保险的响应,排名最高的文档将为文档 d2 ,因为 q 和 d2 之间的角度最小。造成这一结果的原因在于,汽车和保险这两个概念在d2中都很突出,因此具有较高的权重。另一方面, d1 和 d3 也提到了这两个术语,但在每个案例中,其中一个在文档中都不是一个中心重要的术语。
The top ranked document in response to the terms car and insurance will be the document d2 because the angle between q and d2 is the smallest. The reason behind this is that both the concepts car and insurance are salient in d2 and hence have the high weights. On the other side, d1 and d3 also mention both the terms but in each case, one of them is not a centrally important term in the document.
Term Weighting
术语加权表示向量空间中术余的权重。术语的权重越高,术语对余弦的影响就越大。应该为模型中更重要的术语分配更多权重。现在这里出现的问题是,我们如何对此建模。
Term weighting means the weights on the terms in vector space. Higher the weight of the term, greater would be the impact of the term on cosine. More weights should be assigned to the more important terms in the model. Now the question that arises here is how can we model this.
执行此操作的一种方法是将文档中的单词作为其术语权重计数。但是,您认为这会是一种有效的方法吗?
One way to do this is to count the words in a document as its term weight. However, do you think it would be effective method?
另一种更有效的方法是使用 term frequency (tfij), document frequency (dfi) 和 collection frequency (cfi) 。
Another method, which is more effective, is to use term frequency (tfij), document frequency (dfi) and collection frequency (cfi).
Term Frequency (tfij)
可以将其定义为 dj 中 wi 出现的次数。术语频率捕获的信息是一个词在给定文档中的显着程度,或者换句话说,我们可以说,术语频率越高,该词就越能很好地描述该文档的内容。
It may be defined as the number of occurrences of wi in dj. The information that is captured by term frequency is how salient a word is within the given document or in other words we can say that the higher the term frequency the more that word is a good description of the content of that document.
Document Frequency (dfi)
可以将其定义为集合中出现wi的文档总数。这是信息性的一个指标。与语义不集中的单词不同,语义集中的单词将在文档中多次出现。
It may be defined as the total number of documents in the collection in which wi occurs. It is an indicator of informativeness. Semantically focused words will occur several times in the document unlike the semantically unfocused words.
Forms of Document Frequency Weighting
现在,让我们了解文档频率加权的不同形式。形式如下所述:
Let us now learn about the different forms of document frequency weighting. The forms are described below −
Term Frequency Factor
这也称为术语频率因子,这意味着如果某个术语 t 经常出现在某个文档中,那么包含 t 的查询应检索该文档。我们可以将单词 term frequency (tfij) 和 document frequency (dfi) 组合成一个权重,如下所示:
This is also classified as the term frequency factor, which means that if a term t appears often in a document then a query containing t should retrieve that document. We can combine word’s term frequency (tfij) and document frequency (dfi) into a single weight as follows −
权重 \left ( i,j \right ) =\begin{cases}(1+log(tf_{ij}))log\frac{N}{df_{i}}\:如果\:tf_{i,j}\:\geq1\\0 \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: 如果\:tf_{i,j}\:=0\end{cases}
weight \left ( i,j \right ) =\begin{cases}(1+log(tf_{ij}))log\frac{N}{df_{i}}\:if\:tf_{i,j}\:\geq1\\0 \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: if\:tf_{i,j}\:=0\end{cases}
在此 N 是文档的总数。
Here N is the total number of documents.
Inverse Document Frequency (idf)
这是另一种文档频率加权形式,通常称为 idf 加权或逆文档频率加权。idf 加权的重要点在于该术语在集合中的稀缺度是其重要性的衡量标准,而重要性与出现频率成反比。
This is another form of document frequency weighting and often called idf weighting or inverse document frequency weighting. The important point of idf weighting is that the term’s scarcity across the collection is a measure of its importance and importance is inversely proportional to frequency of occurrence.
在数学上,
Mathematically,
idf_{t} = log\left(1+\frac{N}{n_{t}}\right)
idf_{t} = log\left(\frac{N-n_{t}}{n_{t}}\right)
在此,
Here,
N = 集合中的文档
N = documents in the collection
nt = 包含术语 t 的文档
nt = documents containing term t
User Query Improvement
任何信息检索系统的首要目标都必须是准确性,即根据用户要求生成相关文档。然而,这里出现的问题是我们如何通过改进用户的查询形成样式来改进输出。当然,任何 IR 系统的输出都取决于用户的查询,格式良好的查询会产生更准确的结果。用户可以在 relevance feedback 的帮助下改进自己的查询,这是任何 IR 模型的一个重要方面。
The primary goal of any information retrieval system must be accuracy − to produce relevant documents as per the user’s requirement. However, the question that arises here is how can we improve the output by improving user’s query formation style. Certainly, the output of any IR system is dependent on the user’s query and a well-formatted query will produce more accurate results. The user can improve his/her query with the help of relevance feedback, an important aspect of any IR model.
Relevance Feedback
相关反馈采集到最初因给定查询而返回的输出。此初始输出可用于收集用户信息,并了解该输出是否与执行新查询相关。反馈可分类如下:
Relevance feedback takes the output that is initially returned from the given query. This initial output can be used to gather user information and to know whether that output is relevant to perform a new query or not. The feedbacks can be classified as follows −
Explicit Feedback
它可以定义为从相关性评估者获得的反馈。这些评估者还将指示从查询中检索到的文档的相关性。为了改进查询检索性能,相关反馈信息需要与原始查询插值。
It may be defined as the feedback that is obtained from the assessors of relevance. These assessors will also indicate the relevance of a document retrieved from the query. In order to improve query retrieval performance, the relevance feedback information needs to be interpolated with the original query.
系统评估者或其他用户可以通过使用以下相关性系统明确指示相关性:
Assessors or other users of the system may indicate the relevance explicitly by using the following relevance systems −
-
Binary relevance system − This relevance feedback system indicates that a document is either relevant (1) or irrelevant (0) for a given query.
-
Graded relevance system − The graded relevance feedback system indicates the relevance of a document, for a given query, on the basis of grading by using numbers, letters or descriptions. The description can be like “not relevant”, “somewhat relevant”, “very relevant” or “relevant”.
Implicit Feedback
它是从用户行为中推断出的反馈。行为包括用户花在查看文档上的时间、选择了哪些文档进行查看、哪些未查看、页面浏览和滚动操作等。隐式反馈的一个最佳示例是 dwell time ,它衡量用户在搜索结果中链接到的页面上花费了多少时间。
It is the feedback that is inferred from user behavior. The behavior includes the duration of time user spent viewing a document, which document is selected for viewing and which is not, page browsing and scrolling actions, etc. One of the best examples of implicit feedback is dwell time, which is a measure of how much time a user spends viewing the page linked to in a search result.
Pseudo Feedback
它也称为盲反馈。它提供了一种自动局部分析的方法。相关反馈的人工部分在伪相关反馈的帮助下实现自动化,这样用户无需进一步交互即可获得改进的检索性能。此反馈系统的主要优点是它不需要像明确相关反馈系统那样的评估者。
It is also called Blind feedback. It provides a method for automatic local analysis. The manual part of relevance feedback is automated with the help of Pseudo relevance feedback so that the user gets improved retrieval performance without an extended interaction. The main advantage of this feedback system is that it does not require assessors like in explicit relevance feedback system.
考虑以下步骤来实施此反馈:
Consider the following steps to implement this feedback −
-
Step 1 − First, the result returned by initial query must be taken as relevant result. The range of relevant result must be in top 10-50 results.
-
Step 2 − Now, select the top 20-30 terms from the documents using for instance term frequency(tf)-inverse document frequency(idf) weight.
-
Step 3 − Add these terms to the query and match the returned documents. Then return the most relevant documents.