Data Mining 简明教程
Data Mining - Mining Text Data
文本数据库包含大量的文档集合。它们从多种来源收集这些信息,例如新闻文章、书籍、数字图书馆、电子邮件、网页等。由于信息量的增加,文本数据库正在迅速增长。在许多文本数据库中,数据是半结构化的。
Text databases consist of huge collection of documents. They collect these information from several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc. Due to increase in the amount of information, the text databases are growing rapidly. In many of the text databases, the data is semi-structured.
例如,一个文档可能包含一些结构化字段,例如标题、作者、出版日期等。但除了结构数据之外,文档还包含非结构化文本组件,例如摘要和内容。在不知道文档中可能包含什么的情况下,很难制定有效的查询来分析和提取数据中有用的信息。用户需要工具来比较文档并对其重要性和相关性进行排名。因此,文本挖掘已变得流行,并成为数据挖掘中的一个基本主题。
For example, a document may contain a few structured fields, such as title, author, publishing_date, etc. But along with the structure data, the document also contains unstructured text components, such as abstract and contents. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users require tools to compare the documents and rank their importance and relevance. Therefore, text mining has become popular and an essential theme in data mining.
Information Retrieval
信息检索涉及从大量文本文档中检索信息。一些数据库系统通常不存在于信息检索系统中,因为两者处理不同类型的数据。信息检索系统示例包括 −
Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of the database systems are not usually present in information retrieval systems because both handle different kinds of data. Examples of information retrieval system include −
-
Online Library catalogue system
-
Online Document Management Systems
-
Web Search Systems etc.
Note − 信息检索系统的主要问题是根据用户的查询在文档集合中定位相关文档。这种类型的用户查询由一些描述信息需求的关键词组成。
Note − The main problem in an information retrieval system is to locate relevant documents in a document collection based on a user’s query. This kind of user’s query consists of some keywords describing an information need.
在这种搜索问题中,用户主动从集合中提取相关信息。当用户临时需要信息时,即短期需要时,这种方法比较合适。但是,如果用户长期需要信息,那么检索系统还可以主动将新到达的信息项目推送到用户。
In such search problems, the user takes an initiative to pull relevant information out from a collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term need. But if the user has a long-term information need, then the retrieval system can also take an initiative to push any newly arrived information item to the user.
这种访问信息的方式称为信息过滤。相应的系统称为过滤系统或推荐系统。
This kind of access to information is called Information Filtering. And the corresponding systems are known as Filtering Systems or Recommender Systems.
Basic Measures for Text Retrieval
当系统根据用户的输入检索大量文档时,我们需要检查系统的准确性。将与查询相关的文档集表示为 {Relevant},而将检索到的文档集表示为 {Retrieved}。既相关又检索到的文档集可以表示为 {Relevant} ∩ {Retrieved}。这可以通过以下韦恩图的形式展示 −
We need to check the accuracy of a system when it retrieves a number of documents on the basis of user’s input. Let the set of documents relevant to a query be denoted as {Relevant} and the set of retrieved document as {Retrieved}. The set of documents that are relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a Venn diagram as follows −
评估文本检索质量有三个基本度量 −
There are three fundamental measures for assessing the quality of text retrieval −
-
Precision
-
Recall
-
F-score
Precision
准确率是检索到的文档中实际上与查询相关的文档的百分比。准确率可以定义为 −
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
召回率是与查询相关且实际上已检索到的文档的百分比。召回率定义为 −
Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F 分数是常用的折衷方案。信息检索系统通常需要在准确度和召回率之间折衷。F 分数定义为召回率或准确率的调和平均值,如下所示 −
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows −
F-score = recall x precision / (recall + precision) / 2