Data Mining 简明教程

Data Mining - Mining World Wide Web

万维网包含大量的信息,为数据挖掘提供了丰富的来源。

The World Wide Web contains huge amounts of information that provides a rich source for data mining.

Challenges in Web Mining

根据以下观察,网络对资源和知识发现提出了巨大挑战 −

The web poses great challenges for resource and knowledge discovery based on the following observations −

  1. The web is too huge − The size of the web is very huge and rapidly increasing. This seems that the web is too huge for data warehousing and data mining.

  2. Complexity of Web pages − The web pages do not have unifying structure. They are very complex as compared to traditional text document. There are huge amount of documents in digital library of web. These libraries are not arranged according to any particular sorted order.

  3. Web is dynamic information source − The information on the web is rapidly updated. The data such as news, stock markets, weather, sports, shopping, etc., are regularly updated.

  4. Diversity of user communities − The user community on the web is rapidly expanding. These users have different backgrounds, interests, and usage purposes. There are more than 100 million workstations that are connected to the Internet and still rapidly increasing.

  5. Relevancy of Information − It is considered that a particular person is generally interested in only small portion of the web, while the rest of the portion of the web contains the information that is not relevant to the user and may swamp desired results.

Mining Web page layout structure

网页的基本结构基于文档对象模型 (DOM)。DOM 结构是指树状结构,其中页面中的 HTML 标记对应于 DOM 树中的节点。我们可以使用 HTML 中的预定义标记对网页进行分割。HTML 语法很灵活,因此网页不会遵循 W3C 规范。不遵循 W3C 规范可能会导致 DOM 树结构出错。

The basic structure of the web page is based on the Document Object Model (DOM). The DOM structure refers to a tree like structure where the HTML tag in the page corresponds to a node in the DOM tree. We can segment the web page by using predefined tags in HTML. The HTML syntax is flexible therefore, the web pages does not follow the W3C specifications. Not following the specifications of W3C may cause error in DOM tree structure.

DOM 结构最初是为了在浏览器中展示而引入的,而不是为了描述网页的语义结构。DOM 结构无法正确识别网页不同部分之间的语义关系。

The DOM structure was initially introduced for presentation in the browser and not for description of semantic structure of the web page. The DOM structure cannot correctly identify the semantic relationship between the different parts of a web page.

Vision-based page segmentation (VIPS)

  1. The purpose of VIPS is to extract the semantic structure of a web page based on its visual presentation.

  2. Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to a block.

  3. A value is assigned to each node. This value is called the Degree of Coherence. This value is assigned to indicate the coherent content in the block based on visual perception.

  4. The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that it finds the separators between these blocks.

  5. The separators refer to the horizontal or vertical lines in a web page that visually cross with no blocks.

  6. The semantics of the web page is constructed on the basis of these blocks.

下图显示了 VIPS 算法的过程 −

The following figure shows the procedure of VIPS algorithm −

dm vips