Beautiful Soup 简明教程
Beautiful Soup - Overview
在当今世界,我们拥有大量免费的非结构化数据/信息(主要是 Web 数据)。有时,这些免费提供的数据很容易阅读,有时则不然。无论您如何获取数据,网络抓取都是将非结构化数据转换为更易于阅读和分析的结构化数据的非常有用的工具。换句话说,网络抓取是收集、组织和分析大量数据的途径。因此,我们首先了解什么是网络抓取。
In today’s world, we have tons of unstructured data/information (mostly web data) available freely. Sometimes the freely available data is easy to read and sometimes not. No matter how your data is available, web scraping is very useful tool to transform unstructured data into structured data that is easier to read and analyze. In other words, web scraping is a way to collect, organize and analyze this enormous amount of data. So let us first understand what is web-scraping.
Introduction to Beautiful Soup
Beautiful Soup 是一个以爱丽丝漫游仙境中路易斯·卡罗尔同名诗命名的 Python 库。Beautiful Soup 是一个 Python 包,顾名思义,它会解析不必要的数据并通过修复不良 HTML 来帮助整理和设置杂乱的 Web 数据,并以易于遍历的 XML 结构呈现给我们。
The Beautiful Soup is a python library which is named after a Lewis Carroll poem of the same name in "Alice’s Adventures in the Wonderland". Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable XML structures.
简而言之,Beautiful Soup 是一个 Python 包,它允许我们从 HTML 和 XML 文档中提取数据。
In short, Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.
HTML tree Structure
在我们了解 Beautiful Soup 提供的功能之前,让我们首先了解 HTML 树结构。
Before we look into the functionality provided by Beautiful Soup, let us first understand the HTML tree structure.
文档树中的根元素是 html,它可以有父元素、子元素和兄弟元素,并由它在树结构中的位置决定。要在 HTML 元素、属性和文本之间移动,您必须在树结构中的节点之间移动。
The root element in the document tree is the html, which can have parents, children and siblings and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.
让我们假设网页如下所示 −
Let us suppose the webpage is as shown below −
它转换成一个 html 文档,如下所示 −
Which translates to an html document as follows −
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
它仅仅表示,对于上述 html 文档,我们具有如下 HTML 树结构 −
Which simply means, for above html document, we have a html tree structure as follows −