Beautiful Soup 简明教程
Beautiful Soup - Overview
在当今世界,我们拥有大量免费的非结构化数据/信息(主要是 Web 数据)。有时,这些免费提供的数据很容易阅读,有时则不然。无论您如何获取数据,网络抓取都是将非结构化数据转换为更易于阅读和分析的结构化数据的非常有用的工具。换句话说,网络抓取是收集、组织和分析大量数据的途径。因此,我们首先了解什么是网络抓取。
In today’s world, we have tons of unstructured data/information (mostly web data) available freely. Sometimes the freely available data is easy to read and sometimes not. No matter how your data is available, web scraping is very useful tool to transform unstructured data into structured data that is easier to read and analyze. In other words, web scraping is a way to collect, organize and analyze this enormous amount of data. So let us first understand what is web-scraping.
Introduction to Beautiful Soup
Beautiful Soup 是一个以爱丽丝漫游仙境中路易斯·卡罗尔同名诗命名的 Python 库。Beautiful Soup 是一个 Python 包,顾名思义,它会解析不必要的数据并通过修复不良 HTML 来帮助整理和设置杂乱的 Web 数据,并以易于遍历的 XML 结构呈现给我们。
The Beautiful Soup is a python library which is named after a Lewis Carroll poem of the same name in "Alice’s Adventures in the Wonderland". Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable XML structures.
简而言之,Beautiful Soup 是一个 Python 包,它允许我们从 HTML 和 XML 文档中提取数据。
In short, Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.
HTML tree Structure
在我们了解 Beautiful Soup 提供的功能之前,让我们首先了解 HTML 树结构。
Before we look into the functionality provided by Beautiful Soup, let us first understand the HTML tree structure.
文档树中的根元素是 html,它可以有父元素、子元素和兄弟元素,并由它在树结构中的位置决定。要在 HTML 元素、属性和文本之间移动,您必须在树结构中的节点之间移动。
The root element in the document tree is the html, which can have parents, children and siblings and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.
让我们假设网页如下所示 −
Let us suppose the webpage is as shown below −
它转换成一个 html 文档,如下所示 −
Which translates to an html document as follows −
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
它仅仅表示,对于上述 html 文档,我们具有如下 HTML 树结构 −
Which simply means, for above html document, we have a html tree structure as follows −
Beautiful Soup - web-scraping
抓取只是从(各种方法)中提取、复制和筛选数据的过程。
Scraping is simply a process of extracting (from various means), copying and screening of data.
当我们从网络上抓取或提取数据或提要(例如来自网页或网站)时,这被称为网络抓取。
When we scrape or extract data or feeds from the web (like from web-pages or websites), it is termed as web-scraping.
因此,网络抓取(也称为网络数据提取或网络获取)是从网络中提取数据的过程。简而言之,网络抓取为开发人员提供了一种从互联网收集和分析数据的方法。
So, web scraping (which is also known as web data extraction or web harvesting) is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet.
Why Web-scraping?
网络抓取提供了一个很好的工具,可以自动执行人在浏览时所做的很多事情。网络抓取在企业中有多种用途——
Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways −
Data for Research
聪明的分析师(例如研究人员或记者)使用网络抓取器,而不是手动从网站收集和清理数据。
Smart analyst (like researcher or journalist) uses web scrapper instead of manually collecting and cleaning data from the websites.
Products, prices & popularity comparison
目前,有一些服务使用网络抓取器从众多在线网站收集数据,并使用这些数据来比较产品的受欢迎程度和价格。
Currently there are couple of services which use web scrappers to collect data from numerous online sites and use it to compare products popularity and prices.
SEO Monitoring
有许多用于竞争性分析和从客户网站提取数据的 SEO 工具,例如 Ahref、Seobility、SEMrush 等。
There are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are used for competitive analysis and for pulling data from your client’s websites.
Search engines
有一些大型 IT 公司的业务完全依赖网络抓取。
There are some big IT companies whose business solely depends on web scraping.
Sales and Marketing
通过网络抓取收集的数据可由营销人员用来分析不同的利基市场和竞争对手,或由销售专家用来销售内容营销或社交媒体推广服务。
The data gathered through web scraping can be used by marketers to analyze different niches and competitors or by the sales specialist for selling content marketing or social media promotion services.
Why Python for Web Scraping?
Python 是最流行的网络抓取语言之一,因为它可以非常轻松地处理大多数与网络爬取相关的事务。
Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily.
以下是选择 Python 进行网络爬取的原因:
Below are some of the points on why to choose python for web scraping −
Ease of Use
大多数开发人员都同意 Python 非常容易编码。我们不必在任何地方使用花括号“{ }”或分号“;”,这使得它在开发网络爬取器时更具可读性和易用性。
As most of the developers agree that python is very easy to code. We don’t have to use any curly braces "{ }" or semi-colons ";" anywhere, which makes it more readable and easy-to-use while developing web scrapers.
Huge Library Support
Python 为不同的需求提供了大量的库,因此它适用于网络爬取以及数据可视化、机器学习等。
Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc.
Easily Explicable Syntax
Python 是一种非常易读的编程语言,因为 Python 语法易于理解。Python 非常具表现力,并且代码缩进有助于用户区分代码中的不同块或范围。
Python is a very readable programming language as python syntax are easy to understand. Python is very expressive and code indentation helps the users to differentiate different blocks or scopes in the code.
Beautiful Soup - Installation
Beautiful Soup 是一个让从网页中抓取信息变得容易的库。它位于 HTML 或 XML 解析器之上,为解析树的迭代、搜索和修改提供了 Pythonic 惯用语。
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
BeautifulSoup 软件包不是 Python 标准库的一部分,因此必须安装它。在安装最新版本之前,让我们按照 Python 推荐的方法创建一个虚拟环境。
BeautifulSoup package is not a part of Python’s standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python’s recommended method.
虚拟环境允许我们为特定项目创建 Python 的隔离工作副本,而不会影响外部设置。
A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.
我们将使用 Python 标准库中的 venv 模块来创建虚拟环境。PIP 默认包含在 Python 3.4 及更高版本中。
We shall use venv module in Python’s standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.
在 Windows 中使用以下命令创建虚拟环境
Use the following command to create virtual environment in Windows
C:\uses\user\>python -m venv myenv
在 Ubuntu Linux 中,在创建虚拟环境之前更新 APT 存储库并根据需要安装 venv
On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment
mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y
mvl@GNVBGL3:~ $ sudo apt install python3-venv
然后使用以下命令创建一个虚拟环境
Then use the following command to create a virtual environment
mvl@GNVBGL3:~ $ sudo python3 -m venv myenv
你需要激活虚拟环境。在 Windows 中使用该命令
You need to activate the virtual environment. On Windows use the command
C:\uses\user\>cd myenv
C:\uses\user\myenv>scripts\activate
(myenv) C:\Users\users\user\myenv>
在 Ubuntu Linux 中,使用以下命令激活虚拟环境
On Ubuntu Linux, use following command to activate the virtual environment
mvl@GNVBGL3:~$ cd myenv
mvl@GNVBGL3:~/myenv$ source bin/activate
(myenv) mvl@GNVBGL3:~/myenv$
虚拟环境的名称显示在括号中。现在它已激活,我们现在可以在其中安装 BeautifulSoup。
Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.
(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4
Collecting beautifulsoup4
Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
143.0/143.0 KB 325.2 kB/s eta 0:00:00
Collecting soupsieve>1.2
Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1
请注意,Beautifulsoup4 的最新版本为 4.12.2,并且需要 Python 3.8 或更高版本。
Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.
如果没有安装 easy_install 或 pip,则可以下载 Beautiful Soup 4 源归档并使用 setup.py 安装它。
If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install
要检查 Beautifulsoup 是否已正确安装,请在 Python 终端中输入以下命令 -
To check if Beautifulsoup is properly install, enter following commands in Python terminal −
>>> import bs4
>>> bs4.__version__
'4.12.2'
如果安装不成功,你将收到 ModuleNotFoundError。
If the installation hasn’t been successful, you will get ModuleNotFoundError.
你还需要安装 requests 库。它是一个用于 Python 的 HTTP 库。
You will also need to install requests library. It is a HTTP library for Python.
pip3 install requests
Installing a Parser
默认情况下,Beautiful Soup 支持 Python 标准库中包含的 HTML 解析器,但它还支持许多外部第三方 Python 解析器,如 lxml 解析器或 html5lib 解析器。
By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.
要安装 lxml 或 html5lib 解析器,请使用命令:
To install lxml or html5lib parser, use the command:
pip3 install lxml
pip3 install html5lib
这些解析器具有各自的优点和缺点,如下所示 -
These parsers have their advantages and disadvantages as shown below −
Parser: Python’s html.parser
Usage - BeautifulSoup(markup, "html.parser")
Usage − BeautifulSoup(markup, "html.parser")
Advantages
Advantages
-
Batteries included
-
Decent speed
-
Lenient (As of Python 3.2)
Disadvantages
Disadvantages
-
Not as fast as lxml, less lenient than html5lib.
Parser: lxml’s HTML parser
Usage − BeautifulSoup(markup, "lxml")
Usage − BeautifulSoup(markup, "lxml")
Advantages
Advantages
-
Very fast
-
Lenient
Disadvantages
Disadvantages
.
Beautiful Soup - Souping the Page
是时候在其中一个 html 页面(采用网页 - https://www.tutorialspoint.com/index.htm ,您可以选择任何其他您想要的网页)中测试我们的 Beautiful Soup 程序包,并从中提取一些信息了。
It is time to test our Beautiful Soup package in one of the html pages (taking web page - https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.
在以下代码中,我们尝试从网页中提取标题 -
In the below code, we are trying to extract the title from the webpage −
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
print(soup.title)
Output
<title>Online Courses and eBooks Library<title>
一项常见任务是从网页中提取所有 URL。为此,我们只需要添加以下代码行 -
One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −
for link in soup.find_all('a'):
print(link.get('href'))
Output
下面显示了上述循环的部分输出 -
Shown below is the partial output of the above loop −
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
…
…
要分析存储在当前工作目录中的网页,请获取指向 html 文件的文件对象,并将其用作 Beautiful Soup() 构造函数的参数。
To parse a web page stored locally in the current working directory, obtain the file object pointing to the html file, and use it as argument to the BeautifulSoup() constructor.
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(soup)
Output
<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>
您还可以按如下方式使用包含 HTML 脚本的字符串作为构造函数的参数 -
You can also use a string that contains HTML script as constructor’s argument as follows −
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Beautiful Soup 使用可用的最佳解析器解析文档。如果没有另行指定,它将使用 HTML 解析器。
Beautiful Soup uses the best available parser to parse the document. It will use an HTML parser unless specified otherwise.
Beautiful Soup - Kinds of objects
当我们将 HTML 文档或字符串传递给 beautifulsoup 构造函数时,beautifulsoup 基本上将复杂的 HTML 页面转换成不同的 Python 对象。下面我们将讨论 bs4 包中定义的四种主要对象。
When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package.
-
Tag
-
NavigableString
-
BeautifulSoup
-
Comments
Tag Object
HTML 标签用于定义各种类型的内容。BeautifulSoup 中的标签对象对应于实际页面或文档中的 HTML 或 XML 标签。
A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (type(tag))
Output
<class 'bs4.element.Tag'>
标签包含大量的属性和方法,而标签的两个重要特征是其名称和属性。
Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.
Name (tag.name)
每个标签都包含一个名称,可以通过“.name”作为后缀进行访问。tag.name 将返回标签的类型。
Every tag contains a name and can be accessed through '.name' as suffix. tag.name will return the type of tag it is.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (tag.name)
Output
html
但是,如果我们更改了标签名称,则在 BeautifulSoup 生成的 HTML 标记中也会反映出相同的更改。
However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)
Attributes (tag.attrs)
标签对象可以具有任意数量的属性。在上面的示例中,标签 <b class="boldest"> 有一个属性 'class',其值为 "boldest”。任何不是标签的内容基本上都是一种属性,并且必须包含一个值。attrs 返回属性及其值的字典。您也可以通过访问键来访问属性。
A tag object can have any number of attributes. In the above example, the tag <b class="boldest"> has an attribute 'class' whose value is "boldest". Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by "attrs". You can access the attributes either through accessing the keys too.
在下面的示例中,Beautifulsoup() 构造函数的字符串自变量包含 HTML 输入标签。输入标签的属性由 “attr” 返回。
In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by "attr".
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
print (tag.attrs)
Output
{'type': 'text', 'name': 'name', 'value': 'Raju'}
我们可以使用字典操作符或方法对标签的属性进行任何类型的修改(添加/删除/修改)。
We can do all kind of modifications to our tag’s attributes (add/remove/modify), using dictionary operators or methods.
在下面的示例中,更新了值标签。更新后的 HTML 字符串显示了更改。
In the following example, the value tag is updated. The updated HTML string shows changes.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
print (tag.attrs)
tag['value']='Ravi'
print (soup)
Output
<html><body><input name="name" type="text" value="Ravi"/></body></html>
我们添加了一个新的 id 标签,并删除了 value 标签。
We add a new id tag, and delete the value tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
tag['id']='nm'
del tag['value']
print (soup)
Multi-valued attributes
一些 HTML5 属性可以有多个值。最常用的类属性可以有多个 CSS 值。其他内容包括“rel”、“rev”、“headers”、“accesskey”和“accept-charset”。beautiful soup 中的多值属性显示为列表。
Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include 'rel', 'rev', 'headers', 'accesskey' and 'accept-charset'. The multi-valued attributes in beautiful soup are shown as list.
Example
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])
css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])
Output
css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']
但是,如果任何属性包含多个值,但它不是任何 HTML 标准版本的多值属性,则 beautiful soup 会将该属性保留下来——
However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −
NavigableString object
通常情况下,一个字符串会放在特定类型的起始标签和结束标签中。浏览器的 HTML 引擎在渲染元素时,会将预期效果应用到字符串。例如,在 <b>Hello World</b> 中,您会在 <b> 和 </b> 标签中间找到一个字符串,以便以粗体渲染它。
Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold.
NavigableString 对象表示标签的内容。它是 bs4.element.NavigableString 类的对象。要访问内容,请将 “.string” 与标签一起使用。
The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use ".string" with tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')
print (soup.string)
print (type(soup.string))
Output
Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>
NavigableString 对象类似于 Python Unicode 字符串。它的一些功能支持导航树和搜索树。可以使用 str() 函数将 NavigableString 转换为 Unicode 字符串。
A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')
tag = soup.h2
string = str(tag.string)
print (string)
Output
Hello, Tutorialspoint!
正如 Python 字符串一样(不可变),NavigableString 也不能就地修改。但是,使用 replace_with() 可以将标记的内部字符串替换为另一个字符串。
Just as a Python string, which is immutable, the NavigableString also can’t be modified in place. However, use replace_with() to replace the inner string of a tag with another.
BeautifulSoup object
BeautifulSoup 对象表示整个已解析对象。但是,它可以被认为类似于 Tag 对象。它是我们在尝试抓取网络资源时创建的对象。因为它类似于 Tag 对象,所以它支持解析和搜索文档树所需的功能。
The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree.
Example
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
print (soup)
print (soup.name)
print ('type:',type(soup))
Output
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>
BeautifulSoup 对象的 name 属性始终返回 [document]。
The name property of BeautifulSoup object always returns [document].
如果将 BeautifulSoup 对象作为参数传递给特定函数(例如 replace_with()),则可以合并两个已解析的文档。
Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().
Comment object
在 HTML 和 XML 文档中,任何写在 <!-- 和 -→ 之间的内容都被视为注释。BeautifulSoup 可以将此类注释文本检测为 Comment 对象。
Any text written between <!-- and -→ in HTML as well as XML document is treated as comment. BeautifulSoup can detect such commented text as a Comment object.
Example
from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))
Beautiful Soup - Inspect Data Source
为了使用 BeautifulSoup 和 Python 抓取网页,对于任何网页抓取项目,您的第一步应该是探索您想要抓取的网站。因此,在开始提取与您相关的信息之前,请先访问网站了解网站结构。
In order to scrape a web page with BeautifulSoup and Python, your first step for any web scraping project should be to explore the website that you want to scrape. So, first visit the website to understand the site structure before you start extracting the information that’s relevant for you.
让我们访问 TutorialsPoint 的 Python 教程主页。在您的浏览器中打开 https://www.tutorialspoint.com/python3/index.htm 。
Let us visit TutorialsPoint’s Python Tutorial home page. Open https://www.tutorialspoint.com/python3/index.htm in your browser.
使用开发人员工具可以帮助您了解网站的结构。所有现代浏览器都安装了开发人员工具。
Use Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed.
如果使用 Chrome 浏览器,请通过右上角菜单按钮 (⋮) 打开开发者工具,然后选择更多工具 → 开发者工具。
If using Chrome browser, open the Developer Tools from the top-right menu button (⋮) and selecting More Tools → Developer Tools.
借助开发者工具,你可以浏览网站的文档对象模型 (DOM),以更好地了解其来源。在开发者工具中选择“元素”选项卡。你将看到具有可单击 HTML 元素的结构。
With Developer tools, you can explore the site’s document object model (DOM) to better understand your source. Select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements.
“教程”页面在左边的侧边栏显示目录表。右键单击任意一章并选择“检查”选项。
The Tutorial page shows the table of contents in the left sidebar. Right click on any chapter and choose Inspect option.
对于“元素”选项卡,找到与 TOC 列表相对应的标记,如下所示 −
For the Elements tab, locate the tag that corresponds to the TOC list, as shown in the figure below −
右键单击 HTML 元素,复制 HTML 元素,并将其粘贴到任意编辑器中。
Right click on the HTML element, copy the HTML element, and paste it in any editor.
<ul>..</ul> 元素的 HTML 脚本现已获取。
The HTML script of the <ul>..</ul> element is now obtained.
<ul class="toc chapters">
<li class="heading">Python 3 Basic Tutorial</li>
<li class="current-chapter"><a href="/python3/index.htm">Python 3 - Home</a></li>
<li><a href="/python3/python3_whatisnew.htm">What is New in Python 3</a></li>
<li><a href="/python3/python_overview.htm">Python 3 - Overview</a></li>
<li><a href="/python3/python_environment.htm">Python 3 - Environment Setup</a></li>
<li><a href="/python3/python_basic_syntax.htm">Python 3 - Basic Syntax</a></li>
<li><a href="/python3/python_variable_types.htm">Python 3 - Variable Types</a></li>
<li><a href="/python3/python_basic_operators.htm">Python 3 - Basic Operators</a></li>
<li><a href="/python3/python_decision_making.htm">Python 3 - Decision Making</a></li>
<li><a href="/python3/python_loops.htm">Python 3 - Loops</a></li>
<li><a href="/python3/python_numbers.htm">Python 3 - Numbers</a></li>
<li><a href="/python3/python_strings.htm">Python 3 - Strings</a></li>
<li><a href="/python3/python_lists.htm">Python 3 - Lists</a></li>
<li><a href="/python3/python_tuples.htm">Python 3 - Tuples</a></li>
<li><a href="/python3/python_dictionary.htm">Python 3 - Dictionary</a></li>
<li><a href="/python3/python_date_time.htm">Python 3 - Date & Time</a></li>
<li><a href="/python3/python_functions.htm">Python 3 - Functions</a></li>
<li><a href="/python3/python_modules.htm">Python 3 - Modules</a></li>
<li><a href="/python3/python_files_io.htm">Python 3 - Files I/O</a></li>
<li><a href="/python3/python_exceptions.htm">Python 3 - Exceptions</a></li>
</ul>
我们现在可以将该脚本加载到 BeautifulSoup 对象中来分析文档树。
We can now load this script in a BeautifulSoup object to parse the document tree.
Beautiful Soup - Scrape HTML Content
从网站中提取数据的过程称为网络抓取。网页可能包含 URL、电子邮件地址、图像或任何其他内容,我们可以将其存储在文件中或数据库中。手动搜索网站是一个繁琐的过程。有各种网络抓取工具可以实现该过程的自动化。
The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process.
有时通过使用“robots.txt”文件禁止网络抓取。一些热门的网站提供了 API,以结构化方式访问其数据。不道德的网络抓取可能会导致你的 IP 被封禁。
Web scraping is is sometimes prohibited by the use of 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked.
Python 被广泛用于网络抓取。Python 标准库具有 urllib 包,该包可用于从 HTML 页面中提取数据。由于 urllib 模块已与标准库捆绑在一起,因此不需要安装它。
Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed.
urllib 包是 Python 编程语言的 HTTP 客户端。当我们想要打开和读取 URL 时,urllib.request 模块非常有用。urllib 包中的其他模块有 −
The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are −
-
urllib.error defines the exceptions and errors raised by the urllib.request command.
-
urllib.parse is used for parsing URLs.
-
urllib.robotparser is used for parsing robots.txt files.
使用 urllib 模块中的 urlopen() 函数从网站读取网页的内容。
Use the urlopen() function in urllib module to read the content of a web page from a website.
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
你也可以为此目的使用 requests 库。使用之前你需要安装它。
You can also use the requests library for this purpose. You need to install it before using.
pip3 安装 requests
pip3 install requests
在以下代码中,抓取了 http://www.tutorialspoint.com 的主页 −
In the below code, the homepage of http://www.tutorialspoint.com is scraped −
from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
然后用 Beautiful Soup 解析由以上两种方法获得的内容。
The content obtained by either of the above two methods are then parsed with Beautiful Soup.
Beautiful Soup - Navigating by Tags
任何 HTML 文档片断中最重要的元素之一是标记,它可能包含其他标记/字符串(标记的子标记)。Beautiful Soup 提供了浏览和迭代标记子标记的不同方式。
One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag’s children). Beautiful Soup provides different ways to navigate and iterate over’s tag’s children.
搜索解析树最简单的方法是按名称搜索标记。
Easiest way to search a parse tree is to search the tag by its name.
soup.head
soup.head 函数返回 HTML 页面 <head> .. </head> 元素中的内容。
The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.
Consider the following HTML page to be scraped:
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
以下代码提取 <head> 元素的内容
Following code extracts the contents of <head> element
soup.body
同样,要返回 HTML 页面的正文部分的内容,请使用 soup.body
Similarly, to return the contents of body part of HTML page, use soup.body
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
print (soup.body)
Output
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
您还可以在 <body> 标记中提取特定标记(例如第一个 <h1> 标记)。
You can also extract specific tag (like first <h1> tag) in the <body> tag.
soup.p
我们的 HTML 文件包含一个 <p> 标记。我们可以提取此标记的内容
Our HTML file contains a <p> tag. We can extract the contents of this tag
Tag.contents
一个标记对象可以有一个或多个页面元素。标记对象的 contents 属性返回包含在其内所有元素的列表。
A Tag object may have one or more PageElements. The Tag object’s contents property returns a list of all elements included in it.
让我们在 index.html 文件的 <head> 标记中查找元素。
Let us find the elements in <head> tag of our index.html file.
Tag.children
HTML 脚本中标记的结构是分层的。元素嵌套在另一个元素内部。例如,顶级 <HTML> 标记包含 <HEAD> 和 <BODY> 标记,每个标记可能又包含其他标记。
The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.
标记对象具有 children 属性,该属性返回包含所包含的页面元素的列表迭代器对象。
The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.
为了演示 children 属性,我们将使用以下 HTML 脚本 (index.html)。在 <body> 部分中,有两个 <ul> 列表元素,一个嵌套在另一个中。换句话说,正文标记具有顶级列表元素,并且每个列表元素在其下方具有另一个列表。
To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
以下 Python 代码给出了顶级 <ul> 标记的所有子元素的列表。
The following Python code gives a list of all the children elements of top level <ul> tag.
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.ul
print (list(tag.children))
Output
['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']
由于 .children 属性返回一个 list_iterator,所以我们可以使用一个 for 循环来遍历体系结构。
Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.
Tag.find_all()
此方法返回与提供的参数标记相匹配的所有标记的内容的结果集。
This method returns a result set of contents of all the tags matching with the argument tag provided.
让我们考虑这个 HTML 页面 (index.html) 来做如下说明 −
Let us consider the following HTML page(index.html) for this −
<html>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
</body>
</html>
以下代码列出了所有带有 <a> 标记的元素
The following code lists all the elements with <a> tag
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
result = soup.find_all("a")
print (result)
Output
[
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
]
Beautiful Soup - Find Elements by ID
在 HTML 文档中,通常每个元素都分配了一个唯一的 ID。这使得元素的值可以通过前端代码(如 JavaScript function)来提取。
In an HTML document, usually each element is assigned a unique ID. This enables the value of an element to be extracted by a front-end code such as JavaScript function.
使用 BeautifulSoup,你可以通过给定的元素的 ID 来查找它的内容。可以通过以下两种方法来实现这一点——find() 和 find_all(),以及 select()。
With BeautifulSoup, you can find the contents of a given element by its ID. There are two methods by which this can be achieved - find() as well as find_all(), and select()
Using find() method
BeautifulSoup 对象的 find() 方法搜索满足给定条件(作为参数)的第一个元素。
The find() method of BeautifulSoup object searches for first element that satisfies the given criteria as an argument.
让我们为了这个目的使用以下 HTML 脚本(作为 index.html):
Let us use the following HTML script (as index.html) for the purpose
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
</body>
</html>
下面的 Python 代码找到了 id 为 nm 的元素:
The following Python code finds the element with its id as nm
Using find_all()
find_all() 方法也接受一个过滤器参数。它返回所有具有给定 id 的元素的列表。在某些 HTML 文档中,通常具有特定 id 的单个元素。因此,使用 find() 来搜索给定的 id 比使用 find_all() 更可取。
The find_all() method also accepts a filter argument. It returns a list of all the elements with the given id. In a certain HTML document, usually a single element with a particular id. Hence, using find() instead of find_all() is preferrable to search for a given id.
Example
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
obj = soup.find_all(id = 'nm')
print (obj)
Output
[<input id="nm" name="name" type="text"/>]
请注意,find_all() 方法返回一个列表。find_all() 方法还有一个限制参数。将 find_all() 的限制设置为 1 等价于 find()。
Note that the find_all() method returns a list. The find_all() method also has a limit parameter. Setting limit=1 to find_all() is equivalent to find()
obj = soup.find_all(id = 'nm', limit=1)
Using select() method
BeautifulSoup 类中的 select() 方法接受 CSS 选择器作为参数。# 符号是 id 的 CSS 选择器。然后将所需 id 的值传递给 select() 方法。它的工作方式与 find_all() 方法相同。
The select() method in BeautifulSoup class accepts CSS selector as an argument. The # symbol is the CSS selector for id. It followed by the value of required id is passed to select() method. It works as the find_all() method.
Using select_one()
与 find_all() 方法一样,select() 方法也返回一个列表。还有一个 select_one() 方法可以返回给定参数的第一个标记。
Like the find_all() method, the select() method also returns a list. There is also a select_one() method to return the first tag of the given argument.
Beautiful Soup - Find Elements by Class
CSS(层叠样式表)是设计 HTML 元素外观的工具。CSS 规则控制 HTML 元素的不同方面,如大小、颜色、对齐方式等。应用样式比定义 HTML 元素属性更有效。您可以将样式规则应用于每个 HTML 元素。CSS 类用于将类似的样式应用于 HTML 元素组以获得统一的网页外观,而不是逐个应用样式到每个元素。在 BeautifulSoup 中,可以找到使用 CSS 类设置样式的标签。在本章中,我们将使用以下方法搜索指定 CSS 类的元素:
CSS (cascaded Style sheets) is a tool for designing the appearance of HTML elements. CSS rules control the different aspects of HTML element such as size, color, alignment etc.. Applying styles is more effective than defining HTML element attributes. You can apply styling rules to each HTML element. Instead of applying style to each element individually, CSS classes are used to apply similar styling to groups of HTML elements to achieve uniform web page appearance. In BeautifulSoup, it is possible to find tags styled with CSS class. In this chapter, we shall use the following methods to search for elements for a specified CSS class −
-
find_all() and find() methods
-
select() and select_one() methods
Class in CSS
CSS 中的一个类是一组属性,用于指定与外观相关的不同特征,例如字体类型、大小和颜色、背景颜色、对齐方式等。声明类时,类的名称前面加点(.)。
A class in CSS is a collection of attributes specifying the different features related to appearance, such as font type, size and color, background color, alignment etc. Name of the class is prefixed with a dot (.) while declaring it.
.class {
css declarations;
}
CSS 类可以在内联中定义,也可以在需要包含在 HTML 脚本中的单独 css 文件中定义。CSS 类的典型示例如下:
A CSS class may be defined inline, or in a separate css file which needs to be included in the HTML script. A typical example of a CSS class could be as follows −
.blue-text {
color: blue;
font-weight: bold;
}
您可以借助以下 BeautifulSoup 方法搜索已定义为特定类样式的 HTML 元素。
You can search for HTML elements defined with a certain class style with the help of following BeautifulSoup methods.
出于本章的目的,我们将使用以下 HTML 页面:
For the purpose of this chapter, we shall use the following HTML page −
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2 class="heading">Departmentwise Employees</h2>
<ul>
<li class="mainmenu">Accounts</li>
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="mainmenu">HR</li>
<ul>
<li class="submenu">Rani</li>
<li class="submenu">Ankita</li>
</ul>
</ul>
</body>
</html>
Using find() and find_all()
要搜索标签中使用的特定 CSS 类的元素,请按如下使用 Tag 对象的 attrs 属性:
To search for elements with a certain CSS class used in a tag, use attrs property of Tag object as follows −
Example
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
obj = soup.find_all(attrs={"class": "mainmenu"})
print (obj)
Output
[<li class="mainmenu">Accounts</li>, <li class="mainmenu">HR</li>]
此结果是具有 mainmenu 类的所有元素的列表
The result is a list of all the elements with mainmenu class
要获取 attrs 属性中提到的任何 CSS 类中元素的列表,请将 find_all() 语句更改为:
To fetch the list of elements with any of the CSS classes mentioned in in attrs property, change the find_all() statement to −
obj = soup.find_all(attrs={"class": ["mainmenu", "submenu"]})
这会生成一个列表,其中包含上述任何 CSS 类的所有元素。
This results into a list of all the elements with any of CSS classes used above.
[
<li class="mainmenu">Accounts</li>,
<li class="submenu">Anand</li>,
<li class="submenu">Mahesh</li>,
<li class="mainmenu">HR</li>,
<li class="submenu">Rani</li>,
<li class="submenu">Ankita</li>
]
Using select() and select_one()
您还可以使用 select() 方法,CSS 选择器作为参数。(.)符号后跟类名用作 CSS 选择器。
You can also use select() method with the CSS selector as the argument. The (.) symbol followed by the name of the class is used as the CSS selector.
Beautiful Soup - Find Elements by Attribute
find() 和 find_all() 方法都用于根据传递给这些方法的参数找到文档中一个或所有标签。你可以将 attrs 参数传递给这些函数。attrs 的值必须是具有一个或多个标签属性及其值的字典。
Both find() and find_all() methods are meant to find one or all the tags in the document as per the arguments passed to these methods. You can pass attrs parameter to these functions. The value of attrs must be a dictionary with one or more tag attributes and their values.
为了检查这些方法的行为,我们将使用以下 HTML 文档 (index.html)
For the purpose of checking the behaviour of these methods, we shall use the following HTML document (index.html)
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
</body>
</html>
Using find_all()
下面的程序返回一个具有 input type="text" 属性的所有标签的列表。
The following program returns a list of all the tags having input type="text" attribute.
Using find()
find() 方法返回已解析文档中具有给定属性的第一个标签。
The find() method returns the first tag in the parsed document that has the given attributes.
obj = soup.find(attrs={"name":'marks'})
Using select()
可以通过传递要比较的属性来调用 select() 方法。这些属性必须放在一个列表对象中。它返回一个具有给定属性的所有标签的列表。
The select() method can be called by passing the attributes to be compared against. The attributes must be put in a list object. It returns a list of all tags that have the given attribute.
在下面的代码中, select() 方法返回具有 type 属性的所有标签。
In the following code, the select() method returns all the tags with type attribute.
Beautiful Soup - Searching the Tree
在本章中,我们将讨论 Beautiful Soup 中用于在不同方向上浏览 HTML 文档树的不同方法 - 上下、左右以及来回。
In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions - going up and down, sideways, and back and forth.
本章所有示例中,都将使用以下 HTML 字符串 −
We shall use the following HTML string in all the examples in this chapter −
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
所需标记的名称可用于导航解析树。例如,soup.head 会为您提取 <head> 元素 −
The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.head.prettify())
Going down
一个标记可能包含字符串或包含在其内部的其他标记。Tag 对象的 .contents 属性会返回属于它的所有子元素的列表。
A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
print (list(tag.children))
Output
[<title>TutorialsPoint</title>]
返回的对象是一个列表,尽管在这种情况下,head 元素中仅包含一个子标记。
The returned object is a list, although in this case, there is only a single child tag enclosed in head element.
.children
The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.body
print (list(tag.children))
Output
['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n',
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']
不必将其获取为列表,也可以使用 .children 生成器对标记的子元素进行迭代 −
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator −
Output
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
.descendents
The .contents and .children attributes only consider a tag’s direct children. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on.
BeautifulSoup对象位于所有标记层次结构的顶部。因此其 .descendents 属性包括 HTML 字符串中的所有元素。
The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.descendants)
The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
print (element)
Output
<title>TutorialsPoint</title>
TutorialsPoint
head 标记包含一个 title 标记,该标记又包含一个 NavigableString 对象 TutorialsPoint。<head>标记只有一个子元素,但具有两个后代:<title>标记和<title>标记的子元素。不过,BeautifulSoup 对象仅有一个直接子元素(<html>标记),但具有许多后代。
The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants.
Going Up
就像使用子元素和后代属性导航文档的下游信息一样,BeautifulSoup 提供了 .parent和 .parent 属性来导航标记的上游信息
Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag
.parent
每个标记和每个字符串都拥有包含它的父标记。可以使用 parent 属性访问元素的父元素。在我们的示例中,<head>标记是<title>标记的父级。
every tag and every string has a parent tag that contains it. You can access an element’s parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
print (tag.parent)
Output
<head><title>TutorialsPoint</title></head>
由于 title 标记包含一个字符串(NavigableString),因此字符串的父级就是 title 标记自身。
Since the title tag contains a string (NavigableString), the parent for the string is title tag itself.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
string = tag.string
print (string.parent)
.parents
可以使用 .parents 遍历元素的所有父元素。此示例使用 .parents 从位于文档深处的 <a> 标记遍历到文档的顶部。在以下代码中,我们跟踪示例 HTML 字符串中第一个 <a> 的父元素。
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string.
Sideways
显示在相同缩进级别的 HTML 标记称为兄弟标记。考虑以下 HTML 代码段
The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet
<p>
<b>
Hello
</b>
<i>
Python
</i>
</p>
在外部 <p> 标记中,我们具有处于同一缩进级别的 <b> 和 <i> 标记,因此它们称为兄弟标记。BeautifulSoup 使得在相同级别的标记之间导航成为可能。
In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level.
.next_sibling and .previous_sibling
这些属性分别返回处于同一级别的下一个标记和处于同一级别的前一个标记。
These attributes respectively return the next tag at the same level, and the previous tag at same level.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.next_sibling)
tag2 = soup.i
print ("previous:",tag2.previous_sibling)
Output
next: <i>Python</i>
previous: <b>Hello</b>
由于 <b> 标记左侧没有兄弟标记,并且 <i> 标记右侧没有兄弟标记,因此在两种情况下都返回 Nobe。
Since the <b> tag doesn’t have a sibling to its left, and <i> tag doesn’t have a sibling to its right, it returns Nobe in both cases.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.previous_sibling)
tag2 = soup.i
print ("previous:",tag2.next_sibling)
.next_siblings and .previous_siblings
如果某个标记的右侧或左侧有两个或多个兄弟标记,则可以使用 .next_siblings 和 .previous_siblings 属性分别导航它们。它们都返回生成器对象,因此可以使用 for 循环进行迭代。
If there are two or more siblings to the right or left of a tag, they can be navigated with the help of the .next_siblings and .previous_siblings attributes respectively. Both of them return generator object so that a for loop can be used to iterate.
让我们为此目的使用以下 HTML 片段:
Let us use the following HTML snippet for this purpose −
<p>
<b>
Excellent
</b>
<i>
Python
</i>
<u>
Tutorial
</u>
</p>
使用以下代码来遍历后面的和前面的兄弟标签。
Use the following code to traverse next and previous sibling tags.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.b
print ("next siblings:")
for tag in tag1.next_siblings:
print (tag)
print ("previous siblings:")
tag2 = soup.u
for tag in tag2.previous_siblings:
print (tag)
Back and forth
在Beautiful Soup中,next_element属性返回解析树中的下一个字符串或标记。另一方面,previous_element属性返回解析树中前面的字符串或标记。有时,next_element和previous_element属性的返回值与next_sibling和previous_sibling属性类似。
In Beautiful Soup, the next_element property returns the next string or tag in the parse tree. On the other hand, the previous_element property returns the previous string or tag in the parse tree. Sometimes, the return value of next_element and previous_element attributes is similar to next_sibling and previous_sibling properties.
Example
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find("a", id="link3")
print (tag.next_element)
tag = soup.find("a", id="link1")
print (tag.previous_element)
Output
PHP
TutorialsPoint has an excellent collection of tutorials on:
id=“link3”的<a>标签之后的next_element是字符串PHP。类似地,previous_element返回id=“link1”的<a>标签之前的字符串。
The next_element after <a> tag with id = "link3" is the string PHP. Similarly, the previous_element returns the string before <a> tag with id = "link1".
.next_elements and .previous_elements
Tag对象的这些属性分别返回生成器,其中是它后面和前面的所有标签和字符串。
These attributes of the Tag object return generator respectively of all tags and strings after and before it.
Next elements example
Next elements example
tag = soup.find("a", id="link1")
for element in tag.next_elements:
print (element)
Output
Python
,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.
<p class="tutorial">...</p>
...
Previous elements example
Previous elements example
tag = soup.find("body")
for element in tag.previous_elements:
print (element)
Output
<html><head><title>TutorialsPoint</title></head>
Beautiful Soup - Modifying the Tree
Beautiful Soup 库的一个强大功能是可以操作解析后的 HTML 或 XML 文档并修改其内容。
One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents.
Beautiful Soup 库具有不同的函数来执行以下操作 -
Beautiful Soup library has different functions to perform the following operations −
-
Add contents or a new tag to an existing tag of the document
-
Insert contents before or after an existing tag or string
-
Clear the contents of an already existing tag
-
Modify the contents of a tag element
Add content
您可以通过在 Tag 对象上使用 append() 方法向现有标签的内容添加内容。它像 Python 的列表对象的 append() 方法一样工作。
You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python’s list object.
在以下示例中,HTML 脚本有一个 <p> 标签。使用 append() 附加附加文本。
In the following example, the HTML script has a <p> tag. With append(), additional text is appended.
Example
from bs4 import BeautifulSoup
markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p
tag.append(" World")
print (soup)
Output
<p>Hello</p>
<p>Hello World</p>
使用 append() 方法,您可以在现有标签的末尾添加新标签。首先使用 new_tag() 方法创建一个新 Tag 对象,然后将其传递给 append() 方法。
With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.
Example
from bs4 import BeautifulSoup, Tag
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify())
Output
<b>
Hello
<i>
World
</i>
</b>
如果您必须向文档添加字符串,则可以附加 NavigableString 对象。
If you have to add a string to the document, you can append a NavigableString object.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())
Output
<b>
Hello
World
</b>
从 Beautiful Soup 4.7 版本开始,extend() 方法已添加到 Tag 类中。它将列表中的所有元素添加到标签中。
From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag.
Insert Contents
您不必在末尾添加新元素,而是可以使用 insert() 方法在 Tag 元素的子项列表中给定位置添加元素。Beautiful Soup 中的 insert() 方法的行为类似于 Python 列表对象上的 insert()。
Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.
在以下示例中,新字符串被添加到 <b> 标记,位置为 1。结果解析的文档显示结果。
In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert(1, "Tutorial ")
print (soup.prettify())
Output
<b>
Excellent
Tutorial
</b>
<u>
from TutorialsPoint
</u>
Beautiful Soup 还有 insert_before() 和 insert_after() 方法。它们的各自目的是在指定标记对象之前或之后插入标记或字符串。以下代码显示字符 "Python Tutorial" 添加到了 <b> 标记之后。
Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string "Python Tutorial" is added after the <b> tag.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert_after("Python Tutorial")
print (soup.prettify())
Clear the Contents
Beautiful Soup 提供多种方法,从文档树中删除元素的内容。这些方法各自具有独特的功能。
Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features.
clear() 方法最为直接。它仅仅删除指定标记元素的内容。以下示例显示了它的使用情况。
The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.find('u')
tag.clear()
print (soup.prettify())
Output
<b>
Excellent
</b>
<u>
</u>
可以看到, clear() 方法删除了内容,保持标记完好。
It can be seen that the clear() method removes the contents, keeping the tag intact.
对于以下示例,我们解析以下 HTML 文档,对所有标记调用 clear() 方法。
For the following example, we parse the following HTML document and call clear() metho on all tags.
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs./p>
</body>
</html>
使用 clear() 方法的 Python 代码如下
Here is the Python code using clear() method
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
tag.clear()
print (soup.prettify())
Output
<html>
</html>
extract() 方法从文档树中删除标记或字符串,并返回已删除的对象。
The extract() method removes either a tag or a string from the document tree, and returns the object that was removed.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
obj = tag.extract()
print ("Extracted:",obj)
print (soup)
Output
Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>
你可以提取标记或字符串。以下示例显示提取一个标记。
You can extract either a tag or a string. The following example shows antag being extracted.
Example
html = '''
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('ol')
obj.find_next().extract()
print (soup)
Output
<ol id="HR">
<li>Ankita</li>
</ol>
更改 extract() 语句以删除第一个 <li> 元素的内文本。
Change the extract() statement to remove inner text of first <li> element.
Output
<ol id="HR">
<li>Ankita</li>
</ol>
另一种方法 decompose() 从树中删除标记,然后完全销毁它及其内容 −
There is another method decompose() that removes a tag from the tree, then completely destroys it and its contents −
Modify the Contents
我们将着眼于 replace_with() 方法,该方法允许替换标记的内容。
We shall look at the replace_with() method that allows contents of a tag to be replaced.
正如 Python 字符串一样(不可变),NavigableString 也不能就地修改。但是,使用 replace_with() 可以将标记的内部字符串替换为另一个字符串。
Just as a Python string, which is immutable, the NavigableString also can’t be modified in place. However, use replace_with() to replace the inner string of a tag with another.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')
tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)
Output
OnLine Tutorials Library
这里有另一个示例,演示了 replace_with() 的用法。如果将 BeautifulSoup 对象作为参数传递给某些函数,例如 replace_with(),则可以合并两个已解析的文档。
Here is another example to show the use of replace_with(). Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().2524
Example
from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")
obj2.find('b').replace_with(obj1)
print (obj2)
Output
<html><body><book><title>Python</title></book></body></html>
wrap() 方法用你指定的标记包装一个元素。它返回新的包装器。
The wrap() method wraps an element in the tag you specify. It returns the new wrapper.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Hello Python</p>", 'html.parser')
tag = soup.p
newtag = soup.new_tag('b')
tag.string.wrap(newtag)
print (soup)
Output
<p><b>Hello Python</b></p>
另一方面, unwrap() 方法用标记中的内容替换该标记。这适合用于剥离标记。
On the other hand, the unwrap() method replaces a tag with whatever’s inside that tag. It’s good for stripping out markup.
Beautiful Soup - Parsing a Section of a Document
假设你要使用 Beautiful Soup 仅查看文档的 <a> 标签。通常,你将解析树,并使用 find_all() 方法,并以所需标签作为参数。
Let’s say you want to use Beautiful Soup look at a document’s <a> tags only. Normally you would parse the tree and use find_all() method with the required tag as the argument.
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('a')
但是那将耗时,并且无用地占用更多内存。可创建一个 SoupStrainer 类的对象,并将其用作 BeautifulSoup 构造函数中 parse_only 参数的值。
But that would be time consuming as well as it will take up more memory unnecessarily. Instead, you can create an object of SoupStrainer class and use it as value of parse_only argument to BeautifulSoup constructor.
SoupStrainer 告诉 BeautifulSoup 提取哪些部分,而解析树仅包含这些元素。若将所需信息缩小到 HTML 的特定部分,这将加速你的搜索结果。
A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.
product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)
上述代码行仅从产品网站中解析标题,它可能位于某个标签字段中。
Above lines of code will parse only the titles from a product site, which might be inside a tag field.
类似上述情况,我们可以使用其他 soupStrainer 对象来解析 HTML 标签中的特定信息。以下是一些示例:
Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −
Example
from bs4 import BeautifulSoup, SoupStrainer
#Only "a" tags
only_a_tags = SoupStrainer("a")
#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
#parse only where string length is less than 10
def is_short_string(string):
return len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
SoupStrainer 类的参数与通过搜索树的典型方法一致:name、attrs、text 和 **kwargs。
The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, text, and **kwargs.
请注意,如果你使用 html5lib 解析器,此特性将不起作用,因为在这种情况下无论如何都会解析整个文档。因此,你应该使用内建的 html.parser 或 lxml 解析器。
Note that this feature won’t work if you’re using the html5lib parser, because the whole document will be parsed in that case, no matter what. Hence, you should use either the inbuilt html.parser or lxml parser.
你还可以将 SoupStrainer 传递到通过搜索树涵盖的任何方法中。
You can also pass a SoupStrainer into any of the methods covered in Searching the tree.
from bs4 import SoupStrainer
a_tags = SoupStrainer("a")
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(a_tags)
Beautiful Soup - Find all Children of an Element
HTML 脚本中标记的结构是分层的。元素一个嵌套在另一个里面。例如,最顶层的 <HTML> 标记包含 <HEAD> 和 <BODY> 标记,每个都可以包含其他的标记。最顶层元素被称为父元素。嵌套在父元素内的元素是其子元素。借助 Beautiful Soup,我们可以找到父元素的所有子元素。在本章中,我们将找出如何获取 HTML 元素的子元素。
The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The top level element is called as parent. The elements nested inside the parent are its children. With the help of Beautiful Soup, we can find all the children elements of a parent element. In this chapter, we shall find out how to obtain the children of a HTML element.
BeautifulSoup 类中有两个配置,用于获取子元素。
There are two provisions in BeautifulSoup class to fetch the children elements.
-
The .children property
-
The findChildren() method
本章中的示例使用了以下 HTML 脚本 (index.html)
Examples in this chapter use the following HTML script (index.html)
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul id="HR">
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
Using .children property
Tag 对象的 .children 属性以递归方式返回所有子元素的生成器。
The .children property of a Tag object returns a generator of all the child elements in a recursive manner.
以下 Python 代码给出了最顶层的 <ul> 标记的所有子元素的列表。我们首先获取与 <ul> 标记相对应的 Tag 元素,然后读取其 .children 属性
The following Python code gives a list of all the children elements of top level <ul> tag. We first obtain the Tag element corresponding to the <ul> tag, and then read its .children property
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.ul
print (list(tag.children))
Output
['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']
由于 .children 属性返回一个 list_iterator,所以我们可以使用一个 for 循环来遍历体系结构。
Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.
for child in tag.children:
print (child)
Using findChildren() method
findChildren() 方法提供了一个更全面的选择。它返回所有顶层标记下的所有子元素。
The findChildren() method offers a more comprehensive alternative. It returns all the child elements under any top level tag.
在 index.html 文档中,我们有两个嵌套的无序列表。最顶层的 <ul> 元素的 id = "dept",而两个封闭的列表的 id 分别为 = "acc" 和 "HR"。
In the index.html document, we have two nested unordered lists. The top level <ul> element has id = "dept" and the two enclosed lists are having id = "acc' and "HR' respectively.
在以下示例中,我们首先实例化指向最顶层 <ul> 元素的 Tag 对象并提取其下的子元素列表。
In the following example, we first instantiate a Tag object pointing to top level <ul> element and extract the list of children under it.
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find("ul", {"id": "dept"})
children = tag.findChildren()
for child in children:
print(child)
请注意,结果集以递归方式包含元素下的子元素。因此,在以下输出中,你将找到整个内部列表,后跟其中的各个元素。
Note that the resultset includes the children under an element in a recursive fashion. Hence, in the following output, you’ll find the entire inner list, followed by individual elements in it.
<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>Anand</li>
<li>Mahesh</li>
<li>HR</li>
<ul id="HR">
<li>Rani</li>
<li>Ankita</li>
</ul>
<li>Rani</li>
<li>Ankita</li>
让我们提取 id='acc' 的内部 <ul> 元素下的子元素。代码如下 -
Let us extract the children under an inner <ul> element with id='acc'. Here is the code −
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find("ul", {"id": "acc"})
children = tag.findChildren()
for child in children:
print(child)
当以上程序运行时,你将获得 id 为 acc 的 <ul> 下的 <li> 元素。
When the above program is run, you’ll obtain the <li>elements under the <ul> with id as acc.
Beautiful Soup - Find Element using CSS Selectors
在 Beautiful Soup 库中,select() 方法是用于抓取 HTML/XML 文档的重要工具。类似于 find() 和其他 find_*() 方法,select() 方法还有助于查找符合给定条件的元素。但是,find*() 方法根据标记名称及其属性搜索 PageElements,select() 方法根据给定的 CSS 选择器搜索文档树。
In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and the other find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. However, the find*() methods search for the PageElements according to the Tag name and its attributes, the select() method searches the document tree for the given CSS selector.
Beautiful Soup 还具有 select_one() 方法。select() 和 select_one() 之间的区别在于,select() 返回属于 PageElement 并由 CSS 选择器表征的所有元素的 ResultSet;而 select_one() 返回满足基于 CSS 选择器选择标准的元素的第一个出现。
Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria.
在 Beautiful Soup 4.7 版本之前,select() 方法过去只能支持通用的 CSS 选择器。从 4.7 版本开始,Beautiful Soup 与 Soup Sieve CSS 选择器库集成在一起。因此,现在可以使用更多选择器。在 4.12 版本中,除了现有的便利方法 select() 和 select_one() 之外,还添加了 .css 属性。select() 方法的参数如下 −
Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().The parameters for select() method are as follows −
select(selector, limit, **kwargs)
selector − 包含 CSS 选择器的字符串。
selector − A string containing a CSS selector.
limit − 找到此数量的结果后,停止查找。
limit − After finding this number of results, stop looking.
kwargs − 要传递的关键字参数。
kwargs − Keyword arguments to be passed.
如果 limit 参数设置为 1,则它等价于 select_one() 方法。虽然 select() 方法返回一个 Tag 对象的 ResultSet,但 select_one() 方法返回一个单个 Tag 对象。
If the limit parameter is set to 1, it becomes equivalent to select_one() method. While the select() method returns a ResultSet of Tag objects, the select_one() method returns a single Tag object.
Soup Sieve Library
Soup Sieve 是一款 CSS 选择器库。它已与 Beautiful Soup 4 集成,因此可随 Beautiful Soup 包一起安装。它提供使用现代 CSS 选择器选择、匹配和筛选文档树标记的能力。Soup Sieve 目前实现了 CSS 1 级规范到 CSS 4 级的大部分 CSS 选择器,但除了一些尚未实现的例外情况。
Soup Sieve is a CSS selector library. It has been integrated with Beautiful Soup 4, so it is installed along with Beautiful Soup package. It provides ability to select, match, and filter he document tree tags using modern CSS selectors. Soup Sieve currently implements most of the CSS selectors from the CSS level 1 specifications up to CSS level 4, except for some that are not yet implemented.
Soup Sieve 库具有不同类型的 CSS 选择器。基本的 CSS 选择器为 −
The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −
Type selector
通过节点名称匹配元素。例如 −
Matching elements is done by node name. For example −
tags = soup.select('div')
Example
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tags = soup.select('div')
print (tags)
Universal selector (*)
它匹配任何类型的元素。示例 −
It matches elements of any type. Example −
tags = soup.select('*')
ID selector
它基于其 id 属性匹配元素。# 号表示 ID 选择器。示例 −
It matches an element based on its id attribute. The symbol # denotes the ID selector. Example −
tags = soup.select("#nm")
Example
from bs4 import BeautifulSoup
html = '''
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)
Class selector
它根据 class 属性中包含的值匹配元素。. 符号作为类名称的前缀是 CSS 类选择器。示例 −
It matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −
tags = soup.select(".submenu")
Example
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tags = soup.select('div')
print (tags)
Attribute Selectors
属性选择器基于其属性匹配元素。
The attribute selector matches an element based on its attributes.
soup.select('[attr]')
Example
from bs4 import BeautifulSoup
html = '''
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('[href]'))
Pseudo Classes
CSS 规范定义了多个伪 CSS 类。伪类是添加到选择器中的关键字,用于定义所选元素的特殊状态。它会对现有元素添加效果。例如,:link 选择尚未访问的链接(带 href 属性的每个 <a> 和 <area> 元素)。
CSS specification defines a number of pseudo CSS classes. A pseudo-class is a keyword added to a selector so as to define a special state of the selected elements. It adds an effect to the existing elements. For example, :link selects a link (every <a> and <area> element with an href attribute) that has not yet been visited.
nth-of-type 和 nth-child 伪类选择器被广泛使用。
The pseudo-class selectors nth-of-type and nth-child are very widely used.
:nth-of-type()
:nth-of-type() 选择器根据元素在同级元素组中的位置匹配指定类型的元素。关键字 even 和 odd 将分别从子同级元素组中选择元素。
The selector :nth-of-type() matches elements of a given type, based on their position among a group of siblings. The keywords even and odd, and will respectively select elements, from a sub-group of sibling elements.
在以下示例中,选择了 <p> 类型的第二个元素。
In the following example, second element of <p> type is selected.
Example
from bs4 import BeautifulSoup
html = '''
<p id="0"></p>
<p id="1"></p>
<span id="2"></span>
<span id="3"></span>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('p:nth-of-type(2)'))
:nth-child()
此选择器根据元素在同级元素组中的位置匹配元素。关键字 even 和 odd 将分别选择在同级元素组中位置为偶数或奇数的元素。
This selector matches elements based on their position in a group of siblings. The keywords even and odd will respectively select elements whose position is either even or odd amongst a group of siblings.
Beautiful Soup - Find all Comments
计算机代码中加入注释被认为是一种良好的编程实践。注释有助于理解程序的逻辑,同时也可以作为文档。和用 C、Java、Python 等语言编写的程序一样,您还可以在 HTML 和 XML 脚本中添加注释。BeautifulSoup API 可以帮助识别 HTML 文档中的所有注释。
Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document.
在 HTML 和 XML 中,注释文本写在 <!-- 和 -→ 标签之间。
In HTML and XML, the comment text is written between <!-- and -→ tags.
<!-- Comment Text -->
BeutifulSoup 包,其内部名称为 bs4,将注释定义为一个重要的对象。注释对象是一种特殊的 NavigableString 对象类型。因此,任何在 <!-- 和 -→ 之间找到的标签的 string 属性都被认为是注释。
The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between <!-- and -→ is recognized as a Comment.
Example
from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))
Output
This is a comment text in HTML <class 'bs4.element.Comment'>
要搜索 HTML 文档中注释的所有出现,我们应该使用 find_all() 方法。不带任何参数,find_all() 返回解析的 HTML 文档中的所有元素。可以将关键字参数 'string' 传递到 find_all() 方法。我们将为此分配函数 iscomment() 的返回值。
To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument 'string' to find_all() method. We shall assign the return value of a function iscomment() to it.
comments = soup.find_all(string=iscomment)
iscomment() 函数在 isinstance() 函数的帮助下验证标签中的文本是否是注释对象。
The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function.
def iscomment(elem):
return isinstance(elem, Comment)
comments 变量将存储给定 HTML 文档中的所有注释文本出现。我们将在示例代码中使用以下 index.html 文件 -
The comments variable shall store all the comment text occurrences in the given HTML document. We shall use the following index.html file in the example code −
<html>
<head>
<!-- Title of document -->
<title>TutorialsPoint</title>
</head>
<body>
<!-- Page heading -->
<h2>Departmentwise Employees</h2>
<!-- top level list-->
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<!-- first inner list -->
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul id="HR">
<!-- second inner list -->
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
以下 Python 程序刮取了上述 HTML 文档,并找到了其中的所有注释。
The following Python program scrapes the above HTML document, and finds all the comments in it.
Example
from bs4 import BeautifulSoup, Comment
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
def iscomment(elem):
return isinstance(elem, Comment)
comments = soup.find_all(string=iscomment)
print (comments)
Beautiful Soup - Scraping List from HTML
网页通常以有序或无序列表的形式包含信息。使用 Beautiful Soup,我们可以轻松提取 HTML 列表元素,将数据放入 Python 对象中以存储在数据库中以供进一步分析。在本章中,我们将使用 find() 和 select() 方法从 HTML 文档中抓取列表数据。
Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document.
最简单的搜索解析树的方法是按其名称搜索标签。soup.<tag> 提取给定标签的内容。
Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag.
HTML 提供 <ol> 和 <ul> 标签来编写有序和无序列表。和任何其他标签一样,我们可以提取这些标签的内容。
HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags.
我们将使用以下 HTML 文档 -
We shall use the following HTML document −
<html>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>
</body>
</html>
Scraping lists by Tag
在上方的 HTML 文档中,我们有一个顶层 <ul> 列表,其中有另一个 <ul> 标签和另一个 <ol> 标签。我们首先在 soup 对象中解析文档,并在 soup.ul Tag 对象中检索第一个 <ul> 的内容。
In the above HTML document, we have a top-level <ul> list, inside which there’s another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
lst=soup.ul
print (lst)
Using select() method
select() 方法本质上用于使用 CSS 选择器获取数据。但是,你也可以向其传递一个标记。在此处,我们可以将 ol 标记传递给 select() 方法。该 select_one() 方法也可以使用。它获取给定标记的第一次出现。
The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag.
Using find_all() method
find() 和 fin_all() 方法更为全面。你可以将各种类型的过滤器(例如标记、属性或字符串等)传递给这些方法。在这种情况下,我们要获取列表标记的内容。
The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag.
在以下代码中,find_all() 方法返回 <ul> 标记中所有元素的列表。
In the following code, find_all() method returns a list of all elements in the <ul> tag.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
lst=soup.find_all("ul")
print (lst)
我们可以通过包含 attrs 参数来优化搜索过滤器。在我们的 HTML 文档中,即 <ul> 和 <ol> 标记中,我们指定了它们各自的 id 属性。因此,让我们获取 id="acc" 的 <ul> 元素的内容。
We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id="acc".
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
lst=soup.find_all("ul", {"id":"acc"})
print (lst)
Output
[<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>]
这是另一个示例。我们收集所有具有 <li> 标记的元素,其中内文以“A”开头。该 find_all() 方法采用一个关键字参数 string 。如果 startingwith() 函数返回 True,则它取文本的值。
Here’s another example. We collect all elements with <li> tag with the inner text starting with 'A'. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True.
Beautiful Soup - Scraping Paragraphs from HTML
HTML 文档中经常出现的标记之一是标记段落文本的 <p> 标记。使用 Beautiful Soup,你可以轻松地从解析的文档树中提取段落。在本章中,我们将讨论借助 BeautifulSoup 库抓取段落的以下方法。
One of the frequently appearing tags in a HTML document is the <p> tag that marks a paragraph text. With Beautiful Soup, you can easily extract paragraph from the parsed document tree. In this chapter, we shall discuss the following ways of scraping paragraphs with the help of BeautifulSoup library.
-
Scraping HTML paragraph with <p> tag
-
Scraping HTML paragraph with find_all() method
-
Scraping HTML paragraph with select() method
我们将在这些练习中使用以下 HTML 文档:
We shall use the following HTML document for these exercises −
<html>
<head>
<title>BeautifulSoup - Scraping Paragraph</title>
</head>
<body>
<p id='para1'>The quick, brown fox jumps over a lazy dog.</p>
<h2>Hello</h2>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Scraping by <p> tag
搜索解析树的最简单方法是按名称搜索标签。因此,表达式 soup.p 指向 scouped 文档中的第一个 <p> 标签。
Easiest way to search a parse tree is to search the tag by its name. Hence, the expression soup.p points towards the first <p> tag in the scouped document.
para = soup.p
若要获取所有后续 <p> 标签,您可以运行循环,直到所有 <p> 标签都被 soup 对象用尽。以下程序显示所有段落标签的美化输出。
To fetch all the subsequent <p> tags, you can run a loop till the soup object is exhausted of all the <p> tags. The following program displays the prettified output of all the paragraph tags.
Using find_all() method
find_all() 方法更为全面。您可以将各种类型的过滤器(例如,标签、属性或字符串等)传递给此方法。在本例中,我们希望获取 <p> 标签的内容。
The find_all() methods is more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to this method. In this case, we want to fetch the contents of a <p> tag.
在以下代码中,find_all() 方法返回 <p> 标签中所有元素的列表。
In the following code, find_all() method returns a list of all elements in the <p> tag.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
paras = soup.find_all('p')
for para in paras:
print (para.prettify())
Output
<p>
The quick, brown fox jumps over a lazy dog.
</p>
<p>
DJs flock by when MTV ax quiz prog.
</p>
<p>
Junk MTV quiz graced by fox whelps.
</p>
<p>
Bawds jog, flick quartz, vex nymphs.
</p>
我们可以使用另一种方法来查找所有 <p> 标签。首先,使用 find_all() 获取所有标签的列表,并检查每个标签的 Tag.name 是否等于 ='p'。
We can use another approach to find all <p> tags. To begin with, obtain list of all tags using find_all() and check Tag.name of each equals ='p'.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
paras = [tag.contents for tag in tags if tag.name=='p']
print (paras)
find_all() 方法还具有 attrs 参数。当您要提取具有特定属性的 <p> 标签时,此参数很有用。例如,在给定的文档中,第一个 <p> 元素的 id='para1'。若要获取它,我们需要修改标签对象,如下所示:
The find_all() method also has attrs parameter. It is useful when you want to extract the <p> tag with specific attributes. For example, in the given document, the first <p> element has id='para1'. To fetch it, we need to modify the tag object as −
paras = soup.find_all('p', attrs={'id':'para1'})
Using select() method
select() 方法本质上用于使用 CSS 选择器获取数据。但是,您还可以向其传递一个标签。在这里,我们可以将 <p> 标签传递给 select() 方法。select_one() 方法也可用。它获取 <p> 标签的第一个匹配项。
The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the <p> tag to select() method. The select_one() method is also available. It fetches the first occurrence of the <p> tag.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
paras = soup.select('p')
print (paras)
Output
[
<p>The quick, brown fox jumps over a lazy dog.</p>,
<p>DJs flock by when MTV ax quiz prog.</p>,
<p>Junk MTV quiz graced by fox whelps.</p>,
<p>Bawds jog, flick quartz, vex nymphs.</p>
]
若要筛选具有特定 id 的 <p> 标签,请使用如下 for 循环:
To filter out <p> tags with a certain id, use a for loop as follows −
BeautifulSoup - Scraping Link from HTML
在从网站资源中抓取和分析内容时,通常需要提取页面包含的所有链接。在本节中,我们将会了解如何从 HTML 文档中提取链接。
While scraping and analysing the content from resources with a website, you are often required to extract all the links that a certain page contains. In this chapter, we shall find out how we can extract links from a HTML document.
HTML 有锚标记 <a> 用于插入超链接。锚标记的 href 属性允许你建立链接。此属性使用以下语法:
HTML has the anchor tag <a> to insert a hyperlink. The href attribute of anchor tag lets you to establish the link. It uses the following syntax −
<a href=="web page URL">hypertext</a>
利用 find_all() 方法,我们可以在文档中收集所有锚标记,然后打印每个锚标记的 href 属性值。
With the find_all() method we can collect all the anchor tags in a document and then print the value of href attribute of each of them.
在下面的示例中,我们提取 Google 主页上找到的所有链接。我们使用 requests 库来收集 https://google.com 的 HTML 内容,将其解析为 soup 对象,然后收集所有 <a> 标记。最后,我们打印 href 属性。
In the example below, we extract all the links found on Google’s home page. We use requests library to collect the HTML contents of https://google.com, parse it in a soup object, and then collect all <a> tags. Finally, we print href attributes.
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
print (link)
运行上述程序后的部分输出如下:
Here’s the partial output when the above program is run −
Output
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en-IN&authuser=0
https://www.google.com/url?q=https://io.google/2023/%3Futm_source%3Dgoogle-hpp%26utm_medium%3Dembedded_marketing%26utm_campaign%3Dhpp_watch_live%26utm_content%3D&source=hpp&id=19035434&ct=3&usg=AOvVaw0qzqTkP5AEv87NM-MUDd_u&sa=X&ved=0ahUKEwiPzpjku-z-AhU1qJUCHVmqDJoQ8IcBCAU
但是,HTML 文档中可能包含有不同协议方案的超链接,例如 mailto: 协议用于链接到电子邮件 ID,tel: 方案用于链接到电话号码,或带有 file:// URL 方案链接到本地文件。在这种情况下,如果我们有兴趣提取以:https:// 方案开头的链接,则可以通过以下示例来实现。我们有一个由不同类型的超链接组成的 HTML 文档,其中仅提取以 https:// 开头的链接。
However, a HTML document may have hyperlinks of different protocol schemes, such as mailto: protocol for link to an email ID, tel: scheme for link to a telephone number, or a link to a local file with file:// URL scheme. In such a case, if we are interested in extracting links with https:// scheme, we can do so by the following example. We have a HTML document that consists of hyperlinks of different types, out of which only ones with https:// prefix are being extracted.
html = '''
<p><a href="https://www.tutorialspoint.com">Web page link </a></p>
<p><a href="https://www.example.com">Web page link </a></p>
<p><a href="mailto:nowhere@mozilla.org">Email link</a></p>
<p><a href="tel:+4733378901">Telephone link</a></p>
'''
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
if link.startswith("https"):
print (link)
Beautiful Soup - Get all HTML Tags
HTML 中的标签类似于 Python 或 Java 等传统编程语言中的关键字。标签具有预定义的行为,浏览器根据该行为来呈现其内容。通过 Beautiful Soup,可以收集给定 HTML 文档中的所有标签。
Tags in HTML are like keywords in a traditional programming language like Python or Java. Tags have a predefined behaviour according to which the its content is rendered by the browser. With Beautiful Soup, it is possible to collect all the tags in a given HTML document.
获得标签列表的最简单方法是将网页解析为 Soup 对象,并无参数地调用 find_all() 方法。它返回一个列表生成器,向我们提供所有标签的列表。
The simplest way to obtain a list of tags is to parse the web page into a soup object, and call find_all() methods without any argument. It returns a list generator, giving us a list of all the tags.
让我们提取 Google 首页中的所有标签列表。
Let us extract the list of all tags in Google’s homepage.
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
tags = soup.find_all()
print ([tag.name for tag in tags])
Output
['html', 'head', 'meta', 'meta', 'title', 'script', 'style', 'style', 'script', 'body', 'script', 'div', 'div', 'nobr', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'u', 'div', 'nobr', 'span', 'span', 'span', 'a', 'a', 'a', 'div', 'div', 'center', 'br', 'div', 'img', 'br', 'br', 'form', 'table', 'tr', 'td', 'td', 'input', 'input', 'input', 'input', 'input', 'div', 'input', 'br', 'span', 'span', 'input', 'span', 'span', 'input', 'script', 'input', 'td', 'a', 'input', 'script', 'div', 'div', 'br', 'div', 'style', 'div', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'span', 'div', 'div', 'a', 'a', 'a', 'a', 'p', 'a', 'a', 'script', 'script', 'script']
当然,你可能会获得这样一个列表,其中某个特定标签可能出现多次。要获得唯一标签的列表(避免重复),请从标签对象列表构建一个集合。
Naturally, you may get such a list where one certain tag may appear more than once. To obtain a list of unique tags (avoiding the duplication), construct a set from the list of tag objects.
将以上代码中的 print 语句更改为
Change the print statement in above code to
Output
{'body', 'head', 'p', 'a', 'meta', 'tr', 'nobr', 'script', 'br', 'img', 'b', 'form', 'center', 'span', 'div', 'input', 'u', 'title', 'style', 'td', 'table', 'html'}
若要获取与它们关联的文本的标签,请检查字符串属性并打印(如果它不是 None)
To obtain tags with some text associated with them, check the string property and print if it is not None
tags = soup.find_all()
for tag in tags:
if tag.string is not None:
print (tag.name, tag.string)
可能存在一些没有文本但具有一个或多个属性的单例标签,如 <img> 标签。以下循环构造出列出此类标签的列表。
There may be some singleton tags without text but with one or more attributes as in the <img> tag. Following loop constructs lists out such tags.
在以下代码中,HTML 字符串不是一个完整的 HTML 文档(换句话说,没有给定 <html> 和 <body> 标签)。但是,html5lib 和 lxml 解析器在解析文档树时会自动添加这些标签。因此,当我们提取标签列表时,还会看到附加的标签。
In the following code, the HTML string is not a complete HTML document in the sense that thr <html> and <body> tags are not given. But the html5lib and lxml parsers add these tags on their own while parsing the document tree. Hence, when we extract the tag list, the additional tags will also be seen.
Beautiful Soup - Get Text Inside Tag
HTML 中有两种标签类型。许多标签成开闭对出现。具有对应闭合 </html> 标签的顶级 <html> 标签就是主要示例。其他示例包括 <body> 和 </body>、<p> 和 </p>、<h1> 和 </h1> 等。其他标签是自闭合标签,例如 <img> 和 <a>。自闭合标签没有文本,而带有开闭符号的大多数标签(例如 <b>Hello</b>)都有文本。在本章中,我们将了解如何借助 Beautiful Soup 库获取此类标签内部的文本部分。
There are two types of tags in HTML. Many of the tags are in pairs of opening and closing counterparts. The top level <html> tag having a corresponding closing </html> tag is the main example. Others are <body> and </body>, <p> and </p>, <h1> and </h1> and many more. Other tags are self-closing tags - such as <img> and<a>. The self-closing tags don’t have a text as most of the tags with opening and closing symbols (such as <b>Hello</b>). In this chapter, we shall have a look at how can we get the text part inside such tags, with the help of Beautiful Soup library.
Beautiful Soup 中有不止一种方法/属性,我们可以使用它们获取与标签对象关联的文本。
There are more than one methods/properties available in Beautiful Soup, with which we can fetch the text associated with a tag object.
Sr.No |
Methods & Description |
1 |
*text property*Get all child strings of a PageElement, concatenated using a separator if specified. |
2 |
*string property*Convenience property to string from a child element. |
3 |
*strings property*yields string parts from all the child objects under the current PageElement. |
4 |
*stripped_strings property*Same as strings property, with the linebreaks and whitespaces removed. |
5 |
*get_text() method*returns all child strings of this PageElement, concatenated using a separator if specified. |
考虑以下 HTML 文档 −
Consider the following HTML document −
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
<img src='logo.jpg'>
</div>
</div>
如果我们检索解析后文档树中每个标签的 stripped_string 属性,我们将发现两个 div 标签和 p 标签具有两个 NavigableString 对象 Hello 和 World。<b> 标签嵌入 world 字符串,而 <img> 没有任何文本部分。
If we retrieve the stripped_string property of each tag in the parsed document tree, we will find that the two div tags and the p tag have two NavigableString objects, Hello and World. The <b> tag embeds world string, while <img> doesn’t have a text part.
以下示例从给定 HTML 文档中每个标签中获取文本 −
The following example fetches the text from each of the tags in the given HTML document −
Example
html = """
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
<img src='logo.jpg'>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
print ("Tag: {} attributes: {} ".format(tag.name, tag.attrs))
for txt in tag.stripped_strings:
print (txt)
print()
Beautiful Soup - Find all Headings
在本节中,我们将探讨如何使用 BeautifulSoup 来查找 HTML 文档中的所有标题元素。HTML 定义了从 H1 到 H6 的六种标题样式,每种样式的字体大小递减。不同的页面章节(如:主标题、节标题、主题等)使用合适的标签。让我们采用两种不同的方式使用 find_all() 方法来提取 HTML 文档中的所有标题元素。
In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document.
我们在该章节的代码示例中将使用以下 HTML 脚本(保存为 index.html) -
We shall use the following HTML script (saved as index.html) in the code examples in this chapter −
<html>
<head>
<title>BeautifulSoup - Scraping Headings</title>
</head>
<body>
<h2>Scraping Headings</h2>
<b>The quick, brown fox jumps over a lazy dog.</b>
<h3>Paragraph Heading</h3>
<p>DJs flock by when MTV ax quiz prog.</p>
<h3>List heading</h3>
<ul>
<li>Junk MTV quiz graced by fox whelps.</li>
<li>Bawds jog, flick quartz, vex nymphs.</li>
</ul>
</body>
</html>
Example 1
在该方法中,我们收集解析树中的所有标签,并检查每个标签的名称是否位于所有标题标签的列表中。
In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags.
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
headings = ['h1','h2','h3', 'h4', 'h5', 'h6']
tags = soup.find_all()
heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings]
print (heads)
此处,headings 是所有标题样式 h1 到 h6 的列表。如果标签的名称是其中任何一个,则标签及其内容将收集在名为 heads 的列表中。
Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads.
Example 2
你可以在 find_all() 方法中传递正则表达式。请看以下正则表达式。
You can pass a regex expression to the find_all() method. Take a look at the following regex.
re.compile('^h[1-6]$')
该正则表达式查找以 h 开头、h 后面有一个数字,然后数字后结尾的所有标签。让我们将其用作下面代码中 find_all() 方法的参数 -
This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below −
from bs4 import BeautifulSoup
import re
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all(re.compile('^h[1-6]$'))
print (tags)
Beautiful Soup - Extract Title Tag
<title> 标记用于向显示在浏览器标题栏中的页面提供文本标题。它不是网页主要内容的一部分。title 标记始终出现在 <head> 标记内。
The <title> tag is used to provide a text caption to the page that appears in the browser’s title bar. It is not a part of the main content of the web page. The title tag is always present inside the <head> tag.
我们可以通过 Beautiful Soup 提取 title 标记的内容。我们解析 HTML 树并获取 title 标记对象。
We can extract the contents of title tag by Beautiful Soup. We parse the HTML tree and obtain the title tag object.
Example
html = '''
<html>
<head>
<Title>Python Libraries</title>
</head>
<body>
<p Hello World</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html5lib")
title = soup.title
print (title)
Output
<title>Python Libraries</title>
在 HTML 中,我们可以对所有标记使用 title 属性。title 属性提供了有关元素的其他信息。当鼠标悬停在元素上时,该信息会作为工具提示文本起作用。
In HTML, we can use title attribute with all tags. The title attribute gives additional information about an element. The information is works as a tooltip text when the mouse hovers over the element.
我们可以使用以下代码段提取每个标记的 title 属性的文本 −
We can extract the text of title attribute of each tag with following code snippet −
Example
html = '''
<html>
<body>
<p title='parsing HTML and XML'>Beautiful Soup</p>
<p title='HTTP library'>requests</p>
<p title='URL handling'>urllib</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html5lib")
tags = soup.find_all()
for tag in tags:
if tag.has_attr('title'):
print (tag.attrs['title'])
Beautiful Soup - Extract Email IDs
从网页中提取电子邮件地址是 Web 爬取库(如 BeautifulSoup)的一个重要应用。在任何网页中,电子邮件 ID 通常出现在链接 <a> 标记的 href 属性中。电子邮件 ID 使用 mailto URL 方案编写。许多时候,电子邮件地址可能以普通文本形式(没有超链接)存在于页面内容中。在本章中,我们将使用 BeautifulSoup 库,使用简单技术从 HTML 页面中获取电子邮件 ID。
To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques.
在 href 属性中使用电子邮件 ID 的典型用法如下 -
A typical usage of Email ID in href attribute is as below −
<a href = "mailto:xyz@abc.com">test link</a>
在第一个示例中,我们将考虑以下 HTML 文档,用于从超链接中提取电子邮件 ID -
In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks −
<html>
<head>
<title>BeautifulSoup - Scraping Email IDs</title>
</head>
<body>
<h2>Contact Us</h2>
<ul>
<li><a href = "mailto:sales@company.com">Sales Enquiries</a></li>
<li><a href = "mailto:careers@company.com">Careers</a></li>
<li><a href = "mailto:partner@company.com">Partner with us</a></li>
</ul>
</body>
</html>
以下是查找电子邮件 ID 的 Python 代码。我们收集文档中的所有 <a> 标记,并检查标记是否具有 href 属性。如果为真,它的值在第 6 个字符之后的部分就是电子邮件 ID。
Here’s the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id.
from bs4 import BeautifulSoup
import re
fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all("a")
for tag in tags:
if tag.has_attr("href") and tag['href'][:7]=='mailto:':
print (tag['href'][7:])
对于给定的 HTML 文档,电子邮件 ID 将按如下方式提取 -
For the given HTML document, the Email IDs will be extracted as follows −
sales@company.com
careers@company.com
partner@company.com
在第二个示例中,我们假定电子邮件 ID 出现在文本中的任何位置。要提取它们,我们使用 regex 搜索机制。Regex 是一个复杂的字符模式。Python 的 re 模块有助于处理 regex(正则表达式)模式。以下 regex 模式用于搜索电子邮件地址 -
In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python’s re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address −
pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'
对于此练习,我们将使用以下 HTML 文档,其中电子邮件 ID 位于 <li> 标记中。
For this exercise, we shall use the following HTML document, having Email IDs in <li>tags.
<html>
<head>
<title>BeautifulSoup - Scraping Email IDs</title>
</head>
<body>
<h2>Contact Us</h2>
<ul>
<li>Sales Enquiries: sales@company.com</a></li>
<li>Careers: careers@company.com</a></li>
<li>Partner with us: partner@company.com</a></li>
</ul>
</body>
</html>
使用电子邮件 regex,我们将找到模式在每个 <li> 标记字符串中的出现。以下是 Python 代码 -
Using the email regex, we’ll find the appearance of the pattern in each <li> tag string. Here is the Python code −
Beautiful Soup - Scrape Nested Tags
HTML 文档中标记或元素的排列具有层次性。标记最多可嵌套为多层。例如,<head> 和 <body> 标记嵌套在 <html> 标记中。同样,一个或多个 <li> 标记可能存在于 <ul> 标记中。在本章中,我们将找出如何抓取嵌套有一个或多个子标记的标记。
The arrangement of tags or elements in a HTML document is hierarchical nature. The tags are nested upto multiple levels. For example, the <head> and <body> tags are nested inside <html> tag. Similarly, one or more <li> tags may be inside a <ul> tag. In this chapter, we shall find out how to scrape a tag that has one or more children tags nested in it.
我们考虑以下 HTML 文档 −
Let us consider the following HTML document −
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
<img src='logo.jpg'>
</div>
</div>
在这种情况下,两个 <div> 标记和一个 <p> 标记嵌套有一个或多个子元素。而 <img> 和 <b> 标记没有子标记。
In this case, the two <div> tags and a <p> tag has one or more child elements nested inside. Whereas, the <img> and <b> tag donot have any children tags.
findChildren() 方法返回某个标记下所有子项目的 ResultSet。因此,如果一个标记没有任何子项目,则 ResultSet 将成为空列表,如 []。
The findChildren() method returns a ResultSet of all the children under a tag. So, if a tag doesn’t have any children, the ResultSet will be an empty list like [].
由此作为线索,以下代码找出文档树中每个标记下的标记,并显示该列表。
Taking this as a cue, the following code finds out the tags under each tag in the document tree and displays the list.
Example
html = """
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
<img src='logo.jpg'>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
print ("Tag: {} attributes: {}".format(tag.name, tag.attrs))
print ("Child tags: ", tag.findChildren())
print()
Output
Tag: div attributes: {'id': 'outer'}
Child tags: [<div id="inner">
<p>Hello<b>World</b></p>
<img src="logo.jpg"/>
</div>, <p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]
Tag: div attributes: {'id': 'inner'}
Child tags: [<p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]
Tag: p attributes: {}
Child tags: [<b>World</b>]
Tag: b attributes: {}
Child tags: []
Tag: img attributes: {'src': 'logo.jpg'}
Child tags: []
Beautiful Soup - Parsing Tables
除了文本内容,HTML 文档还可能具有 HTML 表格形式的结构化数据。借助 Beautiful Soup,我们可以将表格数据提取为 Python 对象(例如列表或字典),如果需要,可以将这些数据存储在数据库或电子表格中,然后再执行处理操作。在本章中,我们将使用 Beautiful Soup 解析 HTML 表格。
In addition to a textual content, a HTML document may also have a structured data in the form of HTML tables. With Beautiful Soup, we can extract the tabular data in Python objects such as list or dictionary, if required store it in databases or spreadsheets, and perform processing. In this chapter, we shall parse HTML table using Beautiful Soup.
虽然 Beautiful Soup 没有用于提取表格数据的任何特殊函数或方法,但我们可以通过一些简单的抓取技巧来实现此目的。就像 SQL 或电子表格中的任何表格一样,HTML 表格包含行和列。
Although Beautiful Soup doesn’t any special function or method for extracting table data, we can achieve it by simple scraping techniques. Just like any table, say in SQL or spreadsheet, HTML table consists of rows and columns.
HTML 中使用 <table> 标签来构建一个表格结构。有一个或多个嵌套 <tr> 标签,每个标签对应一行。每一行由 <td> 标签组成,这些标签用于保存行中每个单元格中的数据。第一行通常用于列标题,且标题不放在 <td> 标签中,而是放在 <th> 标签中
HTML has <table> tag to build a tabular structure. There are one or more nested <tr> tags one each for a row. Each row consists of <td> tags to hold the data in each cell of the row. First row usually is used for column headings, and the headings are placed in <th> tag instead of <td>
以下 HTML 脚本在浏览器窗口中呈现一个简单的表格 −
Following HTML script renders a simple table on the browser window −
<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr>
<th>Name</th>
<th>Age</th>
<th>Marks</th>
</tr>
<tr class='data'>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr class='data'>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
请注意,数据行的外观是使用 CSS 类 data 自定义的,用于将其与标题行区分开来。
Note that, the appearance of data rows is customized with a CSS class data, in order to distinguish it from the header row.
我们现在来看如何解析表格数据。首先,我们在 BeautifulSoup 对象中获取文档树。然后,将所有列标题收集到一个列表中。
We shall now see how to parse the table data. First, we obtain the document tree in the BeautifulSoup object. Then collect all the column headers in a list.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser")
tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
然后获取标题行之后的具有 class='data' 属性的数据行标签。将一个词典对象(以列标题为键,每个单元格中的相应值为值)形成,并将其附加到一个字典对象列表中。
The data row tags with class='data' attribute following the header row are then fetched. A dictionary object with column header as key and corresponding value in each cell is formed and appended to a list of dict objects.
rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
row = {}
data = i.find_all('td')
n=0
for j in data:
row[headers[n]] = j.string
n+=1
trows.append(row)
trows 中收集了一个字典对象列表。然后,你可以将此列表用于其他各种目的,例如存储在 SQL 表格中、以 JSON 或 pandas 数据框对象的形式保存。
A list of dictionary objects is collected in trows. You can then use it for different purposes such as storing in a SQL table, saving as a JSON or pandas dataframe object.
以下是完整代码 −
The complete code is given below −
markup = """
<html>
<body>
<p>Beautiful Soup - Parse Table</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Marks</th>
</tr>
<tr class='data'>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr class='data'>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser")
tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
print (headers)
rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
row = {}
data = i.find_all('td')
n=0
for j in data:
row[headers[n]] = j.string
n+=1
trows.append(row)
print (trows)
Beautiful Soup - Selecting nth Child
HTML 的特点是标签的层级顺序。例如,<html> 标签包含 <body> 标签,在其内部可能有一个 <div> 标签,它还可以进一步包含嵌套的 <ul> 和 <li> 元素。findChildren() 方法和 .children 特性都将返回元素正下方所有子标签的 ResultSet(列表)。你可以通过遍历列表来获得位于所需位置上的子元素,也就是第 n 个子元素。
HTML is characterized by the hierarchical order of tags. For example, the <html> tag encloses <body> tag, inside which there may be a <div> tag further may have <ul> and <li> elements nested respectively. The findChildren() method and .children property both return a ResultSet (list) of all the child tags directly under an element. By traversing the list, you can obtain the child located at a desired position, nth child.
下面的代码使用 HTML 文档中某个 <div> 标签的 children 特性。由于 children 特性的返回类型是列表迭代器,因此我们要从中检索一个 Python 列表。我们还需要从迭代器中删除空格和换行符。完成后,我们可以获取所需的子元素。这里显示了索引为 1 的 <div> 标签的子元素。
The code below uses the children property of a <div> tag in the HTML document. Since the return type of children property is a list iterator, we shall retrieve a Python list from it. We also need to remove the whitespaces and line breaks from the iterator. Once done, we can fetch the desired child. Here the child element with index 1 of the <div> tag is displayed.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
children = tag.children
childlist = [child for child in children if child not in ['\n', ' ']]
print (childlist[1])
Output
<p>Python</p>
要使用 findChildren() 方法代替 children 特性,请将语句更改为
To use findChildren() method instead of children property, change the statement to
children = tag.findChildren()
输出不会改变。
There will be no change in the output.
定位第 n 个子元素的更有效方法是使用 select() 方法。select() 方法使用 CSS 选择器从当前元素中获取所需的 PageElements。
A more efficient approach toward locating nth child is with the select() method. The select() method uses CSS selectors to obtain required PageElements from the current element.
Soup 和 Tag 对象通过它们的 .css 特性来支持 CSS 选择器,该特性是一个与 CSS 选择器 API 的接口。选择器实现由 Soup Sieve 包处理,该包会随 bs4 包一起安装。
The Soup and Tag objects support CSS selectors through their .css property, which is an interface to the CSS selector API. The selector implementation is handled by the Soup Sieve package, which gets installed along with bs4 package.
Soup Sieve 包定义了不同类型的 CSS 选择器,即由一个或多个类型选择器、ID 选择器和类选择器组成的简单、复合和复杂 CSS 选择器。这些选择器在 CSS 语言中定义。
The Soup Sieve package defines different types of CSS selectors, namely simple, compound and complex CSS selectors that are made up of one or more type selectors, ID selectors, class selectors. These selectors are defined in CSS language.
Soup Sieve 中也有伪类选择器。CSS 伪类是添加到选择器的关键字,用于指定所选元素的特殊状态。我们将在此示例中使用 :nth-child 伪类选择器。由于我们需要选择处于第 2 个位置的 <div> 标记的子元素,因此需要将 :nthchild(2) 传递给 select_one() 方法。
There are pseudo class selectors as well in Soup Sieve. A CSS pseudo-class is a keyword added to a selector that specifies a special state of the selected element(s). We shall use :nth-child pseudo class selector in this example. Since we need to select a child from <div> tag at 2nd position, we shall pass :nthchild(2) to the select_one() method.
Beautiful Soup - Search by text inside a Tag
Beautiful Soup 提供了不同方法来在给定的 HTML 文档中搜索特定文本。在这里,为此目的使用 find() 方法的字符串参数。
Beautiful Soup provides different means to search for a certain text in the given HTML document. Here, we use the string argument of the find() method for the purpose.
在下面的示例中,我们使用 find() 方法来搜索单词 “by”。
In the following example, we use the find() method to search for the word 'by'
Example
html = '''
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs./p>
'''
from bs4 import BeautifulSoup, NavigableString
def search(tag):
if 'by' in tag.text:
return True
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', string=search)
print (tag)
Output
<p> DJs flock by when MTV ax quiz prog.</p>
You can find all occurrences of the word with find_all() method
tag = soup.find_all('p', string=search)
print (tag)
Output
[<p> DJs flock by when MTV ax quiz prog.</p>, <p> Junk MTV quiz graced by fox whelps.</p>]
在某些情况下,必需文本可能位于文档树深处的子标记内。我们需要首先找到没有其他元素的标记,然后检查必需文本是否在其中。
There may be a situation where the required text may be somewhere in a child tag deep inside the document tree. We need to first locate a tag which has no further elements and then check whether the required text is in it.
Example
html = '''
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs./p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(lambda tag: len(tag.find_all()) == 0 and "by" in tag.text)
for tag in tags:
print (tag)
Beautiful Soup - Remove HTML Tags
在本章,让我们看看如何在 HTML 文档中移除所有标记。HTML 是一种标记语言,由预定义标记组成。标记标记与它相关联的某些文本,以便浏览器根据其预定义的含义呈现它。例如,用 <b> 标记标记的词“Hello”(例如 <b>Hello</b>)会被浏览器以粗体呈现。
In this chapter, let us see how we can remove all tags from a HTML document. HTML is a markup language, made up of predefined tags. A tag marks a certain text associated with it so that the browser renders it as per its predefined meaning. For example, the word Hello marked with <b> tag for example <b>Hello</b), is rendered in bold face by the browser.
如果我们想在 HTML 文档中筛选出不同的标记之间的原始文本,我们可以使用 Beautiful Soup 库中的 get_text() 或 extract() 中的任何一个方法。
If we want to filter out the raw text between different tags in a HTML document, we can use any of the two methods - get_text() or extract() in Beautiful Soup library.
get_text() 方法收集文档中的所有原始文本部分并返回一个字符串。但是,原始文档树不会被更改。
The get_text() method collects all the raw text part from the document and returns a string. However, the original document tree is not changed.
在以下示例中,get_text() 方法移除所有 HTML 标记。
In the example below, the get_text() method removes all the HTML tags.
Example
html = '''
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)
Output
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
请注意,上面的示例中的 soup 对象仍然包含 HTML 文档的已解析树。
Not that the soup object in the above example still contains the parsed tree of the HTML document.
另一种方法是在从 soup 对象中提取之前先收集包含在 Tag 对象中的字符串。在 HTML 中,有些标记没有 string 属生(我们可以说,对于某些标记,例如 <html> 或 <body>,tag.string 为 None)。因此,我们从所有其他标记连接字符串,以从 HTML 文档中获得纯文本。
Another approach is to collect the string enclosed in a Tag object before extracting it from the soup object. In HTML, some tags don’t have a string property (we can say that tag.string is None for some tags such as <html> or <body>). So, we concatenate strings from all other tags to obtain the plain text out of the HTML document.
以下程序对此方法进行了演示。
Following program demonstrates this approach.
Example
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()
string=''
for tag in tags:
#print (tag.name, tag.string)
if tag.string != None:
string=string+tag.string+'\n'
tag.extract()
print ("Document text after removing tags:")
print (string)
print ("Document:")
print (soup)
Output
Document text after removing tags:
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
Document:
clear() 方法移除标签对象的内部字符串,但不会返回它。类似地,decompose() 方法销毁标签及其所有子元素。因此,这些方法不适合于从 HTML 文档中检索纯文本。
The clear() method removes the inner string of a tag object but doesn’t return it. Similarly the decompose() method destroys the tag as well as all its children elements. Hence, these methods are not suitable to retrieve the plain text from HTML document.
Beautiful Soup - Remove all Styles
本章节说明如何从 HTML 文档中删除所有样式。层叠样式表 (CSS) 用于控制 HTML 文档的不同方面的显示方式。其中包括为文本使用特定字体、颜色、对齐方式、间距等设置样式。CSS 会通过不同方式应用于 HTML 标签。
This chapter explains how to remove all styles from a HTML document. Cascaded style sheets (CSS) are used to control the appearance of different aspects of a HTML document. It includes styling the rendering of text with a specific font, color, alignment, spacing etc. CSS is applied to HTML tags in different ways.
其中一种方法是在 CSS 文件中定义不同样式,并通过文档 <head> 部分中的 <link> 标签将它们包含到 HTML 脚本中。例如,
One is to define different styles in a CSS file and include in the HTML script with the <link> tag in the <head> section in the document. For example,
Example
<html>
<head>
<link rel="stylesheet" href="style.css">
</head>
<body>
. . .
. . .
</body>
</html>
HTML 脚本主体部分中的不同标签将会使用 mystyle.css 文件中的定义。
The different tags in the body part of the HTML script will use the definitions in mystyle.css file
另一种方法是在 HTML 文档自身的 <head> 部分中定义样式配置。主体部分中的标签将会使用在此处提供的定义来呈现。
Another approach is to define the style configuration inside the <head> part of the HTML document itself. Tags in the body part will be rendered by using the definitions provided internally.
内部样式示例:
Example of internal styling −
<html>
<head>
<style>
p {
text-align: center;
color: red;
}
</style>
</head>
<body>
<p>para1.</p>
<p id="para1">para2</p>
<p>para3</p>
</body>
</html>
在任何情况下,为以编程方式删除样式,只需要从 soup 对象中删除 head 标签。
In either cases, to remove the styles programmatically, simple remove the head tag from the soup object.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.head.extract()
第三种方法是通过在标签中包含 style 属性来按行定义样式。该 style 属性可能包含一个或更多个样式属性定义,如颜色、大小等。例如:
Third approach is to define the styles inline by including style attribute in the tag itself. The style attribute may contain one or more style attribute definitions such as color, size etc. For example
<body>
<h1 style="color:blue;text-align:center;">This is a heading</h1>
<p style="color:red;">This is a paragraph.</p>
</body>
为从 HTML 文档中删除这些内联样式,你需要查看标签对象的 attrs 字典中是否定义了 style 键,如果定义了,就删除该键。
To remove such inline styles from a HTML document, you need to check if attrs dictionary of a tag object has style key defined in it, and if yes delete the same.
tags=soup.find_all()
for tag in tags:
if tag.has_attr('style'):
del tag.attrs['style']
print (soup)
以下代码将删除内联样式以及 head 标签本身,因此结果 HTML 树中将不会有任何样式。
The following code removes the inline styles as well as removes the head tag itself, so that the resultant HTML tree will not have any styles left.
html = '''
<html>
<head>
<link rel="stylesheet" href="style.css">
</head>
<body>
<h1 style="color:blue;text-align:center;">This is a heading</h1>
<p style="color:red;">This is a paragraph.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.head.extract()
tags=soup.find_all()
for tag in tags:
if tag.has_attr('style'):
del tag.attrs['style']
print (soup.prettify())
Beautiful Soup - Remove all Scripts
HTML 中经常使用的一个标记是 <script> 标记。它有助于在 HTML 中嵌入客户端脚本(如 JavaScript 代码)。在本章中,我们将使用 BeautifulSoup 从 HTML 文档中删除脚本标记。
One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.
<script> 标记有一个对应的 </script> 标记。在这两个标记之间,你可以包含对外部 JavaScript 文件的引用,或将 JavaScript 代码与 HTML 脚本本身内联。
The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.
要包含一个外部 Javascript 文件,请使用以下语法 −
To include an external Javascript file, the syntax used is −
<head>
<script src="javascript.js"></script>
</head>
然后,你可以在 HTML 中调用在此文件中定义的函数。
You can then invoke the functions defined in this file from inside HTML.
除了引用外部文件之外,你还可以将 JavaScipt 代码放在 <script> 和 </script> 代码内的 HTML 中。如果将其放在 HTML 文档的 <head> 部分中,那么该功能将在整个文档树中可用。另一方面,如果将其放在 <body> 部分的任何位置,则 JavaScript 函数从此处可用。
Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
使用 Beautiful 轻松去除所有脚本标签。您必须从解析树中收集所有脚本标签的列表并逐一提取它们。
To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.
Example
html = '''
<html>
<head>
<script src="javascript.js"></scrript>
</head>
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all('script'):
tag.extract()
print (soup)
Output
<html>
<head>
</head>
</html>
您还可以使用 decompose() 方法,而不是 extract(),不同之处在于后者返回被去除的内容,而前者只是销毁内容。为了获得更简洁的代码,您还可以使用列表推导语法来获取移除了脚本标签的汤对象,如下所示 −
You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −
[tag.decompose() for tag in soup.find_all('script')]
Beautiful Soup - Remove Empty Tags
在 HTML 中,许多标记具有一个开放式标记和一个结束式标记。此类标记主要用于定义格式化属性,如 <b> 和 </b>、<h1> 和 </h1> 等。也有一些自闭合标记不带结束标记也没有文本部分。例如 <img>、<br>、<input> 等。然而,在编写 HTML 时,可能会无意插入不带任何文本的标记,例如 <p></p>。我们需要借助 Beautiful Soup 库函数移除此类空标记。
In HTML, many of the tags have an opening and closing tag. Such tags are mostly used for defining the formatting properties, such as <b> and </b>, <h1> and </h1> etc. There are some self-closing tags also which don’t have a closing tag and no textual part. For example <img>, <br>, <input> etc. However, while composing HTML, tags such as <p></p> without any text may be inadvertently inserted. We need to remove such empty tags with the help of Beautiful Soup library functions.
移除不带任何文本的文本标记非常简单,如果你可以调用标记上的 extract() 方法且其内部文本的长度为 0。
Removing textual tags without any text between opening and closing symbols is easy. You can call extract() method on a tag if length of its inner text is 0.
for tag in tags:
if (len(tag.get_text(strip=True)) == 0):
tag.extract()
但是,这会移除标记,例如 <hr>、<img> 和 <input> 等。这些都是自闭合标记或单例标记。你不希望关闭带有一个或多个属性的标记,即使没有与其关联的文本。所以你必须检查一个标记是否带有任何属性且 get_text() 返回无。
However, this would remove tags such as <hr>, <img>, and <input> etc. These are all self-closing or singleton tags. You would not like to close tags that have one or more attributes even if there is no text associated with it. So, you’ll have to check if a tag has any attributes and the get_text() returns none.
在以下示例中,都存在 HTML 字符串中带有空文本标记和一些单例标记的情况。该代码保留带属性的标记,但移除不带有嵌入文本的标记。
In the following example, there are both situations where an empty textual tag and some singleton tags are present in the HTML string. The code retains the tags with attributes but removes ones without any text embedded.
Example
html ='''
<html>
<body>
<p>Paragraph</p>
<embed type="image/jpg" src="Python logo.jpg" width="300" height="200">
<hr>
<b></b>
<p>
<a href="#">Link</a>
<ul>
<li>One</li>
</ul>
<input type="text" id="fname" name="fname">
<img src="img_orange_flowers.jpg" alt="Flowers">
</body>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags =soup.find_all()
for tag in tags:
if (len(tag.get_text(strip=True)) == 0):
if len(tag.attrs)==0:
tag.extract()
print (soup)
Output
<html>
<body>
<p>Paragraph</p>
<embed height="200" src="Python logo.jpg" type="image/jpg" width="300"/>
<p>
<a href="#">Link</a>
<ul>
<li>One</li>
</ul>
<input id="fname" name="fname" type="text"/>
<img alt="Flowers" src="img_orange_flowers.jpg"/>
</p>
</body>
</html>
请注意,原始 html 代码带有一个不带有包围 </p> 的 <p> 标记。解析器自动插入结束标记。如果你将解析器更改为 lxml 或 html5lib,结束标记的位置可能会改变。
Note that the original html code has a <p> tag without its enclosing </p>. The parser automatically inserts the closing tag. The position of the closing tag may change if you change the parser to lxml or html5lib.
Beautiful Soup - Remove Child Elements
HTML 文档是不同标记的分层排列,其中一个标记可能在其上嵌套一个或多个标记,并在多个层级中。我们如何删除特定标记的子元素?使用 BeautifulSoup,这非常容易。
HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it.
BeautifulSoup 库中有两种主要方法,可以删除特定标记。decompose() 方法和 extract() 方法,区别在于后者返回被移除的内容,而前者只是将其销毁。
There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it.
因此,要删除子元素,请为给定的 Tag 对象调用 findChildren() 方法,然后在每个方法上 extract() 或 decompose()。
Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each.
考虑以下代码段:
Consider the following code segment −
soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print (soup)
这将销毁整个 soup 对象本身,即文档的已解析树。显然,我们不想这样做。
This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that.
现在是以下代码:
Now the following code −
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
for t in tag.findChildren():
t.extract()
在文档树中,<html> 是第一个标记,所有其他标记都是其子代,因此,在循环的第一次迭代中,它将删除除 <html> 和 </html> 之外的所有标记。
In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself.
如果我们想删除特定标记的子代,则可以使用此方法更有效。例如,您可能希望删除 HTML 表格的头行。
More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table.
以下 HTML 脚本有一个表格,第一个 <tr> 元素具有用 <th> 标记标记的标题。
The following HTML script ha a table with first <tr> element having headers marked by <th> tag.
<html>
<body>
<h2>Beautiful Soup - Remove Child Elements</h2>
<table border="1">
<tr class='header'>
<th>Name</th>
<th>Age</th>
<th>Marks</th>
</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
我们可以使用以下 Python 代码删除具有 <th> 单元的 <tr> 标记的所有子元素。
We can use the following Python code to remove all the children elements of <tr> tag with <th> cells.
Example
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class':'header'})
for tag in tags:
for t in tag.findChildren():
t.extract()
print (soup)
Output
<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">
</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
可以看到 <th> 元素已从已解析树中删除。
It can be seen that the <th> elements have been removed from the parsed tree
Beautiful Soup - find vs find_all
Beautiful Soup 库包括 find() 和 find_all() 方法。解析 HTML 或 XML 文档时,两种方法都是最常使用的方法之一。从特定的文档树中,你常常需要找到一个确定标记类型的 PageElement,或具有某些属性,或具有特定 CSS 样式等。这些条件作为参数提供给 find() 和 find_all() 方法。两者的主要区别点在于,find() 找到满足条件的第一个子元素,而 find_all() 方法搜索条件的所有子元素。
Beautiful Soup library includes find() as well as find_all() methods. Both methods are one of the most frequently used methods while parsing HTML or XML documents. From a particular document tree You often need to locate a PageElement of a certain tag type, or having certain attributes, or having a certain CSS style etc. These criteria are given as argument to both find() and find_all() methods. The main point of difference between the two is that while find() locates the very first child element that satisfies the criteria, find_all() method searches for all the children elements of the criteria.
find() 方法的定义如下语法 −
The find() method is defined with following syntax −
Syntax
find(name, attrs, recursive, string, **kwargs)
name 参数指定标记名称的筛选器。使用 attrs,可以针对标记属性值设置筛选器。如果 recursive 参数为 True,则会执行递归搜索。你可以将可变 kwargs 传递为属性值筛选器的字典。
The name argument specifies a filter on tag name. With attrs, a filter on tag attribute values can be set up. The recursive argument forces a recursive search if it is True. You can pass variable kwargs as dictionary of filters on attribute values.
soup.find(id = 'nm')
soup.find(attrs={"name":'marks'})
find_all() 方法接收与 find() 方法完全一样所有的参数,此外还有 limit 参数。这是一个整数,它将搜索限制为给定筛选器条件指定数量的匹配项。如果没有设置,find_all() 将在该 PageElement 下的所有子元素中搜索条件。
The find_all() method takes all the arguments as for the find() method, in addition there is a limit argument. It is an integer, restricting the search the specified number of occurrences of the given filter criteria. If not set, find_all() searches for the criteria among all the children under the said PageElement.
soup.find_all('input')
lst=soup.find_all('li', limit =2)
如果将 find_all() 方法的 limit 参数设置为 1,它将虚拟充当 find() 方法。
If the limit argument for find_all() method is set to 1, it virtually acts as find() method.
两种方法的返回类型不同。find() 方法返回最早找到的一个 Tag 对象或 NavigableString 对象。find_all() 方法返回一个 ResultSet,其中包含满足筛选器条件的所有 PageElement。
The return type of both the methods differs. The find() method returns either a Tag object or a NavigableString object first found. The find_all() method returns a ResultSet consisting of all the PageElements satisfying the filter criteria.
下面是一个演示 find 和 find_all 方法之间的区别的示例。
Here is an example that demonstrates the difference between find and find_all methods.
Example
from bs4 import BeautifulSoup
markup =open("index.html")
soup = BeautifulSoup(markup, 'html.parser')
ret1 = soup.find('input')
ret2 = soup.find_all ('input')
print (ret1, 'Return type of find:', type(ret1))
print (ret2)
print ('Return tyoe find_all:', type(ret2))
#set limit =1
ret3 = soup.find_all ('input', limit=1)
print ('find:', ret1)
print ('find_all:', ret3)
Output
<input id="nm" name="name" type="text"/> Return type of find: <class 'bs4.element.Tag'>
[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]
Return tyoe find_all: <class 'bs4.element.ResultSet'>
find: <input id="nm" name="name" type="text"/>
find_all: [<input id="nm" name="name" type="text"/>]
Beautiful Soup - Specifying the Parser
将 HTML 文档树解析为 BeautifulSoup 类的对象。此类的构造函数需要以 HTML 字符串或指向 HTML 文件的文件对象作为强制参数。构造函数具有所有其他可选参数,其中最重要的为特征。
A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features.
BeautifulSoup(markup, features)
此处标记为 HTML 字符串或文件对象。features 参数指定要使用的解析器。它可以是特定解析器,例如 “lxml”、“lxml-xml”、“html.parser”或 “html5lib;或要使用的标记类型(“html”、“html5”、“xml”)。
Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as "lxml", "lxml-xml", "html.parser", or "html5lib; or type of markup to be used ("html", "html5", "xml").
如果未给出 features 参数,Beautiful Soup 会选择已安装的最佳 HTML 解析器。Beautiful Soup 将 lxml 的解析器评为最佳,然后是 html5lib,最后是 Python 的内置解析器。
If the features argument is not given, BeautifulSoup chooses the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
你可以指定以下任一项 −
You can specify one of the following −
要解析的标记类型。Beautiful Soup 目前支持 “html”、“xml”和 “html5”。
The type of markup you want to parse. Beautiful Soup currently supports are "html", "xml", and "html5".
要使用的解析器库的名称。当前支持的选项为 “lxml”、“html5lib”和 “html.parser”(Python 的内置 HTML 解析器)。
The name of the parser library to be used. Currently supported options are "lxml", "html5lib", and "html.parser" (Python’s built-in HTML parser).
要安装 lxml 或 html5lib 解析器,请使用命令 −
To install lxml or html5lib parser, use the command −
pip3 install lxml
pip3 install html5lib
这些解析器具有各自的优点和缺点,如下所示 -
These parsers have their advantages and disadvantages as shown below −
Parser: Python’s html.parser
Usage - BeautifulSoup(markup, "html.parser")
Usage − BeautifulSoup(markup, "html.parser")
Parser: lxml’s HTML parser
Usage − BeautifulSoup(markup, "lxml")
Usage − BeautifulSoup(markup, "lxml")
Parser: lxml’s XML parser
Usage − BeautifulSoup(markup, "lxml-xml")
Usage − BeautifulSoup(markup, "lxml-xml")
或 BeautifulSoup(markup, "xml")
Or BeautifulSoup(markup, "xml")
Parser: html5lib
Usage − BeautifulSoup(markup, "html5lib")
Usage − BeautifulSoup(markup, "html5lib")
Disadvantages
-
Very slow
-
External Python dependency
不同的解析器会从同一文档创建不同的解析树。最大的区别在于 HTML 解析器和 XML 解析器之间。下面是一个短文档,已解析为 HTML −
Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a><b /></a>", "html.parser")
print (soup)
Output
<a><b></b></a>
空 <b /> 标记不是有效的 HTML。因此,解析器会将其变成 <b></b> 标记对。
An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair.
现在的同一个文档已解析为 XML。请注意,空 <b /> 标记已保留,并且该文档给出了 XML 声明,而不是被放入 <html> 标记中。
The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag.
Output
<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>
对于格式良好的 HTML 文档,所有 HTML 解析器都会产生类似的解析树,尽管一个解析器将比另一个解析器更快。
In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another.
然而,如果 HTML 文档不够完美,那么不同类型的解析器将会产生不同的结果。请参见当用不同的解析器解析 “<a></p>” 时结果有什么不同 −
However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when "<a></p>" is parsed with different parsers −
lxml parser
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "lxml")
print (soup)
Output
<html><body><a></a></body></html>
请注意,HTML 中悬空的 </p> 标记会被忽略。
Note that the dangling </p> tag is simply ignored.
html5lib parser
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "html5lib")
print (soup)
Output
<html><head></head><body><a><p></p></a></body></html>
html5lib 与开放的 <p> 标记将其配对。此解析器还添加了一个空的 <head> 标记到文档。
The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document.
Built-in html parser
Example
Built in from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "html.parser")
print (soup)
Output
<a></a>
此解析器还会忽略关闭的 </p> 标记。但此解析器通过添加 <body> 不尝试创建格式良好的 HTML 文档,甚至不费心添加 <html> 标记。
This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn’t even bother to add an <html> tag.
html5lib 解析器使用 HTML5 标准中包含的技术,因此它有权被称为“正确”的方法。
The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way.
Beautiful Soup - Comparing Objects
根据 BeautifulSoup,如果两个可导航字符串或标记对象表示相同的 HTML/XML 标记,则它们相等。
As per the beautiful soup, two navigable string or tag objects are equal if they represent the same HTML/XML markup.
现在让我们看看下面的示例,其中两个 <b> 标记被视为相等,即使它们位于对象树的不同部分,因为它们都看起来像 "<b>Java</b>"。
Now let us see the below example, where the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like "<b>Java</b>".
Example
from bs4 import BeautifulSoup
markup = "<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
b1 = soup.find('b')
b2 = b1.find_next('b')
print(b1== b2)
print(b1 is b2)
Output
True
False
在以下示例中,比较了两个 NavigableString 对象。
In the following examples, tow NavigableString objects are compared.
Example
from bs4 import BeautifulSoup
markup = "<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
i1 = soup.find('i')
i2 = i1.find_next('i')
print(i1.string== i2.string)
print(i1.string is i2.string)
Beautiful Soup - Copying Objects
要创建任何标记或 NavigableString 的副本,请使用 Python 标准库中 copy 模块中的 copy() 函数。
To create a copy of any tag or NavigableString, use copy() function from the copy module from Python’s standard library.
Example
from bs4 import BeautifulSoup
import copy
markup = "<p>Learn <b>Python, Java</b>, <i>advanced Python and advanced Java</i>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
i1 = soup.find('i')
icopy = copy.copy(i1)
print (icopy)
Output
<i>advanced Python and advanced Java</i>
虽然两个副本(原副本和已复制副本)包含相同的标记,但这两个副本并不表示相同的对象。
Although the two copies (original and copied one) contain the same markup however, the two do not represent the same object.
print (i1 == icopy)
print (i1 is icopy)
Beautiful Soup - Get Tag Position
Beautiful Soup 中的 Tag 对象具有两个有用的属性,可提供有关其在 HTML 文档中的位置的信息。它们是 −
The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are −
sourceline − 找到标记的行号
sourceline − line number at which the tag is found
sourcepos − 标签在其所在行中的起始索引。
sourcepos − The starting index of the tag in the line in which it is found.
html.parser(Python 的内置解析器)和 html5lib 解析器支持这些属性。当你使用 lmxl 解析器时,它们不可用。
These properties are supported by the html.parser which is Python’s in-built parser and html5lib parser. They are not available when you are using lmxl parser.
在以下示例中,使用 html.parser 解析 HTML 字符串,我们在 HTML 字符串中找到 <p> 标记的行号和位置。
In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string.
Example
html = '''
<html>
<body>
<p>Web frameworks</p>
<ul>
<li>Django</li>
<li>Flask</li>
</ul>
<p>GUI frameworks</p>
<ol>
<li>Tkinter</li>
<li>PyQt</li>
</ol>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for p in p_tags:
print (p.sourceline, p.sourcepos, p.string)
Output
4 0 Web frameworks
9 0 GUI frameworks
对于 html.parser,这些数字表示小于符号的初始位置,在本例中为 0。当使用 html5lib 解析器时,它略有不同。
For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used.
Example
html = '''
<html>
<body>
<p>Web frameworks</p>
<ul>
<li>Django</li>
<li>Flask</li>
</ul>
<p>GUI frameworks</p>
<ol>
<li>Tkinter</li>
<li>PyQt</li>
</ol>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
li_tags = soup.find_all('li')
for l in li_tags:
print (l.sourceline, l.sourcepos, l.string)
Beautiful Soup - Encoding
所有 HTML 或 XML 文档都写入一些特定编码(如 ASCII 或 UTF-8)。但是,当您将该 HTML/XML 文档加载到 BeautifulSoup 时,它已被转换成 Unicode。
All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.
Example
from bs4 import BeautifulSoup
markup = "<p>I will display £</p>"
soup = BeautifulSoup(markup, "html.parser")
print (soup.p)
print (soup.p.string)
Output
<p>I will display £</p>
I will display £
之所以会出现以上情况,是因为 BeautifulSoup 在内部使用名为 Unicode, Dammit 的子库检测文档的编码,然后将其转换成 Unicode。
Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document’s encoding and then convert it into Unicode.
但是,并非总是 Unicode, Dammit 能正确猜测。由于要逐字节搜索文档以猜测编码,因此会花费大量时间。如果您已经知道编码,可以将其作为 from_encoding 传递到 BeautifulSoup 构造器中,这样可以节省一些时间并避免错误发生。
However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding.
下面是一个 BeautifulSoup 识别错误的示例,将 ISO-8859-8 文档识别为 ISO-8859-7 −
Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 −
Example
from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print (soup.h1)
print (soup.original_encoding)
Output
<h1>翴檛</h1>
ISO-8859-7
要解决上述问题,请使用 from_encoding 将其传递到 BeautifulSoup −
To resolve above issue, pass it to BeautifulSoup using from_encoding −
Example
from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.h1)
print (soup.original_encoding)
Output
<h1>םולש</h1>
iso-8859-8
BeautifulSoup 4.4.0 中的另一项新功能是 exclude_encoding。在您不知道正确编码但确定 Unicode, Dammit 未显示正确结果时,可以使用它。
Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don’t know the correct encoding but sure that Unicode, Dammit is showing wrong result.
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
Output encoding
BeautifulSoup 的输出是 UTF-8 文档,与输入到 BeautifulSoup 的文档无关。下面的文档中,波兰语字符采用 ISO-8859-2 格式。
The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format.
Example
markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
</HEAD>
<BODY>
ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
</BODY>
</HTML>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.prettify())
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
</head>
<body>
ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
</body>
</html>
在上述示例中,如果您注意到,<meta> 标签已被重写,以反映由 BeautifulSoup 生成的文档现为 UTF-8 格式。
In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format.
如果您不希望生成的输出为 UTF-8,可以在 prettify() 中分配所需的编码。
If you don’t want the generated output in UTF-8, you can assign the desired encoding in prettify().
print(soup.prettify("latin-1"))
Output
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n <head>\n <meta content="text/html; charset=latin-1" http-equiv="content-type"/>\n </head>\n <body>\n ą ć ę ł ń \xf3 ś ź ż Ą Ć Ę Ł Ń \xd3 Ś Ź Ż\n </body>\n</html>\n'
在上述示例中,我们对整个文档进行了编码,但是您也可以对汤中的任何特定元素进行编码,就像对其为 Python 字符串一样 −
In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string −
soup.p.encode("latin-1")
soup.h1.encode("latin-1")
Output
b'<p>My first paragraph.</p>'
b'<h1>My First Heading</h1>'
任何无法用您选择的编码表示的字符将被转换成数字 XML 实体引用。下面是一个示例 −
Any characters that can’t be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example −
markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
print(tag.encode("utf-8"))
Unicode, Dammit
Unicode, Dammit 主要用于当传入文档为未知格式(主要是外语)且我们想要编码为某些已知格式(Unicode)时,同时我们也不需要 Beautifulsoup 来完成所有这些操作。
Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don’t need Beautifulsoup to do all this.
Beautiful Soup - Output Formatting
如果提供给 BeautifulSoup 构造函数的 HTML 字符串包含任何 HTML 实体,它们将被转换为 Unicode 字符。
If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.
HTML 实体是一个以 & (&)开头并以分号(;)结尾的字符串。它们用于显示保留字符(否则将被解释为 HTML 代码)。一些 HTML 实体的示例:
An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −
< |
less than |
< |
< |
> |
greater than |
> |
> |
& |
ampersand |
& |
& |
" |
double quote |
" |
" |
' |
single quote |
' |
' |
" |
Left Double quote |
“ |
“ |
" |
Right double quote |
” |
” |
£ |
Pound |
£ |
£ |
¥ |
yen |
¥ |
¥ |
€ |
euro |
€ |
€ |
© |
copyright |
© |
© |
默认情况下,在输出时转义的唯一字符是裸露的 & 和尖括号。这些被转换为“&”、“<”和“>”。
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">"
对于其他字符,它们将被转换为 Unicode 字符。
For others, they’ll be converted to Unicode characters.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (str(soup))
Output
Hello "World!"
如果您随后将文档转换为字节串,Unicode 字符将被编码为 UTF-8。您将无法获取 HTML 实体:
If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode())
Output
b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'
要更改此行为,请为 prettify() 方法的格式化程序参数提供一个值。对于格式化程序,以下可能的值:
To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.
formatter="minimal" - 这是默认值。字符串仅会得到足够的处理,以确保 Beautiful Soup 能够生成有效的 HTML/XML
formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML
formatter="html" − 只要有可能,Beautiful Soup 将把 Unicode 字符转换为 HTML 实体。
formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter="html5" - 它类似于 formatter="html”,但 Beautiful Soup 将在 HTML 空标签(如“br”)中省略结束斜杠。
formatter="html5" − it’s similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br"
formatter=None - Beautiful Soup 根本不会在输出时修改字符串。这是最快的选项,但可能会导致 Beautiful Soup 生成无效的 HTML/XML。
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML
Example
from bs4 import BeautifulSoup
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))
Output
minimal:
<p>
Il a dit <<Sacré bleu!>>
</p>
html:
<p>
Il a dit <<Sacré bleu!>>
</p>
None:
<p>
Il a dit <<Sacré bleu!>>
</p>
此外,Beautiful Soup 库提供了格式化程序类。您可以将任何此类对象的实例作为参数传递给 prettify() 方法。
In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.
HTMLFormatter class − 用于自定义 HTML 文档的格式化规则。
HTMLFormatter class − Used to customize the formatting rules for HTML documents.
XMLFormatter class − 用于自定义 XML 文档的格式化规则。
XMLFormatter class − Used to customize the formatting rules for XML documents.
Beautiful Soup - Pretty Printing
要显示 HTML 文档的整个已解析树或特定标记的内容,您也可以使用 print() 函数或同样调用 str() 函数。
To display the entire parsed tree of an HTML document or the contents of a specific tag, you can use the print() function or call str() function as well.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h1>Hello World</h1>", "lxml")
print ("Tree:",soup)
print ("h1 tag:",str(soup.h1))
Output
Tree: <html><body><h1>Hello World</h1></body></html>
h1 tag: <h1>Hello World</h1>
str() 函数返回一个以 UTF-8 编码的字符串。
The str() function returns a string encoded in UTF-8.
要获得格式良好的 Unicode 字符串,请使用 Beautiful Soup 的 prettify() 方法。它对 Beautiful Soup 解析树进行格式化,以便每个标记都位于单独的行上并带有缩进。它允许您轻松可视化 Beautiful Soup 解析树的结构。
To get a nicely formatted Unicode string, use Beautiful Soup’s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree.
考虑以下 HTML 字符串。
Consider the following HTML string.
<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
通过使用 prettify() 方法,我们可以更好地理解其结构 −
Using the prettify() method we can better understand its structure −
html = '''
<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print (soup.prettify())
Output
<html>
<body>
<p>
The quick,
<b>
brown fox
</b>
jumps over a lazy dog.
</p>
</body>
</html>
您可以在文档中的任何标记对象上调用 prettify()。
You can call prettify() on on any of the Tag objects in the document.
print (soup.b.prettify())
Output
<b>
brown fox
</b>
prettify() 方法用于理解文档的结构。但是,它不应该用于重新对其进行格式化,因为它会添加空白(采用新行的形式),并更改 HTML 文档的含义。
The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document.
prettify() 方法可以选择提供 formatter 参数,以指定要使用的格式化。
He prettify() method can optionally be provided formatter argument to specify the formatting to be used.
Beautiful Soup - NavigableString Class
Beautiful Soup API 中普遍存在的一个主要对象是 NavigableString 类的对象。它表示大多数 HTML 标记的开始和结束部分之间的字符串或文本。例如,如果 <b>Hello</b> 是要解析的标记,Hello 是 NavigableString。
One of the main objects prevalent in Beautiful Soup API is the object of NavigableString class. It represents the string or text between the opening and closing counterparts of most of the HTML tags. For example, if <b>Hello</b> is the markup to be parsed, Hello is the NavigableString.
NavigableString 类是 bs4 包中的 PageElement 类的子类,也是 Python 的内置 str 类的子类。因此,它继承了 PageElement 方法,如 find_*()、insert、append、wrap、unwrap 方法以及 str 类的 upper、lower、find、isalpha 等方法。
NavigableString class is subclassed from the PageElement class in bs4 package, as well as Python’s built-in str class. Hence, it inherits the PageElement methods such as find_*(), insert, append, wrap,unwrap methods as well as methods from str class such as upper, lower, find, isalpha etc.
此类的构造函数采用一个参数,即 str 对象。
The constructor of this class takes a single argument, a str object.
Example
from bs4 import NavigableString
new_str = NavigableString('world')
现在,您可以使用此 NavigableString 对象对解析的树执行各种操作,例如 append、insert、find 等。
You can now use this NavigableString object to perform all kinds of operations on the parsed tree, such as append, insert, find etc.
在下面的示例中,我们将新创建的 NavigableString 对象附加到现有的 Tab 对象。
In the following example, we append the newly created NavigableString object to an existing Tab object.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
new_str = NavigableString('world')
tag.append(new_str)
print (soup)
Output
<b>Helloworld</b>
请注意,NavigableString 是一个 PageElement,因此它也可以附加到 Soup 对象。如果我们这样做,请检查差异。
Note that the NavigableString is a PageElement, hence it can be appended to the Soup object also. Check the difference if we do so.
Output
<b>Hello</b>world
正如我们所看到的, string 位于 <b> 标签之后。
As we can see, the string appears after the <b> tag.
Beautiful Soup 提供了一个 new_string() 方法。创建一个与这个 BeautifulSoup 对象相关联的新 NavigableString。
Beautiful Soup offers a new_string() method. Create a new NavigableString associated with this BeautifulSoup object.
让我们使用 new_string() 方法创建一个 NavigableString 对象,并将其添加到 PageElements。
Let us new_string() method to create a NavigableString object, and add it to the PageElements.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
ns=soup.new_string(' World')
tag.append(ns)
print (tag)
soup.append(ns)
print (soup)
Output
<b>Hello World</b>
<b>Hello</b> World
我们在这里发现了一个有趣的行为。NavigableString 对象被添加到树内部的一个标签以及 soup 对象本身。虽然标签显示附加的字符串,但在 soup 对象中,文本 World 被附加,但它不显示在标签中。这是因为 new_string() 方法创建了与 Soup 对象相关联的 NavigableString。
We find an interesting behaviour here. The NavigableString object is added to a tag inside the tree, as well as to the soup object itself. While the tag shows the appended string, but in the soup object, the text World is appended, but it doesn’t show in the tag. This is because the new_string() method creates a NavigableString associated with the Soup object.
Beautiful Soup - Convert Object to String
Beautiful Soup API 有三类主要对象。soup 对象、Tag 对象和 NavigableString 对象。让我们找出如何将这些对象转换为字符串。在 Python 中,字符串是一个 str 对象。
The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object.
我们有一个以下 HTML 文档
Assuming that we have a following HTML document
html = '''
<p>Hello <b>World</b></p>
'''
让我们将这个字符串作为 BeautifulSoup 构造函数的参数。然后使用 Python 的内置 str() 函数将 soup 对象强制转换为字符串对象。
Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python’s builtin str() function.
该 HTML 字符串的已解析树将基于使用的解析器而构建。内置 html 解析器不会添加 <html> 和 <body> 标记。
The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn’t add the <html> and <body> tags.
Output
<p>Hello <b>World</b></p>
另一方面,html5lib 解析器会在插入 <html> 和 <body> 等形式标记后构建树。
On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print (str(soup))
Output
<html><head></head><body><p>Hello <b>World</b></p>
</body></html>
Tag 对象有一个字符串属性,用于返回 NavigableString 对象。
The Tag object has a string property that returns a NavigableString object.
tag = soup.find('b')
obj = (tag.string)
print (type(obj),obj)
Output
string <class 'bs4.element.NavigableString'> World
还为 Tag 对象定义了 Text 属性。它返回标记中包含的文本,并清除所有内部标记和属性。
There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes.
如果 HTML 字符串为 −
If the HTML string is −
html = '''
<p>Hello <div id='id'>World</div></p>
'''
我们尝试获取 <p> 标记的文本属性
We try to obtain the text property of <p> tag
tag = soup.find('p')
obj = (tag.text)
print ( type(obj), obj)
Output
<class 'str'> Hello World
你还可以使用 get_text() 方法,它返回一个表示标记内文本的字符串。该函数实际上是一个围绕文本属性的包装器,因为它也去除了内部标记和属性,并返回了一个字符串。
You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string
obj = tag.get_text()
print (type(obj),obj)
Beautiful Soup - Convert HTML to Text
Beautiful Soup 库等网络爬虫的一个重要但经常需要的应用程序是从 HTML 脚本中提取文本。您可能需要丢弃所有标签以及与其关联的属性(如果有),并从文档中分离原始文本。Beautiful Soup 中的 get_text() 方法适用于此目的。
One of the important and a frequently required application of a web scraper such as Beautiful Soup library is to extract text from a HTML script. You may need to discard all the tags along with the attributes associated if any with each tag and separate out the raw text in the document. The get_text() method in Beautiful Soup is suitable for this purpose.
这里有一个演示 get_text() 方法用法的基本示例。您可以通过删除所有 HTML 标签来获取 HTML 文档中的所有文本。
Here is a basic example demonstrating the usage of get_text() method. You get all the text from HTML document by removing all the HTML tags.
Example
html = '''
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)
Output
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
get_text() 方法有一个可选的分隔符参数。在以下示例中,我们将 get_text() 方法的分隔符参数指定为“#”。
The get_text() method has an optional separator argument. In the following example, we specify the separator argument of get_text() method as '#'.
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)
Output
#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#
get_text() 方法还有另一个参数 strip,可以为 True 或 False。让我们检查 strip 参数设置真时的效果。默认情况下它为 False。
The get_text() method has another argument strip, which can be True or False. Let us check the effect of strip parameter when it is set to True. By default it is False.
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)
Beautiful Soup - Parsing XML
BeautifulSoup 还可以解析 XML 文档。您需要将 fatures='xml' 参数传递给 Beautiful() 构造函数。
BeautifulSoup can also parse a XML document. You need to pass fatures='xml' argument to Beautiful() constructor.
假设我们在当前的工作目录中有以下内容 books.xml −
Assuming that we have the following books.xml in the current working directory −
Example
<?xml version="1.0" ?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
以下代码会解析给定的 XML 文件 −
The following code parses the given XML file −
from bs4 import BeautifulSoup
fp = open("books.xml")
soup = BeautifulSoup(fp, features="xml")
print (soup)
print ('type:', type(soup))
执行以上代码后,您应该会得到以下结果 −
When the above code is executed, you should get the following result −
<?xml version="1.0" encoding="utf-8"?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
type: <class 'bs4.BeautifulSoup'>
XML parser Error
默认情况下,Beautiful Soup 程序包会将文档解析为 HTML,然而,它非常易于使用,并且使用 beautifulsoup4 非常优雅地处理格式错误的 XML。
By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.
要将文档解析为 XML,您需要具有 lxml 解析器,只需要将 "xml" 作为第二个参数传递给 Beautiful Soup 构造函数即可 -
To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −
soup = BeautifulSoup(markup, "lxml-xml")
或
or
soup = BeautifulSoup(markup, "xml")
一个常见的 XML 解析错误是 -
One common XML parsing error is −
AttributeError: 'NoneType' object has no attribute 'attrib'
在使用 find() 或 findall() 函数时,某些元素丢失或未定义,可能会发生这种情况。
This might happen in case, some element is missing or not defined while using find() or findall() function.
Beautiful Soup - Error Handling
在尝试使用 Beautiful Soup 解析 HTML/XML 文档时,您可能会遇到错误,这并非来自您的脚本,而是来自片段的结构,因为 BeautifulSoup API 会引发错误。
While trying to parse HTML/XML document with Beautiful Soup, you may encounter errors, not from your script but from the structure of the snippet because the BeautifulSoup API throws an error.
默认情况下,Beautiful Soup 程序包会将文档解析为 HTML,然而,它非常易于使用,并且使用 beautifulsoup4 非常优雅地处理格式错误的 XML。
By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.
要将文档解析为 XML,您需要具有 lxml 解析器,只需要将 "xml" 作为第二个参数传递给 Beautiful Soup 构造函数即可 -
To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −
soup = BeautifulSoup(markup, "lxml-xml")
或
or
soup = BeautifulSoup(markup, "xml")
一个常见的 XML 解析错误是 -
One common XML parsing error is −
AttributeError: 'NoneType' object has no attribute 'attrib'
在使用 find() 或 findall() 函数时,某些元素丢失或未定义,可能会发生这种情况。
This might happen in case, some element is missing or not defined while using find() or findall() function.
除了上述提及的解析错误之外,您可能会遇到其他解析问题,例如环境问题,其中您的脚本可能在一个操作系统下工作,但在另一个操作系统下却不行,可能在一个虚拟环境下工作,但在另一个虚拟环境下却不行,或者可能无法在虚拟环境外工作。所有这些问题都可能是因为两个环境有不同的可用解析器库。
Apart from the above mentioned parsing errors, you may encounter other parsing issues such as environmental issues where your script might work in one operating system but not in another operating system or may work in one virtual environment but not in another virtual environment or may not work outside the virtual environment. All these issues may be because the two environments have different parser libraries available.
建议了解或检查您当前工作环境中的默认解析器。您可以检查当前工作环境中可用的当前默认解析器,或显式地将所需的解析器库作为第二个参数传递给 BeautifulSoup 构造函数。
It is recommended to know or check your default parser in your current working environment. You can check the current default parser available for the current working environment or else pass explicitly the required parser library as second arguments to the BeautifulSoup constructor.
由于 HTML 标签和属性不区分大小写,所有三个 HTML 解析器都将标签和属性名称转换为小写。但是,如果您要保留混合大小写或大写标签和属性,那么最好将文档解析为 XML。
As the HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. However, if you want to preserve mixed-case or uppercase tags and attributes, then it is better to parse the document as XML.
UnicodeEncodeError
让我们看看下面的代码段 -
Let us look into below code segment −
Output
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f'
上面的问题可能是由于两个主要情况造成的。您可能尝试打印您的控制台不知道如何显示的 Unicode 字符。其次,您尝试写入文件,并传递了一个不受默认编码支持的 Unicode 字符。
Above problem may be because of two main situations. You might be trying to print out a unicode character that your console doesn’t know how to display. Second, you are trying to write to a file and you pass in a Unicode character that’s not supported by your default encoding.
解决上述问题的一种方法是在生成 soup 以获取所需结果之前对响应文本/字符进行编码,如下所示 -
One way to resolve above problem is to encode the response text/character before making the soup to get the desired result, as follows −
responseTxt = response.text.encode('UTF-8')
KeyError: [attr]
当有问题的标签没有定义 attr 属性时,访问 tag['attr'] 会导致此错误。最常见的错误是:“KeyError: 'href'”和“KeyError: 'class'”。如果您不确定 attr 是否已定义,请使用 tag.get('attr')。
It is caused by accessing tag['attr'] when the tag in question doesn’t define the attr attribute. Most common errors are: "KeyError: 'href'" and "KeyError: 'class'". Use tag.get('attr') if you are not sure attr is defined.
for item in soup.fetch('a'):
try:
if (item['href'].startswith('/') or "tutorialspoint" in item['href']):
(...)
except KeyError:
pass # or some other fallback action
AttributeError
您可能会遇到以下 AttributeError -
You may encounter AttributeError as follows −
AttributeError: 'list' object has no attribute 'find_all'
上述错误主要发生在您期望 find_all() 返回单个标签或字符串时。然而,soup.find_all 返回一个元素 Python 列表。
The above error mainly occurs because you expected find_all() return a single tag or string. However, soup.find_all returns a python list of elements.
您需要做的就是遍历列表并从那些元素中获取数据。
All you need to do is to iterate through the list and catch data from those elements.
为了在解析结果时避免上述错误,该结果将被绕过以确保不会将错误格式的片段插入到数据库中 -
To avoid the above errors when parsing a result, that result will be bypassed to make sure that a malformed snippet isn’t inserted into the databases −
except(AttributeError, KeyError) as er:
pass
Beautiful Soup - Trouble Shooting
如果你在尝试解析 HTML/XML 文档时遇到问题,则更有可能是因为使用的解析器正在解释文档。为了帮助你找到并纠正问题,Beautiful Soup API 提供了一个诊断程序 diagnose()。
If you run into problems while trying to parse a HTML/XML document, it is more likely because how the parser in use is interpreting the document. To help you locate and correct the problem, Beautiful Soup API provides a dignose() utility.
Beautiful Soup 中的 diagnose() 方法是一个诊断套件,用于隔离常见问题。如果你难以理解 Beautiful Soup 对文档执行了哪些操作,请将文档作为参数传递给 diagnose() 函数。一份报告显示了不同解析器如何处理文档,并告诉你是否缺少解析器。
The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you’re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you’re missing a parser.
diagnose() 方法在 bs4.diagnose 模块中定义。其输出以如下消息开头 −
The diagnose() method is defined in bs4.diagnose module. Its output starts with a message as follows −
Output
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
如果没有找到这些解析器中的任何一个,还会出现一条相应的消息。
If it doesn’t find any of these parsers, a corresponding message also appears.
I noticed that html5lib is not installed. Installing it may help.
如果馈送到 diagnose() 方法的 HTML 文档形成正确,则任何解析器解析的树都将相同。但是,如果它形成不正确,那么不同的解析器会进行不同的解释。如果你没有得到你预期的树,则更改解析器可能会很有帮助。
If the HTML document fed to diagnose() method is perfectly formed, the parsed tree by any of the parsers will be identical. However if it is not properly formed, then different parser interprets differently. If you don’t get the tree as you anticipate, changing the parser might help.
有时,你可能为 XML 文档选择了 HTML 解析器。HTML 解析器在不正确地解析文档时会添加所有 HTML 标记。查看输出,你将意识到错误,并可在纠正中提供帮助。
Sometimes, you may have chosen HTML parser for a XML document. The HTML parsers add all the HTML tags while parsing the document incorrectly. Looking at the output, you will realize the error and can help in correcting.
如果 Beautiful Soup 发出 HTMLParser.HTMLParseError,请尝试更改解析器。
If Beautiful Soup raises HTMLParser.HTMLParseError, try and change the parser.
解析错误 HTMLParser.HTMLParseError: 格式错误的开始标记和 HTMLParser.HTMLParseError: 错误的结束标记均由 Python 的内置 HTML 解析器库生成,解决方案是安装 lxml 或 html5lib。
parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag are both generated by Python’s built-in HTML parser library, and the solution is to install lxml or html5lib.
如果你遇到 SyntaxError: 语法无效(在行 ROOT_TAG_NAME = '[document]' 中),这是由于在 Python 3 下运行 Beautiful Soup 的旧 Python 2 版本,而没有转换代码。
If you encounter SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = '[document]'), it is caused by running an old Python 2 version of Beautiful Soup under Python 3, without converting the code.
ImportError 出现消息“No module named HTMLParser”是因为在 Python 3 中使用了旧的 Python 2 版 BeautifulSoup。
The ImportError with message No module named HTMLParser is because of an old Python 2 version of Beautiful Soup under Python 3.
同时,ImportError:No module named html.parser - 是在 Python 2 中运行 Python 3 版本的 BeautifulSoup 导致的。
While, ImportError: No module named html.parser - is caused by running the Python 3 version of Beautiful Soup under Python 2.
如果您收到 ImportError:No module named BeautifulSoup - 是因为在尚未安装 BS3 的系统上运 Beautiful Soup 3 代码。或者,不知道包名称已更改为 bs4,编写了 Beautiful Soup 4 代码。
If you get ImportError: No module named BeautifulSoup - more often than not, it is because of running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.
最后,ImportError:No module named bs4 - 也可能是因为尝试在尚未安装 BS4 的系统上运行 Beautiful Soup 4 代码。
Finally, ImportError: No module named bs4 - is due to the fact that you are trying a Beautiful Soup 4 code on a system that doesn’t have BS4 installed.
Beautiful Soup - Porting Old Code
你可以对来自 Beautiful Soup 的早期版本中的代码进行更改,使之与最新版本兼容,在导入语句中进行以下更改 -
You can make the code from earlier version of Beautiful Soup compatible with the lates version by making following change in the import statement −
Example
from BeautifulSoup import BeautifulSoup
#becomes this:
from bs4 import BeautifulSoup
如果你遇到了 ImportError “No module named BeautifulSoup”,这意味着你尝试运行 Beautiful Soup 3 代码,但是你只安装了 Beautiful Soup 4。同样地,如果你遇到了 ImportError “No module named bs4”,是因为你尝试运行 Beautiful Soup 4 代码,但是你只安装了 Beautiful Soup 3。
If you get the ImportError "No module named BeautifulSoup", it means you’re trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4 installed. Similarly, If you get the ImportError "No module named bs4", because you’re trying to run Beautiful Soup 4 code, but you only have Beautiful Soup 3 installed.
Beautiful Soup 3 使用了 Python 的 SGMLParser,这是一个在 Python 3.0 中已被删除的模块。Beautiful Soup 4 默认使用 html.parser,但是你也可以使用 lxml 或 html5lib。
Beautiful Soup 3 used Python’s SGMLParser, a module that has been removed in Python 3.0. Beautiful Soup 4 uses html.parser by default, but you can also use lxml or html5lib.
尽管 BS4 基本上与 BS3 向后兼容,但它的很多方法已被弃用且赋予了新的名称,以便符合 PEP 8 合规要求。
Although BS4 is mostly backwards-compatible with BS3, most of its methods have been deprecated and given new names for PEP 8 compliance.
这里有一些示例 -
Here are a few examples −
replaceWith -> replace_with
findAll -> find_all
findNext -> find_next
findParent -> find_parent
findParents -> find_parents
findPrevious -> find_previous
getText -> get_text
nextSibling -> next_sibling
previousSibling -> previous_sibling
Beautiful Soup - contents Property
Method Description
Soup 对象和标签对象可以使用 contents 属性。它返回包含在对象内部的所有内容,即所有直接子元素和文本节点(即 Navigable 字符串)的列表。
The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String).
Return value
contents 属性返回标签/汤对象中子元素和字符串的列表。
The contents property returns a list of child elements and strings in the Tag/Soup object,.
Example 1
标签对象的内容 -
Contents of a tag object −
from bs4 import BeautifulSoup
markup = '''
<div id="Languages">
<p>Java</p>
<p>Python</p>
<p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
print (tag.contents)
Example 2
文档的整个内容 -
Contents of the entire document −
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
print (soup.contents)
Example 3
请注意, NavigableString 对象没有内容属性。如果我们尝试访问它会引发 AttributeError。
Note that a NavigableString object doesn’t have contents property. It throws AttributeError if we try to access the same.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p
s=tag.contents[0]
print (s.contents)
Output
Traceback (most recent call last):
File "C:\Users\user\BeautifulSoup\2.py", line 11, in <module>
print (s.contents)
^^^^^^^^^^
File "C:\Users\user\BeautifulSoup\Lib\site-packages\bs4\element.py", line 984, in __getattr__
raise AttributeError(
AttributeError: 'NavigableString' object has no attribute 'contents'
Beautiful Soup - children Property
Method Description
Beautiful Soup 库中的 Tag 对象具有 children 属性。它返回用于迭代直接的子元素和文本节点(即 Navigable String)的生成器。
The Tag object in Beautiful Soup library has children property. It returns a generator used to iterate over the immediate child elements and text nodes (i.e. Navigable String).
Return value
该属性返回一个生成器,您可以用它迭代 PageElement 的直接子项。
The property returns a generator with which you can iterate over direct children of the PageElement.
Example 1
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
children = tag.children
for child in children:
print (child)
Example 2
soup 对象也拥有 children 属性。
The soup object too bears the children property.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
children = soup.children
for child in children:
print (child)
Example 3
在以下示例中,我们将 NavigableString 对象追加到 <p> 标记并获取子项列表。
In the following example, we append NavigableString objects to the <p> Tag and get the list of children.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
soup.p.extend(['and', 'JavaScript'])
children = soup.p.children
for child in children:
print (child)
Beautiful Soup - string Property
Method Description
在 Beautiful Soup 中,soup 和 Tag 对象具有一个便利属性 - string 属性。它返回 PageElement、Soup 或 Tag 中的单个字符串。如果此元素只有一个字符串子项,那么将返回与之对应的 NavigableString。如果这个元素有一个子标签,返回值是子标签的“string”属性,如果元素本身是一个字符串(没有子项),那么 string 属性将返回 None。
In Beautiful Soup, the soup and Tag object has a convenience property - string property. It returns a single string within a PageElement, Soup or Tag. If this element has a single string child, then a NavigableString corresponding to it is returned. If this element has one child tag, return value is the 'string' attribute of the child tag, and if element itself is a string, (with no children), then the string property returns None.
Example 1
以下代码有一个 HTML 字符串,其中包含一个 <div> 标签,该标签包含三个 <p> 元素。我们找到第一个 <p> 标签的 string 属性。
The following code has the HTML string with a <div> tag that encloses three <p> elements. We find the string property of first <p> tag.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p
navstr = tag.string
print (navstr, type(navstr))
nav_str = str(navstr)
print (nav_str, type(nav_str))
Output
Java <class 'bs4.element.NavigableString'>
Java <class 'str'>
string 属性返回一个 NavigableString。它可以使用 str() 函数强制转换为一个常规的 Python 字符串
The string property returns a NavigableString. It can be cast to a regular Python string with str() function
Beautiful Soup - strings Property
Method Description
对于包含多个子级的任何页面元素,可以通过 strings 属性获取每个子级的内部文本。与 string 属性不同,strings 处理元素包含多个子级的情况。strings 属性返回生成器对象。它会生成对应于每个子元素的 NavigableStrings 序列。
For any PageElement having more than one children, the inner text of each can be fetched by the strings property. Unlike the string property, strings handles the case when the element contains multiple children. The strings property returns a generator object. It yields a sequence of NavigableStrings corresponding to each of the child elements.
Example 1
你可以为汤对象和标记对象检索 strings 属性的值。在以下示例中,检查了汤对象的 stings 属性。
You can retrieve the value od strings property for soup as well as a tag object. In the following example, the soup object’s stings property is checked.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
print ([string for string in soup.strings])
Output
['\n', '\n', 'Java', ' ', 'Python', ' ', 'C++', '\n', '\n']
注意列表中的换行符和空格。我们可以使用 stripped_strings 属性删除它们。
Note the line breaks and white spaces in the list.We can remove them with stripped_strings property.
Beautiful Soup - stripped_strings Property
Method Description
Tag/Soup 对象的 stripped_strings 属性给出了与 strings 属性类似的返回结果,但去除了额外的换行符和空格。因此,可以说 stripped_strings 属性造成了对象中属于使用对象的内部元素的可遍历字符串对象的生成器。
The stripped_strings property of a Tag/Soup object gives the return similar to strings property, except for the fact that the extra line breaks and whitespaces are stripped off. Hence, it can be said that the stripped_strings property results in a generator of NavigableString objects of the inner elements belonging to the object in use.
Example 1
在下面的示例中,BeautifulSoup 对象中解析的文档树中所有元素的字符串在应用剥离操作后被显示。
In the example below, the strings of all the elements in the document tree parsed in a BeautifulSoup object are displayed after applying the stripping.
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
print ([string for string in soup.stripped_strings])
Output
['Java', 'Python', 'C++']
与 strings 属性的输出结果相比,你可以看到换行符和空格都被移除了。
Compared to the output of strings property, you can see that the line breaks and whitespaces are removed.
Beautiful Soup - descendants Property
Method Description
在 Beautiful Soup API 中,您可以使用 PageElement 对象的 descendants 属性遍历其下的所有子元素列表。此属性返回一个生成器对象,可以通过该对象以广度优先的顺序检索子元素。
With the descendants property of a PageElement object in Beautiful Soup API you can traverse the list of all children under it. This property returns a generator object, with which the children elements can be retrieved in a breadth-first sequence.
在搜索树结构时,广度优先遍历从树根开始,并在继续进入下一深度级别的节点之前,探索当前深度处的所有节点。
While searching a tree structure, the Breadth-first traversal starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level.
Example 1
在下面的代码中,我们有一个带有嵌套无序列表标记的 HTML 文档。我们以广度优先的方式解析子元素。
In the code below, we have a HTML document with nested unordered list tags. We scrape through the children elements parsed in breadth-first manner.
html = '''
<ul id='outer'>
<li class="mainmenu">Accounts</li>
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="mainmenu">HR</li>
<ul>
<li class="submenu">Anil</li>
<li class="submenu">Milind</li>
</ul>
</ul>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('ul', {'id': 'outer'})
tags = soup.descendants
for desc in tags:
print (desc)
Output
<ul id="outer">
<li class="mainmenu">Accounts</li>
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="mainmenu">HR</li>
<ul>
<li class="submenu">Anil</li>
<li class="submenu">Milind</li>
</ul>
</ul>
<li class="mainmenu">Accounts</li>
Accounts
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="submenu">Anand</li>
Anand
<li class="submenu">Mahesh</li>
Mahesh
<li class="mainmenu">HR</li>
HR
<ul>
<li class="submenu">Anil</li>
<li class="submenu">Milind</li>
</ul>
<li class="submenu">Anil</li>
Anil
<li class="submenu">Milind</li>
Milind
Example 2
在下面的示例中,我们列出 <head> 标签的后代
In the following example, we list out the descendants of <head> tag
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
print (element)
Beautiful Soup - parent Property
Method Description
BeautifulSoup库中的parent属性返回所述PegeElement的直接父元素。parents属性返回的值的类型是Tag对象。对于BeautifulSoup对象,其父级是文档对象
The parent property in BeautifulSoup library returns the immediate parent element of the said PegeElement. The type of the value returned by the parents property is a Tag object. For the BeautifulSoup object, its parent is a document object
Return value
parent属性返回Tag对象。对于Soup对象,它返回文档对象
The parent property returns a Tag object. For Soup object, it returns document object
Example 1
此示例使用.parent属性来查找示例HTML字符串中第一个<p>标签的直接父元素。
This example uses .parent property to find the immediate parent element of the first <p> tag in the example HTML string.
html = """
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
print (tag.parent.name)
Example 2
在以下示例中,我们看到<title>标签封闭在<head>标签内。因此,<title>标签的parent属性返回<head>标签。
In the following example, we see that the <title> tag is enclosed inside a <head> tag. Hence, the parent property for <title> tag returns the <head> tag.
html = """
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
print (tag.parent)
Example 3
Python的内置HTML解析器的行为与html5lib和lxml解析器略有不同。内置解析器不会尝试从提供的字符串中构建一个完美的文档。如果字符串中不存在的话,它不会添加附加的父标签,如body或html。另一方面,html5lib和lxml解析器会添加这些标签以使文档成为一个完美的HTML文档。
The behaviour of Python’s built-in HTML parser is a little different from html5lib and lxml parsers. The built-in parser doesn’t try to build a perfect document out of the string provided. It doesn’t add additional parent tags like body or html if they don’t exist in the string. On the other hand, html5lib and lxml parsers add these tags to make the document a perfect HTML document.
html = """
<p><b>Hello World</b></p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.p.parent.name)
soup = BeautifulSoup(html, 'html5lib')
print (soup.p.parent.name)
Beautiful Soup - parents Property
Method Description
BeautifulSoup 库中的 parents 属性以递归方式检索所述 PegeElement 的所有父元素。parents 属性返回的值的类型是一个生成器,借助该生成器,我们可以列出从下到上的父元素。
The parents property in BeautifulSoup library retrieves all the parent elements of the said PegeElement in a recursive manner. The type of the value returned by the parents property is a generator, with the help of which we can list out the parents in the down-to-up order.
Example 1
此示例使用 .parents 从深入埋藏在文档中的 <a> 标记跳转到文档的最顶端。在下面的代码中,我们将跟踪示例 HTML 字符串中第一个 <p> 标记的父标记。
This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <p> tag in the example HTML string.
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
for element in tag.parents:
print (element.name)
Output
body
html
[document]
请注意 BeautifulSoup 对象的父节点是 [document]。
Note that the parent to the BeautifulSoup object is [document].
Example 2
在下面的示例中,我们可以看到 <b> 标记被包含在 <p> 标记里面。它上方的两个 div 标记有一个 id 属性。我们尝试只打印那些具有 id 属性的元素。has_attr() 方法用于此目的。
In the following example, we see that the <b> tag is enclosed inside a <p> tag. The two div tags above it have an id attribute. We try to print the only those elements having id attribute. The has_attr() method is used for the purpose.
html = """
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.b
for parent in tag.parents:
if parent.has_attr("id"):
print(parent["id"])
Beautiful Soup - next_sibling Property
Method Description
缩进级别相同的 HTML 标记称为同级元素。PageElement 的 next_sibling 属性返回同级别或同一父元素下的下一个标记。
The HTML tags appearing at the same indentation level are called siblings. The next_sibling property of the PageElement returns next tag at the same level, or under the same parent.
Return type
next_sibling 属性返回 PageElement、Tag 或 NavigableString 对象。
The next_sibling property returns a PageElement, a Tag or a NavigableString object.
Example 1
index.html 工资页面由一个 HTML 表单组成,其中包含三个具有 name 属性的输入元素。在下面的示例中,找到了 name 属性为 nm 的输入标记的下一个兄弟元素。
The index.html wage page consists of a HTML form with three input elements each with a name attribute. In the following example, the next sibling of an input tag with name attribute as nm is located.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'age'})
print (tag.find_previous())
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'nm'})
sib = tag.next_sibling
print (sib)
Example 2
在下一个示例中,我们有一个 HTML 文档,其中包含 <p> 标记内的几个标记。next_sibling 属性返回其中 <b> 标记旁边的标记。
In the next example, we have a HTML document with a couple of tags inside a <p> tag. The next_sibling property returns the tag next to <b> tag in it.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.next_sibling)
Example 3
考虑以下文档中的 HTML 字符串。它有两级 <p> 标记。第一个 <p> 标记的下一个兄弟元素应提供第二个 <p> 标记的内容。
Consider the HTML string in the following document. It has two <p> tags at the same level. The next_sibling of first <p> should give the second <p> tag’s contents.
html = '''
<p><b>Hello</b><i>Python</i></p>
<p>TutorialsPoint</p>
'''
soup = BeautifulSoup(html, 'html.parser')
tag1 = soup.p
print ("next:",tag1.next_sibling)
Output
next:
单词 next: 之后的空行是意外的。但这是因为第一个 <p> 标记后面有 \n 字符。如下所示更改打印语句以获取下一个兄弟元素的内容
The blank line after the word next: is unexpected. But that’s because of the \n character after the first <p> tag. Change the print statement as shown below to obtain the contents of the next_sibling
tag1 = soup.p
print ("next:",tag1.next_sibling.next_sibling)
Beautiful Soup - previous_sibling Property
Method Description
在同一缩进级别上出现的 HTML 标签称为同级标签。PageElement 的 previous_sibling 属性返回一个前一个标签(出现在当前标签之前的标签),该标签处于同一级别或位于相同父项下。此属性封装了 find_previous_sibling() 方法。
The HTML tags appearing at the same indentation level are called siblings. The previous_sibling property of the PageElement returns a previous tag (a tag appearing before the current tag) at the same level, or under the same parent. This property encapsulates the find_previous_sibling() method.
Return type
previous_sibling 属性返回 PageElement、Tag 或 NavigableString 对象。
The previous_sibling property returns a PageElement, a Tag or a NavigableString object.
Example 1
在以下代码中,HTML 字符串包含在 <p> 标记中的两个相邻标记。它显示在 <b> 标记之前出现的兄弟标记。
In the following code, the HTML string consists of two adjacent tags inside a <p> tag. It shows the sibling tag for <b> tag appearing before it.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag = soup.i
sibling = tag.previous_sibling
print (sibling)
Example 2
我们使用 index.html 文件进行解析。该页面包含带有三个输入元素的 HTML 表单。哪个元素是具有 id 属性为 age 的输入元素的前一兄弟元素?以下代码显示了它:
We are using the index.html file for parsing. The page contains a HTML form with three input elements. Which element is a previous sibling of input element with its id attribute as age? The following code shows it −
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'age'})
sib = tag.previous_sibling.previous_sibling
print (sib)
Example 3
首先我们找到包含字符串“教程”的 <p> 标签,然后找到它的前一个标签。
First we find the <p> tag containing the string 'Tutorial' and then fins a tag previous to it.
html = '''
<p>Excellent</p><p>Python</p><p>Tutorial</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', string='Tutorial')
print (tag.previous_sibling)
Beautiful Soup - next_siblings Property
Method Description
在相同缩进级别出现的 HTML 标签被称为兄弟标签。Beautiful Soup 中的 next_siblings 属性返回用于迭代相同父元素下的所有后续标签和字符串的生成器对象。
The HTML tags appearing at the same indentation level are called siblings. The next_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the subsequent tags and strings under the same parent.
Return type
next_siblings 属性返回兄弟 PageElements 的生成器。
The next_siblings property returns a generator of sibling PageElements.
Example 1
在 index.html 中的 HTML 表单代码包含三个输入元素。以下脚本使用 next_siblings 属性来收集 id 属性为 nm 的输入元素的下一个兄弟元素
In HTML form code in index.html contains three input elements. Following script uses next_siblings property to collect next siblings of an input element wit id attribute as nm
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'nm'})
siblings = tag.next_siblings
print (list(siblings))
Output
['\n', <input id="age" name="age" type="text"/>, '\n', <input id="marks" name="marks" type="text"/>, '\n']
Example 2
让我们为此目的使用以下 HTML 片段:
Let us use the following HTML snippet for this purpose −
使用以下代码遍历下一个兄弟元素标签。
Use the following code to traverse next siblings tags.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.b
print ("next siblings:")
for tag in tag1.next_siblings:
print (tag)
Example 3
下一个示例显示 <head> 标签只有 body 标签这一个下一个兄弟元素。
Next example shows that the <head> tag has only one next sibling in the form of body tag.
html = '''
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Excellent</p><p>Python</p><p>Tutorial</p>
</body>
</head>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup.head.next_siblings
print ("next siblings:")
for tag in tags:
print (tag)
Beautiful Soup - previous_siblings Property
Method Description
在同一缩进级别上出现的 HTML 标签称为同级标签。Beautiful Soup 中的 previous_siblings 属性返回一个生成器对象,用于迭代当前标签之前的同级标签和字符串,它们位于相同父项下。它给出的输出与 find_previous_siblings() 方法类似。
The HTML tags appearing at the same indentation level are called siblings. The previous_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the tags and strings before the current tag, under the same parent. This gives he similar output as find_previous_siblings() method.
Return type
previous_siblings 属性返回同级 PageElements 的一个生成器。
The previous_siblings property returns a generator of sibling PageElements.
Example 1
以下示例解析给定的 HTML 字符串,其中有一些标签嵌入在外部 <p> 标签中。我们借助 previous_siblings 属性获取了下划线标签的前一个同级标签。
The following example parses the given HTML string that has a few tags embedded inside the outer <p> tag. The previous siblings of the <u> tag are fetched with the help of previous_siblings property.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.u
print ("previous siblings:")
for tag in tag1.previous_siblings:
print (tag)
Example 2
在以下示例使用的 index.html 文件中,HTML 表单中存在三个输入元素。我们找出与 id 标记为 marks 的同级标签中位于 <form> 标签下的同级标签。
In the index.html file used in the following example, there are three input elements in the HTML form. We find out what are the sibling tags previous to the one with id set as marks, and under the <form> tag.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'marks'})
sibs = tag.previous_siblings
print ("previous siblings:")
for sib in sibs:
print (sib)
Output
previous siblings:
<input id="age" name="age" type="text"/>
<input id="nm" name="name" type="text"/>
Example 3
<html> 的顶级标签始终存在两个同级标签——head 和 body。因此,<body> 标签只有一个前一个同级标签,即 head,如下代码所示:
The top level <html> tag always has two sibling tags - head and body. Hence, the <body> tag has only one previous sibling i.e. head, as the following code shows −
html = '''
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Excellent</p><p>Python</p><p>Tutorial</p>
</body>
</head>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup.body.previous_siblings
print ("previous siblings:")
for tag in tags:
print (tag)
Beautiful Soup - next_element Property
Method Description
在 Beautiful Soup 库中,next_element 属性返回紧邻当前 PageElement 的 Tag 或 NavigableString,即使它位于父树之外。还有一个具有类似行为的 next 属性
In Beautiful Soup library, the next_element property returns the Tag or NavigableString that appears immediately next to the current PageElement, even if it is out of the parent tree. There is also a next property which has similar behaviour
Return value
next_element 和 next 属性返回紧邻当前标记的标记或 NavigableString。
The next_element and next properties return a tag or a NavigableString appearing immediately next to the current tag.
Example 1
在从给定 HTML 字符串解析的文档树中,我们找到 <b> 标记的 next_element
In the document tree parsed from the given HTML string, we find the next_element of the <b> tag
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
tag = soup.b
print (tag)
nxt = tag.next_element
print ("Next:",nxt)
nxt = tag.next_element.next_element
print ("Next:",nxt)
Output
<b>Excellent</b>
Next: Excellent
Next: <p>Python</p>
输出有点奇怪,因为 <b>Excellent</b> 的下一个元素显示为“Excellent”,这是因为内部字符串被注册为下一个元素。若要将所需结果 (<p>Python</p>) 作为下一个元素,请提取内部 NavigableString 对象的 next_element 属性。
The output is a little strange as the next element for <b>Excellent</b> is shown to be 'Excellent', that is because the inner string is registered as the next element. To obtain the desired result (<p>Python</p>) as the next element, fetch the next_element property of the inner NavigableString object.
Example 2
BeautifulSoup PageElements 还支持类似于 next_element 属性的 next 属性
The BeautifulSoup PageElements also support next property which is analogous to next_element property
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
tag = soup.b
print (tag)
nxt = tag.next
print ("Next:",nxt)
nxt = tag.next.next
print ("Next:",nxt)
Example 3
在下一个示例中,我们尝试确定 <body> 标记旁边的元素。因为它后面跟着一个换行符 (\n),我们需要找到 body 标记旁边的一个元素的下一个元素。碰巧是 <h1> 标记。
In the next example, we try to determine the element next to <body> tag. As it is followed by a line break (\n), we need to find the next element of the one next to body tag. It happens to be <h1> tag.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('body')
nxt = tag.next_element.next
print ("Next:",nxt)
Beautiful Soup - previous_element Property
Method Description
在 Beautiful Soup 库中,previous_element 属性返回紧接在当前 PageElement 之前的 Tag 或 NavigableString,即使它不在父级树中。还有 previous 属性,类似的行为
In Beautiful Soup library, the previous_element property returns the Tag or NavigableString that appears immediately prior to the current PageElement, even if it is out of the parent tree. There is also a previous property which has similar behaviour
Return value
previous_element 和 previous 特性将返回当前标签之前紧邻的标签或 NavigableString。
The previous_element and previous properties return a tag or a NavigableString appearing immediately before the current tag.
Example 1
我们从给定 HTML 字符串中解析出的文档树中, 找到 <p id='id1'> 标签的 previous_element。
In the document tree parsed from the given HTML string, we find the previous_element of the <p id='id1'> tag
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
tag = soup.find('p', id='id1')
print (tag)
pre = tag.previous_element
print ("Previous:",pre)
pre = tag.previous_element.previous_element
print ("Previous:",pre)
Output
<p id="id1">Tutorial</p>
Previous: Python
Previous: <p>Python</p>
输出有点奇怪,因为显示的 previous 元素是 Python,那是因为内部字符串被注册为 previous 元素。如要获取预期的结果(<p>Python</p>)作为 previous 元素,请获取内在 NavigableString 对象的 previous_element 特性。
The output is a little strange as the previous element for shown to be 'Python, that is because the inner string is registered as the previous element. To obtain the desired result (<p>Python</p>) as the previous element, fetch the previous_element property of the inner NavigableString object.
Example 2
BeautifulSoup PageElements 还支持 previous 特性,它类似于 previous_element 特性。
The BeautifulSoup PageElements also supports previous property which is analogous to previous_element property
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
tag = soup.find('p', id='id1')
print (tag)
pre = tag.previous
print ("Previous:",pre)
pre = tag.previous.previous
print ("Previous:",pre)
Example 3
在下一个示例中,我们尝试判断 id 属性为“age”的 <input> 标签旁边的元素。
In the next example, we try to determine the element next to <input> tag whose id attribute is 'age'
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')
tag = soup.find('input', id='age')
pre = tag.previous_element.previous
print ("Previous:",pre)
Beautiful Soup - next_elements Property
Method Description
在 Beautiful Soup 库中,next_elements 属性返回包含解析树中的下一个字符串或标签的生成器对象。
In Beautiful Soup library, the next_elements property returns a generator object containing the next strings or tags in the parse tree.
Example 1
next_elements 属性返回出现在文档字符串中 <b> 标签之后的标签和 NavibaleStrings,如下所示:
The next_elements property returns tags and NavibaleStrings appearing after the <b> tag in the document string below −
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('b')
nexts = tag.next_elements
print ("Next elements:")
for next in nexts:
print (next)
Example 2
所有出现在 <p> 标签之后的元素列在下面:
All the elements appearing after the <p> tag are listed below −
from bs4 import BeautifulSoup
html = '''
<p>
<b>Excellent</b><i>Python</i>
</p>
<u>Tutorial</u>
'''
soup = BeautifulSoup(html, 'html.parser')
tag1 = soup.find('p')
print ("Next elements:")
print (list(tag1.next_elements))
Output
Next elements:
['\n', <b>Excellent</b>, 'Excellent', <i>Python</i>, 'Python', '\n', '\n', <u>Tutorial</u>, 'Tutorial', '\n']
Example 3
出现在 index.html 的 HTML 表单中的 input 标签旁边的元素列在下面:
The elements next to the input tag present in the HTML form of index.html are listed below −
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')
tag = soup.find('input')
nexts = soup.previous_elements
print ("Next elements:")
for next in nexts:
print (next)
Beautiful Soup - previous_elements Property
Method Description
在 Beautiful Soup 库中,previous_elements 属性返回一个生成器对象,其中包含解析树中的前一个字符串或标签。
In Beautiful Soup library, the previous_elements property returns a generator object containing the previous strings or tags in the parse tree.
Example 1
previous_elements 属性返回在下文档字符串中出现在 <p> 标签前的标记和 NavibaleStrings:
The previous_elements property returns tags and NavibaleStrings appearing before the <p> tag in the document string below −
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', id='id1')
pres = tag.previous_elements
print ("Previous elements:")
for pre in pres:
print (pre)
Output
Previous elements:
Python
<p>Python</p>
Excellent
<b>Excellent</b>
<p><b>Excellent</b><p>Python</p><p id="id1">Tutorial</p></p>
Example 2
出现在 <u> 标签之前的所有元素如下:
All the elements appearing before the <u> tag are listed below −
from bs4 import BeautifulSoup
html = '''
<p>
<b>Excellent</b><i>Python</i>
</p>
<u>Tutorial</u>
'''
soup = BeautifulSoup(html, 'html.parser')
tag1 = soup.find('u')
print ("previous elements:")
print (list(tag1.previous_elements))
Output
previous elements:
['\n', '\n', 'Python', <i>Python</i>, 'Excellent', <b>Excellent</b>, '\n', <p>
<b>Excellent</b><i>Python</i>
</p>, '\n']
Example 3
BeautifulSoup 对象本身没有任何前一个元素:
The BeautifulSoup object itself doesn’t have any previous elements −
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')
tag = soup.find('input', id='marks')
pres = soup.previous_elements
print ("Previous elements:")
for pre in pres:
print (pre.name)
Beautiful Soup - find() Method
Method Description
Beautiful Soup 中的 find() 方法在该 PageElement 的子元素中查找与给定标准匹配的第一个 Element 并返回它。
The find() method in Beautiful Soup looks for the first Element that matches the given criteria in the children of this PageElement and returns it.
Parameters
name − 对标记名称的筛选。
name − A filter on tag name.
attrs − 对属性值进行筛选的字典。
attrs − A dictionary of filters on attribute values.
recursive − 如果为 True,则 find() 将执行递归搜索。否则,仅考虑直接子元素。
recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.
limit − 在找到指定数量的出现次数后停止寻找。
limit − Stop looking after specified number of occurrences have been found.
kwargs − 对属性值进行筛选的字典。
kwargs − A dictionary of filters on attribute values.
Return value
find() 方法返回 Tag 对象或 NavigableString 对象
The find() method returns Tag object or a NavigableString object
Example 1
让我们为了这个目的使用以下 HTML 脚本(作为 index.html):
Let us use the following HTML script (as index.html) for the purpose
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
</body>
</html>
下面的 Python 代码找到了 id 为 nm 的元素:
The following Python code finds the element with its id as nm
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
obj = soup.find(id = 'nm')
print (obj)
Example 2
find() 方法返回已解析文档中具有给定属性的第一个标签。
The find() method returns the first tag in the parsed document that has the given attributes.
obj = soup.find(attrs={"name":'marks'})
Beautiful Soup - find_all() Method
Method Description
Beautiful Soup 中的 find_all() 方法查找与该 PageElement 的子元素中的给定条件相匹配的元素并返回所有元素的列表。
The find_all() method in Beautiful Soup looks for the elements that match the given criteria in the children of this PageElement and returns a list of all elements.
Parameters
name − 对标记名称的筛选。
name − A filter on tag name.
attrs − 对属性值进行筛选的字典。
attrs − A dictionary of filters on attribute values.
recursive − 如果为 True,则 find() 将执行递归搜索。否则,仅考虑直接子元素。
recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.
limit − 在找到指定数量的出现次数后停止寻找。
limit − Stop looking after specified number of occurrences have been found.
kwargs − 对属性值进行筛选的字典。
kwargs − A dictionary of filters on attribute values.
Return type
find_all() 方法会返回一个 ResultSet 对象,这是一个列表生成器。
The find_all() method returns a ResultSet object which is a list generator.
Example 1
当我们能够以 name 的形式传递一个值时,Beautiful Soup 才会考虑具有特定名称的标签。文本字符串会被忽略,与不匹配名称的标签也会被忽略。在此示例中,我们将 title 传递给 find_all() 方法。
When we can pass in a value for name, Beautiful Soup only considers tags with certain names. Text strings will be ignored, as will tags whose names that don’t match. In this example we pass title to find_all() method.
from bs4 import BeautifulSoup
html = open('index.html')
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find_all('input')
print (obj)
Output
[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]
Example 2
我们将在本示例中使用以下 HTML 脚本:
We shall use following HTML script in this example −
<html>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>
</body>
</html>
我们能够向 find_all() 方法的 name 参数中传递一个字符串。使用字符串,你可以搜索字符串而非标签。你可以传递字符串、正则表达式、列表、函数或真值。
We can pass a string to the name argument of find_all() method. With string you can search for strings instead of tags. You can pass in a string, a regular expression, a list, a function, or the value True.
在本示例中,一个函数被传递给了 name 参数。所有以“A”开头的名称都会由 find_all() 方法返回。
In this example, a function is passed to name argument. All the name starting with 'A' are returned by find_all() method.
from bs4 import BeautifulSoup
def startingwith(ch):
return ch.startswith('A')
soup = BeautifulSoup(html, 'html.parser')
lst=soup.find_all(string=startingwith)
print (lst)
Beautiful Soup - find_parents() Method
Method Description
BeautifulSoup 包中的 find_parent() 方法可以找到满足给定条件的该元素的所有父元素。
The find_parent() method in BeautifulSoup package finds all parents of this Element that matches the given criteria.
Parameters
name − 对标记名称的筛选。
name − A filter on tag name.
attrs − 对属性值进行筛选的字典。
attrs − A dictionary of filters on attribute values.
limit − 在找到指定数量的出现次数后停止寻找。
limit − Stop looking after specified number of occurrences have been found.
kwargs − 对属性值进行筛选的字典。
kwargs − A dictionary of filters on attribute values.
Return Type
find_parents() 方法返回一个 ResultSet,其中包含所有父元素,并且它们的顺序相反。
The find_parents() method returns a ResultSet consisting of all the parent elements in a reverse order.
Example 1
我们将在本示例中使用以下 HTML 脚本:
We shall use following HTML script in this example −
<html>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>
</body>
</html>
Output
ul
body
html
[document]
请注意,BeautifulSoup 对象的 name 属性总是返回 [document]。
Note that the name property of BeautifulSoup object always returns [document].
Example 2
在此示例中, limit 参数传递给 find_parents() 方法,以将父对象搜索限制在向上两级以内。
In this example, the limit argument is passed to find_parents() method to restrict the parent search to two levels up.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('li')
parents=obj.find_parents(limit=2)
for parent in parents:
print (parent.name)
Beautiful Soup - find_parent() Method
Method Description
BeautifulSoup 包中的 find_parent() 方法找到与给定条件匹配的此 PageElement 的最近父级。
The find_parent() method in BeautifulSoup package finds the closest parent of this PageElement that matches the given criteria.
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
kwargs − A dictionary of filters on attribute values.
Return Type
find_parent() 方法返回标记对象或 NavigableString 对象。
The find_parent() method returns Tag object or a NavigableString object.
Example 1
我们将在本示例中使用以下 HTML 脚本:
We shall use following HTML script in this example −
<html>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>
</body>
</html>
在以下示例中,我们找到包含字符串“HR”的标记的名称。
In the following example, we find the name of the tag that is parent to the string 'HR'.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
obj=soup.find(string='HR')
print (obj.find_parent().name)
Example 2
<body> 标记始终包含在顶级 <html> 标记中。在以下示例中,我们使用 find_parent() 方法来确认这一事实 -
The <body> tag is always enclosed within the top level <html> tag. In the following example, we confirm this fact with find_parent() method −
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('body')
print (obj.find_parent().name)
Beautiful Soup - find_next_siblings() Method
Method Description
find_next_siblings() 方法类似于 next_sibling 属性。它查找此 PageElement 同一级别内所有符合给定条件且在文档中后续出现的兄弟元素。
The find_next_siblings() method is similar to next_sibling property. It finds all siblings at the same level of this PageElement that match the given criteria and appear later in the document.
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − The string to search for (rather than tag).
-
limit − Stop looking after specified number of occurrences have been found.
-
kwargs − A dictionary of filters on attribute values.
Return Type
find_next_siblings() 方法返回一个标签对象列表或一个 NavigableString 对象。
The find_next_siblings() method returns a list of Tag objects or a NavigableString objects.
Example 1
让我们为此目的使用以下 HTML 片段:
Let us use the following HTML snippet for this purpose −
<p>
<b>
Excellent
</b>
<i>
Python
</i>
<u>
Tutorial
</u>
</p>
在下面的代码中,我们尝试查找所有的 <b> 标签的同级元素。在 HTML 字符串中有两个同级的标签用于抓取。
In the code below, we try to find all the siblings of <b> tag. There are two more tags at the same level in the HTML string used for scraping.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.find('b')
print ("next siblings:")
for tag in tag1.find_next_siblings():
print (tag)
Output
find_next_siblings() 的 ResultSet 使用 for 循环进行迭代。
The ResultSet of find_next_siblings() is being iterated with the help of for loop.
next siblings:
<i>Python</i>
<u>Tutorial</u>
Example 2
如果在标签后没有同级元素,则此方法会返回一个空列表。
If there are no siblings to be found after a tag, this method returns an empty list.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.find('u')
print ("next siblings:")
print (tag1.find_next_siblings())
Beautiful Soup - find_next_sibling() Method
Method Description
Beautiful Soup 中的 find_next_sibling() 方法查找与给定条件匹配的,且在文档中之后出现的,在这个 PageElement 中的同级中最接近的同级。此方法类似于 next_sibling 属性。
The find_next_sibling() method in Beautiful Soup Find the closest sibling at the same level to this PageElement that matches the given criteria and appears later in the document. This method is similar to next_sibling property.
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − The string to search for (rather than tag).
-
kwargs − A dictionary of filters on attribute values.
Return Type
find_next_sibling() 方法返回 Tag 对象或 NavigableString 对象。
The find_next_sibling() method returns Tag object or a NavigableString object.
Example 1
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.find('b')
print ("next:",tag1.find_next_sibling())
Beautiful Soup - find_previous_siblings() Method
Method Description
Beautiful Soup包中的find_previous_siblings()方法返回所有在文档中出现在此PAgeElement更前面的兄弟,并且符合给定的条件。
The find_previous_siblings() method in Beautiful Soup package returns all siblings that appear earlier to this PAgeElement in the document and match the given criteria.
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − A filter for a NavigableString with specific text.
-
limit − Stop looking after finding this many results.
-
kwargs − A dictionary of filters on attribute values.
Return Value
find_previous_siblings() 方法的结果集 PageElements。
The find_previous_siblings() method a ResultSet of PageElements.
Example 1
让我们为此目的使用以下 HTML 片段:
Let us use the following HTML snippet for this purpose −
<p>
<b>
Excellent
</b>
<i>
Python
</i>
<u>
Tutorial
</u>
</p>
在下面的代码中,我们尝试查找所有 <> 标记的同级元素。在用于搜索的 HTML 字符串中,同级别还有两个标记。
In the code below, we try to find all the siblings of <> tag. There are two more tags at the same level in the HTML string used for scraping.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.find('u')
print ("previous siblings:")
for tag in tag1.find_previous_siblings():
print (tag)
Example 2
网页(index.html)有一个 HTML 表单,其中包含三个输入元素。我们使用 id 属性为 marks 找到一个输入元素,然后查找其之前的兄弟元素。
The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous siblings.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'marks'})
sibs = tag.find_previous_sibling()
print (sibs)
Example 3
HTML 字符串中有两个 <p> 标记。我们找出了 id 属性为 id1 的标记之前的兄弟元素。
The HTML string has two <p> tags. We find out the siblings previous to the one with id1 as its id attribute.
html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', id='id1')
ptags = tag.find_previous_siblings()
for ptag in ptags:
print ("Tag: {}, Text: {}".format(ptag.name, ptag.text))
Beautiful Soup - find_previous_sibling() Method
Method Description
Beautiful Soup中的find_previous_sibling()方法返回最早出现在文档中并且符合给定条件的与该PageElement最接近的兄弟元素。
The find_previous_sibling() method in Beautiful Soup returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − A filter for a NavigableString with specific text.
-
kwargs − A dictionary of filters on attribute values.
Return Value
find_previous_sibling()方法返回一个可能是Tag或NavigableString的PageElement。
The find_previous_sibling() method returns a PageElement that could be a Tag or a NavigableString.
Example 1
从以下示例中使用的HTML字符串,我们可以找出标签名为“u”的<i>标签的前一兄弟元素。
From the HTML string used in the following example, we find out the previous sibling of <i> tag, having the tag name as 'u'
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup("<p><u>Excellent</u><b>Hello</b><i>Python</i></p>", 'html.parser')
tag = soup.i
sibling = tag.find_previous_sibling('u')
print (sibling)
Example 2
网页(index.html)有一个HTML表单,其中包含三个输入元素。我们通过id属性找到marks,然后找到其前一个兄弟,该兄弟的id设置为nm。
The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous sibling that had id set to nm.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'marks'})
sib = tag.find_previous_sibling(id='nm')
print (sib)
Example 3
在下面的代码中,HTML字符串有两个<p>元素和一个位于外部<p>标签內的字符串。我们使用find_previous_string()方法来查找与<p>Tutorial</p>标签处于兄弟关系的NavigableString对象。
In the code below, the HTML string has two <p> elements and a string inside the outer <p> tag. We use find_previous_string() method to search for the NavigableString object sibling of <p>Tutorial</p> tag.
html = '''
<p>Excellent<p>Python</p><p>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', string='Tutorial')
ptag = tag.find_previous_sibling(string='Excellent')
print (ptag, type(ptag))
Beautiful Soup - find_all_next() Method
Method Description
Beautiful Soup 中的 find_all_next() 方法找到与给定条件匹配并出现在文档中的此元素后面的所有 PageElements。此方法返回标记或 NavigableString 对象,其方法的参数与 find_all() 中的参数完全相同。
The find_all_next() method in Beautiful Soup finds all PageElements that match the given criteria and appear after this element in the document. This method returns tags or NavigableString objects and method takes in the exact same parameters as find_all().
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.
-
limit − Stop looking after specified number of occurrences have been found.
-
kwargs − A dictionary of filters on attribute values.
Return Value
该方法返回包含 PageElements(标签或 NavigableString 对象)的结果集。
This method returns a ResultSet containing PageElements (Tags or NavigableString objects).
Example 1
使用 index.html 作为此示例的 HTML 文档,我们首先定位 <form> 标签并使用 find_all_next() 方法收集它之后的所有元素。
Using the index.html as the HTML document for this example, we first locate the <form> tag and collect all the elements after it with find_all_next() method.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.form
tags = tag.find_all_next()
print (tags)
Output
[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]
Example 2
在此,我们对 find_all_next() 方法应用筛选器,以收集 <form> 之后的、ID 为 nm 或 age 的所有标签。
Here, we apply a filter to the find_all_next() method to collect all the tags subsequent to <form>, with id being nm or age.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.form
tags = tag.find_all_next(id=['nm', 'age'])
print (tags)
Example 3
如果我们检查 body 标签之后的标签,它包括一个 <h1> 标签以及 <form> 标签,其中包含三个输入元素。
If we check the tags following the body tag, it includes a <h1> tag as well as <form> tag, that includes three input elements.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.body
tags = tag.find_all_next()
print (tags)
Beautiful Soup - find_next() Method
Method Description
Beautiful soup 中的 find_next() 方法将找到与给定条件匹配的第一个 PageElement,并且出现在文档中较后的部分。返回文档中当前标签后面的第一个标签或 NavigableString。与所有其他 find 方法一样,此方法具有以下语法 -
The find_next() method in Beautiful soup finds the first PageElement that matches the given criteria and appears later in the document. returns the first tag or NavigableString that comes after the current tag in the document. Like all other find methods, this method has the following syntax −
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − A filter for a NavigableString with specific text.
-
kwargs − A dictionary of filters on attribute values.
Return Value
此 find_next() 方法返回一个标签或一个 NavigableString
This find_next () method returns a Tag or a NavigableString
Example 1
此示例使用了带以下脚本的网页 index.html
A web page index.html with following script has been used for this example
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h1>TutorialsPoint</h1>
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
</body>
</html>
我们首先找到 <form> 标签,然后再找到其旁边的标签。
We first locate the <form> tag and then the one next to it.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.h1
print (tag.find_next())
Output
<form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/>
</form>
Example 2
在这个示例中,我们首先找到 name='age' 的 <input> 标签,然后获取其下一个标签。
In this example, we first locate the <input> tag with its name='age' and obtain its next tag.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'age'})
print (tag.find_next())
Beautiful Soup - find_all_previous() Method
Method Description
Beautiful Soup 中的 find_all_previous() 方法会向后查看此 PageElement 文档,并查找与给定条件匹配且出现在当前元素之前的所有 PageElements。它返回一个 ResultsSet 的 PageElements,该结果集出现在文档中的当前标记之前。与所有其他查找方法一样,此方法具有以下语法:
The find_all_previous() method in Beautiful Soup look backwards in the document from this PageElement and finds all the PageElements that match the given criteria and appear before the current element. It returns a ResultsSet of PageElements that comes before the current tag in the document. Like all other find methods, this method has the following syntax −
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − A filter for a NavigableString with specific text.
-
limit − Stop looking after finding this many results.
-
kwargs − A dictionary of filters on attribute values.
Return Value
find_all_previous() 方法返回一个 Tag 或 NavigableString 对象的结果集。如果 limit 参数为 1,则该方法等效于 find_previous() 方法。
The find_all_previous() method returns a ResultSet of Tag or NavigableString objects. If the limit parameter is 1, the method is equivalent to find_previous() method.
Example 1
在此示例中,显示了出现在第一个 input 标记之前的每个对象的 name 属性。
In this example, name property of each object that appears before the first input tag is displayed.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input')
for t in tag.find_all_previous():
print (t.name)
Example 2
在所考虑的 HTML 文档(index.html)中,有三个输入元素。使用以下代码,我们打印 marks.nm 属性之前的 <input> 标记之前所有标记的标记名称。为了区分之前的两个输入标记,我们还会打印 attrs 属性。请注意,其他标记没有任何属性。
In the HTML document under consideration (index.html), there are three input elements. With the following code, we print the tag names of all preceding tags before thr <input> tag with nm attribute as marks. To differentiate between the two input tags before it, we also print the attrs property. Note that the other tags don’t have any attributes.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'marks'})
pretags = tag.find_all_previous()
for pretag in pretags:
print (pretag.name, pretag.attrs)
Output
input {'type': 'text', 'id': 'age', 'name': 'age'}
input {'type': 'text', 'id': 'nm', 'name': 'name'}
form {}
h1 {}
body {}
title {}
head {}
html {}
Example 3
BeautifulSoup 对象存储了整个文档的树。它没有之前元素,如下例所示:
The BeautifulSoup object stores the entire document’s tree. It doesn’t have any previous element, as the example below shows −
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all_previous()
print (tags)
Beautiful Soup - find_previous() Method
Method Description
Beautiful Soup 中的 find_previous() 方法从该 PageElement 文档中向后查找并查找与给定条件匹配的第一个 PageElement。它返回在文档中当前标签之前的第一个标签或 NavigableString。与所有其他查找方法一样,该方法具有以下语法 -
The find_previous() method in Beautiful Soup look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. It returns the first tag or NavigableString that comes before the current tag in the document. Like all other find methods, this method has the following syntax −
Parameters
-
name − A filter on tag name.
-
attrs − A dictionary of filters on attribute values.
-
string − A filter for a NavigableString with specific text.
-
kwargs − A dictionary of filters on attribute values.
Return Value
find_previous() 方法返回 Tag 或 NavigableString 对象。
The find_previous() method returns a Tag or NavigableString object.
Example 1
在下面的示例中,我们尝试查找 <body> 标签之前的对象。碰巧是 <title> 元素。
In the example below, we try to find which is the previous object before the <body> tag. It happens to be <title> element.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.body
print (tag.find_previous())
Example 2
在该示例中使用的 HTML 文档中有三个输入元素。下面的代码找到 name 属性 = age 的输入元素并查找其前面的元素。
There are three input elements in the HTML document used in this example. The following code locates the input element with name attribute = age and looks for its previous element.
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'age'})
print (tag.find_previous())
Beautiful Soup - select() Method
Method Description
在 BeautifulSoup 库中,select() 方法是抓取 HTML/XML 文档的一个重要工具。与 find() 和 find_*() 方法类似,select() 方法也有助于找到满足给定条件的元素。根据给定的 CSS 选择器(作为参数)在文档树中选择元素。
In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. The selection of an element in the document tree is done based on the CSS selector given to it as an argument.
Beautiful Soup 还具有 select_one() 方法。select() 和 select_one() 之间的区别在于,select() 返回属于 PageElement 并由 CSS 选择器表征的所有元素的 ResultSet;而 select_one() 返回满足基于 CSS 选择器选择标准的元素的第一个出现。
Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria.
在 BeautifulSoup 4.7 版之前,select() 方法通常只支持常见的 CSS 选择器。从 4.7 版开始,BeautifulSoup 与 Soup Sieve CSS 选择器库集成在一起。因此,现在可以使用更多的选择器。在 4.12 版中,除了现有的便捷方法 select() 和 select_one() 之外,还添加了一个 .css 属性。
Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().
Parameters
-
selector − A string containing a CSS selector.
-
limit − After finding this number of results, stop looking.
-
kwargs − Keyword arguments to be passed.
如果将限制参数设置为 1,它将等同于 select_one() 方法。
If the limit parameter is set to 1, it becomes equivalent to select_one() method.
Return Value
select() 方法返回一个 Tag 对象的结果集。select_one() 方法返回一个单独的 Tag 对象。
The select() method returns a ResultSet of Tag objects. The select_one() method returns a single Tag object.
Soup Sieve 库具有不同类型的 CSS 选择器。基本的 CSS 选择器为 −
The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −
-
Type selectors match elements by node name. For example −
tags = soup.select('div')
-
The Universal selector (*) matches elements of any type. Example −
tags = soup.select('*')
-
The ID selector matches an element based on its id attribute. The symbol # denotes the ID selector. Example −
tags = soup.select("#nm")
-
The class selector matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −
tags = soup.select(".submenu")
Example: Type Selector
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tags = soup.select('div')
print (tags)
Example: ID selector
from bs4 import BeautifulSoup
html = '''
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
<input type = 'text' id = 'marks' name = 'marks'>
</form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)
Example: class selector
html = '''
<ul>
<li class="mainmenu">Accounts</li>
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="mainmenu">HR</li>
<ul>
<li class="submenu">Rani</li>
<li class="submenu">Ankita</li>
</ul>
</ul>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup.select(".mainmenu")
print (tags)
Beautiful Soup - append() Method
Method Description
Beautiful Soup 中的 append() 方法在当前标记对象的末尾添加给定的字符串或另一个标记。该 append() 方法的工作方式类似于 Python 的列表对象的 append() 方法。
The append() method in Beautiful Soup adds a given string or another tag at the end of the current Tag object’s contents. The append() method works similar to the append() method of Python’s list object.
Example 1
在以下示例中,HTML 脚本有一个 <p> 标记。使用 append() 附加文本。在以下示例中,HTML 脚本有一个 <p> 标记。使用 append() 附加文本。
In the following example, the HTML script has a <p> tag. With append(), additional text is appended.In the following example, the HTML script has a <p> tag. With append(), additional text is appended.
from bs4 import BeautifulSoup
markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p
tag.append(" World")
print (soup)
Example 2
通过 append() 方法,你可以在现有标记的末尾添加新标记。首先使用 new_tag() 方法创建一个新标记对象,然后将其传递给 append() 方法。
With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.
from bs4 import BeautifulSoup, Tag
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify())
Example 3
如果你必须将字符串添加到文档,你可以附加一个 NavigableString 对象。
If you have to add a string to the document, you can append a NavigableString object.
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())
Beautiful Soup - extend() Method
Method Description
Beautiful Soup 中的 extend() 方法已从版本 4.7 起添加到 Tag 类中。它将列表中的所有元素添加到标记。此方法类似于标准 Python 列表的 extend() 方法 - 它接受一个字符串数组以附加到标记的内容中。
The extend() method in Beautiful Soup has been added to Tag class from version 4.7 onwards. It adds all the elements in a list to the tag. This method is analogous to a standard Python List’s extend() method - it takes in an array of strings to append to the tag’s content.
Beautiful Soup - NavigableString() Method
Method Description
bs4 包中的 NavigableString() 方法是 NavigableString 类的构造方法。NavigableString 表示解析文档的最内层子元素。此方法将常规 Python 字符串转换为 NavigableString。相反,内置的 str() 方法将 NavigableString 对象转换为 Unicode 字符串。
The NavigableString() method in bs4 package is the constructor method for NavigableString class. A NavigableString represents the innermost child element of a parsed document. This method casts a regular Python string to a NavigableString. Conversely, the built-in str() method coverts NavigableString object to a Unicode string.
Return Value
NavigableString() 方法返回一个 NavigableString 对象。
The NavigableString() method returns a NavigableString object.
Example 1
在下面的代码中,HTML 字符串包含一个空的 <b> 标签。我们向其中添加一个 NavigableString 对象。
In the code below, the HTML string contains an empty <b> tag. We add a NavigableString object in it.
html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello World")
soup.b.append(navstr)
print (soup)
Example 2
在这个示例中,我们看到两个 NavigableString 对象被追加到一个空的 <b> 标签。标签响应 strings 属性而不是 string 属性。它是一种 NavigableString 对象的生成器。
In this example, we see that two NavigableString objects are appended to an empty <b> tag. The tag responds to strings property instead of string property. It is a generator of NavigableString objects.
html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello")
soup.b.append(navstr)
navstr = NavigableString("World")
soup.b.append(navstr)
for s in soup.b.strings:
print (s, type(s))
Example 3
如果我们访问了 <b> 标签对象的 stripped_strings 属性而不是 strings 属性,则会得到 Unicode 字符串的生成器,即 str 对象。
Instead of strings property, if we access the stripped_strings property of <b> tag object, we get a generator of Unicode strings i.e. str objects.
html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello")
soup.b.append(navstr)
navstr = NavigableString("World")
soup.b.append(navstr)
for s in soup.b.stripped_strings:
print (s, type(s))
Beautiful Soup - new_tag() Method
Beautiful Soup 库中的 new_tag() 方法会创建一个新的标记对象,该对象与一个现有的 BeautifulSoup 对象关联。您可以使用此工厂方法将新标记附加或插入到文档树中。
The new_tag() method in Beautiful Soup library creates a new Tag object, that is associated with an existing BeautifulSoup object. You can use this factory method to append or insert the new tag into the document tree.
Parameters
-
name − The name of the new Tag.
-
namespace − The URI of the new Tag’s XML namespace, optional.
-
prefix − The prefix for the new Tag’s XML namespace, optional.
-
attrs − A dictionary of this Tag’s attribute values.
-
sourceline − The line number where this tag was found in its source document.
-
sourcepos − The character position within
sourceline
where this tag was found. -
kwattrs − Keyword arguments for the new Tag’s attribute values.
Example 1
以下示例演示了 new_tag() 方法的用法。<a> 元素的一个新标记。标记对象使用 href 和字符串属性初始化,然后插入到文档树中。
The following example shows the use of new_tag() method. A new tag for <a> element. The tag object is initialized with the href and string attributes and then inserted in the document tree.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Welcome to <b>online Tutorial library</b></p>', 'html.parser')
tag = soup.new_tag('a')
tag.attrs['href'] = "www.tutorialspoint.com"
tag.string = "Tutorialspoint"
soup.b.insert_before(tag)
print (soup)
Output
<p>Welcome to <a href="www.tutorialspoint.com">Tutorialspoint</a><b>online Tutorial library</b></p>
Example 2
在以下示例中,我们有一个有两个输入元素的 HTML 表单。我们创建了一个新的输入标记并将其附加到表单标记。
In the following example, we have a HTML form with two input elements. We create a new input tag and append it to the form tag.
html = '''
<form>
<input type = 'text' id = 'nm' name = 'name'>
<input type = 'text' id = 'age' name = 'age'>
</form>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.form
newtag=soup.new_tag('input', attrs={'type':'text', 'id':'marks', 'name':'marks'})
tag.append(newtag)
print (soup)
Output
<form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/></form>
Beautiful Soup - insert() Method
Method Description
Beautiful Soup 中的 insert() 方法在标签元素的子项列表中,在指定位置添加一个元素。Beautiful Soup 中的 insert() 方法的行为,类似于在 Python 列表对象上的 insert() 方法。
The insert() method in Beautiful Soup add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.
Parameters
-
position − The position at which the new PageElement should be inserted.
-
child − A PageElement to be inserted.
Example 1
在以下示例中,新字符串被添加到 <b> 标记,位置为 1。结果解析的文档显示结果。
In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert(1, "Tutorial ")
print (soup.prettify())
Example 2
在下面的示例中,insert() 方法被用于依次将列表中的字符串插入到 HTML 标记中的 <p> 标签。
In the following example, the insert() method is used to successively insert strings from a list to a <p> tag in HTML markup.
from bs4 import BeautifulSoup, NavigableString
markup = '<p>Excellent Tutorials from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
langs = ['Python', 'Java', 'C']
i=0
for lang in langs:
i+=1
tag = soup.new_tag('p')
tag.string = lang
soup.p.insert(i, tag)
print (soup.prettify())
Beautiful Soup - insert_before() Method
Method Description
Beautiful soup 中的 insert_before() 方法将标签或字符串插入解析树中其他内容的正前方。插入的元素成为 this one 的直接前继元素。插入的元素可以是标签或字符串。
The insert_before() method in Beautiful soup inserts tags or strings immediately before something else in the parse tree. The inserted element becomes the immediate predecessor of this one. The inserted element can be a tag or a string.
Return Value
此 insert_before() 方法不返回任何新对象。
This insert_before() method doesn’t return any new object.
Example 1
以下示例将文本 "Here is an" 插入给定 HTML 标记字符串中的 "Excellent" 前方。
The following example inserts a text "Here is an" before "Excellent in the given HTML markup string.
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent</b> Python Tutorial <u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert_before("Here is an ")
print (soup.prettify())
Example 2
您还可以在另一个标签前插入标签。请看这个示例。
You can also insert a tag before another tag. Take a look at this example.
from bs4 import BeautifulSoup, NavigableString
markup = '<P>Excellent <b>Tutorial</b> from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('b')
tag1.string = "Python "
tag.insert_before(tag1)
print (soup.prettify())
Example 3
以下代码传递多个字符串,将它们插入 <b> 标签之前。
The following code passes more than one strings to be inserted before the <b> tag.
from bs4 import BeautifulSoup
markup = '<p>There are <b>Tutorials</b> <u>from TutorialsPoint</u></p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert_before("many ", 'excellent ')
print (soup.prettify())
Beautiful Soup - insert_after() Method
Method Description
Beautiful 汤中的 insert_after() 方法在解析树中的某个元素后立即插入标签或字符串。插入的元素成为该元素的直接后继。插入的元素可以是标签或字符串。
The insert_after() method in Beautiful soup inserts tags or strings immediately after something else in the parse tree. The inserted element becomes the immediate successor of this one. The inserted element can be a tag or a string.
Example 1
以下代码在第一个 <b> 标签后插入字符串"Python"。
Following code inserts a string "Python" after the first <b> tag.
from bs4 import BeautifulSoup
markup = '<p>An <b>Excellent</b> Tutorial <u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert_after("Python ")
print (soup.prettify())
Example 2
您还可以在另一个标签前插入标签。请看这个示例。
You can also insert a tag before another tag. Take a look at this example.
from bs4 import BeautifulSoup, NavigableString
markup = '<P>Excellent <b>Tutorial</b> from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('b')
tag1.string = "on Python "
tag.insert_after(tag1)
print (soup.prettify())
Example 3
可以在某个标签后插入多个标签或字符串。
Multiple tags or strings can be inserted after a certain tags.
from bs4 import BeautifulSoup, NavigableString
markup = '<P>Excellent <b>Tutorials</b> from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p
tag1 = soup.new_tag('i')
tag1.string = 'and Java'
tag.insert_after("on Python", tag1)
print (soup.prettify())
Beautiful Soup - clear() Method
Method Description
Beautiful Soup 库中的 clear() 方法清除标签的内部内容,保持标签的完整性。如果存在子元素,将调用其 extract() 方法。如果 decompose 参数设置为 True,则调用 decompose() 方法,而不是 extract()。
The clear() method in Beautiful Soup library removes the inner content of a tag, keeping the tag intact. If there are any child elements, extract() method is called on them. If decompose argument is set to True, then decompose() method is called instead of extract().
Parameters
-
decompose − If this is True, decompose() (a more destructive method) will be called instead of extract()
Example 1
由于 clear() 方法在表示整个文档的 soup 对象上调用,因此所有内容都会被移除,文档将为空。
As clear() method is called on the soup object that represents the entire document, all the content is removed, leaving the document blank.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.clear()
print(soup)
Example 2
在以下示例中,我们找到所有 <p> 标签并在每个标签上调用 clear() 方法。
In the following example, we find all the <p> tags and call clear() method on each of them.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('p')
for tag in tags:
tag.clear()
print(soup)
Output
每个 <p> .. </p> 的内容将被删除,标签将保留。
Contents of each <p> .. </p> will be removed, the tags will be retained.
<html>
<body>
<p></p>
<p></p>
<p></p>
<p></p>
</body>
</html>
Example 3
我们在此清除 <body> 标签的内容,同时将 decompose 参数设置为 Tue。
Here we clear the contents of <body> tags with decompose argument set to Tue.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags = soup.find('body')
ret = tags.clear(decompose=True)
print(soup)
Beautiful Soup - extract() Method
Method Description
Beautiful Soup 库中的 extract() 方法用于从文档树中移除标签或字符串。extract() 方法返回已移除的对象。它类似于 Python 列表中的 pop() 方法。
The extract() method in Beautiful Soup library is used to remove a tag or a string from the document tree. The extract() method returns the object that has been removed. It is similar to how a pop() method in Python list works.
Return Type
extract() 方法返回已从文档树中移除的元素。
The extract() method returns the element that has been removed from the document tree.
Example 1
html = '''
<div>
<p>Hello Python</p>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html, 'html.parser')
tag1 = soup.find("div")
tag2 = tag1.find("p")
ret = tag2.extract()
print ('Extracted:',ret)
print ('original:',soup)
Example 2
考虑以下 HTML 标记 -
Consider the following HTML markup −
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs./p>
</body>
</html>
以下为代码 -
Here is the code −
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
obj = tag.extract()
print ("Extracted:",obj)
print (soup)
Output
Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>
Example 3
您还可以将 extract() 方法与 find_next()、find_previous() 方法和 next_element、previous_element 属性一起使用。
You can also use extract() method along with find_next(), find_previous() methods and next_element, previous_element properties.
html = '''
<div>
<p><b>Hello</b><b>Python</b></p>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html, 'html.parser')
tag1 = soup.find("b")
ret = tag1.next_element.extract()
print ('Extracted:',ret)
print ('original:',soup)
Beautiful Soup - decompose() Method
Method Description
decompose() 方法销毁当前元素及其子元素,因此元素从树中移除,将其擦除及其下的所有内容。你可以通过 decomposed
属性来检查元素是否已分解。如果已销毁,返回 True,否则返回 false。
The decompose() method destroys current element along with its children, thus the element is removed from the tree, wiping it out and everything beneath it. You can check whether an element has been decomposed, by the decomposed
property. It returns True if destroyed, false otherwise.
Example 1
当我们对 BeautifulSoup 对象本身调用 descompose() 方法时,整个内容将被销毁。
When we call descompose() method on the BeautifulSoup object itself, the entire content will be destroyed.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.decompose()
print ("decomposed:",soup.decomposed)
print (soup)
Output
decomposed: True
document: Traceback (most recent call last):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
TypeError: can only concatenate str (not "NoneType") to str
由于 soup 对象已分解,它返回 True,但是,你会得到如上所示的 TypeError。
Since the soup object is decomposed, it returns True, however, you get TypeError as shown above.
Example 2
下面的代码使用 decompose() 方法,删除 HTML 字符串中所有出现的 <p> 标签。
The code below makes use of decompose() method to remove all the occurrences of <p> tags in the HTML string used.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
p_all = soup.find_all('p')
[p.decompose() for p in p_all]
print ("document:",soup)
Output
移除所有 <p> 标签后,剩余的 HTML 文档将会被打印出来。
Rest of the HTML document after removing all <p> tags will be printed.
document:
<html>
<body>
</body>
</html>
Example 3
在此,我们从 HTML 文档树中找到 <body> 标签,并分解前一个元素,该元素恰好是 <title> 标签。生成的文档树中省略了 <title> 标签。
Here, we find the <body> tag from the HTML document tree and decompose the previous element which happens to be the <title> tag. The resultant document tree omits the <title> tag.
html = '''
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
Hello World
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag = soup.body
tag.find_previous().decompose()
print ("document:",soup)
Beautiful Soup - replace_with() Method
Method Description
Beautiful Soup 的 replace_with() 方法用提供的标签或字符串替换元素中的标签或字符串。
Beautiful Soup’s replace_with() method replaces a tag or string in an element with the provided tag or string.
Example 1
在这个示例中,<p> 标签被使用 replace_with() 方法替换为 <b>。
In this example, the <p> tag is replaced by <b> with the use of replace_with() method.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
txt = tag1.string
tag2 = soup.new_tag('b')
tag2.string = txt
tag1.replace_with(tag2)
print (soup)
Example 2
你可以简单地通过对标签 string 对象调用 replace_with() 方法,将标签的内部文本替换为另一个字符串。
You can simply replace the inner text of a tag with another string by calling replace_with() method on the tag.string object.
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
tag1.string.replace_with("DJs flock by when MTV ax quiz prog.")
print (soup)
Example 3
可用于替换的标签对象,可通过任意 find() 方法获得。在这里,我们替换 <p> 标签旁边的标签的文本。
The tag object to be used for replacement can be obtained by any of the find() methods. Here, we replace the text of the tag next to <p> tag.
html = '''
<html>
<body>
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
tag1.find_next('b').string.replace_with('black')
print (soup)
Beautiful Soup - wrap() Method
Method Description
Beautiful Soup 中的 wrap() 方法将元素括在另一个元素内部。你可以用另一个元素包裹现有的标签元素,或者用标签包裹标签的字符串。
The wrap() method in Beautiful Soup encloses the element inside another element. You can wrap an existing tag element with another, or wrap the tag’s string with a tag.
Example 1
在这个例子中, <b> 标签被包装在 <div> 标签中。
In this example, the <b> tag is wrapped in <div> tag.
html = '''
<html>
<body>
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = soup.new_tag('div')
tag1.wrap(newtag)
print (soup)
Output
<html>
<body>
<p>The quick, <div><b>brown</b></div> fox jumps over a lazy dog.</p>
</body>
</html>
Beautiful Soup - unwrap() Method
Method Description
unwrap() 方法与 wrap() 方法相反。它用标签内部的任何内容替换该标签。它从元素中移除标签并返回它。
The unwrap() method is the opposite of wrap() method. It It replaces a tag with whatever’s inside that tag. It removes the tag from an element and returns it.
Example 1
在以下示例中,从 HTML 字符串中移除 <b> 标签。
In the following example, the <b> tag from the html string is removed.
html = '''
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = tag1.unwrap()
print (soup)
Example 2
下面的代码打印 unwrap() 方法的返回的值。
The code below prints the returned value of unwrap() method.
html = '''
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = tag1.unwrap()
print (newtag)
Example 3
unwrap() 方法非常适合剥离标记,如下代码所示:
The unwrap() method is useful for good for stripping out markup, as the following code shows −
html = '''
<html>
<body>
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
#print (soup.unwrap())
for tag in soup.find_all():
tag.unwrap()
print (soup)
Beautiful Soup - smooth() Method
Method Description
在调用一系列修改解析树的方法后,可能会彼此相邻放置两个或更多 NavigableString 对象。 smooth() 方法通过合并连续的字符串来平滑此元素的子元素。在对树进行大量修改的操作后,这样做可以让可读输出看起来更自然。
After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. The smooth() method smooths out this element’s children by consolidating consecutive strings. This makes pretty-printed output look more natural following a lot of operations that modified the tree.
Example 1
html ='''<html>
<head>
<title>TutorislsPoint/title>
</head>
<body>
Some Text
<div></div>
<p></p>
<div>Some more text</div>
<b></b>
<i></i> # COMMENT
</body>
</html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find('body').sm
for item in soup.find_all():
if not item.get_text(strip=True):
p = item.parent
item.replace_with('')
p.smooth()
print (soup.prettify())
Output
<html>
<head>
<title>
TutorislsPoint/title>
</title>
</head>
<body>
Some Text
<div>
Some more text
</div>
# COMMENT
</body>
</html>
Beautiful Soup - prettify() Method
Method Description
要获得格式良好的 Unicode 字符串,请使用 Beautiful Soup 的 prettify() 方法。它对 Beautiful Soup 解析树进行格式化,以便每个标记都位于单独的行上并带有缩进。它允许您轻松可视化 Beautiful Soup 解析树的结构。
To get a nicely formatted Unicode string, use Beautiful Soup’s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree.
Parameters
-
encoding − The eventual encoding of the string. If this is None, a Unicode string will be returned.
-
A Formatter object, or a string naming one of the standard formatters.
Return Type
prettify() 方法返回 Unicode 字符串(如果编码==无)或字节串(否则)。
The prettify() method returns a Unicode string (if encoding==None) or a bytestring (otherwise).
Example 1
考虑以下 HTML 字符串。
Consider the following HTML string.
<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
通过使用 prettify() 方法,我们可以更好地理解其结构 −
Using the prettify() method we can better understand its structure −
html = '''
<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print (soup.prettify())
Example 2
您可以在文档中的任何标记对象上调用 prettify()。
You can call prettify() on on any of the Tag objects in the document.
print (soup.b.prettify())
Output
<b>
brown fox
</b>
prettify() 方法用于理解文档的结构。但是,它不应该用于重新对其进行格式化,因为它会添加空白(采用新行的形式),并更改 HTML 文档的含义。
The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document.
prettify() 方法可以选择提供 formatter 参数,以指定要使用的格式化。
He prettify() method can optionally be provided formatter argument to specify the formatting to be used.
以下是 formatter 的可能值。
There are following possible values for the formatter.
formatter="minimal" − 它是默认值。将对字符串进行足够的处理以确保 Beautiful Soup 生成有效的 HTML/XML。
formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML.
formatter="html" − 只要有可能,Beautiful Soup 将把 Unicode 字符转换为 HTML 实体。
formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter="html5" − 它类似于 formatter="html",但是 Beautiful Soup 将在 HTML 空标签(例如 "br")中省略结束斜杠。
formatter="html5" − it’s similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br".
formatter=None − Beautiful Soup 将不修改输出中的字符串。这是最快的选项,但可能导致 Beautiful Soup 生成无效的 HTML/XML。
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML.
Beautiful Soup - encode() Method
Method Description
Beautiful Soup 中的 encode() 方法呈现给定的 PageElement 及其内容的字节字符串表示形式。
The encode() method in Beautiful Soup renders a bytestring representation of the given PageElement and its contents.
prettify() 方法允许你轻松直观地展示 Beautiful Soup 解析树的结构,它有 encoding 参数。encode() 方法在 prettify() 方法中的作用与编码相同。
The prettify() method, which allows to you to easily visualize the structure of the Beautiful Soup parse tree, has the encoding argument. The encode() method plays the same role as the encoding in prettify() method has.
Parameters
-
encoding − The destination encoding.
-
indent_level − Each line of the rendering will be
-
indented this many levels. Used internally in recursive calls while pretty-printing.
-
formatter − A Formatter object, or a string naming one of the standard formatters.
-
errors − An error handling strategy.
Return Value
encode() 方法返回标记及其内容的字节字符串表示形式。
The encode() method returns a byte string representation of the tag and its contents.
Example 1
默认情况下,编码参数为 utf-8。以下代码显示了羹对象经过编码后的字节字符串表示形式。
The encoding parameter is utf-8 by default. Following code shows the encoded byte string representation of the soup object.
from bs4 import BeautifulSoup
soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode('utf-8'))
Example 2
制表符对象具有以下预定义值 −
The formatter object has the following predefined values −
formatter="minimal" − 它是默认值。将对字符串进行足够的处理以确保 Beautiful Soup 生成有效的 HTML/XML。
formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML.
formatter="html" − 只要有可能,Beautiful Soup 将把 Unicode 字符转换为 HTML 实体。
formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter="html5" − 它类似于 formatter="html",但是 Beautiful Soup 将在 HTML 空标签(例如 "br")中省略结束斜杠。
formatter="html5" − it’s similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br".
formatter=None − Beautiful Soup 将不修改输出中的字符串。这是最快的选项,但可能导致 Beautiful Soup 生成无效的 HTML/XML。
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML.
在以下示例中,不同的制表符值被用作 encode() 方法的参数。
In the following example, different formatter values are used as argument for encode() method.
from bs4 import BeautifulSoup
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.p.encode(formatter="minimal"))
print ("html: ")
print(soup.p.encode(formatter="html"))
print ("None: ")
print(soup.p.encode(formatter=None))
Output
minimal:
b'<p>Il a dit <<Sacr\xc3\xa9 bleu!>></p>'
html:
b'<p>Il a dit <<Sacré bleu!>></p>'
None:
b'<p>Il a dit <<Sacr\xc3\xa9 bleu!>></p>'
Example 3
以下示例使用 Latin-1 作为编码参数。
The following example uses Latin-1 as the encoding parameter.
markup = '''
<html>
<head>
<meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
</head>
<body>
<p>Sacr`e bleu!</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, 'lxml')
print(soup.p.encode("latin-1"))
Beautiful Soup - decode() Method
Method Description
Beautiful Soup 中的 decode() 方法将解析树作为 HTML 或 XML 文档返回为字符串或 Unicode 表示形式。此方法使用为编码注册的编解码器解码字节。它的函数与 encode() 方法相反。您调用 encode() 获取字节串,调用 decode() 获取 Unicode。让我们通过一些示例来学习 decode() 方法。
The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples.
Parameters
-
pretty_print − If this is True, indentation will be used to make the document more readable.
-
encoding − The encoding of the final document. If this is None, the document will be a Unicode string.
-
formatter − A Formatter object, or a string naming one of the standard formatters.
-
errors − The error handling scheme to use for the handling of decoding errors. Values are 'strict', 'ignore' and 'replace'.
Beautiful Soup - get_text() Method
Method Description
get_text() 方法仅返回整个 HTML 文档或给定标签中适合人读的文本。所有子字符串均由给定的分隔符串连接,默认情况下为 null 字符串。
The get_text() method returns only the human-readable text from the entire HTML document or a given tag. All the child strings are concatenated by the given separator which is a null string by default.
Parameters
-
separator − The child strings will be concatenated using this parameter. By default it is "".
-
strip − The strings will be stripped before concatenation.
Example 1
在以下示例中,get_text() 方法移除所有 HTML 标记。
In the example below, the get_text() method removes all the HTML tags.
html = '''
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)
Output
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
Example 2
在以下示例中,我们将 get_text() 方法的分隔符参数指定为“#”。
In the following example, we specify the separator argument of get_text() method as '#'.
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)
Output
#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#
Example 3
当 strip 参数设置为 True 时,让我们检查其效果。默认情况下为 False。
Let us check the effect of strip parameter when it is set to True. By default it is False.
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)
Beautiful Soup - diagnose() Method
Method Description
Beautiful Soup 中的 diagnose() 方法是一个诊断套件,用于隔离常见问题。如果你难以理解 Beautiful Soup 对文档执行了哪些操作,请将文档作为参数传递给 diagnose() 函数。一份报告显示了不同解析器如何处理文档,并告诉你是否缺少解析器。
The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you’re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you’re missing a parser.
Return Value
该 diagnose() 方法根据所有可用解析器打印解析给定文档的结果。
The diagnose() method prints the result of parsing the given document according all the available parsers.
Example
让我们为这个练习获取这个简单文档 −
Let us take this simple document for our exercise −
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
以下代码对上述 HTML 脚本运行诊断 −
The following code runs the diagnostics on the above HTML script −
markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''
from bs4.diagnose import diagnose
diagnose(markup)
该 diagonose() 输出以一条消息开头,显示了哪些解析器可用 −
The diagonose() output starts with a message showing what all parsers are available −
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1
如果要诊断的文档是一个完美的 HTML 文档,那么所有解析器的结果都几乎相似。但是,在我们的示例中,有许多错误。
If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors.
首先,使用内置的 html.parser。报告如下 −
To begin the built-in html.parser is take up. The report will be as follows −
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<h1>
Hello World
<b>
Welcome
</b>
<p>
<b>
Beautiful Soup
<i>
Tutorial
</i>
<p>
</p>
</b>
</p>
</h1>
你可以看到,Python 的内置解析器不会插入 <html> 和 <body> 标记。未闭合的 <h1> 标记在结尾处提供了匹配的 <h1>。
You can see that Python’s built-in parser doesn’t insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end.
html5lib 和 lxml 解析器均通过用 <html>、<head> 和 <body> 标记包装文档来完成文档。
Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags.
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<h1>
Hello World
<b>
Welcome
</b>
<p>
<b>
Beautiful Soup
<i>
Tutorial
</i>
</b>
</p>
<p>
<b>
</b>
</p>
</h1>
</body>
</html>
使用 lxml 解析器,请注意 </h1> 的插入位置。不完整的 <b> 标记也会纠正,并且悬挂的 </a> 会被移除。
With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed.
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<h1>
Hello World
<b>
Welcome
</b>
</h1>
<p>
<b>
Beautiful Soup
<i>
Tutorial
</i>
</b>
</p>
<p>
</p>
</body>
</html>
diagnose() 方法还会将文档解析为 XML 文档,这在我们这里可能多余。
The diagnose() method parses the document as XML document also, which probably is superfluous in our case.
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
Hello World
<b>
Welcome
</b>
<P>
<b>
Beautiful Soup
</b>
<i>
Tutorial
</i>
<p/>
</P>
</h1>
让我们向 diagnose() 方法提供 XML 文档,而不是 HTML 文档。
Let us give the diagnose() method a XML document instead of HTML document.
<?xml version="1.0" ?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
现在,如果我们运行诊断,即使是 XML,也会应用 html 解析器。
Now if we run the diagnostics, even if it’s a XML, the html parsers are applied.
Trying to parse your markup with html.parser
Warning (from warnings module):
File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
使用 html.parser,会显示一条警告消息。使用 html5lib,包含 XML 版本信息的第一个文本行会被注释掉,并且文档的其余部分将被解析,就像它是 HTML 文档一样。
With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document.
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
<head>
</head>
<body>
<books>
<book>
<title>
Python
</title>
<author>
TutorialsPoint
</author>
<price>
400
</price>
</book>
</books>
</body>
</html>
lxml html 解析器不会插入注释,而是将其解析为 HTML。
The lxml html parser doesn’t insert the comment, but parses it as HTML.
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
<body>
<books>
<book>
<title>
Python
</title>
<author>
TutorialsPoint
</author>
<price>
400
</price>
</book>
</books>
</body>
</html>
lxml-xml 解析器将文档解析为 XML。
The lxml-xml parser parses the document as XML.
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
<books>
<book>
<title>
Python
</title>
<author>
TutorialsPoint
</author>
<price>
400
</price>
</book>
</books>
诊断报告可能被证明对查找 HTML/XML 文档中的错误很有用。
The diagnostics report may prove to be useful in finding errors in HTML/XML documents.