Beautiful Soup 简明教程
Beautiful Soup - Specifying the Parser
将 HTML 文档树解析为 BeautifulSoup 类的对象。此类的构造函数需要以 HTML 字符串或指向 HTML 文件的文件对象作为强制参数。构造函数具有所有其他可选参数,其中最重要的为特征。
A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features.
BeautifulSoup(markup, features)
此处标记为 HTML 字符串或文件对象。features 参数指定要使用的解析器。它可以是特定解析器,例如 “lxml”、“lxml-xml”、“html.parser”或 “html5lib;或要使用的标记类型(“html”、“html5”、“xml”)。
Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as "lxml", "lxml-xml", "html.parser", or "html5lib; or type of markup to be used ("html", "html5", "xml").
如果未给出 features 参数,Beautiful Soup 会选择已安装的最佳 HTML 解析器。Beautiful Soup 将 lxml 的解析器评为最佳,然后是 html5lib,最后是 Python 的内置解析器。
If the features argument is not given, BeautifulSoup chooses the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
你可以指定以下任一项 −
You can specify one of the following −
要解析的标记类型。Beautiful Soup 目前支持 “html”、“xml”和 “html5”。
The type of markup you want to parse. Beautiful Soup currently supports are "html", "xml", and "html5".
要使用的解析器库的名称。当前支持的选项为 “lxml”、“html5lib”和 “html.parser”(Python 的内置 HTML 解析器)。
The name of the parser library to be used. Currently supported options are "lxml", "html5lib", and "html.parser" (Python’s built-in HTML parser).
要安装 lxml 或 html5lib 解析器,请使用命令 −
To install lxml or html5lib parser, use the command −
pip3 install lxml
pip3 install html5lib
这些解析器具有各自的优点和缺点,如下所示 -
These parsers have their advantages and disadvantages as shown below −
Parser: Python’s html.parser
Usage - BeautifulSoup(markup, "html.parser")
Usage − BeautifulSoup(markup, "html.parser")
Parser: lxml’s HTML parser
Usage − BeautifulSoup(markup, "lxml")
Usage − BeautifulSoup(markup, "lxml")
Parser: lxml’s XML parser
Usage − BeautifulSoup(markup, "lxml-xml")
Usage − BeautifulSoup(markup, "lxml-xml")
或 BeautifulSoup(markup, "xml")
Or BeautifulSoup(markup, "xml")
Parser: html5lib
Usage − BeautifulSoup(markup, "html5lib")
Usage − BeautifulSoup(markup, "html5lib")
Disadvantages
-
Very slow
-
External Python dependency
不同的解析器会从同一文档创建不同的解析树。最大的区别在于 HTML 解析器和 XML 解析器之间。下面是一个短文档,已解析为 HTML −
Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a><b /></a>", "html.parser")
print (soup)
Output
<a><b></b></a>
空 <b /> 标记不是有效的 HTML。因此,解析器会将其变成 <b></b> 标记对。
An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair.
现在的同一个文档已解析为 XML。请注意,空 <b /> 标记已保留,并且该文档给出了 XML 声明,而不是被放入 <html> 标记中。
The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag.
Output
<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>
对于格式良好的 HTML 文档,所有 HTML 解析器都会产生类似的解析树,尽管一个解析器将比另一个解析器更快。
In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another.
然而,如果 HTML 文档不够完美,那么不同类型的解析器将会产生不同的结果。请参见当用不同的解析器解析 “<a></p>” 时结果有什么不同 −
However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when "<a></p>" is parsed with different parsers −
lxml parser
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "lxml")
print (soup)
Output
<html><body><a></a></body></html>
请注意,HTML 中悬空的 </p> 标记会被忽略。
Note that the dangling </p> tag is simply ignored.
html5lib parser
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "html5lib")
print (soup)
Output
<html><head></head><body><a><p></p></a></body></html>
html5lib 与开放的 <p> 标记将其配对。此解析器还添加了一个空的 <head> 标记到文档。
The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document.
Built-in html parser
Example
Built in from bs4 import BeautifulSoup
soup = BeautifulSoup("<a></p>", "html.parser")
print (soup)
Output
<a></a>
此解析器还会忽略关闭的 </p> 标记。但此解析器通过添加 <body> 不尝试创建格式良好的 HTML 文档,甚至不费心添加 <html> 标记。
This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn’t even bother to add an <html> tag.
html5lib 解析器使用 HTML5 标准中包含的技术,因此它有权被称为“正确”的方法。
The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way.