Beautiful Soup 简明教程
Beautiful Soup - Output Formatting
如果提供给 BeautifulSoup 构造函数的 HTML 字符串包含任何 HTML 实体,它们将被转换为 Unicode 字符。
If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.
HTML 实体是一个以 & (&)开头并以分号(;)结尾的字符串。它们用于显示保留字符(否则将被解释为 HTML 代码)。一些 HTML 实体的示例:
An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −
< |
less than |
< |
< |
> |
greater than |
> |
> |
& |
ampersand |
& |
& |
" |
double quote |
" |
" |
' |
single quote |
' |
' |
" |
Left Double quote |
“ |
“ |
" |
Right double quote |
” |
” |
£ |
Pound |
£ |
£ |
¥ |
yen |
¥ |
¥ |
€ |
euro |
€ |
€ |
© |
copyright |
© |
© |
默认情况下,在输出时转义的唯一字符是裸露的 & 和尖括号。这些被转换为“&”、“<”和“>”。
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">"
对于其他字符,它们将被转换为 Unicode 字符。
For others, they’ll be converted to Unicode characters.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (str(soup))
Output
Hello "World!"
如果您随后将文档转换为字节串,Unicode 字符将被编码为 UTF-8。您将无法获取 HTML 实体:
If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode())
Output
b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'
要更改此行为,请为 prettify() 方法的格式化程序参数提供一个值。对于格式化程序,以下可能的值:
To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.
formatter="minimal" - 这是默认值。字符串仅会得到足够的处理,以确保 Beautiful Soup 能够生成有效的 HTML/XML
formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML
formatter="html" − 只要有可能,Beautiful Soup 将把 Unicode 字符转换为 HTML 实体。
formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter="html5" - 它类似于 formatter="html”,但 Beautiful Soup 将在 HTML 空标签(如“br”)中省略结束斜杠。
formatter="html5" − it’s similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br"
formatter=None - Beautiful Soup 根本不会在输出时修改字符串。这是最快的选项,但可能会导致 Beautiful Soup 生成无效的 HTML/XML。
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML
Example
from bs4 import BeautifulSoup
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))
Output
minimal:
<p>
Il a dit <<Sacré bleu!>>
</p>
html:
<p>
Il a dit <<Sacré bleu!>>
</p>
None:
<p>
Il a dit <<Sacré bleu!>>
</p>
此外,Beautiful Soup 库提供了格式化程序类。您可以将任何此类对象的实例作为参数传递给 prettify() 方法。
In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.
HTMLFormatter class − 用于自定义 HTML 文档的格式化规则。
HTMLFormatter class − Used to customize the formatting rules for HTML documents.
XMLFormatter class − 用于自定义 XML 文档的格式化规则。
XMLFormatter class − Used to customize the formatting rules for XML documents.