Beautiful Soup 简明教程

Beautiful Soup - Encoding

所有 HTML 或 XML 文档都写入一些特定编码(如 ASCII 或 UTF-8)。但是,当您将该 HTML/XML 文档加载到 BeautifulSoup 时,它已被转换成 Unicode。

All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.

Example

from bs4 import BeautifulSoup
markup = "<p>I will display £</p>"
soup = BeautifulSoup(markup, "html.parser")
print (soup.p)
print (soup.p.string)

Output

<p>I will display £</p>
I will display £

之所以会出现以上情况,是因为 BeautifulSoup 在内部使用名为 Unicode, Dammit 的子库检测文档的编码,然后将其转换成 Unicode。

Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document’s encoding and then convert it into Unicode.

但是,并非总是 Unicode, Dammit 能正确猜测。由于要逐字节搜索文档以猜测编码,因此会花费大量时间。如果您已经知道编码,可以将其作为 from_encoding 传递到 BeautifulSoup 构造器中,这样可以节省一些时间并避免错误发生。

However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding.

下面是一个 BeautifulSoup 识别错误的示例,将 ISO-8859-8 文档识别为 ISO-8859-7 −

Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 −

Example

from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print (soup.h1)

print (soup.original_encoding)

Output

<h1>翴檛</h1>
ISO-8859-7

要解决上述问题,请使用 from_encoding 将其传递到 BeautifulSoup −

To resolve above issue, pass it to BeautifulSoup using from_encoding −

Example

from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.h1)

print (soup.original_encoding)

Output

<h1>םולש</h1>
iso-8859-8

BeautifulSoup 4.4.0 中的另一项新功能是 exclude_encoding。在您不知道正确编码但确定 Unicode, Dammit 未显示正确结果时,可以使用它。

Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don’t know the correct encoding but sure that Unicode, Dammit is showing wrong result.

soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

Output encoding

BeautifulSoup 的输出是 UTF-8 文档,与输入到 BeautifulSoup 的文档无关。下面的文档中,波兰语字符采用 ISO-8859-2 格式。

The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format.

Example

markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
   <HEAD>
      <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
   </HEAD>
   <BODY>
   ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
   </BODY>
</HTML>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.prettify())

Output

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
      <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
   </head>
   <body>
      ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
   </body>
</html>

在上述示例中,如果您注意到,<meta> 标签已被重写,以反映由 BeautifulSoup 生成的文档现为 UTF-8 格式。

In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format.

如果您不希望生成的输出为 UTF-8,可以在 prettify() 中分配所需的编码。

If you don’t want the generated output in UTF-8, you can assign the desired encoding in prettify().

print(soup.prettify("latin-1"))

Output

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n <head>\n  <meta content="text/html; charset=latin-1" http-equiv="content-type"/>\n </head>\n <body>\n  ą ć ę ł ń \xf3 ś ź ż Ą Ć Ę Ł Ń \xd3 Ś Ź Ż\n </body>\n</html>\n'

在上述示例中,我们对整个文档进行了编码,但是您也可以对汤中的任何特定元素进行编码,就像对其为 Python 字符串一样 −

In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string −

soup.p.encode("latin-1")
soup.h1.encode("latin-1")

Output

b'<p>My first paragraph.</p>'
b'<h1>My First Heading</h1>'

任何无法用您选择的编码表示的字符将被转换成数字 XML 实体引用。下面是一个示例 −

Any characters that can’t be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example −

markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
print(tag.encode("utf-8"))

Output

b'<b>\xe2\x98\x83</b>'

如果您尝试使用 "latin-1" 或 "ascii" 编码以上内容,将生成 "☃",表示不存在表示。

If you try to encode the above in "latin-1" or "ascii", it will generate "&#9731", indicating there is no representation for that.

print (tag.encode("latin-1"))
print (tag.encode("ascii"))

Output

b'<b>☃</b>'
b'<b>☃</b>'

Unicode, Dammit

Unicode, Dammit 主要用于当传入文档为未知格式(主要是外语)且我们想要编码为某些已知格式(Unicode)时,同时我们也不需要 Beautifulsoup 来完成所有这些操作。

Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don’t need Beautifulsoup to do all this.