Beautiful Soup 简明教程

Beautiful Soup - Convert Object to String

Beautiful Soup API 有三类主要对象。soup 对象、Tag 对象和 NavigableString 对象。让我们找出如何将这些对象转换为字符串。在 Python 中，字符串是一个 str 对象。

我们有一个以下 HTML 文档

html = '''
<p>Hello <b>World</b></p>
'''

让我们将这个字符串作为 BeautifulSoup 构造函数的参数。然后使用 Python 的内置 str() 函数将 soup 对象强制转换为字符串对象。

该 HTML 字符串的已解析树将基于使用的解析器而构建。内置 html 解析器不会添加 <html> 和 <body> 标记。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (str(soup))

<p>Hello <b>World</b></p>

另一方面，html5lib 解析器会在插入 <html> 和 <body> 等形式标记后构建树。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print (str(soup))

<html><head></head><body><p>Hello <b>World</b></p>
</body></html>

Tag 对象有一个字符串属性，用于返回 NavigableString 对象。

tag = soup.find('b')
obj = (tag.string)
print (type(obj),obj)

string <class 'bs4.element.NavigableString'> World

还为 Tag 对象定义了 Text 属性。它返回标记中包含的文本，并清除所有内部标记和属性。

如果 HTML 字符串为 −

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

我们尝试获取 <p> 标记的文本属性

tag = soup.find('p')
obj = (tag.text)
print ( type(obj), obj)

<class 'str'> Hello World

你还可以使用 get_text() 方法，它返回一个表示标记内文本的字符串。该函数实际上是一个围绕文本属性的包装器，因为它也去除了内部标记和属性，并返回了一个字符串。

obj = tag.get_text()
print (type(obj),obj)

<class 'str'> Hello World