Beautiful Soup 简明教程

Beautiful Soup - Convert Object to String

Beautiful Soup API 有三类主要对象。soup 对象、Tag 对象和 NavigableString 对象。让我们找出如何将这些对象转换为字符串。在 Python 中,字符串是一个 str 对象。

The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object.

我们有一个以下 HTML 文档

Assuming that we have a following HTML document

html = '''
<p>Hello <b>World</b></p>
'''

让我们将这个字符串作为 BeautifulSoup 构造函数的参数。然后使用 Python 的内置 str() 函数将 soup 对象强制转换为字符串对象。

Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python’s builtin str() function.

该 HTML 字符串的已解析树将基于使用的解析器而构建。内置 html 解析器不会添加 <html> 和 <body> 标记。

The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn’t add the <html> and <body> tags.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (str(soup))

Output

<p>Hello <b>World</b></p>

另一方面,html5lib 解析器会在插入 <html> 和 <body> 等形式标记后构建树。

On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print (str(soup))

Output

<html><head></head><body><p>Hello <b>World</b></p>
</body></html>

Tag 对象有一个字符串属性,用于返回 NavigableString 对象。

The Tag object has a string property that returns a NavigableString object.

tag = soup.find('b')
obj = (tag.string)
print (type(obj),obj)

Output

string <class 'bs4.element.NavigableString'> World

还为 Tag 对象定义了 Text 属性。它返回标记中包含的文本,并清除所有内部标记和属性。

There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes.

如果 HTML 字符串为 −

If the HTML string is −

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

我们尝试获取 <p> 标记的文本属性

We try to obtain the text property of <p> tag

tag = soup.find('p')
obj = (tag.text)
print ( type(obj), obj)

Output

<class 'str'> Hello World

你还可以使用 get_text() 方法,它返回一个表示标记内文本的字符串。该函数实际上是一个围绕文本属性的包装器,因为它也去除了内部标记和属性,并返回了一个字符串。

You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string

obj = tag.get_text()
print (type(obj),obj)

Output

<class 'str'> Hello World