Beautiful Soup 简明教程

Beautiful Soup - get_text() Method

Method Description

get_text() 方法仅返回整个 HTML 文档或给定标签中适合人读的文本。所有子字符串均由给定的分隔符串连接,默认情况下为 null 字符串。

Syntax

get_text(separator, strip)

Parameters

  1. separator − 子字符串将使用此参数连接。默认情况下为空字符串。

  2. strip − 连接前,将删除字符串中的空白。

Return Type

get_text() 方法返回一个字符串。

Example 1

在以下示例中,get_text() 方法移除所有 HTML 标记。

html = '''
<html>
<body>
   <p> The quick, brown fox jumps over a lazy dog.</p>
   <p> DJs flock by when MTV ax quiz prog.</p>
   <p> Junk MTV quiz graced by fox whelps.</p>
   <p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

Example 2

在以下示例中,我们将 get_text() 方法的分隔符参数指定为“#”。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

Output

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

Example 3

当 strip 参数设置为 True 时,让我们检查其效果。默认情况下为 False。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

Output

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.