Beautiful Soup 简明教程

Beautiful Soup - Convert HTML to Text

Beautiful Soup 库等网络爬虫的一个重要但经常需要的应用程序是从 HTML 脚本中提取文本。您可能需要丢弃所有标签以及与其关联的属性(如果有),并从文档中分离原始文本。Beautiful Soup 中的 get_text() 方法适用于此目的。

这里有一个演示 get_text() 方法用法的基本示例。您可以通过删除所有 HTML 标签来获取 HTML 文档中的所有文本。

Example

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

get_text() 方法有一个可选的分隔符参数。在以下示例中,我们将 get_text() 方法的分隔符参数指定为“#”。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

Output

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

get_text() 方法还有另一个参数 strip,可以为 True 或 False。让我们检查 strip 参数设置真时的效果。默认情况下它为 False。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

Output

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.