Beautiful Soup 简明教程

Beautiful Soup - Convert HTML to Text

Beautiful Soup 库等网络爬虫的一个重要但经常需要的应用程序是从 HTML 脚本中提取文本。您可能需要丢弃所有标签以及与其关联的属性(如果有),并从文档中分离原始文本。Beautiful Soup 中的 get_text() 方法适用于此目的。

One of the important and a frequently required application of a web scraper such as Beautiful Soup library is to extract text from a HTML script. You may need to discard all the tags along with the attributes associated if any with each tag and separate out the raw text in the document. The get_text() method in Beautiful Soup is suitable for this purpose.

这里有一个演示 get_text() 方法用法的基本示例。您可以通过删除所有 HTML 标签来获取 HTML 文档中的所有文本。

Here is a basic example demonstrating the usage of get_text() method. You get all the text from HTML document by removing all the HTML tags.

Example

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

get_text() 方法有一个可选的分隔符参数。在以下示例中,我们将 get_text() 方法的分隔符参数指定为“#”。

The get_text() method has an optional separator argument. In the following example, we specify the separator argument of get_text() method as '#'.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

Output

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

get_text() 方法还有另一个参数 strip,可以为 True 或 False。让我们检查 strip 参数设置真时的效果。默认情况下它为 False。

The get_text() method has another argument strip, which can be True or False. Let us check the effect of strip parameter when it is set to True. By default it is False.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

Output

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.