Beautiful Soup 简明教程

Beautiful Soup - Find all Headings

在本节中,我们将探讨如何使用 BeautifulSoup 来查找 HTML 文档中的所有标题元素。HTML 定义了从 H1 到 H6 的六种标题样式,每种样式的字体大小递减。不同的页面章节(如:主标题、节标题、主题等)使用合适的标签。让我们采用两种不同的方式使用 find_all() 方法来提取 HTML 文档中的所有标题元素。

In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document.

我们在该章节的代码示例中将使用以下 HTML 脚本(保存为 index.html) -

We shall use the following HTML script (saved as index.html) in the code examples in this chapter −

<html>
   <head>
      <title>BeautifulSoup - Scraping Headings</title>
   </head>
   <body>
      <h2>Scraping Headings</h2>
      <b>The quick, brown fox jumps over a lazy dog.</b>
      <h3>Paragraph Heading</h3>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <h3>List heading</h3>
      <ul>
         <li>Junk MTV quiz graced by fox whelps.</li>
         <li>Bawds jog, flick quartz, vex nymphs.</li>
      </ul>
   </body>
</html>

Example 1

在该方法中,我们收集解析树中的所有标签,并检查每个标签的名称是否位于所有标题标签的列表中。

In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags.

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

headings = ['h1','h2','h3', 'h4', 'h5', 'h6']
tags = soup.find_all()
heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings]
print (heads)

此处,headings 是所有标题样式 h1 到 h6 的列表。如果标签的名称是其中任何一个,则标签及其内容将收集在名为 heads 的列表中。

Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads.

Output

[('h2', 'Scraping Headings'), ('h3', 'Paragraph Heading'), ('h3', 'List heading')]

Example 2

你可以在 find_all() 方法中传递正则表达式。请看以下正则表达式。

You can pass a regex expression to the find_all() method. Take a look at the following regex.

re.compile('^h[1-6]$')

该正则表达式查找以 h 开头、h 后面有一个数字,然后数字后结尾的所有标签。让我们将其用作下面代码中 find_all() 方法的参数 -

This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below −

from bs4 import BeautifulSoup
import re

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

tags = soup.find_all(re.compile('^h[1-6]$'))
print (tags)

Output

[<h2>Scraping Headings</h2>, <h3>Paragraph Heading</h3>, <h3>List heading</h3>]