Beautiful Soup 简明教程

Beautiful Soup - Scraping List from HTML

网页通常以有序或无序列表的形式包含信息。使用 Beautiful Soup，我们可以轻松提取 HTML 列表元素，将数据放入 Python 对象中以存储在数据库中以供进一步分析。在本章中，我们将使用 find() 和 select() 方法从 HTML 文档中抓取列表数据。

Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document.

最简单的搜索解析树的方法是按其名称搜索标签。soup.<tag> 提取给定标签的内容。

Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag.

HTML 提供 <ol> 和 <ul> 标签来编写有序和无序列表。和任何其他标签一样，我们可以提取这些标签的内容。

HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags.

我们将使用以下 HTML 文档 -

We shall use the following HTML document −

<html>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ol id="HR">
         <li>Rani</li>
         <li>Ankita</li>
         </ol>
      </ul>
   </body>
</html>

Scraping lists by Tag

在上方的 HTML 文档中，我们有一个顶层 <ul> 列表，其中有另一个 <ul> 标签和另一个 <ol> 标签。我们首先在 soup 对象中解析文档，并在 soup.ul Tag 对象中检索第一个 <ul> 的内容。

In the above HTML document, we have a top-level <ul> list, inside which there’s another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.ul

print (lst)

Output

<ul id="dept">
<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>

将 lst 的值更改为指向 <ol> 元素以获取内部列表。

Change value of lst to point to <ol> element to get the inner list.

lst=soup.ol

Output

<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>

Using select() method

select() 方法本质上用于使用 CSS 选择器获取数据。但是，你也可以向其传递一个标记。在此处，我们可以将 ol 标记传递给 select() 方法。该 select_one() 方法也可以使用。它获取给定标记的第一次出现。

The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.select("ol")

print (lst)

Output

[<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>]

Using find_all() method

find() 和 fin_all() 方法更为全面。你可以将各种类型的过滤器（例如标记、属性或字符串等）传递给这些方法。在这种情况下，我们要获取列表标记的内容。

The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag.

在以下代码中，find_all() 方法返回 <ul> 标记中所有元素的列表。

In the following code, find_all() method returns a list of all elements in the <ul> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul")

print (lst)

我们可以通过包含 attrs 参数来优化搜索过滤器。在我们的 HTML 文档中，即 <ul> 和 <ol> 标记中，我们指定了它们各自的 id 属性。因此，让我们获取 id="acc" 的 <ul> 元素的内容。

We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id="acc".

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul", {"id":"acc"})

print (lst)

Output

[<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>]

这是另一个示例。我们收集所有具有 <li> 标记的元素，其中内文以“A”开头。该 find_all() 方法采用一个关键字参数 string 。如果 startingwith() 函数返回 True，则它取文本的值。

Here’s another example. We collect all elements with <li> tag with the inner text starting with 'A'. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True.

Example

from bs4 import BeautifulSoup

def startingwith(ch):
   return ch.startswith('A')

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all('li',string=startingwith)

print (lst)

Output

[<li>Accounts</li>, <li>Anand</li>, <li>Ankita</li>]