Beautiful Soup 简明教程
Beautiful Soup - Scraping Paragraphs from HTML
HTML 文档中经常出现的标记之一是标记段落文本的 <p> 标记。使用 Beautiful Soup,你可以轻松地从解析的文档树中提取段落。在本章中,我们将讨论借助 BeautifulSoup 库抓取段落的以下方法。
One of the frequently appearing tags in a HTML document is the <p> tag that marks a paragraph text. With Beautiful Soup, you can easily extract paragraph from the parsed document tree. In this chapter, we shall discuss the following ways of scraping paragraphs with the help of BeautifulSoup library.
-
Scraping HTML paragraph with <p> tag
-
Scraping HTML paragraph with find_all() method
-
Scraping HTML paragraph with select() method
我们将在这些练习中使用以下 HTML 文档:
We shall use the following HTML document for these exercises −
<html>
<head>
<title>BeautifulSoup - Scraping Paragraph</title>
</head>
<body>
<p id='para1'>The quick, brown fox jumps over a lazy dog.</p>
<h2>Hello</h2>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Scraping by <p> tag
搜索解析树的最简单方法是按名称搜索标签。因此,表达式 soup.p 指向 scouped 文档中的第一个 <p> 标签。
Easiest way to search a parse tree is to search the tag by its name. Hence, the expression soup.p points towards the first <p> tag in the scouped document.
para = soup.p
若要获取所有后续 <p> 标签,您可以运行循环,直到所有 <p> 标签都被 soup 对象用尽。以下程序显示所有段落标签的美化输出。
To fetch all the subsequent <p> tags, you can run a loop till the soup object is exhausted of all the <p> tags. The following program displays the prettified output of all the paragraph tags.
Using find_all() method
find_all() 方法更为全面。您可以将各种类型的过滤器(例如,标签、属性或字符串等)传递给此方法。在本例中,我们希望获取 <p> 标签的内容。
The find_all() methods is more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to this method. In this case, we want to fetch the contents of a <p> tag.
在以下代码中,find_all() 方法返回 <p> 标签中所有元素的列表。
In the following code, find_all() method returns a list of all elements in the <p> tag.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
paras = soup.find_all('p')
for para in paras:
print (para.prettify())
Output
<p>
The quick, brown fox jumps over a lazy dog.
</p>
<p>
DJs flock by when MTV ax quiz prog.
</p>
<p>
Junk MTV quiz graced by fox whelps.
</p>
<p>
Bawds jog, flick quartz, vex nymphs.
</p>
我们可以使用另一种方法来查找所有 <p> 标签。首先,使用 find_all() 获取所有标签的列表,并检查每个标签的 Tag.name 是否等于 ='p'。
We can use another approach to find all <p> tags. To begin with, obtain list of all tags using find_all() and check Tag.name of each equals ='p'.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
paras = [tag.contents for tag in tags if tag.name=='p']
print (paras)
find_all() 方法还具有 attrs 参数。当您要提取具有特定属性的 <p> 标签时,此参数很有用。例如,在给定的文档中,第一个 <p> 元素的 id='para1'。若要获取它,我们需要修改标签对象,如下所示:
The find_all() method also has attrs parameter. It is useful when you want to extract the <p> tag with specific attributes. For example, in the given document, the first <p> element has id='para1'. To fetch it, we need to modify the tag object as −
paras = soup.find_all('p', attrs={'id':'para1'})
Using select() method
select() 方法本质上用于使用 CSS 选择器获取数据。但是,您还可以向其传递一个标签。在这里,我们可以将 <p> 标签传递给 select() 方法。select_one() 方法也可用。它获取 <p> 标签的第一个匹配项。
The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the <p> tag to select() method. The select_one() method is also available. It fetches the first occurrence of the <p> tag.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
paras = soup.select('p')
print (paras)
Output
[
<p>The quick, brown fox jumps over a lazy dog.</p>,
<p>DJs flock by when MTV ax quiz prog.</p>,
<p>Junk MTV quiz graced by fox whelps.</p>,
<p>Bawds jog, flick quartz, vex nymphs.</p>
]
若要筛选具有特定 id 的 <p> 标签,请使用如下 for 循环:
To filter out <p> tags with a certain id, use a for loop as follows −