Beautiful Soup 简明教程
Beautiful Soup - Parsing a Section of a Document
假设你要使用 Beautiful Soup 仅查看文档的 <a> 标签。通常,你将解析树,并使用 find_all() 方法,并以所需标签作为参数。
Let’s say you want to use Beautiful Soup look at a document’s <a> tags only. Normally you would parse the tree and use find_all() method with the required tag as the argument.
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('a')
但是那将耗时,并且无用地占用更多内存。可创建一个 SoupStrainer 类的对象,并将其用作 BeautifulSoup 构造函数中 parse_only 参数的值。
But that would be time consuming as well as it will take up more memory unnecessarily. Instead, you can create an object of SoupStrainer class and use it as value of parse_only argument to BeautifulSoup constructor.
SoupStrainer 告诉 BeautifulSoup 提取哪些部分,而解析树仅包含这些元素。若将所需信息缩小到 HTML 的特定部分,这将加速你的搜索结果。
A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.
product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)
上述代码行仅从产品网站中解析标题,它可能位于某个标签字段中。
Above lines of code will parse only the titles from a product site, which might be inside a tag field.
类似上述情况,我们可以使用其他 soupStrainer 对象来解析 HTML 标签中的特定信息。以下是一些示例:
Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −
Example
from bs4 import BeautifulSoup, SoupStrainer
#Only "a" tags
only_a_tags = SoupStrainer("a")
#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
#parse only where string length is less than 10
def is_short_string(string):
return len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
SoupStrainer 类的参数与通过搜索树的典型方法一致:name、attrs、text 和 **kwargs。
The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, text, and **kwargs.
请注意,如果你使用 html5lib 解析器,此特性将不起作用,因为在这种情况下无论如何都会解析整个文档。因此,你应该使用内建的 html.parser 或 lxml 解析器。
Note that this feature won’t work if you’re using the html5lib parser, because the whole document will be parsed in that case, no matter what. Hence, you should use either the inbuilt html.parser or lxml parser.
你还可以将 SoupStrainer 传递到通过搜索树涵盖的任何方法中。
You can also pass a SoupStrainer into any of the methods covered in Searching the tree.
from bs4 import SoupStrainer
a_tags = SoupStrainer("a")
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(a_tags)