Beautiful Soup 简明教程

Beautiful Soup - Find Element using CSS Selectors

在 Beautiful Soup 库中,select() 方法是用于抓取 HTML/XML 文档的重要工具。类似于 find() 和其他 find_*() 方法,select() 方法还有助于查找符合给定条件的元素。但是,find*() 方法根据标记名称及其属性搜索 PageElements,select() 方法根据给定的 CSS 选择器搜索文档树。

In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and the other find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. However, the find*() methods search for the PageElements according to the Tag name and its attributes, the select() method searches the document tree for the given CSS selector.

Beautiful Soup 还具有 select_one() 方法。select() 和 select_one() 之间的区别在于,select() 返回属于 PageElement 并由 CSS 选择器表征的所有元素的 ResultSet;而 select_one() 返回满足基于 CSS 选择器选择标准的元素的第一个出现。

Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria.

在 Beautiful Soup 4.7 版本之前,select() 方法过去只能支持通用的 CSS 选择器。从 4.7 版本开始,Beautiful Soup 与 Soup Sieve CSS 选择器库集成在一起。因此,现在可以使用更多选择器。在 4.12 版本中,除了现有的便利方法 select() 和 select_one() 之外,还添加了 .css 属性。select() 方法的参数如下 −

Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().The parameters for select() method are as follows −

select(selector, limit, **kwargs)

selector − 包含 CSS 选择器的字符串。

selector − A string containing a CSS selector.

limit − 找到此数量的结果后,停止查找。

limit − After finding this number of results, stop looking.

kwargs − 要传递的关键字参数。

kwargs − Keyword arguments to be passed.

如果 limit 参数设置为 1,则它等价于 select_one() 方法。虽然 select() 方法返回一个 Tag 对象的 ResultSet,但 select_one() 方法返回一个单个 Tag 对象。

If the limit parameter is set to 1, it becomes equivalent to select_one() method. While the select() method returns a ResultSet of Tag objects, the select_one() method returns a single Tag object.

Soup Sieve Library

Soup Sieve 是一款 CSS 选择器库。它已与 Beautiful Soup 4 集成,因此可随 Beautiful Soup 包一起安装。它提供使用现代 CSS 选择器选择、匹配和筛选文档树标记的能力。Soup Sieve 目前实现了 CSS 1 级规范到 CSS 4 级的大部分 CSS 选择器,但除了一些尚未实现的例外情况。

Soup Sieve is a CSS selector library. It has been integrated with Beautiful Soup 4, so it is installed along with Beautiful Soup package. It provides ability to select, match, and filter he document tree tags using modern CSS selectors. Soup Sieve currently implements most of the CSS selectors from the CSS level 1 specifications up to CSS level 4, except for some that are not yet implemented.

Soup Sieve 库具有不同类型的 CSS 选择器。基本的 CSS 选择器为 −

The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −

Type selector

通过节点名称匹配元素。例如 −

Matching elements is done by node name. For example −

tags = soup.select('div')

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Universal selector (*)

它匹配任何类型的元素。示例 −

It matches elements of any type. Example −

tags = soup.select('*')

ID selector

它基于其 id 属性匹配元素。# 号表示 ID 选择器。示例 −

It matches an element based on its id attribute. The symbol # denotes the ID selector. Example −

tags = soup.select("#nm")

Example

from bs4 import BeautifulSoup

html = '''
   <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
      <input type = 'text' id = 'marks' name = 'marks'>
   </form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Class selector

它根据 class 属性中包含的值匹配元素。. 符号作为类名称的前缀是 CSS 类选择器。示例 −

It matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −

tags = soup.select(".submenu")

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Attribute Selectors

属性选择器基于其属性匹配元素。

The attribute selector matches an element based on its attributes.

soup.select('[attr]')

Example

from bs4 import BeautifulSoup

html = '''
   <h1>Tutorialspoint Online Library</h1>
   <p><b>It's all Free</b></p>
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('[href]'))

Output

[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>]

Pseudo Classes

CSS 规范定义了多个伪 CSS 类。伪类是添加到选择器中的关键字,用于定义所选元素的特殊状态。它会对现有元素添加效果。例如,:link 选择尚未访问的链接(带 href 属性的每个 <a> 和 <area> 元素)。

CSS specification defines a number of pseudo CSS classes. A pseudo-class is a keyword added to a selector so as to define a special state of the selected elements. It adds an effect to the existing elements. For example, :link selects a link (every <a> and <area> element with an href attribute) that has not yet been visited.

nth-of-type 和 nth-child 伪类选择器被广泛使用。

The pseudo-class selectors nth-of-type and nth-child are very widely used.

:nth-of-type()

:nth-of-type() 选择器根据元素在同级元素组中的位置匹配指定类型的元素。关键字 even 和 odd 将分别从子同级元素组中选择元素。

The selector :nth-of-type() matches elements of a given type, based on their position among a group of siblings. The keywords even and odd, and will respectively select elements, from a sub-group of sibling elements.

在以下示例中,选择了 <p> 类型的第二个元素。

In the following example, second element of <p> type is selected.

Example

from bs4 import BeautifulSoup

html = '''
<p id="0"></p>
<p id="1"></p>
<span id="2"></span>
<span id="3"></span>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('p:nth-of-type(2)'))

Output

[<p id="1"></p>]

:nth-child()

此选择器根据元素在同级元素组中的位置匹配元素。关键字 even 和 odd 将分别选择在同级元素组中位置为偶数或奇数的元素。

This selector matches elements based on their position in a group of siblings. The keywords even and odd will respectively select elements whose position is either even or odd amongst a group of siblings.

Usage

:nth-child(even)
:nth-child(odd)
:nth-child(2)

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div

child = tag.select_one(':nth-child(2)')
print (child)

Output

<p>Python</p>