Beautiful Soup 简明教程
Beautiful Soup - Selecting nth Child
HTML 的特点是标签的层级顺序。例如,<html> 标签包含 <body> 标签,在其内部可能有一个 <div> 标签,它还可以进一步包含嵌套的 <ul> 和 <li> 元素。findChildren() 方法和 .children 特性都将返回元素正下方所有子标签的 ResultSet(列表)。你可以通过遍历列表来获得位于所需位置上的子元素,也就是第 n 个子元素。
HTML is characterized by the hierarchical order of tags. For example, the <html> tag encloses <body> tag, inside which there may be a <div> tag further may have <ul> and <li> elements nested respectively. The findChildren() method and .children property both return a ResultSet (list) of all the child tags directly under an element. By traversing the list, you can obtain the child located at a desired position, nth child.
下面的代码使用 HTML 文档中某个 <div> 标签的 children 特性。由于 children 特性的返回类型是列表迭代器,因此我们要从中检索一个 Python 列表。我们还需要从迭代器中删除空格和换行符。完成后,我们可以获取所需的子元素。这里显示了索引为 1 的 <div> 标签的子元素。
The code below uses the children property of a <div> tag in the HTML document. Since the return type of children property is a list iterator, we shall retrieve a Python list from it. We also need to remove the whitespaces and line breaks from the iterator. Once done, we can fetch the desired child. Here the child element with index 1 of the <div> tag is displayed.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '''
<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
children = tag.children
childlist = [child for child in children if child not in ['\n', ' ']]
print (childlist[1])
Output
<p>Python</p>
要使用 findChildren() 方法代替 children 特性,请将语句更改为
To use findChildren() method instead of children property, change the statement to
children = tag.findChildren()
输出不会改变。
There will be no change in the output.
定位第 n 个子元素的更有效方法是使用 select() 方法。select() 方法使用 CSS 选择器从当前元素中获取所需的 PageElements。
A more efficient approach toward locating nth child is with the select() method. The select() method uses CSS selectors to obtain required PageElements from the current element.
Soup 和 Tag 对象通过它们的 .css 特性来支持 CSS 选择器,该特性是一个与 CSS 选择器 API 的接口。选择器实现由 Soup Sieve 包处理,该包会随 bs4 包一起安装。
The Soup and Tag objects support CSS selectors through their .css property, which is an interface to the CSS selector API. The selector implementation is handled by the Soup Sieve package, which gets installed along with bs4 package.
Soup Sieve 包定义了不同类型的 CSS 选择器,即由一个或多个类型选择器、ID 选择器和类选择器组成的简单、复合和复杂 CSS 选择器。这些选择器在 CSS 语言中定义。
The Soup Sieve package defines different types of CSS selectors, namely simple, compound and complex CSS selectors that are made up of one or more type selectors, ID selectors, class selectors. These selectors are defined in CSS language.
Soup Sieve 中也有伪类选择器。CSS 伪类是添加到选择器的关键字,用于指定所选元素的特殊状态。我们将在此示例中使用 :nth-child 伪类选择器。由于我们需要选择处于第 2 个位置的 <div> 标记的子元素,因此需要将 :nthchild(2) 传递给 select_one() 方法。
There are pseudo class selectors as well in Soup Sieve. A CSS pseudo-class is a keyword added to a selector that specifies a special state of the selected element(s). We shall use :nth-child pseudo class selector in this example. Since we need to select a child from <div> tag at 2nd position, we shall pass :nthchild(2) to the select_one() method.