Beautiful Soup 简明教程
Beautiful Soup - Find all Children of an Element
HTML 脚本中标记的结构是分层的。元素一个嵌套在另一个里面。例如,最顶层的 <HTML> 标记包含 <HEAD> 和 <BODY> 标记,每个都可以包含其他的标记。最顶层元素被称为父元素。嵌套在父元素内的元素是其子元素。借助 Beautiful Soup,我们可以找到父元素的所有子元素。在本章中,我们将找出如何获取 HTML 元素的子元素。
The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The top level element is called as parent. The elements nested inside the parent are its children. With the help of Beautiful Soup, we can find all the children elements of a parent element. In this chapter, we shall find out how to obtain the children of a HTML element.
BeautifulSoup 类中有两个配置,用于获取子元素。
There are two provisions in BeautifulSoup class to fetch the children elements.
-
The .children property
-
The findChildren() method
本章中的示例使用了以下 HTML 脚本 (index.html)
Examples in this chapter use the following HTML script (index.html)
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
<ul id='acc'>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul id="HR">
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
Using .children property
Tag 对象的 .children 属性以递归方式返回所有子元素的生成器。
The .children property of a Tag object returns a generator of all the child elements in a recursive manner.
以下 Python 代码给出了最顶层的 <ul> 标记的所有子元素的列表。我们首先获取与 <ul> 标记相对应的 Tag 元素,然后读取其 .children 属性
The following Python code gives a list of all the children elements of top level <ul> tag. We first obtain the Tag element corresponding to the <ul> tag, and then read its .children property
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.ul
print (list(tag.children))
Output
['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']
由于 .children 属性返回一个 list_iterator,所以我们可以使用一个 for 循环来遍历体系结构。
Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.
for child in tag.children:
print (child)
Using findChildren() method
findChildren() 方法提供了一个更全面的选择。它返回所有顶层标记下的所有子元素。
The findChildren() method offers a more comprehensive alternative. It returns all the child elements under any top level tag.
在 index.html 文档中,我们有两个嵌套的无序列表。最顶层的 <ul> 元素的 id = "dept",而两个封闭的列表的 id 分别为 = "acc" 和 "HR"。
In the index.html document, we have two nested unordered lists. The top level <ul> element has id = "dept" and the two enclosed lists are having id = "acc' and "HR' respectively.
在以下示例中,我们首先实例化指向最顶层 <ul> 元素的 Tag 对象并提取其下的子元素列表。
In the following example, we first instantiate a Tag object pointing to top level <ul> element and extract the list of children under it.
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find("ul", {"id": "dept"})
children = tag.findChildren()
for child in children:
print(child)
请注意,结果集以递归方式包含元素下的子元素。因此,在以下输出中,你将找到整个内部列表,后跟其中的各个元素。
Note that the resultset includes the children under an element in a recursive fashion. Hence, in the following output, you’ll find the entire inner list, followed by individual elements in it.
<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>Anand</li>
<li>Mahesh</li>
<li>HR</li>
<ul id="HR">
<li>Rani</li>
<li>Ankita</li>
</ul>
<li>Rani</li>
<li>Ankita</li>
让我们提取 id='acc' 的内部 <ul> 元素下的子元素。代码如下 -
Let us extract the children under an inner <ul> element with id='acc'. Here is the code −
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find("ul", {"id": "acc"})
children = tag.findChildren()
for child in children:
print(child)
当以上程序运行时,你将获得 id 为 acc 的 <ul> 下的 <li> 元素。
When the above program is run, you’ll obtain the <li>elements under the <ul> with id as acc.