Beautiful Soup 简明教程

Beautiful Soup - Remove Child Elements

HTML 文档是不同标记的分层排列,其中一个标记可能在其上嵌套一个或多个标记,并在多个层级中。我们如何删除特定标记的子元素?使用 BeautifulSoup,这非常容易。

HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it.

BeautifulSoup 库中有两种主要方法,可以删除特定标记。decompose() 方法和 extract() 方法,区别在于后者返回被移除的内容,而前者只是将其销毁。

There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it.

因此,要删除子元素,请为给定的 Tag 对象调用 findChildren() 方法,然后在每个方法上 extract() 或 decompose()。

Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each.

考虑以下代码段:

Consider the following code segment −

soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print (soup)

这将销毁整个 soup 对象本身,即文档的已解析树。显然,我们不想这样做。

This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that.

现在是以下代码:

Now the following code −

soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
   for t in tag.findChildren():
      t.extract()

在文档树中,<html> 是第一个标记,所有其他标记都是其子代,因此,在循环的第一次迭代中,它将删除除 <html> 和 </html> 之外的所有标记。

In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself.

如果我们想删除特定标记的子代,则可以使用此方法更有效。例如,您可能希望删除 HTML 表格的头行。

More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table.

以下 HTML 脚本有一个表格,第一个 <tr> 元素具有用 <th> 标记标记的标题。

The following HTML script ha a table with first <tr> element having headers marked by <th> tag.

<html>
   <body>
      <h2>Beautiful Soup - Remove Child Elements</h2>
      <table border="1">
         <tr class='header'>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

我们可以使用以下 Python 代码删除具有 <th> 单元的 <tr> 标记的所有子元素。

We can use the following Python code to remove all the children elements of <tr> tag with <th> cells.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class':'header'})

for tag in tags:
   for t in tag.findChildren():
      t.extract()

print (soup)

Output

<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">

</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>

可以看到 <th> 元素已从已解析树中删除。

It can be seen that the <th> elements have been removed from the parsed tree