Beautiful Soup 简明教程
Beautiful Soup - Remove all Scripts
HTML 中经常使用的一个标记是 <script> 标记。它有助于在 HTML 中嵌入客户端脚本(如 JavaScript 代码)。在本章中,我们将使用 BeautifulSoup 从 HTML 文档中删除脚本标记。
One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.
<script> 标记有一个对应的 </script> 标记。在这两个标记之间,你可以包含对外部 JavaScript 文件的引用,或将 JavaScript 代码与 HTML 脚本本身内联。
The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.
要包含一个外部 Javascript 文件,请使用以下语法 −
To include an external Javascript file, the syntax used is −
<head>
<script src="javascript.js"></script>
</head>
然后,你可以在 HTML 中调用在此文件中定义的函数。
You can then invoke the functions defined in this file from inside HTML.
除了引用外部文件之外,你还可以将 JavaScipt 代码放在 <script> 和 </script> 代码内的 HTML 中。如果将其放在 HTML 文档的 <head> 部分中,那么该功能将在整个文档树中可用。另一方面,如果将其放在 <body> 部分的任何位置,则 JavaScript 函数从此处可用。
Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
使用 Beautiful 轻松去除所有脚本标签。您必须从解析树中收集所有脚本标签的列表并逐一提取它们。
To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.
Example
html = '''
<html>
<head>
<script src="javascript.js"></scrript>
</head>
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all('script'):
tag.extract()
print (soup)
Output
<html>
<head>
</head>
</html>
您还可以使用 decompose() 方法,而不是 extract(),不同之处在于后者返回被去除的内容,而前者只是销毁内容。为了获得更简洁的代码,您还可以使用列表推导语法来获取移除了脚本标签的汤对象,如下所示 −
You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −
[tag.decompose() for tag in soup.find_all('script')]