Beautiful Soup 简明教程

Beautiful Soup - Remove all Styles

本章节说明如何从 HTML 文档中删除所有样式。层叠样式表 (CSS) 用于控制 HTML 文档的不同方面的显示方式。其中包括为文本使用特定字体、颜色、对齐方式、间距等设置样式。CSS 会通过不同方式应用于 HTML 标签。

This chapter explains how to remove all styles from a HTML document. Cascaded style sheets (CSS) are used to control the appearance of different aspects of a HTML document. It includes styling the rendering of text with a specific font, color, alignment, spacing etc. CSS is applied to HTML tags in different ways.

其中一种方法是在 CSS 文件中定义不同样式,并通过文档 <head> 部分中的 <link> 标签将它们包含到 HTML 脚本中。例如,

One is to define different styles in a CSS file and include in the HTML script with the <link> tag in the <head> section in the document. For example,

Example

<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
   . . .
   . . .
   </body>
</html>

HTML 脚本主体部分中的不同标签将会使用 mystyle.css 文件中的定义。

The different tags in the body part of the HTML script will use the definitions in mystyle.css file

另一种方法是在 HTML 文档自身的 <head> 部分中定义样式配置。主体部分中的标签将会使用在此处提供的定义来呈现。

Another approach is to define the style configuration inside the <head> part of the HTML document itself. Tags in the body part will be rendered by using the definitions provided internally.

内部样式示例:

Example of internal styling −

<html>
<head>
   <style>
      p {
         text-align: center;
         color: red;
      }
   </style>
</head>
   <body>
      <p>para1.</p>
      <p id="para1">para2</p>
      <p>para3</p>
   </body>
</html>

在任何情况下,为以编程方式删除样式,只需要从 soup 对象中删除 head 标签。

In either cases, to remove the styles programmatically, simple remove the head tag from the soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

第三种方法是通过在标签中包含 style 属性来按行定义样式。该 style 属性可能包含一个或更多个样式属性定义,如颜色、大小等。例如:

Third approach is to define the styles inline by including style attribute in the tag itself. The style attribute may contain one or more style attribute definitions such as color, size etc. For example

<body>
   <h1 style="color:blue;text-align:center;">This is a heading</h1>
   <p style="color:red;">This is a paragraph.</p>
</body>

为从 HTML 文档中删除这些内联样式,你需要查看标签对象的 attrs 字典中是否定义了 style 键,如果定义了,就删除该键。

To remove such inline styles from a HTML document, you need to check if attrs dictionary of a tag object has style key defined in it, and if yes delete the same.

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup)

以下代码将删除内联样式以及 head 标签本身,因此结果 HTML 树中将不会有任何样式。

The following code removes the inline styles as well as removes the head tag itself, so that the resultant HTML tree will not have any styles left.

html = '''
<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
      <h1 style="color:blue;text-align:center;">This is a heading</h1>
      <p style="color:red;">This is a paragraph.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup.prettify())

Output

<html>
 <body>
  <h1>
   This is a heading
  </h1>
  <p>
   This is a paragraph.
  </p>
 </body>
</html>