Beautiful Soup 简明教程

Beautiful Soup - Find all Comments

计算机代码中加入注释被认为是一种良好的编程实践。注释有助于理解程序的逻辑,同时也可以作为文档。和用 C、Java、Python 等语言编写的程序一样,您还可以在 HTML 和 XML 脚本中添加注释。BeautifulSoup API 可以帮助识别 HTML 文档中的所有注释。

Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document.

在 HTML 和 XML 中,注释文本写在 <!-- 和 -→ 标签之间。

In HTML and XML, the comment text is written between <!-- and -→ tags.

<!-- Comment Text -->

BeutifulSoup 包,其内部名称为 bs4,将注释定义为一个重要的对象。注释对象是一种特殊的 NavigableString 对象类型。因此,任何在 <!-- 和 -→ 之间找到的标签的 string 属性都被认为是注释。

The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between <!-- and -→ is recognized as a Comment.


from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))


This is a comment text in HTML <class 'bs4.element.Comment'>

要搜索 HTML 文档中注释的所有出现,我们应该使用 find_all() 方法。不带任何参数,find_all() 返回解析的 HTML 文档中的所有元素。可以将关键字参数 'string' 传递到 find_all() 方法。我们将为此分配函数 iscomment() 的返回值。

To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument 'string' to find_all() method. We shall assign the return value of a function iscomment() to it.

comments = soup.find_all(string=iscomment)

iscomment() 函数在 isinstance() 函数的帮助下验证标签中的文本是否是注释对象。

The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function.

def iscomment(elem):
   return isinstance(elem, Comment)

comments 变量将存储给定 HTML 文档中的所有注释文本出现。我们将在示例代码中使用以下 index.html 文件 -

The comments variable shall store all the comment text occurrences in the given HTML document. We shall use the following index.html file in the example code −

      <!-- Title of document -->
      <!-- Page heading -->
      <h2>Departmentwise Employees</h2>
      <!-- top level list-->
      <ul id="dept">
         <ul id='acc'>
         <!-- first inner list -->
         <ul id="HR">
         <!-- second inner list -->

以下 Python 程序刮取了上述 HTML 文档,并找到了其中的所有注释。

The following Python program scrapes the above HTML document, and finds all the comments in it.


from bs4 import BeautifulSoup, Comment

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

def iscomment(elem):
    return isinstance(elem, Comment)

comments = soup.find_all(string=iscomment)
print (comments)


[' Title of document ', ' Page heading ', ' top level list', ' first inner list ', ' second inner list ']

上述输出显示了所有注释的列表。我们还可以在注释集合上使用一个 for 循环。

The above output shows a list of all comments. We can also use a for loop over the collection of comments.


for comment in comments:
   print (i,".",comment)


1 .  Title of document
2 .  Page heading
3 .  top level list
4 .  first inner list
5 .  second inner list

在本章中,我们学习了如何提取 HTML 文档中的所有注释字符串。

In this chapter, we learned how to extract all the comment strings in a HTML document.