Beautiful Soup 简明教程

Beautiful Soup - diagnose() Method

Method Description

Beautiful Soup 中的 diagnose() 方法是一个诊断套件,用于隔离常见问题。如果你难以理解 Beautiful Soup 对文档执行了哪些操作,请将文档作为参数传递给 diagnose() 函数。一份报告显示了不同解析器如何处理文档,并告诉你是否缺少解析器。

The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you’re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you’re missing a parser.

Syntax

diagnose(data)

Parameters

  1. data − the document string.

Return Value

该 diagnose() 方法根据所有可用解析器打印解析给定文档的结果。

The diagnose() method prints the result of parsing the given document according all the available parsers.

Example

让我们为这个练习获取这个简单文档 −

Let us take this simple document for our exercise −

<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>

以下代码对上述 HTML 脚本运行诊断 −

The following code runs the diagnostics on the above HTML script −

markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''

from bs4.diagnose import diagnose

diagnose(markup)

该 diagonose() 输出以一条消息开头,显示了哪些解析器可用 −

The diagonose() output starts with a message showing what all parsers are available −

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1

如果要诊断的文档是一个完美的 HTML 文档,那么所有解析器的结果都几乎相似。但是,在我们的示例中,有许多错误。

If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors.

首先,使用内置的 html.parser。报告如下 −

To begin the built-in html.parser is take up. The report will be as follows −

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
   <h1>
      Hello World
   <b>
      Welcome
   </b>
   <p>
      <b>
         Beautiful Soup
         <i>
            Tutorial
         </i>
         <p>
         </p>
      </b>
   </p>
</h1>

你可以看到,Python 的内置解析器不会插入 <html> 和 <body> 标记。未闭合的 <h1> 标记在结尾处提供了匹配的 <h1>。

You can see that Python’s built-in parser doesn’t insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end.

html5lib 和 lxml 解析器均通过用 <html>、<head> 和 <body> 标记包装文档来完成文档。

Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
   <head>
   </head>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
         <p>
            <b>
               Beautiful Soup
               <i>
                  Tutorial
               </i>
            </b>
         </p>
         <p>
            <b>
            </b>
         </p>
      </h1>
   </body>
</html>

使用 lxml 解析器,请注意 </h1> 的插入位置。不完整的 <b> 标记也会纠正,并且悬挂的 </a> 会被移除。

With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
      </h1>
      <p>
         <b>
            Beautiful Soup
            <i>
               Tutorial
            </i>
         </b>
      </p>
      <p>
      </p>
   </body>
</html>

diagnose() 方法还会将文档解析为 XML 文档,这在我们这里可能多余。

The diagnose() method parses the document as XML document also, which probably is superfluous in our case.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
   Hello World
   <b>
      Welcome
   </b>
   <P>
      <b>
         Beautiful Soup
      </b>
      <i>
         Tutorial
      </i>
   <p/>
   </P>
</h1>

让我们向 diagnose() 方法提供 XML 文档,而不是 HTML 文档。

Let us give the diagnose() method a XML document instead of HTML document.

<?xml version="1.0" ?>
   <books>
      <book>
         <title>Python</title>
         <author>TutorialsPoint</author>
         <price>400</price>
      </book>
   </books>

现在,如果我们运行诊断,即使是 XML,也会应用 html 解析器。

Now if we run the diagnostics, even if it’s a XML, the html parsers are applied.

Trying to parse your markup with html.parser

Warning (from warnings module):
  File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
    warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

使用 html.parser,会显示一条警告消息。使用 html5lib,包含 XML 版本信息的第一个文本行会被注释掉,并且文档的其余部分将被解析,就像它是 HTML 文档一样。

With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
   <head>
   </head>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml html 解析器不会插入注释,而是将其解析为 HTML。

The lxml html parser doesn’t insert the comment, but parses it as HTML.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml-xml 解析器将文档解析为 XML。

The lxml-xml parser parses the document as XML.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
   <books>
      <book>
         <title>
            Python
         </title>
         <author>
            TutorialsPoint
         </author>
         <price>
            400
         </price>
      </book>
   </books>

诊断报告可能被证明对查找 HTML/XML 文档中的错误很有用。

The diagnostics report may prove to be useful in finding errors in HTML/XML documents.