Beautiful Soup 简明教程
Beautiful Soup - Trouble Shooting
如果你在尝试解析 HTML/XML 文档时遇到问题,则更有可能是因为使用的解析器正在解释文档。为了帮助你找到并纠正问题,Beautiful Soup API 提供了一个诊断程序 diagnose()。
If you run into problems while trying to parse a HTML/XML document, it is more likely because how the parser in use is interpreting the document. To help you locate and correct the problem, Beautiful Soup API provides a dignose() utility.
Beautiful Soup 中的 diagnose() 方法是一个诊断套件,用于隔离常见问题。如果你难以理解 Beautiful Soup 对文档执行了哪些操作,请将文档作为参数传递给 diagnose() 函数。一份报告显示了不同解析器如何处理文档,并告诉你是否缺少解析器。
The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you’re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you’re missing a parser.
diagnose() 方法在 bs4.diagnose 模块中定义。其输出以如下消息开头 −
The diagnose() method is defined in bs4.diagnose module. Its output starts with a message as follows −
Output
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
如果没有找到这些解析器中的任何一个,还会出现一条相应的消息。
If it doesn’t find any of these parsers, a corresponding message also appears.
I noticed that html5lib is not installed. Installing it may help.
如果馈送到 diagnose() 方法的 HTML 文档形成正确,则任何解析器解析的树都将相同。但是,如果它形成不正确,那么不同的解析器会进行不同的解释。如果你没有得到你预期的树,则更改解析器可能会很有帮助。
If the HTML document fed to diagnose() method is perfectly formed, the parsed tree by any of the parsers will be identical. However if it is not properly formed, then different parser interprets differently. If you don’t get the tree as you anticipate, changing the parser might help.
有时,你可能为 XML 文档选择了 HTML 解析器。HTML 解析器在不正确地解析文档时会添加所有 HTML 标记。查看输出,你将意识到错误,并可在纠正中提供帮助。
Sometimes, you may have chosen HTML parser for a XML document. The HTML parsers add all the HTML tags while parsing the document incorrectly. Looking at the output, you will realize the error and can help in correcting.
如果 Beautiful Soup 发出 HTMLParser.HTMLParseError,请尝试更改解析器。
If Beautiful Soup raises HTMLParser.HTMLParseError, try and change the parser.
解析错误 HTMLParser.HTMLParseError: 格式错误的开始标记和 HTMLParser.HTMLParseError: 错误的结束标记均由 Python 的内置 HTML 解析器库生成,解决方案是安装 lxml 或 html5lib。
parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag are both generated by Python’s built-in HTML parser library, and the solution is to install lxml or html5lib.
如果你遇到 SyntaxError: 语法无效(在行 ROOT_TAG_NAME = '[document]' 中),这是由于在 Python 3 下运行 Beautiful Soup 的旧 Python 2 版本,而没有转换代码。
If you encounter SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = '[document]'), it is caused by running an old Python 2 version of Beautiful Soup under Python 3, without converting the code.
ImportError 出现消息“No module named HTMLParser”是因为在 Python 3 中使用了旧的 Python 2 版 BeautifulSoup。
The ImportError with message No module named HTMLParser is because of an old Python 2 version of Beautiful Soup under Python 3.
同时,ImportError:No module named html.parser - 是在 Python 2 中运行 Python 3 版本的 BeautifulSoup 导致的。
While, ImportError: No module named html.parser - is caused by running the Python 3 version of Beautiful Soup under Python 2.
如果您收到 ImportError:No module named BeautifulSoup - 是因为在尚未安装 BS3 的系统上运 Beautiful Soup 3 代码。或者,不知道包名称已更改为 bs4,编写了 Beautiful Soup 4 代码。
If you get ImportError: No module named BeautifulSoup - more often than not, it is because of running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.
最后,ImportError:No module named bs4 - 也可能是因为尝试在尚未安装 BS4 的系统上运行 Beautiful Soup 4 代码。
Finally, ImportError: No module named bs4 - is due to the fact that you are trying a Beautiful Soup 4 code on a system that doesn’t have BS4 installed.