Beautiful Soup 简明教程

Beautiful Soup - Parsing XML

BeautifulSoup 还可以解析 XML 文档。您需要将 fatures='xml' 参数传递给 Beautiful() 构造函数。

BeautifulSoup can also parse a XML document. You need to pass fatures='xml' argument to Beautiful() constructor.

假设我们在当前的工作目录中有以下内容 books.xml −

Assuming that we have the following books.xml in the current working directory −

Example

<?xml version="1.0" ?>
<books>
   <book>
      <title>Python</title>
      <author>TutorialsPoint</author>
      <price>400</price>
   </book>
</books>

以下代码会解析给定的 XML 文件 −

The following code parses the given XML file −

from bs4 import BeautifulSoup
fp = open("books.xml")
soup = BeautifulSoup(fp,  features="xml")

print (soup)
print ('type:', type(soup))

执行以上代码后,您应该会得到以下结果 −

When the above code is executed, you should get the following result −

<?xml version="1.0" encoding="utf-8"?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
type: <class 'bs4.BeautifulSoup'>

XML parser Error

默认情况下,Beautiful Soup 程序包会将文档解析为 HTML,然而,它非常易于使用,并且使用 beautifulsoup4 非常优雅地处理格式错误的 XML。

By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.

要将文档解析为 XML,您需要具有 lxml 解析器,只需要将 "xml" 作为第二个参数传递给 Beautiful Soup 构造函数即可 -

To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −

soup = BeautifulSoup(markup, "lxml-xml")

or

soup = BeautifulSoup(markup, "xml")

一个常见的 XML 解析错误是 -

One common XML parsing error is −

AttributeError: 'NoneType' object has no attribute 'attrib'

在使用 find() 或 findall() 函数时,某些元素丢失或未定义,可能会发生这种情况。

This might happen in case, some element is missing or not defined while using find() or findall() function.