Python 简明教程

Python - XML Processing

XML 是一种可移植的开源语言，允许程序员开发可由其他应用程序读取的应用程序，而不论操作系统和/或开发语言如何。

XML is a portable, open-source language that allows programmers to develop applications that can be read by other applications, regardless of operating system and/or developmental language.

What is XML?

可扩展标记语言 (XML) 是一种类似 HTML 或 SGML 的标记语言。它由万维网联盟推荐，并作为一个开放标准提供。

The Extensible Markup Language (XML) is a markup language much like HTML or SGML. This is recommended by the World Wide Web Consortium and available as an open standard.

XML 对于跟踪少量的中型数据非常有用，而不需要基于 SQL 的支柱。

XML is extremely useful for keeping track of small to medium amounts of data without requiring an SQL- based backbone.

XML Parser Architectures and APIs.

Python 标准库提供了最少但有用的接口集，供与 XML 一起使用。用于 XML 处理的所有子模块都可以在 xml 包中找到。

The Python standard library provides a minimal but useful set of interfaces to work with XML. All the submodules for XML processing are available in the xml package.

xml.etree.ElementTree − the ElementTree API, a simple and lightweight XML processor
xml.dom − the DOM API definition.
xml.dom.minidom − a minimal DOM implementation.
xml.dom.pulldom − support for building partial DOM trees.
xml.sax − SAX2 base classes and convenience functions.
xml.parsers.expat − the Expat parser binding.

用于 XML 数据的两个最基本、使用最广泛的 API 是 SAX 和 DOM 接口。

The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces.

Simple API for XML (SAX) − Here, you register callbacks for events of interest and then let the parser proceed through the document. This is useful when your documents are large or you have memory limitations, it parses the file as it reads it from the disk and the entire file is never stored in the memory.
Document Object Model (DOM) − This is a World Wide Web Consortium recommendation wherein the entire file is read into the memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

与大型文件一起使用时，SAX 明显无法像 DOM 那样快速处理信息。另一方面，专门使用 DOM 可能会消耗您的资源，尤其是在很多小文件上使用时。

SAX obviously cannot process information as fast as DOM, when working with large files. On the other hand, using DOM exclusively can really kill your resources, especially if used on many small files.

SAX 是只读的，而 DOM 允许更改 XML 文件。由于这两个不同的 API 实际上是互补的，因此没有理由不能在大型项目中同时使用它们。

SAX is read-only, while DOM allows changes to the XML file. Since these two different APIs literally complement each other, there is no reason why you cannot use them both for large projects.

对于我们所有 XML 代码示例，让我们使用一个简单的 XML 文件 movies.xml 作为输入 −

For all our XML code examples, let us use a simple XML file movies.xml as an input −

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
   <movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Parsing XML with SAX APIs

SAX 是用于事件驱动的 XML 解析的标准接口。使用 SAX 解析 XML 通常需要您通过对 xml.sax.ContentHandler 进行子类化来创建自己的 ContentHandler。

SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX generally requires you to create your own ContentHandler by subclassing xml.sax.ContentHandler.

您的 ContentHandler 处理您想要解析的 XML 的特定标记和属性。ContentHandler 对象提供了处理各种解析事件的方法。它的所有解析器在解析 XML 文件时调用 ContentHandler 方法。

Your ContentHandler handles the particular tags and attributes of your flavor(s) of XML. A ContentHandler object provides methods to handle various parsing events. Its owning parser calls ContentHandler methods as it parses the XML file.

方法 startDocument 和 endDocument 在 XML 文件的开始和结尾调用。方法 characters(text) 通过参数 text 传递 XML 文件的字符数据。

The methods startDocument and endDocument are called at the start and the end of the XML file. The method characters(text) is passed the character data of the XML file via the parameter text.

ContentHandler 在每个元素的开始和结尾调用。如果解析器不处于命名空间模式，则调用方法 startElement(tag, attributes) 和 endElement(tag); 否则，调用相应的方法 startElementNS 和 endElementNS。在此，tag 是元素标记，而 attributes 是 Attributes 对象。

The ContentHandler is called at the start and end of each element. If the parser is not in namespace mode, the methods startElement(tag, attributes) andendElement(tag) are called; otherwise, the corresponding methodsstartElementNS and endElementNS are called. Here, tag is the element tag, and attributes is an Attributes object.

在继续之前，这还有一些需要了解的重要方法 −

Here are other important methods to understand before proceeding −

The make_parser Method

下一个方法创建一个新的解析器对象并返回。创建的解析器对象将属于系统查找的第一个解析器类型。

The following method creates a new parser object and returns it. The parser object created will be of the first parser type, the system finds.

xml.sax.make_parser( [parser_list] )

以下是参数的说明：

Here is the detail of the parameters −

parser_list − The optional argument consisting of a list of parsers to use, which must all implement the make_parser method.

The parse Method

以下方法创建SAX解析器并使用此解析器解析文档。

The following method creates a SAX parser and uses it to parse a document.

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

以下是参数的详细信息−

Here are the details of the parameters −

xmlfile − This is the name of the XML file to read from.
contenthandler − This must be a ContentHandler object.
errorhandler − If specified, errorhandler must be a SAX ErrorHandler object.

The parseString Method

还有另外一种方法可以创建SAX解析器并解析指定的XML字符串。

There is one more method to create a SAX parser and to parse the specifiedXML string.

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

以下是参数的详细信息−

Here are the details of the parameters −

xmlstring − This is the name of the XML string to read from.
contenthandler − This must be a ContentHandler object.
errorhandler − If specified, errorhandler must be a SAX ErrorHandler object.

Example

import xml.sax
class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print ("*****Movie*****")
         title = attributes["title"]
         print ("Title:", title)

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print ("Type:", self.type)
      elif self.CurrentData == "format":
         print ("Format:", self.format)
      elif self.CurrentData == "year":
         print ("Year:", self.year)
      elif self.CurrentData == "rating":
         print ("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print ("Stars:", self.stars)
      elif self.CurrentData == "description":
         print ("Description:", self.description)
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

if ( __name__ == "__main__"):

   # create an XMLReader
   parser = xml.sax.make_parser()

   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )

   parser.parse("movies.xml")

这将产生以下结果 -

This would produce the following result −

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

有关SAX API文档的完整详细信息，请参阅标准 Python SAX APIs 。

For a complete detail on SAX API documentation, please refer to standard Python SAX APIs.

Parsing XML with DOM APIs

文档对象模型（“DOM”）是万维网联盟（W3C）提供的跨语言API，用于访问和修改XML文档。

The Document Object Model ("DOM") is a cross-language API from the World Wide Web Consortium (W3C) for accessing and modifying the XML documents.

DOM对随机访问应用程序非常有用。SAX一次只允许查看文档的一个位。如果您正在查看某个SAX元素，则无权访问另一个元素。

The DOM is extremely useful for random-access applications. SAX only allows you a view of one bit of the document at a time. If you are looking at one SAX element, you have no access to another.

以下是最快速加载XML文档并使用xml.dom模块创建minidom对象的最简单方法。minidom对象提供了一个简单解析器方法，可快速从XML文件创建DOM树。

Here is the easiest way to load an XML document quickly and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file.

示例短语调用minidom对象的parse(file [,parser])函数来解析由file指定的XML文件，放入一个DOM树对象中。

The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file, designated by file into a DOM tree object.

from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print ("Root element : %s" % collection.getAttribute("shelf"))

# Get all the movies in the collection
movies = collection.getElementsByTagName("movie")

# Print detail of each movie.
for movie in movies:
   print ("*****Movie*****")
   if movie.hasAttribute("title"):
      print ("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print ("Type: %s" % type.childNodes[0].data)
   format = movie.getElementsByTagName('format')[0]
   print ("Format: %s" % format.childNodes[0].data)
   rating = movie.getElementsByTagName('rating')[0]
   print ("Rating: %s" % rating.childNodes[0].data)
   description = movie.getElementsByTagName('description')[0]
   print ("Description: %s" % description.childNodes[0].data)

这将产生以下 output −

This would produce the following output −

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

有关DOM API文档的完整详细信息，请参阅标准 Python DOM APIs 。

For a complete detail on DOM API documentation, please refer to standard Python DOM APIs.

ElementTree XML API

xml程序包有一个ElementTree模块。这是一个简单轻便的XML解析器API。

The xml package has an ElementTree module. This is a simple and lightweight XML processor API.

XML是一种类似树形的层次数据格式。此模块中的“ElementTree”将整个XML文档视为一棵树。“Element”类表示此树中的一个节点。在ElementTree级别进行XML文件上的读写操作。与单个XML元素及其子元素的交互在Element级别进行。

XML is a tree-like hierarchical data format. The 'ElementTree' in this module treats the whole XML document as a tree. The 'Element' class represents a single node in this tree. Reading and writing operations on XML files are done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

Create an XML File

该树是元素的层次结构，从根部开始，后跟其他元素。使用此模块的Element()函数创建每个元素。

The tree is a hierarchical structure of elements starting with root followed by other elements. Each element is created by using the Element() function of this module.

import xml.etree.ElementTree as et
e=et.Element('name')

每个元素的特征是标签和attrib属性，attrib属性是一个dict对象。对于树的起始元素，attrib是空字典。

Each element is characterized by a tag and attrib attribute which is a dict object. For tree’s starting element, attrib is an empty dictionary.

>>> root=xml.Element('employees')
>>> root.tag
'employees'
>>> root.attrib
{}

现在您可以设置一个或多个子元素以添加到根元素下。每个子元素可以有一个或多个子元素。使用 SubElement() 函数添加它们并定义其 text 属性。

You may now set up one or more child elements to be added under the root element. Each child may have one or more sub elements. Add them using the SubElement() function and define its text attribute.

child=xml.Element("employee")
nm = xml.SubElement(child, "name")
nm.text = student.get('name')
age = xml.SubElement(child, "salary")
age.text = str(student.get('salary'))

每个子元素都通过 append() 函数添加到根中，如下所示：

Each child is added to root by append() function as −

root.append(child)

添加所需数量的子元素后，通过 elementTree() 函数构造一个树对象：

After adding required number of child elements, construct a tree object by elementTree() function −

tree = et.ElementTree(root)

整个树结构通过树对象的 write() 函数写入一个二进制文件：

The entire tree structure is written to a binary file by tree object’s write() function −

f=open('employees.xml', "wb")
tree.write(f)

Example

在此示例中，树是从字典项的列表构建的。每个字典项持有描述学生数据结构的键值对。这样构造的树将写入“myfile.xml”。

In this example, a tree is constructed out of a list of dictionary items. Each dictionary item holds key-value pairs describing a student data structure. The tree so constructed is written to 'myfile.xml'

import xml.etree.ElementTree as et
employees=[{'name':'aaa','age':21,'sal':5000},{'name':xyz,'age':22,'sal':6000}]
root = et.Element("employees")
for employee in employees:
   child=xml.Element("employee")
   root.append(child)
   nm = xml.SubElement(child, "name")
   nm.text = student.get('name')
   age = xml.SubElement(child, "age")
   age.text = str(student.get('age'))
   sal=xml.SubElement(child, "sal")
   sal.text=str(student.get('sal'))
tree = et.ElementTree(root)
with open('employees.xml', "wb") as fh:
   tree.write(fh)

“myfile.xml”存储在当前工作目录中。

The 'myfile.xml' is stored in current working directory.

<employees><employee><name>aaa</name><age>21</age><sal>5000</sal></employee><employee><name>xyz</name><age>22</age><sal>60</sal></employee></employee>

Parse an XML File

现在，我们从上面示例中创建的“myfile.xml”中读出内容。为此，将使用 ElementTree 模块中的以下函数：

Let us now read back the 'myfile.xml' created in above example. For this purpose, following functions in ElementTree module will be used −

ElementTree() - 此函数被重载为将元素的层次结构读到树对象中。

ElementTree() − This function is overloaded to read the hierarchical structure of elements to a tree objects.

tree = et.ElementTree(file='students.xml')

getroot() - 此函数返回树的根元素。

getroot() − This function returns root element of the tree.

root = tree.getroot()

您可以获取某个元素下一层的子元素列表。

You can obtain the list of sub-elements one level below of an element.

children = list(root)

在以下示例中，“myfile.xml”的元素和子元素被解析成一个字典项列表。

In the following example, elements and sub-elements of the 'myfile.xml' are parsed into a list of dictionary items.

Example

import xml.etree.ElementTree as et
tree = et.ElementTree(file='employees.xml')
root = tree.getroot()
employees=[]
   children = list(root)
for child in children:
   employee={}
   pairs = list(child)
   for pair in pairs:
      employee[pair.tag]=pair.text
   employees.append(employee)
print (employees)

它将生成以下 output −

It will produce the following output −

[{'name': 'aaa', 'age': '21', 'sal': '5000'}, {'name': 'xyz', 'age':'22', 'sal': '6000'}]

Modify an XML file

我们将使用 Element 的 iter() 函数。它创建一个以当前元素为根的给定标记的树迭代器。迭代器对其所在的元素和它之下的所有元素按从上到下的顺序进行迭代。

We shall use iter() function of Element. It creates a tree iterator for given tag with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order.

让我们为所有“marks”子元素构建一个迭代器，并使每个 sal 标记的文本增加 100。

Let us build iterator for all 'marks' subelements and increment text of each sal tag by 100.

import xml.etree.ElementTree as et
tree = et.ElementTree(file='students.xml')
root = tree.getroot()
for x in root.iter('sal'):
   s=int (x.text)
   s=s+100
   x.text=str(s)
with open("employees.xml", "wb") as fh:
   tree.write(fh)

这样，我们的“employees.xml”就会进行相应修改。我们还可以使用 set() 来更新某个键的值。

Our 'employees.xml' will now be modified accordingly. We can also use set() to update value of a certain key.

x.set(marks, str(mark))