Python Data Persistence 简明教程

Python Data Persistence - XML Parsers

XML 是 eXtensible Markup Language 的首字母缩写。它是一种便携式、开放源码且跨平台的语言,非常类似于 HTML 或 SGML,并且得到万维网联盟的推荐。

XML is acronym for eXtensible Markup Language. It is a portable, open source and cross platform language very much like HTML or SGML and recommended by the World Wide Web Consortium.

它是一种众所周知的用于大量应用程序(例如 Web 服务、办公工具和 Service Oriented Architectures (SOA))的数据交换格式。XML 格式机器可读且人类可读。

It is a well-known data interchange format, used by a large number of applications such as web services, office tools, and Service Oriented Architectures (SOA). XML format is both machine readable and human readable.

标准 Python 库的 XML 包包含以下用于处理 XML 的模块:

Standard Python library’s xml package consists of following modules for XML processing −

Sr.No.

Modules & Description

1

xml.etree.ElementTree the ElementTree API, a simple and lightweight XML processor

2

xml.dom the DOM API definition

3

xml.dom.minidom a minimal DOM implementation

4

xml.sax SAX2 interface implementation

5

xml.parsers.expat the Expat parser binding

XML 文档中数据以树状分层格式进行排列,从根元素开始。每个元素都是树中的一个单独节点,具有括在 <> 和 </> 标记之间的一个属性。可以将一个或多个子元素分配给每个元素。

Data in the XML document is arranged in a tree-like hierarchical format, starting with root and elements. Each element is a single node in the tree and has an attribute enclosed in <> and </> tags. One or more sub-elements may be assigned to each element.

以下是 XML 文档的一个典型示例:

Following is a typical example of a XML document −

<?xml version = "1.0" encoding = "iso-8859-1"?>
<studentlist>
   <student>
      <name>Ratna</name>
      <subject>Physics</subject>
      <marks>85</marks>
   </student>
   <student>
      <name>Kiran</name>
      <subject>Maths</subject>
      <marks>100</marks>
   </student>
   <student>
      <name>Mohit</name>
      <subject>Biology</subject>
      <marks>92</marks>
   </student>
</studentlist>

在使用 ElementTree 模块时,第一步是为树设置根元素。每个元素都具有一个标记和属性,它是一个字典对象。对于根元素,属性是一个空字典。

While using ElementTree module, first step is to set up root element of the tree. Each Element has a tag and attrib which is a dict object. For the root element, an attrib is an empty dictionary.

import xml.etree.ElementTree as xmlobj
root=xmlobj.Element('studentList')

现在,我们可以在根元素下添加一个或多个元素。每个元素对象可能具有 SubElements 。每个子元素具有一个属性和文本属性。

Now, we can add one or more elements under root element. Each element object may have SubElements. Each subelement has an attribute and text property.

student=xmlobj.Element('student')
   nm=xmlobj.SubElement(student, 'name')
   nm.text='name'
   subject=xmlobj.SubElement(student, 'subject')
   nm.text='Ratna'
   subject.text='Physics'
   marks=xmlobj.SubElement(student, 'marks')
   marks.text='85'

使用 append() 方法将此新元素附加到根元素。

This new element is appended to the root using append() method.

root.append(student)

使用上述方法追加任意多的元素。最后,将根元素对象写入文件。

Append as many elements as desired using above method. Finally, the root element object is written to a file.

tree = xmlobj.ElementTree(root)
   file = open('studentlist.xml','wb')
   tree.write(file)
   file.close()

现在,我们了解如何解析 XML 文件。为此,通过将它的名称作为 ElementTree 构造函数中的文件参数来构造文档树。

Now, we see how to parse the XML file. For that, construct document tree giving its name as file parameter in ElementTree constructor.

tree = xmlobj.ElementTree(file='studentlist.xml')

树对象具有 getroot() 方法来获取根元素,而 getchildren() 返回位于它下方的一系列元素。

The tree object has getroot() method to obtain root element and getchildren() returns a list of elements below it.

root = tree.getroot()
children = root.getchildren()

通过迭代每个子节点的子元素集合,构造一个对应于每个子元素的字典对象。

A dictionary object corresponding to each sub element is constructed by iterating over sub-element collection of each child node.

for child in children:
   student={}
   pairs = child.getchildren()
   for pair in pairs:
      product[pair.tag]=pair.text

然后将每个字典附加到列表,返回字典对象的原始列表。

Each dictionary is then appended to a list returning original list of dictionary objects.

SAX 是用于事件驱动 XML 解析的标准接口。使用 SAX 解析 XML 要求通过对 xml.sax.ContentHandler 进行子类化来使用 ContentHandler。您注册针对感兴趣的事件的回调,然后让解析器继续处理文档。

SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX requires ContentHandler by subclassing xml.sax.ContentHandler. You register callbacks for events of interest and then, let the parser proceed through the document.

SAX 在文档较大或存在内存限制的情况下非常有用,因为它在磁盘读取文件时会对其进行解析,因此整个文件永远不会存储在内存中。

SAX is useful when your documents are large or you have memory limitations as it parses the file as it reads it from disk as a result entire file is never stored in the memory.

Document Object Model

(DOM) API 是一个万维网联盟推荐。在这种情况下,将整个文件读入内存并存储在分层(基于树)的形式中,以表示 XML 文档的所有特性。

(DOM) API is a World Wide Web Consortium recommendation. In this case, entire file is read into the memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

SAX 速度不如 DOM,对于较大的文件。另一方面,如果 DOM 用于多个小文件,则可能会耗尽资源。SAX 是只读的,而 DOM 允许更改 XML 文件。

SAX, not as fast as DOM, with large files. On the other hand, DOM can kill resources, if used on many small files. SAX is read-only, while DOM allows changes to the XML file.