Ruby 简明教程

Ruby - XML, XSLT and XPath Tutorial

What is XML?

可扩展标记语言 (XML) 是一种类似 HTML 或 SGML 的标记语言。它由万维网联盟推荐,并作为一个开放标准提供。

The Extensible Markup Language (XML) is a markup language much like HTML or SGML. This is recommended by the World Wide Web Consortium and available as an open standard.

XML 是一种便携式开源语言,允许程序员开发应用程序,无论操作系统和/或开发语言是什么,其他应用程序都可以读取这些应用程序。

XML is a portable, open source language that allows programmers to develop applications that can be read by other applications, regardless of operating system and/or developmental language.

XML 极其适用于追踪少量至中等数量的数据,而不需要基于 SQL 的后端。

XML is extremely useful for keeping track of small to medium amounts of data without requiring a SQL-based backbone.

XML Parser Architectures and APIs

对于 XML 解析器,有两种不同的类型 −

There are two different flavors available for XML parsers −

  1. SAX-like (Stream interfaces) − Here you register callbacks for events of interest and then let the parser proceed through the document. This is useful when your documents are large or you have memory limitations, it parses the file as it reads it from disk, and the entire file is never stored in memory.

  2. DOM-like (Object tree interfaces) − This is World Wide Web Consortium recommendation wherein the entire file is read into memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

在处理大型文件时,SAX 显然无法像 DOM 那样快速处理信息。另一方面,专属使用 DOM 会真正耗尽你的资源,尤其是在大量的小型文件中使用时。

SAX obviously can’t process information as fast as DOM can when working with large files. On the other hand, using DOM exclusively can really kill your resources, especially if used on a lot of small files.

SAX 是只读的,而 DOM 允许对 XML 文件进行更改。由于这两个不同的 API 实际上是互补的,因此你没有理由不能将它们都用于大型项目。

SAX is read-only, while DOM allows changes to the XML file. Since these two different APIs literally complement each other there is no reason why you can’t use them both for large projects.

Parsing and Creating XML using Ruby

处理 XML 最常见的方法是使用 Sean Russell 的 REXML 库。自 2002 年以来,REXML 已成为标准 Ruby 发行版的一部分。

The most common way to manipulate XML is with the REXML library by Sean Russell. Since 2002, REXML has been part of the standard Ruby distribution.

REXML 是一个符合 XML 1.0 标准的纯 Ruby XML 处理器。它是一个无需验证的处理器,通过了所有 OASIS 无需验证的一致性测试。

REXML is a pure-Ruby XML processor conforming to the XML 1.0 standard. It is a non-validating processor, passing all of the OASIS non-validating conformance tests.

REXML 解析器相对于其他可用解析器具有以下优势 −

REXML parser has the following advantages over other available parsers −

  1. It is written 100 percent in Ruby.

  2. It can be used for both SAX and DOM parsing.

  3. It is lightweight, less than 2000 lines of code.

  4. Methods and classes are really easy-to-understand.

  5. SAX2-based API and Full XPath support.

  6. Shipped with Ruby installation and no separate installation is required.

对于所有我们的 XML 代码示例,让我们使用一个简单的 XML 文件作为输入 −

For all our XML code examples, let’s use a simple XML file as an input −

<collection shelf = "New Arrivals">
   <movie title = "Enemy Behind">
      <type>War, Thriller</type>
      <format>DVD</format>
      <year>2003</year>
      <rating>PG</rating>
      <stars>10</stars>
      <description>Talk about a US-Japan war</description>
   </movie>
   <movie title = "Transformers">
      <type>Anime, Science Fiction</type>
      <format>DVD</format>
      <year>1989</year>
      <rating>R</rating>
      <stars>8</stars>
      <description>A schientific fiction</description>
   </movie>
   <movie title = "Trigun">
      <type>Anime, Action</type>
      <format>DVD</format>
      <episodes>4</episodes>
      <rating>PG</rating>
      <stars>10</stars>
      <description>Vash the Stampede!</description>
   </movie>
   <movie title = "Ishtar">
      <type>Comedy</type>
      <format>VHS</format>
      <rating>PG</rating>
      <stars>2</stars>
      <description>Viewable boredom</description>
   </movie>
</collection>

DOM-like Parsing

让我们首先以树状的方式解析我们的 XML 数据。我们首先需要 rexml/document 库;通常,我们引入 REXML 以方便地导入到顶级命名空间。

Let’s first parse our XML data in tree fashion. We begin by requiring the rexml/document library; often we do an include REXML to import into the top-level namespace for convenience.

#!/usr/bin/ruby -w

require 'rexml/document'
include REXML

xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)

# Now get the root element
root = xmldoc.root
puts "Root element : " + root.attributes["shelf"]

# This will output all the movie titles.
xmldoc.elements.each("collection/movie"){
   |e| puts "Movie Title : " + e.attributes["title"]
}

# This will output all the movie types.
xmldoc.elements.each("collection/movie/type") {
   |e| puts "Movie Type : " + e.text
}

# This will output all the movie description.
xmldoc.elements.each("collection/movie/description") {
   |e| puts "Movie Description : " + e.text
}

这会产生以下结果 −

This will produce the following result −

Root element : New Arrivals
Movie Title : Enemy Behind
Movie Title : Transformers
Movie Title : Trigun
Movie Title : Ishtar
Movie Type : War, Thriller
Movie Type : Anime, Science Fiction
Movie Type : Anime, Action
Movie Type : Comedy
Movie Description : Talk about a US-Japan war
Movie Description : A schientific fiction
Movie Description : Vash the Stampede!
Movie Description : Viewable boredom

SAX-like Parsing

要以面向流的方式处理相同的数据,movies.xml,文件,我们将定义一个侦听器类,其方法将成为解析器回调的目标。

To process the same data, movies.xml, file in a stream-oriented way we will define a listener class whose methods will be the target of callbacks from the parser.

NOTE −不建议对小文件使用类似 SAX 的解析,这仅仅是一个演示示例。

NOTE − It is not suggested to use SAX-like parsing for a small file, this is just for a demo example.

#!/usr/bin/ruby -w

require 'rexml/document'
require 'rexml/streamlistener'
include REXML

class MyListener
   include REXML::StreamListener
   def tag_start(*args)
      puts "tag_start: #{args.map {|x| x.inspect}.join(', ')}"
   end

   def text(data)
      return if data =~ /^\w*$/     # whitespace only
      abbrev = data[0..40] + (data.length > 40 ? "..." : "")
      puts "  text   :   #{abbrev.inspect}"
   end
end

list = MyListener.new
xmlfile = File.new("movies.xml")
Document.parse_stream(xmlfile, list)

这会产生以下结果 −

This will produce the following result −

tag_start: "collection", {"shelf"=>"New Arrivals"}
tag_start: "movie", {"title"=>"Enemy Behind"}
tag_start: "type", {}
   text   :   "War, Thriller"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
   text   :   "Talk about a US-Japan war"
tag_start: "movie", {"title"=>"Transformers"}
tag_start: "type", {}
   text   :   "Anime, Science Fiction"
tag_start: "format", {}
tag_start: "year", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
   text   :   "A schientific fiction"
tag_start: "movie", {"title"=>"Trigun"}
tag_start: "type", {}
   text   :   "Anime, Action"
tag_start: "format", {}
tag_start: "episodes", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
   text   :   "Vash the Stampede!"
tag_start: "movie", {"title"=>"Ishtar"}
tag_start: "type", {}
tag_start: "format", {}
tag_start: "rating", {}
tag_start: "stars", {}
tag_start: "description", {}
   text   :   "Viewable boredom"

XPath and Ruby

查看 XML 的另一种方式是 XPath。这是一种描述如何在 XML 文档中找到特定元素和属性的伪语言,将该文档视为逻辑有序树。

An alternative way to view XML is XPath. This is a kind of pseudo-language that describes how to locate specific elements and attributes in an XML document, treating that document as a logical ordered tree.

REXML 通过 XPath 类支持 XPath。它假定基于树的解析(文档对象模型),如上所述。

REXML has XPath support via the XPath class. It assumes tree-based parsing (document object model) as we have seen above.

#!/usr/bin/ruby -w

require 'rexml/document'
include REXML

xmlfile = File.new("movies.xml")
xmldoc = Document.new(xmlfile)

# Info for the first movie found
movie = XPath.first(xmldoc, "//movie")
p movie

# Print out all the movie types
XPath.each(xmldoc, "//type") { |e| puts e.text }

# Get an array of all of the movie formats.
names = XPath.match(xmldoc, "//format").map {|x| x.text }
p names

这会产生以下结果 −

This will produce the following result −

<movie title = 'Enemy Behind'> ... </>
War, Thriller
Anime, Science Fiction
Anime, Action
Comedy
["DVD", "DVD", "DVD", "VHS"]

XSLT and Ruby

Ruby 可使用两种 XSLT 解析器。这里对每种解析器做了简要说明。

There are two XSLT parsers available that Ruby can use. A brief description of each is given here.

Ruby-Sablotron

该解析器由 Masayoshi Takahashi 编写和维护。这主要针对 Linux 操作系统编写,需要以下库 −

This parser is written and maintained by Masayoshi Takahashi. This is written primarily for Linux OS and requires the following libraries −

  1. Sablot

  2. Iconv

  3. Expat

你可以在 Ruby-Sablotron 找到此模块。

You can find this module at Ruby-Sablotron.

XSLT4R

XSLT4R 由 Michael Neumann 编写,可在 XML 下的库部分中的 RAA 中找到。XSLT4R 使用一个简单的命令行界面,不过也可以在第三方应用程序中使用它来转换 XML 文档。

XSLT4R is written by Michael Neumann and can be found at the RAA in the Library section under XML. XSLT4R uses a simple commandline interface, though it can alternatively be used within a third-party application to transform an XML document.

XSLT4R 需要 XMLScan 才能运行,该扫描器包含在 XSLT4R 归档文件中,并且也是一个 100% Ruby 模块。可以使用标准 Ruby 安装方法安装这些模块(即,ruby install.rb)。

XSLT4R needs XMLScan to operate, which is included within the XSLT4R archive and which is also a 100 percent Ruby module. These modules can be installed using standard Ruby installation method (i.e., ruby install.rb).

XSLT4R 具有以下语法 −

XSLT4R has the following syntax −

ruby xslt.rb stylesheet.xsl document.xml [arguments]

如果你想在应用程序中使用 XSLT4R,你可以包括 XSLT 并输入所需的 parameters。以下为示例 −

If you want to use XSLT4R from within an application, you can include XSLT and input the parameters you need. Here is the example −

require "xslt"

stylesheet = File.readlines("stylesheet.xsl").to_s
xml_doc = File.readlines("document.xml").to_s
arguments = { 'image_dir' => '/....' }
sheet = XSLT::Stylesheet.new( stylesheet, arguments )

# output to StdOut
sheet.apply( xml_doc )

# output to 'str'
str = ""
sheet.output = [ str ]
sheet.apply( xml_doc )

Further Reading

  1. For a complete detail on REXML Parser, please refer to standard documentation for REXML Parser Documentation.

  2. You can download XSLT4R from RAA Repository.