SAXON: the Java API

Contents

Introduction
Structure
Choosing a SAX Parser
The Controller and Builder
Standard Handlers
Writing an Element Handler
Writing a Text Node Handler
Patterns
The ElementInfo Object

Introduction

This document describes how to use SAXON as a Java class library, without making any use of XSLT stylesheets. If you want to know how to control stylesheet processing from a Java application, see using-xsl.html.

Note: The Java API was provided in SAXON long before the XSL interface. Most of the things that the Java API was designed to do can now be done more conveniently in XSL. Reflecting this, some of the features of the API have been withdrawn as redundant, and the focus in SAXON will increasingly be on doing everything possible through XSL.

The Java processing model in SAXON is an extension of the XSL processing model:

You write handlers for elements and other nodes in the document.
You define the rules that associate a particular handler with particular elements or other nodes. These rules are expressed as XSL-compatible patterns. The node handlers are analogous to XSL templates.
Within a node handler, you can select other nodes (typically but not necessarily the immediate children) for processing. The system will automatically invoke the appropriate handler for each selected node. Alternatively, you can use SAXON API calls to navigate directly to other nodes in the document.

You can process some elements in Java and others in XSL if you wish. You can also use a range of standard node handlers supplied with the SAXON package.

The standard node handlers issued with the SAXON package provide services such as the following:

Copying an element unchanged to the output
Skipping over an element, omitting it from the output
Supplying strings to replace the start and end tags of an element (for example to add HTML formatting)
Recognising a group of consecutive elements of the same type (for example to generate an HTML table or list structure)

When a Java node handler is invoked, it is provided with information about the node via a NodeInfo object (usually you will be processing element nodes, in which case the NodeInfo will be an ElementInfo object). The node handler is also given information about the processing context, and access to a wide range of processing services, via a Context object.

The NodeInfo provides facilities to:

determine the node's type and attributes
locate the node's parent or siblings
associate user data with a node

The Context object allows the node handler to:

access any parameters associated with the applyTemplates() call that invoked this node handler
get information about the current node list (the list of nodes being processed by this handler: for example, to determine if this is the last node in the list)
get rapid access to nodes based on registered keys and identifiers
declare and reference variables
set an output destination for output from this node or its children (useful when splitting an XML document heirarchically)
write output to the current destination

SAXON: comparison with SAX and DOM

There are two standard APIs for processing XML documents: the SAX interface, and the DOM. SAX (see http://www.megginson.com/SAX/index.html) is an event-driven interface in which the parser reports things such as start and end tags to the application as they are encountered, while the Document Object Model (DOM) (see http://www.w3.org/dom is a navigational interface in which the application can roam over the document in memory by following relationships between its nodes.

SAXON offers a higher-level processing model than either SAX or DOM. It allows applications to be written using a rule-based design pattern, in which your application consists of a set of rules of the form "when this condition is encountered, do this processing". It is an event-condition-action model in which the events are the syntactic constructs of XML, the conditions are XSLT-compatible patterns, and the actions are Java methods.

If you are familiar with SAX, the main differences in SAXON are:

You can provide a separate handler for each element type (or other node), making your application more modular
SAXON supplies context information to your application, so you can find out, for example, the parent element of the one you are currently processing
SAXON provides facilities for organizing the output of your application, allowing you to direct different parts of the output to different files. SAXON is a particularly convenient tool for splitting a large document into page-sized chunks for viewing, or into individual records for storing in a relational or object database.
SAXON provides a number of standard element handlers that you can use to perform common rendition tasks. A Java SAXON application to produce HTML output will often be as simple as an equivalent XSL script, and Java imposes fewer constraints.
SAXON allows you to register your preferred SAX-compliant XML parser; you do not need to hard-code the name of the parser into your application or supply it each time on the command line. SAXON also works with several DOM implementations.
SAXON extends the SAX InputSource class allowing you to specify a file name as the source of input.

Serial and Direct processing: preview mode

An earlier release of SAXON allowed a purely serial mode of processing: each node was processed as it was encountered. With experience, this proved too restrictive, and caused the internal architecture to become too complex, so it was withdrawn. It has been replaced with a new facility, preview mode. This is available both with XSL and with the Java API.

Preview mode is useful where the document is too large to fit comfortably in main memory. It allows you to define node handlers that are called as the document tree is being built in memory, rather than waiting until the tree is fully built which is the normal case.

When you define an element as a preview element (using the setPreviewElement() method of the PreviewManager class), its node handler is called as soon as the element end tag is encountered. When the node handler returns control to SAXON, the children of the preview element are discarded from memory.

This means, for example, that if your large XML document consists of a large number of chapters, you can process each chapter as it is read, and the memory available needs to be enough only for (a) the largest individual chapter, and (b) the top-level structure idnetifying the list of chapters.

When the document tree has been fully built, the node handler for its root element will be called in the normal way.

Structure

The monolithic Controller class included in earlier SAXON releases has been split up, promarily to allow multithreaded stylesheets. The Controller still exists, but is now used in conjunction with a set of subsidiary classes to control specific aspects of processing.

There are several classes used to define the kind of processing you want to perform. These are the RuleManager for registering template rules, the KeyManager for registering key definitions, the PreviewManager for registering preview elements, the Stripper for registering which elements are to have whitespace nodes stripped, and the DecimalFormatManager for registering named decimal formats. These classes can all be reused freely, and they are thread safe once the definitions have been set up.

The Builder class is used to build a document tree from a SAX InputSource. Its main method is build(). The builder can be serially reused to build further documents, but it should only be used for one document at a time. The builder needs to know about the Stripper if whitespace nodes are to be stripped from the tree, and it needs to know about the PreviewManager if any elements are to be processed in preview mode. The relevant classes can be registered with the builder using the setStripper() and setPreviewManager() methods.

The Controller class is used to process a document tree by applying registered node handlers. Its main method is run(). The controller is responsible for navigating through the document and calling user-defined handlers which you associate with each element or other node type to define how it is to be processed. The controller can also be serially reused, but should not be used to process more than one document at a time. The Controller needs to know about the RuleManager to find the relevant node handlers to invoke. If keys are used it will need to know about the KeyManager, and if decimal formats are used it will need to know about the DecimalFormatManager. These classes can be registered with the Controller using setRuleManager(), setKeyManager(), and setDecimalFormatManager() respectively. If preview mode is used, the PreviewManager will need to know about the Controller, so it has a setController() method for this purpose.

Element handlers are called to process the start and end tags of the element. They can choose whether or not subsidiary elements should be processed (by calling applyTemplates()), and can dive off into a completely different part of the document tree before resuming.

Handlers for other kinds of node (character content, attributes) are only called once for each node, to process the relevant content. These will not normally select further nodes for processing, since these nodes have no children: but they can do so if they wish, because as in XSL, the applyTemplates() method can navigate in any direction.

A node handler can write to the current output destination. The controller maintains a stack of outputters. Your node handler can switch output to a new destination by calling setOutputDetails(), and can revert to the prevoius destination by calling resetOutputDetails(). This is useful both for splitting an input XML document into multiple XML documents, and for creating output fragments that can be reassembled in a different order for display. Details of the output format required must be set up in an OutputDetails object, which is supplied as a parameter to setOutputDetails(). The actual control of output destinations rests with a class called the OutputManager, but you will normally interact with this via wrapper methods in the Controller.

Choosing a SAX Parser

SAXON provides a layer of services on top of a SAX-compliant XML parser. It will work with any Java-based XML parser that implements the SAX1 or SAX2 interface.

SAXON uses the configuration file ParserManager.properties to decide which SAX parser to use. This file identifies a default parser and a list of alternatives. As issued, it lists some popular (and free) SAX-compliant parsers which have been tested with SAXON. The default is James Clark's xp parser. If you do nothing, SAXON will search your CLASSPATH to see if any of these known parsers are installed. If you want to specify a different parser, or change the default, simply edit the ParserManager.properties file.

At this release SAXON no longer uses or supports the DOM. It builds its own tree structure internally for performance reasons.

The SAXON package includes a copy of the Ælfred parser, so there is no need to download a parser separately.

The Controller and Builder

A simple application proceeds as follows:

Create an instance of the RuleManager.
Define a number of node handlers using the setHandler() method of the RuleManager class.
Create an instance of the Controller.
Register the node handlers with the Controller using the setRuleManager() method.
Create an instance of the Builder.
Supply an input XML document to build a document tree using the build() method of the Builder class. This returns a DocumentInfo object
Supply this DocumentInfo object to the run() method of the Controller class. This will start processing at the root node, calling your node handlers as appropriate.

Standard Handlers

You can use the standard node handlers supplied with SAXON directly (via the setHandler() method), or you can subclass them to create your own.

Some of the standard handlers are:

ELEMENT HANDLERS
ElementHandlerBase	This element handler does nothing with this element, other than calling applyTemplates() to ensure that its children are processed.
ElementSuppressor	This element handler does nothing with this element, it doesn't even call applyTemplates() to ensure that its children are processed.
ElementCopier	This element handler copies the element (tags and attributes, but not child elements) to the current output Writer. Special characters are escaped using XML rules. This handler is useful when you are doing an XML-to-XML transformation; you can also subclass it with an element handler that only processes selected events. You will normally use it in conjunction with ContentCopier to copy the character content.
ItemRenderer	This element handler replaces the start and end tags of the element with character strings that you specify yourself. Special characters are escaped using rules that work both for XML and for HTML output. This handler is useful when you are doing an XML-to-HTML transformation: specify as your prefix and suffix the HTML tags you want to output. There is also a method setItemRendition() which provides a convenient shortcut for invoking this handler.
GroupHandlerBase	This element handler detects elements that are the first and last in a group of consecutive elements of the same type, and calls methods beforeGroup() and afterGroup() accordingly. This handler is useful as a superclass for user-written handlers that need to process a number of consecutive elements as a group.
GroupRenderer	This is a subclass of GroupHandlerBase. It allows you to specify strings that will be output at the start of a group of consecutive elements, between elements of the group, and at the end of the entire group. This handler is useful when you are doing an XML-to-HTML transformation when you want to generate a list or table. It is also useful in XML-to-XML transformations if you want to generate an extra level of structure. There is also a method setGroupRendition() which provides a convenient shortcut for invoking this handler.
NumberHandler	This element handler allows you to add sequential numbers to elements. The numbers are added as the value of a pseudo-attribute called "#". You can specify a base element that causes numbering to restart: for example LISTITEM elements might be numbered using LIST as the base element. If you don't specify a base class, the root element of the document acts as the implicit base element. This handler is provided mainly for forwards compatibility: there are methods getNodeNumber() and getNodeNumberAny() which usually achieve the required effect more conveniently.
MultiHandler	This element handler accepts as parameters to its constructor two further element handlers. When the MultiHandler is called, it calls these two handlers in turn. By chaining together MultiHandlers, you can invoke any number of element handlers to process each element. The handlers are invoked "outwards-in". That is, the first handler registered is called before the second in the case of the startElement() interface; the first handler is called after the second in the case of endElement().
TEXT NODE HANDLERS
ContentCopier	This text node handler copies all character content to the current output writer, escaping special characters using the usual HTML/XML conventions.
ContentSuppressor	This text node handler does nothing with the character content encountered.
ElementToAttributeConverter	This text node handler allows you to treat the character content of an element as if it were the value of an attribute. This will often be an attribute of the parent element, but it can be any ancestor element, or even the element holding the content itself. The attribute is available as soon as the element's end tag has been read. It is useful when an element contains a sequence of child elements which typically appear once each, but in any order. (If a child element appears more than once, the contents of multiple children are appended to each other, using a separator which you can define). In general it is useful only where the elements have pure PCDATA content (not element or mixed content)

Writing an Element Handler

An element handler is one kind of NodeHandler. We focus here on handlers for elements rather than other kinds of node, because they are the most common.

User-written element-handlers are written to implement the interface ElementHandler. Optionally, you can define them as subclasses of the system-supplied class ElementHandlerBase, an element handler that does nothing.

If you write your element handler by subclassing one of these supplied element handlers, then you only need to provide those methods that perform a different action from the default.

Always remember that if you want child elements to be processed recursively, your element handler must call the applyTemplates() method.

The element handler is supplied with an ElementInfo object which provides information about the current element, and with a Context object that gives access to a range of standard services such an Outputter object which include a write() method to produce output.

Normally you will write one element handler for each type of element, but it is quite possible to use the same handler for several different elements. You can also write completely general-purpose handlers: the system-supplied NumberHandler which performs automatic section numbering is an example. You define which elements will be handled by each element handler using a pattern, exactly as in XSL.

You provide one method for each event associated with the selected element type. The two events notified are:

startElement()

This is called when the start of the element is encountered. The ElementInfo object passed gives you information about the element and it attributes. You can save information for later use if required, using one of several techniques:

The setAttribute() interface allows you to store a keyword=value pair as if it were an attribute encountered in the XML source. This technique is useful if you want to supply a default value that was omitted in the source; but the mechanism is completely general, and you can invent new attributes if you wish.
The setUserData() interface allows you to store an arbitrary object in the ElementInfo object. This is useful if you are building up an object model from the XML document, and you want to link XML elements to objects in your model. It also allows you to perform simple functions such as counting the length of character data encountered within an element.
You can save information in local variables within the element handler object: but take care not to do this if the same element handler might be used to process another element before the first one ends.
Finally, you can create XSL variables using the Context object. These variables are visible only within the current element handler, but the ability to reference them in XSL expressions gives added flexibility. For example, you can set up a variable which is then used in a filter in the expression passed to applyTemplates(), which thus controls which child nodes will be processed.

endElement()

This is called when the end of the element is encountered. You have access to the same ElementInfo structure as was used for the start of the corresponding element.

Writing a Text Node Handler

User-written text node handlers must be written to implement the interface CharacterHandler. This interface defines a single method, characters(), which takes a ContentInfo object as its only parameter.

The ContentInfo object can be used to obtain the character content (using the getValue()) method), and to provide information about the context: the getParent() method returns the ElementInfo object describing the containing element. ContentInfo also provides a range of standard services such as a write() method to produce output.

You can register as many different character handlers as you like, each corresponding to a different pattern that the text node must satisfy. Usually, however, very few are needed. The default handlers behave exactly as in XSLT. If you want to change this behaviour, register a handler for the pattern "text()".

You can also register handlers to process attribute nodes and processing instructions, but these are far less common.

Patterns and Expressions

Patterns are used in the setHandler() interface to define which nodes a particular handler applies to. Expressions are used in the applyTemplates() interface to control which nodes are selected for processing. Patterns and expressions used in the SAXON Java API have exactly the same form as in XSLT.

The detailed rules for patterns can be found in patterns.html, and for expressions in expressions.html

Expressions and Patterns are represented in the API by classes Expression and Pattern respectively. These include static methods to create an Expression or Pattern from a String. A few "convenience" methods also allow Expressions and Patterns to be supplied directly as Strings.

When you create an Expression or Pattern using the methods Expression.make() and Pattern.make() you may supply a StaticContext object. This object provides the information needed to interpret certain expressions and patterns: for example, it provides the ability to convert a namespace prefix within the expressions into a URI. In an XSL stylesheet, the StaticContext provides information the expression can get from the rest of the stylesheet; in a Java application, this is not available, so you must provide the context yourself. If you don't supply a StaticContext object, a default context is used: this will prevent you using context-dependent constructs such as variables and namespace prefixes.

The ElementInfo Object

The ElementInfo object represents an element node of the XML document (that is, a construct with a start and end tag)

The main purpose of the ElementInfo object is to provide element handlers with information about the element. The most commonly-used methods include:

getName()	get the name of the element, as a Name object. You can use the Name object to get the local part of the name, the prefix, or the URI of the namespace.
getAttribute()	get the value of a specified attribute, as a String.
getInheritedAttribute()	get the value of an inherited attribute, as a String. This is useful for attributes that are implicitly inherited by child elements from their parent elements (for example the xml:space attribute). It returns the first value of the attribute found on any enclosing element, starting the search at the current element.
getParentNode()	get the ElementInfo of the parent element, or the DocumentInfo object if this is the outermost element. Note that if you frequently need to take different action depending on the parent element type, it may be better to provide separate handlers for the element depending on the context it appears in.
getAncestor()	get the nearest ancestor matching a given pattern.
getDocumentElement()	get the outermost element of the document (not the Document node).
getPreviousSibling()	get the ElementInfo for the previous element at the same level.
getNextSibling()	get the ElementInfo for the next element at the same level.
setUserData(), getUserData()	These methods allow you to save information in your startElement() handler which will be available later in other handlers such as endElement(), and also when processing character data or child elements occurring within this element
setAttribute()	As an alternative to setUserData(), this allows you to save information associated with the current element for later use. The information can be retrieved later using getAttribute() exactly as if it were present in the original XML source.

Michael H. Kay
14 December 1999