Monday, October 10, 2011

Choosing between SAX and DOM

The single biggest factor in deciding whether to code your programs with SAX or DOM is programmer preference. SAX and DOM are very different APIs, Where SAX models the parser, DOM models the XML document. Most programmers find the DOM approach more to their taste, at least initially. Its pull model (The client program extracts the information it wants from a document by invoking various methods on that document.) is much more familiar than SAX’s push model (The parser tells you what it reads when it reads it, whether you’re ready for that information or not.)
However, SAX’s push model, unfamiliar as it is, can be much more efficient. SAX programs can be much faster than their DOM equivalents, and almost always use far less memory. In particular, SAX works extremely well when documents are streamed, and the individual parts of each document can be processed in isolation from other parts. If complicated processes can be broken down into serial filters, then SAX is hard-to-beat. SAX lends itself to assembly-line like automation where different stations perform small operations on just the parts of the document they have at hand right at that moment. By contrast, DOM is more like a factory where each worker operates only on an entire car. Every time the worker receives a new car off the line, they have to take the entire car apart to find the piece they need to work with, do their job, then put the car back together again before moving it along to the next worker. This system is not very efficient if there’s more than one station. DOM lends itself to monolithic applications where one program does everything. SAX works better when the program can be divided into small bits of independent work.
In particular the following characteristics indicate that a program should probably be using a streaming API such as SAX, XNI, or XMLPULL:
  • Documents will not fit into available memory. This is the only rule that really mandates one or the other. If your documents are too big for available memory, then you must use a streaming API such as SAX, painful though it may be. You really have no other choice.
  • You can process the document in small contiguous chunks of input. The entire document does not need to be available before you can do useful work.
    A slightly weaker variant of this is if the decisions you make only depend on preceding parts of the document, never on what comes later.
  • Processing can be divided up into a chain of successive operations.
However, if the problem matches this next set of characteristics, the program should probably be using DOM or perhaps another of the tree-based APIs such as JDOM:
  • The program needs to access widely separated parts of the document at the same time. Even more so, it needs access to multiple documents at the same time.
  • The internal data structures are almost as complicated as the document itself.
  • The program must modify the document repeatedly.
  • The program must store the document for a significant amount of time through many method calls, not just process it once and forget it.
On occasion, it’s possible to use both SAX and DOM. In particular, you can parse the document using a SAX XMLReader attached to a series of SAX filters, then use the final output from that process to construct a DOM Document. Working in reverse, you can traverse a DOM tree while firing off SAX events to a SAX ContentHandler.
The approach is the same Example 9.14 used earlier in to serialize a DOM Document onto a stream. Use JAXP to perform an identity transform from a source to a result. JAXP supports both SAX, DOM, and streams as sources and results. For example, this code fragment reads an XML document from the InputStream in and parses it with the SAX XMLReader named saxParser. Then it transforms this input into the equivalent DOMResult from which the DOM Document is extracted.
XMLReader saxParser = XMLReaderFactory.createXMLReader();
Source input = new SAXSource(saxParser, in);
Result output = new DOMResult();
TransformerFactory xformFactory 
 = TransformerFactory.newInstance();
Transformer idTransform = xformFactory.newTransformer();
idTransform.transform(input, output);
Node document = idTransform.getNode();
To go in the other direction, from DOM to SAX, just use a DOMSource and a SAXResult. The DOMSource is constructed from a DOM Document object, and the SAXResult is configured with a ContentHandler:
Source input = new DOMSource(document);
ContentHandler handler = new MyContentHandler();
Result output = new SAXResult(handler);
TransformerFactory xformFactory 
 = TransformerFactory.newInstance();
Transformer idTransform = xformFactory.newTransformer();
idTransform.transform(input, output);
Node document = idTransform.getNode();
The transform will walk the DOM tree firing off events to the SAX ContentHandler.
Although TrAX is the most standard, parser-independent means of passing documents back and forth between SAX and DOM, many implementations of these APIs also provide their own utility classes for crossing the border between the APIs, For example, GNU JAXP has the gnu.xml.pipeline.DomConsumer class for building DOM Document objects from SAX event streams and the gnu.xml.util.DomParser class for feeding a DOM Document into a SAX program. The Oracle XML Parser for Java provides the oracle.xml.parser.v2.DocumentBuilder is a SAX ContentHandler/LexicalHandler/DeclHandler that builds a DOM Document from a SAX XMLReader.

No comments:

Post a Comment