+1 408 660-3219 sales@single-sourcing.com
Common Questions category image

This week’s topic: Spaces, pretty printing, and XML

We occasionally get comments or questions that fall along this line:

Your XML is very difficult to read (no pretty indenting)
Your file has spaces (white space characters)
Your XML has line returns that is show weirdly in my tool

Before we can answer get to the meat of what's happening, we need to address the implicit assumption that the XML should look pretty, have indents, so you can see the structure in ascii view.

The assumption is that the text inside an XML file should look like this:

Example #1
     <name>
          <firstname>Wally</firstname>
          <lastname>Wallpaper</lastname>
     </name>

Example #2
     <topic>
          <title>Welcome to my topic</title>
          <shortdescription>Maybe there's a little something happening
             right here. This is unplanned it really just happens. Do a
             little painting with us. Just make little strokes like that.
             That is when you can experience true joy, when you have no fear. 
         </shortdescription>
     </topic>

In both of these examples, the open tag is lined up horizontally with the close tag and the child elements (or text values) are indented and aligned as you go farther down into the structure.

While this may be visually easy to look at for human eyes, it can be problematic for use by computer applications.

Yes, XML is designed to be a human-readable markup language but that does not mean that it has to be well-formatted. It's readable - not some proprietary binary.

XML must be valid (conforming to the vocabulary) and well-formed (all open tags must have close tags), yes, but there is absolutely no requirement for it to be well-formatted. In fact, the XML specification specifically rejects the assumption that formatting is a desirable quality.

From G. Ken Holman's book:

The XML Recommendation describes behavior required of an XML processor, how it must process an XML stream and identify constituent data, and what information it must provide to an application. An XML document is only a labeled hierarchy of information...information representation only, not information presentation or processing.

Space characters (space, tab, line feed, etc) are insignificant unless specifically called out as "significant to the data" through one of several XML mechanisms. And formatting inside the file is not one of those mechanisms.

Most XML files that are designed to process and create content for human consumption are not also meant to be read as XML. Rather, these files are written in XML to create another format like PDF or HTML: an output that represents content for human reading, not machine processing. Human-readable content products like documents.

XML files for content product creation can be long (pages and pages and pages of content) or short. They are complex (including many types of entities surrounded by semantic markup), and are intended for downstream processing into a bigger consolidated content product be it a single pdf, multiple pdfs, or a set of packaged HTML (html help, epub), or one of any number of other formats (chm, etc.).

Because of this, most professional XML authoring tools aimed at professional and technical content authors do not require you to read past the tags to see the content. In addition, most provide a tree view, so you can see the structure rather than having to depend on indention in the ascii, tag-cluttered view to see it. 

Furthermore, they don't bother pretty printing the xml to the file for two reasons.

  1. because downstream processing (generally) doesn't care 
  2. pretty printing can cause issues downstream for tools that strictly interpret the XML spec

If you have a tool that is incorrectly responding to the horizontal or vertical spacing, or lack thereof, in an XML file, you have a non-compliant tool. Introducing or removing space characters, in the file for your comfort isn't why we use XML. So, when you find yourself formatting XML so it looks better, remember that you're potentially creating problems downstream and adjust your comfort zone. Understand your schema, how to use it, and let the tools help you.

P. S. If you want to read more about the nitty gritty details, about how XML processors work and why it works this way, I highly recommend reading this article from the Mozilla Developer docs: How whitespace is handled by HTML, CSS, and in the DOM.

Key Concepts:

basics, techcomm tools, xml authoring

Filed under:

Common Questions