JSON vs XML

A lot of web developers you meet hate XML with a passion, and JSON has taken its place as the format of choice for a lot of API work. There are some advantages to JSON, but some disadvantages, and XML does have some problems, but the arguments are not as simple as generally made out.

I have been looking at the issue of writing filters for formats that basically change them as little as possible. This is a slightly difficult field in many ways. You have to store a fair amount of extra information in order to do this, but if you are say changing some metadata items for a user they may not want their CDATA removed and replaced with a semantically identical but syntactically different form. So I am looking at the formats partly from this point of view. Sometimes this brings up the conflicts between the human readable and computer manipulation aspects of these formats, the logical and physical structures. I have also been looking at making simple tools to allow modification of document formats, which has also raised some issues.

Pro JSON

JSON is simple and now well defined. It used not be clear in a few places, like how to encode items outside the 2 byte UTF-16 encoding using the \u notation. Many people do manage to generate invalid JSON (no quotes around identifiers, use of single quotes, use of a BOM at the start) it seems, which is a problem. I mean if people cannot even get possibly the simplest format ever invented right, what hope is there for civilization? This should get better as standard libraries that do the right thing come out one would hope! Introducing lax JSON parsers (in the style of HTML 5) seem to be unnecessary for a simple format that is normally generated by a computer. A strict JSON parser is not a lot of code.

JSON has a simple way of showing which of the allowable encodings it is in, based on the zero bytes at the start, as the first two characters must be ASCII. Allowing a BOM and then allowing unicode whitespace might be more standard, but the whitespace has no function except for use in text editors.

Despite teh attempts by, for example, E4X to add a simple XML native format and processing model to Javascript, JSON remains much easier in most languages to process, as it is built around structures that most languages just have natively, while XML is not. Some languages have issues with a mismatch to JSON, but most are fine. E4X on the other hand has security issues client side, and is not seeing adoption except in some server side applications.

Against JSON

JSON does not have a native hyperlink type. This is unacceptable in a web format in my opinion. For example a REST interface (a real one, not the clean URLs many people who use JSON think are REST, one with HATEOS) requires native links and link types. A native link format such as {“next”: http://example.com/next} would solve a lot of issues, and would be compatible. There is JSON-Schema trying to add schemas that can extend the type system, but having to have a schema to understand links just seems overkill to me.

JSON ought to specify that unicode strings are normalized really. I guess most of them are, but it does mean you should normalize before doing comparisons on keys for example.

There are some syntactically different representations: arbitrary white space, although this does not including the full unicode white space definition, and backslash escaped characters, which can mostly be represented directly in the unicode encoding of the document. Whitespace clearly needs to be preserved for readability and use of line oriented editors and tools. It is unclear how inconvenient it would be if \u codes were normalized to unicode which is the sane default. I think there were some tools that did not support unicode, although it is mandated by the standard; it is odd perhaps that an ASCII encoding is not an option, but it seems unlikely to be important. In fact though, preserving the exact used syntax is not difficult in many applications as, unlike with some cases in XML, this does not involve much additional state.

Although JSON was designed to serialize data structures, its small set of types is limiting. We are not going to get around this though for now, as computer languages still have such different ideas of types. Most uses of JSON have an implicit schema, which is both a strength and weakness. Most implementations were tightly coupled, for example in AJAX. Now we are seeing more APIs exposed to the world using JSON; these have more need for a schema. I tend to prefer the idea of HATEOS, the REST idea of hypertext as the interface design constraint, rather than published schemas in the SOAP WSDL style, and JSON seems to be more inclined to move to the latter. Especially if people use JSON-RPC, I thought people had given up on the RPC style on the web but it appears not.

Data model

The JSON data model is simpler than XML. This is less clear as a differentiator. XML nodes have attributes and children, JSON ones attributes or children if you consider the object to model an attributes set and the array or list type to model an ordered list of children. This difference is not hard to work around, and it is very domain specific what the requirements are.

Bill de hOra pointed out “you should add field cardinality to the distinctions – json needs to change structure [], xml needs just another element” which is a very good point.

Schemas

Schemas for validating are great. Validation is an important activity. It is complicated though in general, rules such as this must be filled in if that is not and so on. Essentially a validation schema might need to be very complicated, but many are very simple. Having a choice of languages to express these constraints in seems to me to be a good thing. The XML DTD is too weak, and should not have been included in the language, as discussed below. Some constraints are computationally complex and need very expressive languages.

The second function of a schema is interpretation; this may relate to validation in that a field must be readable as a number say, and we are also going to read it as a number. This is a different requirement, as in many cases it is about object modelling and code generation, when a validated structure is then mapped to a native language object. These are conceptually separate processes, as a number may be constrained to be between 3 and 5 for domain reasons, but the representation in say Java may be an integer, but it need not be. Of course the validation stage here is essential for security reasons, to stop overflows and type errors; however these are conceptually different activities and may have different schemas.

Against both

Binary data is a big problem. We will need a lot of other formats for anything that has binary data, they are just so much more efficient, even after compression. So ideas of a universal format are not going to happen.

Against XML

XML is weak on unordered items. Most of the structure in an XML document is the child relations and these are ordered. This is used as a criticism, but I am not sure it is that reasonable, as attributes are unordered and as said elsewhere there is an equivalence with the two structures provided by JSON, named and unordered items, and unnamed ordered ones which seems natural.

Pro XML

It was pointed out by Erik Wilde that I had missed out the pro XML section. This was an accident. I am actually very pro XML in many ways. First it has enough structure that we can build rich data structures; and to add to that it has some standard forms (such as XHTML) with rich sets of attributes and elements which can be reused in a variety of domains, and standard link relations. The other big thing is the set of extraction and transformation tools, which are generally quite well designed and fairly complete. There are stream and DOM parsers widely available.

Against XML DTDs

The DTD, which XML inherited from SGML, is an anomaly in many ways. First it has a non XML syntax, so we need another set of parsers and tools to work with it. It has several functions that really need to be separated. The first function is as a schema for validating documents against. Unfortunately it is not a very good schema language, as the constraints it can apply against documents are limited. Now we have for example XML Schema and RELAX-NG, which are better schema languages, but the DTD has a special position in the specification that is difficult to drop.

In addition to being a schema, the DTD can also define default values for attributes that the application should see just as if they were in the document. This is the kind of thing that makes preserving the textual form difficult, as there is a syntactic but not semantic difference between certain attributes. I also do not think that this is used much, as real defaults would be implied by the processing model not the document. Clearly it is easy to remove this feature from documents simply by adding in all the implied defaults explicitly.

There are security issues due to the parsing issues with entities, which means that some parsers disable DTD parsing for security reasons. SOAP for example does not support DTDs. This is of course non conforming, but clearly a good idea in many situations.

DTDs are not namespace aware, which makes them unusable in many cases with documents with namespaces. Another reason to deprecate them.

Against XML entities

Then there are entities. My reading of the initial spec is that entities were designed to save typing for people, but I do not think that they are used for anything except for memorable encodings of characters outside the ASCII set. The thing about this use case is it is perfectly alright to substitute the values for them, as they never change, whereas if I create my own arbitrary entity inn a DTD for the name of something it may be because I wish to use this like a search and replace function to substitute whatever I want in. This is in my opinion not really appropriate at the document format level, this is an application level tool, and the application should use regular XML tags for this type of user level structure.

XML entities can also be used as an inclusion mechanism; again the DTD is not the place to define this. XInclude seems much better if this facility is needed.

Entities can contain other entities, markup and so on. Recursion, and unbalanced markup are not allowed. This whole thing adds enormously to parsing complexity, when the use case is entirely as character data.

Against XML namespaces

I am not against XML namespaces per se, but there are pathological cases which make them very hard to process sanely. In particular, you can redefine the same namespace name to refer to multiple URIs in the same document, and you can refer to the same URI with different names. This effectively means that all processing needs to refer to both the short name and the full name. As this is exactly what the spec was trying to avoid it is pretty bad. The amount of state you need to keep to keep a namespaced document textually the same after processing is very large; the nasty mess one tends to get from parsers to let you cope with namespaces is one measure; another is the complexities of xpath on namespaced documents, especially ones with any of the pathological cases in.

The simple solutions seem to involve not allowing redefinition of namespaces to a different URI in the same document, or the converse; declaring all the namespaces that will be used in the root element is also an option. This means processing can be more or less namespace unaware, as xsd:type will mean the same thing regardless of the context. This falls in with the standard usage, where a fairly small set of namespaces are used and they have abbreviations by convention that remain constant across large sets of documents. This means that very little namespace awareness complexity is needed.

Other issues

Mixed content, the role of CDATA, the significance of whitespace, these are all extremely complex issues that could be simplified.

Minimal XML proposals

XML, quite hard but worth it? For the applications I am interested in, I think simplification is needed. The first issue is that security and simplicity are related. Anything web facing will get hostile documents thrown at it, and having more constraint helps, in a way that the document processing industry does not see so much as an issue.

There was a time ten years or so ago, when minimal XML proposals were fashionable. XML itself was of course an attempt at a minimal SGML proposal, but not enough was cut or changed, and much compatibility was kept. Common XML seems the most reasonable to me, and addresses many of the issues. XML tools do not work in the way that was perhaps envisaged, and making things simpler and easier, evolving them, will make them more robust. JSON shows that the demands for simplicity are there, and XML will suffer if it does not answer these.

The first thing is to drop the DTD. It serves no real function now we have alternative schema languages for XML. Radically, I think we can drop entities too, other than the necessary ones for escaping (amp, quot etc), and numeric ones which are again syntactic. The only possibility for requiring named entities is XHTML, but it barely exists now, and those entities could be special cased there without difficulty, as their values will never change and they do not contain markup or other things that cause parsing issues. Arguably these named entities could be added to the XML spec anyway for all documents, changed to a purely syntactic thing. I am not aware of any other XML usage of entities; there may be a few I suppose.

For namespaces, there needs to be a solution that maps syntax to semantics, so that an attribute or element syntactic name has the same semantics throughout the document. Renaming in different scopes makes global transformations, comparisons, and simple processing too hard. It breaks simple search and replace, even that needs to be namespace aware.

Data versus applications

Part of the conflict is due to whether XML is an application protocol, or a data format. Some of the bits that have issues, like entities, are really part of an application data format, for a class of applications that work according to the model in the mind of the XML designers, which in turn was based on real SGML applications. But data formats are winning really. We want to attach additional semantics to data now through standard mechanisms, such as relations, RDF and so on, not be expanding the storage format. Simplicity is winning here: complexity in a data format does not add to the richness that can be expressed; simple uniform mechanisms can do this. And simplicity is going to win; linked data over Microsoft Word style application data formats.

What will happen?

I actually think these changes are, informally, happening. DTDs and entities are not used in many cases now. They may be in some publishing applications, especially those based on SGML, but the web document architecture does not use them significantly. Namespaces are used in a particular way, usually. HTML5 has shown what the logic of human readability and writeability implies, which is a non XML language. The great advantage of XML is the variety of ways in which it can be processed, but issues such as security to hostile documents, parsing complexity, performance, and ease of processing really matter a lot, and despite many weaknesses JSON is showing the way of radical simplicity. But a simplified XML would be no more complex than JSON I think, and have the advantages of richer tool support, and widespread use. Most of the XML in the wild an APIs is very simple; the sorts of XML that are embedded in other documents as metadata are simple too. Security is limiting processing, and the traditional publishing applications that historically used more of the functionality could change too, although more slowly. Will simplicity win, and wil JSON replace XML? I think not, because so much XML is in use, but I think a specification of an XML subset is needed to stabilise the situation.

you cannot parse XML with regular expressions

4 Trackbacks

You can leave a trackback using this URL: http://blog.technologyofcontent.com/2010/01/json-vs-xml/trackback/

  1. [...] the original post here: JSON vs XML – Technology of Content Share and [...]

  2. By uberVU - social comments on January 28, 2010 at 08:05

    Social comments and analytics for this post…

    This post was mentioned on Twitter by justincormack: A new blog post which is nothing to do with tablets, just JSON, XML http://bit.ly/dCbJ2S (delayed by internet issues)…

  3. [...] This post was mentioned on Twitter by Bill de hOra, Peter Keane, Jon Marks, Mike Amundsen, Erik Wilde and others. Erik Wilde said: JSON vs. XML http://bit.ly/dCbJ2S by @justincormack; nice but no “Pro XML” section? still confused why people love hating XML so much. [...]

  4. By Most Tweeted Articles by CMS Experts: MrTweet on January 29, 2010 at 10:09

    Your article was most tweeted by CMS experts in the Twitterverse…

    Come see other top popular articles surfaced by CMS experts!…

6 Comments

  1. Peter

    What about dropping the (imvho artificial) distinction between attributes and elements (as has been done in JSON)? To my way of thinking attributes are simply a shorthand notation for a single valued element, so while there are arguably some benefits in allowing both for human comprehensibility, from the processing models perspective there’s an argument that they should be treated identically (ie. with elements as the fundamental construct, since they’re a strict superset of attributes).

    Posted January 27, 2010 at 23:04 | Permalink
  2. @peter: many have proposed to get rid of attributes. for data-oriented people, this makes a lot of sense. for document-oriented people, this is just inconceivable. try to think of an HTML without attributes; not very pretty, right? plus, some essential XML mechanics (most notably, namespaces) depend on attributes, and i don’t even want to think what namespaces would look like if they had to be stuffed into elements. it’s nice that some technologies (such as RELAX NG) try to mitigate the distinction where appropriate, but these things are different in many ways, and XML without attributes probably would be different enough from XML so that it wouldn’t be XML anymore.

    Posted February 7, 2010 at 02:19 | Permalink
  3. Peter

    @dret: I’m not proposing getting rid of attributes from the literal XML text, just making them syntactic sugar for single-valued elements in the processing model (ie. so that my code doesn’t have to look like a dog’s breakfast whenever I have to do deal with XML that contains both data-as-attribute-values and data-as-element-values).

    More generally, if Justin’s post is intended to foster discussion on what an “XML++” might look like, then I think a new kind of XML “that wouldn’t be XML anymore” is absolutely on the table – his namespace simplification straw man being a good example.

    Posted February 9, 2010 at 22:36 | Permalink
  4. Barry

    Is anyone worried that JSON is very tightly bound to Java-Script. How do you provide web services to both servers and browsers?

    Posted April 30, 2010 at 03:55 | Permalink
  5. Eric Bloch

    The reason JSON may “win” is because Javascript isthe language in the browser and, even though it has issues, it doesn’t appear in HTML5 timeframe that we’ll have another choice. Having literals for data (as opposed to DOM) is a HUGE benefit. And, in the end, that is a big reason why having JSON in the server simplifies things. Less conversion is a good thing.

    I agree XML will be here for a long time, too. But it will be become a second-class citizen over time as a data encoding choice for some languages, tools, and so on.

    Posted May 7, 2010 at 03:50 | Permalink
  6. Robert

    JSON is for data XML is for documents

    It really doesn’t have to be any harder than that.

    Posted June 9, 2011 at 17:02 | Permalink

Post a Comment

Your email is never shared. Required fields are marked *

*
*