Last Sunday Peter Monks and I were discussing Jon Marks’ upcoming J. Boye presentation when Peter linked to a pair of articles he had been reading by Daniel Jacobson: Create once and publish anywhere and Content modularity: more than just data normalization. As it happened I had been thinking a lot about normalization that day.
Now Daniel has cllearly done an excellent job of analyzing and modelling the NPR content, a job which is vital to do well, and has chosen the simplest solution that works for it. The simplicity has a lot of benefits, because simplicity cascades in a lovely way. My argument today though is that for most people this type of model is too simple, and the general problem domain means that people need to embrace a more complicated solution for their content modelling, and that means learning to love hTML.
One way to start looking at the problem is to take your content and consider the case of inline hypertext links. No hypertext is the simple case, you can not solve the problem by forbidding inline linking in body text, which NPR chose, and which I will discuss below. With hypertext the challenges are to either
- Rewrite links to what will be the output format’s reference
- Rewrite links to a canonical output medium
- Strip links
The choice may depend on the medium – your mobile site wants internal links to itself, but output to email might have links to the main site. This violates the clean layering diagram, as the output is not a pure content object. The first option is easier in some ways, as it is not much different from getting the non inline links that a model usually has to do, although NPR actually store all the link URLs in the content store. The difficulties come in this case when you want to output a subset of a site to mobile say, and have to deal with potential dangling links. The second option involves asking another output processor what its URL would be for a content item, which again violates the clean layering diagram. Some media require the third option, print for example. Making this work is a question of writing style, making sure that the hyperlinks just flow in with the text and are not required for sense – yes that means not allowing “to see this story click here”, but thats how writing for the web should be written, for reading with and without the hyperlinks.
This messiness is important though, and needs to be done well to not impinge on the API layer more than necessary, and to keep the caching working well. Adding this to the cleaner model allows the use of free hypertext, the foundation of the web.
Daniel goes on to say
But to truly separate content from display, the content repository needs to also avoid storing “dirty” content. Dirty content is content that contains any presentation layer information embedded in it, including HTML, XML, character encodings, microformats, and any other markup or rich formatting information.
Now I half agree with this. You cannot avoid character encodings, although mixing them is not usually advisable. But, like the issue of hyperlinks, he seems to be recommending content without any inline structure, just text. Anything else has to be normalized into a separate structure.
Now you can normalize and flatten everything into structures where everything is stored in low level structures, not say XML or HTML documents, which can be reformed into XML for output. But one of the points of the web project is that well written semantic (X)HTML or XML is a neutral semantic data format that it is transformable into other forms if necessary, and does make a good base format. Otherwise you are just normalizing into a database structure instead.
The NPR data model has the following structure
We then attach “resources” to the story, each of which is its own object in the database (examples of resources include full text with each paragraph stored as distinct records, audio, video, images, related links, and a range of other object types)
So no structure within a paragraph. No microformats, as we cannot mark up a name or place. No hyperlinks in running text, no equations or formulas. No rich text. Just boxouts, diagrams between paragraphs. It works, it can be very good, it is easier to implement, but its not as rich or excellent as truly structured, hierarchical, denormalized content can be.
Fundamentally the structural of the story in the NPR model is flat. There is hierarchy above the story, but it is flat within it. For some types of application, perhaps like the NPRs journalistic style this can work. I have worked with clients who do not want hypertext, just boxed lists of links, often as a stylistic issue. It has a certain formality, and the unbroken text has a traditional look, more Britannica than Wikipedia. The NPR is extreme as the model does not even as far as I have seen allow a word to be emphasised; emphasis is not however presentation layer information, it is semantic information that can be conveyed in pretty much every form of human communication.
Maybe it is because even for print I have worked often with richly substructured texts, or maybe just because I appreciate the browsability of well structured hypertext, flat text is not my preference.
Once you give in to stucture within the paragraph level, the normalization question becomes more interesting. You can still normalize items below a paragraph level but it becomes less satisfactory, as the objects start to lose context and stop being reusable on their own, divided into increasingly small units. There are still important normalization issues, as just because the document format is structured does not mean that everything should be embedded into it; items that are reusable within a document should be nested by reference so they can be reused elsewhere, so you end up with a tree of components for a document rather than the flat NPR structure of a list of components; just a few extra levels of structure, as normal articles are not very heavily tree structured, and just a bit more processing to turn these into items for the API and content output layers.
Daniel raises more issues
First, as an example, the image references within the block of text will contain HTML and possibly other markup, making the text block dirty. Any distribution to other platforms could then require special treatment to prepare the content for that destination. More importantly, however, is the fact that these same images are very difficult to repurpose because they are embedded in text. So, it would be quite a challenge to make a feed of images, to identify only those posts that contain images, to resize some or all images in the system, or to consistently restrict distribution of images that do not have the rights cleared.
Now some of these points are important. Journalists were traditionally encouraged to write as if there were no pictures, which always struck me as odd as a child, describing in 1000 words what is in a picture right next to the story, but content has always been repurposed, or reprinted, or read out. But the sizing issues are easily dealt with by linking to a class of images not a particular size, and letting the output process choose the size. And the system can understand the content types being used, such as HTML or XML and say strip out an image feed, or answer questions about the images. HTML (and SGML and XML) have been designed as one of the first presentation independent structured, semantic, content description languages to solve many problems that simpler, flatter representations did not solve, initially being developed for struuctured document applications.
Of course not any old HTML, say, is a good choice. It needs to be valid, and there must be agreed restrictions. These could, but need not, be a minimal, NPR style, only paragraphs, block quotes, lists, UTF8, no HTML entities, or they could allow more structure but still not allow presentational elements. This contract of what is acceptable because it is not presentation layer and it has transform rules to the output layers, must be strongly enforced. Authoring tools still need work; especially tools for structured, denormalized content need more support.
Overall then, web content management, and structured content mnagement needs to embrace the document types that the web has adopted. Native HTML and XML storage is not the wrong approach. Yes there are issues and complexities that have to be addressed, and addressed really well because if done wrong they can hugely hinder flexible reuse and repurposing, but if done right can enable a rich, expressive, hypertext, denormalized content world.