My argument was mainly that storing flat, unstructured data was not enough for most content projects, and the difficult questions of structure needed to be addressed. Daniel’s third part addresses how they do this. Actually when I first looked through the NPR content after reading the first two articles I could not find any content that had inline HTML, clearly was an accident should have read more, as it is used.
The actual NPR process is interesting. In particular it shows the amount of care and attention in curating content that is needed to keep it reusable, repurposable and valuable. Semantic content requires you understand the range of meanings that it encodes and how to work with them, and transform them, and quality control them. And that means you need to know every tag and attribute that is going into the system and what that means for every output you are or may using.
One of the outputs that many people forget is plain text, and NPR is very clear on the writing style that is necessary for writing for HTML and plain text. Everything must make sense in the text form; links should be additional information that adds to the text not necessary to understand it. And of course no “click here”. Text output for other devices may vary between the expressiveness of HTML and that of plain text. Text that reads sanely also helps screen readers and other assistive technologies make the content understandable.
The key points here are that content markup must be
- Valid. Processing is likely to be inaccurate without valid content, and tools will be more limited in how the process it, or will fail unexpectedly. Best fix this at the beginning of the pipeline.
- Meaningful. You need the markup to mean what the author intended, so look at interface usability and training.
- Accepted. You do not have to accept all valid XHTML, say. For a start, XML is an extensible language! You can choose a whole range of markup for a story, from the very minimal, to marking up each person and place involved, or more.
- Stored. The marked up text must be stored; the NPR decomposition is plain text plus plus normalized markup which may work for some systems; storing marked up HTML without output transforms may work out better for others.
- Processed. You need to handle each kind of markup for all output mechanisms, so they need to be introduced in a controlled way, although this should not be difficult. Changing markup is something that may need to happen.
I think the third part is the interesting one. Information architecture with websites often stops at the content level, missing out on this information microarchitecture of the textual content itself, leaving this to authors without enough guidance to build a consistent structure to maximise the long term content value.