Content microarchitecture: How I learned to love HTML part 2

I posted recently on an unfinished series by Daniel Jacobson, which was perhaps slightly unfair, so I thought I should write a followup to his final part.

My argument was mainly that storing flat, unstructured data was not enough for most content projects, and the difficult questions of structure needed to be addressed. Daniel’s third part addresses how they do this. Actually when I first looked through the NPR content after reading the first two articles I could not find any content that had inline HTML, clearly was an accident should have read more, as it is used.

The actual NPR process is interesting. In particular it shows the amount of care and attention in curating content that is needed to keep it reusable, repurposable and valuable. Semantic content requires you understand the range of meanings that it encodes and how to work with them, and transform them, and quality control them. And that means you need to know every tag and attribute that is going into the system and what that means for every output you are or may using.

One of the outputs that many people forget is plain text, and NPR is very clear on the writing style that is necessary for writing for HTML and plain text. Everything must make sense in the text form; links should be additional information that adds to the text not necessary to understand it. And of course no “click here”. Text output for other devices may vary between the expressiveness of HTML and that of plain text. Text that reads sanely also helps screen readers and other assistive technologies make the content understandable.

The key points here are that content markup must be

  1. Valid. Processing is likely to be inaccurate without valid content, and tools will be more limited in how the process it, or will fail unexpectedly. Best fix this at the beginning of the pipeline.
  2. Meaningful. You need the markup to mean what the author intended, so look at interface usability and training.
  3. Accepted. You do not have to accept all valid XHTML, say. For a start, XML is an extensible language! You can choose a whole range of markup for a story, from the very minimal, to marking up each person and place involved, or more.
  4. Stored. The marked up text must be stored; the NPR decomposition is plain text plus plus normalized markup which may work for some systems; storing marked up HTML without output transforms may work out better for others.
  5. Processed. You need to handle each kind of markup for all output mechanisms, so they need to be introduced in a controlled way, although this should not be difficult. Changing markup is something that may need to happen.

I think the third part is the interesting one. Information architecture with websites often stops at the content level, missing out on this information microarchitecture of the textual content itself, leaving this to authors without enough guidance to build a consistent structure to maximise the long term content value.

Fossil foraminifera

2 Trackbacks

You can leave a trackback using this URL:

  1. By Sorry, it’s too late to COPE « CMSish on December 4, 2009 at 13:22

    [...] on the content storage strategy they’ve adopted for the U.S National Public Radio (via Justin Cormack). The basic  idea is to ensure hardcore content portability by very cleanly splitting out [...]

  2. By paradox1x on January 18, 2010 at 20:58

    More from Daniel Jacobson on NPR’s content management ecosystem…

    Programmable Web: Daniel Jacobson: “Content Portability: Building an API is Not Enough” Previous entries in the series: Programmable Web: Daniel……

One Comment

  1. Justin, Nice breakout on the five points of handling markup in text. NPR’s philosophy of COPE (coupled with our strategy for distribution) suggests that we should be storing our markup in a normalized way. But as you point out in your fourth point, this decision may not be appropriate for other systems. Making that decision requires a sound review of the business goals as well as an assessment of the technical architecture. In some cases, implementing a normalized markup model will be very difficult and, if the distribution strategy is limited, that extra work would not realize dividends.

    As you emphasized, the decision to go one way or the other should be a calculated one. My article hopefully demonstrates that there are other approaches and opportunities other than (what seems to be) the default approach of adding plug-ins that strip out the markup on distribution rather than storage.

    Posted November 24, 2009 at 21:22 | Permalink

Post a Comment

Your email is never shared. Required fields are marked *