NoSQL and content management

I went to many of the first ever NoSQL devroom talks at FOSDEM this year. For anyone who hasn’t been, FOSDEM is a great place, and the NoSQL room was well organized and full of interest. The term NoSQL is not even a year old; I first came across CouchDB around a year ago from memory; Tim Anglade gave an excellent introduction where he reminded people of the historical roots, both before relational databases and since then; so not new but there is a renewed focus now. Why is that? I am going to look here at the field of content management and why you might be interested in different data models if that is your problem space, based loosely on some of the ideas from the talks at FOSDEM. There was a talk about content management specifically and the Lily CMS by Evert Arckens although I missed it, but I have added some comments after watching the video.

FOSDEM

The data model for content management

I have another draft post on this subject in more detail, which I am working on as parrt of my REST modelling in content management work, but I will outline some of the types of data relations that are important. I will be quite abstract here, if you want more concrete examples you will have to wait for the other post: database models like the ones we are talking about here are more easily understood in the abstract I think.

First we our unit of modeling. This in itself is the first issue. Content management tends to deal with, at the conceptual level, something that looks like a document. It may be a fragment, in the sense that it is say a page component (asset if you use that terminology) rather than a whole item, but the unit for the user to edit and which is usually versioned is a structured object itself. The processing model tends to treat it as almost of binary blob, except that certain properties can be extracted, such as metadata, links in HTML and so forth, but it is stored as an item rather than decomposed further.

OK, so we have a piece of content and some attributes extracted from it as one basic model. This corresponds pretty much to the JCR data model for example. There are variations; sometimes people do not store metadata in the file formats, as historically many file formats had poor support for arbitrary structured metadata, although that is largely obsolete now, and the advantages of actually storing metadata and relations substantially within documents are high. External storage does not change the model much, just complicates processing and storage. Another variant, often seen in document management systems is to be able to have multiple ‘streams’ ie several document variants rolled into one, for example a video and a still from it. You can however from the modelling point of view regard these as anotehr compound document format kept together because conceptually they are a bundle of content; you might distribute them as a zip file if you havent got any other suitable container format.

So now we have a storage model where we have a blob, with rich media operations on it, and extracted structural and metadata information. There is also versioning to consider, but let us ignore that and treat it either as part of the blob, or as a new document with some relation to the old ones, those being the two core versioning models, this does not really affect anything else.

There are two kinds of metadata, although they are more similar than they appear, properties and relations. Properties are the standard attributes (this picture depicts sheep), while relations join two items in the repository (this is a cropped version of this other picture). Although this distinction seems clear, in the end richer information architectures demand that everything becomes a relation, so I can browse a sheep node and find all the sheep items, turning every attribute value of any significance into a node with relations instead. Pure attribute values are only left for the less interesting properties (this PDF file is 176k in size).

They are also less interesting from a relational versus non relational storage point of view, although there is one important point, which is the dense versus sparse question, so let us take a look at this. Most real world attributes are sparse, that is most attributes aare not set on most items. In the relational model we have a row for our item, and columns for all the attributes, so we are saying most are NULL. (I was brought up on matrix algorithms and still think in terms of sparse versus dense matrices as this is exactly the same problem, and matrices represent graphs anyway). Storing huge mainly null tables is not very efficient, so there are two common practices in relational mapping of attributes in content management systems. First is to define a type based system, where a particular type of content item is defined to have certain attributes (or at least fewer NULLs!), and each set of that type therefore can have its own table which is assumed to have fewer NULL values. Mixins, sets of properties that live across types can potentially be added to this model, as can inheritance schemes, but the basic idea is one table per type. This gives a nice simple direct database programming model, and causes a complete nightmare if you ever want to change the schema, for example add an attribute, as for any large database most DBMSs will effectively shutdown the system while a schema change takes place, as schema changes require pretty much all locks. Silverstripe is one example of a content management system built like this; there are many others.

The alternative is the entity attribute value (EAV) model (terrible Wikipedia article, please fix), where rather than a direct mapping of the attributes to relations, you indirectly map, creating a table that joins entites, attributes and values; this table of course looks just like RDF triples. Doing this though loses everything that makes a relational database useful: constraints, typing, query optimization. It adds an extra layer of logical schema above the physical schema which the database layer does not understand. This is a pretty common relational mapping for content management systems, as it allows full flexibility in defining and redefining attributes. To implement well it needs a large mid layer to manage the constraints, provide an API layer, generate efficient queries, effectively to manage the logical layer to physical layer map. The Drupal CCK is an example of this model.

Of course this is not to say that neither of the two relational models do not work. The direct mapping works well with simple, unchanging content types in small websites, for example, or in models where attributes are not very sparse, or the sparseness is worth the overhead, and changing the schema is rare. EAV works well too, if managed carefully; it helps if the type of queries required on the model are not too complex.

Once you add relations as well as attributes, the already difficult mapping layer gets harder; you add another set of operations (recursion to handle tree structures) that the relational model does not handle well, so you may need to add more into the mapping layer. The promise of NoSQL is that you can bypass this for these types of applications, and program directly to a database model that handles sparse attributes and relations natively. But how much do the NoSQL databases get you? You can argue that if you are already looking at EAV, then you are already not getting much from a relational database, and you are building a modeling layer on top of it, so dropping that and going for something that maps the logical data layer directly does make sense from a development point of view. Whether that really helps performance is less clear; much of the original work for NoSQL has come out of huge scaling, big problems, not actually providing efficient solutions to the types of data mapping problem we are seeing here on a medium scale; of course for huge sites there may be benefits.

The types of NoSQL database vary in their level of support for attributes and relations as they are used in content management. Document oriented databases do not give you much more than retrieval of content items; associative ones give key value type attribute lookups; graph databases should let you query relations directly, expressing the types of queries that are needed for information architecture problems directly, in principle. Examples I am thinking of are things like tag clouds, which is simple to express as a graph problem as it is simple a count of the number of edges from a set of nodes. Indeed most information architecture problems look like graph problems, and also like OLAP processing operations which also do not work well on relational databases. And of course one of the things that NoSQL has shared with OLAP is the use of denormalization; you can use simpler models if you denormalize data to match the queries you will be using, rather than assuming that the types of query you will use can necessarily be optimized and made efficient by a general purpose system.

Denormalization is not without its difficulties, although arguably it could become a tool embedded in databases like indexes are now. One of the issues with NoSQL is most of the database systems leave denormalization to the user: you need to use it because joins are not available, but you have to manage that yourself. Building an infrastructure to explicitly manage denormalization as a first class database item akin to an index might be interesting. So that gives us a first issue, as in any NoSQL system except a graph database we will either need to denormalize or compose queries to get the results we want.

So I think there are four realistic models for content management backends going forward:

  1. The direct relational model for small systems with simple data models, rare attribute changes, little or no use of relations.
  2. EAV models wrapped in a content modeling layer; JCR is an example of this, hiding the underlying SQL layer very well, and indeed allowing it to be replaced with another underlying storage model potentially; I am sure someone is testing a Neo4J backend somewhere. This is where most production solutions are at now.
  3. Direct, nondenormalized graph database backends, with the raw content stored in a document store. Cuts out a special purpose middle level by mapping the domain more directly. As Emil Neo says, it may not scale right up as far as the othe NoSQL technologies, but it cuts complexity of implementation; there are also issues about whether all the kinds of queries required are available efficiently. I think this will be the sweet spot in a few years once the products mature and we see more open source activity in the field. Of course RDF based solutions, for example using SPARQL fall into this category too, and the maturity of products around these technologies will help drive this category as well as the NoSQL models.
  4. Big, denormalized systems, probably with software support for managing the denormalization, and using underlying simple but scalable technologies like key-value stores. These already exist in large scale web applications, but may remain niche if the development effort remains high. If frameworks for modelling more easily on these turn up they may trickle down for performance reasons even on smaller datasets; a key value store runs fine on a relational database backend, although the types of processing required probably means a specialized backend is useful.

Note that the Lily CMS which there was a talk about fits very much into the fourth option above; this is where the NoSQL technologies have perhaps seen most use, but I think there will be a lot of work in order to build a CMS like this now, in particular in terms of tools to support denormalization strategies that are needed. The outlined approach sounded much like the outlines I have been thinking about for this type of model, although I would focus more on tooling for denormalized queries and less on scaling other parts like full text search right now. It will be interesting to follow the progress of this project.

We are at an interesting juncture, where it looks like there are some options that will let us do domain modelling in a way that corresponds more directly to the domain, but there are a lot of interesting challenges on the way.

Dilbert.com

5 Trackbacks

You can leave a trackback using this URL: http://blog.technologyofcontent.com/2010/02/nosql-and-content-management/trackback/

  1. By uberVU - social comments on February 15, 2010 at 00:53

    Social comments and analytics for this post…

    This post was mentioned on Twitter by justincormack: writeup of thoughts on #nosql and #cms data models after #fosdem talks http://bit.ly/dBVeJr

  2. Your article was most tweeted by Distributed Systems experts in the Twitterverse…

    Come see other top popular articles surfaced by Distributed Systems experts!…

  3. [...] NoSQL and content management – Technology of Content Share and [...]

  4. [...] from: NoSQL and content management – Technology of Content Share and [...]

  5. By Brussels JUG » FOSDEM 2010 – Retrospect on March 31, 2010 at 14:17

6 Comments

  1. I’ve been puzzling over this stuff a lot recently as well, and this is a better analysis of the pros and cons than I could have managed. The attractions of moving to a model that maps more closely to the content management domain are obvious. You hit the nail on the head with this line though: “the sweet spot in a few years once the products mature”.

    Posted February 15, 2010 at 14:40 | Permalink
  2. One CMS system worth mentioning here is Plone. It is built upon Zope, which has been doing NoSQL for over a decade now.

    Plone sort of fits between points 3 and 4 above in your article. The ZODB is a persistent object store which natively stores objects (ie no need for an ORM). It is transactional, had MVCC and can be easily clustered.

    Plone has a couple of content type systems. Archetypes is the older one with a richer set of widgets. You can draw up a whole information model in a UML editor and get it converted into a set of installable content types, complete with all relations and containment heirarchy.

    Dexterity is the new content type system and socked the problem that CCK and most other systems have of being unable to easily switch from a web based point and click method for building types and creating them in code on the filesystem (so they can be versioned in your VCS).

    -Matt

    Posted February 22, 2010 at 08:01 | Permalink
  3. @Matt Hamilton: Thanks for reminding me – had forgotten that Zope had done that while I write this while I was writing this. It has always been an interesting framework.

    Posted February 23, 2010 at 04:14 | Permalink
  4. Matt Browne

    Very interesting, thoughtful, and informative post – thank you.

    One question – when you say, “Direct, nondenormalized graph database backends, with the raw content stored in a document store,” I’m wondering what exactly would be stored in the graph database versus the document store.

    It seems that everything could be stored in the graph database, including large blobs like the body of an article…

    Are you suggesting that the “properties” like would go in the document database (e.g. title, body, date published, etc.) and “relations” (e.g. author, tags, etc) would go in the graph database?

    It seems like a nice idea in principle (combining two database technologies to get the strengths of both) but wouldn’t it complicate things to have the data for the same entity split across two separate databases?

    It’s entirely possible that I’m misunderstanding you – in any case, I’d be really curious to hear your comments on this.

    Posted May 18, 2011 at 01:22 | Permalink
  5. My suggestion is that everything goes into the document database, including all the versioning information, and whatever is needed for queries goes into the graph database, which probably does not include large blobs (send these off to your full text search), and not old versions. It might in some cases be a performance optimisation to have everything in the graph database though, just so you could answer queries without referring back to the document database at all.

    You could argue it is a complication, but the document database is very simple, and is mainly aimed at versioning and being system of record, while the graph database answers queries on the current version, which seems a clean separation to me.

    Posted May 20, 2011 at 11:17 | Permalink
  6. Matt Browne

    @justin: Just wanted to say an overdue thank-you for the explanation – that definitely helps clarify things, and makes a lot of sense.

    Posted August 30, 2011 at 16:38 | Permalink

Post a Comment

Your email is never shared. Required fields are marked *

*
*