<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Technology of Content &#187; data modelling</title>
	<atom:link href="http://blog.technologyofcontent.com/tag/data-modelling/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.technologyofcontent.com</link>
	<description>Ramblings on the technology of content management</description>
	<lastBuildDate>Sun, 29 Jan 2012 16:38:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Towards a comparison of content repositories</title>
		<link>http://blog.technologyofcontent.com/2010/09/towards-a-comparison-of-content-repositories/</link>
		<comments>http://blog.technologyofcontent.com/2010/09/towards-a-comparison-of-content-repositories/#comments</comments>
		<pubDate>Sun, 19 Sep 2010 11:57:07 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[CMS]]></category>
		<category><![CDATA[content]]></category>
		<category><![CDATA[data modelling]]></category>
		<category><![CDATA[jcr]]></category>
		<category><![CDATA[modelling]]></category>
		<category><![CDATA[properties]]></category>
		<category><![CDATA[repositories]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=237</guid>
		<description><![CDATA[I am a bit behind on my blog at the moment, with a lot of unfinished posts. While I was writing about Lily CMS, I got distracted with an issue that I have been working on in the background for a long time. There was a comment saying &#8220;The Lily content model has been academically [...]]]></description>
			<content:encoded><![CDATA[<p>I am a bit behind on my blog at the moment, with a lot of unfinished posts. While I was writing about <a href="http://www.lilycms.org/">Lily CMS</a>, I got distracted with an issue that I have been working on in the background for a long time. There was a comment saying <a href="http://outerthought.org/blog/426-ot.html">&#8220;The Lily content model has been academically validated and accommodates data mapped from various domains, such as rich hypermedia, HTML5, NewsML, MXF, CMIS, RDF and many more&#8221;</a> which reminded me of the work I have been doing on classification of content models, as after all how you can validate a content model without a metamodel? And no one seems to have described the space of possible models, or the scope of choices. So here is my attempt.</p>

<p>The basic model is that we have resources, which may have some content and metadata attached. We are mainly interested here in the properties that can be attached to a resource, and the fact that some of those properties are relations to other resources. We are less concerned about what goes inside the structured body of a resource, although there are some issues about how many &#8220;bodies&#8221; a resource can have. So we have a model that has resources, each of which has key-value pairs, some of the values may be links to other resources, which presumably have referential integrity support</p>

<h2>Properties of properties</h2>

<ul>
<li><p><strong>STRING</strong>. Resources can have string valued properties.</p>

<p>This is a basic starting point; I don&#8217;t think I know of a CMS that does not support this. You can store any other type in a string if necessary, though binary values are more efficient in some cases.</p></li>
<li><p><strong>VALID</strong>. Property values can be validated.</p>

<p>Many core repositories do not validate property values at all, it is just a validation proxy layer that does, so as a repository principle this is fairly rare, although a small number of types (numbers, dates) might have native validated representations.</p></li>
<li><p><strong>TYPE</strong>. Property names can be validated.</p>

<p>Many systems have a typing facility that restricts the set of property names a resource can have. Others are unstructured, and any property can be added to any resource. There may be type composition mechanisms, such as mixins or type inheritance. Unlike value VALID this is more often tied to the core repository model, rather than a proxy layer, if the repository is typed for internal performance or indexing reasons, that is tied to dense rather than sparse storage.</p></li>
<li><p><strong>BINARY</strong>. Resources can have at most one binary property.</p>

<p>I have split this property as some content management systems can only have one binary property (such as an image file) on a particular resource, and multiple ones have to be constructed from multiple linked resources. This is not generally a huge limitation in an otherwise flexible system, but in a weaker system could be annoying.</p></li>
<li><p><strong>N-BINARY</strong>. Resources can have any number of binary properties.</p>

<p>This is the fully flexible version; one may still be a distinguished value in some way, but you can store all the sizes of an image (say) as properties of one resource, which makes managing them easier, although it may actually make things more difficult if it is not easy to iterate over properties (STRUCTPROP), and using multiple resources could be easier.</p></li>
<li><p><strong>STRUCTPROP</strong>. Properties can be structured.</p>

<p>Some systems have structured properties, for example some systems have a JSON representation for properties, rather than the flat key-value namespace of other systems. JSON supports arrays that can be iterated over, and structures that can be repeated. To make this sort of structure with only key-value properties you may need to use more resources. Structured properties though add a lot more complexity, and perfectly useful, but different, systems can be made with or without this model type. Structured properties often have partial update interfaces, which adds complexity, so that one subproperty can be modified at a time. Note while technically JCR does not have structured properties, you can use the distinguished tree below any resource as a tree of properties, so it is rather similar to this model. Note also that property naming can informally add structure, such as in the way slashes denote URI hierarchy, they can denote property hierarchy in a technically flat namespace.</p></li>
<li><p><strong>MULPROP</strong>. Properties can have multiple values.</p>

<p>Structured properties can usually have multiple values (JSON array for example), but not all systems with key-value type properties allow the same key to be set multiple times with different values. This is the model with say HTML metadata, where each property (key) can be set multiple times; however some key value systems only allow a key to hold a single value, and so the user would have to make a structured value to hold the multiple data items instead, by some encoding scheme without the system providing support directly. Having multiple properties complicates the simple set and get interfaces that single valued properties have.</p></li>
<li><p><strong>TREE</strong>. Resources can be in exactly one tree structure.</p>

<p>Another split one. Many systems have one distinguished tree structure that content items must be in, and that tree has special operations, like fast access to parents and children; other trees might be constructed by other means, like using a general relation, but the operations on them might be difficult. Children in a tree are almost always ordered and can be reordered, although some systems might not have this property.</p></li>
<li><p><strong>N-TREE</strong>. Resources can be in any number of trees.</p>

<p>The distinguished tree is very common (although Lily for example does not have one); but I do not think I know of any system with multiple named trees that share a common tree interface (like a parent function). You can make a tree with general relations, but you will not get help in making it acyclic for example. So while this is a possible design, it is complex to implement, although arguably useful as a modeling tool. Generally you will have to manage general relations yourself to do this.</p></li>
<li><p><strong>CLONE</strong>. A resource can be cloned.</p>

<p>This means that the same item can appear as more than one resource at the same time, each of which will update in the same way. This is similar to say a Unix hard link. This is the usual way of turning a TREE into a <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph">dag</a>, which adds some flexibility. Different tree locations of the cloned resources may affect properties such as permissions in some systems, or inheritance so this property can add a fair amount of modeling flexibility; conversely without these it is of less use.</p></li>
<li><p><strong>RENAME</strong>. A resource can be renamed or moved.</p>

<p>Many content management systems provide this operation in their model, but it is not a native operation in others at the repository level, as the end user visible name just be a property for example. HTTP does not provide a rename operation, but WebDAV does.</p></li>
<li><p><strong>REL</strong>. General relations between resources can be created.</p>

<p>This is the basic relation (named by the key name) between two items. It corresponds to the HTML <link> metadata element, or an RDF triple. It turns the resources into a directed graph with named edges. It is certainly essential for any content management system; I will talk more  about how you want to be able to use and query it later.</p></li>
<li><p><strong>RELNS</strong>. Relations have a different namespace from properties.</p></li>
</ul>

<p>This is a distinguishing feature between XML, which has attributes and child elements (relations) syntactically distinguished, versus JSON that does not. Non child relations in XML are still attributes though. Generally seems a pointless distinction, and using a single namespace is simpler.</p>

<ul>
<li><p><strong>RELPROP</strong>. Relations can have properties.</p>

<p>This is an interesting one. Adding a value to the relationship triple to make it a quad, means that a number (or other value) can be assigned to a relation, making each relation a weighted directed graph (or you can view the system as a matrix). The general model of the <a href="http://arxiv.org/abs/1006.2361">property graph</a> has properties for edges, but for example the RDF model does not, although they are often what blank nodes are used to model, although of course blank nodes can have relations as well as properties. You can make up for a lack of properties on edges/relations by adding extra nodes like this, but they may proliferate and need managing, so allowing properties may help. It is also worth noting that a system without MULPROP can use naming of properties to implement RELPROP, as a relation could have a naming convention for its properties; similarly STRUCTPROP generally allows storing the extra information in the property structure.</p></li>
<li><p><strong>REFINT</strong>. Referential integrity is preserved for relations.</p>

<p>Preserving referential integrity at the repository layer is a fair amount of work, relational databases can do this, but not all content repositories do, for example over delete operations.</p></li>
<li><p><strong>ORDREL</strong>. Properties are ordered.</p>

<p>True key-value models do not tend to have an ordering for properties. As with many of these things, ordering adds interface complexity. Structured properties may however be ordered, and if the model supports a distinguished tree (TREE) this almost certainly has ordered children. If you have to build an ordered tree simple from basic relations it is quite complex. A sort order on a relation is another relation property that seems to be rarely supported for general relations, like weights.</p></li>
<li><p><strong>REIFY</strong>. Properties can have properties.</p>

<p>RDF in principle lets properties themselves be resources (reification), so that they can in turn have properties. This allows me to add information about the properties, such as where they came from. This rarely seems to be useful in common models. Giving different properties different permissions might be a more useful side effect.</p></li>
<li><p><strong>EXTREL</strong>. Relations can be defined externally to their subject.</p>

<p>HTML originally had a rev relation, which defined a relation backwards from object to subject, and RDF triples can be stored in any document, divorced from subject and object referants. This causes all sorts of issues with updates and managing (even finding) relations, while adding no descriptive ability except potentially REIFY.</p></li>
<li><p><strong>INHERIT</strong>. Relations can inherit properties.</p>

<p>The inheritance tree might be set from other properties, or from a distinguished tree, but one model is that properties not explicitly set can be inherited from another resource, or a prototype. This often makes models simpler, as rather than explicitly walking a tree, you can implicitly do it though inheritance. Seems surprisingly uncommon in content repositories.</p></li>
<li><p><strong>MINHERIT</strong>. Multiple inheritance.</p>

<p>Allow inheritance from multiple resources, not just for example based on the primary distinguished tree. More complex.</p></li>
<li><p><strong>NAMESPACE</strong>. Namespacing on properties.</p>

<p>Some systems have a type of namespacing on properties, often used for multiple language variants for example, so that a property may differ across these namespaces. This can also be implemented with multiple resources, structured properties or inheritance. Usually not all properties are namespaced at once; some may not vary, which makes the set and get interface more complex.</p></li>
<li><p><strong>ATOMIC</strong>. All the properties of a resource must be updated together.</p>

<p>A resource and all its properties are all updated at once to a complete new state (this will also be the versioning state). Other than for versioning, this also affects how concurrent access works. The alternative is that each property can be updated independently. Atomic updates are the HTTP resource update model.</p></li>
<li><p><strong>RVERSION</strong>. Resources may be versioned.</p>

<p>An entire resource may be versioned; this is a similar namespacing operation, where a namespace is available to retrieve the old values of a resource.</p></li>
<li><p><strong>PVERSION</strong>. Properties may be individually versioned.</p>

<p>Some systems (such as Lily) allow versioning to be turned on or off on a per property basis. This property generally implies that ATOMIC is not true.</p></li>
<li><p><strong>DELVERSION</strong>. Versions are deleted when a resource is deleted.</p>

<p>This is surprisingly common, if versions are namespaced properties, then they are often deleted along with the resource when it is deleted. The better solution is not to delete versioned resources, just give them a tombstone (whiteout) marker.</p></li>
<li><p><strong>SNAPSHOT</strong>. The versioning namespace is whole system state not resource state based.</p>

<p>Although versioning of the total state of a system is now common in source code control systems, many content management systems only let individual ressources be versioned (hence creating issues such as DELVERSION). The main issue here is that you cannot apply or undo a set of changes together, only individually. Apart from the difficulty in making easy user interfaces, whole system versioning is superior in every way to versioning of individual resources or properties and no one should be designing a system that does not behave like this.</p></li>
<li><p><strong>TREEVERSION</strong>. Versioning supports branching and merging.</p>

<p>A full versioning model like git or subversion, rather than just a linear series of checkpoints is another model. It generally simplifies the concurrent updates (ie can avoid both CAS and LOCK in theory). Although these provide the richest model of versioning, it is the hardest to present to the non technical user. Note also that this is one clear area where the content model for delivery can differ from the one for authoring; for authoring there are much more complex operations that are useful, while for delivery performance is key, and versioning may not be required at all, depending on how updates are applied.</p></li>
<li><p><strong>CAS</strong>. A <a href="http://en.wikipedia.org/wiki/Compare-and-swap">CAS</a>-type operation is supported on resource updates.</p>

<p>Some type of atomic update-if-unchanged since this version operation is supported, for lockless updates. HTTP Etags are the canonical example. This is the simplest choice for API access, and simple for users too. The unit of atomicity is usually the whole resource here, making it the unit of transactions; atomic update only of individual properties does not let two properties be updated in a single transaction so is not so useful.</p></li>
<li><p><strong>LOCK</strong>. A locking operation is supported.</p>

<p>The traditional alternative to CAS is a locking operation, that disables write operations while the operation is locked. Some administrative or time based unlock operations are required as well. Less suited than other methods to automated APIs, due to issues like deadlock. As multiple locks can potentially be obtained, cross resource transactions are possible, although this could impact concurrency.</p></li>
<li><p><strong>TRANSACTION</strong>. Transactions across multiple resources are supported.</p>

<p>Generally individual resources are the unit of transaction, or possibly individual properties. Some systems however allow a transaction in which multiple resources are updated together. JCR is probably the main example of these. A system with snapshots may also have this property if moving between versions is atomic. HTTP deliberately does not have this sort of transaction, as it does not work well if the resources are distributed, and system design for HTTP should ensure that resources model the right things so that transactions across resources are not needed.</p></li>
</ul>

<h2>Queries</h2>

<p>With the property model above, you can retrieve resources, and read and modify their properties. There may also be some additional maybe slightly different properties (the ones that TREE might expose for example, parent and child relations). We can traverse between resources by following their links. However we do generally want to make more complex queries, either about global questions, or more complex traversals based on properties. There are a lot of query models we should really explore, particularly we need to focus on how properties are indexed. I suspect that the analysis below is just a starting point.</p>

<ul>
<li><p><strong>PINDEX</strong>. Property values are indexed.</p>

<p>This is not necessarily essential, as for most interesting properties one would create a node rather than a value, and use relations, though you need reverse relation indexes anyway.</p></li>
<li><p><strong>REV</strong>. Relations have a reverse index.</p>

<p>This lets me find the opposite direction of a relation. This is a key property, as relations are directional, and important properties are in the other direction, such as finding all the resources tagged with a particular tag value.</p></li>
</ul>

<h2>Interfaces to properties</h2>

<p>I mentioned above that some of the property models have differing interface complexities, and I think it probably helps to show what some of the interfaces look like.</p>

<p>The canonical interface in web content management is that one exposed by HTTP. Resources and all their properties have to be updated atomically (ATOMIC) &#8211; <a href="http://blog.technologyofcontent.com/2009/12/smart-resources-or-why-you-should-care-about-http-patch/">PATCH</a> is just an optimization. CAS is available (Etags or last update). No versioning is specially supported (although the system could create resources for old versions and add properties to access them). Other property behaviours depend on the document types, so HTML for example supports a meta and link flat property namespace, but other schemes are possible. A resource TREE is very loosely defined by &#8216;/&#8217; in URLs, but does not provide any properties, so it is barely a tree.</p>

<p><a href="http://blog.technologyofcontent.com/2009/12/the-bottom-10-things-of-2009/">WebDAV</a> changed the HTTP data model to push it much more close to one traditional content model, supporting LOCK, RENAME, and TREE on top of HTTP, and an explicit property model independent of the resources in question. The property model is flat, with no STRUCTPROP although extending it is mentioned in the <a href="http://www.webdav.org/specs/rfc2518.html">RFC</a>, and no MULPROP. CLONE is allowed, as resources can have more than one URI. Updates to properties are not ATOMIC, as the PROPPATCH method can update some properties without others. The main HTTP resource is the body, which allows storage of one BINARY property, as the other properties are XML strings. This is pretty much the standard document management style set of properties.</p>

<p>As I was looking at Lily CMS recently, it is pretty different. There is no TREE, you have to construct it yourself. There is TYPE, no STRUCTPROP, there is NAMESPACE and PVERSION. There is no CAS or LOCK, conflicts are resolved by time of modification alone. It is an interestingly different model which I will look at in more detail in another post, as it chooses complexity in some areas and simplicity in others.</p>

<p>One of the common themes of the NoSQL movement is saying things along the lines of if you only give up this one feature we can make the storage layer that much simpler and faster, push some more work up to higher layers to resolve, issues like conflict resolution say, or referential integrity, or tree structures. This is the path of a simpler core repository which does not implement all common usage patterns for a CMS application, with some layering and conventions on top to build the next level. This is not so different from the relational model, with a low level relational algebra and a set of database management tools, then the application. What is still unclear is exactly where to make that split, but certainly large monolithic repository models that try to do everything listed above end up very large and complex to use.</p>

<p>There is definitely a case for moving some of these properties out of the repository into the authoring tools. Referential integrity at the repository level is actually quite hard to work with, as you cannot refer to something you are about to create, for example, but the authoring layer can provide tooling to help the user here.</p>

<p>I will post some follow-ups about some of the other issues arising from this, and what I think the best set of constraints to work in is, and more on the CMIS and JCR models.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2010/09/towards-a-comparison-of-content-repositories/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>NoSQL and content management</title>
		<link>http://blog.technologyofcontent.com/2010/02/nosql-and-content-management/</link>
		<comments>http://blog.technologyofcontent.com/2010/02/nosql-and-content-management/#comments</comments>
		<pubDate>Sun, 14 Feb 2010 23:34:15 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[CMS]]></category>
		<category><![CDATA[data modelling]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=216</guid>
		<description><![CDATA[I went to many of the first ever NoSQL devroom talks at FOSDEM this year. For anyone who hasn&#8217;t been, FOSDEM is a great place, and the NoSQL room was well organized and full of interest. The term NoSQL is not even a year old; I first came across CouchDB around a year ago from [...]]]></description>
			<content:encoded><![CDATA[<p>I went to many of the first ever <a href="http://nosql.mypopescu.com/post/385372130/your-chance-to-review-the-fosdem-nosql-event">NoSQL devroom</a> talks at <a href="http://fosdem.org">FOSDEM</a> this year. For anyone who hasn&#8217;t been, FOSDEM is a great place, and the NoSQL room was well organized and full of interest. The term NoSQL is not even a year old; I first came across CouchDB around a year ago from memory; Tim Anglade gave an excellent introduction where he reminded people of the historical roots, both before relational databases and since then; so not new but there is a renewed focus now. Why is that? I am going to look here at the field of content management and why you might be interested in different data models if that is your problem space, based loosely on some of the ideas from the talks at FOSDEM. There was a talk about <a href="http://outerthought.org/blog/blog/353-OTC.html">content management specifically and the Lily CMS by Evert Arckens</a> although I missed it, but I have added some comments after watching the video.</p>

<p><a href="http://www.flickr.com/photos/justincormack/4375594326/" title="FOSDEM by Justin Cormack, on Flickr"><img src="http://farm5.static.flickr.com/4029/4375594326_7ebdafd796.jpg" width="450"  alt="FOSDEM" /></a></p>

<h2>The data model for content management</h2>

<p>I have another draft post on this subject in more detail, which I am working on as parrt of my REST modelling in content management work, but I will outline some of the types of data relations that are important. I will be quite abstract here, if you want more concrete examples you will have to wait for the other post: database models like the ones we are talking about here are more easily understood in the abstract I think.</p>

<p>First we our unit of modeling. This in itself is the first issue. Content management tends to deal with, at the conceptual level, something that looks like a document. It may be a fragment, in the sense that it is say a page component (asset if you use that terminology) rather than a whole item, but the unit for the user to edit and which is usually versioned is a structured object itself. The processing model tends to treat it as almost of binary blob, except that certain properties can be extracted, such as metadata, links in HTML and so forth, but it is stored as an item rather than decomposed further.</p>

<p>OK, so we have a piece of content and some attributes extracted from it as one basic model. This corresponds pretty much to the JCR data model for example. There are variations; sometimes people do not store metadata in the file formats, as historically many file formats had poor support for arbitrary structured metadata, although that is largely obsolete now, and the advantages of actually storing metadata and relations substantially within documents are high. External storage does not change the model much, just complicates processing and storage. Another variant, often seen in document management systems is to be able to have multiple &#8216;streams&#8217; ie several document variants rolled into one, for example a video and a still from it. You can however from the modelling point of view regard these as anotehr compound document format kept together because conceptually they are a bundle of content; you might distribute them as a zip file if you havent got any other suitable container format.</p>

<p>So now we have a storage model where we have a blob, with rich media operations on it, and extracted structural and metadata information. There is also versioning to consider, but let us ignore that and treat it either as part of the blob, or as a new document with some relation to the old ones, those being the two core versioning models, this does not really affect anything else.</p>

<p>There are two kinds of metadata, although they are more similar than they appear, properties and relations. Properties are the standard attributes (this picture depicts sheep), while relations join two items in the repository (this is a cropped version of this other picture). Although this distinction seems clear, in the end richer information architectures demand that everything becomes a relation, so I can browse a sheep node and find all the sheep items, turning every attribute value of any significance into a node with relations instead. Pure attribute values are only left for the less interesting properties (this PDF file is 176k in size).</p>

<p>They are also less interesting from a relational versus non relational storage point of view, although there is one important point, which is the dense versus sparse question, so let us take a look at this. Most real world attributes are sparse, that is most attributes aare not set on most items. In the relational model we have a row for our item, and columns for all the attributes, so we are saying most are NULL. (I was brought up on matrix algorithms and still think in terms of sparse versus dense matrices as this is exactly the same problem, and matrices represent graphs anyway). Storing huge mainly null tables is not very efficient, so there are two common practices in relational mapping of attributes in content management systems. First is to define a type based system, where a particular type of content item is defined to have certain attributes (or at least fewer NULLs!), and each set of that type therefore can have its own table which is assumed to have fewer NULL values. Mixins, sets of properties that live across types can potentially be added to this model, as can inheritance schemes, but the basic idea is one table per type. This gives a nice simple direct database programming model, and causes a complete nightmare if you ever want to change the schema, for example add an attribute, as for any large database most DBMSs will effectively shutdown the system while a schema change takes place, as schema changes require pretty much all locks. <a href="http://www.silverstripe.com">Silverstripe</a> is one example of a content management system built like this; there are many others.</p>

<p>The alternative is the <a href="http://en.wikipedia.org/wiki/Entity-attribute-value_model">entity attribute value</a> (EAV) model (terrible Wikipedia article, please fix), where rather than a direct mapping of the attributes to relations, you indirectly map, creating a table that joins entites, attributes and values; this table of course looks just like RDF triples. Doing this though loses everything that makes a relational database useful: constraints, typing, query optimization. It adds an extra layer of logical schema above the physical schema which the database layer does not understand. This is a pretty common relational mapping for content management systems, as it allows full flexibility in defining and redefining attributes. To implement well it needs a large mid layer to manage the constraints, provide an API layer, generate efficient queries, effectively to manage the logical layer to physical layer map. The <a href="http://drupal.org/node/82661">Drupal CCK</a> is an example of this model.</p>

<p>Of course this is not to say that neither of the two relational models do not work. The direct mapping works well with simple, unchanging content types in small websites, for example, or in models where attributes are not very sparse, or the sparseness is worth the overhead, and changing the schema is rare. EAV works well too, if managed carefully; it helps if the type of queries required on the model are not too complex.</p>

<p>Once you add relations as well as attributes, the already difficult mapping layer gets harder; you add another set of operations (recursion to handle tree structures) that the relational model does not handle well, so you may need to add more into the mapping layer. The promise of NoSQL is that you can bypass this for these types of applications, and program directly to a database model that handles sparse attributes and relations natively. But how much do the NoSQL databases get you? You can argue that if you are already looking at EAV, then you are already not getting much from a relational database, and you are building a modeling layer on top of it, so dropping that and going for something that maps the logical data layer directly does make sense from a development point of view. Whether that really helps performance is less clear; much of the original work for NoSQL has come out of huge scaling, big problems, not actually providing efficient solutions to the types of data mapping problem we are seeing here on a medium scale; of course for huge sites there may be benefits.</p>

<p>The types of NoSQL database vary in their level of support for attributes and relations as they are used in content management. Document oriented databases do not give you much more than retrieval of content items; associative ones give key value type attribute lookups; graph databases should let you query relations directly, expressing the types of queries that are needed for information architecture problems directly, in principle. Examples I am thinking of are things like tag clouds, which is simple to express as a graph problem as it is simple a count of the number of edges from a set of nodes. Indeed most information architecture problems look like graph problems, and also like <a href="http://en.wikipedia.org/wiki/OLAP_cube">OLAP processing operations</a> which also do not work well on relational databases. And of course one of the things that NoSQL has shared with OLAP is the use of denormalization; you can use simpler models if you denormalize data to match the queries you will be using, rather than assuming that the types of query you will use can necessarily be optimized and made efficient by a general purpose system.</p>

<p>Denormalization is not without its difficulties, although arguably it could become a tool embedded in databases like indexes are now. One of the issues with NoSQL is most of the database systems leave denormalization to the user: you need to use it because joins are not available, but you have to manage that yourself. Building an infrastructure to explicitly manage denormalization as a first class database item akin to an index might be interesting. So that gives us a first issue, as in any NoSQL system except a graph database we will either need to denormalize or compose queries to get the results we want.</p>

<p>So I think there are four realistic models for content management backends going forward:</p>

<ol>
<li>The direct relational model for small systems with simple data models, rare attribute changes, little or no use of relations.</li>
<li>EAV models wrapped in a content modeling layer; JCR is an example of this, hiding the underlying SQL layer very well, and indeed allowing it to be replaced with another underlying storage model potentially; I am sure someone is testing a Neo4J backend somewhere. This is where most production solutions are at now.</li>
<li>Direct, nondenormalized graph database backends, with the raw content stored in a document store. Cuts out a special purpose middle level by mapping the domain more directly. As <a href="http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html">Emil Neo</a> says, it may not scale right up as far as the othe NoSQL technologies, but it cuts complexity of implementation; there are also issues about whether all the kinds of queries required are available efficiently. I think this will be the sweet spot in a few years once the products mature and we see more open source activity in the field. Of course RDF based solutions, for example using SPARQL fall into this category too, and the maturity of products around these technologies will help drive this category as well as the NoSQL models.</li>
<li>Big, denormalized systems, probably with software support for managing the denormalization, and using underlying simple but scalable technologies like key-value stores. These already exist in large scale web applications, but may remain niche if the development effort remains high. If frameworks for modelling more easily on these turn up they may trickle down for performance reasons even on smaller datasets; a key value store runs fine on a relational database backend, although the types of processing required probably means a specialized backend is useful.</li>
</ol>

<p>Note that the <a href="http://lilycms.org/">Lily CMS</a> which there was a talk about fits very much into the fourth option above; this is where the NoSQL technologies have perhaps seen most use, but I think there will be a lot of work in order to build a CMS like this now, in particular in terms of tools to support denormalization strategies that are needed. The outlined approach sounded much like the outlines I have been thinking about for this type of model, although I would focus more on tooling for denormalized queries and less on scaling other parts like full text search right now. It will be interesting to follow the progress of this project.</p>

<p>We are at an interesting juncture, where it looks like there are some options that will let us do domain modelling in a way that corresponds more directly to the domain, but there are a lot of interesting challenges on the way.</p>

<p><a href="http://dilbert.com/strips/comic/2008-02-12/" title="Dilbert.com"><img src="http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/000000/00000/1000/800/1869/1869.strip.gif" border="0" alt="Dilbert.com" width="440"/></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2010/02/nosql-and-content-management/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Content Delivery Network via Amazon Web Services: CloudFront: blog.edge3.org

Served from: blog.technologyofcontent.com @ 2012-02-04 14:37:28 -->
