<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Technology of Content &#187; scalability</title>
	<atom:link href="http://blog.technologyofcontent.com/category/scalability/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.technologyofcontent.com</link>
	<description>Ramblings on the technology of content management</description>
	<lastBuildDate>Sun, 29 Jan 2012 16:38:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Search, SQL, NoSQL, Persistence</title>
		<link>http://blog.technologyofcontent.com/2011/04/search-sql-nosql-persistence/</link>
		<comments>http://blog.technologyofcontent.com/2011/04/search-sql-nosql-persistence/#comments</comments>
		<pubDate>Fri, 15 Apr 2011 14:01:20 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[cloud]]></category>
		<category><![CDATA[CMS]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[nosql]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=330</guid>
		<description><![CDATA[I highly recommend the Enterprise Search London meetup, there are lots of interesting talks, thanks to our intrepid organizer Tyler Tate. Last meetup, H. Stefan Olafsson from Twigkit gave a short talk about the relation between relational databases and search engines, and whether you need a relational database if you have a search engine. Craigslist [...]]]></description>
			<content:encoded><![CDATA[<p>I highly recommend the <a href="http://www.meetup.com/es-london/events/17010043/">Enterprise Search London</a> meetup, there are lots of interesting talks, thanks to our intrepid organizer <a href="http://twitter.com/tylertate">Tyler Tate</a>. Last meetup, <a href="http://twitter.com/mrolafsson">H. Stefan Olafsson</a> from <a href="http://www.twigkit.com/">Twigkit</a> gave a short talk about the relation between relational databases and search engines, and whether you need a relational database if you have a search engine.</p>

<p><a href="http://xkcd.com/886/">
<figure>
<img src="http://imgs.xkcd.com/comics/craigslist_apartments.png" alt="Craigslist apartments" width="400"/>
<figcaption>Craigslist Apartments, by XKCD</figcaption>
</figure></a></p>

<p>Now this has been something  have been thinking about recently, and there are people who are moving big parts of their systems to just be built on search, such as the <a href="http://www.guardian.co.uk/open-platform">Guardian API</a> which is <a href="http://www.guardian.co.uk/open-platform/blog/what-is-powering-the-content-api">served from Apache Solr</a>. In this case though, the search engine is still not the system of record for the core data, which is still the Oracle based CMS which did not scale up enough to serve the API. There was some discussion at the talk about search engines that do support persistence (the D in ACID databases), something Lucene used to have a bad reputation for. My view here though is that, while actually making <code>fsync</code> work properly is a good thing, and you should not buy software that cannot recover from crashes, persistence involves a lot more than this now, such as replication, audit, versioning and access control. Building this directly into search products is a mistake. Another issue is that search engines are denormalized, and data stores of record should really be normalized to a large extent, to minimise the amount of data to be replicated.</p>

<p>There are two approaches that should work instead, however.</p>

<p>The first is more or less the current approach, to use the search engine as an index to a persistent store. I really like this approach if we follow it to its logical conclusion, which is that the persistent store in this type of application architecture should not be a relational database, but it should be a document store, that is a file, an HTTP resource, a document in a NoSQL document database, or an object in a replicated cloud storage system like S3. Modularize the database application, and split the persistence function from the index function. The persistence function provides the durability, versioning and audit and access control, with replication, backup. This can update the search index, and potentially any other types of index, such as a graph database for querying relationships, potentially even a relational database if that is the best way of querying some aspects of the data.</p>

<p>Obviously there is a potential consistency issue, if updates from the document store happen slowly, so potentially there is an eventual consistency model. Historically search was a bad offender here, as dynamic updates were not the norm and everything was batched into nightly updates, but that is going away and dynamic updates are more normal for search indexes. In principle you can have more consistency, especially in an architecture where there are fixed releases that can be consistently indexed, rather than distributed rolling updates, you choose your architecture and take your choice. Small consistency lags rarely matter in a lot of applications.</p>

<p>So you end up with an architecture with a well defined persistence layer that is not a relational database, and a set of indexes appropriate to the application, almost certainly including a full text search engine, but perhaps a graph engine too. Maybe you <a href="http://highscalability.com/blog/2011/4/6/netflix-run-consistency-checkers-all-the-time-to-fixup-trans.html">run consistency checks</a> on your indexes for peace of mind.</p>

<p>The second approach is to see that search engines were some of the original NoSQL data stores, building custom storage and indexing engines, because they had such difficult problems. Indeed Google&#8217;s BigTable, and so the ancestry of a lot of NoSQL products came from search. However the search engines around now have not yet refactored themselves on top of the NoSQL engines that have emerged from this work, although this is starting with <a href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/">Lucandra</a> which is Lucene persisted in Cassandra, which looks promising, offering seamless replication and distribution, and <a href="https://github.com/akkumar/hbasene">HBasene</a>, an HBase Lucene backend. These make a huge amount of sense to me, as if you are developing sophisticated search algorithms, not having to build the whole index and persistence layer as well is a big advantage, as well as the scale out potential. Of course this approach does not conflict with the first one, in fact you could choose a NoSQL backend that is aimed more at read performance than persistence, and at storing small index values fast. The hard bits with this are that the search engines have specifically customised their data storage for the particular use cases, and reworking this onto a more general backend has few apparent advantages; as you can see from the examples above, most of these changes have come from people already using the backends in question and who want a single database to manage all their data requirements, particularly once they are working with high availability and replication. Software modularity really is not at the right level yet is it, I blame object oriented programming for this lack of reusability.</p>

<p>Anyway, back to the main point. For applications like content management, an architecture based on a content store that deals with persistence, versioning, access control, replication, with a set of indexes based on search engine techniques, graph databases, and anything else your applications needs. Ideally the indexes are all based on a common set of low level primitives so the backend can be swapped out or shared between the search store and other application specific indexing requirements, so there is a single low level indexing infrastructure that can be available as a common scalable service, with different implementations available. This type of architecture is quite buildable now, and is certainly used in quite a few applications, and I think it will become much more widespread, particularly in the cloud where it seems more natural, certainly for many types of application that fit into a document type model, such as content based applications.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2011/04/search-sql-nosql-persistence/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Scaling, Security and architecture in 2010</title>
		<link>http://blog.technologyofcontent.com/2010/01/scaling-security-and-architecture-in-2010/</link>
		<comments>http://blog.technologyofcontent.com/2010/01/scaling-security-and-architecture-in-2010/#comments</comments>
		<pubDate>Sun, 17 Jan 2010 18:47:53 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=200</guid>
		<description><![CDATA[This post is about a bunch of stuff I have noticed recently, things that are affecting software and hardware architectures, and security; it is a bit miscellaneous perhaps. As application architectures on the enterprise move towards emulating web scale architectures these trends will affect software more widely. This concentrates on Linux, the operating system the [...]]]></description>
			<content:encoded><![CDATA[<p>This post is about a bunch of stuff I have noticed recently, things that are affecting software and hardware architectures, and security; it is a bit miscellaneous perhaps. As application architectures on the enterprise move towards emulating web scale architectures these trends will affect software more widely. This concentrates on Linux, the operating system the internet is now built on, and how it is modifying the trends to fit with ways of doing things that may be different from what goes on in other communities. Security continues to be more and more important as the environment for applications becomes more hostile.</p>

<h2>Virtualization</h2>

<p>Virtualization mainly started as a way to deal with issues in running multiple services on Windows, due to compatibility issues. This has always been much less of an issue with Linux applications, due to the scale of supporting libraries packaged by distributions. It is still an issue though, for security reasons (apache without suexec for shared hosting still exists, bypassing OS based multi tenancy security, a model that should have gone years ago). KVM, which uses Linux as a hypervisor and uses the hardware virtualization capabilities of newer hardware as now in the Linux kernel, and supported in Redhat Enterprise Linux. I suspect this will gradually overtake Xen and VMWare in areas where only Linux is of interest, due to the built in kernel support; however lighter weight solutions for the security issues such as containers will probably take off instead for many applications where running multiple kernels is unnecessary.</p>

<h2>Containers</h2>

<p>Linux now has a full container model called LXC, similar in principle to BSD jails and Solaris zones. It arrived a bit gradually as a set of patches to namespace various parts of the system such as the process ID space, so a container has its own init process with ID 1 and can have the same IDs as other containers (this also is needed for process migration). There is also a network namespace, so each container has its own loopback device, and independently named network devices (that can for example be bridged back to the host). There is also a read only bind mount which can be used to safely export libraries and binaries to multiple containers with updates done centrally if required; otherwise the container can be managed as a standalone system just sharing the kernel. This environemnt provides a level of secure isolation between containers that solutions such as chroot never had. Processes in containers can be seen from the container host so obviously this needs to be well secured. Because containers do not need hardware support and are very lightweight I think they will grow rarpidly in popularity; they can also run within a virtual machine guest for process isolation inn a virtual environment. Ubuntu 10.04 will have <a href="https://wiki.ubuntu.com/ContainersSpec">full support</a>; earlier versions do work.</p>

<h2>Capabilities</h2>

<p>The old high risk ways of setuid binaries (with broad permissions) are going at last, replaced by a fine grained capabilities system. In principle this means you can drop root capabilities completely, making root an unpriviledged user. There is a <a href="http://ols.fedoraproject.org/OLS/Reprints-2008/hallyn-reprint.pdf">good summary article on this</a> and <a href="http://www.linuxjournal.com/article/10249">another on trying to remove root access</a>. It seems that we will not see pure capabilities based Linux distributions for a while, and will have setuid binaries in general purpose systems, but there is no reason why single application sandboxes should not drop root capabilities in their init process and just use capabilities set in the file system. Fedora seems the furthest ahead in trying this out as a full distribution, and hopefully this will move ahead, adding another security layer in addition to SELinux.</p>

<h2>Sandboxing</h2>

<p>Privilege separation in network applications has been around for a while, but it is starting to spread, with the best example being the <a href="http://blog.chromium.org/2008/10/new-approach-to-browser-security-google.html">Chrome security model</a>. The thing that has really started to change is treating all complex bits of code, such as HTML rendering in Chrome, as potentially hostile as they are likely to be buggy. There is a lot to do to get good security thinking pervasive in application design, but having some well thought out examples is a good start. Currently Linux Chrome seems to offer a <a href="http://code.google.com/p/chromium/wiki/LinuxSandboxing">choice of sandboxing methods</a> of varying effectiveness from a suid helper to using <a href="http://lwn.net/Articles/332974/">seccomp</a></p>

<h2>SELinux</h2>

<p>SELinux has been available in Linux, providing a Mandatory Access Control framework for ten years now, but it has taken that long for it to get really widespread use, mainly pushed by RedHat. Gradually it is extending to other applications, such as mod_selinux for Apache that runs web applications in appropriate security contexts; Postgres SELinux extensions are also available. We are getting to a point when OS security mechanisms can and will be used as they provide the types of security hooks that modern applications need, after a period where we have had applications inventing their own security mechanisms because the OS did not provide the right ones.</p>

<h2>Physicalization</h2>

<p>There was an interesting new buzzword this year: <a href="http://arstechnica.com/business/news/2009/11/basics-of-physicalization.ars">physicalization</a>. Yes just when you tought virtualization was an important new trend, along comes the opposite. What is the idea?</p>

<p>A two socket 8 core server with 16GB RAM and multiple ethernet ports divided into four virtual servers is actually quite expensive compared to four commodity low end boxes. There is a server premium built into the chip manufacture profit model for a start, and also a volume issue.</p>

<p>The price arbitrage is fairly compelling, although the other costs (disks, motherboards, networking) add up and reduce the saving. The example systems are things like <a href="http://www.sgi.com/products/servers/microslice/">SGI&#8217;s Microslice</a> &#8211; yes SGI, that name from the past! This offers dual core but single CPU systems, but with ECC, for significantly lower price and power consumption than typical two way servers, and potentially more throughput per $, for some workloads.</p>

<p>There are even some suggestions that for Linux workloads non x86 architectures (eg ARM) might be competitive for applications that scale out effectively to multiple machines, although I think the risk of introducing these would be high, and there would need to be a big buyer.</p>

<h2>Cloud</h2>

<p>The big coming trend as the world comes out of recession is that cloud computing platforms are cheap, very cheap, compared to in house server provision. Some estimates put it at 20% of cost now, falling to 10% this year. Part of this is economies of scale, part is standardized components and architectural options, and economies of scale in administration. Part of it may be untrue, as there certainly do not appear to be good figures. What is clear is that the SAAS model is compelling for many kinds of product, and fits in with a general movement to charge software as an expense not an investment. There is a lot of hype, and a lot of people have seen the cloud idea before under different names, but the web has produced a viable delivery mechanism, and the uniformity of hosting environments like EC2 cuts costs. Costs such as upgrades are much lower in a SAAS environment too; although the architecture of this software needs to be different to support that.</p>

<h2>Availability</h2>

<p>The last year or so, high availability programming has reached out into awareness a bit. The <a href="http://www.infoq.com/presentations/Systems-that-Never-Stop-Joe-Armstrong">Erlang model</a> has become better known, bringing more awareness of the base elements for building reliable systems such as process supervision. We are starting to see other implementations, such as <a href="http://akkasource.org/">Akka</a>. This is a great move, as availability needs to  move from being a sysadmin and maintenance issue to being a coding issue; for too long effective handling of failure has been ignored by programmers.</p>

<h2>Locks</h2>

<p>As applications start to scale to more threads on multicore CPUs, locking becomes more of an issue. <a href="http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms">Lock-free algorithms</a> are one interesting answer that has emerged that can work well for some  algorithms. Getting past the scaling issues as architectures get more cores needs innovation in lots of areas such as this. Locks are definitely in the sequential areas that limit scaling through <a href="http://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl&#8217;s law</a>.</p>

<h2>Summary</h2>

<p>Software architecture is at an interesting point; the principles of web architecture and the security mindset are gradually feeding into tools and infrastructure and becoming more widespread, and delivery is also changing. Scalable, available and secure systems are the aim.</p>

<p><a href="http://dilbert.com/strips/comic/2009-11-19/" title="Dilbert.com"><img src="http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/000000/70000/4000/100/74150/74150.strip.gif" border="0" alt="Dilbert.com" width="450"/></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2010/01/scaling-security-and-architecture-in-2010/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CMS technology choices</title>
		<link>http://blog.technologyofcontent.com/2009/08/cms-technology-choices/</link>
		<comments>http://blog.technologyofcontent.com/2009/08/cms-technology-choices/#comments</comments>
		<pubDate>Sun, 02 Aug 2009 20:40:08 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[CMS]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=33</guid>
		<description><![CDATA[Response to Julian Wraith's "The future of Content Management…" post covering some of my arguments with Jon and some technical decisions that the content management community will have to make to get to that future…]]></description>
			<content:encoded><![CDATA[<p>The future of Content Management is what we make of it right now, it has not been decided or built yet. Remarkably for a market with so many people in it there are no hard and fast rules and nothing definitive. However we are coming to the end of the experimental phase and the hard decisions are going to be made now, and the future for a fairly long period will be determined pretty soon now.</p>

<p>Although the vast majority (but not all) of open source content management systems are continually trying to reinvent the blog, we are talking about internet infrastructure here, and the future of content is going to be open source, like the rest of internet development. I also believe that long term the web project will overwhelm the legacy areas of document management, although it may take some time. Hypertext, the web architecture, XML, HTML, and all those standards are here to stay and to dominate long term. Content management will also become pervasive long term, as the blogging projects show, as the right tools make content management a natural part of workflow. Content management succeeds when it replaces the file and folder paradigm with a content-led paradigm.</p>

<p>In my conversation the other day with <a href="http://jonontech.com/2009/08/01/i-have-a-dream-of-the-cms-future" title="Jon's post on the subject">Jon</a> I was arguing that although we agree on many of the technical issues there are real decisions that need to be made about what needs to be built to get the the content management future. Below are some of my lists of differences. Generally, I think the future of content management is going for the left hand one of the pairs, although some are not clear yet. I have probably missed a lot of the things to determine, but it is a start.</p>

<h2>Architecture &ndash; API differences</h2>

<p>These may cause API and other more significant differences, though some may not matter (eg git can read svn repos, but not vice versa).</p>

<ul>
<li>REST vs SOAP</li>
<li>REST vs Java native interfaces</li>
<li>distributed version control (git) vs file based (SVN)</li>
<li>compositional vs monolithic</li>
<li>structured content vs files</li>
<li>relations vs metadata</li>
<li>web (hypertext) content vs documents</li>
<li>URIs vs referential integrity</li>
<li>web applications with content management vs content management systems</li>
</ul>

<h2>Architecture &ndash; performance differences</h2>

<p>These could have different implementations with different performance characteristics potentially. These are basically IA differences to a large extent, so they do depend on the type of problem being modelled and the modelling process. Models and performance are linked though, and the best we can do is to make parts of this pluggable so that a range of performance characteristics can be used.</p>

<ul>
<li>unstructured vs structured</li>
<li>sparse vs dense</li>
<li>untyped vs typed</li>
<li>NoSQL vs RDBMS</li>
<li>permission hierarchy vs permission graph</li>
<li>scaleable vs local</li>
</ul>

<h2>Development process</h2>

<p>This is key to getting the product to where you want it to be.</p>

<ul>
<li>open source vs proprietary</li>
<li>API driven vs UX driven</li>
<li>ubiquitous content management vs isolated systems</li>
<li>agile vs monolithic</li>
</ul>

<h2>Architecture &ndash; usage differences</h2>

<p>These could potentially just come down to the ways or tools with which components are joined together, maybe they do not affect architecture per se.</p>

<ul>
<li>social media vs controlled content</li>
<li>programming languages (Javascript, XSLT) vs templating systems</li>
</ul>

<p><a href="http://browsertoolkit.com/fault-tolerance.png"><img src="http://browsertoolkit.com/fault-tolerance.png" alt="fault tolerance" width="500px"/></a></p>

<p>6f82f1d2683dc522545efe863e5d2b73</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2009/08/cms-technology-choices/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Key value stores and the relational model</title>
		<link>http://blog.technologyofcontent.com/2009/03/key-value-stores-and-the-relational-model/</link>
		<comments>http://blog.technologyofcontent.com/2009/03/key-value-stores-and-the-relational-model/#comments</comments>
		<pubDate>Sun, 08 Mar 2009 12:44:15 +0000</pubDate>
		<dc:creator>justin</dc:creator>
				<category><![CDATA[scalability]]></category>
		<category><![CDATA[keyvalue]]></category>
		<category><![CDATA[relational]]></category>

		<guid isPermaLink="false">http://blog.technologyofcontent.com/?p=11</guid>
		<description><![CDATA[Short overview of key-value stores, and the nosql universe.]]></description>
			<content:encoded><![CDATA[<p>Large scale persistent key value stores are rather fashionable right now.</p>

<p>For anyone who has missed this there is a <a href="http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/">survey from Richard Jones</a> at <a href="http://last.fm">last.fm</a> of some of the main software projects out there &#8211; although missing some like <a href="http://tokyocabinet.sourceforge.net/spex-en.html">Tokyo Cabinet</a> &#8211; there is a lot in this area <a href="http://blog.plathome.com/2009/02/first-key-value-storage-meeting-held.html">going on in Japan</a>.</p>

<p>Scaling has suddenly become something that a lot of people have to do, and they are working in an open source world, so many of the things they build et released out into the world. Many of these projects come out of the big web sites, or are products for building scaling services like <a href="http://en.wikipedia.org/wiki/SimpleDB">Amazon&#8217;s SimpleDB</a>. Scaling a relational database above the point where replication alone works gets difficult, and replication only improves read performance while leaving a write bottleneck. <a href="http://highscalability.com/unorthodox-approach-database-design-coming-shard">Sharding</a> is the usual approach, partitioning the data across multiple machines, but it removes the ability to perform joins (as these now cross machines). Once you lose that you have lost the <a href="http://en.wikipedia.org/wiki/Relational_model">theoretical basis of the relational model</a>.</p>

<p>The key value stores that exist are more or less based on building persistent distributed high performance versions of basic data structures. Two dominate, the simple hash table, which is just a pure key value store, as with <a href="http://memcachedb.org/">MemcacheDB</a> which is built on <a href="http://www.oracle.com/technology/products/berkeley-db/index.html">Berkeley DB</a> which is perhaps the oldest persistent key value store, but with a network interface not just a language API. These just give very fast lookup and storage of values associated with a key. Thats it, one single value. The other form is the B-tree based form, which generally add the ability to do range queries, so with careful selection of keys you can return multiple values from tanges, as in <a href="http://couchdb.apache.org/">CouchDB</a>. Pretty simple then.</p>

<p>Although relational databases are basically built on B-trees, there is a huge gap for most applications in using structure at this level.  Normalisation has gone, so there goes that part of the storage design process. Some applications or parts of applications map easily, and these easy parts are where these technologies are getting most use now. Some people are using denormalisation now with database backends (<a href="http://highscalability.com/flickr-architecture">Flickr stores data twice</a> in some cases). So far there is no data model, so it is expensive custom work in each situation to map these calable storage models.</p>

<p>As a side note, there used to be a lot of discussion about integrating SQL into programming languages as a native data type. But the only language the relational model really works with is <a href="http://en.wikipedia.org/wiki/Prolog">Prolog</a> as that has a matching native structure. Key value stores match much better to native structures, but mainly because of the weaker expressive structure.</p>

<p>The relational database did well because the model had the right balance between implementation for people and for computers to optimize much of it. That is generally a hard balance. Key value stores are putting all the work back on the programmers right now. Some projects are starting to work on the next stages – CouchDB views get a few things right that other frameworks do not look at, but there is a lot of work to do here.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.technologyofcontent.com/2009/03/key-value-stores-and-the-relational-model/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Content Delivery Network via Amazon Web Services: CloudFront: blog.edge3.org

Served from: blog.technologyofcontent.com @ 2012-02-04 14:41:39 -->
