Posted by tobi — 11:42 AM Sep 02
I have to confess: I really don’t like relational databases. I can’t wait for the day we can ditch them.
Think about it for a second: Databases store data to disk. Thats all what 90% of us use them for. They are essentially elaborate hash tables backed by a disk drive. Why are they more lines of code than some operating systems?
Despite that, unless you have a really well thought out setup, a disk failure is still a major disaster. Even if you have backups, even if you have replication, there will be downtime and manual labor while a new master server is established. Databases never put your 10-20 commodity server boxes with all their spare disk space to use. They always sit on these really expensive ivory tower IBM boxes outside of your cheap cluster.
Despite million man years of research databases are actually pretty dumb. You have to tell them about every nuance of your schema, you have to tell them about indexes and so on. If you forget an index they are perfectly happy to run sequentially run through all the data you ever inserted into them many times a second.
Replication is generally a nightmare and every machine involved in the replication needs to have enough disk space to store the entire content of the database.
There are several interesting projects which try to re-invent the database as we know them. Yesterday i found out about a particularly interesting one: CouchDB a contender for “The next generation web storage” as their website proclaims. The project started out using C but eventually changed to Erlang which is a perfect choice for highly parallel server software.
CouchDB has no tables, it just has a flat global namespace for documents. A document is a simple JSON record.
POST /shopify/
{
"value":
{
title:"Arbor Draft",
type:"Product",
price:299.00,
tags:["snowboarding", "freestyle", "wintersport"],
description:"...."
}
}
Instead of defining the schema we simply add arbitrary records. There are no tables.
So how do we receive all the records again? CouchDB uses the concept of views which are essentially javascript methods. It uses map/reduce to find matching records in its global namespace so that at query time the results are available instantaneously. This is a huge performance boost for web applications which generally have many more queries than update/inserts.
Lets install some usefull views under /shopify/all:
PUT /shopify/all
{
"_view_documents": "function(doc) { return doc; }",
"_view_products": function(doc) {
if(doc.type == 'product') { return doc; }
}
}
GET http://couchserver/shopify/all:products
returns:
{
"_id":"all:products",
"rows":
[
{
"_id":"64ACF01B05F53ACFEC48C062A5D01D89",
"_rev":"62D22746",
title:"Arbor Draft",
type:"Product",
price:299.00,
tags:["snowboarding", "freestyle", "wintersport"],
description:"...."
},
}
There are a lot more cool things in CouchDB. Notice that the returned document has a _rev? Older revisions of documents are only deleted if you say so. If you are working on a wiki you just got your historical data for free. Unfortunately CouchDB is still in alpha but i think the fundamentals are sound. Its a lot more aligned with the way a modern web application works and needs its data represented. Its replication system is already much more powerful than that of other database systems and in fact is very similar to the way google works with tis bigtable and map/reduce infrastructure.
For further information head to the projects Wiki.
Jay Owack 02 Sep 15:14
Wow, that’s a pretty cool project. It looks like a great niche DB for special projects, but I can’t imagine it ever replacing relational databases. The performance must be horrendous for any large set of data considering there is no fixed structure, which relational databases use heavily for fast disk access and set operations. I can’t even imagine what sort of performance problems are encountered related to document locking on updates to the db. I guess if performance isn’t a requirement, then CouchDB may be fine.
tobi 02 Sep 15:29
Well this is going to be the big question. I actually think that the potential for performance is a lot greater with the CoachDB approach. A new or updated record has to be compared to the installed views only once. This is probably a millisecond operation. After that the views know about their results and they can be queried in constant time.
I believe that for performance its better to know the possible queries instead of the structure of the data.
Also the CouchDB approach scales out very well as opposed to scaling up. You have 30 different servers which all keep a fraction of the total available data ( every record would be 2-3x redundant for the case of a node failure ). Because all the servers already have the results ready a query would just go to all servers and the results would be merged together. This means that the load, which happens mostly on update time as opposed to query time in SQL databases, can be scaled with a share nothing approach just like web applications can today.
Scott 02 Sep 15:47
Yet another solution in search of a problem.
Oh don’t get me wrong, it’s cute, but it’ll never be confused for a real database.
Let’s face it, if you’re only ever interacting with your data via a single application, then hash tables are all you really need. However… If you have multiple systems hitting the same pieces of data, then defining the schema in the database is the only logical choice. Safer, more accurate, faster.
Consider a customer account schema for a somebody like Amazon. You’re talking 100’s of millions of entries. It needs to interact with the application code, reporting, forecasting, etc. Do you define the rules in each application? (opening the door for inconsistencies) Or do you define the rules in the database itself? (single location, and always the same) In all honesty, for large scale applications, the most accurate thing to do is have less application code and use APIs defined by the database.
But your first sentance pretty much summed it up to begin with: ‘I really don’t like relational databases’ Sounds more to me like you don’t know how best to use them.
Silveira Neto 02 Sep 16:01
I liked the idea. Hope to see it soon working.
Koz 02 Sep 17:29
@Scott: Interestingly enough, amazon have repeatedly and publicly stated that they don’t share their data using their relational databases, and they don’t even use relational databases for some of their larger data sets.
All their data is shared by APIs defined by the application. So while you may find some cases to cite for your relational-centric view of the world, amazon isn’t one of them
lvanek 02 Sep 17:38
Disk failures may have been a major pain in the past, but RAID and hot-swappable drives make this pretty much a non-event. A drive goes out, order a new one, slide out the old, slide in the new drive. Watch the blinking yellow lights for a few seconds until they turn green. And this is on commodity rack servers. It really is that easy now a days.
Scott 02 Sep 17:47
@Koz: OK, bad example on my part. Actually, not surprised that Amazon doesn’t use the relational model on exceedingly large datasets. That calls for a more warehouse-like star schema. Fast reads, slow writes, but the fastest way to retrieve any sort of large analysis. But I’ll stick with the concept that if you need to share data across applications, the rules should be in the database.
I’ll speak to my situation. We’ve got 30M customers, 4 brands shared across them, and at least 5 applications reading and writing the same data. It really makes no sense to define rules for the data in 5 separate places. Odds are very good that something would be missed in at least one location, or rules would be implemented slightly differently. That route leads to data corruption and errors. The only location that makes sense for the rules is at the level closest to the data. The database itself.
Bottom line: The value to the business is NOT in the application. It’s in the data. The business can’t forecast by looking at application usage. Keeping the data as clean as possible is the key to any business. And you do that by enforcing data integrity at the most basic level.
Steve Hannah 02 Sep 18:26
This looks like a very useful abstraction for ajax applications, as storing objects in JSON format seems like a good choice for fast retrieval and flexibility.
However, this seems more like an interface specification for a database than a database itself. Don’t get me wrong, I think that this is a good thing. This sort of system could be implemented using any type of database including a relational database.
In fact using an underlying relational database for storage would probably yield good results. One table to store the actual objects, and you could have other tables that serve as indexes to be able to look up objects quickly based on various parameters.
I look forward to seeing more on this.
alan 02 Sep 19:02
”... and every machine involved in the replication needs to have enough disk space to store the entire content of the database.”
This is correct, except in cases of using Oracle RAC or PG-Cluster II. Also, namespace queries aren’t a new idea for data storage (yes, I noticed the JSON format too). Please have a look some of the many RDF databases out there, which utilize a proven technique of storing data in triplet statements. The benefit to RDF is semantic data storage (no need to define relations separate to stored data). There are also a few established RDF query languages, and many languages already have RDF querying plugins.
George 02 Sep 19:33
None of the most successful players in the information business (Google, Amazon, et al) rely on centralized databases. Even with replication and RAID, they still form a scary Single-Point-of-Failure.
And replication is actually not a solution to anything at all—suppose your code accidentally issues a “delete * from customers” and you don’t notice until the query has been replicated. You are screwed.
Shared-nothing architectures are the future, and traditional relational databases don’t fit into that category. I don’t know if CouchDB is the answer, but something very much like it is.
At least, that is, if not having service outages is important at all. I mean, come on, guys, you can’t tell me that your centralized DB systems haven’t ever had serious outage problems when that one system where all the data is goes down. I, for one, speak from experience. A couple of years ago I lived through an outage firedrill where we happened to have a disk fill up unexpectantly on the master just at the same time that the replica was having a hard time “catching up” with the master. And I’ve also worked with some systems that could tolerate 10s of systems failing simultaneously WITHOUT the service going down at all, and nary a user noticing that those boxes were gone.
Shaun Regauld 02 Sep 19:49
CouchDB seems to align somewhat with the “Eliminating Service Outages, Once and For All” (http://flud.org/blog/2007/04/26/eradicating-service-outages-once-and-for-all/) point of view.
I’ve also worked with brittle centralized relational DBs for far too long. And despite having the brightest DBAs, some very smart system architects, and very top-of-the-line hardware, it seems like such DBs are inherently weak when it comes to reliability.
Of course, there are lots of systems for which reliability doesn’t matter, or at least doesn’t seem to matter. The system is down for all users for several hours every couple of months? “Yawn, who cares?” seems to be the response. For these “toy” applications (and I’m lumping a bunch of those vital db-dependent financial apps in here), silly relational dbs are probably here to stay.
But for a real “man’s” application, that needs to stay up all the time
- well -you’d be crazy to invest in Oracle or IBM or any of the others.Mr eel 02 Sep 19:58
@steve:
“However, this seems more like an interface specification for a database than a database itself.”
Did you have a poke around the wiki? It’s a database written in Erlang. You store and retrieve data with it. It works now. It’s a database.
“This sort of system could be implemented using any type of database including a relational database.”
This is true. ThingDB does something like that over the top of PostgreSQL. But that’s just a clever hack and doesn’t give you opportunities to develop nice approaches for distribution and querying.
@scott:
“Yet another solution in search of a problem.
Oh don’t get me wrong, it’s cute, but it’ll never be confused for a real database.”
Oh my $DEITY. Well I can see what you do for a living :)
I personally think that the current definition of ‘real database’ is way too narrow. Database != relational database.
Also the assuming that because CouchDB isn’t a relational DB, it’s gonna be slow is… well a bit silly really. Perhaps looking at the actual implementation might help.
I mean, it might be slow. Then again, maybe not.
Daniel Morrison 02 Sep 22:18
I’ve been ranting about RDBMS at a very frequent rate for the last month or so. I’m fed up with how web apps accept the limitations that come with databases, and how we all code with them in mind.
It seems that databases have become a nasty anti-pattern that we can’t escape from. Clients frequently ask what DBs we use. They never ask about not using one.
I’m really excited to look at CouchDB. Thanks for the tip!
Shanti Braford 03 Sep 00:39
Come on, guys. I don’t honestly think Tobias was suggesting Amazon scrap their system and use CouchDB or a CouchDB-like system.
I think smaller apps without a huge amount of “business rules” are better suited for something like this.
It would also work well for a kind of distributed open source database.
i.e. OpenIMDB or a wiki-like database.
Kevin Teague 03 Sep 01:14
Amen to the sentiment that relational databases are often more work than they are worth! Especially in the world of the Web, where our interfaces are defined in HTTP, XML, JSON and REST – SQL often just feels like an added, unneccessary hassle. Many Python web developers, this one included, ditched relational databases for the pure object persistance goodness of Zope Object Database (ZODB) many, many years ago and haven’t looked back. There are a lot of large web sites out there today that don’t use relational databases: Novell, United Nations, CIA, OXFAM International and The Free Software Foundation to name some of the larger ones.
Pat Maddox 03 Sep 05:23
I wrote some instructions on how to build and install it from source, since I ran into quite a bit of trouble.
http://evang.eli.st/blog/2007/9/3/building-and-installing-couchdb-on-os-x
Jonathan Conway 03 Sep 10:01
@scott – I worked on a very large enterprise project in the past where apps in different languages all had to share the DB. I had the following options of:
a) Allow different apps to hold business logic (maintanence nightmare). b) Use the relational database as the main integration point. c) Use web service app as the main integration point (SOA anyone?).
Of course I chose option c). People who think in terms of the database being the main integration point should really rethink their strategy as it’s seriously a harder to scale solution on multiple fronts.
I’m constantly butting into the limitations of relational databases. So I for one applaud any innovations in database storage/retrieval.
Jan 03 Sep 10:42
@Scott You can choose to use per-app or per-database Views. If you use the same Views in all your apps, this is effectively a shared schema definition.
CouchDb allows you to be less strict, but gives you all the tools to be as tight as required.
@Jay On the contrary. CouchDb is designed specifically for high speed and concurrent access. It uses optimistic locking and a MVCC (multiple, non blocking reads and serialized writes) system, that makes it scale magnitudes higher than pessimistic locking model. Also, the Erlang runtime system is again designed for a high level of concurrency. The operations that serve a request work together loosely coupled and if there’s a problem with one, all the others continue to run at normal speed.
@Pat Cheers!
@Ivanek Ever worked with a system that resides in different locations? Disk failure is not the only problem a global system might have.
@Tobi Thanks for the great article!
Meneer R 03 Sep 11:24
There are currently four issues with relational databases:
You can’t change the database-structure on the fly with a live system. You are forced to co-erce all the old data into the new format. What kind of nightmare would that be with a huge database? The workarounds you find in enterprises in practise show the problems. They often have .. tablename_new and logic in their clients to check both tables.
It doesn’t have to be mandatory, but it should be possible per table.
This is problematic if you want to keep the schema simple. Relational links are horrible in this context because you can’t easily see who owns an object. You need to use id’s and move that kind of logic into the client.
We are enforced to combine tables that should be separate for performance. I though database design was about separating structure from implementation. It isn’t in practice. And where are the round-a-bout index trees? No database vendor offers these indexes, which are CRUCIAL for almost all web applications. round-a-bout trees, reorder their tree on access, so the tree automatically adopts to popularity of certain records.
Try defining a field that can only contain even numbers, or only prime-numers, or only string without spaces. Yet, maintaining data integrity at several places is asking for a creepy bugs that are hard to catch. So we don’t enforce stuff the database can’t enforce. Now who is limiting who?
6. SQL is not a turing-compatible programming language.
We are limited in the type of queries we can apply at the database-server. But if we have to do those at the client, we loose all performance benefits.
7. Abstract datatypes are just not supported at all. This falls somewhat in the constraint area, although i’m talking about user-defined types here. Stuff like, this record is either an A or a B. Or these field contains either a boolean, or an integer. Try defining that in most DMBS. You can’t.
What we need:
Easy to move around and no limitations on the structure by default. Use json or xml to interface. Each record should have one unique id by default
Because of the flat-style access, this should not be a problem.
Examples:
Examples:
Yaroslav Markin 04 Sep 11:07
Lotus Notes, Web 2.0 style. Hello my old friend. It’s been long.
Non-relational databases seem to act like Phoenix birds, dying and resurrecting all the time. Notes, then Zope (ZODB), then Intersystems Cache..
It’s actually interesting if the ideas behind “enterprise oldie” Notes will evolve into something better. One of CouchDB developers was involved into Notes, am I right?
Yaroslav Markin 04 Sep 11:09
And of course it is really great to see a promising app built on Erlang. Good job :)
Jay Owack 05 Sep 18:00
@Jan It seems theoretically impossible for CouchDB to reach the performance of a relational database. With no fixed structure of known columns and known column lengths, how can disk access be optimized?
Moreover, information in the database will be stored in a more compact format than in a text file, so accessing it involves fewer disk accesses as well.
I’m not exactly sure how the data is going to be stored for CouchDB, but I fear from the overview that each document will be saved to a separate text file.
Once again, CouchDB seems like a nifty project with some nice potential uses. But, outperforming a relational database will not be one of them. Erlang has some nice concurrency features, but for extreme performance you have to get to a lower level than Erlang will allow.
Mr eel 06 Sep 08:51
“I’m not exactly sure how the data is going to be stored for CouchDB, but I fear from the overview that each document will be saved to a separate text file.”
They mean document in the abstract sense — an arbitrary collection of fields — not as individual files on disk.
“But, outperforming a relational database will not be one of them.”
Such certainty. How exactly do you know this? :)
“but for extreme performance you have to get to a lower level than Erlang will allow.”
Forgive me if this sounds patronising, but that is a very naive thing to say. Erlang’s performance has nothing to do with being a high-level language.
Jay Owack 06 Sep 13:09
“Such certainty. How exactly do you know this? :)” Wow, quoting people’s sentences while adding no additional info is fun… If you read the previous posts, then you’d see how.
“Forgive me if this sounds patronising, but that is a very naive thing to say. Erlang’s performance has nothing to do with being a high-level language.” Being a high-level language has EVERYTHING to do with performance. Every layer of abstraction added takes away from performance.
Higher levels = more abstractions = worse performance
Did you ever notice that Ruby performs slower than Java, Java performs slower than C, and C performs slower than assembler? That’s not to say that you should not program in Erlang or Ruby. I love Ruby, but I’m not going to be fooled into thinking it’s performance will ever be fabulous.
York 06 Sep 14:54
Jay, you’re rdbms is a pile of crap when it comes to scalability and performance. I don’t care how enterprisey it is, or how many cpus it has, or how many raided disks it has. It’s still going to have a lot less cpu/disk spindles than a completely distributed couchDB. And a couchDB approach will absolutely outperform the rdbms approach.
In addition, a couchDB-like system will be able to scale nearly infinitely. So while you are trying to figure out how to shard your db so that it can handle load, and juggling all the issues that result, couchDB will be humming along like it did on day 1.
Breton 07 Sep 02:58
There’s a few myths floating around in this article and in the comments. I normally don’t have much luck diffusing myths, but hopefully if the people read this have the ability to think critically, and do their own research instead of knee-jerk react, then it should go alright.
Myth 1: SQL is relational. It turns out, this is rubbish. SQL is so much not relational, that the committee in charge of SQL had to remove the words “relation” and “relational” from the standard. This is easily verified- Find the standard and do a simple find. What’s not relational about it? In short: in SQL, tables and rows are not sets. the order of records and columns are significant, allow duplicates, and introduce three valued logic with the NULL- all violations of the relational model which introduce bugs inconsistencies, and a significant risk of data corruption.
Myth 2: Relational Databases are flawed. Truth: There’s no way to know since there’s never been a widely used relational database. There are some who argue that should the relational model ever be implemented in a DBMS system, it would outperform current database systems in terms of speed and stability. It seems to me this is probably true, but then, what do I know? (Though some vendors claim that they produce relational database systems, this could not possibly be true as long as such systems use SQL)
Myth 3: Systems like couch DB are in every way better than relational database systems.
Again, there’s no way to know since there isn’t a widely available and usable relational database. But a truly relational database system would hold significant advantages over a “flat” data storage system such as couch DB. Assuming of course, that it is utilized properly, a relational database is essentially a large scale inference engine. Meaning that, if you can reason about a subject using formal logic, you can reason about data on the subject on a large scale using the relational algebra or relational calculus. This is amazingly powerful, but since most people never bother learning about the relational model they never take advantage of it.
Apologies if this post comes across as a bit pedantic. Hopefully though, you guys will reconsider relational databases and not smear the concept with the inadequacies introduced by a bad attempt at implementating it. (SQL)
CouchDB to be fair though, looks promising for simple small scale projects.
Breton 07 Sep 03:12
A good analogy might be claiming that Web Standards, semantic html, and seperation of presentation and content are unworthy concepts because they happen to be difficult to achieve in IE6. Don’t blame the inadequacies of the implementation on the concept itself.
Jay Owack 07 Sep 08:25
@York – If you’re right then nothing will be able to hold CouchDB back in popularity. Its marketshare will climb steadily until we’re all using it. Don’t hold your breath.
York 11 Sep 00:58
Jay, I’d bet my career that CouchDB-like databases will rule the future (they already do, if you count Google’s BigTable).
That doesn’t mean that people won’t still be running Oracle 34rxi in 2020. They will. Just like lots of enterprisey institutions are still running COBOL and moving data around with EDI files today.
Actually, I’ll tone it down a bit: for applications that don’t need to scale and don’t need to be reliable, traditional relational DBs are fine, and will probably continue to be fine for the foreseeable future. Doesn’t mean they don’t suck bigtime in comparison.
Chad 11 Sep 18:11
It seems like there is a bit of a disconnect from reality here. The CouchDB supporters are praising this new db based on its scalability, but it’s a new, untested project. How can anyone claim that it is so much more reliable and easier to scale than a typical RDBMS? After you’ve used CouchDB for several years, then you get the opportunity to brag about its scalability and what not.
York, you’re claiming that CouchDB-like databases already rule if you count Google’s BigTable. As if Google is somehow even a dot on the map in terms of all the data stored by companies? Does AT&T store all of its customer data in BigTable? What about Verizon? What about Aflac? There are bigger companies than Google out there and they aren’t storing their files in BigTable or CouchDB. Not only that, who said BigTable is anything like CouchDB? It seems that BigTable was designed with efficient disk access in mind and is written C. These are two key elements missing from CouchDB.
James Wu 21 Sep 22:59
SQL databases are never going to go away, but I think we will see them used less and for more specialized scenarios. They’re not necessarily suited to being the one true data store that people tend to use them as.
We process a significant percentage of payment card transactions in Asia, and for pretty much everything involved with processing transactions we use embedded or flat file databases: The performance is predictable, the problems known, and we’re not surrendering financial integrity to a third party. We can offer service level guarantees that have penalties running to hundreds of thousands of dollars per minute of downtime.
However, a lot of metrics and reporting related data does eventually end up flowing to SQL databases for mining, and frankly, I can’t see you getting decent reports without the data in a relational database.