Monday, February 6, 2012

Robert Young is Wrong about Threads and PostgreSQL

Robert Young has claimed, not once but twice, that a multi-threaded engine is necessary for database performance to be good enough.  I am going to explain here why he is wrong.

Mr Young's argument more or less goes something like this:  threads are more efficient, processor-power-wise to spin up than processes, and therefore if you insist on a single-threaded multi-process model, you can't have sufficient performance to make your database system relevant.

There are several important problems with this thinking.  The first is that for it to be significant, the thread vs process overhead, plus related IPC, etc. must be a significant portion of actual CPU load.  If this is not the case, the difference largely gets lost in the noise.  In my experience in looking at databases, I have yet to identify one case where this is the case.  Instead things like network connection/teardown, command parsing, and, well, actual work dwarf process/thread startup cycles by several orders of magnitude.  This means that the actual startup processing overhead is likely insignificant in the overall scheme of things,

Of course actually starting up a process or thread is not all there is to the picture.  There's communication between processes or threads to take into account.  There is also the added complexity of locking shared resources in a threaded vs process model.  And there are of course the costs that come with a lack of process isolation on the other side.  These all have to be weighed against eachother, and it is not clear who the winner is.  In fact even these costs will dwarf the startup costs by a significant margin.

Rather the reason why many RDBMS's have gone to a multi-threaded model is that it is one way, perhaps even an elegant way, to support intraquery parallelism.  In other words if you can spin up a few more threads in your process, those threads can be conducting scans on different partitions in different table spaces, and thus overall increasing the resource utilization of a given query.  This is an important feature and it is currently lacking on PostgreSQL.  Postgres-XC however is a project that offers such a solution based on a multi-server, distributed model.  It is perhaps not quite ready for prime time but is getting there.

But while a multi-threaded engine would make this sort of parallelism easier to implement, it isn't clear to me that either it is necessary or that the cost in complexity would be worth it compared to multi-process and even multi-server models.  Threaded systems are far harder to troubleshoot, far less transparent, and far more brittle than process-oriented systems.

Note that each approach has strong points and weak points.  Declaring a specific architecture as a clear winner forgets that these differences exist.  Ideally eventually PostgreSQL would support multi-server and multi-process methods of intraquery parallelism.  I am not convinced that we need to support a multi-threaded method in addition.

Update:  Robert Young also links to an IBM piece talking about multiplexing network connections.  However the additional overhead here is going to be specific to the case where you have enough simple queries on their own connections to justify the additional complexity.  Yes, you can write benchmarks where you see a small but measurable benefit.  However, it is not clear that these benchmarks necessarily map to any significant portion of real-world scenarios.  Certainly in the sorts of programming I do, I would expect no performance gain whatsoever to such a model.


  1. > threads are more efficient, processor-power-wise to spin up than
    > processes, and therefore if you insist on a single-threaded multi-process
    > model, you can't have sufficient performance to make your database system
    > relevant.

    Not to mention that he's missing that you can re-use the same process for new sessions -- what's commonly called connection pooling. That mostly removes the process spawning overhead, and it's already very widely used in PostgreSQL deployments.

  2. Those that have been following DrCoddWasRight for a while understand that this about the evolution of hardware architecture and web infrastructure and how RDBMSs fit in. All (most?) of the others have seen this co-evolution and gone to threaded engines. They didn't do it for yucks. I'll have more to say over there later.

    As to connection pooling: it was invented to deal with the (un)connected semantic of HTTP, and is now largely implemented in web servers. Connection pooling is exactly what you don't want in the future, as I expect it to play out.

    1. Robert: I didn't say they did it for yucks. I said they did it because it is one way, perhaps even an elegant way, to support intraquery parallelism. Intraquery parallelism is an important feature, and one PostgreSQL currently lacks.

      Interestingly, Oracle, outside of Windows, uses one process per connection. If that process is multithreaded, it is only multithreaded to support intraquery parallelism.

      Similarly multiplexing a connection really only makes sense when connecting from certain forms of middleware, and IBM specifically warns against doing so on some platforms. This is an area which may speed up certain narrow kinds of environments, but it isn't clear how many environments see a significant benefit over old-school connection pooling.

      The problem is that threading isn't something magical which gives you all benefits and no problems. Rather it is something of a tradeoff: give up more robust process isolation in exchange for the possibility of a performance improvement. Consequently I am less than convinced that it is as necessary as you say.

  3. This comment has been removed by a blog administrator.

  4. Apart from parallel execution of query and connection speed up, another area where threads are smarter is : shared memory. The database buffers need to be in in shared memory, which is tough to access, share and manage. At the same time thread does it in very simple way. In new generation servers we see very big memory

    1. rdbms cant use one process any way - that mean rdbms must use shared memory between processes any way.

      and if thread crash - them crash the parent process with all other threads in this process.

      For the rdbms with lot of calculations (not only read\write) - it's critical restriction.