There as a recent email to the development list from Erik Huelssman describing his experience extending LedgerSMB. In my view this demonstrates better than anything else that we are getting right what an ERP framework needs to do.
The email is reposted here with permission.
Hi all,
For one of my businesses, I needed to create an application to store some data on the services and products we provide. Basically, there would be one central table with some references to a few other tables, mostly to provide (short) selection lists. Also, there's one bigger complexity, because we wanted to link to our customer data.
The customer data is all stored in LSMB, so, developing this database inside LSMB's database seemed like a logical choice.
Since my colleague and I were completely unfamiliar with the web development framework for LSMB, we first had a start with the tools we did know. (Internally developed stuff, nothing particularly fancy.) However, due to other priorities, things didn't quite roll forward as we wanted; it was simply too much work to get it done.
Last week we scrapped our own development. As a replacement I started looking at the tools that LSMB provides to get the same job done. Again, without prior knowledge how this should be done, I simply studied the budgeting module. I figured that if budgeting was an extension, we might consider our own administration as an extension and that it should probably work (roughly) the same way.
And it did: in 1,5 day I had our administration up and running, featuring a data entry screen and a search screen. The current version of the data entry screen searches and links to a customer only if it finds an exact match. [Chris and I are working on functionality every module can use to look up cross table references like these, so it'll be replaced by something better soon.]
The steps I had to take were really simple:
* Develop 2 main stored procedures:
- <mymodule>__search()
- <mymodule>__save()
and 1 stored procedure for each drop down in the data-entry screen:
- <mymodule>__list_<characteristic1>()
- <mymodule>__list_<characteristic2>()
- ... etc...
* Create a class in LedgerSMB/DBObject/<objectname>.pm
which contains functions callable from perl and maps those
to calls to the stored procedures in the database as simple as
containing:
sub search {
my ($self) = @_;
my $results = $self->execmethod({funcname=>'<mymodule>__search'});
$self->merge($results);
return $results;
}
* Create two Template Toolkit template file for my module:
- ./UI/services/entry.html
- ./UI/services/search_criteria.html
* Create an LedgerSMB 'module' <mymodule>.pl, with three entry points:
- edit
- search
- update
* Create a top level handler file by copying ./file.pl to ./<mymodule>.pl
* Add menu items
Basically, the out of the 1,5 days I've spent, most time was spent learning what I had to do. The actual time spent coding was negligeable.
In summary: adding to or changing LedgerSMB can be extremely easy. It was a refreshing surprise to me how easy. We'll definitely develop the rest of our application using LedgerSMB's own framework and when it's done, we might be able to donate it to the project.
Anyway, I hope the above inspires others to try it out themselves. If you have any questions, both Chris and me will be glad to help you out!
Bye,
Erik.
This blog tracks development of the open source accounting and ERP software LedgerSMB. I also offer some perspectives on PostgreSQL including new features which we may find useful. Brought to you by Metatron Technology Consulting.
Saturday, March 10, 2012
Friday, March 2, 2012
At the edges: Relational, Hierarchical, and Key-Value models in LSMB using PostgreSQL
Introduction
It is no secret that I think of NoSQL as a kind of cool set of technologies applicable to a fairly narrow niche market. My view is that relational databases are still (and will be for the long term) the bread and butter of data management. NoSQL provides a useful set of tools for cases which have very different data requirements. There are edge cases where many NoSQL techniques make sense in a relational environment though.
Data in an ERP application can be conceptualized as sets of sections of tables, each of which has various write vs read load balances. Some sections may be updated only rarely, others may prevent updates and deletes, but allow inserts and selects, and receive data constantly. Data may migrate from one area to another over time, such as when the books are closed.
LedgerSMB tends to currently use a variety of models for different sections of the application. Settings and menu attributes (what a menu item does) are modelled using key-value modelling because that best semantically fits the sort of data being stored. The menu structure is modelled hierarchical, as (in 1.4) are projects, departments, and account classifications. Hierarchical modelling is done currently through a simple self-join.
Virtually all of the rest of the data is modelled using simple relations with no self-joins. I don't see that changing. However, for the sections that either do self-joins or key-value modelling, I am always interested in seeing how these can be optimized, and so when I read a blog post on NoSQL data modelling, it occurred to me that as time goes on some of the models discussed there may eventually become more applicable to relational systems. And so I will go over my thoughts as to how these may be helpful or not.
As PostgreSQL gets native JSON support, some parts may in fact be moved to that.
Key-Value and Schemaless Modelling
Key-value modelling is generally something to be avoided in relational design. There are, however, times when key-value pairs are in fact what one is modelling, and hence this is a good way forward. The best example in LedgerSMB is the settings table (called "defaults"). Here the data is semantically keys and values and hence there is no other clean way to model the data.
There are some other cases where data can be modelled well using key-value modelling, but where this is not quite so semantically clear cut. Consider the LedgerSMB menu, for example:
CREATE TABLE menu_node (
id serial NOT NULL,
label character varying NOT NULL,
parent integer,
"position" integer NOT NULL
);
and
CREATE TABLE menu_attribute (
node_id integer NOT NULL references menu_node(id),
attribute character varying NOT NULL,
value character varying NOT NULL,
id serial NOT NULL,
primary key(node_id, attribute)
);
The second table effectively models the keys and values submitted back to the application when link on the menu is clicked. While this works ok with KVM, it is not really clean. Ideally any common attributes would be moved to menu_node and free-form ones would be added to this table.
But here's a case where JSON would be cleaner, per node. The data here is not getting updated frequently (unlike defaults). Indeed it is typically written once unless a fix needs to be made. The fixes would be cleaner if it was all in one field. So here, a NoSQL model would work very well, albeit incorporated into a relational database.
Hierarchical Modelling
Since LedgerSMB 1.3, we have started to do hierarchical modelling in the database. See the above menu_node table as an example. 1.4 adds hierarchical business reporting units (for projects and departments) as well.
Currently these are entirely handled through simple self-joins and no clustered indexes are added. When the menu is generated in 1.3, we use connectby() in tablefunc (we support PostgreSQL 8.3 and above), but for trunk (will be 1.4) we have gone to WITH RECURSIVE which has shown a significant speed improvement (2-70 times depending on various factors).
Initially when I read the article, I thought maybe some of the ideas might be applicable to my work with LedgerSMB, but as I think about it I am not sure. It may be that PostgreSQL's approach may perform well enough not to ever have to move to something else. WITH RECURSIVE performs very well, and so the major options for improving performance (namely clustered indexes) may be sufficient.
Outside of clustering on indexes, I am hard-pressed to think of ways to optimize hierarchical lookups in ways that would significantly beat other forms of hierarchical storage. Perhaps someone could explain what the benefits are to storing child node id's etc in the table. I sure don't see it.
OLAP
In additional to standard relational operations on a well-normalized database, there are also cases where analytics need to be run relatively quickly and in semi-flexible manners, over increasingly large datasets, and they must perform well.
Unfortunately in an ERP application, there is rarely a hard distinction between analytic and transactional workloads. Bank account reconciliation may enter data, but the workload depends on getting good data regarding reconciled transactions from the past, and that means aggregating potentially large datasets, and those datasets could increase in size over time.
The first forays into this area for LedgerSMB in fact were aimed to solve mixed transactional/reporting workflows. Our current approach uses "checkpoints" which effectively close the books on all previous dates. When one closes the books the checkpoints are created, and from that point forward, no transactions can be entered for previous dates.
This allows us eventually to tie a number of other information on to checkpoints. Various reporting views could be materialized on closing (asynchronously even), and thus provide better performance for all kinds of reports on these static data sets.
These areas have not really been fully explored in LedgerSMB yet. However for some upcoming features (such as graphs in reports), I think it will be necessary to offer this sort of thing. Referential integrity constraints can tie a set of reports to a specific checkpoint so that the reports cannot get out of sync when books are re-opened (which happens on rare occasion).
It is no secret that I think of NoSQL as a kind of cool set of technologies applicable to a fairly narrow niche market. My view is that relational databases are still (and will be for the long term) the bread and butter of data management. NoSQL provides a useful set of tools for cases which have very different data requirements. There are edge cases where many NoSQL techniques make sense in a relational environment though.
Data in an ERP application can be conceptualized as sets of sections of tables, each of which has various write vs read load balances. Some sections may be updated only rarely, others may prevent updates and deletes, but allow inserts and selects, and receive data constantly. Data may migrate from one area to another over time, such as when the books are closed.
LedgerSMB tends to currently use a variety of models for different sections of the application. Settings and menu attributes (what a menu item does) are modelled using key-value modelling because that best semantically fits the sort of data being stored. The menu structure is modelled hierarchical, as (in 1.4) are projects, departments, and account classifications. Hierarchical modelling is done currently through a simple self-join.
Virtually all of the rest of the data is modelled using simple relations with no self-joins. I don't see that changing. However, for the sections that either do self-joins or key-value modelling, I am always interested in seeing how these can be optimized, and so when I read a blog post on NoSQL data modelling, it occurred to me that as time goes on some of the models discussed there may eventually become more applicable to relational systems. And so I will go over my thoughts as to how these may be helpful or not.
As PostgreSQL gets native JSON support, some parts may in fact be moved to that.
Key-Value and Schemaless Modelling
Key-value modelling is generally something to be avoided in relational design. There are, however, times when key-value pairs are in fact what one is modelling, and hence this is a good way forward. The best example in LedgerSMB is the settings table (called "defaults"). Here the data is semantically keys and values and hence there is no other clean way to model the data.
There are some other cases where data can be modelled well using key-value modelling, but where this is not quite so semantically clear cut. Consider the LedgerSMB menu, for example:
CREATE TABLE menu_node (
id serial NOT NULL,
label character varying NOT NULL,
parent integer,
"position" integer NOT NULL
);
and
CREATE TABLE menu_attribute (
node_id integer NOT NULL references menu_node(id),
attribute character varying NOT NULL,
value character varying NOT NULL,
id serial NOT NULL,
primary key(node_id, attribute)
);
The second table effectively models the keys and values submitted back to the application when link on the menu is clicked. While this works ok with KVM, it is not really clean. Ideally any common attributes would be moved to menu_node and free-form ones would be added to this table.
But here's a case where JSON would be cleaner, per node. The data here is not getting updated frequently (unlike defaults). Indeed it is typically written once unless a fix needs to be made. The fixes would be cleaner if it was all in one field. So here, a NoSQL model would work very well, albeit incorporated into a relational database.
Hierarchical Modelling
Since LedgerSMB 1.3, we have started to do hierarchical modelling in the database. See the above menu_node table as an example. 1.4 adds hierarchical business reporting units (for projects and departments) as well.
Currently these are entirely handled through simple self-joins and no clustered indexes are added. When the menu is generated in 1.3, we use connectby() in tablefunc (we support PostgreSQL 8.3 and above), but for trunk (will be 1.4) we have gone to WITH RECURSIVE which has shown a significant speed improvement (2-70 times depending on various factors).
Initially when I read the article, I thought maybe some of the ideas might be applicable to my work with LedgerSMB, but as I think about it I am not sure. It may be that PostgreSQL's approach may perform well enough not to ever have to move to something else. WITH RECURSIVE performs very well, and so the major options for improving performance (namely clustered indexes) may be sufficient.
Outside of clustering on indexes, I am hard-pressed to think of ways to optimize hierarchical lookups in ways that would significantly beat other forms of hierarchical storage. Perhaps someone could explain what the benefits are to storing child node id's etc in the table. I sure don't see it.
OLAP
In additional to standard relational operations on a well-normalized database, there are also cases where analytics need to be run relatively quickly and in semi-flexible manners, over increasingly large datasets, and they must perform well.
Unfortunately in an ERP application, there is rarely a hard distinction between analytic and transactional workloads. Bank account reconciliation may enter data, but the workload depends on getting good data regarding reconciled transactions from the past, and that means aggregating potentially large datasets, and those datasets could increase in size over time.
The first forays into this area for LedgerSMB in fact were aimed to solve mixed transactional/reporting workflows. Our current approach uses "checkpoints" which effectively close the books on all previous dates. When one closes the books the checkpoints are created, and from that point forward, no transactions can be entered for previous dates.
This allows us eventually to tie a number of other information on to checkpoints. Various reporting views could be materialized on closing (asynchronously even), and thus provide better performance for all kinds of reports on these static data sets.
These areas have not really been fully explored in LedgerSMB yet. However for some upcoming features (such as graphs in reports), I think it will be necessary to offer this sort of thing. Referential integrity constraints can tie a set of reports to a specific checkpoint so that the reports cannot get out of sync when books are re-opened (which happens on rare occasion).
Saturday, February 25, 2012
Application vs Database Programming
A few years ago, I had a problem. A database routine for processing bulk payments for LedgerSMB that was humming along in my test cases was failing when tested under load with real data prior to implementation. Some testing showed that while it ran fine with small numbers of inputs, it eventually slowed way down as the number of inputs rose. Andrew Sullivan, who was also on the project through a different company, suggested it was something called a "cache miss" because I was running through a loop, inserting two rows, and then updating a different table. Although he quickly abandoned that theory, his initial assessment was dead right and within a day I had a fix to the problem.
I learned there why database and application programming are so different. The same mistake I made I have seen in contributions by many other developers. I have therefore concluded that it is a universal mistake.
When we program applications, we think in terms of instructions. We break a problem down into instructions, and we order those to get the right result. We usually focus on simpler elegant code over more complex code and therefore tend to try to make things as simple as possible. This was what I was doing and it failed miserably. I have also had the misfortune of looking through several hundred line stored procedures that were obviously written by application engineers, and where the ability to think in db-land was missing. Such stored procedures are not maintainable and usually contain a lot of hidden bugs.
In database-level routines, however, it is extremely important to think in set operations. A clear, complex query with both be easier to read and easier to maintain than a set of smaller simpler queries. The database is an important math engine. Use it as one. Consequently simplicity and elegance in a stored procedure is often the ability to express the most complicated operations in the fewest database queries possible. Because SQL queries are highly structured (SELECT queries involve a column list, followed by a set of tables linked by JOIN operations, followed by filter operations, etc), it is very easy to track down problems, whether those are bugs or performance issues.
With smaller operations, like I had been doing, the database cannot optimize the operation as a whole, and one may have to contend with problems caching results in an effective way, or the like. Worse, the operations are not linked together in a predictable way and therefore problems are harder to track down when they involve specific inputs or concurrent operations.
The way we think when we code SQL (including PL/PGSQL) and when we code Perl, Python, or C is just fundamentally different.
I learned there why database and application programming are so different. The same mistake I made I have seen in contributions by many other developers. I have therefore concluded that it is a universal mistake.
When we program applications, we think in terms of instructions. We break a problem down into instructions, and we order those to get the right result. We usually focus on simpler elegant code over more complex code and therefore tend to try to make things as simple as possible. This was what I was doing and it failed miserably. I have also had the misfortune of looking through several hundred line stored procedures that were obviously written by application engineers, and where the ability to think in db-land was missing. Such stored procedures are not maintainable and usually contain a lot of hidden bugs.
In database-level routines, however, it is extremely important to think in set operations. A clear, complex query with both be easier to read and easier to maintain than a set of smaller simpler queries. The database is an important math engine. Use it as one. Consequently simplicity and elegance in a stored procedure is often the ability to express the most complicated operations in the fewest database queries possible. Because SQL queries are highly structured (SELECT queries involve a column list, followed by a set of tables linked by JOIN operations, followed by filter operations, etc), it is very easy to track down problems, whether those are bugs or performance issues.
With smaller operations, like I had been doing, the database cannot optimize the operation as a whole, and one may have to contend with problems caching results in an effective way, or the like. Worse, the operations are not linked together in a predictable way and therefore problems are harder to track down when they involve specific inputs or concurrent operations.
The way we think when we code SQL (including PL/PGSQL) and when we code Perl, Python, or C is just fundamentally different.
Wednesday, February 15, 2012
Thoughts on LedgerSMB 1.4
LedgerSMB 1.4 is now the primary development priority of my business and some others too. There are a number of important developments already a part of the project, and more that are on their way.
LedgerSMB 1.4 will require PostgreSQL 8.4 or higher. In 1.3, we made use of the connectby() routine in the tablefunc contrib module (now extension in 9.1) but with the release of 9.1 this is hard to manage, and WITH RECURSIVE (which was new in 8.4) does the job at least as well. WITH RECURSIVE and connectby() have different traps and problems---- the former is susceptible to infinite recursion problems while the latter handles some cases painfully slowly, so there is a tradeoff. However we feel that this is worth it. Even on the fastest cases, WITH RECURSIVE is running twice as fast as connectby().
The first major improvement was to make the change above and thus remove all dependencies on contrib modules or extensions. This will make installation, upgrade, etc. far more smooth than they were for 1.3. It also provides a standard way for us to handle tree like structures.
Along the same lines, we are working on replacing the current project and department tracking interface with a more extensible one we call business reporting units. Business reporting units can be departments or projects, but they can also be engineered-to-order jobs, funds (used by non-profits), or other links to the line items on the invoice. By generalizing this out, it means we can provide a much more useful reporting system which can be used for markets where LedgerSMB is not yet suitable.
Business reporting units will support nested units. this will allow reporting on a project or a subproject, on a department or a sub-department, etc. and WITH RECURSIVE will once again be a primary element of this infrastructure.
We have already added budget logic, and there is more that is coming. There is a basic payroll framework (without locale-specific code) which is largely done. And of course new reports. There is some talk about adding forecasting as well.
LedgerSMB 1.4 will require PostgreSQL 8.4 or higher. In 1.3, we made use of the connectby() routine in the tablefunc contrib module (now extension in 9.1) but with the release of 9.1 this is hard to manage, and WITH RECURSIVE (which was new in 8.4) does the job at least as well. WITH RECURSIVE and connectby() have different traps and problems---- the former is susceptible to infinite recursion problems while the latter handles some cases painfully slowly, so there is a tradeoff. However we feel that this is worth it. Even on the fastest cases, WITH RECURSIVE is running twice as fast as connectby().
The first major improvement was to make the change above and thus remove all dependencies on contrib modules or extensions. This will make installation, upgrade, etc. far more smooth than they were for 1.3. It also provides a standard way for us to handle tree like structures.
Along the same lines, we are working on replacing the current project and department tracking interface with a more extensible one we call business reporting units. Business reporting units can be departments or projects, but they can also be engineered-to-order jobs, funds (used by non-profits), or other links to the line items on the invoice. By generalizing this out, it means we can provide a much more useful reporting system which can be used for markets where LedgerSMB is not yet suitable.
Business reporting units will support nested units. this will allow reporting on a project or a subproject, on a department or a sub-department, etc. and WITH RECURSIVE will once again be a primary element of this infrastructure.
We have already added budget logic, and there is more that is coming. There is a basic payroll framework (without locale-specific code) which is largely done. And of course new reports. There is some talk about adding forecasting as well.
Wednesday, February 8, 2012
Further Response to Robert Young
Robert Young has responded to my previous post and clarified his position. However, he is still incorrect in his assessment.
The first thing to note is that how a CPU cycles are utilized is primarily the job of the scheduler which is a part of any multitasking kernel, and different operating systems utilize very different process and thread models. These cannot be merely reduced to their hardware equivalents. If they were we would see process vs thread performance to be largely equivalent in Windows and Linux, but in fact while Windows is very thread-friendly and process-unfriendly, though my understanding is that applications on Windows running under SUA experience better multiprocess performance than under the Win32 subsystem.
On Linux and UNIX there is very little difference between a pthread and a process except for memory isolation. Individual processes can be scheduled on logical cores supporting hyperthreading, for example. The overhead for supporting multiple processes is low.
On the other hand, on Windows, threads are cheap and processes are expensive. This is one reason why I typically recommend that high-concurrency PostgreSQL instances are not run on Windows. My understanding is that SUA addresses this issue but I can't find solid comparisons at the moment and I don't know of a PostgreSQL port to that platform.
I don't see what this has to do with anything. I personally find the user friendliness of an RDBMS and *nix to be very important. I also primarily build accounting apps, and I will tell you that doing so often over the web is not very performance-friendly for reasons that I am more than happy to discuss in another post (it's one reason why long-run I want to move away from requiring that everything goes through a web interface). But basically, transactional handling is somewhat crippled when tied to a stateless network protocol like HTTP, meaning all transactions must be atomic to within a single http request, and where we would normally not do that, we have to invent ways to get around that limitation.
This is the wrong way to look at it. HTTP unconnectedness actually in many cases makes the problem worse rather than better if you have to preserve db state across HTTP requests (which LedgerSMB btw does in some cases). It's one thing to have 15 connections updating payments in a db with 10 million transactions and relying on advisory locks. It's a very different thing to have 15 users hitting a db with 10 million transactions and having to store in the db which transactions they are temporarily holding for payment. now suddenly you go from 3 sec queries to 45 sec queries, and I would expect that we could support at least 10x the number of concurrent users in a large db if it wasn't for this unconnectedness. Now you are dealing with a large amount of I/O contention and row-locking (because we have to make sure we get the right info so invoices aren't double-paid), and while it works, it isn't pretty, nor does it perform particularly well.
This being said, the DB is usually even less of an issue in these cases than the web server.....
Moreover, web apps, particularly those that run public sites, can experience traffic spikes that can easily exceed the number of connections a db can be expected to gracefully handle. So in both cases: simple web site apps like blogs accessible to the public, and complex web-based business apps, rdbms-backed web apps are worse performance-wise than thick clients.
But the overall point remains the same: scheduling of processes and threads is the OS's job, not the CPU's. Wasted cycles in scheduling are the OS's problem, not the CPU's. If an OS can't schedule processes as efficiently as it schedules threads, that's the OS's constraint, and it is only the app's problem if you are writing with an OS in mind and your OS doesn't behave the way you would like it to. Windows, in particular, is heavily oriented towards threads. It isn't clear that most *nix platforms see the same benefit, as witnessed by Oracle's architecture which assumes one process per connection.
Also a note on DB2's connection multiplexing. It isn't clear that this actually requires multiple threads to do. The fact is you can only receive a string of data over a connection. This has to be parsed and handled. It's not at all clear that this needs threads when it is clear that regardless of how you do it, it needs a queue.
BTW, PostgreSQL parallel performance is getting a LOT of attention in the move from 9.1 to 9.2.
The issue is lost cycles as one moves down to a single thread. It's more about the number of clients that can be supported "simultaneously". At one time I was also enamoured of thread count to execute queries in parallel; not so much now. For reporting generally, and BI/DW specifically, sure. But for OLTP, where my interest lies, not so much. There, client count is what matters. Clients run relatively discrete transactions on few rows; parallelism isn't much help for that sort of client. Yes, to some extent I've changed my mind.I think this makes the mistake of assuming that single threaded multiple process models cannot utilize hyperthreading architectures to increase computational efficiency.
The first thing to note is that how a CPU cycles are utilized is primarily the job of the scheduler which is a part of any multitasking kernel, and different operating systems utilize very different process and thread models. These cannot be merely reduced to their hardware equivalents. If they were we would see process vs thread performance to be largely equivalent in Windows and Linux, but in fact while Windows is very thread-friendly and process-unfriendly, though my understanding is that applications on Windows running under SUA experience better multiprocess performance than under the Win32 subsystem.
On Linux and UNIX there is very little difference between a pthread and a process except for memory isolation. Individual processes can be scheduled on logical cores supporting hyperthreading, for example. The overhead for supporting multiple processes is low.
On the other hand, on Windows, threads are cheap and processes are expensive. This is one reason why I typically recommend that high-concurrency PostgreSQL instances are not run on Windows. My understanding is that SUA addresses this issue but I can't find solid comparisons at the moment and I don't know of a PostgreSQL port to that platform.
Think of the issue this way: if it were true that scaling down threads to one per processor/core yielded a scale up in performance equivalently (only one of two threads yields twice the performance on that last thread), then the question gets more difficult. But threads and cores don't scale that way. Such reverse scaling could only happen if the core's frequency increased as the inverse of threads active. That just doesn't happen, you get a bin or two; that nasty frequency brick wall is the reason vendors have turned to multi-core and threads in the first place.I am not sure I buy that. Linux, for example, is quite happy to schedule different processes on different logical cores, thus meaning that two processes can share a hyperthreading core in the same way two threads can. Again, that behavior is specific to a combination of a CPU, OS and scheduler.
Unless you've been building database applications from before 1995 (or thereabouts), which is to say only webbie thingees, the efficiency and user friendliness of *nix/RDBMS/RS-232/terminal is a foreign concept; just as it still is for mainframe COBOL coders writing to 3270 terminals, same semantics (I'll grant that javascript is more capable than 3270 edit language, and Wikipedia does it again, with this write up). In the web age, AJAX was the first widely known (and possibly absolute first) attempt to get back to that future, then there is Comet. Now, we have WebSocket, discussed in earlier posts.
I don't see what this has to do with anything. I personally find the user friendliness of an RDBMS and *nix to be very important. I also primarily build accounting apps, and I will tell you that doing so often over the web is not very performance-friendly for reasons that I am more than happy to discuss in another post (it's one reason why long-run I want to move away from requiring that everything goes through a web interface). But basically, transactional handling is somewhat crippled when tied to a stateless network protocol like HTTP, meaning all transactions must be atomic to within a single http request, and where we would normally not do that, we have to invent ways to get around that limitation.
The Achilles heel of those *nix/RDBMS/RS-232/VT-X00 applications was the limit on the number of clients that could be accommodated since the client patch remained while the user kept a session. The emerging hardware/infrastructure that I'm discussing may/should remove that restriction. This also means that developers must heed the mantra, "the user doesn't determine transaction scope". With HTTP (un)connectedness, that isn't much of an issue.
This is the wrong way to look at it. HTTP unconnectedness actually in many cases makes the problem worse rather than better if you have to preserve db state across HTTP requests (which LedgerSMB btw does in some cases). It's one thing to have 15 connections updating payments in a db with 10 million transactions and relying on advisory locks. It's a very different thing to have 15 users hitting a db with 10 million transactions and having to store in the db which transactions they are temporarily holding for payment. now suddenly you go from 3 sec queries to 45 sec queries, and I would expect that we could support at least 10x the number of concurrent users in a large db if it wasn't for this unconnectedness. Now you are dealing with a large amount of I/O contention and row-locking (because we have to make sure we get the right info so invoices aren't double-paid), and while it works, it isn't pretty, nor does it perform particularly well.
This being said, the DB is usually even less of an issue in these cases than the web server.....
Moreover, web apps, particularly those that run public sites, can experience traffic spikes that can easily exceed the number of connections a db can be expected to gracefully handle. So in both cases: simple web site apps like blogs accessible to the public, and complex web-based business apps, rdbms-backed web apps are worse performance-wise than thick clients.
But the overall point remains the same: scheduling of processes and threads is the OS's job, not the CPU's. Wasted cycles in scheduling are the OS's problem, not the CPU's. If an OS can't schedule processes as efficiently as it schedules threads, that's the OS's constraint, and it is only the app's problem if you are writing with an OS in mind and your OS doesn't behave the way you would like it to. Windows, in particular, is heavily oriented towards threads. It isn't clear that most *nix platforms see the same benefit, as witnessed by Oracle's architecture which assumes one process per connection.
Also a note on DB2's connection multiplexing. It isn't clear that this actually requires multiple threads to do. The fact is you can only receive a string of data over a connection. This has to be parsed and handled. It's not at all clear that this needs threads when it is clear that regardless of how you do it, it needs a queue.
BTW, PostgreSQL parallel performance is getting a LOT of attention in the move from 9.1 to 9.2.
Monday, February 6, 2012
Robert Young is Wrong about Threads and PostgreSQL
Robert Young has claimed, not once but twice, that a multi-threaded engine is necessary for database performance to be good enough. I am going to explain here why he is wrong.
Mr Young's argument more or less goes something like this: threads are more efficient, processor-power-wise to spin up than processes, and therefore if you insist on a single-threaded multi-process model, you can't have sufficient performance to make your database system relevant.
There are several important problems with this thinking. The first is that for it to be significant, the thread vs process overhead, plus related IPC, etc. must be a significant portion of actual CPU load. If this is not the case, the difference largely gets lost in the noise. In my experience in looking at databases, I have yet to identify one case where this is the case. Instead things like network connection/teardown, command parsing, and, well, actual work dwarf process/thread startup cycles by several orders of magnitude. This means that the actual startup processing overhead is likely insignificant in the overall scheme of things,
Of course actually starting up a process or thread is not all there is to the picture. There's communication between processes or threads to take into account. There is also the added complexity of locking shared resources in a threaded vs process model. And there are of course the costs that come with a lack of process isolation on the other side. These all have to be weighed against eachother, and it is not clear who the winner is. In fact even these costs will dwarf the startup costs by a significant margin.
Rather the reason why many RDBMS's have gone to a multi-threaded model is that it is one way, perhaps even an elegant way, to support intraquery parallelism. In other words if you can spin up a few more threads in your process, those threads can be conducting scans on different partitions in different table spaces, and thus overall increasing the resource utilization of a given query. This is an important feature and it is currently lacking on PostgreSQL. Postgres-XC however is a project that offers such a solution based on a multi-server, distributed model. It is perhaps not quite ready for prime time but is getting there.
But while a multi-threaded engine would make this sort of parallelism easier to implement, it isn't clear to me that either it is necessary or that the cost in complexity would be worth it compared to multi-process and even multi-server models. Threaded systems are far harder to troubleshoot, far less transparent, and far more brittle than process-oriented systems.
Note that each approach has strong points and weak points. Declaring a specific architecture as a clear winner forgets that these differences exist. Ideally eventually PostgreSQL would support multi-server and multi-process methods of intraquery parallelism. I am not convinced that we need to support a multi-threaded method in addition.
Update: Robert Young also links to an IBM piece talking about multiplexing network connections. However the additional overhead here is going to be specific to the case where you have enough simple queries on their own connections to justify the additional complexity. Yes, you can write benchmarks where you see a small but measurable benefit. However, it is not clear that these benchmarks necessarily map to any significant portion of real-world scenarios. Certainly in the sorts of programming I do, I would expect no performance gain whatsoever to such a model.
Mr Young's argument more or less goes something like this: threads are more efficient, processor-power-wise to spin up than processes, and therefore if you insist on a single-threaded multi-process model, you can't have sufficient performance to make your database system relevant.
There are several important problems with this thinking. The first is that for it to be significant, the thread vs process overhead, plus related IPC, etc. must be a significant portion of actual CPU load. If this is not the case, the difference largely gets lost in the noise. In my experience in looking at databases, I have yet to identify one case where this is the case. Instead things like network connection/teardown, command parsing, and, well, actual work dwarf process/thread startup cycles by several orders of magnitude. This means that the actual startup processing overhead is likely insignificant in the overall scheme of things,
Of course actually starting up a process or thread is not all there is to the picture. There's communication between processes or threads to take into account. There is also the added complexity of locking shared resources in a threaded vs process model. And there are of course the costs that come with a lack of process isolation on the other side. These all have to be weighed against eachother, and it is not clear who the winner is. In fact even these costs will dwarf the startup costs by a significant margin.
Rather the reason why many RDBMS's have gone to a multi-threaded model is that it is one way, perhaps even an elegant way, to support intraquery parallelism. In other words if you can spin up a few more threads in your process, those threads can be conducting scans on different partitions in different table spaces, and thus overall increasing the resource utilization of a given query. This is an important feature and it is currently lacking on PostgreSQL. Postgres-XC however is a project that offers such a solution based on a multi-server, distributed model. It is perhaps not quite ready for prime time but is getting there.
But while a multi-threaded engine would make this sort of parallelism easier to implement, it isn't clear to me that either it is necessary or that the cost in complexity would be worth it compared to multi-process and even multi-server models. Threaded systems are far harder to troubleshoot, far less transparent, and far more brittle than process-oriented systems.
Note that each approach has strong points and weak points. Declaring a specific architecture as a clear winner forgets that these differences exist. Ideally eventually PostgreSQL would support multi-server and multi-process methods of intraquery parallelism. I am not convinced that we need to support a multi-threaded method in addition.
Update: Robert Young also links to an IBM piece talking about multiplexing network connections. However the additional overhead here is going to be specific to the case where you have enough simple queries on their own connections to justify the additional complexity. Yes, you can write benchmarks where you see a small but measurable benefit. However, it is not clear that these benchmarks necessarily map to any significant portion of real-world scenarios. Certainly in the sorts of programming I do, I would expect no performance gain whatsoever to such a model.
Sunday, February 5, 2012
LedgerSMB talk at FOSDEM
Hi,
I just arrived at FOSDEM - Brussels. Due to last minute arrangements, my talk about LedgerSMB isn't in the schedule. So, here's the announcement:
There will be a LedgerSMB talk in the Perl devroom (AW1.121)
between 12.25 and 12.45
The talk is on the fosdem site: http://fosdem.org/2012/ schedule/event/ledgersmb_perl
With kind regards,
Erik.
Subscribe to:
Posts (Atom)