Tuesday, March 13, 2012

3 Things I Wish PostgreSQL Did

There are a few times when I work with PostgreSQL when I wish it was able to do some things that neither it nor other relational databases are able to do at present.  In my view (and I will defend each of these feature requests), these all fit well within a relational db centric world.


3:  XML/JSON to tuple type

One of the problems that one runs into in a stored-procedure centric application is managing input and output of an application.   An ability to convert XML or JSON to a tuple type (remember that tuples can have members that are tuples, or arrays of tuples).  The formats are semantically equivalent, so why not allow for conversion?


2:  Nested Namespaces

Managing hundreds or thousands of stored procedures  can be a problem in a flat namespace.  We end up having to semantically create hierarchical function names or hierarchically named namespaces in a flat namespace world.  Even if one cannot do it for tables, there ought to be some way to do this for functions.

Part of the problem here is that the SQL standards give semantic value to different namespace lengths, so this would have to be solved.  It is not an easy problem to solve and it may not have a solution.  Perhaps a package delimiter within a namespace would be helpful?  Maybe a character like# or $?

1:  Rich Declarative Constraints for Accounting Applications

It seems strange to me that accounting applications have been one of the primary uses of RDBMS's since their inception, and yet there is no type of declarative constraint to handle ensuring that transactions are balanced.

Whether a transaction is balanced or not is fundamentally a set-based operation.  A transaction is balanced when the sum of the debits of the rows in each transaction is equal to the sum of hte credits of the rows in each transaction.  In LedgerSMB, debits are negative amounts, and credits are positive amounts (they are just presented to the user as fundamentally different).  So I would like the ability to do something like:

CHECK FOR EACH TRANSACTION (SUM(amount) = 0) GROUP BY trans_id;

Part of the problem I suppose is that getting this right in a row-locking environment is hard.  However, in an MVCC environment it should be possible to check if this constraint matches the deleted and inserted rows, and if it does (0 + 0 = 0), we know we are balanced (deletions would only happen for unapproved transactions, i.e. ones that have not yet hit the books).  Such a check would only fire once per transaction, and only check rows modified by the transaction.  It therefore shouldn't be problematic in the way an aggregate check would be over the entire table.

The other two are nice, but if PostgreSQL could do this very well, it would be the king of databases for accounting software, and I am sure they feature would find new uses elsewhere.

So, what's on your list?

18 comments:

  1. Actually in progress already, and a big boost to your #3 is 1st class citizenship for plv8, for a few key reasons:

    1) It's very fast, leveraging the smarties at google, and programming in plpqsql is limited compared to a prototype based language like javascript.

    2) There are a lot more javascript programmers than pgsql, and that's not going to change.

    3) It could very well be THE reason pg becomes more popular than mysql for web dev. Being a more touring complete database should be enough, but it's not. From my own experience, we have always used mysql for web apps because it's proven and we knew we could count on hiring web devs with experience with it. Everything has it's pros and cons, but it has worked very well for us. But one night not too long ago, plv8 caught our attention and we haven't looked back. I look forward to writing a blog post soon about using pg/plv8 for web apps over mysql.

    ReplyDelete
    Replies
    1. Troy: plv8 is interesting but I don't see how it provides the sorts of declarative constraints I am asking for here. Also I think that most stored procedure languages run into issues regarding thinking in sets vs instructions (see my recent post about application vs database programming). These are avoidable, but throwing JS programmers at the database doesn't do it.

      I think that plv8 however might be very helpful in boosting #1. Just not at all convinced it has anything to do with #3.

      Delete
  2. @3 - do you know constrain triggers? http://postgres.cz/wiki/PostgreSQL_SQL_Tricks#Deferred_constraints

    @2 - I am not sure about sense - used flat model is adequate to SQL - is simply and then is simple used.

    ReplyDelete
    Replies
    1. Pavel: What I am looking for is an ability to do things in a declarative manner. I am not sure triggers work because you really need to have the set of all deleted records at once. The information is specific to the set of affected tuples, not to any given tuple. The same goes with the inserted rows.

      The current workaround for #3 is to do all inserts through stored procedures which can check these things. However, it would be nice to make such declarative rather than imperative.

      As for #2, the problem really has to do with managing very large numbers of stored procedures, and the fact that comparatively few non-alphanumeric symbols are usable in names by default. Having some sort of nested namespace for management of these sprocs would be very helpful.

      You are right that a simple stored namespace does work, But then we end up with kludges like using double underscores to create something like a nesting hierarchy which is problematic for a few reasons, including readability.

      Delete
    2. Chris: @3 - standard knows "CREATE ASSERTION", but nobody implement it in PostgreSQL

      @2 - there is clean limit - ANSI SQL, but some years ago we used "_" with success. we had names like application_output_xml.procedure

      Delete
    3. Pavel: First on #2, I recognize that there are issues with ANSI or ISO SQL, but there may be ways of creating extensions to ISO SQL that don't violate those standards. We do this with various clauses as it is.

      But the point is that using things like underscores means you have to make sure your namespaces use only a single word. This can be problematic if you want to be more descriptive.

      On to #3. The problem with CREATE ASSERTION is that you still have the issue of having to scan the entire table to check the assertion when the transaction commits. That may work well on a small table. It does not work well when dealing with tables that have, say, ten million rows.

      What I am proposing is something that allows the checking of rows inserted or deleted (remember an update is a delete plus an insert in a MVCC world) such that they meet certain criteria as a set. Triggers don't work for this well because you can only evaluate one row at a time. Aggregative checks don't work well because they have to evaluate large numbers of rows which aren't applicable to the current transaction. CREATE ASSERTION gets us close with regard to small tables. But for large tables that add, maybe, 1 million rows or more per year, there are few ways to keep this performing well in an environment with decent load over time. In other words, from what I read, it still creates an assertion on the *table* rather than on the *set of rows affected by the transaction.*

      The only workaround that works currently is to implement this as an imperative rather than a declarative way. What I am suggesting is that it would be great to have declarative constraints that would work well for sets of modified data, ignoring non-modified data. This would allow for no more I/O than is necessary.

      Delete
    4. BTW, for #2, the closest thing we have right now is the extensions system for 9.1. It helps a great deal, but I worry about things like name collisions if these become widely used for packaging applications and extra functionality.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Chris, am I right in suggesting that what you want to do for #3 is something along these lines:

    For accounting it is important that for double entry book keeping you can do things like make sure that all the accounts add up to zero all the time. So, in simplistic terms, "SELECT sum(balance) FROM account" would equal 0 at the point a commit is issued.

    For small tables it's fine to run a complete check of all records but as the table grows to millions of rows that the performance penalty of this check becomes too large to execute at the table level.

    Because the accounts always have to add up to zero this means that all the changes within a single commit also have to add up to zero, and that if there was a mechanism to check just the changes then this would be a far more efficient solution.

    Currently it is only possible to either check the whole table at a time, or to pass _all_ changes to the "accounts" table through a store procedure. Whilst the latter works, it's an inelegant solution to a common problem.

    I too am storing account information in a similar manner within Postgres so if we could declare a COMMIT wide constraint then that would certainly help with this scenario.

    ReplyDelete
  5. Forgive the horrible bastardisation of existing syntax, but something along these lines:

    CREATE CONSTRAINT TRIGGER example
    AFTER COMMIT ON accounts
    FOR EACH CHANGED ROW sum(balance) = 0

    ReplyDelete
    Replies
    1. just declare the CONSTRAINT TRIGGER as DEFERRABLE INITIALLY DEFERRED, wich means check at commit time:

      CREATE CONSTRAINT TRIGGER name AFTER event [ OR ... ]
      ON table
      DEFERRABLE INITIALLY DEFERRED
      FOR EACH ROW
      [ WHEN ( condition ) ]
      EXECUTE PROCEDURE function_name ( arguments )

      Delete
    2. Jaime: I am not convinced that works, because we aren't interested in the rows individually. We can't determine whether to raise an exceptions based solely on the values within a single row.

      Instead the properties we are interested in arise from the set of rows affected by the transactions. There is currently no way to do this.

      It wasn't obvious that Chris was talking about this in his top level reply to my post above, but his previous reply he in fact makes very clear that he is on the same page as myself on this.

      Delete
  6. Chris,

    The issue with namespace nesting has been with us since SQL92. The basic issue is that the SQL committee chose a single level of namespacing. This drives our object identifier resolution, which means that additional levels of namespacing break things. For example, imagine that you have:

    schema accounts
    schema customer
    schema customer/accounts
    function accounts.balance()
    function customer/accounts.balance()

    ... and all of those are in your search_path.

    Now, if you call accounts.balance(), what do you get? Nested namespaces take something which was determinative in the current impemementation and either makes it non-determinative, or forces you to namespace-qualify everything, which would break a lot of code.

    Saying that you can't have namespace name collisions isn't helpful either, since the whole purpose of nesting namespaces is to allow name collisions.

    This issue comes up every time anyone discusses Oracle packages. Nobody has come up with a solution yet. It's a real problem, and one worth solving, but it's a hard problem.

    ReplyDelete
    Replies
    1. Yeah, it's a hard problem. However, actually I am less interested in it for allowing namespace collisions than I am for easy management of large numbers of stored procedures.

      I suppose an alternative is to use logically hierarchically named schemas (f. ex., lsmb, lsmb_crm, lsmb_crm_person, lsmb_crm_company, lsmb_crm_eca), and then create stored procedures to manage these. The problem is that I would really like to call lsmb_crm_eca lsmb_crm_credit_account instead, but then the underscore becomes ambiguous.

      I guess I'd be happy with at least one other non-alphanumeric character being allowed :-P.....

      Delete
  7. As for the XML: please check http://www.pgxn.org/dist/pg_xnode/0.6.1/
    It's a project that I'm trying to start. Still quite an early pre-release, but it already has function that you might like:

    node[] xml.children(node)

    where the function can be recursively applied on each item of the result array.

    Any feedback is appreciated.

    ReplyDelete
    Replies
    1. Looks like an interesting project. I guess what I would like would be an easy way to translate one data type to the other. Might not be quite what I need but seems worth watching.

      Delete