Thursday, October 11, 2012

NoSQL-like Approaches to PostgreSQL, a reply to Martin Fowler

The major point of many of Martin Fowler's blog posts, repeated in his book, is that NoSQL represents a movement away from integrating on the database and towards encapsulating databases in applications and using services to integrate data.  However, that this isn't the way higher end RDBMS's are always used speaks more to the market than the RDBMS technology itself.  For the last five years I have been building relational databases along a similar approach, using web services as an inspiration for interfaces.  This post discusses a few different approaches to doing this on PostgreSQL.  Note that other RDBMS's may be different in many of these regards.

The Promise of the Relational Database Model


Relational databases are designed to store information in application neutral formats and then present it in various ways to various applications.  The goal is, as Fowler puts it, the "integration database."  In order to facilitate this goal in a maintainable way, most RDBMS's come with solid sets of features designed to encapsulate physical storage behind application-specific presentation of the data stored.  These include views, functions, and stored procedures.  These features allow a database engineer to build applications which encapsulate the physical storage, so that the database management system essentially serves out an information model through a myriad of client-specific interfaces.  This achieves what Fowler is advocating, namely encapsulating the database in the application, but it allows for declarative access to that API.   At least in theory, there is no difference between what Fowler advocates for NoSQL and what many DBA's do with an RDBMS.  This also suggests that Fowler is on to something important, but that ditching the RDBMS may in many cases not be the answer.

Why the RDBMS Rarely Delivers on That Promise


For all the theory though, most RDBMS's are rarely used in this way.  I think there are several factors that typically weigh against such encapsulation:

  • Investment (expertise and license costs) in a specific RDBMS by IT departments means that application developers want to write portable code, and thus write lowest common denominator code
  • Application developers often want to ensure that all access goes through their databases and thus do not want to centralize application logic in the database, and
  • Many of the standards ensure that tools are not really optimal for this task.  The best tools are those which go beyond the standards, thus with greater implications regarding portability.
The first market factor has to do with the fact that relational database management systems are expensive systems in several senses.  In addition to the (often hefty) licensing costs, you have the necessity to hire people with experience and/or train them.   Companies usually address this cost by consolidating their expertise and servers on a single RDBMS.  Therefore you see Oracle shops, SQL Server shops, DB2 shops, etc.  This leaves lock-in as a major cost for businesses, and it reduces the market for database-specific applications.  Application developers therefore, quite rationally, more often choose to write SQL code that runs on any different RDBMS.  This naturally requires eschewing advanced features, like Oracle Objects or pl/pgsql functions for vanilla SQL.

 The problem of course is that it is hard to encapsulate data when you are limited to the most basic feature set.  While views may work, making them updatable may be different, and have different consequences on every RDBMS.  Stored procedures or functions are far worse in this regard.  Consequently the need to write portable code requires essentially dispensing  with the very tools used to encapsulate data.  For this reason I don't think you can both write a database for use by multiple applications (i.e where the internal data structures are encapsulated) and also write a database that runs on multiple RDBMS's.  You have to choose.  Developers cannot be blamed for choosing the option which gives their applications the most market and lock-in.

The competing desires for application lock-in is another factor.  RDBMS vendors typically want to restrict access to a certain number of client access licenses or seat licenses, and if connections can be pooled in middleware this can help circumvent some of this limitation (it may cause contractual issues depending on the nature of the EULA, but the technical controls are easily worked around in this manner and at some point you run into problems with definitions, particularly when systems are loosely coupled).  Application developers want to sell their own licenses and this can only be done if the connections are checked in the application layer.  Therefore it is against the interests of many application developers to ship with encapsulated database schemas.  Instead, the RDBMS is used largely like a private data store with some additional reporting capabilities. 

Some RDBMS vendors actually optimize their systems for the above needs.  One feature SQLAnywhere offers is that developers like Intuit can lock a database to a specific application, banning all third party access, and a large part of MySQL's current popularity may be tied to the fact that it is well-optimized for moving unencapsulated databases that run on other DB's to it.  In particular the sql_mode setting on one hand facilitates porting code to MySQL and on the other makes it relatively unsafe for this sort of encapsulation.

However, market factors aren't the only ones pushing developers away from building databases in this manner.  The perpetual debate over stored procedurs illustrates a mismatch between at least one commonly used tool and the typical use case of that tool.

 Stored procedures, as the name implies are essentially imperative constructs which take specific arguments and return records.  The idea is that they provide a basic imperative interface language to a database.  Instead of SELECT .... you issue CALL some_proc(arg1, ....);

 There are several problems however with stored procedures as they are typically used.  The first is that they still have significant impedence mismatch with object-oriented programming.  The data structures they return tend to be fairly rigid so adding a new column tends to require multiple changes in code, often at least one in each layer of the program.  You can't just define your object once and then re-use over and over in other layers of your code.

A second significant problem is that stored procedures are at their best when they are a few well-connected and modularized queries, and at their worst when they are many different queries tied together in complex ways.  This leads to limited utility in actually encapsulating the data, and in all cases the abstraction layer is not really a good match for what is typically done with it.  For these reasons stored procedures as typically used make the most sense when working outside the object-oriented approach.

 Stored procedures have been replaced by object-relation mapping tools (ORMs) which attempt to provide a mapping between a relational interface and an object-oriented development environment.  ORMs automate basic database operations for things like insert, select, update, and delete operations, but they don't really provide an abstraction regarding the actual data mapping between the behavioral app layer and the information model layer.  This can currently only be done in the information model itself, so ORM's are best paired with updatable views, but this comes at the cost of portable SQL code.

Aside from these approaches, or moving to NoSQL, there are a variety of methods to encapsulate the data store inside a relational application.  These approaches require understanding both the desire for encapsulatin and interfaces, and the desire to integrate with applications as a service rather than as a relatively simple persistence layer manipulated mathematically.

Service Oriented Database Architecture

For most of the last five years, I have been building LedgerSMB using an approach I call "Service Oriented Database Architecture," or SODA, which is inspired in part by RESTful web services and SOAP.  From SOAP I took the emphasis on discoverability, and from REST, I took, to the extent possible, the goal of re-using everything in the database that can be re-used in order to define an API.  This approach thus uses the database semantics the way REST re-uses HTTP semantics, and while there are some differences forced by the way PostgreSQL does things (every function called by a SELECT statement), this is not the end of the world.  The goal, of course is to build database interfaces suitable for loosely coupled application/database combinations.

 The SODA approach is based on a number of principles, namely that:

  • Access to the database is through functions, not relations,
  • Functions, to the extent possible, always return a useful result, usually in a data structure corresponding to an object,
  • Function names (within the domain of this architecture) are unique, and
  • Function argument names correspond to the properties expected.
  • The database is responsible for its own security.
 If these are followed then the functions can be mapped, discovered, and run at run-time.  Here is a PHP class that implements such a run-time mapping:

 class DBObject
{
    protected $schema = 'public';
   
   /* function __call($procname, $order = null)
    * Maps in object properties into an arg array and calls call_procedure
    *
    * db procedures are checked for argument names and these are stripped of 
    * the "in_" prefix.  After this is complete, a property is matched and
    * mapped in.
    */
    public function __call($procname, $order = null){
        # variable initializations
        $procargs = array();

        # proc data lookup
        $procquery = "
            SELECT proname, pronargs, proargnames, proargtypes FROM pg_proc 
             WHERE proname = $1
               AND pronamespace = 
                   coalesce((SELECT oid FROM pg_namespace WHERE nspname = $2), 
                        pronamespace)";
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $procquery, 
                               array($procname, $this->schema));
         $procdata = pg_fetch_assoc($sth);

         if (0 == pg_num_rows($sth)){
            throw new \exception('Function not found');
         }
         # building argument list
         preg_match('/^{(.*)}$/', $procdata['proargnames'], $matches);
         $procargnames = $phpArr = str_getcsv($matches[1]);
         foreach ($procargnames as $argname){
              $argname = preg_replace('/^in_/', '', $argname);
              array_push($procargs, $this->$argname);
         }

         # calling call_procedure
         return $this->call_procedure($procname, $procargs, $order);
    }
    /* function call_procedure($procname, $args = array(), $order = null)
     *
     * generates a query in the form of:
     * select * from $procname($1, $2, etc) ORDER BY colname
     * and runs it.  It returns an array of associative arrays representing the
     * output.
     */
    public function call_procedure($procname, $args = array(), $order = null){
         $results = array();
         # query generation
         $query = "select * from "
                       . pg_escape_identifier($this->schema) . "." 
                       . pg_escape_identifier($procname) . "(";
         $count = 1;
         $first = 1;
         foreach ($args as $arg){
             if (!$first){
                 $query .= ", ";
             }
             $query .= '$' . $count;
             $first = 0;
             ++ $count;
         }
         $query .= ')';
         if ($order){
            $query .= " ORDER BY " . pg_escape_identifier($order);
         }


         # execution and returning results
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $query, $args);
         if (!$sth){
             return null;
         }
         for ($i = 0; $i < pg_num_rows($sth); $i++){
              print "retrieving row $i \n";
              array_push($results, pg_fetch_assoc($sth, $i));
         }
         return $results;
    }
    /* function merge(array $data)
     * merges data into the current object from an associative array
     * 
     * null or undef values are not set
     */
    public function merge($data){
        foreach ($this as $prop => $value){
             if (array_key_exists($prop, $data) and null != $data[$prop]){
                 $this->$prop = $data[$prop];
             }
        }
    }
    /* function is_allowed_role($role)
     * returns true if the user is allowed the role for the specific db 
     * i.e. $role should not include the prefix.  Otherwise it returns false
     */
    public function is_allowed_role($role){
        $db = DB::getObject();
        return $db->is_allowed_role($role);
    }
}  


The above code seems long but what it allows essentially is inheriting objects to simply declare that methods are mapped to stored procedures, and these mappings are automatically adjusted at the time that stored procedure is actually called.  Additionally this centralizes essentially all db access in a single file where it can be audited for SQL injection issues and the like, and you can go on programming as if you are hitting an object-oriented database.  Of course there are times when you need to  make modifications on many layers, such as when a new attribute needs to be added and stored, and it isn't in the table yet, but generally these are relatively rare.

In PHP, I can have a class which checks the version and selects the appropriate stored procedure easily even if they expect different object properties as arguments:




public function save(){
     $procname = 'company__save';
     if ('1.3' == \LedgerSMB\Config\DBVERSION){
         $procname = 'company_save';
     }
     $data = array_pop($this->$procname());
     $this->merge($data);
}



 What might a stored procedure look like?  Here is one:


CREATE OR REPLACE FUNCTION asset_dep_straight_line_yr_d
(in_asset_ids int[],  in_report_date date, in_report_id int)
RETURNS bool AS
$$
     INSERT INTO asset_report_line (asset_id, report_id, amount, department_id,
                                   warehouse_id)
     SELECT ai.id, $3,
            asset_dep__straight_line_base(
                  ai.usable_life, -- years
                  ai.usable_life -
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                       coalesce(start_depreciation,
                                         purchase_date)),
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                $2),
                  purchase_value - salvage_value,
                  coalesce(sum(l.amount), 0)),
            ai.department_id, ai.location_id
       FROM asset_item ai
  LEFT JOIN asset_report_line l ON (l.asset_id = ai.id)
  LEFT JOIN asset_report r ON (l.report_id = r.id)
      WHERE ai.id = ANY($1)
   GROUP BY ai.id, ai.start_depreciation, ai.purchase_date, ai.purchase_value,
            ai.salvage_value, ai.department_id, ai.location_id, ai.usable_life;

    UPDATE asset_report SET report_class = 1 WHERE id = $3;

    select true;
$$ language sql; 

Unfortunately the above has to return true because the nature of the operation does not really provide another effective approach though if we find one, it will be adjusted in the following major version upgrade.

This approach is generally nice because  it is light-weight and conforms relatively well to more rapidly changing environments.  However, the lack of structure imposed may be a problem in some environments also.  Where more engineering is required, the other approaches may work better.  This works relatively well, however, if you build your API to assume a relatively loose coupling between your database and the application hitting this sort of API.

Object-Oriented Database-level interface


Where tighter coupling is required, an object-oriented interface may be better.  In some ways this is worth avoiding because it leads to very ugly SQL queries, for example:


SELECT (save).*
FROM save(row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company);


The overall issue here is that you have the possibility of multiple levels of discoverability involved.  It works very well for code generators, but not so wellf or the human masters.  Note the above could be rewritten, assuming no additional arguments as:

 SELECT (save).*
   FROM (row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company).save;


the advantage to this approach is that your objects form classes whose structure is discoverable, and overloading becomes possible.    Code generators thus can work well, as the database contains all information needed to create all boilerplate code.  The database code itself is simplified as well.  On the other hand, troubleshooting can be a bit of a pain.   It also has the property of essentially requiring the use of code generators in order to create libraries for interoperability.  This closely ties the generated libraries to the interface created.

In-database ORM and JSON (In Javascript!)


 One fascinating approach I came across recently, but have very little experience with, is xTuple's in-database ORM which is largely written in pl/v8js stored procedures. Yes, you got that right, the stored procedures are written in Javascript.  I would invite people to check out the code and see what they think.  This is a fascinating approach and not one I have played around with yet but it definitely shows how far the encapsulation can be made to happen within PostgreSQL.

Conclusions


Encapsulating an application inside the database is not something which one must go to NoSQL to do.  RDBMS's which are strongly programmable are capable of doing this now, although perhaps few if any rival the flexibility in this area of PostgreSQL.  The RDBMS can then be an 'information model server' which serves out information models as requested, each of which encapsulates further data within it.  The data model can then be consumed and expanded in the Model of an MVC framework, but the impedance mismatch issues can largely be eliminated by understanding and utilizing separation of concerns to one's advantage.

 Of course none of this is to disparage NoSQL.  These products have been successfully used quite often as adjuncts to traditional RDBMS's, either preprocessing or post-processing data for later use or re-use.  Some users of polyglot storage models have found rapid increases in development pace when these are used together, with data often being piped from something like Mongo or Neo4j into PostgreSQL after essentially using these engines to transform the data, or using it to transform the data on output.   This is particularly true when processing large amounts of data which is to be processed in relatively non-relational ways, such as with graph databases, array-native databases (in the sciences) etc.  The combination between these approaches becomes very rich.

 But it does suggest that the days of utilizing the RDBMS  as a dumb data store for a given application are numbered in many cases.  I don't see anything challenging the primacy of the RDBMS environment, but at the same time, that doesn't mean no role for the other ones as well.  If you want a dumb persistence store, you can go with a NoSQL solution and be happier.  However, the overall approach of avoiding the mismatch by encapsulating data storage inside of an application is equally applicable to the RDBMS environment as any other.

Monday, October 8, 2012

Three Approaches to Object-Relational Databases: PostgreSQL, Oracle, and Informix

PostgreSQL vs Informix

Probably the closest database object-relationally to Postgres is Informix.  Informix in fact got its object-relational approach with the purchase of Illustra, a Postgres fork.  Illustra however split from Postgres before the latter adopted SQL, and so the SQL implementations are entirely independent.

Informix has perhaps had the most influence of any database software on how the industry sees object-relational databases and so because of the share heritage and Informix's place in the industry, it is the first comparison to make.

Table Methods

I was not able to find information on class.method notation in the Informix documentation.  As far as I can tell, Informix requires the methods to be called using function(object) syntax.  This is along the lines of Stonebraker's image processing example.  In this way the connection between structure and function feels somewhat more bolted on than it does in PostgreSQL.  However it would be a mistake to see this as the entirity of Informix's object-relational capabilities.

Inheritance

Informix supports single inheritance for both types and tables using the UNDER supertype syntax.  UNDER, similar to INHERITS in PostgreSQL establish an "is-a" relationship between the supertype and the subtype.  Unlike in PostgreSQL, indexes and referential integrity is inherited in Informix, meaning that foreign keys work properly in both directions.  Single inheritance is thus quite a bit more mature on Informix than on PostgreSQL, but the lack of multiple inheritance prevents composition of tables by in-lining type attributes (which is possible on PostgreSQL but not on Informix).

This shows that Informix has a different approach to table inheritance, namely that there is a simple use case which it supports very well and is quite well polished, but more complex use cases are beyond it.  In PostgreSQL, on the other hand, declarative referential integrity doesn't work and thus referential integrity requires writing your own constraint triggers.

Return Results

A select * from parent_table in informix returns all attributes of all rows from the parent table and all descendant tables.  This can lead to a situation where the result set is "jagged" (i.e. where the rows have different numbers of elements), where child tables add additional columns.  In this case, it is necessary to check the row definition when receiving each row.

One way to look at it is that both PostgreSQL and Informix return a set of objects, bot PostgreSQL coerces them into the types that are asked for, while informix returns them as they are.  Thus if you select * from person, and employee inherits person, then you automatically get all the employee information as well.

This feature is as far as I know, unique to Informix.  I know of no other ORDBMS that allows a result set to be jagged in this way, but it shows one important problem that happens when one starts merging object-oriented and relational approaches.  What does select * mean in this case?  It isn't self-evident, and therefore Informix and PostgreSQL take different approaches.

Alternative Approaches

In the absence of a system of composing tables by in-lining types, the way multiple inheritance works in PostgreSQL, table composition requires using row types as columns in Informix.  The use of complex types in this way in Informix is much more mature than it is in PostgreSQL (which only even began to support this very recently recently).

Consequently composition is done primarily through member types, but this evokes an important tradeoff between ease of querying in a relational sense and rich modelling.

I personally find the SQL required to make elegant use of columns as types in this way somewhat ugly but I recognize this is a personal practice.  Still, consider the following:

SELECT print(e.name), mailing_label(e.address) from employees e;

If address and name are both complex types then whatever we are going to do is going to be a bit ugly.  There isn't much we can do about that.

PostgreSQL and Informix vs Oracle

Oracle in some ways shows some influence from Informix but it takes the approach in a somewhat different direction.  Oracle makes no bones about being a relational database first and foremost and has adopted an approach which avoids, rather than answers, questions of proper behavior.  Oracle Objects in some ways resemble Informix's approach and in some ways PostgreSQL's, but they are in general more different than the same.

Tables and Types 

Oracle objects tends approaches the question of object to relation equivalence by allowing types to be inherited, while tables cannot be.  Tables can copy type structures for their schemas however.  Thus one generally has to create an object model first and then create tables to store those objects.  This approach sets up very different patterns and antipatterns than one would get in Informix and PostgreSQL where tables themselves can inherit.  On one hand this separates (forcibly) data holding tables and their abstract parents.  On the other, this makes it harder to work with, except where complex types are being used in columns of a table.

It is worth noting that types support methods in Oracle and this simplifies things greatly.  However, I am left wondering why one would use Oracle objects instead of virtual columns and just treat Oracle as a form of relational database management system with little use of object extensions.

Object Composition Approaches

In Oracle the only approach that works is to use columns to store complex types.  Forunately those types can have methods so the SQL is not as ugly as it would be on Informix.  You can:

select e.name.print(), e.address.mailing_label() from employees e;

This strikes me as a bit more readable.

It seems to me that Oracle Objects have two main uses.  The first is in the logical model of the database although this role can be taken over by ORMs to some extent.  The second and more attractive approach is to use Oracle Objects not for their advertised use of modelling of things like customers or invoices but rather to create intelligent datatypes for columns.

For example, if I want to store images and dynamically determine what type of image we are looking for, I could still do something like:

SELECT id FROM slides s
 WHERE s.picture.is_of('sunset');

This is far cleaner SQL-wise than the equivalent syntax in Informix or PostgreSQL:

SELECT id FROM slides s
 WHERE is_of(s.picture, 'sunset'); 

This allows the same sorts of problems to be addressed as Stonebraker talks about, and it allows structure and logic to be clearly tied together, at the expense of substitutability as is found in table inheritance in both Informix and PostgreSQL.

The complexity costs are likely to be worth it in Oracle in a very different set of cases than would be the case in PostgreSQL or Informix.  These cases intersect at queries which must return data based on very complex criteria which cannot be reduced to relational algebra and predicate calculus.

Final Thoughts

The above discusses three different approaches to the question of how to encapsulate complex data into a database which may need to query based on arbitrary criteria not readily reducible to relational algebra and predicate calculus.

PostgreSQL and Informix both tightly integrate object handling far more deeply than Oracle.  Informix seems more polished in the areas it supports, but PostgreSQL has a broader support for the overall idea.

Oracle's approach is essentially to move object handling capabilities to the column level for the most part.  Yes, you an create tables of objects, but you cannot inherit them directly and you cannot compose your object model using multiple parents without moving everything to the column level.  This makes object behavior mostly useful in the creation of "smart column types."

Each of these three ORDBMS's takes its own approach.  All three allow the development of very intelligent database models.  All three pose different complexity issues.

Wednesday, October 3, 2012

Faq: Why is LedgerSMB PostgreSQL-only?

We get asked a lot, why is LedgerSMB Postgresql-only?  Why not run on MySQL?  After all, since 5.0, MySQL has offered relatively decent type constraints, and offers a database that works sufficient to build ACID-compliant applications, and so forth.  After all, this line of reasoning goes, we'd get more market share by being able to run on a larger number of hosting providers.

This post discusses why we made the decision to go with Postgres and decline to support any other databases.  I still believe it has been the correct decision for us but it might not be the right decision for other projects.  I hope though that writing this up will help other projects weigh their options and choose appropriately.

The short version is that we decided to favor an application with many avenues for integration over the idea of an application that would run on any platform.  We chose the path of moving towards giving our users as rich an experience as possible, and as rich opportunities as possible in integrating this software with their business over the idea of having as many customers as possible.  If the software works wonders, people will make room for it.

Our Goal

In our case, our goal has generally been to be a major part of the business infrastructure of our users.  We want to create a flexible framework which supports extension and repurposing, allowing businesses to find new uses for what we are doing.

Additionally early on our decision was shaped by the desire to hook LedgerSMB into reporting tools written in other languages, and we generalized this to tools generally.  This has been a relatively safe bet, and with additional compatibility classes auto-detection of DB-level API's is possible.

Finally we started as a fork of a PostgreSQL-only application and so the choice was whether to deepen our commitment to PostgreSQL only or whether to move towards portability.

Considerations

Our overall approach  was based on the idea that financial and ERP systems are best when they are open to integration, and that ERP frameworks need to be open to extension as well.  After all, ERP software typically runs a wide variety of functions in a business, and these are often connected to yet more workflows where custom apps may be developed.  If those apps can be developed against LedgerSMB this delivers greater value to the customer.

In other words, we see ERP as much about development platform and tools as it is about application functionality and flexibility in adding features is key.

Our Decision

Our decision was to solidify our usage of PostgreSQL, use it to add additional security and tune performance.  We quickly made the decision to move much of the logic into stored procedures in order to support tools written in other languages as well.

One important benefit which helped push us this way was the fact that a stored procedure interface could be centrally controlled and checked against SQL injection while ad hoc query interfaces required careful auditing.

Our Approach and Its Challenges

Our current approach is to name arguments such that they can be mapped in. This works reasonably well with a single application, it is simple and requires relatively straight-forward queries.  However it is currently limited in two ways:  namespaces, and the fact that the interface is very heavily procedural and this makes maintenance a bit of a pain.  It also means that class attributes must be manually maintained across integration classes, which is sub-optimal.

We are currently exploring the possibility of moving to a more object-oriented interface in the db which would solve the above issues.  The interface would be dynamically discovered by querying system catalogs or views we would create, and results cached by the application.  This would allow dynamic API discovery and classes to export the db-level API to other systems in a purely db-centric way.  This would hence be expected to cut our maintenance effort significantly.

A major challenge here however has to do with quality of stored procedures.  The current quality of stored procedures varies quite a bit.   In general, one of the problems one faces is that application developers don't always make good stored procedure developers because the sorts of thinking are rather different.  However, over time we expect to be able to fix this issue.

Another key part of our strategy is that of layered API's.  Not only are we building discoverable db-level API's but also RESTful web ser

Final Words

LedgerSMB chose a db-centric model because we want to support a wide range of applications written in a wide range of languages, running in a wide range of environments.  This is very specific to our approach, and it is not without challenges.   However it has already had some benefits in our PHP compatibility classes and these are expected to multiply in the future.  In the end we choose a db for many apps over an app for many db's.

Monday, October 1, 2012

Towards a simple explanation of Object-Relational Database Management Systems

A relational database management system is a piece of software which takes organized information and processes it in order to answer questions presented by other software.

In a typical relational database management system, data is arranged into tables, each of which is similar to a worksheet in an Excel workbook.  Information in these tables is matched, and processed, and returned to the user.  A skilled database architect can arrange the data so that the structure (called the schema) is easily maintained and extended, so new types of information can be stored as the need arises.

Object-relational database management systems take this system and expand it, allowing for more complex data to be stored in each column and for a wider range of calculations to be attached to the table.   The combination of more complex data and complex calculations allows one to build advanced databases that can do more than one could before.

These systems are called object-relational because they are an attempt expand what relational database management systems can do by borrowing some ideas from object-oriented programming.  The two disciplines are however very far apart because object-oriented programming aims to model behavior encapsulated behind interfaces while object-relational database design seeks to extend relational math to model derived information as well as what is minimally known.

Thursday, September 27, 2012

O/R Series Epilogue 2: Towards a not-so-simple explanation of Object Relational Database Management Systems

Object relational databases build on the relational model but focus on answering questions regarding information that is derived from other information using a variety of general purpose tools.  Modelling of the information then shifts not only to what is known but what can be easily (or even not so easily) extrapolated from what is known.  Existing ORDBMS-type systems appear to include Informix (of course, given the PostgreSQL legacy), Oracle, and DB2, but PostgreSQL and Informix have the longest lineage in this area.

The key features of an ORDBMS are:
  1. The ability to model easily extrapolated information within the data model itself, through use of user defined functions
  2. Extensibility with regard to types and functions, written in general purpose programming languages
These features work together in important ways and it is possible, as we have seen, to build a great deal of intelligence into a data model by only moving slightly beyond the standard relational approach.  Although every step further brings with it complexity costs, these are very often useful and allow problems to be solved close to the database which cannot be really easily solved otherwise.

Towards an Object-Relational Algebra

One way to think of object-relational modelling is in the language of relational algebra.  I haven't been able to find accepted notation for functions on individual tuples, so here I use f(relation) notation, to show that the function operates over the set of tuples in the relation.

Relational algebra as defined by Dr Codd is extremely important but it cannot solve certain classes of important queries and so SQL has gone beyond it.  Important blind-spots include:
  1. Transitive operations at arbitrary depth.  Transitive closure and "n next highest values" are both examples of this.
  2. Calculated data problems
Most problems fall into one of these two categories if they are outside the relational model.

We already know that elements of a tuple are functionally dependent on others if, and only if, for each value of the dependency, there is only one functionally dependent value.  So if a is functionally dependent on b, for every b there is exactly one valid value of a.  Functional dependency, as in algebra, is value-based, not expression-based.

I am choosing an algebraic-inspired notation for this, where f(R) is a function of relation R if, and only if, for every tuple in relation R,  there is only one f(R).

f(R) is trivial if for every tuple (a, b, c) in relation R, f(R) returns a value or a tuple that is a subset of the tuple processed.  So if for every tuple (a, b, c), if f(R) = a, or if f(R) = (a, c), then the function is trivial.   Every trivial function also represents a trivial functional dependency within the tuple.

A function is relational if it can be expressed solely through relational algebra.  All trivial functions can be expressed relationally (using π operations) and therefore are also relational.  A relational function thus always specifies a functional data dependency in or between relations.  Relational functions have the property of always denoting global functional dependencies.

A function is non-relational if it cannot be expressed solely through non-relational algebra, for example if it involves processing of the actual value of one or more of the tuple elements or their functional dependencies.  If we have a relation (Emp) containing an attribute salary_per_wk, and annual_salary(Emp) = πsalary_per_wk * 52(Emp), then annual_salary is non-relational because it involves actual processing of data inside the tuple.  Relational functions often can be expanded in relational operations but as far as relational operations, non-relational functions are black boxes and function very much like attributes of a relation.

For example id(R) = πid(R) and c(R) = πnameid=id(R)(C)) are both relational functions, but only id(R) is trivial.  c(R) represents essentially a join and subselect.

An example of a general operation in functional notation might be:

π age(R) (R).  Similarly we can πname(R)age(R)=41(R))

Of course, we must be careful.  Since age(R) is only locally functionally dependant, indexes are out of the question and we must be careful about specifying constraints.  However defining a relation such that age(R) < 65 might prove problematic unless we are re-checking every day.

This would be similar to the following statement in PostgreSQL:

SELECT r.name
  FROM employee r
 WHERE r.age = 41;

where name and age are table methods.  This allows us to store less information in the database (and hence with less duplication and chances for error) and extrapolate important information out of the data that is stored.  It also allows us to store data in ways which are less traditional (nested structures etc) for the sole purpose of writing functions against it in that specific format and thus modelling constraints which cannot be modelled using more traditional structures (though as we have seen that poses significant gotchas and complexity costs).

Similarly recursive queries require recursive operators to work.  I place a capital Greek Sigma (Σ) to signify recursion above the join operator.  This is borrowed because it is the series designator elsewhere in mathematics.  An optional maximum depth is specified as a subscript to the Σ.  So Σ5 would indicate that the expression or join should be subject to no more than 5 iterations.  In a recursive join, the join is repeated until the θ condition is no longer satisfied.  The functions of path(R) and depth(R) are functionally dependent on the output of a recursive join, so Σ5 is identical to (σdepth(r)=5(Σ(...) r)).  The choice of the Σ is also helpful because while Σ always returns an extended superset, σ always returns a subset.

Since path is functionally dependent on the output of a recursive join, we can prove transitive closure over a finite set using recursive self-joins, functions, and boolean operators.  We can also express next highest N tuples results.  Alternatively the Σ can be followed by parentheses to show that an entire expression should be repeated until it brings no further results into the set.  In a Σ expression the set is divided into two subsets:  previous and new.  New results are those returned by the last iteration, and are the only ones processed for join conditions.  On each iteration, the "new" tuples are moved into the previous set and the tuples which satisfied the join condition are moved into the "new" set.

I also use three other symbols for specifying order-dependent information.  ω (omega) denotes a "window" order and partition in which aggregates can be applied to tuples in order, tuples can be removed from the beginning (α with a subscript for number of tuples to "skip") or truncated from the end (τ with a subscript for the number of tuples after which the set is to be truncated).  These allow me to approach the problems which SQL can address but relational algebra cannot.  An interesting property of ω is that the window order only is valid for some specific operations and is lost on any join or select operations.  These operations have interesting properties as well but they are somewhat outside the scope of this posting.  I will however note that it is usually cleaner to solve next N result issues with window ordering, tuple omission, and truncation than it is with recursion and aggregates.

Next:  Towards a simple explanation of object-relational database systems.

Sunday, September 23, 2012

O/R Series Epilogue 1: 4 Things I Wish PostgreSQL did

In this area there are things I wish PostgreSQL did.  These are listed in order of priority.  Each item also has workarounds listed.

Parameters in Object Methods

In Oracle you can define an object method such that you can do something like:

SELECT li.length(30) from line_segment li;

In PostgreSQL there is no way to pass parameters to an object method called with object.method syntax.  The equivalent in PostgreSQL would be:

SELECT length(li, 30) from line_segment li;

This isn't a huge deal but it would make it a lot clearer that length was closely tied to the line_segment type structure if it was possible to do things that way.

Inheritable foreign key constraints

A very useful and extension to what we have would be to allow foreign keys to be inherited on the referencing side.  This would save one from having to define foreign keys over and over for each child table and it would make key classes even more useful.

Ideally I would like to be able to do

CREATE TABLE country_ref (
     country_id int references country(id)
);

and have the foreign key properly enforced on all inheriting tables.  This would just involve copying the foreign key constraint and so would probably not require any deep changes to structure the way the ability to reference child table and parent tables together might.

Inheritable unique constraints 

Related to this I think another thing that would smooth things over down the road and might ultimately lead to more functionality down the road.  For now, I think the first step would be a SCOPE predicate for unique constraints and indexes.  This could be something like:

CREATE TABLE country_ref_u (
    unique (country_id) SCOPE PER TABLE
    FOREIGN KEY (country_id) REFERENCES country(id)
);

Such a unique constraint could then specify that this is unique on each table where it is inherited.  Initially PER TABLE and ONLY HERE (i.e. not inherited) might be supported with an effort to eventually support SCOPE FULL (and DRI against an inheritance tree).

One of the fundamental problems that one runs into with the issue of unique constraints cross tables is that extending the index format to include tableoid might lead to significantly larger indexes and very likely slower lookups where inheritance is not used.  One option might be to have an additional index page with special pointers to other indexes.  So per table indexes might be useful jumping off points for full tree constraints.

Consistent assumptions in composite types used for columns

Composite types in columns is one of those areas where the behavior is remarkably inconsistent and the edge cases poorly documented.  This isn't surprising given that this is relatively new functionality at the edges of relational database usage, and different people have different ideas of how it should work.  However, it seems very difficult to get everyone on the same set of assumptions and it is impossible to fix the current mish-mash while maintaining backwards compatibility.  Right now, there is no consistency however, and this makes it a difficult area to use going forwards.

Wednesday, September 19, 2012

Object/Relational modelling Part 7: General Design Considerations

The previous posts have hopefully opened up the way we look at modelling relational data, in a move from thinking about relations and tuples to thinking about sets of objects coupled with derived values and catalogs.  These approaches are all viable as long as the actual table storage layout still meets all criteria of a well-designed relational database.

Unfortunately with new techniques and approaches, additional antipatterns are possible.  Every feature can be badly used, and every feature can be badly combined with other features.  For this reason, the more one adds advanced features which complicate the data model, the more it becomes possible to spectacularly fail.  Some of the tools discussed so far can be liberally used, but many must be used sparingly and only as necessary.

I have personally been working with features like table inheritance for some years now.   The features can be of immense help if used properly, but much of the current documentation sets users up to be spectacularly disappointed.

I am now at a point where I can summarize the differences between relational thinking and object-relational thinking.  The difference is similar to the difference between algebra and calculus.  Relationally we want our information arranged so we can select, join, filter, and aggregate to provide useful answers to questions.    This is not so different from using algebra to factor and simplify expressions or to solve problems for specific unknowns.

When a student is learning calculus however, the same tools are used but applied in a very different way.  The approach to thinking is very different.  Calculus applies many of the same techniques to very small numbers and to trends between numbers.  For example, if you have a curve and want to find out what the slope is at a certain point, you look at the limit of the slope between two points as those points converge.  Similarly an integral is the sum of an infinite series of infinitesimals.   Calculus can thus be seen as a level of abstraction above algebra useful primarily for those problems that algebra cannot solve directly, but it builds upon algebraic techniques, and anyone doing calculus will apply algebra wherever possible first.

Here we have a similar process.  We are aggregating or decomposing relational information such that we can derive interesting values from them. However, for a whole host of reasons we do not want to lose sight of the fact that the database --- at least in the sections we are running day to day SELECT queries from --- is relationally well-formed.

If the above comparison has merit then it is worth heeding the words of my college Calculus teacher, Mike Lavender.  Many of these features are power tools and should be used both sparingly and with care.

A second important point is that object-relational and object-oriented thinking is also quite different.  Object-relational thinking ties data structures to their functional dependencies which may not be stored in the database, and tends to be best when approaching data structures from the perspective of what answers can be derived.  Thus, a square is a rectangle with an additional constraint, and this is perfectly consistent with the Liskov Substitution Principle applied to an information model.  A square is mathematically substitutable for a rectangle in terms of what we can derive from it.

On the other hand, object-oriented programming is about modelling behavior and what can be done to objects without transforming them across categories.  Thus in object-oriented programming, you cannot say that a square is a rectangle because there are things we can do to a rectangle that we cannot do to a square without transforming them across class boundaries.  As I will attempt to show in the future, while object-oriented design in application code is very much an art, object-relational design can be expressed in terms of mathematics, namely relational algebra with some extremely modest extensions.  This difference in role and approach means that while many of the object-oriented design principles can be found to apply, they apply to object-relational designs in ways very different than they would in the app layer.

I am planning a short (three post) series on my attempt at a object-relational algebra, and a longer series on applying SOLID principles to Object-Relational design.


The Problem:  "Relational Decay"

 

The basic problem can be described as the ability to silently degrade the usefulness of relational data constraints because the design does not allow for clean mapping of constraints.  This requires either ever-increasing complexity of data management or it requires degraded data constraints.  Both of these are dangerous and can lead eventually to data integrity and accuracy issues.

Anti-Pattern: Mixing Stored and Subclass Data in Same Virtual Table

Having data written to both parent and child table complicates data modelling in ways that result in functionally ambiguous foreign keys.  Functionally ambiguous foreign keys are generally bad and to be avoided.  In general foreign keys should be clear and unambiguous and should reference specific tables.  Data inheritance, mixing stored data and data of subclasses together is a recipe for problems.  If a single query pulls data like this together, relational integrity becomes problematic.  Don't do it.

Anti-Pattern:  Multi-way sync'ing.

One of the solutions for virtually any key issue is to use materialized views.   This is particularly helpful when representation of data must be transformed from a form where derived values can be constrained to one where it can be reasonably relationally queried (nested table structures create this problem among others).  Its tempting to try to synchronize data both says, but this is unmanageably complex.  Data flow needs to only go one way.

Anti-Pattern: Commonly Retrieving Data Directly from Nested Storage

As we have shown, nested tables present massive problems regarding storage and indexing because the logical storage and the physical storage are not really of the same structure.  Nested storage makes a great deal of sense for some problems, as we have seen, but used indiscriminately, the result is invariably a mess.

Anti-Pattern:  Casting complex type to text

When prototyping types, perhaps before writing a type in C, it may be tempting to write casts of tupe types to text.  If you do this however, things are likely to break because there are built-in casts to text that are used internally.  Instead use an as_text method.

 

Solutions:  Relational first, Object second


The solutions below have one thing in common.  They allow relational integrity to be preserved while putting these sorts of  power tools to use.  These however are not necessarily to be used willy-nilly either.  Each one has something of a complexity cost and this must be weighed against the reductions in complexity that each one provides for a project.

Interface inheritance

Interfaces should be inherited, not data.   In this regard, the inheritance design follows from data.  This first pattern is to build inheritance trees assuming that the only tables being actively queried will be those at the bottom of the tree, where data is actually stored.  In this case, inheritance is used not to provide additional query capabilities but rather to provide consistent interfaces across a domain.

Stock and Leaf Inheritance Tree Design

Stock and leaf is a name I gave to the idea that you can build an inheritance tree which separates tables out into functional units from which no further inheritance occurs (leaves) and infrastructure tables which provide consistent query capability for a subset of table information.  These stock tables never have data stored in them but provide general query capabilities.

A way to think of the stock and leaf design might be a table partitioned in a complex way.  Maybe we have an initial partition based on some ranges of the one primary key column, but one of the first-level tables here (but not all) is partitioned on a second primary key column.  Thus we have leaf nodes occurring at several different levels of inheritance.  This allows for organic growth of inheritance, but avoids the functionally ambiguous foreign keys that often result.

Simplified Write Model

If the data model requires synchronizing some aspects of it, it is very important to keep the write scenarios to a minimum.  Fewer write scenarios means fewer things that can go wrong, easier testing, more understandable failure cases, and easier troubleshooting when things go wrong.

Log-Aggregate-Snapshot Modelling

One way to greatly simplify writes is to move from representing current state in the database to representing cumulative changes in the database.  Accounting systems, for example, have a very restricted write model (usually few if any update scenarios, most important tables being append-only, etc) along with some current state (balance of the checking account) being calculated based on past changes.  This is a pattern which can be used elsewhere, and it enables other approaches (see below).

The typical objection is that representing state as a series of changes means one is storing a lot more data and that calculating state then requires expensive calculations.  The solution to this is to use periodic snapshots which can be used as roll-forward points, as well as to constrain when input may be entered.  For example we may take year-end snapshots of account balances, plus month-end snapshots of invoice balances, preventing invoices from being entered that are more than a month old, and any financial transactions from being entered into periods in which books are closed.  Similarly this allows us to delete old data without destroying our ability to track current state.

Log-Aggregate-Snapshot modelling trades data model complexity for write simplicity.  It is itself a power tool but it makes many other power tools safely usable.

Separation of Entry Storage and Query Storage for Append-Only Data

Another approach we can borrow from accounting, this time the paper world, is the idea that the format at point of entry can be transformed and re-saved for later use in another form.   This works well only when the data is append-only and the write transformations are well understood.  However one can use it for materialized views for inheritance trees where needed, or for breaking out data stored in nested tables where this is needed in order to enforce constraints.

For example, we may have nested tables in order to enforce subset constraints.  We can break these out on save into conventional tables where row constraints can be enforced and data more flexibly queried and optimized.

object_id inherited field

When using stock and leaf approaches one of the difficulties can be in tracking rows back from the stock table catalogs where this is needed into the actual tables.  Inheriting an object_id field poses some problems, but it works relatively well.  The solution typically is to do something like:

CREATE TABLE object_root (
    object_id bigserial
);

And then inherit object_root.

 

The O/R "Hand Tools"


The following are in the hand tools category.  They can be elegantly used wherever they seem useful.  However they are ranked by increasing complexity cost.

Object methods for derived data and storage

The major complexity cost here is that as table structures change, methods can get out of date or break.  This can be mitigated by source code control and factoring of the code so that it is easy to understand the codebase.

Table Inheritance to Model Interface Inheritance

This approach trades enforcement of interface consistency with increased ramp-up time and knowledge.  This will slow down the database table layout, but it will speed up the use of the database once designed.  On the whole this is a bit of a wash complexity-wise but one should counsel against over-using this.

The O/R "Power Tools"


Like the hand tools above, these are ranked according to complexity costs.  Unlike the hand tools, these are specialized tools, best used rarely.  They have high complexity costs and usually require compensating designs to use safely.

Complex Data Types in Views

Complex data types in views are machine, not human interfaces.  These make it harder to access data and can confuse casual observers.  Additionally as types change underlying plumbing may have to change as well.

Complex Data Types in Tables

Complex data in tables greatly increases the complexity of table constraints.  There are cases where it helps, but in general multiple inheritance is a cleaner alternative with fewer gotchas.  This is likely to be useful when moving O/R code over from Oracle or other systems that do not support multiple inheritance or the rare case where table inheritance is inadequate to solve the problem at hand.

Nested Table Storage

Currently nested table storage is relatively useless except in cases where it allows otherwise impossible constraints to be modelled.  The current approaches to storage make nested table storage relatively useless for general purpose queries, at least where the data sets are likely to get large.  They can, however be used for small data sets where indexes might not be useful anyway, or for cases as a point of original entry where data is then copied into another table for actual access.

There isn't any reason why these issues can't be solved in future versions, but the fact that this is a tool that works well only with edge cases anyway means it is unlikely to be a very high priority to folks not using it.

Table Inheritance to Model Data Inheritance

This adds tremendous complexity to the schema, and requires a lot of effort to make work correctly.  Only use it if you are sure that in your specific case the complexity issues it solves for you are worth the costs.